Conversation
|
Good to see that you fixed the off-by-one in read() :-) Seeing There are also no allowances for signals causing interrupted syscalls. That would simply cause the application to bail and exit. The That brings us to a general LCNC issue: error handling. There are many more instances in LCNC programs and utilities that are not exactly well-behaved when we talk about error resilience and recovery. But that would require a complete review and well defined strategy how to cope. |
|
Is this stuff even thread safe? It looks like it uses emcsched.cc, and addProgram there just uses push_back on a std::list... did I muss a lock?. Seems incorrect. |
|
I like to where this cppcheck warning brought us. The pthread_detach should always declare the pthread_t to be reallocatable after it returned, but correct and direct me. At one of the very last lines, just prior to the invocation of sockMain, another pthread_create is performed. That should be detached, too, but I wait for the clarification on the position of this pthread_detach. @rmu75, I also just skimmed through emcsched.cc and emcsched.hh - the queueStatus is an enum that may indeed immediately benefit from some thread safety, as in testing for a condition and changing its status should not be interrupted. And the addProgram changes the queue and then triggers a queue sort, which may come as a disturbing inconvenience to anyone iterating through the queue at the same time. Should we have this as a separate issue? |
You are right! It is incorrect and not thread safe. There are probably many more problems with thread safety. These would not have been seen as the "normal" use is to attach only one remote client to send stuff. Seen in this light, the choice is to limit to one client only (or use a select/poll construct to serialize access) or to review and fix the entire code path to be thread safe. |
A detached thread will have its resources purged when it exits. That means that the return value cannot be retrieved, for which you would have to use a join. In this case, it doesn't matter because no return value is used. Therefore, a simple detach should suffice.
That doesn't matter. That specific thread is the "queue runner" and there is only one during the lifetime of the program. Any resources associated with it will automatically die when the program dies. |
|
We have two separate threads already. One is started in linuxcnc/src/emc/usr_intf/schedrmt.cc Line 1289 in 46c3b99 linuxcnc/src/emc/usr_intf/emcsched.cc Line 313 in 46c3b99 |
Done it anyway, may eventually avoid false positive memory leaks in valgrind. |
|
I think the likelihood that concurrency-problems are triggered in real application of schedrmt are very low and in this case it would be acceptable to just document the shortcomings. Introducing a global mutex that is held while processing data from read() should be possible and relatively easy. As no significant amounts of data are moved that should have no performance implications. Proper way to fix this amounts to a rewrite with boost::asio or equivalent (which would also get rid of the threads btw) (or port to python). |
|
With a global mutex, I think only invocation of |
|
with something like this class MutexGuard {
static pthread_mutex_t mutex_;
public:
MutexGuard() { while (pthread_mutex_lock(&mutex_)) /* sleep? exit?*/; }
~MutexGuard() { pthread_mutex_unlock(&mutex_); }
};
pthread_mutex_t MutexGuard::mutex_ = PTHREAD_MUTEX_INITIALIZER;it should be only a matter of creating a local MutexGuard object at the start of the functions to be mutex'ed. Don't pthread_exit though. |
|
There are no thread-die-exit-or-else afaics. So just mutex the two functions should handle the problem. Then, thisQuit() is called from a signal handler and uses exit(0). However, you should use _exit(0) when called from a signal handler. |
e95461e to
fc012ca
Compare
|
@smoe, did you see my review comments? |
fc012ca to
b820de0
Compare
Well, yes and no, had not mentally parsed the exit -> _exit one. This is all a bit beyond my daily routines - what is the signal meant to stop? I was wrong to remove the exit from the implementation of shutdown, from how interpret this all now, but what flavour of exit is appropriate when the shutdown code also frees everything? |
Don't we need to expand any such global mutex also to emcsched.cc, etc? |
|
I would keep the mutexes in the main loop of the threads. Blocking read returns data -> acquire mutex -> do stuff -> release mutex -> go back to blocking read of socket. |
I can answer that question for me now - no, we don't, since schedrmt is an application on its own, so there is no other route to invoke the same lower-level routines. Sorry, this seems rather obvious but somehow I needed that. |
I suggest to leave that stuff alone. interp isn't thread safe anyways and can't be made without extensive rearchitecture of the canon interface.
then we should remove the label "2.9-must-fix". |
|
Hm. I admit not to know about how many threads there are executed in parallel that possibly use strtok, but the problem is real, as in this little test program #include <stdio.h> // printf
#include <string.h> // strtok
#include <pthread.h>
const char delim[]=" ,?";
const int num=500;
void * f(void*) {
for(int i=0; i< num; i++) {
char a[]="hello world, how are you?";
char *s1 = strtok(a,delim);
printf("1: %s\n", s1);
char *s2 = strtok(NULL,delim);
printf("2: %s\n", s2);
char *s3 = strtok(NULL,delim);
printf("3: %s\n", s3);
}
return(nullptr);
}
void * g(void*) {
char *safeptr = nullptr;
for(int i=0; i< num; i++) {
char a[]="hello world, how are you?";
safeptr = NULL;
char *s1 = strtok_r(a, delim,&safeptr);
printf("1: %s\n", s1);
char *s2 = strtok_r(NULL,delim,&safeptr);
printf("2: %s\n", s2);
char* s3 = strtok_r(NULL,delim,&safeptr);
printf("3: %s\n", s3);
}
return(nullptr);
}
int main() {
for(int i=0; i<100; i++) {
pthread_t t;
int res = pthread_create(&t,NULL,g,NULL);
if (res) {
fprintf(stderr,"E: %d -> %d\n", i, res);
} else {
pthread_detach(t);
}
}
}produces with strtok (function f) and with strtok_r the expected (did not use join to wait, so numbers are not equal) and certainly what LinuxCNC is doing is not threadsafe in the sense that it may not be prohibited that the tool table may be edited simultaneously to some thread reading it because some locking mechanism not implemented. What we have here is that someone pressing the button to reread the tool table may lead to some complete nonsense at some other part of LinuxCNC that is auto-updating the contents of an .ini file or whatever. Completely unrelated so it cannot be addressed by protecting critical areas with semaphores. And dangerous. Just because of those routines are using strtok instead of strtok_r.. Since I am for some weird reason enjoying this, I suggest you just let me perform those strtok to strtok_r transitions, and it shall not be to the disadvantage of LinuxCNC. |
|
It is out of the question that strtok is not thread-safe and concurrent use will mess up the save-pointer or worse. In general, internal linuxcnc interfaces are not thread safe anyway, strtok is not the only problem. All stuff that is written in python is implicitly mutexed by the global interpreter lock. The interpreter canon interface has no way to identify any context of a calling interpreter, so it really doesn't make sense to run two interpreters in one process at the same time. IME concurrency problems do show up in program usage rather sooner than later. The only reason no problems are known in emcrsh and emcsched is because nobody is/was using those with multiple connections doing stuff at the same time. |
|
Editing and reloading tool table in the same process at the same time from two different threads seems a rather theoretical possibility, why would anybody do that and expect it to work? It won't make sense in any case. Also all stuff that goes through NML also is serialized auomatically, messages are put into a message queue, and one giant semi-endless loop and corresponding switch statement is processing those messages, no danger of anything happening concurrently there. |
|
The question about thread safety is of course a serious one. As said, there are many places that are not thread safe by design or implementation and we have not (yet) seen disasters because most user/usage patterns do not exploit parallelisms that would expose the races. It should be a long term goal to update and fix all programs to adhere to proper thread safe standards and fix things. But I see no reason to expedite this update process at this moment unless there is a use case that demands it or while we are fixing a particular program anyway. For now, fixing as we get along should be fine. |
|
Fixing things is of course OK and necessary but we should not begin to redesign stuff to become threadsafe/reentrant just for the heck of it or making stuff multi-threaded needlessly, just the opposite, stuff like emcrsh should be redone to not neet multiple threads IMHO. |
|
Well, I did previously suggest to use a simple select/poll loop to make it simpler. |
|
There is some subtle distinction between multithreading (what in my limited understanding we are mostly discussing) and functions being reentrant (which strtok is not but strtok_r is). I'll prepare the strtok->strtok_r transition against 2.9. And I am happily anticipating your select/poll loop. |
Working on it :-) |
The select/poll loop, so I hope :-) Have the strtok_r transition prepared locally. |
|
Actually,... I've been doing a rewrite on linuxcncrsh (well, emcrsh.cc) making it single threaded and more of a C++ program. It started out as a small patch, but I went a bit further when I found all the off-by-one and buffer problems. Oh, and strtok is now also history. It is almost done. Currently testing it, smoothing things out and fix inconsistencies. That whole process uncovered a lot more of funny stuff that may be necessary to do in schedrmt.cc too. |
|
Do you have ideas for including your (some of your) tests as part of the CI ? |
Not yet, but yes, they need to be added. |
|
An update on the linuxcncrsh rewrite: Most code now just works. I've also re-enabled the test in The idea that you can make linuxcncrsh single threaded is not supported by the code. The problem is in the fact that NML will stall the entire process when you run synchronously (WAIT_MODE DONE). That is a problem and can only be solved by having the session (i.e. a connected socket) run in its own thread. The next thing is that linuxcncrsh mostly is single threaded because it uses an Now, a real question is whether it makes sense, at all, to have multiple sessions to linuxcncrsh that can control a single machine. There may be scenarios, but I'd need someone to tell me which scenario that would be. However, if so, then this probably needs to me multiplexed onto one single NML connection, but that is a speculation I have at this time. BTW, there is no guard against instantiating linuxcncrsh multiple times on different ports and connecting a session from each instance. I have no idea how that may lead to catastrophic machine pulp, broken iron, lost limbs and other unfortunate faults. Interactions of multiple sessions needs to be discussed how they are to be handled. FWIW, I'm currently stuck on issuing |
|
To help Andy a bit with his decision making towards accepting your work:
And .. maybe https://linuxcnc.org/docs/html/man/man1/linuxcncrsh.1.html is not meant to be ultimately safe in the first place? It can be started in parallel to a GUI, so you already can unknowingly give contradictory (limb-cutting) commands to what the machine is doing. It is a bit funny for us to care about thread-safety when that same thread-safe tool may indeed ruin any real-live thread (and limb) at ease when interfering with a running program?! |
No, it is working much better now :-) It is just that I've been digging and discovered all these issues as I went along. Many I could solve without discussion, but some need discussion and this is the time to ask some of those questions and share reflections. The test in tests/linuxcncrsh-tcp works just fine. I am working on getting the test in tests/linuxcncrsh to work because it is much more comprehensive (currently disabled in master).
Yeah, that is a very odd situation. It appears as if something in LCNC terminates when the command is issued and that wraps it all up. Need to do some more digging. It reminds me a bit of the failure of multiple spindles I was tracing. It took me a whole day to realize that the wrong hal-file was referenced in the INI. And the INI has the SPINDLES variable in the wrong place. Then, homing was again a problem. It needs a small sleep pause before issuing the next home command. Otherwise the homing process is still running and you get an error.
I'd not have it terminate immediately. There is an argument for being able to kill a session from another session. Whether that can be done safely and with valid machine-state needs to be determined. There already is a limited set of available commands when not in "enable" mode. This can be useful for more things.
Indeed, you have a very valid point. OTOH, it would be nice to prevent some obvious risk or at least warn about it. You can rebuild a crashed machine. Regrowing limbs is not (yet) common practice ;-) |
It can make sense in case you have multiple HMIs / operator stations. The only real problem is that error messages and user message can only be received by one endpoint and if you connect multiple guis it is more or less non-deterministic which GUi receives a particular message. The status OTOH is polled and while in theory there could be races where one gui receives the status another gui polled, that is harmless in my experience. I wrote a debug tool gui that polls with screen refresh rate, and it is no problem to run that parallel to some "proper" gui, in fact, I use that also as a kind of DRO on a second screen on a mill.. In a sense multiple sessions / guis / operator stations is like attaching a pendant that allows to jog and "cycle run". On a real machine, all safety-relevant stuff has to be external to the linuxcnc control anyways. Also linuxcncrsh is something that has to be explicitely enabled / started. |
|
multiplexing multiple sessions onto one NML session is probably a bad idea. |
Ah, but then shcom.{cc,hh} needs to become a proper class that can be instantiated for every connection to encapsulate separate NML sessions within the same process. That is no problem, I already had a feeling it would be necessary. There is also other work in those files because I detected static buffers that need to go anyway. When that is done, then there should be no problem in accepting multiple sessions. |
|
see https://github.com/rmu75/cockpit/blob/main/src/shcom.cpp and https://github.com/rmu75/cockpit/blob/main/include/shcom.hh I also have variants that talk zeroMQ. |
Thanks, that is a good start. I'll have a deeper look and see what I can "steal" ;-) |
|
You should probably connect to user and error channels only once. Or not connect at all -- it won't work correctly if there is a real GUI in parallel. |
|
There are three NML channels: emcCommand(W), emcStatus(R) and emcError(R). All are connected to namespace Which of these is the 'user' channel? The problem is that if two processes send a command on emcCommand, then the first reader of the emcError will retrieve any current error. Interestingly, the emcStatus channel is only peek'ed at and never actually read. I am a bit baffled how this is supposed to work at all... |
|
It has been some time since I last looked at that stuff, seems I misrembered about a "user" channel. I didn't want to look too closely, there are too many things that induce headache. Like https://github.com/LinuxCNC/linuxcnc/blob/master/src/libnml/nml/nml.cc#L1260 and https://github.com/LinuxCNC/linuxcnc/blob/master/src/libnml/nml/nml.cc#L1281. Maybe there is some stubtle stuff I miss, but to me, it looks like a copy/paste error. I assume that every code path that is not used by linuxcnc contains some subtle bugs (like messages containing long truncating those to 32bit or the really strange length calculations in the ascii and display updaters). |
|
Yeah, I tried going down that rabbit hole... At least I can explain the peek() in the status channel. It is because the nml-file shows the buffer as not queued. Both emcCommand and emcError are queued buffers, hence the write/read. Unfortunately, I've not seen an unread(). Anyway, multiple processes may interfere with each other on basis of the serial number. It seems that a lot of synchronization between write command and read status are based on the serial number. No one is checking the serial number of the error, so I don't even know if any error is correlated to the last command. |
This bit has been known to be a problem for decades. The first thing to "look" gets the error status, and might or might not inform the user. |
Ok then. My suspicions confirmed. The real problem is that there are separate uncorrelated queued read and write channels that share the backing store. Analysis: The emc.emcCommand(R) (emctaskmain.cc) receives the commands from any of emcsrv.emcCommand(W) or xemc.emcCommand(W) (for the default nml file). However, it cannot reply to the appropriate ???.emcError channel because these are shared for both emcsrv and xemc. To fix this, you need a back channel for emcError that matches the command connection. This is not a scalable solution because you need to create complex nml configs with an error buffer for each connection and also track that in the task handling, which is a near impossibility. The changes required are not worth the effort. The way it looks is that NML is inappropriate for handling this particular case. For this you'd want a socket connection and use sendmsg()/recvmsg(). That means, zeroMQ will be much better than NML to handle communication, including status because it makes a simple bidirectional connection. @rmu75 you said something about files using zeroMQ. Have you hacked a LCNC version, ripped out NML and replaced it with zeroMQ? |
I didn't rip out anything, but added all nml messages as flatbuf structures and played around a bit stuffing those into zeromq sockets. See https://github.com/rmu75/linuxcnc/tree/rs/zmq-experiments and https://github.com/rmu75/cockpit/tree/rs/zmq. It is just a hack, nothing is integrated into the build system, and generated files are checked in. If you want to try that, you will need to install flatbuffer and zeromq dev libs and invoke flatc by hand so generated stuff matches your installed flatbuf libs. flatc needs "--gen-object-api" IIRC. I did look at machinekit and their machinetalk stuff (google protobuf and zeromq), and I wondered why they didn't go the full length and ripped out NML but built this "bridge". Turned out it is not that hard to feed alternate message format into task. I didn't like protobuf though, flatbuf is more efficient and doesn't suffer from a problematic compatibility story and a 2-3 version-transition that reminds of python's. IIRC in my zeromq implementation, if you connect the error "channel", you receive error messages. Also status updates are not "peeked" but messages are send to each connection of the status socket. |
Took the patch proposed by @BsAtHome for #3700 and prepared this PR for it. It looks fine to me but I also did not test it. Any takers?
The most important bits are
There are some meta talking points: