-
Notifications
You must be signed in to change notification settings - Fork 233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pytest hangs while running tests #110
Comments
those processes are indeed xdist slaves, |
I don't know if it's related but I've seen parallel runs with xdist hang for a long time on my Jenkins with the most recent output being |
@RonnyPfannschmidt , what kind of information should I provide? It happened only once, I was not able to reproduce it a second time |
@telles-simbiose all involved packages and versions, also it would help if we could take a look at the testsuite deadlocks triggered by rare race conditions are not uncommon in distributed systems, and xdist running slaves is a distributed system |
We commonly (about 1/20 runs) get this issue on jenkins and local machines.
|
We see a similar issue here:
Where tests seem to finish but then still hang. In this specific case, |
Yeah, suffered from this issue sometimes. --fulltrace provide lock here
|
Try to use pytest-timeout, pytest --timeout=, this will kill existing hang thread and move the execution for you. |
Hey @telles-simbiose and @BlackHobbiT, Did you manage to make it work? We suffer from the same issue. The test run hangs on 93%. All the workers are busy, only killing the process in task manager solves it, meaning let the run continue. When that worker is crushed, the report is also lost from that specific test. Thanks |
I'm facing a very weird issue. py.test -vv -n 2 test1 It will get stuck after this. |
Ensure that individual tests never block over 1800 seconds, this also helps to avoid locking up in pytest-xdist parallel testing mode: pytest-dev/pytest-xdist#110
@tamaskakuszi as far I remember, wiping pycache dirs sometimes helps. |
@RonnyPfannschmidt we're also seeing this intermittently. If you'd like access to one of our environments, I can make that happen. Thanks! |
@JacobCallahan shot me more details at the work channel |
Any updated about this topic @RonnyPfannschmidt @JacobCallahan? We are facing this in the CI of the company I work. We use xdist to run the tests in parallel and it seems this happens from time to time when fail fast is enabled and the test session is aborted. We then have a zombie process that is stuck running this command https://github.com/pytest-dev/execnet/blame/d7ca9815734a4efb168c3ef997858e38c040fc70/execnet/gateway_io.py#L58 as far as I can tell. It would make sense as we are using xdist. I don't really understand what this line is supposed to do but it looks like some old workaround possibly? I could also create an issue in execnet as well if that is of any use. |
this line bootstraps execnet, the rest is fed as commend in stdio |
Okay, any idea why processes can be left hanging on that command? There is this command seen in pstree and then it's waiting to read a file descriptor, I guess stdin, but nothing is being written there by any process. |
That seems like the control process died and the worker is waiting for the shutdown command, fetching a stack trace with gdb is only partially helpful as the io is being handled multithreaded, and the state of the worker is unclear |
Thanks for your time Ronny, I tried digging around with gdb but I only basically found a reference back to the code I mentioned. The reset of the trace was in C so it went a bit over my head. |
@Bruniz thats unclear, its entirely possible our suite is hanging somewhere in c and the shut-down isn't reaching it |
Thanks for your time Ronny, I tried digging around with gdb but I only basically found a reference back to the code I mentioned. The reset of the trace was in C so it went a bit over my head. |
@Bruniz by fail fast, do you mean having a paste of command + command output would be a big help |
I'm seeing this problem in our CI now, too.
What other information can I provide?
Here's the log leading up to the hang:
[and nothing more...] |
Trying the tests in my development environment, they hang too. Since I'm running interactively, there, I get slightly more output from Pytest:
The
|
Nevermind!... It turned out that one parametrize'd scenario of one test was sitting in an infinite loop, and that caused the test run to hang a few tests short of the end. Once I found and addressed the problem, the tests no longer hang. That's great, but was there some way that I should have been able to find this more easily? |
The timeout Plugin tends to be a great Help for hangups |
Anybody wants to contribute a change to the docs mentioning |
😁 Indeed, especially if it points out the fact that it is hard to figure out which test is hanging without it! |
@nicoddemus I'm wondering if xdist should ensure to identify all currently running tests and their phases whenever a node exceeds a predetermined timeframe An even further expansion of this might be printing stacktraces |
Sounds good @RonnyPfannschmidt, indeed it makes sense for a new option to at least warn the user if a test has been running for X seconds (configurable, perhaps with a reasonable default of say 120s). However I would leave the job of cancelling long running test to |
Indeed, debugging Print is fine but the terminate gun ought to be opt in |
F39 pip isn't affected, but on rpm the tests stuck till the OOM Killer triggers. There are multiple similar reports upstream but no fix. To unblock the unit tests, F39 RPM will be skipped for now. F39 pip covers py39, py310, py311 and py312. CI jobs that run into the issue: https://jenkins-pagure.apps.ocp.cloud.ci.centos.org/job/pull-requests/276/ https://jenkins-pagure.apps.ocp.cloud.ci.centos.org/job/pull-requests/277/ GitHub issues that report similar issues: pytest-dev/pytest-xdist#110 pytest-dev/pytest-xdist#661 pytest-dev/pytest-xdist#872 pytest-dev/pytest-xdist#1005
We had this also and it seems that setting It would be great if you could set a negative flag i.e "-n=-1" which could mean "logical - 1 core" to allow as many cores to be used whilst minimising deadlocking risk |
Hello everyone,
Last night I let my test suite running until this morning, but I noticed that it hasn't finished running all tests, looking at
htop
, I noticed some strange processes running for a really long time, as shown in this screenshot:Looking at the tests output, I saw that the last ran tests were all ran by the same worker
gw2
(there were 4 workers running), as there were 3 processesimport sys;exec(eval(sys.stdin.readline()))
running for 13+ hours, I think that those 3 workers were just stuck somehow.The text was updated successfully, but these errors were encountered: