Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pytest hangs while running tests #110

Open
telles-simbiose opened this issue Jan 11, 2017 · 31 comments
Open

pytest hangs while running tests #110

telles-simbiose opened this issue Jan 11, 2017 · 31 comments

Comments

@telles-simbiose
Copy link

telles-simbiose commented Jan 11, 2017

Hello everyone,
Last night I let my test suite running until this morning, but I noticed that it hasn't finished running all tests, looking at htop, I noticed some strange processes running for a really long time, as shown in this screenshot:

Looking at the tests output, I saw that the last ran tests were all ran by the same worker gw2 (there were 4 workers running), as there were 3 processes import sys;exec(eval(sys.stdin.readline())) running for 13+ hours, I think that those 3 workers were just stuck somehow.

@RonnyPfannschmidt
Copy link
Member

those processes are indeed xdist slaves,
however with the amount of information you provided we are entirely unable to help

@Daenyth
Copy link

Daenyth commented Jan 12, 2017

I don't know if it's related but I've seen parallel runs with xdist hang for a long time on my Jenkins with the most recent output being scheduling tests via LoadScheduling

@telles-simbiose
Copy link
Author

@RonnyPfannschmidt , what kind of information should I provide? It happened only once, I was not able to reproduce it a second time

@RonnyPfannschmidt
Copy link
Member

@telles-simbiose all involved packages and versions, also it would help if we could take a look at the testsuite

deadlocks triggered by rare race conditions are not uncommon in distributed systems, and xdist running slaves is a distributed system

@pseudotensor
Copy link

pseudotensor commented Mar 10, 2018

We commonly (about 1/20 runs) get this issue on jenkins and local machines.

15:56:10 ============================= test session starts ==============================
15:56:10 platform linux -- Python 3.6.4, pytest-3.3.2, py-1.5.2, pluggy-0.6.0 -- /****_env/bin/python3.6
15:56:10 cachedir: .cache
15:56:10 rootdir: /opt/****, inifile: setup.cfg
15:56:10 plugins: xdist-1.22.0, forked-0.2, cov-2.4.0
15:56:10 gw0 I / gw1 I / gw2 I / gw3 I / gw4 I / gw5 I / gw6 I / gw7 I / gw8 I / gw9 I / gw10 I / gw11 I / gw12 I / gw13 I / gw14 I / gw15 I / gw16 I / gw17 I / gw18 I / gw19 I / gw20 I / gw21 I / gw22 I / gw23 I / gw24 I / gw25 I / gw26 I / gw27 I / gw28 I / gw29 I / gw30 I / gw31 I / gw32 I / gw33 I / gw34 I / gw35 I / gw36 I / gw37 I / gw38 I / gw39 I
15:56:10 
[gw0] linux Python 3.6.4 cwd: /opt/****
15:56:11 
[gw1] linux Python 3.6.4 cwd: /opt/****
15:56:11 
[gw2] linux Python 3.6.4 cwd: /opt/****
15:56:11 
[gw3] linux Python 3.6.4 cwd: /opt/****
15:56:11 
[gw4] linux Python 3.6.4 cwd: /opt/****
15:56:11 
[gw5] linux Python 3.6.4 cwd: /opt/****
15:56:11 
[gw6] linux Python 3.6.4 cwd: /opt/****
15:56:11 
[gw7] linux Python 3.6.4 cwd: /opt/****
15:56:11 
[gw8] linux Python 3.6.4 cwd: /opt/****
15:56:11 
[gw9] linux Python 3.6.4 cwd: /opt/****
15:56:11 
[gw10] linux Python 3.6.4 cwd: /opt/****
15:56:11 
[gw11] linux Python 3.6.4 cwd: /opt/****
15:56:11 
[gw12] linux Python 3.6.4 cwd: /opt/****
15:56:12 
[gw13] linux Python 3.6.4 cwd: /opt/****
15:56:12 
[gw14] linux Python 3.6.4 cwd: /opt/****
15:56:12 
[gw15] linux Python 3.6.4 cwd: /opt/****
15:56:12 
[gw16] linux Python 3.6.4 cwd: /opt/****
15:56:12 
[gw17] linux Python 3.6.4 cwd: /opt/****
15:56:12 
[gw18] linux Python 3.6.4 cwd: /opt/****
15:56:12 
[gw19] linux Python 3.6.4 cwd: /opt/****
15:56:13 
[gw20] linux Python 3.6.4 cwd: /opt/****
15:56:13 
[gw21] linux Python 3.6.4 cwd: /opt/****
15:56:13 
[gw22] linux Python 3.6.4 cwd: /opt/****
15:56:13 
[gw23] linux Python 3.6.4 cwd: /opt/****
15:56:14 
[gw24] linux Python 3.6.4 cwd: /opt/****
15:56:14 
[gw25] linux Python 3.6.4 cwd: /opt/****
15:56:14 
[gw26] linux Python 3.6.4 cwd: /opt/****
15:56:14 
[gw27] linux Python 3.6.4 cwd: /opt/****
15:56:15 
[gw28] linux Python 3.6.4 cwd: /opt/****
15:56:15 
[gw29] linux Python 3.6.4 cwd: /opt/****
15:56:15 
[gw30] linux Python 3.6.4 cwd: /opt/****
15:56:15 
[gw31] linux Python 3.6.4 cwd: /opt/****
15:56:15 
[gw32] linux Python 3.6.4 cwd: /opt/****
15:56:16 
[gw33] linux Python 3.6.4 cwd: /opt/****
15:56:17 
[gw34] linux Python 3.6.4 cwd: /opt/****
15:56:17 
[gw35] linux Python 3.6.4 cwd: /opt/****
15:56:17 
[gw36] linux Python 3.6.4 cwd: /opt/****
15:56:18 
[gw37] linux Python 3.6.4 cwd: /opt/****
15:56:18 
[gw38] linux Python 3.6.4 cwd: /opt/****
15:56:19 
[gw39] linux Python 3.6.4 cwd: /opt/****
15:56:19 
[gw0] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw1] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw2] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw3] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw4] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw5] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw6] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw8] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw7] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw9] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw10] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw11] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw12] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw13] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw14] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw15] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw17] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw16] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw18] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw19] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw20] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw21] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw23] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw22] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw24] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw25] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw27] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw26] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw28] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw29] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw30] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw31] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw32] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw33] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw34] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw35] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw36] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw37] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:19 
[gw38] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
15:56:20 
[gw39] Python 3.6.4 (default, Mar  9 2018, 23:26:36)  -- [GCC 5.4.0 20160609]
17:09:46 Cancelling nested steps due to timeout
17:09:46 Sending interrupt signal to process
17:09:49 Makefile:240: recipe for target 'test_integration2' failed
17:09:49 make: *** [test_integration2] Terminated
17:09:49 Terminated
[Pipeline] }
[Pipeline] // timeout
[Pipeline] sh
17:09:49 [dai-tests-2gpu_dev-NUMSKB2Q3XX56UBUAOGHD33FZR7UOQYGSSWMHKM3RXLMS6ITB5GQ] Running shell script
17:09:49 + echo Jenkins job ABORTED due to exception, non-zero errorcode, or Jenkins timeout: script returned exit code 143
[Pipeline] sh
17:09:49 [tests-2gpu_dev-NUMSKB2Q3XX56UBUAOGHD33FZR7UOQYGSSWMHKM3RXLMS6ITB5GQ] Running shell script
17:09:49 + mkdir -p build/test-reports
17:09:49 + echo <?xml version="1.0" encoding="utf-8"?>
17:09:49 <testsuite errors="0" failures="1" name="pytest" skips="0" tests="0" time="0">
17:09:49 <testcase classname="tests" file="tests" name="tests" time="0"><failure message="JOB WAS KILLED BECAUSE Exception">script returned exit code 143</failure></testcase>
17:09:49 </testsuite>
enkins  22406  0.0  0.0   1148     4 ?        Ss   01:30   0:00      |   \_ /dev/init -- ./run.sh /opt/h2oai make test
jenkins   2536  0.0  0.0  19716  3156 ?        S    01:32   0:00      |           \_ /bin/bash -c pytest --color=yes --durations=10 -s -v --fulltrace --full-trace --junit-xml=build/test-reports/unit-test_test_models.xml -n auto tests/test_models 2> ./tmp/test_models.29966_2018.03.05-09:32:12
jenkins   2537  0.0  0.0 3063628 29844 ?       Sl   01:32   0:01      |               \_ /h2oai_env/bin/python3.6 /env/bin/pytest --color=yes --durations=10 -s -v --fulltrace --full-trace --junit-xml=build/test-reports/unit-test_test_models.xml -n auto tests/test_models
jenkins   2541  0.2  0.1 3606224 161340 ?      Sl   01:32   0:09      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   2544  0.2  0.1 3604924 159784 ?      Sl   01:32   0:10      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   2547  0.1  0.1 3604840 159948 ?      Sl   01:32   0:08      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   2551  0.1  0.1 3604924 159984 ?      Sl   01:32   0:09      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   2555  0.1  0.1 3604864 160076 ?      Sl   01:32   0:08      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   2559  0.1  0.1 3604928 159904 ?      Sl   01:32   0:09      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   2567  0.1  0.1 3604928 159876 ?      Sl   01:32   0:09      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   2574  0.1  0.1 3604928 160152 ?      Sl   01:32   0:09      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   2580  0.2  0.1 3604852 160100 ?      Sl   01:32   0:10      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   2636  0.1  0.1 3604928 159868 ?      Sl   01:32   0:08      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   2669  0.1  0.1 3604832 160048 ?      Sl   01:32   0:07      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   2735  0.1  0.1 3604844 159968 ?      Sl   01:32   0:07      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   2780  0.1  0.1 3651120 160128 ?      Sl   01:32   0:06      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   2825  0.1  0.1 3604872 159976 ?      Sl   01:32   0:06      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   2871  0.1  0.1 3604832 160176 ?      Sl   01:32   0:07      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   2914  0.1  0.1 3604840 159736 ?      Sl   01:32   0:08      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   2959  0.1  0.1 3604848 159996 ?      Sl   01:32   0:07      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   3006  0.1  0.1 3604920 159820 ?      Sl   01:32   0:06      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   3012  0.1  0.1 3604852 159868 ?      Sl   01:32   0:08      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   3059  0.1  0.1 3604852 160052 ?      Sl   01:32   0:08      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   3104  0.1  0.1 3604940 160652 ?      Sl   01:32   0:08      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   3186  0.1  0.1 3604836 160056 ?      Sl   01:32   0:07      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   3284  0.1  0.1 3604824 160316 ?      Sl   01:32   0:06      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   3472  0.1  0.1 3604876 160136 ?      Sl   01:32   0:07      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   3494  0.1  0.1 3604852 159680 ?      Sl   01:32   0:05      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   3623  0.1  0.0 3382212 122060 ?      Sl   01:32   0:07      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   7939  0.0  0.0 3382232 75236 ?       Sl   01:32   0:00      |                   |   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   7962  0.0  0.0 3388608 76856 ?       S    01:32   0:00      |                   |       \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   3890  0.1  0.1 3604884 160320 ?      Sl   01:32   0:08      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   4064  0.1  0.1 3604936 160388 ?      Sl   01:32   0:06      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   4162  0.1  0.1 3604884 160260 ?      Sl   01:32   0:07      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   4253  0.1  0.1 3604840 160472 ?      Sl   01:32   0:07      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   4485  0.1  0.1 3604840 159932 ?      Sl   01:32   0:06      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   4879  0.1  0.1 3604868 160344 ?      Sl   01:32   0:06      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   4914  0.1  0.1 3604836 160148 ?      Sl   01:32   0:08      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   5151  0.1  0.1 3604836 160108 ?      Sl   01:32   0:09      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   5505  0.1  0.1 3604928 160212 ?      Sl   01:32   0:07      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   5728  0.1  0.1 3604924 160408 ?      Sl   01:32   0:08      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   6014  0.1  0.1 3604964 160328 ?      Sl   01:32   0:05      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   6205  0.1  0.1 3604836 160404 ?      Sl   01:32   0:07      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   6330  0.1  0.1 3604928 160224 ?      Sl   01:32   0:07      |                   \_ /h2oai_env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins   6497  0.1  0.1 3604848 159980 ?      Sl   01:32   0:09      |                   \_ /env/bin/python3.6 -u -c import sys;exec(eval(sys.stdin.readline()))

@s1113950
Copy link

s1113950 commented Aug 1, 2018

We see a similar issue here:

py2-e2e runtests: commands[0] | py.test -vv -n 8 tests/e2e
============================= test session starts ==============================
platform linux2 -- Python 2.7.12, pytest-3.7.0, py-1.5.4, pluggy-0.7.1 -- /usr/share/mlt/.venv/bin/python2
cachedir: .pytest_cache
rootdir: /usr/share/mlt, inifile: tox.ini
plugins: xdist-1.22.5, forked-0.2, cov-2.5.1
[gw0] linux2 Python 2.7.12 cwd: /usr/share/mlt               
[gw1] linux2 Python 2.7.12 cwd: /usr/share/mlt               
[gw2] linux2 Python 2.7.12 cwd: /usr/share/mlt               
[gw3] linux2 Python 2.7.12 cwd: /usr/share/mlt               
[gw4] linux2 Python 2.7.12 cwd: /usr/share/mlt               
[gw5] linux2 Python 2.7.12 cwd: /usr/share/mlt               
[gw6] linux2 Python 2.7.12 cwd: /usr/share/mlt               
[gw7] linux2 Python 2.7.12 cwd: /usr/share/mlt               
[gw0] Python 2.7.12 (default, Dec  4 2017, 14:50:18)  -- [GCC 5.4.0 20160609]
[gw2] Python 2.7.12 (default, Dec  4 2017, 14:50:18)  -- [GCC 5.4.0 20160609]
[gw1] Python 2.7.12 (default, Dec  4 2017, 14:50:18)  -- [GCC 5.4.0 20160609]
[gw4] Python 2.7.12 (default, Dec  4 2017, 14:50:18)  -- [GCC 5.4.0 20160609]
[gw3] Python 2.7.12 (default, Dec  4 2017, 14:50:18)  -- [GCC 5.4.0 20160609]
[gw7] Python 2.7.12 (default, Dec  4 2017, 14:50:18)  -- [GCC 5.4.0 20160609]
[gw5] Python 2.7.12 (default, Dec  4 2017, 14:50:18)  -- [GCC 5.4.0 20160609]
[gw6] Python 2.7.12 (default, Dec  4 2017, 14:50:18)  -- [GCC 5.4.0 20160609]   
gw0 [20] / gw1 [20] / gw2 [20] / gw3 [20] / gw4 [20] / gw5 [20] / gw6 [20] / gw7 [20]
scheduling tests via LoadScheduling

tests/e2e/test_config_updates.py::TestConfig::test_add_remove_config 
tests/e2e/test_config_updates.py::TestConfig::test_update_config 
tests/e2e/test_deploy_flow.py::TestDeployFlow::test_deploying_templates[pytorch] 
tests/e2e/test_deploy_flow.py::TestDeployFlow::test_deploying_templates[hello-world] 
tests/e2e/test_deploy_flow.py::TestDeployFlow::test_deploying_templates[experiments] 
tests/e2e/test_deploy_flow.py::TestDeployFlow::test_deploying_templates[pytorch-distributed] 
tests/e2e/test_deploy_flow.py::TestDeployFlow::test_deploying_templates[tf-distributed] 
tests/e2e/test_config_updates.py::TestConfig::test_config_list 
[gw6] [  5%] PASSED tests/e2e/test_config_updates.py::TestConfig::test_config_list 
tests/e2e/test_deploy_flow.py::TestDeployFlow::test_deploying_templates[tf-dist-mnist] 
[gw4] [ 10%] PASSED tests/e2e/test_config_updates.py::TestConfig::test_add_remove_config 
tests/e2e/test_deploy_flow.py::TestDeployFlow::test_deploying_templates[horovod] 
[gw3] [ 15%] PASSED tests/e2e/test_config_updates.py::TestConfig::test_update_config 
tests/e2e/test_deploy_flow.py::TestDeployFlow::test_deploying_templates[tensorboard] 
[gw1] [ 20%] PASSED tests/e2e/test_deploy_flow.py::TestDeployFlow::test_deploying_templates[hello-world] 
[gw5] [ 25%] PASSED tests/e2e/test_deploy_flow.py::TestDeployFlow::test_deploying_templates[tf-distributed] 
tests/e2e/test_deploy_flow.py::TestDeployFlow::test_deploy_enable_sync 
[gw3] [ 30%] PASSED tests/e2e/test_deploy_flow.py::TestDeployFlow::test_deploying_templates[tensorboard] 
[gw5] [ 35%] PASSED tests/e2e/test_deploy_flow.py::TestDeployFlow::test_deploy_enable_sync 
[gw7] [ 40%] PASSED tests/e2e/test_deploy_flow.py::TestDeployFlow::test_deploying_templates[pytorch] 
[gw2] [ 45%] PASSED tests/e2e/test_deploy_flow.py::TestDeployFlow::test_deploying_templates[experiments] 
[gw0] [ 50%] PASSED tests/e2e/test_deploy_flow.py::TestDeployFlow::test_deploying_templates[pytorch-distributed] 
[gw4] [ 55%] PASSED tests/e2e/test_deploy_flow.py::TestDeployFlow::test_deploying_templates[horovod] 
[gw6] [ 60%] PASSED tests/e2e/test_deploy_flow.py::TestDeployFlow::test_deploying_templates[tf-dist-mnist] 
tests/e2e/test_deploy_flow.py::TestDeployFlow::test_debug_wrapper 
tests/e2e/test_templates.py::TestTemplates::test_templates 
tests/e2e/test_deploy_flow.py::TestDeployFlow::test_interactive_deploy[hello-world] 
tests/e2e/test_templates.py::TestTemplates::test_local_templates 
tests/e2e/test_deploy_flow.py::TestDeployFlow::test_interactive_deploy[tf-distributed] 
tests/e2e/test_deploy_flow.py::TestDeployFlow::test_deploy_check_logs 
tests/e2e/test_deploy_flow.py::TestDeployFlow::test_no_push_deploy 
tests/e2e/test_deploy_flow.py::TestDeployFlow::test_watch_build_and_deploy_no_push 
[gw3] [ 65%] PASSED tests/e2e/test_templates.py::TestTemplates::test_templates 
[gw0] [ 70%] PASSED tests/e2e/test_deploy_flow.py::TestDeployFlow::test_interactive_deploy[tf-distributed] 
[gw7] [ 75%] PASSED tests/e2e/test_deploy_flow.py::TestDeployFlow::test_deploy_check_logs 
[gw5] [ 80%] PASSED tests/e2e/test_templates.py::TestTemplates::test_local_templates 
[gw2] [ 85%] PASSED tests/e2e/test_deploy_flow.py::TestDeployFlow::test_interactive_deploy[hello-world] 
[gw6] [ 90%] PASSED tests/e2e/test_deploy_flow.py::TestDeployFlow::test_watch_build_and_deploy_no_push 
[gw1] [ 95%] PASSED tests/e2e/test_deploy_flow.py::TestDeployFlow::test_no_push_deploy 
[gw4] [100%] PASSED tests/e2e/test_deploy_flow.py::TestDeployFlow::test_debug_wrapper Makefile:177: recipe for target 'test-e2e-all-circleci' failed
make: *** [test-e2e-all-circleci] Terminated
Too long with no output (exceeded 20m0s)

Where tests seem to finish but then still hang. In this specific case, python2 tests seemed to finish but hung and then python3 tests never started (we use tox to trigger tests).

@BlackHobbiT
Copy link

BlackHobbiT commented Jan 30, 2020

Yeah, suffered from this issue sometimes. --fulltrace provide lock here
platform linux -- Python 3.6.1, pytest-5.0.1, py-1.5.3, pluggy-0.12.0 -- /usr/bin/python3.6 cachedir: .pytest_cache rootdir: /home/BlackHobbiT/path/to/test inifile: pytest.ini plugins: forked-1.0.2, xdist-1.29.0, allure-pytest-2.7.0 [gw0] linux Python 3.6.1 cwd: /home/BlackHobbiT/path/to/test [gw0] Python 3.6.1 (default, Sep 7 2017, 16:36:03) -- [GCC 6.3.0 20170406]


config = <_pytest.config.Config object at 0x7f1cac7e4da0>, doit = <function _main at 0x7f1caccb0f28>

    def wrap_session(config, doit):
        """Skeleton command line program"""
        session = Session(config)
        session.exitstatus = ExitCode.OK
        initstate = 0
        try:
            try:
                config._do_configure()
                initstate = 1
                config.hook.pytest_sessionstart(session=session)
                initstate = 2
>               session.exitstatus = doit(config, session) or 0

/usr/local/lib/python3.6/dist-packages/_pytest/main.py:213: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

config = <_pytest.config.Config object at 0x7f1cac7e4da0>, session = <Session session exitstatus=<ExitCode.OK: 0> testsfailed=0 testscollected=16>

    def _main(config, session):
        """ default command line protocol for initialization, session,
        running tests and reporting. """
        config.hook.pytest_collection(session=session)
>       config.hook.pytest_runtestloop(session=session)

/usr/local/lib/python3.6/dist-packages/_pytest/main.py:257: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <_HookCaller 'pytest_runtestloop'>, args = (), kwargs = {'session': <Session session exitstatus=<ExitCode.OK: 0> testsfailed=0 testscollected=16>}
notincall = set()

    def __call__(self, *args, **kwargs):
        if args:
            raise TypeError("hook calling supports only keyword arguments")
        assert not self.is_historic()
        if self.spec and self.spec.argnames:
            notincall = (
                set(self.spec.argnames) - set(["__multicall__"]) - set(kwargs.keys())
            )
            if notincall:
                warnings.warn(
                    "Argument(s) {} which are declared in the hookspec "
                    "can not be found in this hook call".format(tuple(notincall)),
                    stacklevel=2,
                )
>       return self._hookexec(self, self.get_hookimpls(), kwargs)

/usr/local/lib/python3.6/dist-packages/pluggy/hooks.py:289: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <_pytest.config.PytestPluginManager object at 0x7f1caee2ab00>, hook = <_HookCaller 'pytest_runtestloop'>
methods = [<HookImpl plugin_name='main', plugin=<module '_pytest.main' from '/usr/local/lib/python3.6/dist-packages/_pytest/main...1cab5f7f98>>, <HookImpl plugin_name='logging-plugin', plugin=<_pytest.logging.LoggingPlugin object at 0x7f1cab5ce2b0>>]
kwargs = {'session': <Session session exitstatus=<ExitCode.OK: 0> testsfailed=0 testscollected=16>}

    def _hookexec(self, hook, methods, kwargs):
        # called from all hookcaller instances.
        # enable_tracing will set its own wrapping function at self._inner_hookexec
>       return self._inner_hookexec(hook, methods, kwargs)

/usr/local/lib/python3.6/dist-packages/pluggy/manager.py:87: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

hook = <_HookCaller 'pytest_runtestloop'>
methods = [<HookImpl plugin_name='main', plugin=<module '_pytest.main' from '/usr/local/lib/python3.6/dist-packages/_pytest/main...1cab5f7f98>>, <HookImpl plugin_name='logging-plugin', plugin=<_pytest.logging.LoggingPlugin object at 0x7f1cab5ce2b0>>]
kwargs = {'session': <Session session exitstatus=<ExitCode.OK: 0> testsfailed=0 testscollected=16>}

    self._inner_hookexec = lambda hook, methods, kwargs: hook.multicall(
        methods,
        kwargs,
>       firstresult=hook.spec.opts.get("firstresult") if hook.spec else False,
    )

/usr/local/lib/python3.6/dist-packages/pluggy/manager.py:81: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <xdist.dsession.DSession object at 0x7f1cab5f7f98>

    def pytest_runtestloop(self):
        self.sched = self.config.hook.pytest_xdist_make_scheduler(
            config=self.config, log=self.log
        )
        assert self.sched is not None
    
        self.shouldstop = False
        while not self.session_finished:
>           self.loop_once()

/usr/local/lib/python3.6/dist-packages/xdist/dsession.py:115: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <xdist.dsession.DSession object at 0x7f1cab5f7f98>

    def loop_once(self):
        """Process one callback from one of the workers."""
        while 1:
            if not self._active_nodes:
                # If everything has died stop looping
                self.triggershutdown()
                raise RuntimeError("Unexpectedly no active workers available")
            try:
>               eventcall = self.queue.get(timeout=2.0)

/usr/local/lib/python3.6/dist-packages/xdist/dsession.py:129: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <queue.Queue object at 0x7f1cab5f76d8>, block = True, timeout = 2.0

    def get(self, block=True, timeout=None):
        '''Remove and return an item from the queue.
    
        If optional args 'block' is true and 'timeout' is None (the default),
        block if necessary until an item is available. If 'timeout' is
        a non-negative number, it blocks at most 'timeout' seconds and raises
        the Empty exception if no item was available within that time.
        Otherwise ('block' is false), return an item if one is immediately
        available, else raise the Empty exception ('timeout' is ignored
        in that case).
        '''
        with self.not_empty:
            if not block:
                if not self._qsize():
                    raise Empty
            elif timeout is None:
                while not self._qsize():
                    self.not_empty.wait()
            elif timeout < 0:
                raise ValueError("'timeout' must be a non-negative number")
            else:
                endtime = time() + timeout
                while not self._qsize():
                    remaining = endtime - time()
                    if remaining <= 0.0:
                        raise Empty
>                   self.not_empty.wait(remaining)

/usr/lib/python3.6/queue.py:173: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <Condition(<unlocked _thread.lock object at 0x7f1cab4f2120>, 0)>, timeout = 1.9999943980947137

    def wait(self, timeout=None):
        """Wait until notified or until a timeout occurs.
    
        If the calling thread has not acquired the lock when this method is
        called, a RuntimeError is raised.
    
        This method releases the underlying lock, and then blocks until it is
        awakened by a notify() or notify_all() call for the same condition
        variable in another thread, or until the optional timeout occurs. Once
        awakened or timed out, it re-acquires the lock and returns.
    
        When the timeout argument is present and not None, it should be a
        floating point number specifying a timeout for the operation in seconds
        (or fractions thereof).
    
        When the underlying lock is an RLock, it is not released using its
        release() method, since this may not actually unlock the lock when it
        was acquired multiple times recursively. Instead, an internal interface
        of the RLock class is used, which really unlocks it even when it has
        been recursively acquired several times. Another internal interface is
        then used to restore the recursion level when the lock is reacquired.
    
        """
        if not self._is_owned():
            raise RuntimeError("cannot wait on un-acquired lock")
        waiter = _allocate_lock()
        waiter.acquire()
        self._waiters.append(waiter)
        saved_state = self._release_save()
        gotit = False
        try:    # restore state no matter what (e.g., KeyboardInterrupt)
            if timeout is None:
                waiter.acquire()
                gotit = True
            else:
                if timeout > 0:
>                   gotit = waiter.acquire(True, timeout)
E                   KeyboardInterrupt

/usr/lib/python3.6/threading.py:299: KeyboardInterrupt

@mavo123
Copy link

mavo123 commented Nov 13, 2020

Try to use pytest-timeout, pytest --timeout=, this will kill existing hang thread and move the execution for you.

@tamaskakuszi
Copy link

Hey @telles-simbiose and @BlackHobbiT,

Did you manage to make it work? We suffer from the same issue. The test run hangs on 93%. All the workers are busy, only killing the process in task manager solves it, meaning let the run continue. When that worker is crushed, the report is also lost from that specific test.

Thanks

@vishakha-vonage
Copy link

I'm facing a very weird issue.
When I run my test in parallel-
Only 1 browser opens even when I execute pytest -n 2
After first test is successful, pytest just hangs and it does not even timeout
Has anyone faced this issue?

py.test -vv -n 2
=================================================================================================== test session starts ===================================================================================================
platform darwin -- Python 3.7.3, pytest-5.4.1, py-1.10.0, pluggy-0.13.1 --
///*
cachedir: .pytest_cache
rootdir: //*, inifile: pytest.ini
plugins: xdist-1.31.0, allure-pytest-2.8.40, repeat-0.9.1, forked-1.1.3, timeout-1.4.2
[gw0] darwin Python 3.7.3 cwd:
/
/*
[gw1] darwin Python 3.7.3 cwd:
/**/*
collected 8 items
[gw0] Python 3.7.3 (default, Apr 24 2020, 18:51:23) -- [Clang 11.0.3 (clang-1103.0.32.62)]
[gw1] Python 3.7.3 (default, Apr 24 2020, 18:51:23) -- [Clang 11.0.3 (clang-1103.0.32.62)]
gw0 [8] / gw1 [8]
scheduling tests via MarkBasedScheduler

test1
[gw1] [ 12%] PASSED test1

It will get stuck after this.

gtristan added a commit to apache/buildstream that referenced this issue Aug 18, 2021
Ensure that individual tests never block over 1800 seconds, this also helps
to avoid locking up in pytest-xdist parallel testing mode:

    pytest-dev/pytest-xdist#110
@BlackHobbiT
Copy link

@tamaskakuszi as far I remember, wiping pycache dirs sometimes helps.

@JacobCallahan
Copy link

@RonnyPfannschmidt we're also seeing this intermittently. If you'd like access to one of our environments, I can make that happen. Thanks!

@RonnyPfannschmidt
Copy link
Member

@JacobCallahan shot me more details at the work channel

@Bruniz
Copy link

Bruniz commented Feb 20, 2023

Any updated about this topic @RonnyPfannschmidt @JacobCallahan?

We are facing this in the CI of the company I work. We use xdist to run the tests in parallel and it seems this happens from time to time when fail fast is enabled and the test session is aborted. We then have a zombie process that is stuck running this command https://github.com/pytest-dev/execnet/blame/d7ca9815734a4efb168c3ef997858e38c040fc70/execnet/gateway_io.py#L58 as far as I can tell. It would make sense as we are using xdist.

I don't really understand what this line is supposed to do but it looks like some old workaround possibly?

I could also create an issue in execnet as well if that is of any use.

@RonnyPfannschmidt
Copy link
Member

this line bootstraps execnet, the rest is fed as commend in stdio

@Bruniz
Copy link

Bruniz commented Feb 20, 2023

Okay, any idea why processes can be left hanging on that command? There is this command seen in pstree and then it's waiting to read a file descriptor, I guess stdin, but nothing is being written there by any process.

@RonnyPfannschmidt
Copy link
Member

That seems like the control process died and the worker is waiting for the shutdown command, fetching a stack trace with gdb is only partially helpful as the io is being handled multithreaded, and the state of the worker is unclear

@Bruniz
Copy link

Bruniz commented Feb 20, 2023

Thanks for your time Ronny, I tried digging around with gdb but I only basically found a reference back to the code I mentioned. The reset of the trace was in C so it went a bit over my head.
Is the interrupt from pytest then possibly not handled correctly. We only use the built in fail fast in pytest and don't kill anything extra.

@RonnyPfannschmidt
Copy link
Member

@Bruniz thats unclear, its entirely possible our suite is hanging somewhere in c and the shut-down isn't reaching it
with the information provided i'm unable to make a educated guess

@Bruniz
Copy link

Bruniz commented Feb 21, 2023

Thanks for your time Ronny, I tried digging around with gdb but I only basically found a reference back to the code I mentioned. The reset of the trace was in C so it went a bit over my head.
Is the interrupt from pytest then possibly not handled correctly. We only use the built in fail fast in pytest and don't kill anything extra.

@RonnyPfannschmidt
Copy link
Member

@Bruniz by fail fast, do you mean -x ?

having a paste of command + command output would be a big help

@webbnh
Copy link

webbnh commented Jul 29, 2023

I'm seeing this problem in our CI now, too.

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
jenkins        1  0.0  0.0   1000     4 ?        Ss   02:32   0:00 /run/podman-init -- ./build.sh
jenkins        7  0.0  0.0   4072  3272 ?        S    02:32   0:00 /bin/bash -e ./build.sh
jenkins      227  1.6  0.0 850404 36284 ?        Sl   02:34   0:31 /usr/bin/python3 -P /usr/bin/tox
jenkins      387  0.3  0.0  86652 79016 ?        S    02:35   0:05 /var/tmp/jenkins/tox/.pkg/bin/python /usr/lib/python3.11/site-packages/pyproject_api/_backend.py True setuptools.build_meta
jenkins      450  0.0  0.0   4204  3320 ?        S    02:35   0:00 /bin/bash /home/jenkins/pbench/exec-tests /var/tmp/jenkins/tox/py39
jenkins      560  0.0  0.0   2332   916 ?        S    02:35   0:00 /usr/bin/time --format=\n\nCommand: '%C'\nExit status: %x\nTimings: user %Us, system %Ss, elapsed %es (%E, %P)\nMemory: max RSS %Mk, minor pf: %R, major pf: %F, swaps %W\nContext switches: inv %c, vol %w, signals %k\nI/O: fs in %I, fs out %O, socket in %r, socket out %s\n pytest --tb=native -n auto --basetemp=/var/tmp/jenkins/tox/py39/tmp --cov=/var/tmp/jenkins/tox/py39/lib/python3.9/site-packages/pbench --cov-report xml:cov/report.xml -rs --pyargs pbench.test.unit.common pbench.test.unit.agent pbench.test.functional.agent pbench.test.unit.client pbench.test.unit.server
jenkins      561  2.2  0.0 2587628 241432 ?      Sl   02:35   0:41 /var/tmp/jenkins/tox/py39/bin/python /var/tmp/jenkins/tox/py39/bin/pytest --tb=native -n auto --basetemp=/var/tmp/jenkins/tox/py39/tmp --cov=/var/tmp/jenkins/tox/py39/lib/python3.9/site-packages/pbench --cov-report xml:cov/report.xml -rs --pyargs pbench.test.unit.common pbench.test.unit.agent pbench.test.functional.agent pbench.test.unit.client pbench.test.unit.server
jenkins      567  0.5  0.0 307108 149232 ?       Sl   02:35   0:09 /var/tmp/jenkins/tox/py39/bin/python -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins      570  0.4  0.0 299348 140964 ?       Sl   02:35   0:07 /var/tmp/jenkins/tox/py39/bin/python -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins      573  0.5  0.0 308988 150952 ?       Sl   02:35   0:10 /var/tmp/jenkins/tox/py39/bin/python -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins      576  0.4  0.0 302416 144392 ?       Sl   02:35   0:07 /var/tmp/jenkins/tox/py39/bin/python -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins      579  0.5  0.0 311952 153348 ?       Sl   02:35   0:10 /var/tmp/jenkins/tox/py39/bin/python -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins      582  0.5  0.0 305180 149068 ?       Sl   02:35   0:09 /var/tmp/jenkins/tox/py39/bin/python -u -c import sys;exec(eval(sys.stdin.readline()))
jenkins      585  0.6  0.0 313972 155868 ?       Sl   02:35   0:11 /var/tmp/jenkins/tox/py39/bin/python -u -c import sys;exec(eval(sys.stdin.readline()))
 [and 25 more worker processes in similar states]

What other information can I provide?

$ /var/tmp/jenkins/tox/py39/bin/python --version
Python 3.9.16
$ /var/tmp/jenkins/tox/py39/bin/pytest --version
pytest 7.4.0

Here's the log leading up to the hang:

py39: install_deps> pip install --cache-dir=/var/tmp/jenkins/tox/cache --progress-bar off --prefix=/var/tmp/jenkins/tox/py39 -r /home/jenkins/pbench/agent/requirements.txt -r /home/jenkins/pbench/agent/test-requirements.txt -r /home/jenkins/pbench/client/requirements.txt -r /home/jenkins/pbench/server/requirements.txt -r /home/jenkins/pbench/server/test-requirements.txt
.pkg: install_requires> python -I -m pip install 'setuptools>=46.1.3' 'wheel>=0.34.2'
.pkg: _optional_hooks> python /usr/lib/python3.11/site-packages/pyproject_api/_backend.py True setuptools.build_meta
.pkg: get_requires_for_build_sdist> python /usr/lib/python3.11/site-packages/pyproject_api/_backend.py True setuptools.build_meta
.pkg: install_requires_for_build_sdist> python -I -m pip install pbr
.pkg: prepare_metadata_for_build_wheel> python /usr/lib/python3.11/site-packages/pyproject_api/_backend.py True setuptools.build_meta
.pkg: build_sdist> python /usr/lib/python3.11/site-packages/pyproject_api/_backend.py True setuptools.build_meta
py39: install_package> pip install --cache-dir=/var/tmp/jenkins/tox/cache --progress-bar off --prefix=/var/tmp/jenkins/tox/py39 --force-reinstall --no-deps /var/tmp/jenkins/tox/.tmp/package/1/pbench-0.0.1.dev2674.tar.gz
py39: commands[0]> bash -c '/home/jenkins/pbench/exec-tests /var/tmp/jenkins/tox/py39 '



Running agent,client,server python3-based unit tests via pytest

Package                  Version
------------------------ -------------
alembic                  1.11.1
aniso8601                9.0.1
appier                   1.21.7
bcrypt                   4.0.1
Bcrypt-Flask             1.0.2
beautifulsoup4           4.9.3
blinker                  1.6.2
boto3                    1.17.97
botocore                 1.20.97
bottle                   0.12.25
bs4                      0.0.1
cachetools               4.2.2
certifi                  2021.5.30
cffi                     1.15.1
chardet                  4.0.0
click                    8.1.6
coverage                 7.2.7
cryptography             41.0.2
docutils                 0.20.1
elasticsearch            7.13.4
exceptiongroup           1.1.2
execnet                  2.0.2
filelock                 3.12.2
Flask                    2.3.2
Flask-Cors               4.0.0
Flask-HTTPAuth           4.8.0
Flask-JWT-Extended       4.5.2
Flask-Migrate            4.0.4
Flask-RESTful            0.3.10
Flask-SQLAlchemy         3.0.5
freezegun                1.2.2
gitdb                    4.0.10
GitPython                3.1.32
google-api               0.1.12
google-api-core          1.30.0
google-api-python-client 2.9.0
google-auth              1.31.0
google-auth-httplib2     0.1.0
google-auth-oauthlib     0.4.4
googleapis-common-protos 1.53.0
greenlet                 2.0.2
gunicorn                 21.2.0
hiredis                  2.2.3
httplib2                 0.19.1
humanize                 4.7.0
idna                     2.10
ifaddr                   0.2.0
importlib-metadata       6.8.0
iniconfig                2.0.0
itsdangerous             2.1.2
Jinja2                   3.1.2
jmespath                 0.10.0
lockfile                 0.12.2
Mako                     1.2.4
MarkupSafe               2.1.3
mock                     5.1.0
oauthlib                 3.1.1
packaging                20.9
pbench                   0.0.1.dev2674
pip                      22.3.1
pluggy                   1.2.0
pquisby                  0.0.12
protobuf                 3.17.3
psutil                   5.9.5
psycopg2                 2.9.6
pyasn1                   0.4.8
pyasn1-modules           0.2.8
pycparser                2.21
pyesbulk                 2.1.1
PyJWT                    2.8.0
pyparsing                2.4.7
pytest                   7.4.0
pytest-cov               4.1.0
pytest-dependency        0.5.1
pytest-freezegun         0.4.2
pytest-helpers-namespace 2019.1.8
pytest-mock              3.11.1
pytest-xdist             3.3.1
python-daemon            3.0.1
python-dateutil          2.8.1
python-pidfile           3.1.1
pytz                     2019.1
PyYAML                   6.0.1
redis                    3.5.3
requests                 2.25.1
requests-mock            1.11.0
requests-oauthlib        1.3.0
responses                0.23.1
rsa                      4.7.2
s3transfer               0.4.2
sdnotify                 0.3.2
setuptools               65.5.1
sh                       2.0.4
six                      1.16.0
smmap                    5.0.0
soupsieve                2.2.1
SQLAlchemy               2.0.19
SQLAlchemy-Utils         0.41.1
state-signals            1.0.1
tomli                    2.0.1
types-PyYAML             6.0.12.11
typing_extensions        4.7.1
uritemplate              3.0.1
urllib3                  1.26.5
Werkzeug                 2.3.6
wheel                    0.38.4
zipp                     3.16.2
============================= test session starts ==============================
platform linux -- Python 3.9.16, pytest-7.4.0, pluggy-1.2.0
cachedir: /var/tmp/jenkins/tox/py39/.pytest_cache
rootdir: /home/jenkins/pbench
configfile: pytest.ini
plugins: cov-4.1.0, freezegun-0.4.2, helpers-namespace-2019.1.8, requests-mock-1.11.0, mock-3.11.1, xdist-3.3.1, dependency-0.5.1
created: 32/32 workers
32 workers [1521 items]

[and nothing more...]

@webbnh
Copy link

webbnh commented Jul 29, 2023

Trying the tests in my development environment, they hang too. Since I'm running interactively, there, I get slightly more output from Pytest:

=============== test session starts ================
platform linux -- Python 3.9.16, pytest-7.4.0, pluggy-1.2.0
cachedir: /var/tmp/wscales/tox/py39/.pytest_cache
rootdir: /home/wscales/pbench
configfile: pytest.ini
plugins: requests-mock-1.11.0, xdist-3.3.1, mock-3.11.1, helpers-namespace-2019.1.8, freezegun-0.4.2, dependency-0.5.1, cov-4.1.0
8 workers [1521 items]  
..................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................ss....s.s.....s.........s..........................s..s...s.....s...s.s............................................................................................................................s..s..s...s......ss..s..s..s......s.s......ss.sss....s......s.....................................................

The ps output looks pretty much the same (only there are 8 workers instead of 32):

$ ps -auxww
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
wscales        1  0.0  0.0   1084     4 pts/0    Ss   04:32   0:00 /run/podman-init -- tox
wscales        2  2.4  0.0 776700 38664 pts/0    Sl+  04:32   0:04 /usr/bin/python3 -P /usr/bin/tox
wscales      162  4.4  0.1  88460 81804 pts/0    S+   04:32   0:06 /var/tmp/wscales/tox/.pkg/bin/python /usr/lib/python3.11/site-packages/pyproject_api/_backend.py True setuptools.build_meta
wscales      226  0.0  0.0   4204  3244 pts/0    S+   04:32   0:00 /bin/bash /home/wscales/pbench/exec-tests /var/tmp/wscales/tox/py39
wscales      270  0.0  0.0   2332   912 pts/0    S+   04:32   0:00 /usr/bin/time --format=\n\nCommand: '%C'\nExit status: %x\nTimings: user %Us, system %Ss, elapsed %es (%E, %P)\nMemory: max RSS %Mk, minor pf: %R, major pf: %F, swaps %W\nContext switches: inv %c, vol %w, signals %k\nI/O: fs in %I, fs out %O, socket in %r, socket out %s\n pytest --tb=native -n auto --basetemp=/var/tmp/wscales/tox/py39/tmp --cov=/var/tmp/wscales/tox/py39/lib/python3.9/site-packages/pbench --cov-report html:/var/tmp/wscales/tox/py39/cov/html -rs --pyargs pbench.test.unit.common pbench.test.unit.agent pbench.test.functional.agent pbench.test.unit.client pbench.test.unit.server
wscales      271  7.8  0.1 698088 107188 pts/0   Sl+  04:32   0:09 /var/tmp/wscales/tox/py39/bin/python /var/tmp/wscales/tox/py39/bin/pytest --tb=native -n auto --basetemp=/var/tmp/wscales/tox/py39/tmp --cov=/var/tmp/wscales/tox/py39/lib/python3.9/site-packages/pbench --cov-report html:/var/tmp/wscales/tox/py39/cov/html -rs --pyargs pbench.test.unit.common pbench.test.unit.agent pbench.test.functional.agent pbench.test.unit.client pbench.test.unit.server
wscales      276 15.3  0.2 322884 164796 pts/0   Sl+  04:32   0:18 /var/tmp/wscales/tox/py39/bin/python -u -c import sys;exec(eval(sys.stdin.readline()))
wscales      280  5.6  0.1 215148 126652 pts/0   Sl+  04:32   0:06 /var/tmp/wscales/tox/py39/bin/python -u -c import sys;exec(eval(sys.stdin.readline()))
wscales      284 15.6  0.3 355472 196948 pts/0   Sl+  04:32   0:18 /var/tmp/wscales/tox/py39/bin/python -u -c import sys;exec(eval(sys.stdin.readline()))
wscales      288 14.8  0.3 359424 197008 pts/0   Sl+  04:32   0:17 /var/tmp/wscales/tox/py39/bin/python -u -c import sys;exec(eval(sys.stdin.readline()))
wscales      292 13.5  0.2 324888 166400 pts/0   Sl+  04:32   0:16 /var/tmp/wscales/tox/py39/bin/python -u -c import sys;exec(eval(sys.stdin.readline()))
wscales      296 13.3  0.2 326292 167844 pts/0   Sl+  04:32   0:16 /var/tmp/wscales/tox/py39/bin/python -u -c import sys;exec(eval(sys.stdin.readline()))
wscales      300 13.5  0.2 319792 161728 pts/0   Sl+  04:32   0:16 /var/tmp/wscales/tox/py39/bin/python -u -c import sys;exec(eval(sys.stdin.readline()))
wscales      304 11.6  0.2 341464 180100 pts/0   Sl+  04:32   0:14 /var/tmp/wscales/tox/py39/bin/python -u -c import sys;exec(eval(sys.stdin.readline()))

@webbnh
Copy link

webbnh commented Jul 29, 2023

Nevermind!...

It turned out that one parametrize'd scenario of one test was sitting in an infinite loop, and that caused the test run to hang a few tests short of the end. Once I found and addressed the problem, the tests no longer hang.

That's great, but was there some way that I should have been able to find this more easily?

@RonnyPfannschmidt
Copy link
Member

The timeout Plugin tends to be a great Help for hangups

@nicoddemus
Copy link
Member

Anybody wants to contribute a change to the docs mentioning pytest-timeout? That would be excellent.

@webbnh
Copy link

webbnh commented Aug 2, 2023

Anybody wants to contribute a change to the docs mentioning pytest-timeout? That would be excellent.

😁 Indeed, especially if it points out the fact that it is hard to figure out which test is hanging without it!

@RonnyPfannschmidt
Copy link
Member

@nicoddemus I'm wondering if xdist should ensure to identify all currently running tests and their phases whenever a node exceeds a predetermined timeframe

An even further expansion of this might be printing stacktraces

@nicoddemus
Copy link
Member

Sounds good @RonnyPfannschmidt, indeed it makes sense for a new option to at least warn the user if a test has been running for X seconds (configurable, perhaps with a reasonable default of say 120s). However I would leave the job of cancelling long running test to pytest-timeout.

@RonnyPfannschmidt
Copy link
Member

Indeed, debugging Print is fine but the terminate gun ought to be opt in

pypingou pushed a commit to Pagure/pagure that referenced this issue Apr 19, 2024
F39 pip isn't affected, but on rpm the tests stuck
till the OOM Killer triggers. There are multiple
similar reports upstream but no fix. To unblock
the unit tests, F39 RPM will be skipped for now.
F39 pip covers py39, py310, py311 and py312.

CI jobs that run into the issue:
https://jenkins-pagure.apps.ocp.cloud.ci.centos.org/job/pull-requests/276/
https://jenkins-pagure.apps.ocp.cloud.ci.centos.org/job/pull-requests/277/

GitHub issues that report similar issues:
pytest-dev/pytest-xdist#110
pytest-dev/pytest-xdist#661
pytest-dev/pytest-xdist#872
pytest-dev/pytest-xdist#1005
@jcampbell05
Copy link

jcampbell05 commented Feb 10, 2025

We had this also and it seems that setting n=logical can consume too many threads and in some cases lead to xdist waiting for the program being tested to complete however that program is waiting to get a thread but the threadpool is exhausted by xdist. We fixed it by setting xdist to a number of threads lower than the maximum cpu core, which makes the tests stable again.

It would be great if you could set a negative flag i.e "-n=-1" which could mean "logical - 1 core" to allow as many cores to be used whilst minimising deadlocking risk

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests