-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Leaked paramiko ssh connection threads prevent rho from exiting #15
Comments
What's odd ... is I think it gave up authenticating long before those errors: 2015-12-17 10:35:14,544 rho WARNING - Found ssh on 10.139.28.10:22, but no auths worked. (ssh_jobs/244) |
Thanks for reporting the issue. So do you expect rho to be able to connect to these ilo's? I want to make sure I can some how recreate a scenario where rho tries but can't login and hangs.
|
As a baseline, I'm running CentOS 7. I've pulled the latest rho from github. At the moment I'm stepping through a remote debugger with the network isolated to the single ilo. I've put a break point here:
Right now it is looping through about 12 auths .. (which will all likely fail, and that is fine/expected.) $ rpm -qi python-paramiko The config is basically a single ip at the moment (it was a /24) and there are about 15 auths to try, including an ssh key one as the final attempt. The others are just root/password. |
I lost connection with the debugger. I'm running it alone (no debugger) as a single ip to see what happens. |
While running, each connection uses two sockets / fds. I assume this is expected.
Looks like a new pair is created for each auth request. Ah, and it just finished. 2015-12-17 11:44:38,043 paramiko.transport DEBUG - starting thread (client mode): 0xe4b990L (transport/1428) and exited "successfully" -- meaning, no hang. So the hang only happens when we have more connections. I'm going to try this http://stackoverflow.com/questions/132058/showing-the-stack-trace-from-a-running-python-application and running the full subnet again. See if I can get a stack trace when it hangs. |
I end up entering a state where I get this output:
But I still have one socket connected per lsof -- same ip:
...which eventually disappears after a
But then rho doesn't exit. I can't break in on SIGUSR1 to get the stack trace. If I use pstack on the pid, all threads are blocked on sem_wait(). |
I've added this to get periodic aggregated stack traces:
And upped self.max_threads to 256 to shorten the run time. |
Ok, as it finishes, I do have threads blocked on sockets:
And then when I'm in the "finished but hung" state, I get the following. We're blocked on Queues... are we missing a message in self.all_tasks_done.wait()?
|
Ah, Queue24 is written as a compatibility class... but I'm using python 2.7.x. I'm going to try running this with Queue24 just a direct derivation of Queue, with no overrides. If that doesn't fix it, then I wonder if an exception is causing a missing task_done message on one of the queues. edit: Changing the Queue class didn't seem to fix it, but I've left it as a direct derivation for now. |
I've found an exception I think isn't caught. Do you realize PkgInfoParseException is derived from BaseException?
This causes a thread to except all the way out since you only catch Exception:
|
Changing the parent class of PkgInfoParseException from BaseException to Exception seems to have cleared it up. edit: nope. More info coming. |
There still seems to be a problem of the ssh threads hanging on recv(). I think they aren't getting shut down after the exception. I've tried adding a close:
and
There might be a smarter way to deal with it, like always closing at the end of run(). Or in a "finally:" after the exception, maybe with the close itself also wrapped in a try block. I think what is happening is that SshThread isn't recycling self.ssh -- so it never gets GC'ed on the last iteration if an error happens. This causes the underlying paramiko thread to never finish, which prevents the script from exiting. Another alternative might be to make the paramiko threads daemon threads... but I think it would be better to ensure self.ssh is always closed at the end of every iteration. Those additional changes above, including the s/BaseException/Exception/ change seems to have improved my reliability further. |
Couple more places that might need the connection closed:
|
Still getting hung/orphan threads stuck on recv. run might also need to cleanup at the end:
|
The tool still isn't completing because some systems are still stuck in auth -- after 12 hours: ThreadIDs: ['139837625333504', '139838028183296', '139838811657984', '139836517496576', '139837642118912', '139837482657536', '139837516228352'] |
As I said, I found that some connections hung overnight on authentication. So as a workaround, I'm running rho against a single IP at a time (found via nmap) and rho wrapped with the Linux 'timeout' tool. I wrote a bash script to parallelize them. This tool really needs some testing on more diverse/realistic networks/environments. |
rho seems to get stuck scanning some ilo's that are on our network. If I watch lsof I can see the connections, and then they eventually disappear. I get these messages as they do:
2015-12-17 10:43:20,575 paramiko.transport INFO - Disconnect (code 2): Protocol Timeout (transport/1428)
2015-12-17 10:43:21,670 paramiko.transport INFO - Disconnect (code 2): Protocol Timeout (transport/1428)
2015-12-17 10:43:22,949 paramiko.transport INFO - Disconnect (code 2): Protocol Timeout (transport/1428)
2015-12-17 10:43:24,369 paramiko.transport INFO - Disconnect (code 2): Protocol Timeout (transport/1428)
2015-12-17 10:44:24,480 paramiko.transport INFO - Disconnect (code 2): Protocol Timeout (transport/1428)
2015-12-17 10:44:25,398 paramiko.transport INFO - Disconnect (code 2): Protocol Timeout (transport/1428)
And then nothing else happens. rho hangs indefnitely after that -- it doesn't seem to recognize that the ssh jobs have expired.
strace doesn't help -- it just shows all threads hanging on futex()
I think this is happening while rho is trying to login with different auth's.
The text was updated successfully, but these errors were encountered: