Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Rewrite PipeStream and SshCommand stream handling #144

Closed
wants to merge 3 commits into from

Conversation

hifi
Copy link
Contributor

@hifi hifi commented Dec 28, 2016

  • More efficient when moving large amounts of binary data
    • Speedup on Mono: ~1.71x
    • Speedup on Win10: ~1.39x
  • No in-memory buffering of data
    • Except if you use the Execute() method to return a result
      string
    • The result string will be empty if you have read the output
      stream completely
  • Unless you close or have something reading the error stream it
    can completely block the execution as it will hang the event
    handler indefinitely if no one is reading the other end
  • Reading from PipeStream will now always block until EOF
  • Writing to PipeStream will now always block until a reader has
    completely emptied it

This has not been tested with anything else than SshCommand in a very limited way. I'm planning on adding InputStream on top of this if this refactor is in the right direction.

@hifi
Copy link
Contributor Author

hifi commented Dec 29, 2016

InputStream is now implemented. It is now possible to fully control the process execution and implement things like git+ssh or rsync+ssh on top of SSH.NET.

@drieseng
Copy link
Member

Great contributions.
I'll be reviewing them in details asap.
Can you start by adding more unit tests for PipeStream?
I was planning on using a linked list implementation for PipeStream, but that should not affect the behavior (and as such, it should not affect the tests you add).
I'll try to find time for that and based on performance tests, we'll either use that implementation or yours.
OK?

@hifi
Copy link
Contributor Author

hifi commented Dec 29, 2016

I didn't see any reason to buffer data in the pipe as that will only increase memory use without improving throughput so my PipeStream implementation keeps only one window of data and expects someone to read it and writing will block until it has been completely read out.

The absolute best way to handle the incoming data from SSH would be to (optionally) throw out the streams and use the events instead as that would completely eliminate read blocking and unnecessary threading/locking that's caused by it.

I'm not 100% sure what the expected behavior of PipeStream is as the non-blocking exception thing was there that I removed and there's a bit of confusing going on with my commits regarding disposing the streams and letting the last buffer to be read out. Flushing was completely removed for instance as I didn't see the need for it in the simple version of PipeStream but it might not be the right move.

Regarding performance I have some doubts a linked list would be any faster as it would increase the amount of work to be done if there's a reader ready to empty it at all times (like Stream.CopyTo()) and the list would mostly contain only one element. It would also require you to support partial reading of a written byte buffer which will increase the complexity.

A linked list would also change the behavior from (my) blocking back to non-blocking and it would allow the writer to exhaust the memory if the reader is slow for any reason where usually the speed of transfer is determined by the complete blocking chain.

The original PipeStream implementation resulted in around 50-55% CPU use where my hanged around 20-30%. Throughput during my test was a constant 1Gbit/s with both and OpenSSH hitting local network limit.

@drieseng
Copy link
Member

I think we can agree that the original implementation left "some" room for improvement.
However, I don't think we should block writing until the buffer has been read.
Your implementation may be the fastest and cover your specific use case, but let's see if we can support more use cases (without sacrificing all too much performance).

@hifi
Copy link
Contributor Author

hifi commented Dec 29, 2016

Streams are supposed to be blocking (right?) so the default being blocking would be something I'd expect. The Result property and EndExecute method both empty out the stream properly with my PipeStream implementation after shuffling SshCommand a bit so it should work even with the blocking behavior.

Making it all not block would be best done by having events as it removes the pipes completely and if no one is listening then the data is simply just lost. Blocking the writer when not using events looks like the right call to me.

The big issue is handling reading both stdout and stderr simultaneously. They can both become large if buffered in-memory and no one is reading them. The simple solution for the user of the api would be to dispose the extended stream immediately to prevent it from blocking if the process writes to stderr and the user does not care about it.

In summary, changing the api of SshCommand and/or adding another command class for blocking use might be worth a shot to support different use cases.

I'm a little bit annoyed when people use the TTY method to run simple commands (with Expect kludge) when SshCommand is usually the right thing to do when you're not running anything interactive so getting it user friendly and performant at the same time is something I have a lot of interest in.

@hifi
Copy link
Contributor Author

hifi commented Dec 29, 2016

I thought about the blocking problem more and I think a simple works-for-everyone solution would be to make SshCommand export the streams as PipeStream instead of plain Stream and have a property for switching between blocking and non-blocking (buffering) mode.

Depending on your performance testing, if a linked list is not significantly slower than the raw byte[] thing I have when there's a reader always available, switching between blocking and non-blocking mode would just add a lock to writer when there's something in the list or alternatively it would switch between a list and a single byte[] buffer if there's a major performance difference.

I can see this benefiting my use case as well as then I could keep OutputStream blocking but let ExtendedOutputStream buffer everything in-memory when I'm not expecting there to be much. It would also make a nice default if all streams buffer as you want them but the property be exposed to switch them on-demand.

What do you think?

@drieseng
Copy link
Member

I'm still incliding to go for the buffered (non-blocking write, blocking read) approach.
However, give me some time to look into it.
Meanwhile, do you have (more or less selfcontained) integration tests that you use for testing both input and output streams?

@hifi
Copy link
Contributor Author

hifi commented Dec 29, 2016

You can execute 'cat' on the remote system and then loopback everything through it from InputStream to OutputStream. I used some random binary data (files) and hashed them after piping through and measured the CPU time while it was doing that.

@hifi
Copy link
Contributor Author

hifi commented Dec 30, 2016

https://gist.github.com/hifi/9213b8981a938be7fd2dda4790a40dfb an alternative PipeStream that queues everything. It seems to be as heavy as the original implementation (on Mono) but the code is a bit cleaner. Reader is blocking but writer isn't.

@hifi
Copy link
Contributor Author

hifi commented Dec 30, 2016

More testing reveals that the buffering version I gisted does not have almost any performance impact compared to the blocking one I have in this pull request making it the clear winner supporting both workflows.

On Mono, however, the buffering implementation is CPU intensive but that's something that needs to be tackled on that side.

@hifi
Copy link
Contributor Author

hifi commented Dec 31, 2016

Here are some actual benchmarks of the two implementations on three different runtimes: https://github.com/hifi/PipeBench

@drieseng
Copy link
Member

Let me start by saying: thanks for all the effort you've put into this (getting nothing but semi-negative feedback from me). I don't have much time to spend on this right now (got two branches called Sofie and Lisa that require attention).

Usage of BlockingCollection is a no-go since it's only available as of .NET 4+.
I think performance can be improved (and allocations reduced) by using a linked list approach.
I don't like the fact that the buffer size is limited to the 64 Kb.

Feel free to disregard these "negative" remarks until I have time to work with you on this.

@hifi
Copy link
Contributor Author

hifi commented Jan 1, 2017

Technically BlockingCollection is exactly what you needed. It allows the writer to write without blocking (unless a reader blocks it). I'll see if I can find an alternative solution.

There is no buffer size limitation in my BlockingCollection approach as it copies the write buffers as they are to the collection. I can replace BlockingCollection with a home grown solution to get better backwards compatiblity. It might also help with the performance on Mono.

Personal branches always before GitHub. Take care!

@hifi
Copy link
Contributor Author

hifi commented Jan 1, 2017

Added an implementation of BufferingPipeStream to the benchmark that does its own locking around Queue. It is the fastest implementation of them all on Microsoft runtimes.

@drieseng
Copy link
Member

drieseng commented Jan 2, 2017

Apart from both memory and performance optimizations, one of the most important changes in the new implementation is that Read(byte[] buffer, int offset, int count) will block until either the request number of bytes are available or the stream is closed.

Before we can merge this new implementation, we should:

  • Add unit tests that fully cover the new implementation it.

  • Review our codebase where we use PipeStream and (may) count on it to only block until at least some data is available.

  • Decide whether to use PipeStream.Dispose() to mark the stream as EOF, or introduce a separate (internal) method for this purpose.

  • Modify our codebase to use the appropriate mechanism to signal that no more data is available.

  • Fix unit tests that are broken due to the new behavior.

  • Ensure we document the public-facing part of this breaking change for the next release, and provide a clear indication where users may be confronted with this change and how they can or should deal with it.

@drieseng
Copy link
Member

drieseng commented Jan 2, 2017

@hifi What do you think about this: should we use PipeStream.Dispose() to mark the stream as EOF, or introduce a separate (internal) method for this purpose?

@hifi
Copy link
Contributor Author

hifi commented Jan 3, 2017

@drieseng The Read() method is allowed to throw ObjectDisposedException by definition so it might be better to set EOF with an internal method.

@hifi
Copy link
Contributor Author

hifi commented Jan 4, 2017

@drieseng Would an empty write do it? A write with zero count would queue normally but when the reader hits that empty buffer it would also return and not dequeue that from the linked list anymore?

@hifi
Copy link
Contributor Author

hifi commented Jan 4, 2017

Actually I had a discussion about Streams on IRC and my eyes were opened that on C# a Stream instance should only do one direction at a time. A pipe should consist of two streams where closing the writer stream also indicates EOF. We should do that instead.

@drieseng
Copy link
Member

drieseng commented Jan 4, 2017

Yeah, we could go for a PipeInputStream/PipeOutputStream approach.
I don't like the fact that in .NET a stream exposes both read and write members.

@hifi
Copy link
Contributor Author

hifi commented Jan 4, 2017

https://gist.github.com/hifi/3882f07f9064ce87e1a43d27a83888b3

I haven't even run this once and it might be overkill and I might have refactored too much but in this you create a Pipe which has two public properties InputStream and OutputStream. You can then decide who to give which and they should only allow you to do the thing you're supposed to be doing and closing InputStream would signal EOF.

Tried to share as much code as possible between the streams.

@drieseng
Copy link
Member

drieseng commented Jan 4, 2017

I think I'd prefer to split PipeStream into PipeInputStream and PipeOutputStream, and maybe have PipeInputStream "read" from PipeOutputStream (not using the Stream.Read(byte[] buffer, int offset, int count) method on PipeOutputStream of course)

@drieseng
Copy link
Member

drieseng commented Jan 4, 2017

Want me to propose something? I think I'll have time tomorrow or Friday.

@hifi
Copy link
Contributor Author

hifi commented Jan 5, 2017

You got the streams mixed but sure, propose something so we have things to compare. One design should stick out to be the cleanest and easiest to understand.

I thought about having the pipe split into two streams but the "ownership" of the data was something that seemed odd to me if one of them had some sort of authority. That's why I used a nested class to hold the state/data and shared that between the two pipes. They could be split into two different classes instead of having a single class that does one of them depending on how it's initialized.

I'll keep experimenting with the design as well.

@drieseng
Copy link
Member

drieseng commented Jan 5, 2017

I don't think I mixed the streams, but feel free to correct me.
Those are the named using in Java.
An InputStream in Java is actually a read-only stream.

I won't have time to work on this today :(

@hifi
Copy link
Contributor Author

hifi commented Jan 6, 2017

So it was me who got them mixed up. I'm looking into this today to get those separated and properly named.

@hifi
Copy link
Contributor Author

hifi commented Jan 11, 2017

Since SshCommand is essentially remote System.Diagnostics.Process, why wouldn't we mimic the API as much as we can as it's something people are already used to and it provides all the required things?

Edit: What I mean is that we could instead deprecate SshCommand itself and create SshProcess that's API compatible with Process. This wouldn't break existing use for the next release but provide something you should transition to.

@21region
Copy link

Will it be possible to emulate the following command with this new functionality?

cat file.tar.gz | ssh user@server 'tar -C /path/to/dest/dir -xjf -'

This capability is very desired in our project. For now we are planning to implement it ourselves, but it will be great to see that in an official build of SSH.NET.

@hifi
Copy link
Contributor Author

hifi commented Jan 11, 2017

That is one possible use case for this PR with stdin support.

@drieseng
Copy link
Member

@hifi The little time I had this week, I've been working on issue #100. I'm making good progress there, so I should soon be able to pick up work on this issue.

@drieseng
Copy link
Member

@hifi Can we move back to using your PipeBench in order to arrive at a stable API that we both feel comfortable with?

I could also use your help on the performance fix I'm preparing for issue #100.
Note that I still need to commit or submit a PR for this.

@hifi
Copy link
Contributor Author

hifi commented Jan 18, 2017

Sure, PipeBench needs a little bit of cleaning to make it compatible with the double stream approach which should be the one we aim for.

@drieseng
Copy link
Member

drieseng commented Jan 25, 2017

@hifi Could you have a look at performance changes I committed for SFTP download? More specifically, please have a look at SftpFileReader and tell me what you would change or how you would've implemented it.

There's still quite some stuff left to do in that class. Mostly around exceptional handling, but more importantly around cancellation (e.g. how to cancel a blocking wait for a semaphore without having a CancelleationToken?). I've been thinking about having our proper implementation of CancellationToken that we can use when we're targeting frameworks that don't provide this, but I'm sure there are simpler solutions ...

@alanmcgovern
Copy link

Out of curiosity, what needs to be done to get this PR completed? I love the idea of exposing a RemoteProcess type to mimic System.Diagnostics.Process. That approach would clean up some of my code nicely!

@drieseng
Copy link
Member

I think we kind of abandoned the process idea, and were discussing the design of the PipeStream.
I proposed to first finish it in @hifi's PipeBench and then integrate it into SSH.NET. Since then I mostly worked on other issues.

I first want to release 2016.1.0-beta2 (about time, it seems) before I continue working on this issue, but I'd gladly welcome you to join the discussion (and - if possible - help with the implementation).

@hifi
Copy link
Contributor Author

hifi commented Feb 14, 2017

Did we abandon the process idea? I just proposed it a few messages up. It did intrigue me to make it look like a remote System.Diagnostics.Process if it makes sense to people.

I haven't been active around this project for a few weeks now but I'll get back to it.

@drieseng
Copy link
Member

"Abondon" was perhaps a bit too strong worded. I though we discussed it, and decided to at least postpone it (and instead first focus on the PipeStream design and implementation). But I can't find any trace of it, so I'm probably mistaken. My bad.

Personally, I'd still prefer to move forward like this:

  • Design and implement the PipeStream (including unit tests)
  • Analyze the impact of replacing the current PipeStream implementation.
  • If this would cause all too many breaking changes, then we MUST introduce a new remote process concept.
  • If not, we integrate the new implementation into SSH.NET and document any change in behavior.

I'd happily discuss and design the remote process concept (even if the new PipeStream implementation does not introduce too many breaking change), but I know my time is limited (even though I spend already way too much time on SSH.NET).

If you guys have enough time available, then please create a new issue for this., I'll participate in the discussions on a best-effort basis.

@mloefler
Copy link

All the above seems to be quite demanding (timewise).
So I have some quick suggestions:

  • Make OutputStream protected in SshCommand since at the end of the execution it is always empty
  • Add an OnData event to SshCommand so I can implement my own streaming, optionally avoiding the filling of the OutpuStream, so the memory consumption is not an issue any more.

(Sorry, new here, but just wasted quite some time figuring out why OutputStream does never return anything :-) )

@hifi
Copy link
Contributor Author

hifi commented Apr 10, 2017

I kind of dropped the ball regarding this PR and went into other things. I'd still like to get it going but need to find the motivation again. Regardless the performance is excellent and could even surpass SFTP reading the performance issues it had.

@drieseng
Copy link
Member

I'd also like this and other PRs to get going again, but I have too little time and would like more interaction/discussion with other devs (eg. on issue #194).

@mloefler
Copy link

I would not be able to help on those issue, but I will try to do two things (since I need them anyway):

  • Add an OnData event to SshCommand, which triggers from inside Channel_DataReceived. This enables me to implement my own data handling and/ or streaming.
  • Add a class to 'stream' lines based on input from OnData.
  • Modify getter to OutputStream to throw if stream has already been consumed.

I can make the changes available.
(Background: My ssh command extracts large sets of data that I need to process. Large meaning possibly causing a memory issue if all of it resides in memory).

@thenadz
Copy link

thenadz commented May 13, 2017

All, any word on this PR? A working OutputStream for the SshCommand would be really nice. Right now I've hacked together some code wrapping the OutputStream to force necessary blocking for a temporary workaround on my end, but if this PR is too large of a scope to get out in the near future, would you consider putting something out just to address this smaller chunk of needed (and easily addressed) functionality.

@drieseng drieseng mentioned this pull request Sep 11, 2017
@WalkerNA
Copy link

WalkerNA commented Jan 8, 2018

Any progress on this PR? Our project has a desire that mimic's 21region's example code:

cat file.tar.gz | ssh user@server 'tar -C /path/to/dest/dir -xjf -'

@hifi
Copy link
Contributor Author

hifi commented Jan 9, 2018

I think the easiest way is to create something that has a similar API to System.Diagnostics.Process and give it a spin like that instead of trying to fix the old SshCommand. The old class could then be slowly deprecated and there wouldn't be need to keep compatibility.

Unit testing against said SshProcess(?) class could be done by comparing local execution with Process.

I haven't made any progress towards this yet so it's open to be picked up as long as there's a pre-accepted solution so no one would be doing more non-mergeable work than I already did.

@masaeedu
Copy link

masaeedu commented Jan 22, 2018

I don't know if this is going to work, but I'm cornered enough that I'm now going to attempt the following:

  • Create 3 temp files on the remote system
  • Use UploadFile and DownloadFile with SFTP to obtain input/output streams that actually work in a sensible way
  • Use SSHCommand to start whatever I actually wanted to run under a shell instance with I/O file descriptors redirected to these files. Totally ignore any streams SSHCommand provides

Wish me luck 😄

@ghjkl799
Copy link

ghjkl799 commented Mar 9, 2020

Any progress on this PR? Our project has a desire that mimic's 21region's example code:

cat file.tar.gz | ssh user@server 'tar -C /path/to/dest/dir -xjf -'

Hi

Looking for the same.
Any update on progress?

@elgonzo
Copy link

elgonzo commented Apr 3, 2020

Any progress on this would be most appreciated.

Just recently, i was trying to utilize blocking reads with SSH.NET's output PipeStream, but it seems the logic associated with the PipeStream.BlockLastReadBuffer property is either unintentional/broken or not really useful. If i want to have a blocking read of X bytes, why would the PipeStream.ReadAvailable(int count) method insist on blocking until X+1 bytes have been received?

I am not sure if it would make sense to file an issue reportwith regard to my issue (under the condition that i cannot find an equivalent issue report, of course), considering that the issue would probably disappear (or in the worst case morph into a different issue) if the changes discussed here would be implemented in SSH.NET...

@Livius90
Copy link

Livius90 commented Sep 4, 2022

When it will be available? I would like to send disk image datas from C# to a stdin pipe of dd over SSH, like in this example.

I would like to do the following same thing which works in Linux.

# dd bs=16M if=/dev/sda | ssh root@serverB "dd bs=16M of=/dev/sda"

@WojciechNagorski
Copy link
Collaborator

WojciechNagorski commented Sep 15, 2022

There is a fork that has this feature
https://www.nuget.org/packages/TrapTech.SSH.NET
https://github.com/TrapTech/SSH.NET/
But I wish I could use official package.

@Livius90

@Rob-Hague
Copy link
Collaborator

SshCommand.CreateInputStream was added in version 2024.0.0 in #1293 (https://sshnet.github.io/SSH.NET/examples.html#stream-data-to-a-command) and PipeStream was rewritten in #1399. I'll put this one to rest.

@Rob-Hague Rob-Hague closed this May 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.