Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 122 streaming operation #125

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

j4m3z0r
Copy link
Contributor

@j4m3z0r j4m3z0r commented Feb 8, 2018

Modified IndexedCapturingReader that only reads as much of the file as is necessary.

…echanism if the stream doesn't support the Length operation. Addresses drewnoakes#122 by allowing for streamed parsing of metadata.
try
{
// For some reason, FileStreams are faster in contiguous mode. Since this is such a a commont case, we
// specifically check for it.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the underlying OS pre-fetches data. Many read operations are sequential. In fact, when you create a FileStream there's a ctor overload that lets you specify Sequential vs RandomAccess (IIRC) which presumably hints the platform of how the reader will behave. For the places that we create FileStream we could experiment there to see if it matters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, so what I didn't understand is why the performance would have changed: in the case where it previously pre-loaded the whole file by reading sequentially (on FileStreams), we still do that. Likewise, the cost of the lookup was all but eliminated by memoizing the last looked up chunk (only something like 5% of calls ended up requesting a chunk other than the last one it returned). But the new code is still ~50% slower at running the test-suite.

Alas I am developing on a Mac, and the Xamarin-based toolchain here doesn't have a robust profiler (nor do I have VS Ultimate (or whatever it's called) on my Windows machine). If you do have access to a profiler, I'd be very interested to know what you see.

@drewnoakes
Copy link
Owner

This looks very cool and I'm keen to take it for a proper spin and learn more about it.

The failing unit tests suggest there are still some pending issues though. Have you looked into the reported problems there?

The comments you've added mention a perf hit relating to dictionary lookups. If data's mostly read sequentially then perhaps one approach is to have the dictionary's values be a Chunk type that allows linking without needing to do another lookup:

struct Chunk
{
    public byte[] Bytes { get; }
    /// <summary>Gets and sets the subsequent and adjacent chunk, if loaded.</summary>
    public Chunk? Next { get; set; }
}

I haven't thought this through to deeply. Just throwing the idea out there.

// If we know the length of the stream ahead of time, we can allocate a Dictionary with enough slots
// for all the chunks. We 2X it to try to avoid hash collisions.
var chunksCapacity = 2 * (_stream.Length / chunkLength);
_chunks = new Dictionary<int, byte[]>((int) chunksCapacity);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For most data streams we would be unlikely to load all the chunks in the file, so perhaps preallocating enough buckets for the entire file is unnecessary and may even hurt performance by spreading entries across more cache lines than are strictly needed. We'd need to measure performance to be sure one way or another.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only added this after doing a series of performance tests to ensure it would be a win. It's relatively small (~5%) but it was measurable and repeatable.

@drewnoakes
Copy link
Owner

Would also be good to create some more unit tests that drive the class with stream implementations with different Length and CanSeek values.

@j4m3z0r
Copy link
Contributor Author

j4m3z0r commented Feb 12, 2018

Seems like I can't respond to all of your comments inline, so let me answer here:

  • No, I haven't yet looked into unit-test failures. It's quite possible that they are the cause of the performance regression.
  • Regarding separately testing Length and CanSeek separately, that seems doable. Inherit from MemoryStream and overload, or something. I'll put it on my list.
  • What's the best workflow for investigating unit-test failures? I was a bit confused by the contribution guidelines, which seemed to suggest that I could see failures by running git diff on the images repo, but that showed differences even on a build from before these changes.

At present this branch solves the initial problem I set out to solve (it can now process raw files from the network without reading in the entire stream; a ~60X performance improvement for my use-case), so it will be at least a few days before I can look into this again.

@drewnoakes
Copy link
Owner

Inherit from MemoryStream and overload

Yep, this is pretty much what I was thinking. Shouldn't be too hard. Can write one set of assertions across a single data stream, then have code test it in all permutations of settings to validate all modes operate correctly.

The test failures suggest that the new reader implementation is producing incorrect values. Did you compare output before/after this change on any files? This may only trigger for some kinds of streams.

What's the best workflow for investigating unit-test failures? I was a bit confused by the contribution guidelines, which seemed to suggest that I could see failures by running git diff on the images repo, but that showed differences even on a build from before these changes.

The unit tests are stand-alone and don't require the image repo. I think you said you're using Rider, so it should be possible to run them directly there. IIRC you just right click the tests project in the solution view.

As for confusion regarding the images repo, you're right that it's confusing. I've opened an issue regarding this.

…s repeats the IndexedCapturingReaderTests, for each of the permutations of seekable and length supported.
@j4m3z0r
Copy link
Contributor Author

j4m3z0r commented Apr 16, 2018

Hi there, quick ping on this -- is there anything further needed here?

@drewnoakes
Copy link
Owner

Hi James, this hasn't fallen off my radar but I have been busy elsewhere and this PR will take more than just a few minutes to review and think through properly. It's a core change and it seems great but I still want to give it a proper look through. I will also consider the approach in the context of the Java implementation, as it'd be nice to keep parity as much as possible. Thanks for your patience.

@j4m3z0r
Copy link
Contributor Author

j4m3z0r commented Apr 26, 2018

No problem, Drew. Thanks for the note and for being a diligent maintainer. I’ve done my best to make sure that it’s right, but I have a lot less experience with his codebase than you do, so I appreciate the attention to detail.

This was referenced May 2, 2018
@j4m3z0r
Copy link
Contributor Author

j4m3z0r commented Jul 10, 2018

Hi there, just checking in on this. Totally understand if you're still swamped; just want to make sure it doesn't bit-rot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants