Issue 122 streaming operation #125

j4m3z0r · 2018-02-08T23:27:05Z

Modified IndexedCapturingReader that only reads as much of the file as is necessary.

…echanism if the stream doesn't support the Length operation. Addresses drewnoakes#122 by allowing for streamed parsing of metadata.

… comment.

…file unless it's needed.

drewnoakes · 2018-02-11T22:38:08Z

MetadataExtractor/IO/IndexedCapturingReader.cs

+            try
+            {
+                // For some reason, FileStreams are faster in contiguous mode. Since this is such a a commont case, we
+                // specifically check for it.


I think the underlying OS pre-fetches data. Many read operations are sequential. In fact, when you create a FileStream there's a ctor overload that lets you specify Sequential vs RandomAccess (IIRC) which presumably hints the platform of how the reader will behave. For the places that we create FileStream we could experiment there to see if it matters.

Right, so what I didn't understand is why the performance would have changed: in the case where it previously pre-loaded the whole file by reading sequentially (on FileStreams), we still do that. Likewise, the cost of the lookup was all but eliminated by memoizing the last looked up chunk (only something like 5% of calls ended up requesting a chunk other than the last one it returned). But the new code is still ~50% slower at running the test-suite.

Alas I am developing on a Mac, and the Xamarin-based toolchain here doesn't have a robust profiler (nor do I have VS Ultimate (or whatever it's called) on my Windows machine). If you do have access to a profiler, I'd be very interested to know what you see.

drewnoakes · 2018-02-11T22:49:42Z

This looks very cool and I'm keen to take it for a proper spin and learn more about it.

The failing unit tests suggest there are still some pending issues though. Have you looked into the reported problems there?

The comments you've added mention a perf hit relating to dictionary lookups. If data's mostly read sequentially then perhaps one approach is to have the dictionary's values be a Chunk type that allows linking without needing to do another lookup:

struct Chunk
{
    public byte[] Bytes { get; }
    /// <summary>Gets and sets the subsequent and adjacent chunk, if loaded.</summary>
    public Chunk? Next { get; set; }
}

I haven't thought this through to deeply. Just throwing the idea out there.

drewnoakes · 2018-02-11T22:52:16Z

MetadataExtractor/IO/IndexedCapturingReader.cs

+                // If we know the length of the stream ahead of time, we can allocate a Dictionary with enough slots
+                // for all the chunks. We 2X it to try to avoid hash collisions.
+                var chunksCapacity = 2 * (_stream.Length / chunkLength);
+                _chunks = new Dictionary<int, byte[]>((int) chunksCapacity);


For most data streams we would be unlikely to load all the chunks in the file, so perhaps preallocating enough buckets for the entire file is unnecessary and may even hurt performance by spreading entries across more cache lines than are strictly needed. We'd need to measure performance to be sure one way or another.

I only added this after doing a series of performance tests to ensure it would be a win. It's relatively small (~5%) but it was measurable and repeatable.

drewnoakes · 2018-02-11T22:58:05Z

Would also be good to create some more unit tests that drive the class with stream implementations with different Length and CanSeek values.

j4m3z0r · 2018-02-12T17:31:33Z

Seems like I can't respond to all of your comments inline, so let me answer here:

No, I haven't yet looked into unit-test failures. It's quite possible that they are the cause of the performance regression.
Regarding separately testing Length and CanSeek separately, that seems doable. Inherit from MemoryStream and overload, or something. I'll put it on my list.
What's the best workflow for investigating unit-test failures? I was a bit confused by the contribution guidelines, which seemed to suggest that I could see failures by running git diff on the images repo, but that showed differences even on a build from before these changes.

At present this branch solves the initial problem I set out to solve (it can now process raw files from the network without reading in the entire stream; a ~60X performance improvement for my use-case), so it will be at least a few days before I can look into this again.

drewnoakes · 2018-02-12T20:22:56Z

Inherit from MemoryStream and overload

Yep, this is pretty much what I was thinking. Shouldn't be too hard. Can write one set of assertions across a single data stream, then have code test it in all permutations of settings to validate all modes operate correctly.

The test failures suggest that the new reader implementation is producing incorrect values. Did you compare output before/after this change on any files? This may only trigger for some kinds of streams.

What's the best workflow for investigating unit-test failures? I was a bit confused by the contribution guidelines, which seemed to suggest that I could see failures by running git diff on the images repo, but that showed differences even on a build from before these changes.

The unit tests are stand-alone and don't require the image repo. I think you said you're using Rider, so it should be possible to run them directly there. IIRC you just right click the tests project in the solution view.

As for confusion regarding the images repo, you're right that it's confusing. I've opened an issue regarding this.

…s repeats the IndexedCapturingReaderTests, for each of the permutations of seekable and length supported.

…he test cases.

j4m3z0r · 2018-04-16T21:00:58Z

Hi there, quick ping on this -- is there anything further needed here?

drewnoakes · 2018-04-26T17:10:44Z

Hi James, this hasn't fallen off my radar but I have been busy elsewhere and this PR will take more than just a few minutes to review and think through properly. It's a core change and it seems great but I still want to give it a proper look through. I will also consider the approach in the context of the Java implementation, as it'd be nice to keep parity as much as possible. Thanks for your patience.

j4m3z0r · 2018-04-26T17:14:03Z

No problem, Drew. Thanks for the note and for being a diligent maintainer. I’ve done my best to make sure that it’s right, but I have a lot less experience with his codebase than you do, so I appreciate the attention to detail.

j4m3z0r · 2018-07-10T15:04:50Z

Hi there, just checking in on this. Totally understand if you're still swamped; just want to make sure it doesn't bit-rot.

j4m3z0r added 3 commits February 7, 2018 11:32

Query underlying Stream object for Length, but fallback to original m…

01c6a2f

…echanism if the stream doesn't support the Length operation. Addresses drewnoakes#122 by allowing for streamed parsing of metadata.

Cache status of Length being supported on the file stream, per Drew's…

b6b61ff

… comment.

Initial version of IndexedCapturingReader that won't read the entire …

6f7e2f0

…file unless it's needed.

drewnoakes force-pushed the master branch from 457c501 to 4fadf7f Compare February 8, 2018 23:31

j4m3z0r mentioned this pull request Feb 8, 2018

Allow for streaming operation #122

Open

Merge branch 'master' into issue-122-streaming-operation

67bac12

drewnoakes reviewed Feb 11, 2018

View reviewed changes

j4m3z0r added 2 commits April 10, 2018 13:11

Add test cases for the streamable IO described in drewnoakes#122. Thi…

6160e1f

…s repeats the IndexedCapturingReaderTests, for each of the permutations of seekable and length supported.

Fixes a couple of issues with length handling that were revealed by t…

626355f

…he test cases.

This was referenced May 2, 2018

RandomAccessStream #130

Closed

RandomAccessStream + ReaderInfo #131

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue 122 streaming operation #125

Issue 122 streaming operation #125

j4m3z0r commented Feb 8, 2018

drewnoakes Feb 11, 2018

j4m3z0r Feb 12, 2018

drewnoakes commented Feb 11, 2018

drewnoakes Feb 11, 2018

j4m3z0r Feb 12, 2018

drewnoakes commented Feb 11, 2018

j4m3z0r commented Feb 12, 2018

drewnoakes commented Feb 12, 2018

j4m3z0r commented Apr 16, 2018

drewnoakes commented Apr 26, 2018

j4m3z0r commented Apr 26, 2018

j4m3z0r commented Jul 10, 2018

Issue 122 streaming operation #125

Are you sure you want to change the base?

Issue 122 streaming operation #125

Conversation

j4m3z0r commented Feb 8, 2018

drewnoakes Feb 11, 2018

Choose a reason for hiding this comment

j4m3z0r Feb 12, 2018

Choose a reason for hiding this comment

drewnoakes commented Feb 11, 2018

drewnoakes Feb 11, 2018

Choose a reason for hiding this comment

j4m3z0r Feb 12, 2018

Choose a reason for hiding this comment

drewnoakes commented Feb 11, 2018

j4m3z0r commented Feb 12, 2018

drewnoakes commented Feb 12, 2018

j4m3z0r commented Apr 16, 2018

drewnoakes commented Apr 26, 2018

j4m3z0r commented Apr 26, 2018

j4m3z0r commented Jul 10, 2018