Temporary files in outlier detection median computation should read faster #8774

stscijgbot-jp · 2024-09-11T17:28:08Z

Issue JP-3741 was created on JIRA by Ned Molter:

Runtimes can be very long for outlier detection on large associations when the in_memory parameter is set to False. When all models cannot be stored in memory at once, the median is computed in spatial sections as follows:

One model is loaded into memory
The requested spatial section is read from the model and stored in memory somewhere else
The model is deleted from memory
This is repeated for all the models until the same spatial section has been read from all images
The median over that group is computed and stored
The group is discarded
The process is repeated for all sections

As currently written, this operation requires a number of load operations equal to n_sections * n_models, which gets very large especially for small sections. Because datamodels are used to load each section, these load operations are extremely inefficient, as they are loading unnecessary arrays (e.g. error and dq arrays) and schema validation occurs every single time.

As shown by the comments on linked ticket JP-3706 as well as a long discussion of the associated pull request, one effect of these inefficiencies is that the runtime is highly sensitive to the choice of section size, and there is no clearly "best" section size to use as a default.

There are multiple ways to fix this, ranging from quick-and-dirty to elegant and hard to implement.

One of the quicker ones is simply to store the temporary files as numpy arrays and load them as such, instead of using datamodels. The current plan is to implement a draft of this, then test the extent to which this improves the runtime of the file I/O and decreases the sensitivity of the entire step's runtime to the section size.

The text was updated successfully, but these errors were encountered:

stscijgbot-jp · 2024-09-12T16:33:03Z

Comment by Ned Molter on JIRA:

The proposed intermediate solution of simply using numpy arrays instead of datamodels for the repeated I/O unfortunately only leads to marginal runtime gains and does not resolve the issue that the runtime is highly sensitive to the choice of chunk size. See the two attached files comparing the runtimes for a similar buffer size. Using data models and a 7 MB buffer, the runtime of load operations is 122 seconds, whereas using numpy arrays and a slightly larger 10 MB buffer, the runtime is 105 seconds. I also checked the difference for a 1 MB buffer, and again there is little difference: both take ~1000 seconds total to do the load operations, i.e., the runtime is scaling ~linearly with the inverse of chunk size.

This small test justifies attempting the more nuanced (but more challenging to implement) approach suggested by Brett Graham in this GitHub comment.

stscijgbot-jp · 2024-09-12T22:38:04Z

Comment by Ned Molter on JIRA:

With some help from Brett, I made a draft of this which can be seen here. The change appears to make a large difference in the amount of runtime spent on file I/O. I'm attaching a third runtime graph for the same input association, but this time using the PR branch. The amount of time spent on file I/O went from roughly 100 seconds to 0.6 seconds. The buffer size is now computed based on the size of the input image data arrays (this can be changed) but for reference to the other tests it amounted to ~3 MB per model. I'm also attaching a memory profile.

stscijgbot-jp · 2024-09-13T14:58:04Z

Comment by Ned Molter on JIRA:

A note on the memory profiles. I have attached memray_flamegraph_output_3741.html for my test association on the PR branch with in_memory=False, and memray_flamegraph_output_3741_main.html for the same test association on the master branch also with in_memory=False.

Peak memory usage on main: 5.6 GB

Peak memory usage on PR banch: 5.4 GB

So they are nearly identical, and if anything the PR branch is using a bit less memory. I take this as evidence that the PR branch really did get its speedups from decreasing file I/O and not from inadvertently using more memory.

stscijgbot-jp · 2024-09-18T18:43:04Z

Comment by Ned Molter on JIRA:

I'm adding memray outputs for in_memory=True on both the main branch and the PR branch. As discussed more on the PR, refactoring resample_many_to_many improved memory usage for in_memory=True. On main, my test dataset used peak memory 34 GB (memray_flamegraph_inmemory_main.html), whereas on the PR branch it used 21 GB (memray-flamegraph-output_inmemory_resample_refactor4.html).

stscijgbot-jp added Software Affected: calwebb_image3 team-chartreuse labels Sep 11, 2024

emolter mentioned this issue Sep 12, 2024

JP-3706: Set outlier detection buffer size based on total available memory #8756

Closed

8 tasks

emolter linked a pull request Sep 12, 2024 that will close this issue

JP-3741: Faster temporary file I/O for outlier detection in on-disk mode #8782

Open

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Temporary files in outlier detection median computation should read faster #8774

Temporary files in outlier detection median computation should read faster #8774

stscijgbot-jp commented Sep 11, 2024

stscijgbot-jp commented Sep 12, 2024

stscijgbot-jp commented Sep 12, 2024 •

edited

Loading

stscijgbot-jp commented Sep 13, 2024

stscijgbot-jp commented Sep 18, 2024

Temporary files in outlier detection median computation should read faster #8774

Temporary files in outlier detection median computation should read faster #8774

Comments

stscijgbot-jp commented Sep 11, 2024

stscijgbot-jp commented Sep 12, 2024

stscijgbot-jp commented Sep 12, 2024 • edited Loading

stscijgbot-jp commented Sep 13, 2024

stscijgbot-jp commented Sep 18, 2024

stscijgbot-jp commented Sep 12, 2024 •

edited

Loading