Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Temporary files in outlier detection median computation should read faster #8774

Open
stscijgbot-jp opened this issue Sep 11, 2024 · 4 comments · May be fixed by #8782
Open

Temporary files in outlier detection median computation should read faster #8774

stscijgbot-jp opened this issue Sep 11, 2024 · 4 comments · May be fixed by #8782

Comments

@stscijgbot-jp
Copy link
Collaborator

Issue JP-3741 was created on JIRA by Ned Molter:

Runtimes can be very long for outlier detection on large associations when the in_memory parameter is set to False.  When all models cannot be stored in memory at once, the median is computed in spatial sections as follows:

  • One model is loaded into memory
  • The requested spatial section is read from the model and stored in memory somewhere else
  • The model is deleted from memory
  • This is repeated for all the models until the same spatial section has been read from all images
  • The median over that group is computed and stored
  • The group is discarded
  • The process is repeated for all sections

As currently written, this operation requires a number of load operations equal to n_sections * n_models, which gets very large especially for small sections.  Because datamodels are used to load each section, these load operations are extremely inefficient, as they are loading unnecessary arrays (e.g. error and dq arrays) and schema validation occurs every single time.

As shown by the comments on linked ticket JP-3706 as well as a long discussion of the associated pull request, one effect of these inefficiencies is that the runtime is highly sensitive to the choice of section size, and there is no clearly "best" section size to use as a default.

There are multiple ways to fix this, ranging from quick-and-dirty to elegant and hard to implement.

One of the quicker ones is simply to store the temporary files as numpy arrays and load them as such, instead of using datamodels.  The current plan is to implement a draft of this, then test the extent to which this improves the runtime of the file I/O and decreases the sensitivity of the entire step's runtime to the section size.

@stscijgbot-jp
Copy link
Collaborator Author

Comment by Ned Molter on JIRA:

The proposed intermediate solution of simply using numpy arrays instead of datamodels for the repeated I/O unfortunately only leads to marginal runtime gains and does not resolve the issue that the runtime is highly sensitive to the choice of chunk size.  See the two attached files comparing the runtimes for a similar buffer size. Using data models and a 7 MB buffer, the runtime of load operations is 122 seconds, whereas using numpy arrays and a slightly larger 10 MB buffer, the runtime is 105 seconds.  I also checked the difference for a 1 MB buffer, and again there is little difference: both take ~1000 seconds total to do the load operations, i.e., the runtime is scaling ~linearly with the inverse of chunk size.

This small test justifies attempting the more nuanced (but more challenging to implement) approach suggested by Brett Graham in this GitHub comment.

@stscijgbot-jp
Copy link
Collaborator Author

stscijgbot-jp commented Sep 12, 2024

Comment by Ned Molter on JIRA:

With some help from Brett, I made a draft of this which can be seen here.  The change appears to make a large difference in the amount of runtime spent on file I/O.  I'm attaching a third runtime graph for the same input association, but this time using the PR branch.  The amount of time spent on file I/O went from roughly 100 seconds to 0.6 seconds.  The buffer size is now computed based on the size of the input image data arrays (this can be changed) but for reference to the other tests it amounted to ~3 MB per model.  I'm also attaching a memory profile.

@stscijgbot-jp
Copy link
Collaborator Author

Comment by Ned Molter on JIRA:

A note on the memory profiles. I have attached memray_flamegraph_output_3741.html for my test association on the PR branch with in_memory=False, and memray_flamegraph_output_3741_main.html for the same test association on the master branch also with in_memory=False.

Peak memory usage on main: 5.6 GB

Peak memory usage on PR banch: 5.4 GB

So they are nearly identical, and if anything the PR branch is using a bit less memory.  I take this as evidence that the PR branch really did get its speedups from decreasing file I/O and not from inadvertently using more memory.

@stscijgbot-jp
Copy link
Collaborator Author

Comment by Ned Molter on JIRA:

I'm adding memray outputs for in_memory=True on both the main branch and the PR branch.  As discussed more on the PR, refactoring resample_many_to_many improved memory usage for in_memory=True.  On main, my test dataset used peak memory 34 GB (memray_flamegraph_inmemory_main.html), whereas on the PR branch it used 21 GB (memray-flamegraph-output_inmemory_resample_refactor4.html).

 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant