Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make in-memory processing the default for steps of image3 pipeline #8779

Open
stscijgbot-jp opened this issue Sep 12, 2024 · 9 comments
Open

Comments

@stscijgbot-jp
Copy link
Collaborator

stscijgbot-jp commented Sep 12, 2024

Issue JP-3744 was created on JIRA by Ned Molter:

In a discussion with Jesse Doggett it was realized that ops typically prefers to process everything in memory when possible, even if it requires a high-memory node, because of potential issues with temporary file I/O.

In light of this, it may make sense to make in_memory=True the default for calwebb_image3 and in particular outlier detection.

The effects of this would be:

  • Faster runtimes through the image3 pipeline for typical/small-ish associations
  • Potential out-of-memory failures happening more often, requiring processing on higher-memory nodes
  • Possible out-of-memory failures even on the godzilla nodes. Special processing instructions would be required to switch in_memory to False to allow these to be processed successfully

Some questions we had for Hien Tran or other members of Team Coffee are:

  • What is the typical size (on disk) of associations that are run through calwebb_image3?  How many associations are so large that their total file size is close to the entire allocation of memory on a typical machine in ops?  What about on the godzilla node?  Would you anticipate that some of the higher-memory systems could get overloaded if this change were made?
  • How would this interact with the new process that guesses how big a machine will be needed to process a dataset?(I think someone said this was being worked on by Ru Kein?)
  • While the pipeline is being run, do the machines have access to local/fast I/O?  The in_memory=False mode of calwebb_image3 relies on reading and writing lots of temporary files, so if there's no fast I/O this will cause long runtimes.

 

@stscijgbot-jp
Copy link
Collaborator Author

Comment by David Law on JIRA:

Let's also keep in mind the impact of any such changes on non-STScI users who may typically be working with a total of 16 GB RAM.  Would this be a change in behavior from how things are handled now?

@stscijgbot-jp
Copy link
Collaborator Author

Comment by Jesse Doggett on JIRA:

I'm not sure we have the answers to your questions above. It might be good to have a meeting with Hien Tran and Eric Barron to make sure we understand the questions and how we might go about finding the answers.

 

 

@stscijgbot-jp
Copy link
Collaborator Author

Comment by Jesse Doggett on JIRA:

I started our godzilla test job. It has two associations, jw01568-o002_20240911t173952_image3_00001/2. All related files are available on tljwdmscsched2.stsci.edu. The input, output, and association files are in:

  • /ifs/archive/test/jwst/store/doggett/tests/godzilla01/dtest2/JWSTDP-2024.2.2-4~c1042679/2024-09-11-153935/sdp/asn_creation/cal/level3

The _00001 association has completed and its log files are in:

  • /ifs/archive/test/jwst/test2/info/owl/logs/doggett_jw01568-o002_20240911t173952_image3_00001_1726097084.820033

The _00002 association just started. It was submitted last night, but we didn't have a large enough machine configured for it. We got one set up and it's running now. It's log files are in:

  • /ifs/archive/test/jwst/test2/info/owl/logs/doggett_jw01568-o002_20240911t173952_image3_00002_1726097185.148363

@stscijgbot-jp
Copy link
Collaborator Author

Comment by Ned Molter on JIRA:

That would be great Jesse, I'll send a message to all of you tomorrow to schedule something for early next week.  And thanks for starting the test I will look tomorrow.

 

David, yes it would be different.  Right now, the default is in_memory=False which means that (among other things) in outlier detection the median is computed piecewise in sections to save memory.  I agree that the end user experience is a concern.  The other memory improvements that went along with implementing ModelLibrary will help this somewhat, but for larger associations it is still certainly easy to go over 16 GB with in_memory=True.  On the flip side, the runtimes are faster when everything is in memory, so that option might be preferred when users have relatively small associations.

 

If, theoretically, we did want to have the parameter False in ops but True for an end user, is that possible to do?

@stscijgbot-jp
Copy link
Collaborator Author

Comment by Ned Molter on JIRA:

A group of us including Brett Graham Tyler Pauly Jesse Doggett Eric Barron Hien Tran just had a preliminary discussion about this.  The take-aways were:

Setting the default as in_memory=True for outlier detection is something that Team Coffee would like to consider moving forward with.  Paraphrasing Eric, "if it can be done in memory, it should be".  However they prefer that we do not change the default in the upcoming build, such that we can better understand the consequences of the memory/runtime improvements that are already part of 11.1 without changing these defaults

For large associations that fail due to out-of-memory constraints, we should investigate ways to rescue those jobs and get them to process successfully by changing the in_memory parameter to False instead of True.  Multiple potential ways to specify this were discussed, including config files and additional command-line parameters.

As of the upcoming build 11.1, both memory and runtime performance should improve for both True and False settings of the in_memory parameter.  This means that Ru Kein's program to guess the memory usage of jobs probably needs to be retrained for the level 3 pipelines, and would probably need to be retrained again after changing the default in outlier detection (if this is indeed done).  We should initiate follow-up conversation with Ru

 

@stscijgbot-jp
Copy link
Collaborator Author

Comment by Ru Kein on JIRA:

Have been following along on this thread and others, as well as brief discussion with Brett G. As Ned mentions in the comment from 9/16, we will have to live with the fact that the model is likely going to underestimate a fair amount of datasets that now use more memory than they used to, and plan to retrain the model on data reprocessed using new code changes, and get that into the next release.

One thought I have (not sure how feasible this is, or if it's even a priority): The only way around this would be if there was some way to estimate the rate of increased memory usage incurred by the code change (in_memory=True). Then you can just use that rough estimate on top of the model prediction to act as a buffer. In other words, the model predicts 80GB for dataset X, and we can calculate (roughly) dataset X is going to use 20-50 more GB than it used to because of in_memory=True, so we put it on the next biggest node instead. Whereas dataset Y is estimated at 25GB, so adding 20-50GB more to that results in using the same node size (100GB) as before. That would theoretically prevent a lot memory fails/rescues from occurring during that interim period.

@stscijgbot-jp
Copy link
Collaborator Author

Comment by Ned Molter on JIRA:

Ru Kein Tyler Pauly I just attached a script to this ticket that attempts to guess the memory usage of outlier detection and resample without running those steps.  At present, it's able to calculate the memory usage to within, say, +/-30% of the actual usage.  The way it works is to basically figure out the size of the resampled array based on the s_regions of all the inputs plus a pixel scale, then basically account for all allocations by hand, which is tractable because all the important ones are integer multiples of either the input data size or the output data size.

This is definitely a work in progress, and it's also a moving target: some open pull requests will modify the memory usage of these steps.  However, I think this is much more straightforward, and I'm guessing accurate, than using a machine learning algorithm, and I'm hoping that after Build 11.2 delivery these steps will be in a more stable state.  Please let me know what you think, and whether this appears useful enough for us to discuss next steps.

I encourage you to try this out on your own datasets.  I'm sure there are bugs and ways to improve the accuracy of the estimate.

(note: you will need to install specific branches of stcal and jwst to get this to run: instructions in the docstring at the top)

(note: you can disable the plot by setting save_plot="")

@stscijgbot-jp
Copy link
Collaborator Author

Comment by Ru Kein on JIRA:

Ned Molter are you able to provide a list of the datasets you tested this on, along with the estimates vs actual memory usage?

@stscijgbot-jp
Copy link
Collaborator Author

Comment by Ned Molter on JIRA:

I've only tested this on one dataset so far, which is a subset of a large nircam mosaic.  See attached asn json file.  [^small_asn.json]

The output from the script for in_memory=False mode was:

Estimated peak memory usage for OutlierDetectionStep: 4.987201224677027 GB

Estimated peak memory usage for ResampleStep: 11.490969524246395 GB

True peak memory usage for OutlierDetectionStep: 4.020512127317488 GB

True peak memory usage for ResampleStep: 10.445950175635517 GB

 

Please note, it looks like we are going to change the API of 
git+[https://github.com/emolter/stcal@AL-837] before it is merged to main.  I will provide an updated script at that time. Sorry that this is still in flux, but I wanted to share my progress anyway so we can decide how to move forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant