-
Notifications
You must be signed in to change notification settings - Fork 167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make in-memory processing the default for steps of image3 pipeline #8779
Comments
Comment by Jesse Doggett on JIRA: I'm not sure we have the answers to your questions above. It might be good to have a meeting with Hien Tran and Eric Barron to make sure we understand the questions and how we might go about finding the answers.
|
Comment by Jesse Doggett on JIRA: I started our godzilla test job. It has two associations, jw01568-o002_20240911t173952_image3_00001/2. All related files are available on tljwdmscsched2.stsci.edu. The input, output, and association files are in:
The _00001 association has completed and its log files are in:
The _00002 association just started. It was submitted last night, but we didn't have a large enough machine configured for it. We got one set up and it's running now. It's log files are in:
|
Comment by Ned Molter on JIRA: That would be great Jesse, I'll send a message to all of you tomorrow to schedule something for early next week. And thanks for starting the test I will look tomorrow.
David, yes it would be different. Right now, the default is
If, theoretically, we did want to have the parameter |
Comment by Ned Molter on JIRA: A group of us including Brett Graham Tyler Pauly Jesse Doggett Eric Barron Hien Tran just had a preliminary discussion about this. The take-aways were: Setting the default as in_memory=True for outlier detection is something that Team Coffee would like to consider moving forward with. Paraphrasing Eric, "if it can be done in memory, it should be". However they prefer that we do not change the default in the upcoming build, such that we can better understand the consequences of the memory/runtime improvements that are already part of 11.1 without changing these defaultsFor large associations that fail due to out-of-memory constraints, we should investigate ways to rescue those jobs and get them to process successfully by changing the in_memory parameter to False instead of True. Multiple potential ways to specify this were discussed, including config files and additional command-line parameters.As of the upcoming build 11.1, both memory and runtime performance should improve for both True and False settings of the in_memory parameter. This means that Ru Kein's program to guess the memory usage of jobs probably needs to be retrained for the level 3 pipelines, and would probably need to be retrained again after changing the default in outlier detection (if this is indeed done). We should initiate follow-up conversation with Ru
|
Have been following along on this thread and others, as well as brief discussion with Brett G. As Ned mentions in the comment from 9/16, we will have to live with the fact that the model is likely going to underestimate a fair amount of datasets that now use more memory than they used to, and plan to retrain the model on data reprocessed using new code changes, and get that into the next release. One thought I have (not sure how feasible this is, or if it's even a priority): The only way around this would be if there was some way to estimate the rate of increased memory usage incurred by the code change (in_memory=True). Then you can just use that rough estimate on top of the model prediction to act as a buffer. In other words, the model predicts 80GB for dataset X, and we can calculate (roughly) dataset X is going to use 20-50 more GB than it used to because of in_memory=True, so we put it on the next biggest node instead. Whereas dataset Y is estimated at 25GB, so adding 20-50GB more to that results in using the same node size (100GB) as before. That would theoretically prevent a lot memory fails/rescues from occurring during that interim period. |
Comment by Ned Molter on JIRA: Ru Kein Tyler Pauly I just attached a script to this ticket that attempts to guess the memory usage of outlier detection and resample without running those steps. At present, it's able to calculate the memory usage to within, say, +/-30% of the actual usage. The way it works is to basically figure out the size of the resampled array based on the s_regions of all the inputs plus a pixel scale, then basically account for all allocations by hand, which is tractable because all the important ones are integer multiples of either the input data size or the output data size. This is definitely a work in progress, and it's also a moving target: some open pull requests will modify the memory usage of these steps. However, I think this is much more straightforward, and I'm guessing accurate, than using a machine learning algorithm, and I'm hoping that after Build 11.2 delivery these steps will be in a more stable state. Please let me know what you think, and whether this appears useful enough for us to discuss next steps. I encourage you to try this out on your own datasets. I'm sure there are bugs and ways to improve the accuracy of the estimate. (note: you will need to install specific branches of stcal and jwst to get this to run: instructions in the docstring at the top) (note: you can disable the plot by setting save_plot="") |
Ned Molter are you able to provide a list of the datasets you tested this on, along with the estimates vs actual memory usage? |
Comment by Ned Molter on JIRA: I've only tested this on one dataset so far, which is a subset of a large nircam mosaic. See attached asn json file. [^small_asn.json] The output from the script for in_memory=False mode was: Estimated peak memory usage for OutlierDetectionStep: 4.987201224677027 GB Estimated peak memory usage for ResampleStep: 11.490969524246395 GB True peak memory usage for OutlierDetectionStep: 4.020512127317488 GB True peak memory usage for ResampleStep: 10.445950175635517 GB
Please note, it looks like we are going to change the API of |
Issue JP-3744 was created on JIRA by Ned Molter:
In a discussion with Jesse Doggett it was realized that ops typically prefers to process everything in memory when possible, even if it requires a high-memory node, because of potential issues with temporary file I/O.
In light of this, it may make sense to make
in_memory=True
the default for calwebb_image3 and in particular outlier detection.The effects of this would be:
in_memory
to False to allow these to be processed successfullySome questions we had for Hien Tran or other members of Team Coffee are:
in_memory=False
mode of calwebb_image3 relies on reading and writing lots of temporary files, so if there's no fast I/O this will cause long runtimes.The text was updated successfully, but these errors were encountered: