Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Answer changes on Izumi after the hardware rebuild #2862

Open
3 of 5 tasks
ekluzek opened this issue Nov 6, 2024 · 7 comments
Open
3 of 5 tasks

Answer changes on Izumi after the hardware rebuild #2862

ekluzek opened this issue Nov 6, 2024 · 7 comments
Assignees
Labels
external issue needs to be addressed elsewhere (submodule); issue here for the sake of project tracking investigation Needs to be verified and more investigation into what's going on. non-bfb Changes answers (incl. adding tests)

Comments

@ekluzek
Copy link
Collaborator

ekluzek commented Nov 6, 2024

45 of the aux_clm tests show changes to answers after the hardware rebuild. Which means most of our baselines will need to be regenerated if we want to use them going forward. There were 25 tests in aux_clm that did NOT change answers, but most did. I would expect the changes to be roundoff level.

Here's the list of answer changes from aux_clm:

ERI_D_Ld9_P48x1.f10_f10_mg37.I2000Clm50BgcCru.izumi_nag.clm-reduceOutput
ERI_D_Ld9_P48x1.f10_f10_mg37.I2000Clm50Sp.izumi_nag.clm-SNICARFRC
ERI_D_Ld9_P48x1.f10_f10_mg37.I2000Clm50Sp.izumi_nag.clm-reduceOutput
ERP_D_Ld5_P48x1.f10_f10_mg37.I1850Clm50Bgc.izumi_nag.clm-ciso
ERP_D_Ld5_P48x1.f10_f10_mg37.I1850Clm60Bgc.izumi_nag.clm-ciso
ERP_D_Ld5_P48x1.f10_f10_mg37.I1850Clm60Bgc.izumi_nag.clm-ciso--clm-matrixcnOn
ERP_D_Ld5_P48x1.f10_f10_mg37.I2000Clm50BgcCru.izumi_nag.clm-flexCN_FUN
ERP_D_Ld5_P48x1.f10_f10_mg37.I2000Clm50BgcCru.izumi_nag.clm-flexCN_FUN--clm-matrixcnOn
ERP_D_Ld5_P48x1.f10_f10_mg37.I2000Clm50BgcCru.izumi_nag.clm-luna
ERP_D_Ld5_P48x1.f10_f10_mg37.I2000Clm50BgcCru.izumi_nag.clm-noFUN_flexCN
ERP_D_Ld5_P48x1.f10_f10_mg37.I2000Clm50BgcCru.izumi_nag.clm-noFUN_flexCN--clm-matrixcnOn
ERP_D_Ld5_P48x1.f10_f10_mg37.I2000Clm50BgcCru.izumi_nag.clm-reduceOutput
ERP_D_Ld5_P48x1.f10_f10_mg37.I2000Clm50Sp.izumi_nag.clm-o3lombardozzi2015
ERP_D_Ld9.f10_f10_mg37.I1850Clm60BgcCrop.izumi_nag.clm-clm60cam7LndTuningModeLDust
ERP_D_P48x1.f10_f10_mg37.IHistClm60Bgc.izumi_nag.clm-decStart
ERP_D_P48x1.f10_f10_mg37.IHistClm60Bgc.izumi_nag.clm-decStart--clm-matrixcnOn_ignore_warnings
ERS_D.f10_f10_mg37.I1850Clm50BgcCrop.izumi_nag.clm-ciso_monthly_matrixcn_spinup
ERS_D.f10_f10_mg37.I1850Clm60Sp.izumi_nag.clm-ExcessIceStreams
ERS_D_Ld5.f10_f10_mg37.I2000Clm50Fates.izumi_nag.clm-FatesCold
ERS_Lm20_Mmpi-serial.1x1_smallvilleIA.I1850Clm50BgcCrop.izumi_gnu.clm-cropMonthlyNoinitial
ERS_Lm40_Mmpi-serial.1x1_numaIA.I2000Clm50BgcCropQianRs.izumi_gnu.clm-cropMonthlyNoinitial
ERS_Ly3_Mmpi-serial.1x1_smallvilleIA.IHistClm50BgcCropQianRs.izumi_gnu.clm-cropMonthOutput
ERS_Ly5_Mmpi-serial.1x1_smallvilleIA.I1850Clm50BgcCrop.izumi_gnu.clm-ciso_monthly
ERS_Ly5_Mmpi-serial.1x1_smallvilleIA.I1850Clm50BgcCrop.izumi_gnu.clm-ciso_monthly--clm-matrixcnOn
SMS.f10_f10_mg37.I2000Clm50BgcCrop.izumi_gnu.clm-crop
SMS_D.f10_f10_mg37.I1850Clm60BgcCrop.izumi_nag.clm-ciso_soil_matrixcn_only
SMS_D.f10_f10_mg37.I2000Clm60BgcCrop.izumi_gnu.clm-crop
SMS_D.f10_f10_mg37.I2000Clm60BgcCrop.izumi_nag.clm-crop
SMS_D_Ld1_Mmpi-serial.f45_f45_mg37.I2000Clm50SpRs.izumi_gnu.clm-ptsRLA
SMS_D_Ld1_P48x1.f10_f10_mg37.I2000Clm45BgcCrop.izumi_nag.clm-oldhyd
SMS_D_Ld1_P48x1.f10_f10_mg37.I2000Clm50BgcCru.izumi_nag.clm-datm_bias_correct_cruv7
SMS_D_Ld3.f10_f10_mg37.I2000Clm60Bgc.izumi_nag.clm-HillslopeD
SMS_D_Ld5.f10_f10_mg37.I1850Clm45BgcCrop.izumi_nag.clm-crop
SMS_D_Ld5.f10_f10_mg37.I2000Clm50BgcCrop.izumi_nag.clm-irrig_alternate
SMS_D_Ld5.f10_f10_mg37.I2000Clm50FatesRs.izumi_nag.clm-FatesCold
SMS_D_Ld5.f45_f45_mg37.I2000Clm60Fates.izumi_nag.clm-FatesCold
SMS_D_Ld65.f10_f10_mg37.I2000Clm60BgcCrop.izumi_nag.clm-FireLi2024GSWP
SMS_D_Ld65.f10_f10_mg37.IHistClm60BgcCrop.izumi_nag.clm-cropMonthOutput--clm-RxCropCalsAdaptGGCMI
SMS_D_P48x1_Ld5.f10_f10_mg37.I2000Clm50BgcCrop.izumi_nag.clm-irrig_spunup
SMS_Ld5_D_P48x1.f10_f10_mg37.IHistClm50Bgc.izumi_nag.clm-monthly
SMS_Ld5_D_P48x1.f10_f10_mg37.IHistClm60Bgc.izumi_nag.clm-decStart
SMS_Ld5_Mmpi-serial.1x1_brazil.IHistClm60Bgc.izumi_gnu.clm-mimics
SMS_Ln9.f10_f10_mg37.I1850Clm45Bgc.izumi_gnu.clm-clm45cam4LndTuningModeZDustSoilErod
SMS_Ly3_Mmpi-serial.1x1_numaIA.I2000Clm50BgcDvCropQianRs.izumi_gnu.clm-ignor_warn_cropMonthOutputColdStart
SMS_Ly5_Mmpi-serial.1x1_smallvilleIA.IHistClm60BgcCropQianRs.izumi_gnu.clm-gregorian_cropMonthOutput

Definition of done:

  • Meet with Joseph about this
  • Assess if it's worth working on to get identical answers -- currently thinking NO
  • Run on queues with the old OS and see if baselines are identical
  • Assess if we should run the ECT test for the CESM system? CSEG says yes for CESM2.1
  • Chris will run CESM2.1 ECT test on Izumi
@ekluzek ekluzek added investigation Needs to be verified and more investigation into what's going on. next this should get some attention in the next week or two. Normally each Thursday SE meeting. external issue needs to be addressed elsewhere (submodule); issue here for the sake of project tracking non-bfb Changes answers (incl. adding tests) labels Nov 6, 2024
@ekluzek ekluzek self-assigned this Nov 6, 2024
@ekluzek
Copy link
Collaborator Author

ekluzek commented Nov 6, 2024

I redid the ctsm5.3.009 baselines and called them

ctsm5.3.009.redo

@ekluzek
Copy link
Collaborator Author

ekluzek commented Nov 6, 2024

email on the rebuild sent out on Oct/23/2024 was:

In preparation for the full Izumi migration to the new Alma Linux 8 operating system, I have prepared 12 existing Izumi nodes (i001-i012) for testing existing workflows before a full migration is performed after the CESM3 code freeze. Outside of the OS upgrade, there have been no major software adjustments to these systems; these systems are more or less similar to the already upgraded CGD Post Processors, with some additional cluster software and modules added. I do expect there to be a few minor issues with these nodes, but all existing workflows should work more or less the same as before.

For the next week, these nodes will only be available in a separate queue named "upgrade". This queue is closely aligned with the existing and popular "medium" queue, with the exception that all minimum requirements of the queue have been removed. All additional qsub parameters also work normally in this queue (eg, requesting a specific node, adjusting walltime, etc.). Below is an example of how to submit a job to the new "upgrade" queue:

[joneill@izumi ~]$ qsub -q upgrade ./your_model_run_script_here

I highly recommend using this queue (and nodes) for any non-CESM3 related work or any existing Izumi workflows, as the remaining nodes (and queues) are being held for CESM3 work. Assuming there are no major issues, these nodes (i001-i012) will be re-released into the general queues (short, long, medium, etc.) during next week's systems work (10/30), and the remaining nodes (i013-i030) will be placed into a seperate queue reserved for CESM3 development until after the CESM3 code freeze is complete. At that point, all of Izumi will be migrated to the new Alma 8 Operating System.

If you do run into any issues with anything regarding Izumi software, please submit a request to [email protected]. This will help us keep track of all necessary changes as a team, and ensure that we don't overwrite or overlook any necessary updates to all Izumi nodes. While I expect most issues to be fairly simple, all solutions will need to be applied to all Izumi nodes (existing and future), to ensure the uniformity of the cluster.

@ekluzek
Copy link
Collaborator Author

ekluzek commented Nov 7, 2024

In looking at the coupler fields for the test: SMS.f10_f10_mg37.I2000Clm50BgcCrop.izumi_gnu.clm-crop it appears that many fields are roundoff different, but larger differences have propagated to some fields even at this point after only a few days. So it's likely roundoff level, but propagating to larger than roundoff. It also only appears for the Nag and gnu compilers.

The differences for the Nag compiler seem to be roundoff level for example for the nag equivalent of above... SMS_D.f10_f10_mg37.I2000Clm60BgcCrop.izumi_nag.clm-crop:

grep RMS /scratch/cluster/erik/tests_ctsm539redoacl/SMS_D.f10_f10_mg37.I2000Clm60BgcCrop.izumi_nag.clm-crop.GC.ctsm539redoacl_nag/run/SMS_D.f10_f10_mg37.I2000Clm60BgcCrop.izumi_nag.clm-crop.GC.ctsm539redoacl_nag.clm2.h0.2000-01-06-00000.nc.cprnc.out
RMS landfrac 4.7340E-17 NORMALIZED 7.5121E-17
RMS CH4_SURF_AERE_SAT 1.8671E-22 NORMALIZED 3.8879E-15
RMS CH4_SURF_DIFF_SAT 3.1481E-25 NORMALIZED 7.1050E-16
RMS FCH4 1.5601E-24 NORMALIZED 3.6108E-14
RMS FCH4TOCO2 1.5310E-21 NORMALIZED 2.1225E-14
RMS FCH4_DFSAT 4.6737E-26 NORMALIZED 2.9183E-13
RMS FINUNDATED 7.9302E-18 NORMALIZED 1.8425E-16
RMS NEM 1.5210E-21 NORMALIZED 6.5745E-14
RMS TOTCOLCH4 5.2349E-18 NORMALIZED 9.0652E-18
RMS VOLR 1.5554E-06 NORMALIZED 4.3977E-16
RMS VOLRMCH 1.1214E-06 NORMALIZED 5.2843E-16
RMS CONC_O2_SAT 8.3206E-14 NORMALIZED 1.5986E-12

The longer nag mpi-serial single point case ERS_Ly5_Mmpi-serial.1x1_smallvilleIA.I1850Clm50BgcCrop.izumi_gnu.clm-ciso_monthly, shows that the difference has propegated and it no longer appears as roundoff.

@ekluzek
Copy link
Collaborator Author

ekluzek commented Nov 7, 2024

And here's the email from the day of the rebuild on Oct/30th...

The Izumi cluster will be updated and rebooted.
This work should not affect any existing workflows.
The newly upgraded Izumi nodes 1 - 14 will be reintroduced to the standard Izumi queues.
If you experience any issues with these nodes, please contact [email protected] with the title: "Izumi Upgraded Node Issue [ATTN: Joseph ONeill/NRIT Linux]".
If possible, please include the job number in the body of the request
Izumi nodes 15 - 30 (i014 to i030) will be taken out of the standard Izumi queues until the planned OS upgrade is completed after the CESM3 code freeze.
These nodes will be locked to specific users working on CESM3. If you'd like to be added to this unique queue, please contact [email protected] with the title: "Izumi Special Queue Request [ATTN: Joseph ONeill/NRIT Linux]"

@wwieder wwieder removed the next this should get some attention in the next week or two. Normally each Thursday SE meeting. label Nov 7, 2024
@ekluzek
Copy link
Collaborator Author

ekluzek commented Nov 12, 2024

I meet with Joseph on this and we talked it over. He potentially see's ways to fix this if I worked with him on going through libraries. But, potentially this would be something difficult and impossible to predict a timeline for.

From looking at our CTSM tests it appears to be roundoff level since most field differences are small (appearing roundoff level) for short tests, but grow for longer ones which is normal for roundoff level changes that affect answers. To be certain of this we'd need to do more extensive testing, which is unclear if it's worth doing. We'll talk more at the CSEG meeting to evaluate.

I wondered about having a warning this might change answers. However, since it was just an OS change and NOT a hardware change -- he didn't anticipate changes to answers -- which is reasonable. His process included a stepwise iteration allowing for people to test beforehand, so his process needn't change. We should maybe anticipate that more things can change answers though and elect to do more testing in cases like this.

@ekluzek
Copy link
Collaborator Author

ekluzek commented Nov 12, 2024

Working with Joseph we figured I should try building and running on a queue with the old OS and verify that answers are identical for our old baselines. This is worth doing. So I will do that.

The old OS is in the "upgrade" queue, and on the login node. Other queues have the new OS.

@ekluzek
Copy link
Collaborator Author

ekluzek commented Nov 13, 2024

Sending a baseline comparison on the "upgrade" queue which has the old OS, I get identical answers for the tests that ran. The nag mpi-serial tests still mostly didn't run. @fischer-ncar is going to run the ECT test on Izumi with CESM2.1 to assess the likelihood if these changes truly are roundoff level.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
external issue needs to be addressed elsewhere (submodule); issue here for the sake of project tracking investigation Needs to be verified and more investigation into what's going on. non-bfb Changes answers (incl. adding tests)
Projects
Status: Todo
Development

No branches or pull requests

2 participants