cam6_4_050: clubb_intr GPUization #1175

huebleruwm · 2024-10-18T23:26:18Z

This only modifies clubb_intr.F90 and doesn't require a new verseion of clubb. The purpose of this is the addition of acc directives, added in order to offload computations to GPUs. Besides the directives, this mainly consists of replacing vector notation with explicit loops, combining loops with the same bounds where possible, and moving non-gpuized function calls to outside of the GPU section. I also added some new notation for the number of vertical levels (nzm_clubb and nzt_clubb) that improves readability and will make it easier to merge in with future versions of clubb. I also included some timing statements, similar to the ones added in the Earthworks ew-develop branch, which this version of clubb_intr is also compatible with.

This is BFB on CPUs (tested with intel), runs with intel+debugging, and passes the ECT test when comparing CPU results to GPU results (using cam7). There's some options that I didn't GPUize or test (do_clubb_mf, do_rainturb, do_cldcool, clubb_do_icesuper, single_column ), so I left the code for them untouched and added some checks to stop the run if they're set when the code is compiled with OpenACC.

If there ends up being something wrong with these changes then this version, which is an earlier commit that contains only a new OpenACC data statement and some timer additions, would be nice to get in at least.

…e options from running when compiled with OpenACC

…l cleanup.

… by breaking BFBness. All these values are ~0 but not always exactly.

…t and BFB on CPUs.

nusbaume

Looks good to me! I had some questions and change requests but none of them are required, and of course if you have any concerns with any of my requests then just let me know. Thanks!

src/physics/cam/clubb_intr.F90

nusbaume · 2024-10-23T20:14:59Z

src/physics/cam/clubb_intr.F90

-    ! to zero.
-    fcor(:) = 0._r8
+    !  Set the ztodt timestep in pbuf for SILHS
+    ztodtptr(:) = 1.0_r8*hdtime


Maybe this is a question for @Katetc, but why multiply by one instead of just using hdtime directly?

I have no idea why this is, but I was wondering the same thing. The comment implies it's only for SILHS though, which has a dubious future in cam, so I left it untouched.

I very vaguely remember some kind of PGI compiler issue with pointers and some PBuf variables like 10 years ago... and this was some kind of tacky patch. I'm sure you could remove this multiplier without an issue now.

nusbaume · 2024-10-24T17:38:29Z

src/physics/cam/clubb_intr.F90

+    !$acc              state1, state1%q, state1%u, state1%v, state1%t, state1%pmid, state1%s, state1%pint, &
+    !$acc              state1%zm, state1%zi, state1%pdeldry, state1%pdel, state1%omega, state1%phis, &


Just curious as to why you have to bring in state1 in addition to the relevant state1 variables (e.g. state1%q)?

Honestly I don't know either anymore. I remember at one point having trouble with this and finding out that you did have to allocate the derived type itself on the GPU (create/copyin), so I just keep doing it that way, but I discovered recently that it's completely unnecessary. However, I'm not sure if the lack of such need is something nvidia has implemented, or if it's part of the API standard, or if the trouble I encountered forever ago was just a bug. Common advice from the internet is to copy the user-defined type in itself, but that might be for types that contain scalars or fixed sized arrays, since those values are contained in the memory storing the type variable.

So long story short - just in case we need it.

src/physics/cam/clubb_intr.F90

… in clubb_ini_cam. Reusing inv_exner_clubb(:,pver) to set inv_exner_clubb_surf. Splitting up array initialization loop to slightly improve CPU performance.

huebleruwm · 2024-10-28T22:17:16Z

Looks good to me! I had some questions and change requests but none of them are required, and of course if you have any concerns with any of my requests then just let me know. Thanks!

Good finds, I took you up on every suggestion.

huebleruwm · 2024-10-28T22:29:03Z

I ran a PFS test to check how the performance changed, and initially found that clubb_tend_cam was ~8% slower with these changes (up to cfd2824). Which is unacceptable. I took a guess that the biggest performance killer was a massive loop I made to replace a large amount of vector notation, where the code just zeros out a bunch of arrays. That seemed to be the culprit, since I split the loop up here around line 2991, and the result was slightly faster than the original code.

Here's the timing output comparison now. I ran these a couple times and got roughly the same results.

From the cam_dev head I started with (cam6_4_038)

Region                                         PETs   PEs    Count    Mean (s)    Min (s)     Min PET Max (s)     Max PET
macrop_tend                                    512    512    MULTIPLE 135.3464    124.0819    366     146.0524    42     
  clubb_tend_cam                               512    512    MULTIPLE 132.0139    121.0235    366     142.8016    166    
    clubb_tend_cam_i_loop                      512    512    MULTIPLE 111.1615    102.4363    366     123.0577    166

From the head of this branch (75f51d1)

Region                                         PETs   PEs    Count    Mean (s)    Min (s)     Min PET Max (s)     Max PET
macrop_tend                                    512    512    MULTIPLE 132.7466    123.5249    366     144.4218    149    
  clubb_tend_cam                               512    512    MULTIPLE 129.4563    120.4303    366     140.8711    149    
    clubb_tend_cam:ACCR                        512    512    MULTIPLE 107.8117    100.6424    366     117.3765    149    
      clubb_tend_cam:advance_clubb_core_api    512    512    MULTIPLE 91.2902     84.0197     366     100.7022    149    
      clubb_tend_cam:flip-index                512    512    MULTIPLE 11.2684     9.9628      220     13.7559     455    
   clubb_tend_cam:NAR                          512    512    MULTIPLE 21.4333     19.0054     401     24.0401     76     
                      qneg3                    512    512    MULTIPLE 0.5783      0.5268      474     0.6311      66     
   clubb_tend_cam:acc_copyin                   512    512    MULTIPLE 0.0059      0.0045      358     0.0081      260    
   clubb_tend_cam:acc_copyout                  512    512    MULTIPLE 0.0038      0.0025      314     0.0061      149

So it seems these changes made clubb_tend_cam faster by ~2%.

I ran all the same tests mentioned above to confirm these new changes.

Katetc

This is a really nice cleanup and the GPU code is not too obtrusive. I give it five stars!

My comments are mostly questions so I can understand what is going on as best as possible.

I am only about 96% confidant that the vertical dimension changes are correct through the whole file. I think the regression tests will be necessary to make sure we have that right in every case.

Katetc · 2024-11-05T23:51:37Z

src/physics/cam/clubb_intr.F90

      sclrm_forcing,  & ! Passive scalar forcing              [{units vary}/s]
      sclrm,          & ! Passive scalar mean (thermo. levels)          [units vary]
      sclrp2,         & ! sclr'^2 (momentum levels)                     [{units vary}^2]
      sclrp3,         & ! sclr'^3 (thermo. levels)                      [{units vary}^3]
      sclrprtp,       & ! sclr'rt' (momentum levels)          [{units vary} (kg/kg)]
      sclrpthlp,      & ! sclr'thlp' (momentum levels)        [{units vary} (K)]
-      wpsclrp           ! w'sclr' (momentum levels)                     [{units vary} m/s]
+      wpsclrp,        & ! w'sclr' (momentum levels)                     [{units vary} m/s]
+      sclrpthvp_inout   ! sclr'th_v' (momentum levels)                  [{units vary} (K)]


This variable had a straight "pverp" vertical dimension before, so if "top_lev" is not 1 then it will be dimensioned differently here... however, I think that's correct. Not sure it was meant to be at all levels no matter what.

You're right that it's dimensioned differently, but that was intentional. I noticed that we never used the whole array, only ever up to pverp+1-toplev, which is now nzm_clubb, so I set it to that instead.

Katetc · 2024-11-05T23:56:02Z

src/physics/cam/clubb_intr.F90

+    if ( single_column .and. .not. scm_cambfb_mode )  then
+      call endrun(subr//': (single_column && !scm_cambfb_mode)=.true. not available when compiling with OpenACC')
+    end if
+#endif


I'm still learning about GPUs, but I'm curious why we have to have an endrun here instead of just running non-GPU code on the CPUs?

That was my first instinct too. It would take some of acc update host and acc update device directives to move the data we need back and forth inside of the if-statements, and there would be no issue with doing so. The reason I didn't do that is because I was testing the GPU results with a CPU ensemble baseline that I had run with those options turned off. So to confirm those changes I would've had to run another baseline with those options turned on, which would've taken at least another day, and I was trying to get this in quickly. Also I asked Vince about the options and he indicated they weren't run often or were more so test options that didn't turn out well.

So I thought it was safest to leave the code in those sections untouched and just disable the options for GPU runs. I did do a little test to confirm the CPU code ran to completion with them enabled though, except clubb_do_adv, which didn't run due to begin with due to some error in cnst_add called from clubb_register_cam.

Katetc · 2024-11-06T18:19:18Z

src/physics/cam/clubb_intr.F90

+
+    endif
+
+    call t_startf('clubb_tend_cam:flip-index')


So, I might be confused, but it looks like this little else section above isn't included in a timer. Is that what we want? It's probably not a big deal, but I think timers should cover everything.

The whole if statement is inside of a larger timing section labelled "ACCR" (for accelerated region). The if part isn't GPUized though, so it ends the "ACCR" timer (accelerated region) and starts the "NAR" timer (non-accelerated region), then does the opposite at the end. This was needed because the if bit contains a call to calc_ustar, which I didn't want to GPUize (just so that all changes were contained in clubb_intr), but the else region is GPUized, so it's cost is included in the surrounding ACCR timer.

Katetc · 2024-11-06T18:25:39Z

src/physics/cam/clubb_intr.F90

          do i=1,ncol
            rvm_in(i,k) = rtm_in(i,k) - rcm_inout(i,k)
          end do
        end do

-        call update_xp2_mc_api( gr, nlev+1, ncol, dtime, cloud_frac_inout, &
+        call t_startf('clubb_tend_cam:update_xp2_mc_api')


I understand why this timer is just around the call to update_xp2_mc_api, but I kind of think it should include the double loop above that is called right before every time. Just to include all related calculations. Or add a whole "rain_turb" timer around this one.

I think the "rain_turb" timer makes the most sense. I don't know why I didn't do that to begin with, I've changed the timers in this commit b1a243c

Katetc · 2024-11-06T18:33:05Z

src/physics/cam/clubb_intr.F90

-    if (macmic_it==1) vpwp_clubb_gw_mc(:ncol,:) = 0._r8
-    if (macmic_it==1) thlp2_clubb_gw_mc(:ncol,:) = 0._r8
-    if (macmic_it==1) wpthlp_clubb_gw_mc(:ncol,:) = 0._r8
+    call t_stopf('clubb_tend_cam:flip-index')

    do t=1,nadv    ! do needed number of "sub" timesteps for each CAM step


This loop - Its kind of hard to tell but I think it's a larger loop for the cld_mac_mic run. Shouldn't this have an !acc tag?

This is the overall sub-stepping loop, so it has to be run sequentially. It also contains the call to advance_clubb_core and some other smaller routines, so it wouldn't be possibly to offload to the GPU.

…st the function call.

huebleruwm added 12 commits October 11, 2024 22:28

Manually merging in changes from ew-develop

37430e1

Adding acc data statement outside substepping loop and preventing som…

9427505

…e options from running when compiled with OpenACC

Replacing vector notation with loops, rearranging some things, genera…

6a38ec2

…l cleanup.

More small changes

94ba629

Some GPUization.

7b94963

Small bug fixes

7786cca

Tweaks to acc data statement. This version passed the GPU ect test.

0d391b1

More GPUization and small changes. ECT passed on CPU and GPU.

cf58834

Bug fix that scarily did not break the CPU ECT test, but was detected…

e39e8a3

… by breaking BFBness. All these values are ~0 but not always exactly.

More GPUization. ECT passes for CPU and GPU, CPU is BFB.

e5e81fb

Adding/moving timing statements

87af330

Small cleanup and a little more GPUization. Passes ECT CPU vs GPU tes…

c85778e

…t and BFB on CPUs.

Katetc self-requested a review October 21, 2024 21:43

cacraigucar requested a review from nusbaume October 22, 2024 15:10

nusbaume approved these changes Oct 25, 2024

View reviewed changes

huebleruwm added 2 commits October 25, 2024 12:23

Small updates to comments and indenting

cfd2824

Making nzm_clubb and nzt_clubb module parameters that are initialized…

75f51d1

… in clubb_ini_cam. Reusing inv_exner_clubb(:,pver) to set inv_exner_clubb_surf. Splitting up array initialization loop to slightly improve CPU performance.

Katetc approved these changes Nov 6, 2024

View reviewed changes

peverwhee changed the title ~~clubb_intr GPUization~~ cam6_4_048: clubb_intr GPUization Nov 11, 2024

Adding timer around all code in the do_rainturb option rather than ju…

b1a243c

…st the function call.

peverwhee changed the title ~~cam6_4_048: clubb_intr GPUization~~ cam6_4_049: clubb_intr GPUization Nov 12, 2024

peverwhee changed the title ~~cam6_4_049: clubb_intr GPUization~~ cam6_4_050: clubb_intr GPUization Nov 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cam6_4_050: clubb_intr GPUization #1175

cam6_4_050: clubb_intr GPUization #1175

huebleruwm commented Oct 18, 2024

nusbaume left a comment

nusbaume Oct 23, 2024

huebleruwm Oct 28, 2024 •

edited

Loading

Katetc Nov 5, 2024

nusbaume Oct 24, 2024 •

edited

Loading

huebleruwm Oct 28, 2024 •

edited

Loading

huebleruwm commented Oct 28, 2024

huebleruwm commented Oct 28, 2024 •

edited

Loading

Katetc left a comment

Katetc Nov 5, 2024

huebleruwm Nov 11, 2024

Katetc Nov 5, 2024

huebleruwm Nov 11, 2024 •

edited

Loading

Katetc Nov 6, 2024

huebleruwm Nov 11, 2024

Katetc Nov 6, 2024

huebleruwm Nov 11, 2024

Katetc Nov 6, 2024

huebleruwm Nov 11, 2024

		!$acc state1, state1%q, state1%u, state1%v, state1%t, state1%pmid, state1%s, state1%pint, &
		!$acc state1%zm, state1%zi, state1%pdeldry, state1%pdel, state1%omega, state1%phis, &

cam6_4_050: clubb_intr GPUization #1175

Are you sure you want to change the base?

cam6_4_050: clubb_intr GPUization #1175

Conversation

huebleruwm commented Oct 18, 2024

nusbaume left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

huebleruwm Oct 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nusbaume Oct 24, 2024 • edited Loading

Choose a reason for hiding this comment

huebleruwm Oct 28, 2024 • edited Loading

Choose a reason for hiding this comment

huebleruwm commented Oct 28, 2024

huebleruwm commented Oct 28, 2024 • edited Loading

Katetc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

huebleruwm Nov 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

huebleruwm Oct 28, 2024 •

edited

Loading

nusbaume Oct 24, 2024 •

edited

Loading

huebleruwm Oct 28, 2024 •

edited

Loading

huebleruwm commented Oct 28, 2024 •

edited

Loading

huebleruwm Nov 11, 2024 •

edited

Loading