Scalability issue in decompInit_lnd #2995
Labels
bfb
bit-for-bit
enhancement
new capability or improved behavior of existing capability
investigation
Needs to be verified and more investigation into what's going on.
next
this should get some attention in the next week or two. Normally each Thursday SE meeting.
performance
idea or PR to improve performance (e.g. throughput, memory)
In the decompInit_lnd routine (in src/main/decompInitMod.F90), there's a performance issue in the section below:
CTSM/src/main/decompInitMod.F90
Lines 226 to 236 in 7ff6061
This happens because the main loop, outside of that, is over every cell, and this internal loop is over the number of clumps, which is at least equal to the number of PEs. For km-scale runs, this ends up being quite large - eg, on the 3.75km test case, the outer loop is ~42M, and the inner loop is >~40K, since that's the minimum we're able to run the case on. The conditional does restrict the inner loop to only happening on ~12.4M of the 42M cells, but it's still a problem to have things scale by the number of cores.
In terms of data, the loop above took between 754 - 1044 seconds, averaging 804 across all PEs. That's ~86% of the InitializeRealize call.
I've got a few ideas on things to try, including simply changing the complex conditionals as a temporary work-around, as well as saving/reading a decomposition, but would welcome insights from folks who know the land model.
Perhaps the other issue here, which may be more important in the end, is that it seems like 'clumps' is allocated on every rank for every rank?
CTSM/src/main/decompInitMod.F90
Line 118 in 7ff6061
Again, I welcome input here - this is a challenge for memory scalability, at least, and unless this is needed elsewhere, we should move to a local-to-the-rank structure.
Anyway, just getting the issue in - I think I can create work-arounds for the near-term needs, but would be happy to chat with any land folks on this and see if we can get a SIF or some other way to focus on addressing this soon, too.
Thanks!
The text was updated successfully, but these errors were encountered: