Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scalability issue in decompInit_lnd #2995

Open
briandobbins opened this issue Mar 9, 2025 · 1 comment
Open

Scalability issue in decompInit_lnd #2995

briandobbins opened this issue Mar 9, 2025 · 1 comment
Labels
bfb bit-for-bit enhancement new capability or improved behavior of existing capability investigation Needs to be verified and more investigation into what's going on. next this should get some attention in the next week or two. Normally each Thursday SE meeting. performance idea or PR to improve performance (e.g. throughput, memory)

Comments

@briandobbins
Copy link
Contributor

In the decompInit_lnd routine (in src/main/decompInitMod.F90), there's a performance issue in the section below:

do m = 1,nclumps
if ((clumps(m)%owner > clumps(cid)%owner) .or. &
(clumps(m)%owner == clumps(cid)%owner .and. m > cid)) then
clumps(m)%begg = clumps(m)%begg + 1
endif
if ((clumps(m)%owner > clumps(cid)%owner) .or. &
(clumps(m)%owner == clumps(cid)%owner .and. m >= cid)) then
clumps(m)%endg = clumps(m)%endg + 1
endif
enddo

This happens because the main loop, outside of that, is over every cell, and this internal loop is over the number of clumps, which is at least equal to the number of PEs. For km-scale runs, this ends up being quite large - eg, on the 3.75km test case, the outer loop is ~42M, and the inner loop is >~40K, since that's the minimum we're able to run the case on. The conditional does restrict the inner loop to only happening on ~12.4M of the 42M cells, but it's still a problem to have things scale by the number of cores.

In terms of data, the loop above took between 754 - 1044 seconds, averaging 804 across all PEs. That's ~86% of the InitializeRealize call.

I've got a few ideas on things to try, including simply changing the complex conditionals as a temporary work-around, as well as saving/reading a decomposition, but would welcome insights from folks who know the land model.

Perhaps the other issue here, which may be more important in the end, is that it seems like 'clumps' is allocated on every rank for every rank?

allocate(clumps(nclumps), stat=ier)

Again, I welcome input here - this is a challenge for memory scalability, at least, and unless this is needed elsewhere, we should move to a local-to-the-rank structure.

Anyway, just getting the issue in - I think I can create work-arounds for the near-term needs, but would be happy to chat with any land folks on this and see if we can get a SIF or some other way to focus on addressing this soon, too.

Thanks!

@ekluzek ekluzek added enhancement new capability or improved behavior of existing capability investigation Needs to be verified and more investigation into what's going on. bfb bit-for-bit performance idea or PR to improve performance (e.g. throughput, memory) next this should get some attention in the next week or two. Normally each Thursday SE meeting. labels Mar 10, 2025
@wwieder
Copy link
Contributor

wwieder commented Mar 11, 2025

Thanks for creating this issue, @briandobbins. Let us know the timeline that's helpful for this to be addressed. As you know, we're kind of slammed with prepping for CLM6 / CESM3, so addressing this after the release will be more realistic. That said, we don't want poor scalability hindering the work you're trying to do for high res work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bfb bit-for-bit enhancement new capability or improved behavior of existing capability investigation Needs to be verified and more investigation into what's going on. next this should get some attention in the next week or two. Normally each Thursday SE meeting. performance idea or PR to improve performance (e.g. throughput, memory)
Projects
None yet
Development

No branches or pull requests

3 participants