-
Notifications
You must be signed in to change notification settings - Fork 284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wrong data with MPI send/recv and pipelining on Intel GPUs #7139
Comments
I can reproduce this on Aurora with commit |
Thanks for the reproducer. it appears in GPU pipelining, there is potentially scenarios that chunks are written into receive buffers out-of-order. I created a PR #7182 to fix it. |
I also confirmed this fixes the reproducer, however I now get hangs for some specific cases when running a full application with pipelining when not setting a larger buffer size. The cases seem to involve messages of different sizes, where some messages are much larger than the rest. I don't know the exact requirements yet and don't have a simple reproducer, but will keep trying to see if I can get one. |
I'm now getting hangs when running this test case with the newly compiled MPICH currently available in the alcf_kmd_val Aurora queue. |
I was able to reproduce the hang at 2 nodes with 1 process per node: The backtrace of the two ranks is:
rank 1:
I guess they are both in the Waitall waiting for something. |
I can reproduce the hang with the main branch, however our latest drop version works. Could you try the drop version that was installed on Aurora? |
Yes, it works fine with the current software stack on Aurora, thanks. I'm still seeing another hang with pipelining that is dependent on the relative message sizes. I don't have a simple reproducer, but can reproduce it with an application test case.
and the run script
As-is this will hang. Uncommenting the buffer size line, or swapping the 'geom' with the commented out one will make it run to completion. The "row" rank-order makes the messages passed between nodes be the same size as the larger of the messages passed within a node (this is the case that hangs). For the default row-order, the messages between nodes are the same size as the smaller of the messages within a node and it doesn't hang. |
@jcosborn please try the Intel provided drop: It seems to run with that version. I don't know what has changed in the default build. |
Yes, it works with mpich/opt/4.2.3-intel. |
Just ran the reproducer on Aurora
seems working:
I'll try #7139 (comment) next...
|
We're getting incorrect results in application code when using MPIR_CVAR_CH4_OFI_ENABLE_GPU_PIPELINE=1 if the buffer size MPIR_CVAR_CH4_OFI_GPU_PIPELINE_BUFFER_SZ isn't set large enough. Setting it larger seems to work, but MPI should still give correct results (with possible performance hit, or give an error) if it is not set large enough. The full code is fairly complicated, but I have a simple reproducer which can somewhat reproduce the issue. The reproducer can easily fail if the buffer size is set lower than the default, but it doesn't seem to fail for the default size on up to 8 nodes. With a buffer size of 512k it fails easily on 4 nodes, and with 256k will fail regularly on 2 nodes.
Reproducer
sendrecvgpu.cc
mpicxx -fsycl -qopenmp sendrecvgpu.cc -o sendrecvgpu
export MPIR_CVAR_CH4_OFI_ENABLE_GPU_PIPELINE=1
export MPIR_CVAR_CH4_OFI_GPU_PIPELINE_BUFFER_SZ=$((256*1024))
export ZE_FLAT_DEVICE_HIERARCHY=FLAT
mpiexec -np 24 --ppn 12 ./sendrecvgpu
The text was updated successfully, but these errors were encountered: