Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect results in receive buffer in GPU memory on Aurora #7302

Open
paboyle opened this issue Feb 13, 2025 · 15 comments
Open

Incorrect results in receive buffer in GPU memory on Aurora #7302

paboyle opened this issue Feb 13, 2025 · 15 comments
Labels

Comments

@paboyle
Copy link

paboyle commented Feb 13, 2025

I have forensically tracked a correctness issue in a test on Aurora, eventually identifying
a GPU memory resident received packet from MPI_Sendrecv as containing incorrect data.
I talked to James Osborn and Xaioyong Jin who said to file this here.

The fail is deterministic for a specific problem size, but depends very critically and bizarrely on the precise
history of the code (GPU memory allocation sequences and/or kernel calls) that were used to initialize all arrays.
Presumably something is being remembered and cached somewhere. Deleting certain sections of kernel
calls fills the arrays differently and then correct results are obtained.

At the point of failure, I

  1. performed MPI_Sendrecv with gpu resident send and receive buffers.
  2. pull these back to host and compute CRC32
  3. perform host to host MPI_Sendrecv of the "same" data.
  4. Compute CRC 32 of the MPI host-host buffers.

Up to this point, I could also run the same sequence on Frontier with the same problem size and rank geometry.

  • CRC's match between Frontier and Aurora, until a single MPI_Sendrecv mismatches.
  • The Aurora host-host Sendrecv matches with Frontier.
  • The Aurora device-device Sendrecv mismatches.
  • Incorrect results are obtained.

Key sequence demonstrating the fail is on Aurora in:

/home/paboyle/Pipeline/Grid.save/Grid/cshift/Cshift_mpi.h

	std::cerr << GridLogMessage << "SendToRecvFrom gpu send buf "<<send_buf_extract_mpi<<" gpu recv buf "<<recv_buf_extract_mpi<<std::endl;
	std::cerr << GridLogMessage << "SendToRecvFrom buffer size bytes "<<bytes<<std::endl;
	std::cerr << GridLogMessage << "SendToRecvFrom xmit_to_rank "<<xmit_to_rank<<" recv_from_rank "<<recv_from_rank<<std::endl;
	grid->Barrier();
	
	grid->SendToRecvFrom((void *)send_buf_extract_mpi,
			     xmit_to_rank,
			     (void *)recv_buf_extract_mpi,
			     recv_from_rank,
			     bytes);

	std::cerr << GridLogMessage << "SendToRecvFrom GPU-GPU done "<<std::endl;

	unsigned long  xcrc = crc32(0L, Z_NULL, 0);
	unsigned long  rcrc = crc32(0L, Z_NULL, 0);
	unsigned long  hrcrc = crc32(0L, Z_NULL, 0);
	unsigned char *hxbuf = (unsigned char *) malloc(bytes);
	unsigned char *hrbuf = (unsigned char *) malloc(bytes);
	acceleratorCopyFromDevice(send_buf_extract_mpi,hxbuf,bytes);
	xcrc = crc32(xcrc,(unsigned char *)hxbuf,bytes);

	acceleratorCopyFromDevice(recv_buf_extract_mpi,hrbuf,bytes);
	rcrc = crc32(rcrc,(unsigned char *)hrbuf,bytes);

	/*
	 * Send again host to host
	 */
	grid->SendToRecvFrom((void *)hxbuf,
			     xmit_to_rank,
			     (void *)hrbuf,
			     recv_from_rank,
			     bytes);
	std::cerr << GridLogMessage << "SendToRecvFrom HOST-HOST done "<<std::endl;

	hrcrc = crc32(hrcrc,(unsigned char *)hrbuf,bytes);
	

Stderr is stored into files Grid.stderr.$mpirank
From rank zero, the preceding sequence gives device buffers, byte count and the CRCs.

In the first below, the CRC of the received GPU data recv 3398638398 matches the received host data hrecv 3398638398

However in the last SendRecv in the sequence plotted there is a miscompare:
FAILED MPI crcs xmit 3004459347 recv 1101278579 hrecv 4248251369, and the buffer contents differ
in all entries.

Grid.stderr.0

Grid : Message : 8.574109 s : SendToRecvFrom gpu send buf 0xff00fffffbec0000 gpu recv buf 0xff00fffffc900000
Grid : Message : 8.574119 s : SendToRecvFrom buffer size bytes 147456
Grid : Message : 8.574127 s : SendToRecvFrom xmit_to_rank 1 recv_from_rank 1
Grid : Message : 8.577276 s : SendToRecvFrom GPU-GPU done 
Grid : Message : 8.577955 s : SendToRecvFrom HOST-HOST done 
Grid : Message : 8.578071 s :  MPI crcs xmit 124559292 recv 3398638398 hrecv 3398638398
Grid : Message : 8.578323 s : SendToRecvFrom gpu send buf 0xff00fffffe940000 gpu recv buf 0xff00fffffbd00000
Grid : Message : 8.578333 s : SendToRecvFrom buffer size bytes 147456
Grid : Message : 8.578341 s : SendToRecvFrom xmit_to_rank 1 recv_from_rank 1
Grid : Message : 8.578536 s : SendToRecvFrom GPU-GPU done 
Grid : Message : 8.578888 s : SendToRecvFrom HOST-HOST done 
Grid : Message : 8.579003 s :  MPI crcs xmit 3631437604 recv 2137761691 hrecv 2137761691
Grid : Message : 8.579120 s : SendToRecvFrom gpu send buf 0xff00fffffe7c0000 gpu recv buf 0xff00ffffffe00000
Grid : Message : 8.579129 s : SendToRecvFrom buffer size bytes 147456
Grid : Message : 8.579137 s : SendToRecvFrom xmit_to_rank 1 recv_from_rank 1
Grid : Message : 8.579324 s : SendToRecvFrom GPU-GPU done 
Grid : Message : 8.579979 s : SendToRecvFrom HOST-HOST done 
Grid : Message : 8.580150 s :  MPI crcs xmit 2420115748 recv 125244814 hrecv 125244814
Grid : Message : 8.580176 s : SendToRecvFrom gpu send buf 0xff00fffffe600000 gpu recv buf 0xff00fffffbf80000
Grid : Message : 8.580190 s : SendToRecvFrom buffer size bytes 147456
Grid : Message : 8.580202 s : SendToRecvFrom xmit_to_rank 1 recv_from_rank 1
Grid : Message : 8.580359 s : SendToRecvFrom GPU-GPU done 
Grid : Message : 8.580889 s : SendToRecvFrom HOST-HOST done 
Grid : Message : 8.581058 s :  FAILED MPI crcs xmit 3004459347 recv 1101278579 hrecv 4248251369

If I look at ALL 8 ranks, they all failed at the same packet.

Grid.stderr.0:Grid : Message : 8.581058 s :  FAILED MPI crcs xmit 3004459347 recv 1101278579 hrecv 4248251369
Grid.stderr.1:Grid : Message : 8.581038 s :  FAILED MPI crcs xmit 4248251369 recv 3228181958 hrecv 3004459347
Grid.stderr.2:Grid : Message : 8.580981 s :  FAILED MPI crcs xmit 800018516 recv 4155407109 hrecv 3700462029
Grid.stderr.3:Grid : Message : 8.580965 s :  FAILED MPI crcs xmit 3700462029 recv 1588848890 hrecv 800018516
Grid.stderr.4:Grid : Message : 8.580993 s :  FAILED MPI crcs xmit 661114845 recv 2819124774 hrecv 4263298989
Grid.stderr.5:Grid : Message : 8.580987 s :  FAILED MPI crcs xmit 4263298989 recv 170971885 hrecv 661114845
Grid.stderr.6:Grid : Message : 8.580958 s :  FAILED MPI crcs xmit 3227910482 recv 801109152 hrecv 2143185873
Grid.stderr.7:Grid : Message : 8.580957 s :  FAILED MPI crcs xmit 2143185873 recv 817668029 hrecv 3227910482
  • The xmit buffers CRC32 match those I produce on Frontier
  • The host buffer receive CRC's match the expected xmit CRC's, but the GPU side data CRC's do NOT match the xmit CRC's.
  • E.g. rank 0 sends to/from rank 1, and transmits a buffer with CRC 3004459347, this is received by rank 1 on CPU memory with same CRC, but wrongly on GPU memory with CRC 1101278579.
  • Finally, if after receiving the host-host data correctly, the host receive buffer is copied back to the device and replaces the
    incorrectly received data in the GPU, the test runs to completion and then gets the correct answers.

This is running the standard log modules, and has not enabled pipeline mode - it's the standard vanilla MPICH installed on Aurora.
.

The software is in:

/home/paboyle/Pipeline/Grid.save/systems/Aurora/benchmarks

Binary:

Benchmark_dwf_fp32

Submission script:

bench1.pbs

To reconfigure and rebuild, assuming "/home/paboyle/Pipeline/Grid.save" is copied to some new location "GRID"

cd GRID
./bootstrap.sh
cd systems/Aurora
source sourceme.sh
source config-command
make -j 20 -C Grid
make -j 20 -C benchmarks
@paboyle
Copy link
Author

paboyle commented Feb 13, 2025

My immediate work around is to ensure I have a configure option for performing all MPI transfers with a bounce to host memory first.

@raffenet
Copy link
Contributor

We need to nail down the MPICH configuration for dealing with GPU buffers. Can you also provide the out of an MPI program execution with MPIR_CVAR_DEBUG_SUMMARY=1 set in the environment? It doesn't have to be your full example, just any MPI program.

@paboyle
Copy link
Author

paboyle commented Feb 13, 2025

Sure ! working on it.

@paboyle
Copy link
Author

paboyle commented Feb 13, 2025

Required minimum FI_VERSION: 10006, current version: 10014
==== Capability set configuration ====
libfabric provider: tcp;ofi_rxm - 10.112.0.0/15
MPIDI_OFI_ENABLE_DATA: 0
MPIDI_OFI_ENABLE_AV_TABLE: 0
MPIDI_OFI_ENABLE_SCALABLE_ENDPOINTS: 0
MPIDI_OFI_ENABLE_SHARED_CONTEXTS: 0
MPIDI_OFI_ENABLE_MR_VIRT_ADDRESS: 0
MPIDI_OFI_ENABLE_MR_ALLOCATED: 0
MPIDI_OFI_ENABLE_MR_REGISTER_NULL: 1
MPIDI_OFI_ENABLE_MR_PROV_KEY: 0
MPIDI_OFI_ENABLE_TAGGED: 1
MPIDI_OFI_ENABLE_AM: 1
MPIDI_OFI_ENABLE_RMA: 1
MPIDI_OFI_ENABLE_ATOMICS: 0
MPIDI_OFI_FETCH_ATOMIC_IOVECS: 1
MPIDI_OFI_ENABLE_DATA_AUTO_PROGRESS: 0
MPIDI_OFI_ENABLE_CONTROL_AUTO_PROGRESS: 0
MPIDI_OFI_ENABLE_PT2PT_NOPACK: 1
MPIDI_OFI_ENABLE_TRIGGERED: 0
MPIDI_OFI_ENABLE_HMEM: 0
MPIDI_OFI_NUM_AM_BUFFERS: 8
MPIDI_OFI_NUM_OPTIMIZED_MEMORY_REGIONS: 0
MPIDI_OFI_CONTEXT_BITS: 16
MPIDI_OFI_SOURCE_BITS: 23
MPIDI_OFI_TAG_BITS: 19
MPIDI_OFI_VNI_USE_DOMAIN: 1
MAXIMUM SUPPORTED RANKS: 8388608
MAXIMUM TAG: 524288
==== Provider global thresholds ====
max_buffered_send: 64
max_buffered_write: 64
max_msg_size: 9223372036854775807
max_order_raw: -1
max_order_war: 0
max_order_waw: -1
tx_iov_limit: 4
rx_iov_limit: 4
rma_iov_limit: 4
max_mr_key_size: 8
==== Various sizes and limits ====
MPIDI_OFI_AM_MSG_HEADER_SIZE: 24
MPIDI_OFI_MAX_AM_HDR_SIZE: 255
sizeof(MPIDI_OFI_am_request_header_t): 416
sizeof(MPIDI_OFI_per_vci_t): 52480
MPIDI_OFI_AM_HDR_POOL_CELL_SIZE: 1024
MPIDI_OFI_DEFAULT_SHORT_SEND_SIZE: 16384
==== GPU Init (ZE) ====
device_count: 1
subdevice_count: 0
=========================
==== GPU Init (ZE) ====
device_count: 1
subdevice_count: 0
=========================
==== GPU Init (ZE) ====
device_count: 1
subdevice_count: 0
=========================
==== GPU Init (ZE) ====
device_count: 1
subdevice_count: 0
=========================
==== GPU Init (ZE) ====
device_count: 1
subdevice_count: 0
=========================
==== GPU Init (ZE) ====
device_count: 1
subdevice_count: 0
=========================
==== GPU Init (ZE) ====
device_count: 1
subdevice_count: 0
=========================
==== OFI dynamic settings ====
num_vcis: 1
num_nics: 1
======================================
error checking    : disabled
QMPI              : disabled
debugger support  : disabled
thread level      : MPI_THREAD_SERIALIZED
thread CS         : per-vci
threadcomm        : enabled
==== data structure summary ====
sizeof(MPIR_Comm): 1912
sizeof(MPIR_Request): 496
sizeof(MPIR_Datatype): 280
================================

Environment in the wrapper script prior to mpiexec:

env | grep MPI
LMOD_FAMILY_MPI_VERSION=4.3.0rc3
MPIR_CVAR_CH4_SHM_POSIX_TOPO_ENABLE=1
__LMOD_STACK_MPI_ROOT=L29wdC9hdXJvcmEvMjQuMTgwLjMvc3BhY2svdW5pZmllZC8wLjguMC9pbnN0YWxsL2xpbnV4LXNsZXMxNS14ODZfNjQvb25lYXBpLTIwMjQuMDcuMzAuMDAyL21waWNoLTQuMy4wcmMzLWZ6bXJmdGE=
LMOD_FAMILY_COMPILER_VERSION=12.2.0
__LMOD_STACK_LMOD_MPI_VERSION=NC4zLjByYzMtZnptcmZ0YQ==
MPICH_OFI_NIC_POLICY=GPU
MPIF90=/opt/aurora/24.180.3/spack/unified/0.8.0/install/linux-sles15-x86_64/oneapi-2024.07.30.002/mpich-4.3.0rc3-fzmrfta/bin/mpif90
__LMOD_STACK_MPICXX=L29wdC9hdXJvcmEvMjQuMTgwLjMvc3BhY2svdW5pZmllZC8wLjguMC9pbnN0YWxsL2xpbnV4LXNsZXMxNS14ODZfNjQvb25lYXBpLTIwMjQuMDcuMzAuMDAyL21waWNoLTQuMy4wcmMzLWZ6bXJmdGEvYmluL21waWMrKw==
MPICXX=/opt/aurora/24.180.3/spack/unified/0.8.0/install/linux-sles15-x86_64/oneapi-2024.07.30.002/mpich-4.3.0rc3-fzmrfta/bin/mpic++
__LMOD_STACK_MPICH_ROOT=L29wdC9hdXJvcmEvMjQuMTgwLjMvc3BhY2svdW5pZmllZC8wLjguMC9pbnN0YWxsL2xpbnV4LXNsZXMxNS14ODZfNjQvb25lYXBpLTIwMjQuMDcuMzAuMDAyL21waWNoLTQuMy4wcmMzLWZ6bXJmdGE=
__LMOD_STACK_MPICC=L29wdC9hdXJvcmEvMjQuMTgwLjMvc3BhY2svdW5pZmllZC8wLjguMC9pbnN0YWxsL2xpbnV4LXNsZXMxNS14ODZfNjQvb25lYXBpLTIwMjQuMDcuMzAuMDAyL21waWNoLTQuMy4wcmMzLWZ6bXJmdGEvYmluL21waWNj
MPIR_CVAR_INIT_SKIP_PMI_BARRIER=0
LMOD_MPI_VERSION=4.3.0rc3-fzmrfta
MPIR_CVAR_DEBUG_SUMMARY=1
__LMOD_STACK_MPIF90=L29wdC9hdXJvcmEvMjQuMTgwLjMvc3BhY2svdW5pZmllZC8wLjguMC9pbnN0YWxsL2xpbnV4LXNsZXMxNS14ODZfNjQvb25lYXBpLTIwMjQuMDcuMzAuMDAyL21waWNoLTQuMy4wcmMzLWZ6bXJmdGEvYmluL21waWY5MA==
MPICC=/opt/aurora/24.180.3/spack/unified/0.8.0/install/linux-sles15-x86_64/oneapi-2024.07.30.002/mpich-4.3.0rc3-fzmrfta/bin/mpicc
__LMOD_STACK_MPIF77=L29wdC9hdXJvcmEvMjQuMTgwLjMvc3BhY2svdW5pZmllZC8wLjguMC9pbnN0YWxsL2xpbnV4LXNsZXMxNS14ODZfNjQvb25lYXBpLTIwMjQuMDcuMzAuMDAyL21waWNoLTQuMy4wcmMzLWZ6bXJmdGEvYmluL21waWY3Nw==
SYCL_PROGRAM_COMPILE_OPTIONS=-ze-opt-large-register-file
LMOD_FAMILY_COMPILER=gcc
__LMOD_STACK_LMOD_MPI_NAME=bXBpY2g=
MPIF77=/opt/aurora/24.180.3/spack/unified/0.8.0/install/linux-sles15-x86_64/oneapi-2024.07.30.002/mpich-4.3.0rc3-fzmrfta/bin/mpif77
MPICH_ROOT=/opt/aurora/24.180.3/spack/unified/0.8.0/install/linux-sles15-x86_64/oneapi-2024.07.30.002/mpich-4.3.0rc3-fzmrfta
LMOD_MPI_NAME=mpich
__LMOD_STACK_MPIR_CVAR_INIT_SKIP_PMI_BARRIER=MA==
__LMOD_STACK_MPIR_CVAR_CH4_SHM_POSIX_TOPO_ENABLE=MQ==
MPI_ROOT=/opt/aurora/24.180.3/spack/unified/0.8.0/install/linux-sles15-x86_64/oneapi-2024.07.30.002/mpich-4.3.0rc3-fzmrfta
LMOD_FAMILY_MPI=mpich/opt

Earlier sections, likely irrelevant here:

==== GPU Init (ZE) ====
device_count: 1
subdevice_count: 0
=========================
==== CH4 runtime configurations ====
MPIDI_CH4_MT_MODEL: 0 (direct)
================================
==== Various sizes and limits ====
sizeof(MPIDI_per_vci_t): 192
Required minimum FI_VERSION: 0, current version: 10014
provider: cxi, score = 5, pref = -100, FI_FORMAT_UNSPEC [8]

[last line repeats]

provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN [16] 10.112.36.94
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN [16] 10.112.51.181
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN [16] 10.112.36.124
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN [16] 10.112.36.98
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN [16] 10.112.36.135
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN [16] 10.112.36.131
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN [16] 10.112.36.121
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN [16] 10.112.36.92
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] fe80::ff:fe02:7800
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] fe80::240:a6ff:fe91:9dc5
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] fe80::240:a6ff:fe91:8119
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] fe80::240:a6ff:fe91:811b
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] fe80::240:a6ff:fe91:8e79
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] fe80::240:a6ff:fe91:8e7b
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] fe80::240:a6ff:fe91:7891
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] fe80::240:a6ff:fe91:7893
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN [16] 10.115.35.57
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] fe80::9a4f:eeff:fe16:690e
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] fded:fcc2:56d2:2307:9a4f:eeff:fe16:690e
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN [16] 127.0.0.1
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] ::1
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN [16] 10.112.36.94
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN [16] 10.112.51.181
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN [16] 10.112.36.124
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN [16] 10.112.36.98
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN [16] 10.112.36.135
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN [16] 10.112.36.131
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN [16] 10.112.36.121
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN [16] 10.112.36.92
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] fe80::ff:fe02:7800
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] fe80::240:a6ff:fe91:9dc5
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] fe80::240:a6ff:fe91:8119
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] fe80::240:a6ff:fe91:811b
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] fe80::240:a6ff:fe91:8e79
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] fe80::240:a6ff:fe91:8e7b
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] fe80::240:a6ff:fe91:7891
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] fe80::240:a6ff:fe91:7893
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN [16] 10.115.35.57
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] fe80::9a4f:eeff:fe16:690e
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] fded:fcc2:56d2:2307:9a4f:eeff:fe16:690e
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN [16] 127.0.0.1
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] ::1
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN [16] 10.112.36.94
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN [16] 10.112.51.181
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN [16] 10.112.36.124
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN [16] 10.112.36.98
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN [16] 10.112.36.135
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN [16] 10.112.36.131
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN [16] 10.112.36.121
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN [16] 10.112.36.92
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] fe80::ff:fe02:7800
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] fe80::240:a6ff:fe91:9dc5
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] fe80::240:a6ff:fe91:8119
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] fe80::240:a6ff:fe91:811b
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] fe80::240:a6ff:fe91:8e79
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] fe80::240:a6ff:fe91:8e7b
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] fe80::240:a6ff:fe91:7891
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] fe80::240:a6ff:fe91:7893
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN [16] 10.115.35.57
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] fe80::9a4f:eeff:fe16:690e
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] fded:fcc2:56d2:2307:9a4f:eeff:fe16:690e
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN [16] 127.0.0.1
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] ::1
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN [16] 10.112.36.94
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN [16] 10.112.51.181
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN [16] 10.112.36.124
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN [16] 10.112.36.98
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN [16] 10.112.36.135
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN [16] 10.112.36.131
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN [16] 10.112.36.121
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN [16] 10.112.36.92
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] fe80::ff:fe02:7800
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] fe80::240:a6ff:fe91:9dc5
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] fe80::240:a6ff:fe91:8119
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] fe80::240:a6ff:fe91:811b
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] fe80::240:a6ff:fe91:8e79
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] fe80::240:a6ff:fe91:8e7b
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] fe80::240:a6ff:fe91:7891
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] fe80::240:a6ff:fe91:7893
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN [16] 10.115.35.57
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] fe80::9a4f:eeff:fe16:690e
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] fded:fcc2:56d2:2307:9a4f:eeff:fe16:690e
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN [16] 127.0.0.1
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] ::1

@hzhou
Copy link
Contributor

hzhou commented Feb 13, 2025

@paboyle Can you share the source code for Benchmark_dwf_fp32 (not publicly, any means) ?

@paboyle
Copy link
Author

paboyle commented Feb 13, 2025

The source is on Aurora, under:

/home/paboyle/Pipeline/Grid.save/

Rebuild instructions were in the bottom of the initial post.
It's a locally modified version of:

https://github.com/paboyle/Grid/

The modifications have lots of debug / logging information, (I kinda had to tear the code apart tracking this back) such as the CRC computations that got me to the smoking gun. I'm reluctant to commit them, but could make a feature branch and put them onto GitHub if really necessary

@hzhou
Copy link
Contributor

hzhou commented Feb 13, 2025

@paboyle A few things to try to help us narrow down the suspects -

  • Try set MPIR_CVAR_NOLOCAL=1
  • Try set MPIR_CVAR_CH4_IPC_GPU_P2P_THRESHOLD=1000000000 (or whatever limit that exceeds your largest message size).

@paboyle
Copy link
Author

paboyle commented Feb 13, 2025

I've just done a recursive chmod to be sure on the subdirectories -- you should be able read if you are on Aurora.

@paboyle
Copy link
Author

paboyle commented Feb 13, 2025

Will try these

@paboyle
Copy link
Author

paboyle commented Feb 13, 2025

Was too hasty. There was fail in one case. Double checking.

@hzhou
Copy link
Contributor

hzhou commented Feb 13, 2025

@paboyle Another thing to try -

  • Set MPIR_CVAR_CH4_IPC_GPU_HANDLE_CACHE=disabled

@paboyle
Copy link
Author

paboyle commented Feb 14, 2025

  • Setting only MPIR_CVAR_CH4_IPC_GPU_P2P_THRESHOLD=1000000000 : correct results
  • Setting only MPIR_CVAR_NOLOCAL=1 : correct results
  • Setting only MPIR_CVAR_CH4_IPC_GPU_HANDLE_CACHE=disabled fails in same way as original

@hzhou
Copy link
Contributor

hzhou commented Feb 14, 2025

@paboyle Thanks for the tests. They confirm that the path in question is the GPU IPC path and it doesn't seem to be related to stale IPC cahce. But to confirm, could you also try -

  • MPIR_CVAR_CH4_IPC_GPU_HANDLE_CACHE=generic ?

The crc for send buffer does not match the crc for receive buffer. Does that means your datatype has gaps? Is there gaps in the receive buffer?

@paboyle
Copy link
Author

paboyle commented Feb 14, 2025

The data are contiguous buffers.

I run GPU kernels to gather/scatter the data into a buffer as this is typically fastest and simplest
case to present to MPI.

I'll get onto the :

MPIR_CVAR_CH4_IPC_GPU_HANDLE_CACHE=generic

case shortly.

@paboyle
Copy link
Author

paboyle commented Feb 14, 2025

p.s. I'm set up to quite easily take a unitrace log.

We could probably see the intranode XeLink use happening under MPI when it fails if that is helpful ?
I think Aurora login is down just now, so it will be later today.

Here's an earlier plot (not from the failed case) I can see MPI_Sendrecv is sometimes using D2D copies and sometimes
doing H2D and D2H, not sure what causes the difference. Would a plot like this of the failing sequence be useful?

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants