You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
--> loading model from /mnt/bn/nlhei-nas/yangming/pretrained_models/hunyuan-t2v-ckpts
Could not load Sliding Tile Attention.
Total training parameters = 12821.012544 M
--> Initializing FSDP with sharding strategy: full
--> model loaded
--> applying fdsp activation checkpointing...
optimizer: AdamW (
Parameter Group 0
amsgrad: False
betas: (0.9, 0.999)
capturable: False
differentiable: False
eps: 1e-08
foreach: None
fused: None
lr: 1e-05
maximize: False
weight_decay: 0.01
)
***** Running training *****
Num examples = 88
Dataloader size = 11
Num Epochs = 46
Resume training from step 0
Instantaneous batch size per device = 1
Total train batch size (w. data & sequence parallel, accumulation) = 2.0
Gradient Accumulation steps = 1
Total optimization steps = 2000
Total training parameters per FSDP shard = 1.602626568 B
Master weight dtype: torch.float32
dc61-p18-t23-n020:2087:2087 [0] NCCL INFO NCCL_SOCKET_FAMILY set by environment to AF_INET6
dc61-p18-t23-n020:2087:2087 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to =eth0
dc61-p18-t23-n020:2087:2087 [0] NCCL INFO Bootstrap : Using eth0:fdbd:dc61:18:23::20<0>
dc61-p18-t23-n020:2087:2087 [0] NCCL INFO NET/Plugin: No plugin found (none)
dc61-p18-t23-n020:2087:2087 [0] NCCL INFO NET/Plugin: No plugin found (libnccl-net-none.so)
dc61-p18-t23-n020:2087:2087 [0] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net-none.so: cannot open shared object file: No such file or directory : when loading none
dc61-p18-t23-n020:2087:2087 [0] NCCL INFO NET/Plugin: Using internal network plugin.
dc61-p18-t23-n020:2087:2087 [0] NCCL INFO cudaDriverVersion 12040
NCCL version 2.21.5+cuda12.4
dc61-p18-t23-n020:2087:4030 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
dc61-p18-t23-n020:2087:4030 [0] NCCL INFO NCCL_SOCKET_FAMILY set by environment to AF_INET6
dc61-p18-t23-n020:2087:4030 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to =eth0
dc61-p18-t23-n020:2087:4030 [0] NCCL INFO NCCL_IB_HCA set to mlx5
dc61-p18-t23-n020:2087:4030 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE [RO]; OOB eth0:fdbd:dc61:18:23::20<0>
dc61-p18-t23-n020:2087:4030 [0] NCCL INFO Using non-device net plugin version 0
dc61-p18-t23-n020:2087:4030 [0] NCCL INFO Using network IB
--> applying fdsp activation checkpointing...
dc61-p18-t23-n020:2088:2088 [1] NCCL INFO cudaDriverVersion 12040
dc61-p18-t23-n020:2088:2088 [1] NCCL INFO NCCL_SOCKET_FAMILY set by environment to AF_INET6
dc61-p18-t23-n020:2088:2088 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to =eth0
dc61-p18-t23-n020:2088:2088 [1] NCCL INFO Bootstrap : Using eth0:fdbd:dc61:18:23::20<0>
dc61-p18-t23-n020:2088:2088 [1] NCCL INFO NET/Plugin: No plugin found (none)
dc61-p18-t23-n020:2088:2088 [1] NCCL INFO NET/Plugin: No plugin found (libnccl-net-none.so)
dc61-p18-t23-n020:2088:2088 [1] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net-none.so: cannot open shared object file: No such file or directory : when loading none
dc61-p18-t23-n020:2088:2088 [1] NCCL INFO NET/Plugin: Using internal network plugin.
--> applying fdsp activation checkpointing...
dc61-p18-t23-n020:2088:4363 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
dc61-p18-t23-n020:2088:4363 [1] NCCL INFO NCCL_SOCKET_FAMILY set by environment to AF_INET6
dc61-p18-t23-n020:2088:4363 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to =eth0
dc61-p18-t23-n020:2088:4363 [1] NCCL INFO NCCL_IB_HCA set to mlx5
dc61-p18-t23-n020:2088:4363 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE [RO]; OOB eth0:fdbd:dc61:18:23::20<0>
dc61-p18-t23-n020:2088:4363 [1] NCCL INFO Using non-device net plugin version 0
dc61-p18-t23-n020:2088:4363 [1] NCCL INFO Using network IB
--> applying fdsp activation checkpointing...
--> applying fdsp activation checkpointing...
dc61-p18-t23-n020:2090:2090 [3] NCCL INFO cudaDriverVersion 12040
dc61-p18-t23-n020:2090:2090 [3] NCCL INFO NCCL_SOCKET_FAMILY set by environment to AF_INET6
dc61-p18-t23-n020:2090:2090 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to =eth0
dc61-p18-t23-n020:2090:2090 [3] NCCL INFO Bootstrap : Using eth0:fdbd:dc61:18:23::20<0>
dc61-p18-t23-n020:2090:2090 [3] NCCL INFO NET/Plugin: No plugin found (none)
dc61-p18-t23-n020:2090:2090 [3] NCCL INFO NET/Plugin: No plugin found (libnccl-net-none.so)
dc61-p18-t23-n020:2090:2090 [3] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net-none.so: cannot open shared object file: No such file or directory : when loading none
dc61-p18-t23-n020:2090:2090 [3] NCCL INFO NET/Plugin: Using internal network plugin.
--> applying fdsp activation checkpointing...
dc61-p18-t23-n020:2090:5351 [3] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
dc61-p18-t23-n020:2090:5351 [3] NCCL INFO NCCL_SOCKET_FAMILY set by environment to AF_INET6
dc61-p18-t23-n020:2090:5351 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to =eth0
dc61-p18-t23-n020:2090:5351 [3] NCCL INFO NCCL_IB_HCA set to mlx5
dc61-p18-t23-n020:2090:5351 [3] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE [RO]; OOB eth0:fdbd:dc61:18:23::20<0>
dc61-p18-t23-n020:2090:5351 [3] NCCL INFO Using non-device net plugin version 0
dc61-p18-t23-n020:2090:5351 [3] NCCL INFO Using network IB
dc61-p18-t23-n020:2091:2091 [4] NCCL INFO NCCL_SOCKET_FAMILY set by environment to AF_INET6
dc61-p18-t23-n020:2091:2091 [4] NCCL INFO NCCL_SOCKET_IFNAME set by environment to =eth0
dc61-p18-t23-n020:2091:2091 [4] NCCL INFO Bootstrap : Using eth0:fdbd:dc61:18:23::20<0>
dc61-p18-t23-n020:2091:2091 [4] NCCL INFO NET/Plugin: No plugin found (none)
dc61-p18-t23-n020:2091:2091 [4] NCCL INFO NET/Plugin: No plugin found (libnccl-net-none.so)
dc61-p18-t23-n020:2091:2091 [4] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net-none.so: cannot open shared object file: No such file or directory : when loading none
dc61-p18-t23-n020:2091:2091 [4] NCCL INFO NET/Plugin: Using internal network plugin.
dc61-p18-t23-n020:2091:2091 [4] NCCL INFO cudaDriverVersion 12040
NCCL version 2.21.5+cuda12.4
dc61-p18-t23-n020:2093:2093 [6] NCCL INFO cudaDriverVersion 12040
dc61-p18-t23-n020:2093:2093 [6] NCCL INFO NCCL_SOCKET_FAMILY set by environment to AF_INET6
dc61-p18-t23-n020:2093:2093 [6] NCCL INFO NCCL_SOCKET_IFNAME set by environment to =eth0
dc61-p18-t23-n020:2093:2093 [6] NCCL INFO Bootstrap : Using eth0:fdbd:dc61:18:23::20<0>
dc61-p18-t23-n020:2093:2093 [6] NCCL INFO NET/Plugin: No plugin found (none)
dc61-p18-t23-n020:2093:2093 [6] NCCL INFO NET/Plugin: No plugin found (libnccl-net-none.so)
dc61-p18-t23-n020:2093:2093 [6] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net-none.so: cannot open shared object file: No such file or directory : when loading none
dc61-p18-t23-n020:2093:2093 [6] NCCL INFO NET/Plugin: Using internal network plugin.
dc61-p18-t23-n020:2094:2094 [7] NCCL INFO cudaDriverVersion 12040
dc61-p18-t23-n020:2094:2094 [7] NCCL INFO NCCL_SOCKET_FAMILY set by environment to AF_INET6
dc61-p18-t23-n020:2094:2094 [7] NCCL INFO NCCL_SOCKET_IFNAME set by environment to =eth0
dc61-p18-t23-n020:2094:2094 [7] NCCL INFO Bootstrap : Using eth0:fdbd:dc61:18:23::20<0>
dc61-p18-t23-n020:2094:2094 [7] NCCL INFO NET/Plugin: No plugin found (none)
dc61-p18-t23-n020:2094:2094 [7] NCCL INFO NET/Plugin: No plugin found (libnccl-net-none.so)
dc61-p18-t23-n020:2094:2094 [7] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net-none.so: cannot open shared object file: No such file or directory : when loading none
dc61-p18-t23-n020:2094:2094 [7] NCCL INFO NET/Plugin: Using internal network plugin.
dc61-p18-t23-n020:2091:5685 [4] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
dc61-p18-t23-n020:2091:5685 [4] NCCL INFO NCCL_SOCKET_FAMILY set by environment to AF_INET6
dc61-p18-t23-n020:2091:5685 [4] NCCL INFO NCCL_SOCKET_IFNAME set by environment to =eth0
dc61-p18-t23-n020:2091:5685 [4] NCCL INFO NCCL_IB_HCA set to mlx5
dc61-p18-t23-n020:2091:5685 [4] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE [RO]; OOB eth0:fdbd:dc61:18:23::20<0>
dc61-p18-t23-n020:2091:5685 [4] NCCL INFO Using non-device net plugin version 0
dc61-p18-t23-n020:2091:5685 [4] NCCL INFO Using network IB
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO NCCL_SOCKET_FAMILY set by environment to AF_INET6
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO NCCL_SOCKET_IFNAME set by environment to =eth0
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO NCCL_IB_HCA set to mlx5
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO NCCL_SOCKET_FAMILY set by environment to AF_INET6
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO NCCL_SOCKET_IFNAME set by environment to =eth0
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO NCCL_IB_HCA set to mlx5
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE [RO]; OOB eth0:fdbd:dc61:18:23::20<0>
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO Using non-device net plugin version 0
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO Using network IB
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE [RO]; OOB eth0:fdbd:dc61:18:23::20<0>
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO Using non-device net plugin version 0
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO Using network IB
--> applying fdsp activation checkpointing...
--> applying fdsp activation checkpointing...
dc61-p18-t23-n020:2092:2092 [5] NCCL INFO cudaDriverVersion 12040
dc61-p18-t23-n020:2092:2092 [5] NCCL INFO NCCL_SOCKET_FAMILY set by environment to AF_INET6
dc61-p18-t23-n020:2092:2092 [5] NCCL INFO NCCL_SOCKET_IFNAME set by environment to =eth0
dc61-p18-t23-n020:2092:2092 [5] NCCL INFO Bootstrap : Using eth0:fdbd:dc61:18:23::20<0>
dc61-p18-t23-n020:2092:2092 [5] NCCL INFO NET/Plugin: No plugin found (none)
dc61-p18-t23-n020:2092:2092 [5] NCCL INFO NET/Plugin: No plugin found (libnccl-net-none.so)
dc61-p18-t23-n020:2092:2092 [5] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net-none.so: cannot open shared object file: No such file or directory : when loading none
dc61-p18-t23-n020:2092:2092 [5] NCCL INFO NET/Plugin: Using internal network plugin.
dc61-p18-t23-n020:2089:2089 [2] NCCL INFO cudaDriverVersion 12040
dc61-p18-t23-n020:2089:2089 [2] NCCL INFO NCCL_SOCKET_FAMILY set by environment to AF_INET6
dc61-p18-t23-n020:2089:2089 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to =eth0
dc61-p18-t23-n020:2089:2089 [2] NCCL INFO Bootstrap : Using eth0:fdbd:dc61:18:23::20<0>
dc61-p18-t23-n020:2089:2089 [2] NCCL INFO NET/Plugin: No plugin found (none)
dc61-p18-t23-n020:2089:2089 [2] NCCL INFO NET/Plugin: No plugin found (libnccl-net-none.so)
dc61-p18-t23-n020:2089:2089 [2] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net-none.so: cannot open shared object file: No such file or directory : when loading none
dc61-p18-t23-n020:2089:2089 [2] NCCL INFO NET/Plugin: Using internal network plugin.
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO NCCL_SOCKET_FAMILY set by environment to AF_INET6
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO NCCL_SOCKET_IFNAME set by environment to =eth0
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO NCCL_IB_HCA set to mlx5
dc61-p18-t23-n020:2089:6357 [2] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
dc61-p18-t23-n020:2089:6357 [2] NCCL INFO NCCL_SOCKET_FAMILY set by environment to AF_INET6
dc61-p18-t23-n020:2089:6357 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to =eth0
dc61-p18-t23-n020:2089:6357 [2] NCCL INFO NCCL_IB_HCA set to mlx5
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE [RO]; OOB eth0:fdbd:dc61:18:23::20<0>
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO Using non-device net plugin version 0
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO Using network IB
dc61-p18-t23-n020:2089:6357 [2] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE [RO]; OOB eth0:fdbd:dc61:18:23::20<0>
dc61-p18-t23-n020:2089:6357 [2] NCCL INFO Using non-device net plugin version 0
dc61-p18-t23-n020:2089:6357 [2] NCCL INFO Using network IB
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO ncclCommInitRank comm 0xd476ce0 rank 1 nranks 4 cudaDev 5 nvmlDev 5 busId 96000 commId 0x10787e24a8775d7e - Init START
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO ncclCommInitRank comm 0xe6397f0 rank 3 nranks 4 cudaDev 7 nvmlDev 7 busId cc000 commId 0x10787e24a8775d7e - Init START
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO ncclCommInitRank comm 0xcac7e20 rank 2 nranks 4 cudaDev 6 nvmlDev 6 busId c8000 commId 0x10787e24a8775d7e - Init START
dc61-p18-t23-n020:2091:5685 [4] NCCL INFO ncclCommInitRank comm 0xd3deee0 rank 0 nranks 4 cudaDev 4 nvmlDev 4 busId 92000 commId 0x10787e24a8775d7e - Init START
dc61-p18-t23-n020:2089:6357 [2] NCCL INFO ncclCommInitRank comm 0xcc329b0 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId 46000 commId 0x4becccc0e6cf7e11 - Init START
dc61-p18-t23-n020:2087:4030 [0] NCCL INFO ncclCommInitRank comm 0xc251630 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 14000 commId 0x4becccc0e6cf7e11 - Init START
dc61-p18-t23-n020:2088:4363 [1] NCCL INFO ncclCommInitRank comm 0xe43ab80 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 1a000 commId 0x4becccc0e6cf7e11 - Init START
dc61-p18-t23-n020:2090:5351 [3] NCCL INFO ncclCommInitRank comm 0xe1b29a0 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId 4c000 commId 0x4becccc0e6cf7e11 - Init START
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO === System : maxBw 240.0 totalBw 240.0 ===
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO CPU/0-1 (1/1/2)
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO + PCI[24.0] - PCI/0-82000 (1000c0101000ffff)
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO + PCI[24.0] - NIC/0-86000
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO + PCI[24.0] - PCI/0-90000 (1000c01010de13b8)
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO + PCI[24.0] - GPU/0-92000 (0)
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO + NVL[240.0] - NVS/0
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO + PCI[24.0] - PCI/0-94000 (1000c01010de13b8)
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO + PCI[24.0] - GPU/0-96000 (1)
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO + NVL[240.0] - NVS/0
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO + PCI[24.0] - PCI/0-af000 (1000c0101000ffff)
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO + PCI[24.0] - NIC/0-b3000
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO + PCI[24.0] - PCI/0-c6000 (1000c01010de13b8)
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO + PCI[24.0] - GPU/0-c8000 (2)
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO + NVL[240.0] - NVS/0
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO + PCI[24.0] - PCI/0-ca000 (1000c01010de13b8)
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO + PCI[24.0] - GPU/0-cc000 (3)
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO + NVL[240.0] - NVS/0
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO + SYS[10.0] - CPU/0
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO CPU/0-0 (1/1/2)
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff)
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO + PCI[24.0] - NIC/0-e000
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO + PCI[24.0] - PCI/0-3c000 (1000c0101000ffff)
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO + PCI[24.0] - NIC/0-40000
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO + SYS[10.0] - CPU/1
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO ==========================================
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO GPU/92000 :GPU/0-92000 (0/5000.0/LOC) GPU/0-96000 (2/240.0/NVL) GPU/0-c8000 (2/240.0/NVL) GPU/0-cc000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS)
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO GPU/96000 :GPU/0-92000 (2/240.0/NVL) GPU/0-96000 (0/5000.0/LOC) GPU/0-c8000 (2/240.0/NVL) GPU/0-cc000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS)
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO GPU/C8000 :GPU/0-92000 (2/240.0/NVL) GPU/0-96000 (2/240.0/NVL) GPU/0-c8000 (0/5000.0/LOC) GPU/0-cc000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS)
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO GPU/CC000 :GPU/0-92000 (2/240.0/NVL) GPU/0-96000 (2/240.0/NVL) GPU/0-c8000 (2/240.0/NVL) GPU/0-cc000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS)
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO Setting affinity for GPU 7 to ffffffff,00000000,ffffffff,00000000
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO NVLS multicast support is not available on dev 7
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO 0 : GPU/0 GPU/1 GPU/2 GPU/3
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO 1 : GPU/0 GPU/1 GPU/2 GPU/3
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO 2 : GPU/0 GPU/1 GPU/2 GPU/3
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO 3 : GPU/0 GPU/1 GPU/2 GPU/3
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO 11 : GPU/0 GPU/1 GPU/2 GPU/3
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO 0 : GPU/0 GPU/1 GPU/2 GPU/3
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO 8 : GPU/0 GPU/1 GPU/2 GPU/3
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO 9 : GPU/0 GPU/1 GPU/2 GPU/3
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO 10 : GPU/0 GPU/1 GPU/2 GPU/3
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO 11 : GPU/0 GPU/1 GPU/2 GPU/3
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO === System : maxBw 240.0 totalBw 240.0 ===
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO CPU/0-1 (1/1/2)
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO + PCI[24.0] - PCI/0-82000 (1000c0101000ffff)
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO + PCI[24.0] - NIC/0-86000
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO + PCI[24.0] - PCI/0-90000 (1000c01010de13b8)
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO === System : maxBw 240.0 totalBw 240.0 ===
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO + PCI[24.0] - GPU/0-92000 (0)
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO CPU/0-1 (1/1/2)
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO + NVL[240.0] - NVS/0
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO + PCI[24.0] - PCI/0-82000 (1000c0101000ffff)
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO + PCI[24.0] - PCI/0-94000 (1000c01010de13b8)
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO + PCI[24.0] - NIC/0-86000
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO + PCI[24.0] - GPU/0-96000 (1)
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO + PCI[24.0] - PCI/0-90000 (1000c01010de13b8)
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO + NVL[240.0] - NVS/0
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO + PCI[24.0] - GPU/0-92000 (0)
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO + PCI[24.0] - PCI/0-af000 (1000c0101000ffff)
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO + NVL[240.0] - NVS/0
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO + PCI[24.0] - NIC/0-b3000
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO + PCI[24.0] - PCI/0-94000 (1000c01010de13b8)
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO + PCI[24.0] - PCI/0-c6000 (1000c01010de13b8)
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO + PCI[24.0] - GPU/0-96000 (1)
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO + PCI[24.0] - GPU/0-c8000 (2)
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO + NVL[240.0] - NVS/0
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO + NVL[240.0] - NVS/0
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO + PCI[24.0] - PCI/0-af000 (1000c0101000ffff)
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO + PCI[24.0] - PCI/0-ca000 (1000c01010de13b8)
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO + PCI[24.0] - NIC/0-b3000
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO + PCI[24.0] - GPU/0-cc000 (3)
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO + PCI[24.0] - PCI/0-c6000 (1000c01010de13b8)
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO + NVL[240.0] - NVS/0
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO + PCI[24.0] - GPU/0-c8000 (2)
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO + SYS[10.0] - CPU/0
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO + NVL[240.0] - NVS/0
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO CPU/0-0 (1/1/2)
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO + PCI[24.0] - PCI/0-ca000 (1000c01010de13b8)
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff)
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO + PCI[24.0] - GPU/0-cc000 (3)
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO + PCI[24.0] - NIC/0-e000
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO + NVL[240.0] - NVS/0
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO + PCI[24.0] - PCI/0-3c000 (1000c0101000ffff)
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO + SYS[10.0] - CPU/0
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO + PCI[24.0] - NIC/0-40000
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO CPU/0-0 (1/1/2)
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO + SYS[10.0] - CPU/1
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff)
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO ==========================================
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO + PCI[24.0] - NIC/0-e000
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO + PCI[24.0] - PCI/0-3c000 (1000c0101000ffff)
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO GPU/92000 :GPU/0-92000 (0/5000.0/LOC) GPU/0-96000 (2/240.0/NVL) GPU/0-c8000 (2/240.0/NVL) GPU/0-cc000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS)
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO + PCI[24.0] - NIC/0-40000
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO + SYS[10.0] - CPU/1
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO GPU/96000 :GPU/0-92000 (2/240.0/NVL) GPU/0-96000 (0/5000.0/LOC) GPU/0-c8000 (2/240.0/NVL) GPU/0-cc000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS)
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO ==========================================
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO GPU/C8000 :GPU/0-92000 (2/240.0/NVL) GPU/0-96000 (2/240.0/NVL) GPU/0-c8000 (0/5000.0/LOC) GPU/0-cc000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS)
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO GPU/92000 :GPU/0-92000 (0/5000.0/LOC) GPU/0-96000 (2/240.0/NVL) GPU/0-c8000 (2/240.0/NVL) GPU/0-cc000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS)
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO GPU/CC000 :GPU/0-92000 (2/240.0/NVL) GPU/0-96000 (2/240.0/NVL) GPU/0-c8000 (2/240.0/NVL) GPU/0-cc000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS)
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO GPU/96000 :GPU/0-92000 (2/240.0/NVL) GPU/0-96000 (0/5000.0/LOC) GPU/0-c8000 (2/240.0/NVL) GPU/0-cc000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS)
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO GPU/C8000 :GPU/0-92000 (2/240.0/NVL) GPU/0-96000 (2/240.0/NVL) GPU/0-c8000 (0/5000.0/LOC) GPU/0-cc000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS)
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO Setting affinity for GPU 5 to ffffffff,00000000,ffffffff,00000000
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO GPU/CC000 :GPU/0-92000 (2/240.0/NVL) GPU/0-96000 (2/240.0/NVL) GPU/0-c8000 (2/240.0/NVL) GPU/0-cc000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS)
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO NVLS multicast support is not available on dev 5
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO Setting affinity for GPU 6 to ffffffff,00000000,ffffffff,00000000
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO NVLS multicast support is not available on dev 6
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO 0 : GPU/0 GPU/1 GPU/2 GPU/3
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO 1 : GPU/0 GPU/1 GPU/2 GPU/3
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO 2 : GPU/0 GPU/1 GPU/2 GPU/3
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO 3 : GPU/0 GPU/1 GPU/2 GPU/3
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO 4 : GPU/0 GPU/1 GPU/2 GPU/3
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1
dc61-p18-t23-n020:2087:4030 [0] NCCL INFO Ring 12 : 3 -> 0 -> 1
dc61-p18-t23-n020:2088:4363 [1] NCCL INFO Ring 23 : 0 -> 1 -> 2
dc61-p18-t23-n020:2087:4030 [0] NCCL INFO Ring 13 : 3 -> 0 -> 1
dc61-p18-t23-n020:2088:4363 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 [16] 2/-1/-1->1->0 [17] 2/-1/-1->1->0 [18] 2/-1/-1->1->0 [19] 2/-1/-1->1->0 [20] 2/-1/-1->1->0 [21] 2/-1/-1->1->0 [22] 2/-1/-1->1->0 [23] 2/-1/-1->1->0
dc61-p18-t23-n020:2087:4030 [0] NCCL INFO Ring 14 : 3 -> 0 -> 1
dc61-p18-t23-n020:2088:4363 [1] NCCL INFO P2P Chunksize set to 524288
dc61-p18-t23-n020:2087:4030 [0] NCCL INFO Ring 15 : 3 -> 0 -> 1
dc61-p18-t23-n020:2087:4030 [0] NCCL INFO Ring 16 : 3 -> 0 -> 1
dc61-p18-t23-n020:2087:4030 [0] NCCL INFO Ring 23 : 3 -> 0 -> 1
dc61-p18-t23-n020:2087:4030 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] 1/-1/-1->0->-1 [19] 1/-1/-1->0->-1 [20] 1/-1/-1->0->-1 [21] 1/-1/-1->0->-1 [22] 1/-1/-1->0->-1 [23] 1/-1/-1->0->-1
dc61-p18-t23-n020:2087:4030 [0] NCCL INFO P2P Chunksize set to 524288
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO Channel 05/0 : 2[6] -> 3[7] via P2P/CUMEM/read
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO Channel 19/0 : 2[6] -> 3[7] via P2P/CUMEM/read
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO Channel 17/0 : 3[7] -> 0[4] via P2P/CUMEM/read
dc61-p18-t23-n020:2091:5685 [4] NCCL INFO Channel 00/0 : 0[4] -> 1[5] via P2P/CUMEM/read
dc61-p18-t23-n020:2087:4030 [0] NCCL INFO Connected all trees
dc61-p18-t23-n020:2087:4030 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
dc61-p18-t23-n020:2087:4030 [0] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
dc61-p18-t23-n020:2088:4363 [1] NCCL INFO Connected all trees
dc61-p18-t23-n020:2088:4363 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
dc61-p18-t23-n020:2088:4363 [1] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
dc61-p18-t23-n020:2090:5351 [3] NCCL INFO Connected all trees
dc61-p18-t23-n020:2089:6357 [2] NCCL INFO Connected all trees
dc61-p18-t23-n020:2090:5351 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
dc61-p18-t23-n020:2090:5351 [3] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
dc61-p18-t23-n020:2089:6357 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
dc61-p18-t23-n020:2089:6357 [2] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
dc61-p18-t23-n020:2089:6357 [2] NCCL INFO TUNER/Plugin: Most recent plugin load returned 2 : libnccl-net-none.so: cannot open shared object file: No such file or directory. All attempts to load 'libnccl-tuner.so none' also failed.
dc61-p18-t23-n020:2088:4363 [1] NCCL INFO TUNER/Plugin: Most recent plugin load returned 2 : libnccl-net-none.so: cannot open shared object file: No such file or directory. All attempts to load 'libnccl-tuner.so none' also failed.
dc61-p18-t23-n020:2090:5351 [3] NCCL INFO TUNER/Plugin: Most recent plugin load returned 2 : libnccl-net-none.so: cannot open shared object file: No such file or directory. All attempts to load 'libnccl-tuner.so none' also failed.
dc61-p18-t23-n020:2089:6357 [2] NCCL INFO TUNER/Plugin: Using internal tuner plugin.
dc61-p18-t23-n020:2088:4363 [1] NCCL INFO TUNER/Plugin: Using internal tuner plugin.
dc61-p18-t23-n020:2090:5351 [3] NCCL INFO TUNER/Plugin: Using internal tuner plugin.
dc61-p18-t23-n020:2089:6357 [2] NCCL INFO ncclCommInitRank comm 0xcc329b0 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId 46000 commId 0x4becccc0e6cf7e11 - Init COMPLETE
dc61-p18-t23-n020:2088:4363 [1] NCCL INFO ncclCommInitRank comm 0xe43ab80 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 1a000 commId 0x4becccc0e6cf7e11 - Init COMPLETE
dc61-p18-t23-n020:2090:5351 [3] NCCL INFO ncclCommInitRank comm 0xe1b29a0 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId 4c000 commId 0x4becccc0e6cf7e11 - Init COMPLETE
dc61-p18-t23-n020:2087:4030 [0] NCCL INFO TUNER/Plugin: Most recent plugin load returned 2 : libnccl-net-none.so: cannot open shared object file: No such file or directory. All attempts to load 'libnccl-tuner.so none' also failed.
dc61-p18-t23-n020:2087:4030 [0] NCCL INFO TUNER/Plugin: Using internal tuner plugin.
dc61-p18-t23-n020:2087:4030 [0] NCCL INFO ncclCommInitRank comm 0xc251630 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 14000 commId 0x4becccc0e6cf7e11 - Init COMPLETE
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO Connected all trees
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
dc61-p18-t23-n020:2091:5685 [4] NCCL INFO Connected all trees
dc61-p18-t23-n020:2091:5685 [4] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
dc61-p18-t23-n020:2091:5685 [4] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO Connected all trees
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO Connected all trees
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
dc61-p18-t23-n020:2087:6390 [0] NCCL INFO Channel 00/1 : 0[0] -> 1[1] via P2P/CUMEM/read
dc61-p18-t23-n020:2090:6388 [3] NCCL INFO Channel 00/1 : 3[3] -> 0[0] via P2P/CUMEM/read
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO TUNER/Plugin: Most recent plugin load returned 2 : libnccl-net-none.so: cannot open shared object file: No such file or directory. All attempts to load 'libnccl-tuner.so none' also failed.
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO TUNER/Plugin: Using internal tuner plugin.
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO TUNER/Plugin: Most recent plugin load returned 2 : libnccl-net-none.so: cannot open shared object file: No such file or directory. All attempts to load 'libnccl-tuner.so none' also failed.
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO ncclCommInitRank comm 0xd476ce0 rank 1 nranks 4 cudaDev 5 nvmlDev 5 busId 96000 commId 0x10787e24a8775d7e - Init COMPLETE
dc61-p18-t23-n020:2091:5685 [4] NCCL INFO TUNER/Plugin: Most recent plugin load returned 2 : libnccl-net-none.so: cannot open shared object file: No such file or directory. All attempts to load 'libnccl-tuner.so none' also failed.
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO TUNER/Plugin: Using internal tuner plugin.
dc61-p18-t23-n020:2091:5685 [4] NCCL INFO TUNER/Plugin: Using internal tuner plugin.
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO ncclCommInitRank comm 0xcac7e20 rank 2 nranks 4 cudaDev 6 nvmlDev 6 busId c8000 commId 0x10787e24a8775d7e - Init COMPLETE
dc61-p18-t23-n020:2091:5685 [4] NCCL INFO ncclCommInitRank comm 0xd3deee0 rank 0 nranks 4 cudaDev 4 nvmlDev 4 busId 92000 commId 0x10787e24a8775d7e - Init COMPLETE
dc61-p18-t23-n020:2087:6390 [0] NCCL INFO Channel 28/1 : 0[0] -> 1[1] via P2P/CUMEM/read
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO TUNER/Plugin: Most recent plugin load returned 2 : libnccl-net-none.so: cannot open shared object file: No such file or directory. All attempts to load 'libnccl-tuner.so none' also failed.
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO TUNER/Plugin: Using internal tuner plugin.
dc61-p18-t23-n020:2088:6389 [1] NCCL INFO Channel 20/1 : 1[1] -> 2[2] via P2P/CUMEM/read
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO ncclCommInitRank comm 0xe6397f0 rank 3 nranks 4 cudaDev 7 nvmlDev 7 busId cc000 commId 0x10787e24a8775d7e - Init COMPLETE
dc61-p18-t23-n020:2089:6387 [2] NCCL INFO Channel 31/1 : 2[2] -> 1[1] via P2P/CUMEM/read
dc61-p18-t23-n020:2087:6398 [0] NCCL INFO Using non-device net plugin version 0
dc61-p18-t23-n020:2087:6398 [0] NCCL INFO Using network IB
dc61-p18-t23-n020:2088:6399 [1] NCCL INFO Using non-device net plugin version 0
dc61-p18-t23-n020:2089:6401 [2] NCCL INFO Using non-device net plugin version 0
dc61-p18-t23-n020:2090:6400 [3] NCCL INFO Using non-device net plugin version 0
dc61-p18-t23-n020:2088:6399 [1] NCCL INFO Using network IB
dc61-p18-t23-n020:2089:6401 [2] NCCL INFO Using network IB
dc61-p18-t23-n020:2090:6400 [3] NCCL INFO Using network IB
dc61-p18-t23-n020:2094:6368 [7] NCCL INFO [Service thread] Connection closed by localRank 3
dc61-p18-t23-n020:2094:2441 [0] NCCL INFO comm 0xe6397f0 rank 3 nranks 4 cudaDev 7 busId cc000 - Abort COMPLETE
Steps: 0%| | 0/2000 [00:00<?, ?it/s]File not found: /home/tiger/.icube/recently_opened
[07:21:04] [fdbd:dc61:10:529:2240:1ed5:bb00:ab][9723a7bf][$Nm] New connection established.
[07:21:06] [fdbd:dc61:c:225:f40:1ed5:bb00:ab][5590d78a][$Nm] New connection established.
[07:21:07] [fdbd:dc61:c:225:f40:1ed5:bb00:ab][3fb46809][ExtensionHostConnection] New connection established.
[07:21:07] [fdbd:dc61:c:225:f40:1ed5:bb00:ab][3fb46809][ExtensionHostConnection] ExtensionHost will start connect with mode: ProxyServer
[07:21:07] env.json not found
[07:21:07] [fdbd:dc61:c:225:f40:1ed5:bb00:ab][3fb46809][ExtensionHostConnection] <7869>[ProxyServer] Launched Extension Host Process, execArgv: --dns-result-order=ipv6first
[07:21:14] env.json not found
[07:25:26] [fdbd:dc61:10:529:2240:1ed5:bb00:ab][3fb46809][ExtensionHostConnection] The client has reconnected.
[rank7]:[E310 07:26:53.822247581 ProcessGroupNCCL.cpp:616] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600084 milliseconds before timing out.
[rank7]:[E310 07:26:53.822563082 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 2 Rank 3] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank7]: Traceback (most recent call last):
[rank7]: File "/mnt/bn/nlhei-nas/liwen.8459/VideoProj/FastVideo/fastvideo/train.py", line 663, in
[rank7]: main(args)
[rank7]: File "/mnt/bn/nlhei-nas/liwen.8459/VideoProj/FastVideo/fastvideo/train.py", line 368, in main
[rank7]: loss, grad_norm = train_one_step(
[rank7]: ^^^^^^^^^^^^^^^
[rank7]: File "/mnt/bn/nlhei-nas/liwen.8459/VideoProj/FastVideo/fastvideo/train.py", line 128, in train_one_step
[rank7]: sigmas = get_sigmas(
[rank7]: ^^^^^^^^^^^
[rank7]: File "/mnt/bn/nlhei-nas/liwen.8459/VideoProj/FastVideo/fastvideo/train.py", line 78, in get_sigmas
[rank7]: step_indices = [(schedule_timesteps == t).nonzero().item() for t in timesteps]
[rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]: File "/mnt/bn/nlhei-nas/liwen.8459/VideoProj/FastVideo/fastvideo/train.py", line 78, in
[rank7]: step_indices = [(schedule_timesteps == t).nonzero().item() for t in timesteps]
[rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]: RuntimeError: a Tensor with 0 elements cannot be converted to Scalar
[rank7]:[E310 07:26:54.757697857 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 2 Rank 3] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank7]:[E310 07:26:54.757741468 ProcessGroupNCCL.cpp:630] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank7]:[E310 07:26:54.757748967 ProcessGroupNCCL.cpp:636] [Rank 3] To avoid data inconsistency, we are taking the entire process down.
[rank7]:[E310 07:26:54.759129434 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 2 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600084 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fe6428b9446 in /usr/local/lib/python3.11/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7fe5f8819762 in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fe5f8820ba3 in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fe5f882260d in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x145c0 (0x7fe642cf35c0 in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch.so)
frame #5: + 0x89144 (0x7fe643a8f144 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x1097dc (0x7fe643b0f7dc in /usr/lib/x86_64-linux-gnu/libc.so.6)
Environment
8XA100-SXM4-80GB
Describe the bug
--> loading model from /mnt/bn/nlhei-nas/yangming/pretrained_models/hunyuan-t2v-ckpts
Could not load Sliding Tile Attention.
Total training parameters = 12821.012544 M
--> Initializing FSDP with sharding strategy: full
--> model loaded
--> applying fdsp activation checkpointing...
optimizer: AdamW (
Parameter Group 0
amsgrad: False
betas: (0.9, 0.999)
capturable: False
differentiable: False
eps: 1e-08
foreach: None
fused: None
lr: 1e-05
maximize: False
weight_decay: 0.01
)
***** Running training *****
Num examples = 88
Dataloader size = 11
Num Epochs = 46
Resume training from step 0
Instantaneous batch size per device = 1
Total train batch size (w. data & sequence parallel, accumulation) = 2.0
Gradient Accumulation steps = 1
Total optimization steps = 2000
Total training parameters per FSDP shard = 1.602626568 B
Master weight dtype: torch.float32
dc61-p18-t23-n020:2087:2087 [0] NCCL INFO NCCL_SOCKET_FAMILY set by environment to AF_INET6
dc61-p18-t23-n020:2087:2087 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to =eth0
dc61-p18-t23-n020:2087:2087 [0] NCCL INFO Bootstrap : Using eth0:fdbd:dc61:18:23::20<0>
dc61-p18-t23-n020:2087:2087 [0] NCCL INFO NET/Plugin: No plugin found (none)
dc61-p18-t23-n020:2087:2087 [0] NCCL INFO NET/Plugin: No plugin found (libnccl-net-none.so)
dc61-p18-t23-n020:2087:2087 [0] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net-none.so: cannot open shared object file: No such file or directory : when loading none
dc61-p18-t23-n020:2087:2087 [0] NCCL INFO NET/Plugin: Using internal network plugin.
dc61-p18-t23-n020:2087:2087 [0] NCCL INFO cudaDriverVersion 12040
NCCL version 2.21.5+cuda12.4
dc61-p18-t23-n020:2087:4030 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
dc61-p18-t23-n020:2087:4030 [0] NCCL INFO NCCL_SOCKET_FAMILY set by environment to AF_INET6
dc61-p18-t23-n020:2087:4030 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to =eth0
dc61-p18-t23-n020:2087:4030 [0] NCCL INFO NCCL_IB_HCA set to mlx5
dc61-p18-t23-n020:2087:4030 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE [RO]; OOB eth0:fdbd:dc61:18:23::20<0>
dc61-p18-t23-n020:2087:4030 [0] NCCL INFO Using non-device net plugin version 0
dc61-p18-t23-n020:2087:4030 [0] NCCL INFO Using network IB
--> applying fdsp activation checkpointing...
dc61-p18-t23-n020:2088:2088 [1] NCCL INFO cudaDriverVersion 12040
dc61-p18-t23-n020:2088:2088 [1] NCCL INFO NCCL_SOCKET_FAMILY set by environment to AF_INET6
dc61-p18-t23-n020:2088:2088 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to =eth0
dc61-p18-t23-n020:2088:2088 [1] NCCL INFO Bootstrap : Using eth0:fdbd:dc61:18:23::20<0>
dc61-p18-t23-n020:2088:2088 [1] NCCL INFO NET/Plugin: No plugin found (none)
dc61-p18-t23-n020:2088:2088 [1] NCCL INFO NET/Plugin: No plugin found (libnccl-net-none.so)
dc61-p18-t23-n020:2088:2088 [1] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net-none.so: cannot open shared object file: No such file or directory : when loading none
dc61-p18-t23-n020:2088:2088 [1] NCCL INFO NET/Plugin: Using internal network plugin.
--> applying fdsp activation checkpointing...
dc61-p18-t23-n020:2088:4363 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
dc61-p18-t23-n020:2088:4363 [1] NCCL INFO NCCL_SOCKET_FAMILY set by environment to AF_INET6
dc61-p18-t23-n020:2088:4363 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to =eth0
dc61-p18-t23-n020:2088:4363 [1] NCCL INFO NCCL_IB_HCA set to mlx5
dc61-p18-t23-n020:2088:4363 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE [RO]; OOB eth0:fdbd:dc61:18:23::20<0>
dc61-p18-t23-n020:2088:4363 [1] NCCL INFO Using non-device net plugin version 0
dc61-p18-t23-n020:2088:4363 [1] NCCL INFO Using network IB
--> applying fdsp activation checkpointing...
--> applying fdsp activation checkpointing...
dc61-p18-t23-n020:2090:2090 [3] NCCL INFO cudaDriverVersion 12040
dc61-p18-t23-n020:2090:2090 [3] NCCL INFO NCCL_SOCKET_FAMILY set by environment to AF_INET6
dc61-p18-t23-n020:2090:2090 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to =eth0
dc61-p18-t23-n020:2090:2090 [3] NCCL INFO Bootstrap : Using eth0:fdbd:dc61:18:23::20<0>
dc61-p18-t23-n020:2090:2090 [3] NCCL INFO NET/Plugin: No plugin found (none)
dc61-p18-t23-n020:2090:2090 [3] NCCL INFO NET/Plugin: No plugin found (libnccl-net-none.so)
dc61-p18-t23-n020:2090:2090 [3] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net-none.so: cannot open shared object file: No such file or directory : when loading none
dc61-p18-t23-n020:2090:2090 [3] NCCL INFO NET/Plugin: Using internal network plugin.
--> applying fdsp activation checkpointing...
dc61-p18-t23-n020:2090:5351 [3] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
dc61-p18-t23-n020:2090:5351 [3] NCCL INFO NCCL_SOCKET_FAMILY set by environment to AF_INET6
dc61-p18-t23-n020:2090:5351 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to =eth0
dc61-p18-t23-n020:2090:5351 [3] NCCL INFO NCCL_IB_HCA set to mlx5
dc61-p18-t23-n020:2090:5351 [3] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE [RO]; OOB eth0:fdbd:dc61:18:23::20<0>
dc61-p18-t23-n020:2090:5351 [3] NCCL INFO Using non-device net plugin version 0
dc61-p18-t23-n020:2090:5351 [3] NCCL INFO Using network IB
dc61-p18-t23-n020:2091:2091 [4] NCCL INFO NCCL_SOCKET_FAMILY set by environment to AF_INET6
dc61-p18-t23-n020:2091:2091 [4] NCCL INFO NCCL_SOCKET_IFNAME set by environment to =eth0
dc61-p18-t23-n020:2091:2091 [4] NCCL INFO Bootstrap : Using eth0:fdbd:dc61:18:23::20<0>
dc61-p18-t23-n020:2091:2091 [4] NCCL INFO NET/Plugin: No plugin found (none)
dc61-p18-t23-n020:2091:2091 [4] NCCL INFO NET/Plugin: No plugin found (libnccl-net-none.so)
dc61-p18-t23-n020:2091:2091 [4] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net-none.so: cannot open shared object file: No such file or directory : when loading none
dc61-p18-t23-n020:2091:2091 [4] NCCL INFO NET/Plugin: Using internal network plugin.
dc61-p18-t23-n020:2091:2091 [4] NCCL INFO cudaDriverVersion 12040
NCCL version 2.21.5+cuda12.4
dc61-p18-t23-n020:2093:2093 [6] NCCL INFO cudaDriverVersion 12040
dc61-p18-t23-n020:2093:2093 [6] NCCL INFO NCCL_SOCKET_FAMILY set by environment to AF_INET6
dc61-p18-t23-n020:2093:2093 [6] NCCL INFO NCCL_SOCKET_IFNAME set by environment to =eth0
dc61-p18-t23-n020:2093:2093 [6] NCCL INFO Bootstrap : Using eth0:fdbd:dc61:18:23::20<0>
dc61-p18-t23-n020:2093:2093 [6] NCCL INFO NET/Plugin: No plugin found (none)
dc61-p18-t23-n020:2093:2093 [6] NCCL INFO NET/Plugin: No plugin found (libnccl-net-none.so)
dc61-p18-t23-n020:2093:2093 [6] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net-none.so: cannot open shared object file: No such file or directory : when loading none
dc61-p18-t23-n020:2093:2093 [6] NCCL INFO NET/Plugin: Using internal network plugin.
dc61-p18-t23-n020:2094:2094 [7] NCCL INFO cudaDriverVersion 12040
dc61-p18-t23-n020:2094:2094 [7] NCCL INFO NCCL_SOCKET_FAMILY set by environment to AF_INET6
dc61-p18-t23-n020:2094:2094 [7] NCCL INFO NCCL_SOCKET_IFNAME set by environment to =eth0
dc61-p18-t23-n020:2094:2094 [7] NCCL INFO Bootstrap : Using eth0:fdbd:dc61:18:23::20<0>
dc61-p18-t23-n020:2094:2094 [7] NCCL INFO NET/Plugin: No plugin found (none)
dc61-p18-t23-n020:2094:2094 [7] NCCL INFO NET/Plugin: No plugin found (libnccl-net-none.so)
dc61-p18-t23-n020:2094:2094 [7] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net-none.so: cannot open shared object file: No such file or directory : when loading none
dc61-p18-t23-n020:2094:2094 [7] NCCL INFO NET/Plugin: Using internal network plugin.
dc61-p18-t23-n020:2091:5685 [4] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
dc61-p18-t23-n020:2091:5685 [4] NCCL INFO NCCL_SOCKET_FAMILY set by environment to AF_INET6
dc61-p18-t23-n020:2091:5685 [4] NCCL INFO NCCL_SOCKET_IFNAME set by environment to =eth0
dc61-p18-t23-n020:2091:5685 [4] NCCL INFO NCCL_IB_HCA set to mlx5
dc61-p18-t23-n020:2091:5685 [4] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE [RO]; OOB eth0:fdbd:dc61:18:23::20<0>
dc61-p18-t23-n020:2091:5685 [4] NCCL INFO Using non-device net plugin version 0
dc61-p18-t23-n020:2091:5685 [4] NCCL INFO Using network IB
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO NCCL_SOCKET_FAMILY set by environment to AF_INET6
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO NCCL_SOCKET_IFNAME set by environment to =eth0
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO NCCL_IB_HCA set to mlx5
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO NCCL_SOCKET_FAMILY set by environment to AF_INET6
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO NCCL_SOCKET_IFNAME set by environment to =eth0
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO NCCL_IB_HCA set to mlx5
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE [RO]; OOB eth0:fdbd:dc61:18:23::20<0>
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO Using non-device net plugin version 0
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO Using network IB
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE [RO]; OOB eth0:fdbd:dc61:18:23::20<0>
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO Using non-device net plugin version 0
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO Using network IB
--> applying fdsp activation checkpointing...
--> applying fdsp activation checkpointing...
dc61-p18-t23-n020:2092:2092 [5] NCCL INFO cudaDriverVersion 12040
dc61-p18-t23-n020:2092:2092 [5] NCCL INFO NCCL_SOCKET_FAMILY set by environment to AF_INET6
dc61-p18-t23-n020:2092:2092 [5] NCCL INFO NCCL_SOCKET_IFNAME set by environment to =eth0
dc61-p18-t23-n020:2092:2092 [5] NCCL INFO Bootstrap : Using eth0:fdbd:dc61:18:23::20<0>
dc61-p18-t23-n020:2092:2092 [5] NCCL INFO NET/Plugin: No plugin found (none)
dc61-p18-t23-n020:2092:2092 [5] NCCL INFO NET/Plugin: No plugin found (libnccl-net-none.so)
dc61-p18-t23-n020:2092:2092 [5] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net-none.so: cannot open shared object file: No such file or directory : when loading none
dc61-p18-t23-n020:2092:2092 [5] NCCL INFO NET/Plugin: Using internal network plugin.
dc61-p18-t23-n020:2089:2089 [2] NCCL INFO cudaDriverVersion 12040
dc61-p18-t23-n020:2089:2089 [2] NCCL INFO NCCL_SOCKET_FAMILY set by environment to AF_INET6
dc61-p18-t23-n020:2089:2089 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to =eth0
dc61-p18-t23-n020:2089:2089 [2] NCCL INFO Bootstrap : Using eth0:fdbd:dc61:18:23::20<0>
dc61-p18-t23-n020:2089:2089 [2] NCCL INFO NET/Plugin: No plugin found (none)
dc61-p18-t23-n020:2089:2089 [2] NCCL INFO NET/Plugin: No plugin found (libnccl-net-none.so)
dc61-p18-t23-n020:2089:2089 [2] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net-none.so: cannot open shared object file: No such file or directory : when loading none
dc61-p18-t23-n020:2089:2089 [2] NCCL INFO NET/Plugin: Using internal network plugin.
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO NCCL_SOCKET_FAMILY set by environment to AF_INET6
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO NCCL_SOCKET_IFNAME set by environment to =eth0
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO NCCL_IB_HCA set to mlx5
dc61-p18-t23-n020:2089:6357 [2] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
dc61-p18-t23-n020:2089:6357 [2] NCCL INFO NCCL_SOCKET_FAMILY set by environment to AF_INET6
dc61-p18-t23-n020:2089:6357 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to =eth0
dc61-p18-t23-n020:2089:6357 [2] NCCL INFO NCCL_IB_HCA set to mlx5
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE [RO]; OOB eth0:fdbd:dc61:18:23::20<0>
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO Using non-device net plugin version 0
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO Using network IB
dc61-p18-t23-n020:2089:6357 [2] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE [RO]; OOB eth0:fdbd:dc61:18:23::20<0>
dc61-p18-t23-n020:2089:6357 [2] NCCL INFO Using non-device net plugin version 0
dc61-p18-t23-n020:2089:6357 [2] NCCL INFO Using network IB
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO ncclCommInitRank comm 0xd476ce0 rank 1 nranks 4 cudaDev 5 nvmlDev 5 busId 96000 commId 0x10787e24a8775d7e - Init START
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO ncclCommInitRank comm 0xe6397f0 rank 3 nranks 4 cudaDev 7 nvmlDev 7 busId cc000 commId 0x10787e24a8775d7e - Init START
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO ncclCommInitRank comm 0xcac7e20 rank 2 nranks 4 cudaDev 6 nvmlDev 6 busId c8000 commId 0x10787e24a8775d7e - Init START
dc61-p18-t23-n020:2091:5685 [4] NCCL INFO ncclCommInitRank comm 0xd3deee0 rank 0 nranks 4 cudaDev 4 nvmlDev 4 busId 92000 commId 0x10787e24a8775d7e - Init START
dc61-p18-t23-n020:2089:6357 [2] NCCL INFO ncclCommInitRank comm 0xcc329b0 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId 46000 commId 0x4becccc0e6cf7e11 - Init START
dc61-p18-t23-n020:2087:4030 [0] NCCL INFO ncclCommInitRank comm 0xc251630 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 14000 commId 0x4becccc0e6cf7e11 - Init START
dc61-p18-t23-n020:2088:4363 [1] NCCL INFO ncclCommInitRank comm 0xe43ab80 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 1a000 commId 0x4becccc0e6cf7e11 - Init START
dc61-p18-t23-n020:2090:5351 [3] NCCL INFO ncclCommInitRank comm 0xe1b29a0 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId 4c000 commId 0x4becccc0e6cf7e11 - Init START
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO === System : maxBw 240.0 totalBw 240.0 ===
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO CPU/0-1 (1/1/2)
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO + PCI[24.0] - PCI/0-82000 (1000c0101000ffff)
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO + PCI[24.0] - NIC/0-86000
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO + PCI[24.0] - PCI/0-90000 (1000c01010de13b8)
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO + PCI[24.0] - GPU/0-92000 (0)
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO + NVL[240.0] - NVS/0
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO + PCI[24.0] - PCI/0-94000 (1000c01010de13b8)
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO + PCI[24.0] - GPU/0-96000 (1)
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO + NVL[240.0] - NVS/0
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO + PCI[24.0] - PCI/0-af000 (1000c0101000ffff)
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO + PCI[24.0] - NIC/0-b3000
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO + PCI[24.0] - PCI/0-c6000 (1000c01010de13b8)
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO + PCI[24.0] - GPU/0-c8000 (2)
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO + NVL[240.0] - NVS/0
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO + PCI[24.0] - PCI/0-ca000 (1000c01010de13b8)
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO + PCI[24.0] - GPU/0-cc000 (3)
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO + NVL[240.0] - NVS/0
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO + SYS[10.0] - CPU/0
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO CPU/0-0 (1/1/2)
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff)
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO + PCI[24.0] - NIC/0-e000
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO + PCI[24.0] - PCI/0-3c000 (1000c0101000ffff)
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO + PCI[24.0] - NIC/0-40000
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO + SYS[10.0] - CPU/1
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO ==========================================
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO GPU/92000 :GPU/0-92000 (0/5000.0/LOC) GPU/0-96000 (2/240.0/NVL) GPU/0-c8000 (2/240.0/NVL) GPU/0-cc000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS)
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO GPU/96000 :GPU/0-92000 (2/240.0/NVL) GPU/0-96000 (0/5000.0/LOC) GPU/0-c8000 (2/240.0/NVL) GPU/0-cc000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS)
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO GPU/C8000 :GPU/0-92000 (2/240.0/NVL) GPU/0-96000 (2/240.0/NVL) GPU/0-c8000 (0/5000.0/LOC) GPU/0-cc000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS)
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO GPU/CC000 :GPU/0-92000 (2/240.0/NVL) GPU/0-96000 (2/240.0/NVL) GPU/0-c8000 (2/240.0/NVL) GPU/0-cc000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS)
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO Setting affinity for GPU 7 to ffffffff,00000000,ffffffff,00000000
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO NVLS multicast support is not available on dev 7
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO 0 : GPU/0 GPU/1 GPU/2 GPU/3
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO 1 : GPU/0 GPU/1 GPU/2 GPU/3
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO 2 : GPU/0 GPU/1 GPU/2 GPU/3
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO 3 : GPU/0 GPU/1 GPU/2 GPU/3
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO 11 : GPU/0 GPU/1 GPU/2 GPU/3
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO Pattern 1, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO 0 : GPU/0 GPU/1 GPU/2 GPU/3
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO 8 : GPU/0 GPU/1 GPU/2 GPU/3
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO 9 : GPU/0 GPU/1 GPU/2 GPU/3
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO 10 : GPU/0 GPU/1 GPU/2 GPU/3
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO 11 : GPU/0 GPU/1 GPU/2 GPU/3
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO === System : maxBw 240.0 totalBw 240.0 ===
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO CPU/0-1 (1/1/2)
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO + PCI[24.0] - PCI/0-82000 (1000c0101000ffff)
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO + PCI[24.0] - NIC/0-86000
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO + PCI[24.0] - PCI/0-90000 (1000c01010de13b8)
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO === System : maxBw 240.0 totalBw 240.0 ===
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO + PCI[24.0] - GPU/0-92000 (0)
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO CPU/0-1 (1/1/2)
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO + NVL[240.0] - NVS/0
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO + PCI[24.0] - PCI/0-82000 (1000c0101000ffff)
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO + PCI[24.0] - PCI/0-94000 (1000c01010de13b8)
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO + PCI[24.0] - NIC/0-86000
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO + PCI[24.0] - GPU/0-96000 (1)
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO + PCI[24.0] - PCI/0-90000 (1000c01010de13b8)
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO + NVL[240.0] - NVS/0
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO + PCI[24.0] - GPU/0-92000 (0)
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO + PCI[24.0] - PCI/0-af000 (1000c0101000ffff)
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO + NVL[240.0] - NVS/0
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO + PCI[24.0] - NIC/0-b3000
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO + PCI[24.0] - PCI/0-94000 (1000c01010de13b8)
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO + PCI[24.0] - PCI/0-c6000 (1000c01010de13b8)
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO + PCI[24.0] - GPU/0-96000 (1)
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO + PCI[24.0] - GPU/0-c8000 (2)
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO + NVL[240.0] - NVS/0
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO + NVL[240.0] - NVS/0
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO + PCI[24.0] - PCI/0-af000 (1000c0101000ffff)
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO + PCI[24.0] - PCI/0-ca000 (1000c01010de13b8)
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO + PCI[24.0] - NIC/0-b3000
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO + PCI[24.0] - GPU/0-cc000 (3)
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO + PCI[24.0] - PCI/0-c6000 (1000c01010de13b8)
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO + NVL[240.0] - NVS/0
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO + PCI[24.0] - GPU/0-c8000 (2)
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO + SYS[10.0] - CPU/0
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO + NVL[240.0] - NVS/0
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO CPU/0-0 (1/1/2)
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO + PCI[24.0] - PCI/0-ca000 (1000c01010de13b8)
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff)
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO + PCI[24.0] - GPU/0-cc000 (3)
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO + PCI[24.0] - NIC/0-e000
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO + NVL[240.0] - NVS/0
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO + PCI[24.0] - PCI/0-3c000 (1000c0101000ffff)
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO + SYS[10.0] - CPU/0
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO + PCI[24.0] - NIC/0-40000
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO CPU/0-0 (1/1/2)
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO + SYS[10.0] - CPU/1
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO + PCI[24.0] - PCI/0-a000 (1000c0101000ffff)
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO ==========================================
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO + PCI[24.0] - NIC/0-e000
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO + PCI[24.0] - PCI/0-3c000 (1000c0101000ffff)
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO GPU/92000 :GPU/0-92000 (0/5000.0/LOC) GPU/0-96000 (2/240.0/NVL) GPU/0-c8000 (2/240.0/NVL) GPU/0-cc000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS)
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO + PCI[24.0] - NIC/0-40000
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO + SYS[10.0] - CPU/1
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO GPU/96000 :GPU/0-92000 (2/240.0/NVL) GPU/0-96000 (0/5000.0/LOC) GPU/0-c8000 (2/240.0/NVL) GPU/0-cc000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS)
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO ==========================================
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO GPU/C8000 :GPU/0-92000 (2/240.0/NVL) GPU/0-96000 (2/240.0/NVL) GPU/0-c8000 (0/5000.0/LOC) GPU/0-cc000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS)
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO GPU/92000 :GPU/0-92000 (0/5000.0/LOC) GPU/0-96000 (2/240.0/NVL) GPU/0-c8000 (2/240.0/NVL) GPU/0-cc000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS)
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO GPU/CC000 :GPU/0-92000 (2/240.0/NVL) GPU/0-96000 (2/240.0/NVL) GPU/0-c8000 (2/240.0/NVL) GPU/0-cc000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS)
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO GPU/96000 :GPU/0-92000 (2/240.0/NVL) GPU/0-96000 (0/5000.0/LOC) GPU/0-c8000 (2/240.0/NVL) GPU/0-cc000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS)
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO GPU/C8000 :GPU/0-92000 (2/240.0/NVL) GPU/0-96000 (2/240.0/NVL) GPU/0-c8000 (0/5000.0/LOC) GPU/0-cc000 (2/240.0/NVL) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS)
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO Setting affinity for GPU 5 to ffffffff,00000000,ffffffff,00000000
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO GPU/CC000 :GPU/0-92000 (2/240.0/NVL) GPU/0-96000 (2/240.0/NVL) GPU/0-c8000 (2/240.0/NVL) GPU/0-cc000 (0/5000.0/LOC) NVS/0-0 (1/240.0/NVL) CPU/0-1 (3/24.0/PHB) CPU/0-0 (4/10.0/SYS)
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO NVLS multicast support is not available on dev 5
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO Setting affinity for GPU 6 to ffffffff,00000000,ffffffff,00000000
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO NVLS multicast support is not available on dev 6
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO 0 : GPU/0 GPU/1 GPU/2 GPU/3
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO 1 : GPU/0 GPU/1 GPU/2 GPU/3
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO 2 : GPU/0 GPU/1 GPU/2 GPU/3
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO 3 : GPU/0 GPU/1 GPU/2 GPU/3
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO 4 : GPU/0 GPU/1 GPU/2 GPU/3
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1
dc61-p18-t23-n020:2087:4030 [0] NCCL INFO Ring 12 : 3 -> 0 -> 1
dc61-p18-t23-n020:2088:4363 [1] NCCL INFO Ring 23 : 0 -> 1 -> 2
dc61-p18-t23-n020:2087:4030 [0] NCCL INFO Ring 13 : 3 -> 0 -> 1
dc61-p18-t23-n020:2088:4363 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 [16] 2/-1/-1->1->0 [17] 2/-1/-1->1->0 [18] 2/-1/-1->1->0 [19] 2/-1/-1->1->0 [20] 2/-1/-1->1->0 [21] 2/-1/-1->1->0 [22] 2/-1/-1->1->0 [23] 2/-1/-1->1->0
dc61-p18-t23-n020:2087:4030 [0] NCCL INFO Ring 14 : 3 -> 0 -> 1
dc61-p18-t23-n020:2088:4363 [1] NCCL INFO P2P Chunksize set to 524288
dc61-p18-t23-n020:2087:4030 [0] NCCL INFO Ring 15 : 3 -> 0 -> 1
dc61-p18-t23-n020:2087:4030 [0] NCCL INFO Ring 16 : 3 -> 0 -> 1
dc61-p18-t23-n020:2087:4030 [0] NCCL INFO Ring 23 : 3 -> 0 -> 1
dc61-p18-t23-n020:2087:4030 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] 1/-1/-1->0->-1 [19] 1/-1/-1->0->-1 [20] 1/-1/-1->0->-1 [21] 1/-1/-1->0->-1 [22] 1/-1/-1->0->-1 [23] 1/-1/-1->0->-1
dc61-p18-t23-n020:2087:4030 [0] NCCL INFO P2P Chunksize set to 524288
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO Channel 05/0 : 2[6] -> 3[7] via P2P/CUMEM/read
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO Channel 19/0 : 2[6] -> 3[7] via P2P/CUMEM/read
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO Channel 17/0 : 3[7] -> 0[4] via P2P/CUMEM/read
dc61-p18-t23-n020:2091:5685 [4] NCCL INFO Channel 00/0 : 0[4] -> 1[5] via P2P/CUMEM/read
dc61-p18-t23-n020:2087:4030 [0] NCCL INFO Connected all trees
dc61-p18-t23-n020:2087:4030 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
dc61-p18-t23-n020:2087:4030 [0] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
dc61-p18-t23-n020:2088:4363 [1] NCCL INFO Connected all trees
dc61-p18-t23-n020:2088:4363 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
dc61-p18-t23-n020:2088:4363 [1] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
dc61-p18-t23-n020:2090:5351 [3] NCCL INFO Connected all trees
dc61-p18-t23-n020:2089:6357 [2] NCCL INFO Connected all trees
dc61-p18-t23-n020:2090:5351 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
dc61-p18-t23-n020:2090:5351 [3] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
dc61-p18-t23-n020:2089:6357 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
dc61-p18-t23-n020:2089:6357 [2] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
dc61-p18-t23-n020:2089:6357 [2] NCCL INFO TUNER/Plugin: Most recent plugin load returned 2 : libnccl-net-none.so: cannot open shared object file: No such file or directory. All attempts to load 'libnccl-tuner.so none' also failed.
dc61-p18-t23-n020:2088:4363 [1] NCCL INFO TUNER/Plugin: Most recent plugin load returned 2 : libnccl-net-none.so: cannot open shared object file: No such file or directory. All attempts to load 'libnccl-tuner.so none' also failed.
dc61-p18-t23-n020:2090:5351 [3] NCCL INFO TUNER/Plugin: Most recent plugin load returned 2 : libnccl-net-none.so: cannot open shared object file: No such file or directory. All attempts to load 'libnccl-tuner.so none' also failed.
dc61-p18-t23-n020:2089:6357 [2] NCCL INFO TUNER/Plugin: Using internal tuner plugin.
dc61-p18-t23-n020:2088:4363 [1] NCCL INFO TUNER/Plugin: Using internal tuner plugin.
dc61-p18-t23-n020:2090:5351 [3] NCCL INFO TUNER/Plugin: Using internal tuner plugin.
dc61-p18-t23-n020:2089:6357 [2] NCCL INFO ncclCommInitRank comm 0xcc329b0 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId 46000 commId 0x4becccc0e6cf7e11 - Init COMPLETE
dc61-p18-t23-n020:2088:4363 [1] NCCL INFO ncclCommInitRank comm 0xe43ab80 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 1a000 commId 0x4becccc0e6cf7e11 - Init COMPLETE
dc61-p18-t23-n020:2090:5351 [3] NCCL INFO ncclCommInitRank comm 0xe1b29a0 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId 4c000 commId 0x4becccc0e6cf7e11 - Init COMPLETE
dc61-p18-t23-n020:2087:4030 [0] NCCL INFO TUNER/Plugin: Most recent plugin load returned 2 : libnccl-net-none.so: cannot open shared object file: No such file or directory. All attempts to load 'libnccl-tuner.so none' also failed.
dc61-p18-t23-n020:2087:4030 [0] NCCL INFO TUNER/Plugin: Using internal tuner plugin.
dc61-p18-t23-n020:2087:4030 [0] NCCL INFO ncclCommInitRank comm 0xc251630 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 14000 commId 0x4becccc0e6cf7e11 - Init COMPLETE
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO Connected all trees
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
dc61-p18-t23-n020:2091:5685 [4] NCCL INFO Connected all trees
dc61-p18-t23-n020:2091:5685 [4] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
dc61-p18-t23-n020:2091:5685 [4] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO Connected all trees
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO Connected all trees
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
dc61-p18-t23-n020:2087:6390 [0] NCCL INFO Channel 00/1 : 0[0] -> 1[1] via P2P/CUMEM/read
dc61-p18-t23-n020:2090:6388 [3] NCCL INFO Channel 00/1 : 3[3] -> 0[0] via P2P/CUMEM/read
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO TUNER/Plugin: Most recent plugin load returned 2 : libnccl-net-none.so: cannot open shared object file: No such file or directory. All attempts to load 'libnccl-tuner.so none' also failed.
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO TUNER/Plugin: Using internal tuner plugin.
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO TUNER/Plugin: Most recent plugin load returned 2 : libnccl-net-none.so: cannot open shared object file: No such file or directory. All attempts to load 'libnccl-tuner.so none' also failed.
dc61-p18-t23-n020:2092:6356 [5] NCCL INFO ncclCommInitRank comm 0xd476ce0 rank 1 nranks 4 cudaDev 5 nvmlDev 5 busId 96000 commId 0x10787e24a8775d7e - Init COMPLETE
dc61-p18-t23-n020:2091:5685 [4] NCCL INFO TUNER/Plugin: Most recent plugin load returned 2 : libnccl-net-none.so: cannot open shared object file: No such file or directory. All attempts to load 'libnccl-tuner.so none' also failed.
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO TUNER/Plugin: Using internal tuner plugin.
dc61-p18-t23-n020:2091:5685 [4] NCCL INFO TUNER/Plugin: Using internal tuner plugin.
dc61-p18-t23-n020:2093:5686 [6] NCCL INFO ncclCommInitRank comm 0xcac7e20 rank 2 nranks 4 cudaDev 6 nvmlDev 6 busId c8000 commId 0x10787e24a8775d7e - Init COMPLETE
dc61-p18-t23-n020:2091:5685 [4] NCCL INFO ncclCommInitRank comm 0xd3deee0 rank 0 nranks 4 cudaDev 4 nvmlDev 4 busId 92000 commId 0x10787e24a8775d7e - Init COMPLETE
dc61-p18-t23-n020:2087:6390 [0] NCCL INFO Channel 28/1 : 0[0] -> 1[1] via P2P/CUMEM/read
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO TUNER/Plugin: Most recent plugin load returned 2 : libnccl-net-none.so: cannot open shared object file: No such file or directory. All attempts to load 'libnccl-tuner.so none' also failed.
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO TUNER/Plugin: Using internal tuner plugin.
dc61-p18-t23-n020:2088:6389 [1] NCCL INFO Channel 20/1 : 1[1] -> 2[2] via P2P/CUMEM/read
dc61-p18-t23-n020:2094:5687 [7] NCCL INFO ncclCommInitRank comm 0xe6397f0 rank 3 nranks 4 cudaDev 7 nvmlDev 7 busId cc000 commId 0x10787e24a8775d7e - Init COMPLETE
dc61-p18-t23-n020:2089:6387 [2] NCCL INFO Channel 31/1 : 2[2] -> 1[1] via P2P/CUMEM/read
dc61-p18-t23-n020:2087:6398 [0] NCCL INFO Using non-device net plugin version 0
dc61-p18-t23-n020:2087:6398 [0] NCCL INFO Using network IB
dc61-p18-t23-n020:2088:6399 [1] NCCL INFO Using non-device net plugin version 0
dc61-p18-t23-n020:2089:6401 [2] NCCL INFO Using non-device net plugin version 0
dc61-p18-t23-n020:2090:6400 [3] NCCL INFO Using non-device net plugin version 0
dc61-p18-t23-n020:2088:6399 [1] NCCL INFO Using network IB
dc61-p18-t23-n020:2089:6401 [2] NCCL INFO Using network IB
dc61-p18-t23-n020:2090:6400 [3] NCCL INFO Using network IB
dc61-p18-t23-n020:2094:6368 [7] NCCL INFO [Service thread] Connection closed by localRank 3
dc61-p18-t23-n020:2094:2441 [0] NCCL INFO comm 0xe6397f0 rank 3 nranks 4 cudaDev 7 busId cc000 - Abort COMPLETE
Steps: 0%| | 0/2000 [00:00<?, ?it/s]File not found: /home/tiger/.icube/recently_opened
[07:21:04] [fdbd:dc61:10:529:2240:1ed5:bb00:ab][9723a7bf][$Nm] New connection established.
[07:21:06] [fdbd:dc61:c:225:f40:1ed5:bb00:ab][5590d78a][$Nm] New connection established.
[07:21:07] [fdbd:dc61:c:225:f40:1ed5:bb00:ab][3fb46809][ExtensionHostConnection] New connection established.
[07:21:07] [fdbd:dc61:c:225:f40:1ed5:bb00:ab][3fb46809][ExtensionHostConnection] ExtensionHost will start connect with mode: ProxyServer
[07:21:07] env.json not found
[07:21:07] [fdbd:dc61:c:225:f40:1ed5:bb00:ab][3fb46809][ExtensionHostConnection] <7869>[ProxyServer] Launched Extension Host Process, execArgv: --dns-result-order=ipv6first
[07:21:14] env.json not found
[07:25:26] [fdbd:dc61:10:529:2240:1ed5:bb00:ab][3fb46809][ExtensionHostConnection] The client has reconnected.
[rank7]:[E310 07:26:53.822247581 ProcessGroupNCCL.cpp:616] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600084 milliseconds before timing out.
[rank7]:[E310 07:26:53.822563082 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 2 Rank 3] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank7]: Traceback (most recent call last):
[rank7]: File "/mnt/bn/nlhei-nas/liwen.8459/VideoProj/FastVideo/fastvideo/train.py", line 663, in
[rank7]: main(args)
[rank7]: File "/mnt/bn/nlhei-nas/liwen.8459/VideoProj/FastVideo/fastvideo/train.py", line 368, in main
[rank7]: loss, grad_norm = train_one_step(
[rank7]: ^^^^^^^^^^^^^^^
[rank7]: File "/mnt/bn/nlhei-nas/liwen.8459/VideoProj/FastVideo/fastvideo/train.py", line 128, in train_one_step
[rank7]: sigmas = get_sigmas(
[rank7]: ^^^^^^^^^^^
[rank7]: File "/mnt/bn/nlhei-nas/liwen.8459/VideoProj/FastVideo/fastvideo/train.py", line 78, in get_sigmas
[rank7]: step_indices = [(schedule_timesteps == t).nonzero().item() for t in timesteps]
[rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]: File "/mnt/bn/nlhei-nas/liwen.8459/VideoProj/FastVideo/fastvideo/train.py", line 78, in
[rank7]: step_indices = [(schedule_timesteps == t).nonzero().item() for t in timesteps]
[rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]: RuntimeError: a Tensor with 0 elements cannot be converted to Scalar
[rank7]:[E310 07:26:54.757697857 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 2 Rank 3] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank7]:[E310 07:26:54.757741468 ProcessGroupNCCL.cpp:630] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank7]:[E310 07:26:54.757748967 ProcessGroupNCCL.cpp:636] [Rank 3] To avoid data inconsistency, we are taking the entire process down.
[rank7]:[E310 07:26:54.759129434 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 2 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600084 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fe6428b9446 in /usr/local/lib/python3.11/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7fe5f8819762 in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fe5f8820ba3 in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fe5f882260d in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x145c0 (0x7fe642cf35c0 in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch.so)
frame #5: + 0x89144 (0x7fe643a8f144 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x1097dc (0x7fe643b0f7dc in /usr/lib/x86_64-linux-gnu/libc.so.6)
Reproduction
The text was updated successfully, but these errors were encountered: