All-to-all seems to trigger connection of many channels thus a large memory usage? #1685

kwen2501 · 2025-04-16T19:35:31Z

Referring to an all-to-all intra node on 8 x H100s.

User reports the following statistics on each GPU:

[0] before 1st all_to_all_single
[0] NV 1.06 GB
[0] after 1st all_to_all_single
[0] NV 4.34 GB

The user noted that the reported numbers have excluded memory usage from PyTorch.

I did a run with NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT, indeed it seems as many as 32 channels are created, per pair of GPUs.

devgpu263:557295:558643 [0] NCCL INFO Channel 00/1 : 0[0] -> 2[2] via P2P/CUMEM
devgpu263:557295:558643 [0] NCCL INFO Channel 01/1 : 0[0] -> 2[2] via P2P/CUMEM
devgpu263:557295:558643 [0] NCCL INFO Channel 02/1 : 0[0] -> 2[2] via P2P/CUMEM
devgpu263:557295:558643 [0] NCCL INFO Channel 03/1 : 0[0] -> 2[2] via P2P/CUMEM
...
devgpu263:557295:558643 [0] NCCL INFO Channel 28/1 : 0[0] -> 2[2] via P2P/CUMEM
devgpu263:557295:558643 [0] NCCL INFO Channel 29/1 : 0[0] -> 2[2] via P2P/CUMEM
devgpu263:557295:558643 [0] NCCL INFO Channel 30/1 : 0[0] -> 2[2] via P2P/CUMEM
devgpu263:557295:558643 [0] NCCL INFO Channel 31/1 : 0[0] -> 2[2] via P2P/CUMEM

I further turned on NCCL_DEBUG_SUBSYS=ALLOC, I can see two types of buffer being allocated, one type of size 10485760, the other type of size 2097152.

If we use grep -c to count the number of allocations for each type, we see:

NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=ALLOC torchrun --standalone --nproc-per-node 8 repro.py | grep "\[0\]" | grep -c "size 10485760"
224

NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=ALLOC torchrun --standalone --nproc-per-node 8 repro.py | grep "\[0\]" | grep -c "size 2097152"
224

Note that the above counting is on each GPU because of grep "\[0\]".

Running some number matching:

224 seems to be 32 (channels) * 7 (peers).
10485760 + 2097152 = 12 MB (per channel).
12 MB * 224 = 2.6 GB, which seems to be on the ballpark of the memory increase reported by the user.

The text was updated successfully, but these errors were encountered:

kwen2501 · 2025-04-16T19:38:32Z

cc @stas00 @ngimel @wconstab

kwen2501 · 2025-04-16T21:16:06Z

btw, is it possible to add a "total allocated" field in ALLOC's log lines? That would make the accounting easier.
For example (added part in bold):
[0] NCCL INFO Allocated shareable buffer 0x3a3e400000 size 2097152 ipcDesc 0x7fa7382a84e0, total allocated ...

wconstab · 2025-04-16T21:39:11Z

i'd like to ask for an API to query nccl's current allocation level. This could be integrated into pytorch's memory profiler. It's common for people to notice that nv driver apis show X GB allocated while pytorch allocator shows Y = X-2 or X-3GB reserved, and then they start asking us where the delta comes from. Its impossible for us to answer this in general, but since NCCL is a very commonly used library with torch and uses significant memory, adding this should be pretty helpful.

kiskra-nvidia added the enhancement label Apr 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

All-to-all seems to trigger connection of many channels thus a large memory usage? #1685

All-to-all seems to trigger connection of many channels thus a large memory usage? #1685

kwen2501 commented Apr 16, 2025 •

edited

Loading

kwen2501 commented Apr 16, 2025

kwen2501 commented Apr 16, 2025

wconstab commented Apr 16, 2025

All-to-all seems to trigger connection of many channels thus a large memory usage? #1685

All-to-all seems to trigger connection of many channels thus a large memory usage? #1685

Comments

kwen2501 commented Apr 16, 2025 • edited Loading

kwen2501 commented Apr 16, 2025

kwen2501 commented Apr 16, 2025

wconstab commented Apr 16, 2025

kwen2501 commented Apr 16, 2025 •

edited

Loading