Skip to content

All-to-all seems to trigger connection of many channels thus a large memory usage? #1685

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
kwen2501 opened this issue Apr 16, 2025 · 3 comments

Comments

@kwen2501
Copy link
Contributor

kwen2501 commented Apr 16, 2025

Referring to an all-to-all intra node on 8 x H100s.

User reports the following statistics on each GPU:

[0] before 1st all_to_all_single
[0] NV 1.06 GB
[0] after 1st all_to_all_single
[0] NV 4.34 GB 

The user noted that the reported numbers have excluded memory usage from PyTorch.

I did a run with NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT, indeed it seems as many as 32 channels are created, per pair of GPUs.

devgpu263:557295:558643 [0] NCCL INFO Channel 00/1 : 0[0] -> 2[2] via P2P/CUMEM
devgpu263:557295:558643 [0] NCCL INFO Channel 01/1 : 0[0] -> 2[2] via P2P/CUMEM
devgpu263:557295:558643 [0] NCCL INFO Channel 02/1 : 0[0] -> 2[2] via P2P/CUMEM
devgpu263:557295:558643 [0] NCCL INFO Channel 03/1 : 0[0] -> 2[2] via P2P/CUMEM
...
devgpu263:557295:558643 [0] NCCL INFO Channel 28/1 : 0[0] -> 2[2] via P2P/CUMEM
devgpu263:557295:558643 [0] NCCL INFO Channel 29/1 : 0[0] -> 2[2] via P2P/CUMEM
devgpu263:557295:558643 [0] NCCL INFO Channel 30/1 : 0[0] -> 2[2] via P2P/CUMEM
devgpu263:557295:558643 [0] NCCL INFO Channel 31/1 : 0[0] -> 2[2] via P2P/CUMEM

I further turned on NCCL_DEBUG_SUBSYS=ALLOC, I can see two types of buffer being allocated, one type of size 10485760, the other type of size 2097152.

If we use grep -c to count the number of allocations for each type, we see:

NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=ALLOC torchrun --standalone --nproc-per-node 8 repro.py | grep "\[0\]" | grep -c "size 10485760"
224
NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=ALLOC torchrun --standalone --nproc-per-node 8 repro.py | grep "\[0\]" | grep -c "size 2097152"
224

Note that the above counting is on each GPU because of grep "\[0\]".

Running some number matching:

  • 224 seems to be 32 (channels) * 7 (peers).
  • 10485760 + 2097152 = 12 MB (per channel).
  • 12 MB * 224 = 2.6 GB, which seems to be on the ballpark of the memory increase reported by the user.
@kwen2501
Copy link
Contributor Author

cc @stas00 @ngimel @wconstab

@kwen2501
Copy link
Contributor Author

btw, is it possible to add a "total allocated" field in ALLOC's log lines? That would make the accounting easier.
For example (added part in bold):
[0] NCCL INFO Allocated shareable buffer 0x3a3e400000 size 2097152 ipcDesc 0x7fa7382a84e0, total allocated ...

@wconstab
Copy link

i'd like to ask for an API to query nccl's current allocation level. This could be integrated into pytorch's memory profiler. It's common for people to notice that nv driver apis show X GB allocated while pytorch allocator shows Y = X-2 or X-3GB reserved, and then they start asking us where the delta comes from. Its impossible for us to answer this in general, but since NCCL is a very commonly used library with torch and uses significant memory, adding this should be pretty helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants