You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
btw, is it possible to add a "total allocated" field in ALLOC's log lines? That would make the accounting easier.
For example (added part in bold):
[0] NCCL INFO Allocated shareable buffer 0x3a3e400000 size 2097152 ipcDesc 0x7fa7382a84e0, total allocated ...
i'd like to ask for an API to query nccl's current allocation level. This could be integrated into pytorch's memory profiler. It's common for people to notice that nv driver apis show X GB allocated while pytorch allocator shows Y = X-2 or X-3GB reserved, and then they start asking us where the delta comes from. Its impossible for us to answer this in general, but since NCCL is a very commonly used library with torch and uses significant memory, adding this should be pretty helpful.
Referring to an all-to-all intra node on 8 x H100s.
User reports the following statistics on each GPU:
The user noted that the reported numbers have excluded memory usage from PyTorch.
I did a run with
NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT
, indeed it seems as many as 32 channels are created, per pair of GPUs.I further turned on
NCCL_DEBUG_SUBSYS=ALLOC
, I can see two types of buffer being allocated, one type of size 10485760, the other type of size 2097152.If we use
grep -c
to count the number of allocations for each type, we see:Note that the above counting is on each GPU because of
grep "\[0\]"
.Running some number matching:
The text was updated successfully, but these errors were encountered: