You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, there. I am using the newly released torchrec for our model training. I use VBE to reduce the data deduplication in the embedding lookup and communication in the forward pass.
Specifically, I used VBE (variable batch embedding) in my code and runs the code on 8 GPUs (rank0-rank7). The code consistently failed on the forward pass, with rank 1 always returning the correct lookup results. the other ranks always returning the following error . To exhibit the error in the minimal setting, I reduced the model size to fit into 2 GPUs. the error is shown in rank 0, and rank 1 still return the correct lookup results.
here is the error
" Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f59c88cd897 in /opt/conda/envs/ptca/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f59c887db25 in /opt/conda/envs/ptca/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f59c89a5718 in /opt/conda/envs/ptca/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f59c9ba3e36 in /opt/conda/envs/ptca/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f59c9ba7f38 in /opt/conda/envs/ptca/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x7f59c9bad5ac in /opt/conda/envs/ptca/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f59c9bae31c in /opt/conda/envs/ptca/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xd3b75 (0x7f5a1565cb75 in /opt/conda/envs/ptca/bin/../lib/libstdc++.so.6)
frame #8: + 0x8609 (0x7f5a17e91609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x43 (0x7f5a17c5c353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'"
I add the "feature_inverse_indices" as below.
...................................................................................................
sparse_feature_df = batch.batch_data[0].select(feature)
sparse_feature_df = sparse_feature_df.with_columns(sparse_feature_df[feature].apply(lambda x: "".join([str(i) for i in x])).alias("id"))
unique_df = sparse_feature_df.unique(maintain_order=True)
print(f"{feature}, sparse_feature_df: {sparse_feature_df.shape}, unique_df: {unique_df.shape}")
df_with_index = sparse_feature_df.join(unique_df.with_row_count(), on="id", how="left")
inverse_indices = df_with_index["row_nr"].to_numpy(zero_copy_only=True)
sparse_feature_inverse_indices.append(inverse_indices)
.........................................................................................................
sparse_feature_inverse_indices = torch.tensor(np.stack(sparse_feature_inverse_indices), dtype=torch.int64)
..........................................................................................................
The text was updated successfully, but these errors were encountered:
Hi @xiexbing, could you try running your job with CUDA_LAUNCH_BLOCKING=1 and share the results? That would help pinpoint where the failure is coming from.
Could you also share a minimal repro of the issue? It is hard to debug from our end without more details.
Hello, there. I am using the newly released torchrec for our model training. I use VBE to reduce the data deduplication in the embedding lookup and communication in the forward pass.
Specifically, I used VBE (variable batch embedding) in my code and runs the code on 8 GPUs (rank0-rank7). The code consistently failed on the forward pass, with rank 1 always returning the correct lookup results. the other ranks always returning the following error . To exhibit the error in the minimal setting, I reduced the model size to fit into 2 GPUs. the error is shown in rank 0, and rank 1 still return the correct lookup results.
here is the error
" Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f59c88cd897 in /opt/conda/envs/ptca/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f59c887db25 in /opt/conda/envs/ptca/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f59c89a5718 in /opt/conda/envs/ptca/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f59c9ba3e36 in /opt/conda/envs/ptca/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f59c9ba7f38 in /opt/conda/envs/ptca/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x7f59c9bad5ac in /opt/conda/envs/ptca/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f59c9bae31c in /opt/conda/envs/ptca/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xd3b75 (0x7f5a1565cb75 in /opt/conda/envs/ptca/bin/../lib/libstdc++.so.6)
frame #8: + 0x8609 (0x7f5a17e91609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x43 (0x7f5a17c5c353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'"
I add the "feature_inverse_indices" as below.
...................................................................................................
sparse_feature_df = batch.batch_data[0].select(feature)
sparse_feature_df = sparse_feature_df.with_columns(sparse_feature_df[feature].apply(lambda x: "".join([str(i) for i in x])).alias("id"))
unique_df = sparse_feature_df.unique(maintain_order=True)
print(f"{feature}, sparse_feature_df: {sparse_feature_df.shape}, unique_df: {unique_df.shape}")
df_with_index = sparse_feature_df.join(unique_df.with_row_count(), on="id", how="left")
inverse_indices = df_with_index["row_nr"].to_numpy(zero_copy_only=True)
sparse_feature_inverse_indices.append(inverse_indices)
.........................................................................................................
sparse_feature_inverse_indices = torch.tensor(np.stack(sparse_feature_inverse_indices), dtype=torch.int64)
..........................................................................................................
The text was updated successfully, but these errors were encountered: