Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA kernel error when using VBE #2502

Open
xiexbing opened this issue Oct 22, 2024 · 2 comments
Open

CUDA kernel error when using VBE #2502

xiexbing opened this issue Oct 22, 2024 · 2 comments

Comments

@xiexbing
Copy link

xiexbing commented Oct 22, 2024

Hello, there. I am using the newly released torchrec for our model training. I use VBE to reduce the data deduplication in the embedding lookup and communication in the forward pass.
Specifically, I used VBE (variable batch embedding) in my code and runs the code on 8 GPUs (rank0-rank7). The code consistently failed on the forward pass, with rank 1 always returning the correct lookup results. the other ranks always returning the following error . To exhibit the error in the minimal setting, I reduced the model size to fit into 2 GPUs. the error is shown in rank 0, and rank 1 still return the correct lookup results.

here is the error
" Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f59c88cd897 in /opt/conda/envs/ptca/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f59c887db25 in /opt/conda/envs/ptca/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f59c89a5718 in /opt/conda/envs/ptca/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f59c9ba3e36 in /opt/conda/envs/ptca/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f59c9ba7f38 in /opt/conda/envs/ptca/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x7f59c9bad5ac in /opt/conda/envs/ptca/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f59c9bae31c in /opt/conda/envs/ptca/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xd3b75 (0x7f5a1565cb75 in /opt/conda/envs/ptca/bin/../lib/libstdc++.so.6)
frame #8: + 0x8609 (0x7f5a17e91609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x43 (0x7f5a17c5c353 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'"

I add the "feature_inverse_indices" as below.
...................................................................................................
sparse_feature_df = batch.batch_data[0].select(feature)
sparse_feature_df = sparse_feature_df.with_columns(sparse_feature_df[feature].apply(lambda x: "".join([str(i) for i in x])).alias("id"))
unique_df = sparse_feature_df.unique(maintain_order=True)
print(f"{feature}, sparse_feature_df: {sparse_feature_df.shape}, unique_df: {unique_df.shape}")
df_with_index = sparse_feature_df.join(unique_df.with_row_count(), on="id", how="left")
inverse_indices = df_with_index["row_nr"].to_numpy(zero_copy_only=True)
sparse_feature_inverse_indices.append(inverse_indices)
.........................................................................................................
sparse_feature_inverse_indices = torch.tensor(np.stack(sparse_feature_inverse_indices), dtype=torch.int64)
..........................................................................................................

@xiexbing
Copy link
Author

@joshuadeng , can you please help take a look on this? thanks!

@sarckk
Copy link
Member

sarckk commented Dec 27, 2024

Hi @xiexbing, could you try running your job with CUDA_LAUNCH_BLOCKING=1 and share the results? That would help pinpoint where the failure is coming from.

Could you also share a minimal repro of the issue? It is hard to debug from our end without more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants