Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: Triton Error [CUDA]: invalid device context #700

Open
andymvp2018 opened this issue Aug 13, 2024 · 4 comments
Open

RuntimeError: Triton Error [CUDA]: invalid device context #700

andymvp2018 opened this issue Aug 13, 2024 · 4 comments
Labels
type/bug An issue about a bug

Comments

@andymvp2018
Copy link

🐛 Describe the bug

h100-196-003:0 err: wandb: Synced 6 W&B file(s), 0 media file(s), 2 artifact file(s) and 0 other file(s)
h100-196-003:0 err: Traceback (most recent call last):
h100-196-003:0 err: File "/Users/H100/OLMo/scripts/train.py", line 347, in
h100-196-003:0 err: main(cfg)
h100-196-003:0 err: File "/Users/H100/OLMo/scripts/train.py", line 319, in main
h100-196-003:0 err: trainer.fit()
h100-196-003:0 err: File "/Users/H100/OLMo/olmo/train.py", line 1152, in fit
h100-196-003:0 err: metrics = self.train_step(batch, reduce_global_loss=should_log_this_step)
h100-196-003:0 err: File "/Users/H100/OLMo/olmo/train.py", line 781, in train_step
h100-196-003:0 err: ce_batch_loss, z_batch_loss = self.train_batch(batch)
h100-196-003:0 err: File "/Users/H100/miniconda3/envs/olmo/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 451, in _fn
h100-196-003:0 err: return fn(*args, **kwargs)
h100-196-003:0 err: File "/Users/H100/OLMo/olmo/train.py", line 758, in train_batch
h100-196-003:0 err: loss.backward()
h100-196-003:0 err: File "/Users/H100/miniconda3/envs/olmo/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward
h100-196-003:0 err: torch.autograd.backward(
h100-196-003:0 err: File "/Users/H100/miniconda3/envs/olmo/lib/python3.10/site-packages/torch/autograd/init.py", line 267, in backward
h100-196-003:0 err: _engine_run_backward(
h100-196-003:0 err: File "/Users/H100/miniconda3/envs/olmo/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
h100-196-003:0 err: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
h100-196-003:0 err: File "/Users/H100/miniconda3/envs/olmo/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply
h100-196-003:0 err: return user_fn(self, *args)
h100-196-003:0 err: File "/Users/H100/miniconda3/envs/olmo/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py", line 882, in backward
h100-196-003:0 err: out = call_compiled_backward()
h100-196-003:0 err: File "/Users/H100/miniconda3/envs/olmo/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py", line 831, in call_compiled_backward
h100-196-003:0 err: out = call_func_at_runtime_with_args(
h100-196-003:0 err: File "/Users/H100/miniconda3/envs/olmo/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/utils.py", line 113, in call_func_at_runtime_with_args
h100-196-003:0 err: out = normalize_as_list(f(args))
h100-196-003:0 err: File "/Users/H100/miniconda3/envs/olmo/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 451, in _fn
h100-196-003:0 err: return fn(*args, **kwargs)
h100-196-003:0 err: File "/Users/H100/miniconda3/envs/olmo/lib/python3.10/site-packages/torch/_dynamo/external_utils.py", line 36, in inner
h100-196-003:0 err: return fn(*args, **kwargs)
h100-196-003:0 err: File "/Users/H100/miniconda3/envs/olmo/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 906, in call
h100-196-003:0 err: return self.get_current_callable()(inputs)
h100-196-003:0 err: File "/Users/H100/miniconda3/envs/olmo/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 784, in run
h100-196-003:0 err: return model(new_inputs)
h100-196-003:0 err: File "/Users/H100/miniconda3/envs/olmo/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 934, in _run_from_cache
h100-196-003:0 err: return compiled_graph.compiled_artifact(inputs)
h100-196-003:0 err: File "/tmp/torchinductor_dejasu/yz/cyzj56loyzqxbsmpxbkpn2snn62qzjk6zvqc7nhgbi262jwngmlr.py", line 80, in call
h100-196-003:0 err: triton_poi_fused_div_0.run(tangents_1, buf0, 1, grid=grid(1), stream=stream0)
h100-196-003:0 err: File "/Users/H100/miniconda3/envs/olmo/lib/python3.10/site-packages/torch/_inductor/triton_heuristics.py", line 670, in run
h100-196-003:0 err: return launcher(
h100-196-003:0 err: File "", line 7, in launcher
h100-196-003:0 err: RuntimeError: Triton Error [CUDA]: invalid device context

Versions

Python 3.10.14
-e git+https://github.com/allenai/OLMo.git@4332c3224030a321c5894df18f97049b10a56582#egg=ai2_olmo
ai2-olmo-core==0.1.0
aiohappyeyeballs==2.3.5
aiohttp==3.10.2
aiosignal==1.3.1
annotated-types==0.7.0
antlr4-python3-runtime==4.9.3
async-timeout==4.0.3
attrs==24.2.0
backports.tarfile==1.2.0
beaker-gantry==1.8.3
beaker-py==1.31.2
black==23.12.1
boltons==24.0.0
boto3==1.34.158
botocore==1.34.158
build==1.2.1
cached_path==1.6.3
cachetools==5.4.0
certifi==2024.7.4
cffi==1.17.0
charset-normalizer==3.3.2
click==8.1.7
click-help-colors==0.9.4
cryptography==43.0.0
datasets==2.20.0
dill==0.3.8
docker==7.1.0
docker-pycreds==0.4.0
docutils==0.21.2
exceptiongroup==1.2.2
face==20.1.1
filelock==3.13.4
frozenlist==1.4.1
fsspec==2024.5.0
ftfy==6.2.3
gitdb==4.0.11
GitPython==3.1.43
glom==23.5.0
google-api-core==2.19.1
google-auth==2.33.0
google-cloud-core==2.4.1
google-cloud-storage==2.18.2
google-crc32c==1.5.0
google-resumable-media==2.7.2
googleapis-common-protos==1.63.2
huggingface-hub==0.23.5
idna==3.7
importlib_metadata==8.2.0
importlib_resources==6.4.0
iniconfig==2.0.0
isort==5.12.0
jaraco.classes==3.4.0
jaraco.context==5.3.0
jaraco.functools==4.0.2
jeepney==0.8.0
Jinja2==3.1.4
jmespath==1.0.1
joblib==1.4.2
keyring==25.3.0
lightning-utilities==0.11.6
markdown-it-py==3.0.0
MarkupSafe==2.1.5
mdurl==0.1.2
more-itertools==10.4.0
mpmath==1.3.0
msgspec==0.18.6
multidict==6.0.5
multiprocess==0.70.16
mypy==1.3.0
mypy-extensions==1.0.0
necessary==0.4.3
networkx==3.3
nh3==0.2.18
numpy==2.0.1
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.20.5
nvidia-nvjitlink-cu12==12.6.20
nvidia-nvtx-cu12==12.1.105
omegaconf==2.3.0
packaging==24.1
pandas==2.2.2
pathspec==0.12.1
petname==2.6
pkginfo==1.10.0
platformdirs==4.2.2
pluggy==1.5.0
proto-plus==1.24.0
protobuf==5.27.3
psutil==6.0.0
pyarrow==17.0.0
pyarrow-hotfix==0.6
pyasn1==0.6.0
pyasn1_modules==0.4.0
pycparser==2.22
pydantic==2.8.2
pydantic_core==2.20.1
Pygments==2.18.0
pyproject_hooks==1.1.0
pytest==8.3.2
pytest-sphinx==0.6.3
python-dateutil==2.9.0.post0
pytz==2024.1
PyYAML==6.0.2
readme_renderer==44.0
regex==2024.7.24
requests==2.32.3
requests-toolbelt==1.0.0
requirements-parser==0.10.2
rfc3986==2.0.0
rich==13.7.1
rsa==4.9
ruff==0.5.7
s3transfer==0.10.2
safetensors==0.4.4
scikit-learn==1.5.1
scipy==1.14.0
SecretStorage==3.3.3
sentry-sdk==2.12.0
setproctitle==1.3.3
six==1.16.0
smart-open==7.0.4
smashed==0.21.5
smmap==5.0.1
sympy==1.13.1
threadpoolctl==3.5.0
tokenizers==0.19.1
tomli==2.0.1
torch==2.3.1
torchmetrics==1.4.1
tqdm==4.66.5
transformers==4.44.0
triton==2.3.1
trouting==0.3.3
twine==5.1.1
types-setuptools==71.1.0.20240806
typing_extensions==4.12.2
tzdata==2024.1
urllib3==2.2.2
wandb==0.17.6
wcwidth==0.2.13
wrapt==1.16.0
xxhash==3.4.1
yarl==1.9.4
zipp==3.19.2

@andymvp2018 andymvp2018 added the type/bug An issue about a bug label Aug 13, 2024
@2015aroras
Copy link
Collaborator

Can you share the config file? My guess is that setting compile: null in your config could get rid of this issue.

@andymvp2018
Copy link
Author

@2015aroras , I use the exact configs/official/OLMo-7B.yaml, and only modify the training data (change the training data path). So are you suggesting to add compile: null to that official Olmo-7B.yaml?

@2015aroras
Copy link
Collaborator

You should try replacing

compile:
  fullgraph: false

with compile:null. I don't think the compile option affects the training loss, it just the affects the throughput.

@RithvikKolla
Copy link

Ran into the same issue. I want to use compile to drive throughput. Is there any fix for the above? Is compile not supported for OLMo models?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug An issue about a bug
Projects
None yet
Development

No branches or pull requests

3 participants