RuntimeError: Triton Error [CUDA]: invalid device context #700

andymvp2018 · 2024-08-13T18:06:18Z

🐛 Describe the bug

h100-196-003:0 err: wandb: Synced 6 W&B file(s), 0 media file(s), 2 artifact file(s) and 0 other file(s)
h100-196-003:0 err: Traceback (most recent call last):
h100-196-003:0 err: File "/Users/H100/OLMo/scripts/train.py", line 347, in
h100-196-003:0 err: main(cfg)
h100-196-003:0 err: File "/Users/H100/OLMo/scripts/train.py", line 319, in main
h100-196-003:0 err: trainer.fit()
h100-196-003:0 err: File "/Users/H100/OLMo/olmo/train.py", line 1152, in fit
h100-196-003:0 err: metrics = self.train_step(batch, reduce_global_loss=should_log_this_step)
h100-196-003:0 err: File "/Users/H100/OLMo/olmo/train.py", line 781, in train_step
h100-196-003:0 err: ce_batch_loss, z_batch_loss = self.train_batch(batch)
h100-196-003:0 err: File "/Users/H100/miniconda3/envs/olmo/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 451, in _fn
h100-196-003:0 err: return fn(*args, **kwargs)
h100-196-003:0 err: File "/Users/H100/OLMo/olmo/train.py", line 758, in train_batch
h100-196-003:0 err: loss.backward()
h100-196-003:0 err: File "/Users/H100/miniconda3/envs/olmo/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward
h100-196-003:0 err: torch.autograd.backward(
h100-196-003:0 err: File "/Users/H100/miniconda3/envs/olmo/lib/python3.10/site-packages/torch/autograd/init.py", line 267, in backward
h100-196-003:0 err: _engine_run_backward(
h100-196-003:0 err: File "/Users/H100/miniconda3/envs/olmo/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
h100-196-003:0 err: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
h100-196-003:0 err: File "/Users/H100/miniconda3/envs/olmo/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply
h100-196-003:0 err: return user_fn(self, *args)
h100-196-003:0 err: File "/Users/H100/miniconda3/envs/olmo/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py", line 882, in backward
h100-196-003:0 err: out = call_compiled_backward()
h100-196-003:0 err: File "/Users/H100/miniconda3/envs/olmo/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py", line 831, in call_compiled_backward
h100-196-003:0 err: out = call_func_at_runtime_with_args(
h100-196-003:0 err: File "/Users/H100/miniconda3/envs/olmo/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/utils.py", line 113, in call_func_at_runtime_with_args
h100-196-003:0 err: out = normalize_as_list(f(args))
h100-196-003:0 err: File "/Users/H100/miniconda3/envs/olmo/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 451, in _fn
h100-196-003:0 err: return fn(*args, **kwargs)
h100-196-003:0 err: File "/Users/H100/miniconda3/envs/olmo/lib/python3.10/site-packages/torch/_dynamo/external_utils.py", line 36, in inner
h100-196-003:0 err: return fn(*args, **kwargs)
h100-196-003:0 err: File "/Users/H100/miniconda3/envs/olmo/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 906, in call
h100-196-003:0 err: return self.get_current_callable()(inputs)
h100-196-003:0 err: File "/Users/H100/miniconda3/envs/olmo/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 784, in run
h100-196-003:0 err: return model(new_inputs)
h100-196-003:0 err: File "/Users/H100/miniconda3/envs/olmo/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 934, in _run_from_cache
h100-196-003:0 err: return compiled_graph.compiled_artifact(inputs)
h100-196-003:0 err: File "/tmp/torchinductor_dejasu/yz/cyzj56loyzqxbsmpxbkpn2snn62qzjk6zvqc7nhgbi262jwngmlr.py", line 80, in call
h100-196-003:0 err: triton_poi_fused_div_0.run(tangents_1, buf0, 1, grid=grid(1), stream=stream0)
h100-196-003:0 err: File "/Users/H100/miniconda3/envs/olmo/lib/python3.10/site-packages/torch/_inductor/triton_heuristics.py", line 670, in run
h100-196-003:0 err: return launcher(
h100-196-003:0 err: File "", line 7, in launcher
h100-196-003:0 err: RuntimeError: Triton Error [CUDA]: invalid device context

Versions

Python 3.10.14
-e git+https://github.com/allenai/OLMo.git@4332c3224030a321c5894df18f97049b10a56582#egg=ai2_olmo
ai2-olmo-core==0.1.0
aiohappyeyeballs==2.3.5
aiohttp==3.10.2
aiosignal==1.3.1
annotated-types==0.7.0
antlr4-python3-runtime==4.9.3
async-timeout==4.0.3
attrs==24.2.0
backports.tarfile==1.2.0
beaker-gantry==1.8.3
beaker-py==1.31.2
black==23.12.1
boltons==24.0.0
boto3==1.34.158
botocore==1.34.158
build==1.2.1
cached_path==1.6.3
cachetools==5.4.0
certifi==2024.7.4
cffi==1.17.0
charset-normalizer==3.3.2
click==8.1.7
click-help-colors==0.9.4
cryptography==43.0.0
datasets==2.20.0
dill==0.3.8
docker==7.1.0
docker-pycreds==0.4.0
docutils==0.21.2
exceptiongroup==1.2.2
face==20.1.1
filelock==3.13.4
frozenlist==1.4.1
fsspec==2024.5.0
ftfy==6.2.3
gitdb==4.0.11
GitPython==3.1.43
glom==23.5.0
google-api-core==2.19.1
google-auth==2.33.0
google-cloud-core==2.4.1
google-cloud-storage==2.18.2
google-crc32c==1.5.0
google-resumable-media==2.7.2
googleapis-common-protos==1.63.2
huggingface-hub==0.23.5
idna==3.7
importlib_metadata==8.2.0
importlib_resources==6.4.0
iniconfig==2.0.0
isort==5.12.0
jaraco.classes==3.4.0
jaraco.context==5.3.0
jaraco.functools==4.0.2
jeepney==0.8.0
Jinja2==3.1.4
jmespath==1.0.1
joblib==1.4.2
keyring==25.3.0
lightning-utilities==0.11.6
markdown-it-py==3.0.0
MarkupSafe==2.1.5
mdurl==0.1.2
more-itertools==10.4.0
mpmath==1.3.0
msgspec==0.18.6
multidict==6.0.5
multiprocess==0.70.16
mypy==1.3.0
mypy-extensions==1.0.0
necessary==0.4.3
networkx==3.3
nh3==0.2.18
numpy==2.0.1
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.20.5
nvidia-nvjitlink-cu12==12.6.20
nvidia-nvtx-cu12==12.1.105
omegaconf==2.3.0
packaging==24.1
pandas==2.2.2
pathspec==0.12.1
petname==2.6
pkginfo==1.10.0
platformdirs==4.2.2
pluggy==1.5.0
proto-plus==1.24.0
protobuf==5.27.3
psutil==6.0.0
pyarrow==17.0.0
pyarrow-hotfix==0.6
pyasn1==0.6.0
pyasn1_modules==0.4.0
pycparser==2.22
pydantic==2.8.2
pydantic_core==2.20.1
Pygments==2.18.0
pyproject_hooks==1.1.0
pytest==8.3.2
pytest-sphinx==0.6.3
python-dateutil==2.9.0.post0
pytz==2024.1
PyYAML==6.0.2
readme_renderer==44.0
regex==2024.7.24
requests==2.32.3
requests-toolbelt==1.0.0
requirements-parser==0.10.2
rfc3986==2.0.0
rich==13.7.1
rsa==4.9
ruff==0.5.7
s3transfer==0.10.2
safetensors==0.4.4
scikit-learn==1.5.1
scipy==1.14.0
SecretStorage==3.3.3
sentry-sdk==2.12.0
setproctitle==1.3.3
six==1.16.0
smart-open==7.0.4
smashed==0.21.5
smmap==5.0.1
sympy==1.13.1
threadpoolctl==3.5.0
tokenizers==0.19.1
tomli==2.0.1
torch==2.3.1
torchmetrics==1.4.1
tqdm==4.66.5
transformers==4.44.0
triton==2.3.1
trouting==0.3.3
twine==5.1.1
types-setuptools==71.1.0.20240806
typing_extensions==4.12.2
tzdata==2024.1
urllib3==2.2.2
wandb==0.17.6
wcwidth==0.2.13
wrapt==1.16.0
xxhash==3.4.1
yarl==1.9.4
zipp==3.19.2

2015aroras · 2024-08-13T21:42:28Z

Can you share the config file? My guess is that setting compile: null in your config could get rid of this issue.

andymvp2018 · 2024-08-13T21:44:35Z

@2015aroras , I use the exact configs/official/OLMo-7B.yaml, and only modify the training data (change the training data path). So are you suggesting to add compile: null to that official Olmo-7B.yaml?

2015aroras · 2024-08-14T17:47:57Z

You should try replacing

compile:
  fullgraph: false

with compile:null. I don't think the compile option affects the training loss, it just the affects the throughput.

RithvikKolla · 2024-08-20T07:51:08Z

Ran into the same issue. I want to use compile to drive throughput. Is there any fix for the above? Is compile not supported for OLMo models?

andymvp2018 added the type/bug An issue about a bug label Aug 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: Triton Error [CUDA]: invalid device context #700

RuntimeError: Triton Error [CUDA]: invalid device context #700

andymvp2018 commented Aug 13, 2024

2015aroras commented Aug 13, 2024

andymvp2018 commented Aug 13, 2024

2015aroras commented Aug 14, 2024

RithvikKolla commented Aug 20, 2024

RuntimeError: Triton Error [CUDA]: invalid device context #700

RuntimeError: Triton Error [CUDA]: invalid device context #700

Comments

andymvp2018 commented Aug 13, 2024

🐛 Describe the bug

Versions

2015aroras commented Aug 13, 2024

andymvp2018 commented Aug 13, 2024

2015aroras commented Aug 14, 2024

RithvikKolla commented Aug 20, 2024