You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Many users have experienced errors in the runtime of the form: Expected input tensors to have device cuda, found device cpu, despite having casted all inputs to CUDA. Examples include #1123, #1401, and #730. We have traced these issues to internally-generated tensors originating from functions such as aten::full, aten::masked_fill_, and prim::NumToTensor, for which tensors are initialized on CPU by default. When tensors generated by these functions are passed from a Torch engine to a Torch-TensorRT engine, the Torch-TRT engine will throw an error, as it expects all inputs to be CUDA tensors:
The primary source of this issue, as seen from converting BART models, has been the prim::NumToTensor + aten::ScalarImplicit paradigm, which is used primarily to convert Scalars to Tensors (and vice versa), so they can be packed into Tensor lists and utilized by Torch-TensorRT engines. This is a useful construct, since all inputs to Torch-TRT engines must be Tensors, however these Tensors are not automatically casted to CUDA, causing errors at runtime. Many issues are related to the use and complexity of single-element (0D) tensors in Torch-TRT, including #829 and #956.
Proposed Solution and Implementation
The proposed solution intends to accomplish:
Avoid crashes by moving tensors on incorrect devices at runtime
Fix known problematic cases of internally-generated tensors to avoid overhead at runtime
Give more meaningful feedback to users when tensor device issues arise
With these goals in mind, we have implemented PR #1416, which consists of four main additions to the code:
A runtime check to ensure that tensor inputs (whether user-generated or internal), are located on the correct device at runtime, avoiding errors of the sort mentioned above. Additionally, a more informative error message, which indicates that the issue could be internal, and an issue report may be warranted
LOG_GRAPH("After unpack and cast full: " << *graph);
}
A bugfix (lowering pass) for aten::ScalarImplicit, which fails to perceive 0D tensors casted to CUDA as 0D tensors (aten::item does not exhibit this issue, but has functionally the same behavior)
A brief overview of the investigation into this issue and the steps to a solution are detailed here. A canonical trace originating from a BART model compilation is shown below:
TorchTRT-Compiled Graph Snippet
graph(%input_0 : __torch__.transformers.models.bart.modeling_bart.BartForConditionalGeneration_trt, %input_1 : Tensor, %input_2 : Tensor):
%2 : int=prim::Constant[value=0]() # /opt/conda/lib/python3.8/site-packages/transformers/models/bart/modeling_bart.py:66:0%3 : int=prim::Constant[value=9223372036854775807]() # /opt/conda/lib/python3.8/site-packages/transformers/models/bart/modeling_bart.py:67:0%4 : int=prim::Constant[value=1]()
%5 : int=prim::Constant[value=-1]() # /opt/conda/lib/python3.8/site-packages/transformers/models/bart/modeling_bart.py:67:0%6 : NoneType=prim::Constant()
%7 : int=prim::Constant[value=3]() # /opt/conda/lib/python3.8/site-packages/transformers/models/bart/modeling_bart.py:66:0%8 : Device=prim::Constant[value="cuda:0"]()
%9 : bool=prim::Constant[value=0]() # /opt/conda/lib/python3.8/site-packages/transformers/models/bart/modeling_bart.py:66:0%10 : Tensor=aten::slice(%input_1, %2, %2, %3, %4) # /opt/conda/lib/python3.8/site-packages/transformers/models/bart/modeling_bart.py:67:0%11 : Tensor=aten::slice(%10, %4, %2, %5, %4) # /opt/conda/lib/python3.8/site-packages/transformers/models/bart/modeling_bart.py:67:0%12 : int=aten::size(%input_1, %2) # /opt/conda/lib/python3.8/site-packages/transformers/models/bart/modeling_bart.py:66:0%13 : int=aten::size(%input_1, %4) # /opt/conda/lib/python3.8/site-packages/transformers/models/bart/modeling_bart.py:66:0%14 : int[] =prim::ListConstruct(%12, %13)
%15 : Tensor=aten::clone(%11, %6) # /opt/conda/lib/python3.8/site-packages/transformers/models/bart/modeling_bart.py:67:0%shifted_input_ids.1 : Tensor=aten::new_zeros(%input_1, %14, %7, %2, %8, %9) # /opt/conda/lib/python3.8/site-packages/transformers/models/bart/modeling_bart.py:66:0%__torch___transformers_models_bart_modeling_bart_BartForConditionalGeneration_trt_engine_0x557c2dc8e020 : __torch__.torch.classes.tensorrt.Engine=prim::GetAttr[name="__torch___transformers_models_bart_modeling_bart_BartForConditionalGeneration_trt_engine_0x557c2dc8e020"](%input_0)
%19 : Tensor[] =prim::ListConstruct(%shifted_input_ids.1, %15)
%20 : Tensor[] =tensorrt::execute_engine(%19, %__torch___transformers_models_bart_modeling_bart_BartForConditionalGeneration_trt_engine_0x557c2dc8e020)
%21 : Tensor=prim::ListUnpack(%20)
%22 : Tensor=prim::Constant[value={2}]() # /opt/conda/lib/python3.8/site-packages/transformers/models/bart/modeling_bart.py:68:0%23 : int=prim::Constant[value=-100]() # /opt/conda/lib/python3.8/site-packages/transformers/models/bart/modeling_bart.py:72:0%24 : int=prim::Constant[value=-1]() # /opt/conda/lib/python3.8/site-packages/transformers/models/bart/modeling_bart.py:67:0%25 : int=prim::Constant[value=1]()
%26 : Tensor=aten::fill_(%21, %22) # /opt/conda/lib/python3.8/site-packages/transformers/models/bart/modeling_bart.py:68:0%27 : Tensor=aten::eq(%shifted_input_ids.1, %23) # /opt/conda/lib/python3.8/site-packages/transformers/models/bart/modeling_bart.py:72:0%seq_len.1 : Tensor=prim::NumToTensor(%13) # :0:0### The line above runs in Torch, where the NumToTensor call returns a dimensionless CPULong{} Tensor%29 : int[] =prim::ListConstruct(%24, %13)
%30 : Tensor=aten::reshape(%input_1, %29)
%input_ids0.1 : Tensor=aten::masked_fill_(%shifted_input_ids.1, %27, %25) # /opt/conda/lib/python3.8/site-packages/transformers/models/bart/modeling_bart.py:72:0%__torch___transformers_models_bart_modeling_bart_BartForConditionalGeneration_trt_engine_0x557c2dc8e1c0 : __torch__.torch.classes.tensorrt.Engine=prim::GetAttr[name="__torch___transformers_models_bart_modeling_bart_BartForConditionalGeneration_trt_engine_0x557c2dc8e1c0"](%input_0)
%33 : Tensor[] =prim::ListConstruct(%input_2, %30, %seq_len.1)
%34 : Tensor[] =tensorrt::execute_engine(%33, %__torch___transformers_models_bart_modeling_bart_BartForConditionalGeneration_trt_engine_0x557c2dc8e1c0)
%35 : Tensor, %36 : Tensor, %37 : Tensor=prim::ListUnpack(%34)
### The line above runs in Torch-TRT, where the unpacked value is expected to be a CUDAInt{} tensor, but what is found is the CPULong{} Tensor%38 : int=prim::Constant[value=0]() # /opt/conda/lib/python3.8/site-packages/transformers/models/bart/modeling_bart.py:66:0%39 : int=prim::Constant[value=4]() # /opt/conda/lib/python3.8/site-packages/transformers/models/bart/modeling_bart.py:120:0%40 : NoneType=prim::Constant()
%41 : Device=prim::Constant[value="cuda:0"]()
%42 : bool=prim::Constant[value=0]() # /opt/conda/lib/python3.8/site-packages/transformers/models/bart/modeling_bart.py:66:0%43 : Scalar=aten::ScalarImplicit(%37) # :0:0### The above function Schema expects a dimensionless tensor, and it seems that simply casting %37 to GPU + Int type also causes the above to fail [ScalarImplicit expects a 0D Tensor, the casts might impart a dimension on it]%positions.2 : Tensor=aten::arange(%38, %43, %39, %40, %41, %42) # /opt/conda/lib/python3.8/site-packages/transformers/models/bart/modeling_bart.py:120:0
As detailed in the comments within the code snippet above, the CUDA/CPU issue is a combination of three bugs: incorrect device placement, incorrect data types, and invalid function schema. Fixing all three of these issues is key to resolving this runtime bug. The device placement and aten::ScalarImplicit schema issues are resolved in PR #1416, and the data type issue is resolved in PR #1407. When both PRs are combined, all three of the above issues are addressed.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Context
Many users have experienced errors in the runtime of the form:
Expected input tensors to have device cuda, found device cpu
, despite having casted all inputs to CUDA. Examples include #1123, #1401, and #730. We have traced these issues to internally-generated tensors originating from functions such asaten::full
,aten::masked_fill_
, andprim::NumToTensor
, for which tensors are initialized on CPU by default. When tensors generated by these functions are passed from a Torch engine to a Torch-TensorRT engine, the Torch-TRT engine will throw an error, as it expects all inputs to be CUDA tensors:The primary source of this issue, as seen from converting BART models, has been the
prim::NumToTensor
+aten::ScalarImplicit
paradigm, which is used primarily to convert Scalars to Tensors (and vice versa), so they can be packed into Tensor lists and utilized by Torch-TensorRT engines. This is a useful construct, since all inputs to Torch-TRT engines must be Tensors, however these Tensors are not automatically casted to CUDA, causing errors at runtime. Many issues are related to the use and complexity of single-element (0D) tensors in Torch-TRT, including #829 and #956.Proposed Solution and Implementation
The proposed solution intends to accomplish:
With these goals in mind, we have implemented PR #1416, which consists of four main additions to the code:
TensorRT/core/runtime/execute_engine.cpp
Lines 81 to 98 in ec6aa6b
TensorRT/core/lowering/passes/device_casting.cpp
Lines 35 to 78 in ec6aa6b
aten::ScalarImplicit
, which fails to perceive 0D tensors casted to CUDA as 0D tensors (aten::item
does not exhibit this issue, but has functionally the same behavior)TensorRT/core/lowering/passes/device_casting.cpp
Lines 80 to 98 in ec6aa6b
TensorRT/tests/core/lowering/test_device_casting.cpp
Lines 12 to 194 in ec6aa6b
Investigation Details
A brief overview of the investigation into this issue and the steps to a solution are detailed here. A canonical trace originating from a BART model compilation is shown below:
TorchTRT-Compiled Graph Snippet
As detailed in the comments within the code snippet above, the CUDA/CPU issue is a combination of three bugs: incorrect device placement, incorrect data types, and invalid function schema. Fixing all three of these issues is key to resolving this runtime bug. The device placement and
aten::ScalarImplicit
schema issues are resolved in PR #1416, and the data type issue is resolved in PR #1407. When both PRs are combined, all three of the above issues are addressed.Beta Was this translation helpful? Give feedback.
All reactions