Addressing CUDA/CPU Device Errors Occurring Internally in Torch-TRT Compilations #1423

gs-olive · 2022-10-27T19:40:17Z

gs-olive
Oct 27, 2022
Collaborator

Context

Many users have experienced errors in the runtime of the form: Expected input tensors to have device cuda, found device cpu, despite having casted all inputs to CUDA. Examples include #1123, #1401, and #730. We have traced these issues to internally-generated tensors originating from functions such as aten::full, aten::masked_fill_, and prim::NumToTensor, for which tensors are initialized on CPU by default. When tensors generated by these functions are passed from a Torch engine to a Torch-TensorRT engine, the Torch-TRT engine will throw an error, as it expects all inputs to be CUDA tensors:

RuntimeError: [Error thrown at core/runtime/execute_engine.cpp:85] Expected inputs[pyt_idx].is_cuda() to be true but got false

The primary source of this issue, as seen from converting BART models, has been the prim::NumToTensor + aten::ScalarImplicit paradigm, which is used primarily to convert Scalars to Tensors (and vice versa), so they can be packed into Tensor lists and utilized by Torch-TensorRT engines. This is a useful construct, since all inputs to Torch-TRT engines must be Tensors, however these Tensors are not automatically casted to CUDA, causing errors at runtime. Many issues are related to the use and complexity of single-element (0D) tensors in Torch-TRT, including #829 and #956.

Proposed Solution and Implementation

The proposed solution intends to accomplish:

Avoid crashes by moving tensors on incorrect devices at runtime
Fix known problematic cases of internally-generated tensors to avoid overhead at runtime
Give more meaningful feedback to users when tensor device issues arise

With these goals in mind, we have implemented PR #1416, which consists of four main additions to the code:

A runtime check to ensure that tensor inputs (whether user-generated or internal), are located on the correct device at runtime, avoiding errors of the sort mentioned above. Additionally, a more informative error message, which indicates that the issue could be internal, and an issue report may be warranted

TensorRT/core/runtime/execute_engine.cpp

Lines 81 to 98 in ec6aa6b

    
           // Target device is current device 
        
           target_device += std::to_string(curr_device.id); 
        
           // For each input, ensure its current device is the desired target device 
        
           for (size_t i = 0; i < inputs.size(); i++) { 
        
             at::Tensor* in = &inputs[i]; 
        
             std::string current_tensor_device = in->device().str(); 
        
             // If current device string does not match target device, display warning and move tensor accordingly 
        
             if (current_tensor_device != target_device) { 
        
               LOG_WARNING( 
        
                   "Input " << i << " of engine " << compiled_engine->name << " was found to be on " << current_tensor_device 
        
                            << " but should be on " << target_device 
        
                            << ". This tensor is being moved manually by the runtime but " 
        
                            << "for performance considerations, ensure your inputs are all on GPU " 
        
                            << "and open an issue here (https://github.com/pytorch/TensorRT/issues) if this " 
        
                            << "warning persists."); 
        
               *in = in->to(torch::Device(target_device));

A set of lowering passes to send tensors generated by internal functions known to be on CPU, to CUDA

TensorRT/core/lowering/passes/device_casting.cpp

Lines 35 to 78 in ec6aa6b

    
           void UnpackAndCastNumToTensor(std::shared_ptr<torch::jit::Graph>& graph) { 
        
             std::string num_to_tensor_cast_pattern = R"IR( 
        
               graph(%1: Scalar): 
        
                 %2: Tensor = prim::NumToTensor(%1) 
        
                 return (%2))IR"; 
        
             // 0D Tensors are initialized on cpu, and need to be casted to CUDA 
        
             // to avoid device mismatch issues 
        
             std::string num_to_tensor_clean_pattern = R"IR( 
        
               graph(%1: Scalar): 
        
                 %2: Tensor = prim::NumToTensor(%1) 
        
                 %device: Device = prim::Constant[value="cuda"]() 
        
                 %dtype: NoneType = prim::Constant() 
        
                 %false: bool = prim::Constant[value=0]() 
        
                 %3: Tensor = aten::to(%2, %device, %dtype, %false, %false) 
        
                 return (%3))IR"; 
        
             torch::jit::SubgraphRewriter num_to_tensor_cast_rewriter; 
        
             num_to_tensor_cast_rewriter.RegisterRewritePattern(num_to_tensor_cast_pattern, num_to_tensor_clean_pattern); 
        
             num_to_tensor_cast_rewriter.runOnGraph(graph); 
        
             LOG_GRAPH("After unpack and cast NumToTensor: " << *graph); 
        
           } 
        
           void UnpackAndCastFull(std::shared_ptr<torch::jit::Graph>& graph) { 
        
             std::string full_cast_pattern = R"IR( 
        
               graph(%1, %2, %3, %4, %5, %6): 
        
                 %out: Tensor = aten::full(%1, %2, %3, %4, %5, %6) 
        
                 return (%out))IR"; 
        
             // Tensors created via aten::full are initialized on cpu, and need to be casted to CUDA 
        
             // to avoid device mismatch issues 
        
             std::string full_clean_pattern = R"IR( 
        
               graph(%1, %2, %3, %4, %5, %6): 
        
                 %cuda: Device = prim::Constant[value="cuda"]() 
        
                 %out: Tensor = aten::full(%1, %2, %3, %4, %cuda, %6) 
        
                 return (%out))IR"; 
        
             torch::jit::SubgraphRewriter full_cast_rewriter; 
        
             full_cast_rewriter.RegisterRewritePattern(full_cast_pattern, full_clean_pattern); 
        
             full_cast_rewriter.runOnGraph(graph); 
        
             LOG_GRAPH("After unpack and cast full: " << *graph); 
        
           }

A bugfix (lowering pass) for aten::ScalarImplicit, which fails to perceive 0D tensors casted to CUDA as 0D tensors (aten::item does not exhibit this issue, but has functionally the same behavior)

TensorRT/core/lowering/passes/device_casting.cpp

Lines 80 to 98 in ec6aa6b

    
           void ReplaceScalarImplicit(std::shared_ptr<torch::jit::Graph>& graph) { 
        
             std::string scalar_implicit_cast_pattern = R"IR( 
        
               graph(%1: Tensor): 
        
                 %2: Scalar = aten::ScalarImplicit(%1) 
        
                 return (%2))IR"; 
        
             // ScalarImplicit can only unpack 0D tensors, whereas Tensors operated on by 
        
             // TensorRT are padded to 1 dimension. aten::item() resolves this conflict 
        
             std::string scalar_implicit_clean_pattern = R"IR( 
        
               graph(%1: Tensor): 
        
                 %2: Scalar = aten::item(%1) 
        
                 return (%2))IR"; 
        
             torch::jit::SubgraphRewriter scalar_implicit_cast_rewriter; 
        
             scalar_implicit_cast_rewriter.RegisterRewritePattern(scalar_implicit_cast_pattern, scalar_implicit_clean_pattern); 
        
             scalar_implicit_cast_rewriter.runOnGraph(graph); 
        
             LOG_GRAPH("After unpack and cast full: " << *graph); 
        
           }

Testing for these lowering passes to ensure accuracy is not affected

TensorRT/tests/core/lowering/test_device_casting.cpp

Lines 12 to 194 in ec6aa6b

    
           TEST(LoweringPasses, UnpackAndCastMaskedFillLowersCorrectly) { 
        
             const auto graph = R"IR( 
        
                 graph(%x.1: Tensor, %x.2: Tensor, %x.3: float): 
        
                   %2 : Tensor = aten::masked_fill_(%x.1, %x.2, %x.3) 
        
                   return (%2))IR"; 
        
             auto in = at::rand({2, 3, 5, 7}, {at::kCUDA}); 
        
             auto in2 = at::rand({2, 3, 5, 7}, {at::kCUDA}).to(torch::kBool); 
        
             auto in3 = 7.3; 
        
             auto g = std::make_shared<torch::jit::Graph>(); 
        
             torch::jit::parseIR(graph, g.get()); 
        
             auto jit_pre_results = torch_tensorrt::tests::util::EvaluateGraphJIT(g, {in, in2, in3}); 
        
             torch_tensorrt::core::lowering::passes::UnpackAndCastMaskedFill(g); 
        
             torch::jit::EliminateCommonSubexpression(g); 
        
             auto jit_post_results = torch_tensorrt::tests::util::EvaluateGraphJIT(g, {in, in2, in3}); 
        
             ASSERT_TRUE( 
        
                 torch_tensorrt::tests::util::almostEqual(jit_pre_results[0].toTensor(), jit_post_results[0].toTensor(), 2e-6)); 
        
           } 
        
           TEST(LoweringPasses, UnpackAndCastNumToTensorLowersIntCorrectly) { 
        
             const auto graph = R"IR( 
        
                 graph(%x.1: int): 
        
                   %2 : Tensor = prim::NumToTensor(%x.1) 
        
                   return (%2))IR"; 
        
             auto in = 1; 
        
             auto g = std::make_shared<torch::jit::Graph>(); 
        
             torch::jit::parseIR(graph, g.get()); 
        
             auto jit_pre_results = torch_tensorrt::tests::util::EvaluateGraphJIT(g, {in}); 
        
             torch_tensorrt::core::lowering::passes::UnpackAndCastNumToTensor(g); 
        
             torch::jit::EliminateCommonSubexpression(g); 
        
             auto jit_post_results = torch_tensorrt::tests::util::EvaluateGraphJIT(g, {in}); 
        
             ASSERT_TRUE( 
        
                 torch_tensorrt::tests::util::almostEqual(jit_pre_results[0].toTensor(), jit_post_results[0].toTensor(), 2e-6)); 
        
           } 
        
           TEST(LoweringPasses, UnpackAndCastNumToTensorLowersFloatCorrectly) { 
        
             const auto graph = R"IR( 
        
                 graph(%x.1: float): 
        
                   %2 : Tensor = prim::NumToTensor(%x.1) 
        
                   return (%2))IR"; 
        
             auto in = 78.1; 
        
             auto g = std::make_shared<torch::jit::Graph>(); 
        
             torch::jit::parseIR(graph, g.get()); 
        
             auto jit_pre_results = torch_tensorrt::tests::util::EvaluateGraphJIT(g, {in}); 
        
             torch_tensorrt::core::lowering::passes::UnpackAndCastNumToTensor(g); 
        
             torch::jit::EliminateCommonSubexpression(g); 
        
             auto jit_post_results = torch_tensorrt::tests::util::EvaluateGraphJIT(g, {in}); 
        
             ASSERT_TRUE( 
        
                 torch_tensorrt::tests::util::almostEqual(jit_pre_results[0].toTensor(), jit_post_results[0].toTensor(), 2e-6)); 
        
           } 
        
           TEST(LoweringPasses, UnpackAndCastFullIntLowersCorrectly) { 
        
             const auto graph = R"IR( 
        
                 graph(%x.1: int): 
        
                   %5 : NoneType = prim::Constant() 
        
                   %2 : int = prim::Constant[value=3]() 
        
                   %10 : int[] = prim::ListConstruct(%2, %2) 
        
                   %out : Tensor = aten::full(%10, %x.1, %5, %5, %5, %5) 
        
                   return (%out))IR"; 
        
             auto in = 4; 
        
             auto g = std::make_shared<torch::jit::Graph>(); 
        
             torch::jit::parseIR(graph, g.get()); 
        
             auto jit_pre_results = torch_tensorrt::tests::util::EvaluateGraphJIT(g, {in}); 
        
             torch_tensorrt::core::lowering::passes::UnpackAndCastFull(g); 
        
             torch::jit::EliminateCommonSubexpression(g); 
        
             auto jit_post_results = torch_tensorrt::tests::util::EvaluateGraphJIT(g, {in}); 
        
             ASSERT_TRUE(torch_tensorrt::tests::util::almostEqual( 
        
                 jit_pre_results[0].toTensor(), jit_post_results[0].toTensor().cpu(), 2e-6)); 
        
           } 
        
           TEST(LoweringPasses, UnpackAndCastFullFloatLowersCorrectly) { 
        
             const auto graph = R"IR( 
        
                 graph(%x.1: float): 
        
                   %5 : NoneType = prim::Constant() 
        
                   %2 : int = prim::Constant[value=5]() 
        
                   %3 : int = prim::Constant[value=4]() 
        
                   %10 : int[] = prim::ListConstruct(%2, %3) 
        
                   %out : Tensor = aten::full(%10, %x.1, %5, %5, %5, %5) 
        
                   return (%out))IR"; 
        
             auto in = 54.1; 
        
             auto g = std::make_shared<torch::jit::Graph>(); 
        
             torch::jit::parseIR(graph, g.get()); 
        
             auto jit_pre_results = torch_tensorrt::tests::util::EvaluateGraphJIT(g, {in}); 
        
             torch_tensorrt::core::lowering::passes::UnpackAndCastFull(g); 
        
             torch::jit::EliminateCommonSubexpression(g); 
        
             auto jit_post_results = torch_tensorrt::tests::util::EvaluateGraphJIT(g, {in}); 
        
             ASSERT_TRUE(torch_tensorrt::tests::util::almostEqual( 
        
                 jit_pre_results[0].toTensor(), jit_post_results[0].toTensor().cpu(), 2e-6)); 
        
           } 
        
           TEST(LoweringPasses, ReplaceScalarImplicitLowersCorrectly) { 
        
             const auto graph = R"IR( 
        
                 graph(%x.1: Tensor): 
        
                   %5 : int = prim::Constant[value=0]() 
        
                   %false : bool = prim::Constant[value=0]() 
        
                   %none : NoneType = prim::Constant() 
        
                   %cuda : Device = prim::Constant[value="cuda"]() 
        
                   %3 : int = aten::size(%x.1, %5) 
        
                   %y.2 : Tensor = prim::NumToTensor(%3) 
        
                   %y.1 : Tensor = aten::to(%y.2, %cuda, %none, %false, %false) 
        
                   %19 : Tensor[] = prim::ListConstruct(%x.1, %y.1) 
        
                   %21 : Tensor, %22 : Tensor = prim::ListUnpack(%19) 
        
                   %2 : Scalar = aten::ScalarImplicit(%22) 
        
                   %out : Tensor = prim::NumToTensor(%2) 
        
                   return (%out))IR"; 
        
             auto in = at::rand({2, 3, 5, 7}, {at::kCUDA}); 
        
             auto g = std::make_shared<torch::jit::Graph>(); 
        
             torch::jit::parseIR(graph, g.get()); 
        
             auto jit_pre_results = torch_tensorrt::tests::util::EvaluateGraphJIT(g, {in}); 
        
             torch_tensorrt::core::lowering::passes::ReplaceScalarImplicit(g); 
        
             torch::jit::EliminateCommonSubexpression(g); 
        
             auto jit_post_results = torch_tensorrt::tests::util::EvaluateGraphJIT(g, {in}); 
        
             ASSERT_TRUE( 
        
                 torch_tensorrt::tests::util::almostEqual(jit_pre_results[0].toTensor(), jit_post_results[0].toTensor(), 2e-6)); 
        
           } 
        
           TEST(LoweringPasses, ReplaceScalarImplicitIntNumToTensorLowersCorrectly) { 
        
             const auto graph = R"IR( 
        
                 graph(%x.1: int): 
        
                   %1 : Tensor = prim::NumToTensor(%x.1) 
        
                   %2 : Scalar = aten::ScalarImplicit(%1) 
        
                   %3 : Tensor = prim::NumToTensor(%2) 
        
                   return (%3))IR"; 
        
             auto in = 25; 
        
             auto g = std::make_shared<torch::jit::Graph>(); 
        
             torch::jit::parseIR(graph, g.get()); 
        
             auto jit_pre_results = torch_tensorrt::tests::util::EvaluateGraphJIT(g, {in}); 
        
             torch_tensorrt::core::lowering::passes::UnpackAndCastNumToTensor(g); 
        
             torch_tensorrt::core::lowering::passes::ReplaceScalarImplicit(g); 
        
             torch::jit::EliminateCommonSubexpression(g); 
        
             auto jit_post_results = torch_tensorrt::tests::util::EvaluateGraphJIT(g, {in}); 
        
             ASSERT_TRUE( 
        
                 torch_tensorrt::tests::util::almostEqual(jit_pre_results[0].toTensor(), jit_post_results[0].toTensor(), 2e-6)); 
        
           } 
        
           TEST(LoweringPasses, ReplaceScalarImplicitFloatLowersCorrectly) { 
        
             const auto graph = R"IR( 
        
                 graph(%x.1: float): 
        
                   %1 : Tensor = prim::NumToTensor(%x.1) 
        
                   %2 : Scalar = aten::ScalarImplicit(%1) 
        
                   %3 : Tensor = prim::NumToTensor(%2) 
        
                   return (%3))IR"; 
        
             auto in = 2.5; 
        
             auto g = std::make_shared<torch::jit::Graph>(); 
        
             torch::jit::parseIR(graph, g.get()); 
        
             auto jit_pre_results = torch_tensorrt::tests::util::EvaluateGraphJIT(g, {in}); 
        
             torch_tensorrt::core::lowering::passes::ReplaceScalarImplicit(g); 
        
             torch::jit::EliminateCommonSubexpression(g); 
        
             auto jit_post_results = torch_tensorrt::tests::util::EvaluateGraphJIT(g, {in}); 
        
             ASSERT_TRUE( 
        
                 torch_tensorrt::tests::util::almostEqual(jit_pre_results[0].toTensor(), jit_post_results[0].toTensor(), 2e-6)); 
        
           }

Investigation Details

A brief overview of the investigation into this issue and the steps to a solution are detailed here. A canonical trace originating from a BART model compilation is shown below:

TorchTRT-Compiled Graph Snippet

graph(%input_0 : __torch__.transformers.models.bart.modeling_bart.BartForConditionalGeneration_trt, %input_1 : Tensor, %input_2 : Tensor):
  %2 : int = prim::Constant[value=0]() # /opt/conda/lib/python3.8/site-packages/transformers/models/bart/modeling_bart.py:66:0
  %3 : int = prim::Constant[value=9223372036854775807]() # /opt/conda/lib/python3.8/site-packages/transformers/models/bart/modeling_bart.py:67:0
  %4 : int = prim::Constant[value=1]()
  %5 : int = prim::Constant[value=-1]() # /opt/conda/lib/python3.8/site-packages/transformers/models/bart/modeling_bart.py:67:0
  %6 : NoneType = prim::Constant()
  %7 : int = prim::Constant[value=3]() # /opt/conda/lib/python3.8/site-packages/transformers/models/bart/modeling_bart.py:66:0
  %8 : Device = prim::Constant[value="cuda:0"]()
  %9 : bool = prim::Constant[value=0]() # /opt/conda/lib/python3.8/site-packages/transformers/models/bart/modeling_bart.py:66:0
  %10 : Tensor = aten::slice(%input_1, %2, %2, %3, %4) # /opt/conda/lib/python3.8/site-packages/transformers/models/bart/modeling_bart.py:67:0
  %11 : Tensor = aten::slice(%10, %4, %2, %5, %4) # /opt/conda/lib/python3.8/site-packages/transformers/models/bart/modeling_bart.py:67:0
  %12 : int = aten::size(%input_1, %2) # /opt/conda/lib/python3.8/site-packages/transformers/models/bart/modeling_bart.py:66:0
  %13 : int = aten::size(%input_1, %4) # /opt/conda/lib/python3.8/site-packages/transformers/models/bart/modeling_bart.py:66:0
  %14 : int[] = prim::ListConstruct(%12, %13)
  %15 : Tensor = aten::clone(%11, %6) # /opt/conda/lib/python3.8/site-packages/transformers/models/bart/modeling_bart.py:67:0
  %shifted_input_ids.1 : Tensor = aten::new_zeros(%input_1, %14, %7, %2, %8, %9) # /opt/conda/lib/python3.8/site-packages/transformers/models/bart/modeling_bart.py:66:0
  %__torch___transformers_models_bart_modeling_bart_BartForConditionalGeneration_trt_engine_0x557c2dc8e020 : __torch__.torch.classes.tensorrt.Engine = prim::GetAttr[name="__torch___transformers_models_bart_modeling_bart_BartForConditionalGeneration_trt_engine_0x557c2dc8e020"](%input_0)
  %19 : Tensor[] = prim::ListConstruct(%shifted_input_ids.1, %15)
  %20 : Tensor[] = tensorrt::execute_engine(%19, %__torch___transformers_models_bart_modeling_bart_BartForConditionalGeneration_trt_engine_0x557c2dc8e020)
  %21 : Tensor = prim::ListUnpack(%20)
  %22 : Tensor = prim::Constant[value={2}]() # /opt/conda/lib/python3.8/site-packages/transformers/models/bart/modeling_bart.py:68:0
  %23 : int = prim::Constant[value=-100]() # /opt/conda/lib/python3.8/site-packages/transformers/models/bart/modeling_bart.py:72:0
  %24 : int = prim::Constant[value=-1]() # /opt/conda/lib/python3.8/site-packages/transformers/models/bart/modeling_bart.py:67:0
  %25 : int = prim::Constant[value=1]()
  %26 : Tensor = aten::fill_(%21, %22) # /opt/conda/lib/python3.8/site-packages/transformers/models/bart/modeling_bart.py:68:0
  %27 : Tensor = aten::eq(%shifted_input_ids.1, %23) # /opt/conda/lib/python3.8/site-packages/transformers/models/bart/modeling_bart.py:72:0
  %seq_len.1 : Tensor = prim::NumToTensor(%13) # :0:0
### The line above runs in Torch, where the NumToTensor call returns a dimensionless CPULong{} Tensor
  %29 : int[] = prim::ListConstruct(%24, %13)
  %30 : Tensor = aten::reshape(%input_1, %29)
  %input_ids0.1 : Tensor = aten::masked_fill_(%shifted_input_ids.1, %27, %25) # /opt/conda/lib/python3.8/site-packages/transformers/models/bart/modeling_bart.py:72:0
  %__torch___transformers_models_bart_modeling_bart_BartForConditionalGeneration_trt_engine_0x557c2dc8e1c0 : __torch__.torch.classes.tensorrt.Engine = prim::GetAttr[name="__torch___transformers_models_bart_modeling_bart_BartForConditionalGeneration_trt_engine_0x557c2dc8e1c0"](%input_0)
  %33 : Tensor[] = prim::ListConstruct(%input_2, %30, %seq_len.1)
  %34 : Tensor[] = tensorrt::execute_engine(%33, %__torch___transformers_models_bart_modeling_bart_BartForConditionalGeneration_trt_engine_0x557c2dc8e1c0)
  %35 : Tensor, %36 : Tensor, %37 : Tensor = prim::ListUnpack(%34)
### The line above runs in Torch-TRT, where the unpacked value is expected to be a CUDAInt{} tensor, but what is found is the CPULong{} Tensor
  %38 : int = prim::Constant[value=0]() # /opt/conda/lib/python3.8/site-packages/transformers/models/bart/modeling_bart.py:66:0
  %39 : int = prim::Constant[value=4]() # /opt/conda/lib/python3.8/site-packages/transformers/models/bart/modeling_bart.py:120:0
  %40 : NoneType = prim::Constant()
  %41 : Device = prim::Constant[value="cuda:0"]()
  %42 : bool = prim::Constant[value=0]() # /opt/conda/lib/python3.8/site-packages/transformers/models/bart/modeling_bart.py:66:0
  %43 : Scalar = aten::ScalarImplicit(%37) # :0:0
### The above function Schema expects a dimensionless tensor, and it seems that simply casting %37 to GPU + Int type also causes the above to fail [ScalarImplicit expects a 0D Tensor, the casts might impart a dimension on it]
  %positions.2 : Tensor = aten::arange(%38, %43, %39, %40, %41, %42) # /opt/conda/lib/python3.8/site-packages/transformers/models/bart/modeling_bart.py:120:0

As detailed in the comments within the code snippet above, the CUDA/CPU issue is a combination of three bugs: incorrect device placement, incorrect data types, and invalid function schema. Fixing all three of these issues is key to resolving this runtime bug. The device placement and aten::ScalarImplicit schema issues are resolved in PR #1416, and the data type issue is resolved in PR #1407. When both PRs are combined, all three of the above issues are addressed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Addressing CUDA/CPU Device Errors Occurring Internally in Torch-TRT Compilations #1423

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Addressing CUDA/CPU Device Errors Occurring Internally in Torch-TRT Compilations #1423

Uh oh!

gs-olive Oct 27, 2022 Collaborator

Context

Proposed Solution and Implementation

Investigation Details

Replies: 0 comments

gs-olive
Oct 27, 2022
Collaborator