Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Support padding based on row_dim (torchrec part) (pytorch#2204)
Summary: Pull Request resolved: pytorch#2204 MX4 GEMM kernel required the total number of element in the KJT per rank is divisible to 32. Failed job: aps-350x_lite-b76673ccdc RuntimeError: Number of inputs needs to be a multiple of group size Exception raised from quantize_mx_cuda at fbcode/deeplearning/fbgemm/fbgemm_gpu/src/quantize_ops/quantize_mx.cu:63 (most recent call first): # 2 c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) # 3 fbgemm_gpu::quantize_mx_cuda(at::Tensor const&, std::vector<long, std::allocator<long> > const&, long, long, long, double, long, bool, long) # 4 std::decay<c10::guts::infer_function_traits<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, std::vector<long, std::allocator<long> > const&, long, long, long, double, long, bool, long), &fbgemm_gpu::quantize_mx_cuda>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, std::vector<long, std::allocator<long> > const&, long, long, long, double, long, bool, long> > >::type::return_type>::type c10::impl::call_functor_with_args_from_stack_<c10::impl::detail::WrapFunctionIntoFunctor... Reviewed By: sryap Differential Revision: D58223717 fbshipit-source-id: 910d365b95b9c8d06b1ac4240b550816d723c9f0
- Loading branch information