Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

api_server 方式部署有概率卡住 #2691

Closed
1 of 3 tasks
LiYtao opened this issue Oct 31, 2024 · 6 comments
Closed
1 of 3 tasks

api_server 方式部署有概率卡住 #2691

LiYtao opened this issue Oct 31, 2024 · 6 comments
Assignees

Comments

@LiYtao
Copy link

LiYtao commented Oct 31, 2024

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

使用api_server 方式部署internvl2-40B时有概率卡住,使用了4张A100,debug日志的最后一行是 [TM][DEBUG] T* turbomind::Tensor::getPtr() const [with T = __nv_bfloat16] start

Reproduction

CUDA_VISIBLE_DEVICES=0,3,4,7 NCCL_P2P_DIRECT_DISABLE=1 lmdeploy serve api_server /var/aigc/model/InternVL2-40B --tp 4 --server-port 23333 --log-level DEBUG

Environment

root@71dde6280c86:/# lmdeploy check_env
sys.platform: linux
Python: 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0,1,2,3,4,5,6,7: NVIDIA A100-PCIE-40GB
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.7, V11.7.99
GCC: x86_64-linux-gnu-gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
PyTorch: 2.4.0+cu118
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.4.2 (Git Hash 1137e04ec0b5251ca2b4400a4fd3c667ce843d67)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 11.8
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_90,code=sm_90
  - CuDNN 90.1
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.8, CUDNN_VERSION=9.1.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.4.0, USE_CUDA=ON, USE_CUDNN=ON, USE_CUSPARSELT=1, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF,

TorchVision: 0.19.0+cu118
LMDeploy: 0.6.2+
transformers: 4.46.1
gradio: Not Found
fastapi: 0.115.4
pydantic: 2.9.2
triton: 3.0.0
NVIDIA Topology:
	GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	CPU Affinity	NUMA Affinity
GPU0	 X 	PIX	PIX	PIX	SYS	SYS	SYS	SYS	0-23,48-71	0
GPU1	PIX	 X 	PIX	PIX	SYS	SYS	SYS	SYS	0-23,48-71	0
GPU2	PIX	PIX	 X 	PIX	SYS	SYS	SYS	SYS	0-23,48-71	0
GPU3	PIX	PIX	PIX	 X 	SYS	SYS	SYS	SYS	0-23,48-71	0
GPU4	SYS	SYS	SYS	SYS	 X 	PIX	PIX	PIX	24-47,72-95	1
GPU5	SYS	SYS	SYS	SYS	PIX	 X 	PIX	PIX	24-47,72-95	1
GPU6	SYS	SYS	SYS	SYS	PIX	PIX	 X 	PIX	24-47,72-95	1
GPU7	SYS	SYS	SYS	SYS	PIX	PIX	PIX	 X 	24-47,72-95	1

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

Error traceback

[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] malloc buffer 0x311fd20200 with size 8388608
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = curandStateXORWOW; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] malloc buffer 0x3120520200 with size 12288
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = int; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] malloc buffer 0x3120523200 with size 8388608
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = curandStateXORWOW; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] malloc buffer 0x9abd20200 with size 8388608
[TM][DEBUG] malloc buffer 0x3120d23200 with size 12288
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = curandStateXORWOW; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] malloc buffer 0x9ac520200 with size 12288
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = int; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] malloc buffer 0x9ac523200 with size 8388608
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = curandStateXORWOW; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] malloc buffer 0x44d9d20200 with size 8388608
[TM][DEBUG] malloc buffer 0x9acd23200 with size 12288
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = curandStateXORWOW; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] malloc buffer 0x1d65d20200 with size 8388608
[TM][DEBUG] malloc buffer 0x44da520200 with size 12288
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = int; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = curandStateXORWOW; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] malloc buffer 0x44da523200 with size 8388608
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = curandStateXORWOW; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] malloc buffer 0x1d66520200 with size 12288
[TM][DEBUG] malloc buffer 0x44dad23200 with size 12288
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = int; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] malloc buffer 0x1d66523200 with size 8388608
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = curandStateXORWOW; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] malloc buffer 0x1d66d23200 with size 12288
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = int; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = int; size_t = long unsigned int]
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = int; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = int; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] malloc buffer 0x7ef650c00000 with size 8388608
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = int; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] malloc buffer 0x7ef651400000 with size 8388608
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = int; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] malloc buffer 0x7ef41ec00000 with size 8388608
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = int; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] malloc buffer 0x7ef41f400000 with size 8388608
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = int; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] malloc buffer 0x7f0a8d826400 with size 1024
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = int; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] malloc buffer 0x7f08b1c26400 with size 1024
[TM][DEBUG] malloc buffer 0x7f0a8d826800 with size 1056
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] malloc buffer 0x7f0a69026400 with size 1024
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = int; size_t = long unsigned int]
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = long unsigned int; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = int; size_t = long unsigned int]
[TM][DEBUG] malloc buffer 0x7f0a75826400 with size 1024
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = int; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] malloc buffer 0x7f0a8d826e00 with size 262144
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = int; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] malloc buffer 0x7f0a75826800 with size 1056
[TM][DEBUG] malloc buffer 0x7f0a8d866e00 with size 1024
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = int; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = long unsigned int; size_t = long unsigned int]
[TM][DEBUG] malloc buffer 0x7f0a8d867200 with size 1024
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = bool; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] malloc buffer 0x7f0a8d867600 with size 512
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = float; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] malloc buffer 0x7f08b1c26800 with size 1056
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = long unsigned int; size_t = long unsigned int]
[TM][DEBUG] malloc buffer 0x7f0a69026800 with size 1056
[TM][DEBUG] malloc buffer 0x7f0a8d867800 with size 1024
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] malloc buffer 0x7f0a75826e00 with size 262144
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = int; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = int; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] malloc buffer 0x7f0a8d867c00 with size 1024
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = int; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = long unsigned int; size_t = long unsigned int]
[TM][DEBUG] malloc buffer 0x7f0a75866e00 with size 1024
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = int; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] malloc buffer 0x7f0a8d868000 with size 1024
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = bool; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] malloc buffer 0x7f0a75867200 with size 1024
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = bool; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] malloc buffer 0x7f08b1c26e00 with size 262144
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = int; size_t = long unsigned int]
[TM][DEBUG] malloc buffer 0x7f0a75867600 with size 512
[TM][DEBUG] malloc buffer 0x7f0a8d868400 with size 512
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = float; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = float; size_t = long unsigned int]
[TM][DEBUG] malloc buffer 0x7f0a69026e00 with size 262144
[TM][DEBUG] malloc buffer 0x7f08b1c66e00 with size 1024
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] malloc buffer 0x7f0a8d868600 with size 1024
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = int; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = int; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] malloc buffer 0x7f0a8d868a00 with size 1024
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = int; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] malloc buffer 0x7f0a69066e00 with size 1024
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = int; size_t = long unsigned int]
[TM][DEBUG] malloc buffer 0x7f0a75867800 with size 1024
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = int; size_t = long unsigned int]
[TM][DEBUG] malloc buffer 0x7f0a8d868e00 with size 1024
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = bool; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] malloc buffer 0x7f0a69067200 with size 1024
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = int; size_t = long unsigned int]
[TM][DEBUG] malloc buffer 0x7f0a8d869200 with size 512
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = float; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] malloc buffer 0x7f08b1c67200 with size 1024
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = bool; size_t = long unsigned int]
[TM][DEBUG] malloc buffer 0x7f0a8d869400 with size 1024
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = unsigned int; size_t = long unsigned int]
[TM][DEBUG] malloc buffer 0x7f0a75867c00 with size 1024
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = int; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] malloc buffer 0x7f0a8d869800 with size 1024
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = int; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = bool; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] malloc buffer 0x7f0a69067600 with size 512
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = float; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] malloc buffer 0x7f0a75868000 with size 1024
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = bool; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] malloc buffer 0x7f08b1c67600 with size 512
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = float; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] malloc buffer 0x7ef6a0c00000 with size 8388608
[TM][DEBUG] void turbomind::ftNcclStreamSynchronize(turbomind::NcclParam, turbomind::NcclParam, cudaStream_t) start
[TM][DEBUG] void turbomind::ftNcclStreamSynchronize(turbomind::NcclParam, turbomind::NcclParam, cudaStream_t) stop
[TM][DEBUG] malloc buffer 0x7f0a69067800 with size 1024
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = int; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] malloc buffer 0x7f08b1c67800 with size 1024
[TM][DEBUG] malloc buffer 0x7f0a75868400 with size 512
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = int; size_t = long unsigned int]
[TM][DEBUG] malloc buffer 0x7f0a69067c00 with size 1024
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = int; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] malloc buffer 0x7f0a69068000 with size 1024
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = bool; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = float; size_t = long unsigned int]
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] malloc buffer 0x7f0a69068400 with size 512
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = float; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] malloc buffer 0x7f0a75868600 with size 1024
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = int; size_t = long unsigned int]
[TM][DEBUG] malloc buffer 0x7f0a69068600 with size 1024
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = int; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] malloc buffer 0x7f08b1c67c00 with size 1024
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] malloc buffer 0x7f0a69068a00 with size 1024
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = int; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = int; size_t = long unsigned int]
[TM][DEBUG] malloc buffer 0x7f0a75868a00 with size 1024
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = int; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] malloc buffer 0x7f0a69068e00 with size 1024
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = bool; size_t = long unsigned int]
[TM][DEBUG] malloc buffer 0x7f0a75868e00 with size 1024
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = bool; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] malloc buffer 0x7f0a75869200 with size 512
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = float; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] malloc buffer 0x7f0a75869400 with size 1024
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = unsigned int; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] malloc buffer 0x7f0a69069200 with size 512
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = float; size_t = long unsigned int]
[TM][DEBUG] malloc buffer 0x7f0a75869800 with size 1024
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = int; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] malloc buffer 0x7f08b1c68000 with size 1024
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = bool; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] malloc buffer 0x7f0a69069400 with size 1024
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = unsigned int; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] malloc buffer 0x7f08b1c68400 with size 512
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = float; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] malloc buffer 0x7ef6a1400000 with size 8388608
[TM][DEBUG] void turbomind::ftNcclStreamSynchronize(turbomind::NcclParam, turbomind::NcclParam, cudaStream_t) start
[TM][DEBUG] malloc buffer 0x7f08b1c68600 with size 1024
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = int; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] void turbomind::ftNcclStreamSynchronize(turbomind::NcclParam, turbomind::NcclParam, cudaStream_t) stop
[TM][DEBUG] malloc buffer 0x7f0a69069800 with size 1024
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = int; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] malloc buffer 0x7f08b1c68a00 with size 1024
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = int; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] malloc buffer 0x7f08b1c68e00 with size 1024
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = bool; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] malloc buffer 0x7ef67cc00000 with size 8388608
[TM][DEBUG] void turbomind::ftNcclStreamSynchronize(turbomind::NcclParam, turbomind::NcclParam, cudaStream_t) start
[TM][DEBUG] malloc buffer 0x7f08b1c69200 with size 512
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = float; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] void turbomind::ftNcclStreamSynchronize(turbomind::NcclParam, turbomind::NcclParam, cudaStream_t) stop
[TM][DEBUG] malloc buffer 0x7f08b1c69400 with size 1024
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = unsigned int; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] malloc buffer 0x7f08b1c69800 with size 1024
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = int; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] malloc buffer 0x7ef67d400000 with size 8388608
[TM][DEBUG] void turbomind::ftNcclStreamSynchronize(turbomind::NcclParam, turbomind::NcclParam, cudaStream_t) start
[TM][DEBUG] void turbomind::ftNcclStreamSynchronize(turbomind::NcclParam, turbomind::NcclParam, cudaStream_t) stop
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = float; size_t = long unsigned int]
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = float; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = float; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = float; size_t = long unsigned int]
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] malloc buffer 0x7f0a8d869c00 with size 1048576
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = unsigned int; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] malloc buffer 0x7f08b1c69c00 with size 1048576
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = unsigned int; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] malloc buffer 0x7f0a69069c00 with size 1048576
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = unsigned int; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] malloc buffer 0x7f0a75869c00 with size 1048576
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = unsigned int; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] malloc buffer 0x7ef651c00000 with size 1048576
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = unsigned int; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] malloc buffer 0x7ef651e00000 with size 1048576
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = unsigned int; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] malloc buffer 0x7ef41fc00000 with size 1048576
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = unsigned int; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] malloc buffer 0x7f0a69169c00 with size 1024
[TM][INFO] LlamaBatch<T>::Start()
[TM][DEBUG] malloc buffer 0x7ef41fe00000 with size 1048576
[TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = unsigned int; size_t = long unsigned int]
[TM][DEBUG] Cannot find buffer (nil), mallocing new one.
[TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
[TM][DEBUG] void turbomind::LlamaV2<T>::forwardUnified(T*, T*, T*, void**, const int*, const int*, const int*, const int*, const float*, const bool*, size_t, int, int, int*, const turbomind::Sequence**) [with T = __nv_bfloat16; size_t = long unsigned int]
[TM][DEBUG] malloc buffer 0x7f08b1d69c00 with size 1024
[TM][INFO] LlamaBatch<T>::Start()
[TM][INFO] [Gemm2] Tuning sequence: 8, 16, 32, 48, 64, 96, 128, 192, 256, 384, 512, 768, 1024, 1536, 2048, 3072, 4096, 6144, 8192
[TM][INFO] [Gemm2] 8
[TM][DEBUG] void turbomind::LlamaV2<T>::forwardUnified(T*, T*, T*, void**, const int*, const int*, const int*, const int*, const float*, const bool*, size_t, int, int, int*, const turbomind::Sequence**) [with T = __nv_bfloat16; size_t = long unsigned int]
[TM][DEBUG] void turbomind::ftNcclAllGather(const T*, T*, int, int, turbomind::NcclParam, cudaStream_t) [with T = __nv_bfloat16; cudaStream_t = CUstream_st*] start
[TM][DEBUG] ncclDataType_t turbomind::getNcclDataType() [with T = __nv_bfloat16] start
[TM][DEBUG] void turbomind::ftNcclAllGather(const T*, T*, int, int, turbomind::NcclParam, cudaStream_t) [with T = __nv_bfloat16; cudaStream_t = CUstream_st*] start
[TM][DEBUG] ncclDataType_t turbomind::getNcclDataType() [with T = __nv_bfloat16] start
[TM][DEBUG] void turbomind::ftNcclAllGather(const T*, T*, int, int, turbomind::NcclParam, cudaStream_t) [with T = __nv_bfloat16; cudaStream_t = CUstream_st*] stop
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: decoder_input
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: output_norm_weight
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: h_q_len
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: h_k_len
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: finished
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: dc_batch_size
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: pf_batch_size
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: rope_theta
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: cu_block_counts
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: decoder_output
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: block_ptrs
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: last_token_hidden_units
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: decoder_input
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: pf_batch_size
[TM][DEBUG] T turbomind::Tensor::getVal() const [with T = int] start
[TM][DEBUG] T turbomind::Tensor::getVal(size_t) const [with T = int; size_t = long unsigned int] start
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: dc_batch_size
[TM][DEBUG] T turbomind::Tensor::getVal() const [with T = int] start
[TM][DEBUG] T turbomind::Tensor::getVal(size_t) const [with T = int; size_t = long unsigned int] start
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: h_q_len
[TM][DEBUG] T* turbomind::Tensor::getPtr() const [with T = int] start
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: h_k_len
[TM][DEBUG] T* turbomind::Tensor::getPtr() const [with T = int] start
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: decoder_input
[TM][DEBUG] T* turbomind::Tensor::getPtr() const [with T = __nv_bfloat16] start
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: decoder_output
[TM][DEBUG] T* turbomind::Tensor::getPtr() const [with T = __nv_bfloat16] start
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: last_token_hidden_units
[TM][DEBUG] T* turbomind::Tensor::getPtr() const [with T = __nv_bfloat16] start
[TM][DEBUG] void turbomind::ftNcclAllGather(const T*, T*, int, int, turbomind::NcclParam, cudaStream_t) [with T = __nv_bfloat16; cudaStream_t = CUstream_st*] stop
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: decoder_input
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: output_norm_weight
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: h_q_len
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: h_k_len
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: finished
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: dc_batch_size
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: pf_batch_size
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: rope_theta
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: cu_block_counts
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: decoder_output
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: block_ptrs
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: last_token_hidden_units
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: decoder_input
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: pf_batch_size
[TM][DEBUG] T turbomind::Tensor::getVal() const [with T = int] start
[TM][DEBUG] T turbomind::Tensor::getVal(size_t) const [with T = int; size_t = long unsigned int] start
[TM][DEBUG] malloc buffer 0x7f0a8d969c00 with size 1024
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: dc_batch_size
[TM][INFO] LlamaBatch<T>::Start()
[TM][DEBUG] T turbomind::Tensor::getVal() const [with T = int] start
[TM][DEBUG] T turbomind::Tensor::getVal(size_t) const [with T = int; size_t = long unsigned int] start
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: h_q_len
[TM][DEBUG] T* turbomind::Tensor::getPtr() const [with T = int] start
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: h_k_len
[TM][DEBUG] T* turbomind::Tensor::getPtr() const [with T = int] start
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: decoder_input
[TM][DEBUG] T* turbomind::Tensor::getPtr() const [with T = __nv_bfloat16] start
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: decoder_output
[TM][DEBUG] T* turbomind::Tensor::getPtr() const [with T = __nv_bfloat16] start
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: last_token_hidden_units
[TM][DEBUG] T* turbomind::Tensor::getPtr() const [with T = __nv_bfloat16] start
[TM][DEBUG] void turbomind::LlamaV2<T>::forwardUnified(T*, T*, T*, void**, const int*, const int*, const int*, const int*, const float*, const bool*, size_t, int, int, int*, const turbomind::Sequence**) [with T = __nv_bfloat16; size_t = long unsigned int]
[TM][DEBUG] void turbomind::ftNcclAllGather(const T*, T*, int, int, turbomind::NcclParam, cudaStream_t) [with T = __nv_bfloat16; cudaStream_t = CUstream_st*] start
[TM][DEBUG] ncclDataType_t turbomind::getNcclDataType() [with T = __nv_bfloat16] start
[TM][DEBUG] void turbomind::ftNcclAllGather(const T*, T*, int, int, turbomind::NcclParam, cudaStream_t) [with T = __nv_bfloat16; cudaStream_t = CUstream_st*] stop
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: decoder_input
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: output_norm_weight
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: h_q_len
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: h_k_len
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: finished
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: dc_batch_size
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: pf_batch_size
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: rope_theta
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: cu_block_counts
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: decoder_output
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: block_ptrs
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: last_token_hidden_units
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: decoder_input
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: pf_batch_size
[TM][DEBUG] T turbomind::Tensor::getVal() const [with T = int] start
[TM][DEBUG] T turbomind::Tensor::getVal(size_t) const [with T = int; size_t = long unsigned int] start
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: dc_batch_size
[TM][DEBUG] T turbomind::Tensor::getVal() const [with T = int] start
[TM][DEBUG] T turbomind::Tensor::getVal(size_t) const [with T = int; size_t = long unsigned int] start
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: h_q_len
[TM][DEBUG] T* turbomind::Tensor::getPtr() const [with T = int] start
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: h_k_len
[TM][DEBUG] T* turbomind::Tensor::getPtr() const [with T = int] start
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: decoder_input
[TM][DEBUG] T* turbomind::Tensor::getPtr() const [with T = __nv_bfloat16] start
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: decoder_output
[TM][DEBUG] T* turbomind::Tensor::getPtr() const [with T = __nv_bfloat16] start
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: last_token_hidden_units
[TM][DEBUG] T* turbomind::Tensor::getPtr() const [with T = __nv_bfloat16] start
@lzhangzz
Copy link
Collaborator

lzhangzz commented Nov 4, 2024

Maybe related to #2706

Copy link

This issue is marked as stale because it has been marked as invalid or awaiting response for 7 days without any further response. It will be closed in 5 days if the stale label is not removed or if there is no further response.

@github-actions github-actions bot added the Stale label Nov 12, 2024
@DefTruth
Copy link
Contributor

遇到了相同的问题,50个实例开TP=2,有几个会卡死,其中一个卡GPU利用率为0,另一个为100%

@github-actions github-actions bot removed the Stale label Nov 14, 2024
@DefTruth
Copy link
Contributor

@LiYtao 看着是关了P2P, NCCL_P2P_DIRECT_DISABLE=1。我在不支持P2P的卡上测试也有类似的情况,请问您这边解决了这个问题了吗?

Copy link

This issue is marked as stale because it has been marked as invalid or awaiting response for 7 days without any further response. It will be closed in 5 days if the stale label is not removed or if there is no further response.

@github-actions github-actions bot added the Stale label Nov 24, 2024
Copy link

This issue is closed because it has been stale for 5 days. Please open a new issue if you have similar issues or you have any new updates now.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants