accuracy and performance of bfloat16 with bitblas linear #161

AbedKhateeb2 · 2024-08-29T15:30:04Z

I tried to run bfloat16 linear of bitblas
but I got different result

output:
qunatizing /decoder/block/0/layer/0/SelfAttention/k
torch linear took by avg 7.802581787109375e-05
BitBLAS Operator found in global_operator_cache.
bitblas linear took to init : 1.157283067703247 sec
bitblas linear took by avg 7.474946975708008e-05
torch compare : tensor(2.2344, device='cuda:0', dtype=torch.bfloat16)

the linear layer is from pretrained model
the model was trained with bf16
cuda version : 12.1
gpu : A10G
ubuntu
bitblas version bitblas==0.0.1.dev15

from bitblas import Linear as BitBLASLinear
print(f"qunatizing {name}")
  in_features = linear_layer.in_features
  out_features = linear_layer.out_features


      opt_M = 1

  class Custom( BitBLASLinear):
      
      def forward(self, A):
          out = super().forward(A)
          out = out.to(torch.bfloat16)
          return out
  input_tensor = torch.rand(opt_M, in_features).to(torch.bfloat16).cuda()
  st = time.time()
  while time.time() - st < 1.0:
      linear_layer(input_tensor)
  times = 1000
  with torch.no_grad():
      start_time = time.time()
      for _ in range(times):
          output_torch = linear_layer(input_tensor)
      end_time = time.time()
  print(f"torch linear took by avg {(end_time-start_time)/times}")
  start_time = time.time()
  # bitblas_linear = Int8Linear(linear_module=linear_torch)
  # BitBLASLinear.STORAGE_DTYPE='bfloa16'
  bitblas_linear = Custom(linear_layer.in_features, linear_layer.out_features, bias=linear_layer.bias is not None, opt_M=opt_M, accum_dtype='float32', A_dtype='bfloat16', W_dtype='bfloat16')
  bitblas_linear.load_and_transform_weight(linear_layer.weight.clone())
  if linear_layer.bias is not None:
      bitblas_linear.bias.data = linear_layer.bias.data.clone()
      
  st = time.time()
  while time.time() - st < 1.0:
      bitblas_linear(input_tensor)
  end_time = time.time()
  print(f"bitblas linear took to init : {(end_time-start_time)} sec")
  bitblas_linear.cuda()
  with torch.no_grad():
      start_time = time.time()
      for _ in range(times):
          output_bitblas = bitblas_linear(input_tensor)
      end_time = time.time()
  print(f"bitblas linear took by avg {(end_time-start_time)/times}")

  print("torch compare : ",torch.mean(torch.abs(output_torch.to(torch.bfloat16)-output_bitblas.to(torch.bfloat16))))

The text was updated successfully, but these errors were encountered:

LeiWang1999 · 2024-08-29T16:15:32Z

hi @AbedKhateeb2 , bfloat16 related test can be found at https://github.com/microsoft/BitBLAS/blob/main/testing/python/operators/test_general_matmul_bf16.py

would you mind provide a simple unit test to reproduce? because I cannot access the layer that you mentioned within your problem.

AbedKhateeb2 · 2024-08-29T17:30:08Z

thank you @LeiWang1999 for your response 😃
here a standalone script
torch 2.4.0
torchaudio 2.4.0
torchvision 0.19.0
bitblas 0.0.1.dev15
Python 3.10.14
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

import time
from bitblas import Linear as BitBLASLinear
import torch
import torch.nn as nn
import os
import torchvision.models as models
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
# Load a pre-trained VGG-16 model
vgg16 = models.vgg16(pretrained=True)

# Get the last linear layer from the pre-trained model
linear_layer = vgg16.classifier[3].to(torch.bfloat16).cuda()
linear_layer.bias=None
print(f"quantizing {linear_layer}")
in_features = linear_layer.in_features
out_features = linear_layer.out_features

opt_M = 1

input_tensor = torch.rand(opt_M, in_features).to(torch.bfloat16).cuda()
st = time.time()
while time.time() - st < 1.0:
    linear_layer(input_tensor)
times = 1000
with torch.no_grad():
    start_time = time.time()
    for _ in range(times):
        output_torch = linear_layer(input_tensor).to(torch.bfloat16)
    end_time = time.time()
print(f"torch linear took by avg {(end_time-start_time)/times}")

start_time = time.time()
bitblas_linear = BitBLASLinear(linear_layer.in_features, linear_layer.out_features, bias=linear_layer.bias is not None, opt_M=opt_M, accum_dtype='float32', A_dtype='bfloat16', W_dtype='bfloat16')
bitblas_linear.load_and_transform_weight(linear_layer.weight.clone())
if linear_layer.bias is not None:
    bitblas_linear.bias.data = linear_layer.bias.data.clone()

st = time.time()
while time.time() - st < 1.0:
    bitblas_linear(input_tensor)
end_time = time.time()
print(f"bitblas linear took to init : {(end_time-start_time)} sec")
bitblas_linear.cuda()

with torch.no_grad():
    start_time = time.time()
    for _ in range(times):
        output_bitblas = bitblas_linear(input_tensor)
    end_time = time.time()
print(f"bitblas linear took by avg {(end_time-start_time)/times}")

print("torch compare : ", torch.mean(torch.abs(output_torch.to(torch.bfloat16)-output_bitblas.to(torch.bfloat16))))

the result :
quantizing Linear(in_features=4096, out_features=4096, bias=False)
torch linear took by avg 7.706689834594727e-05
2024-08-29 17:20:22 [BitBLAS:WARNING]: [BitBLAS][Warning] with_zeros is not supported for int source format as int has a constant zeropoints already.
2024-08-29 17:20:23 [BitBLAS:WARNING]: [BitBLAS][Warning] with_zeros is not supported for int source format as int has a constant zeropoints already.
2024-08-29 17:20:25 [BitBLAS:WARNING]: [BitBLAS][Warning] with_zeros is not supported for int source format as int has a constant zeropoints already.
BitBLAS Operator found in global_operator_cache.
bitblas linear took to init : 10.60917353630066 sec
bitblas linear took by avg 9.263944625854492e-05
torch compare : tensor(2.0469, device='cuda:0', dtype=torch.bfloat16)

LeiWang1999 · 2024-08-30T04:24:56Z

BitBLAS/bitblas/module/__init__.py

Lines 267 to 270 in 872d6d7

    
           def forward(self, A, output=None): 
        
               if A.dtype != torch.float16: 
        
                   A = A.half() 
        
               A = self.bitblas_matmul.transform_input(A)

There do exist a bug in BitBLASLinear that causes any datatype to be casted into float16.

LeiWang1999 · 2024-08-30T04:32:54Z

Hi @AbedKhateeb2 , Take a look at pr #164

You can check out this fix by installing the upstream bitblas with command pip install git+https://github.com/microsoft/BitBLAS.git

AbedKhateeb2 changed the title ~~performance with bitblas linear~~ accuracy of bflaot 16 with bitblas linear Aug 29, 2024

AbedKhateeb2 changed the title ~~accuracy of bflaot 16 with bitblas linear~~ accuracy and performance of bflaot 16 with bitblas linear Aug 29, 2024

LeiWang1999 mentioned this issue Aug 30, 2024

[BugFix] Fix BitBLAS Linear with BFloat16 input #164

Merged

LeiWang1999 changed the title ~~accuracy and performance of bflaot 16 with bitblas linear~~ accuracy and performance of bfloat16 with bitblas linear Aug 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

accuracy and performance of bfloat16 with bitblas linear #161

accuracy and performance of bfloat16 with bitblas linear #161

AbedKhateeb2 commented Aug 29, 2024

LeiWang1999 commented Aug 29, 2024

AbedKhateeb2 commented Aug 29, 2024

LeiWang1999 commented Aug 30, 2024

LeiWang1999 commented Aug 30, 2024

accuracy and performance of bfloat16 with bitblas linear #161

accuracy and performance of bfloat16 with bitblas linear #161

Comments

AbedKhateeb2 commented Aug 29, 2024

LeiWang1999 commented Aug 29, 2024

AbedKhateeb2 commented Aug 29, 2024

LeiWang1999 commented Aug 30, 2024

LeiWang1999 commented Aug 30, 2024