Skip to content

Commit

Permalink
Updated nb
Browse files Browse the repository at this point in the history
  • Loading branch information
suppathak committed Aug 8, 2023
1 parent 672f677 commit 8e2222a
Showing 1 changed file with 110 additions and 58 deletions.
168 changes: 110 additions & 58 deletions notebooks/efficient-gpu-utilization.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,14 @@
"Image('../reports/figures/LLM_size.png')"
]
},
{
"cell_type": "markdown",
"id": "79f44610-1ef9-4359-a1d8-1486ebb43d7a",
"metadata": {},
"source": [
"[Source](https://www.tasq.ai/blog/large-language-models/)"
]
},
{
"cell_type": "markdown",
"id": "569c40f1-0bd0-4123-8697-0c24ab6e5ff4",
Expand Down Expand Up @@ -178,19 +186,10 @@
},
{
"cell_type": "code",
"execution_count": 3,
"execution_count": 4,
"id": "f97aadcb-860a-438e-a0c8-d626a1bdd7e0",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/opt/app-root/lib64/python3.9/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
" from .autonotebook import tqdm as notebook_tqdm\n"
]
}
],
"outputs": [],
"source": [
"from torch import cuda\n",
"import scipy\n",
Expand Down Expand Up @@ -218,7 +217,7 @@
},
{
"cell_type": "code",
"execution_count": 4,
"execution_count": 5,
"id": "18de57f3-0600-41c5-b4c8-4757184d8512",
"metadata": {},
"outputs": [
Expand All @@ -228,7 +227,7 @@
"text": [
"| ID | Name | Serial | UUID || GPU temp. | GPU util. | Memory util. || Memory total | Memory used | Memory free || Display mode | Display active |\n",
"-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------\n",
"| 0 | Tesla T4 | 1564620006411 | GPU-e8a8cf27-187d-2821-36c5-e654c154fbb1 || 37C | 0% | 0% || 15360MB | 5MB | 14923MB || Enabled | Disabled |\n"
"| 0 | Tesla T4 | 1564620006411 | GPU-e8a8cf27-187d-2821-36c5-e654c154fbb1 || 33C | 0% | 0% || 15360MB | 5MB | 14923MB || Enabled | Disabled |\n"
]
}
],
Expand Down Expand Up @@ -261,7 +260,7 @@
},
{
"cell_type": "code",
"execution_count": 5,
"execution_count": 6,
"id": "f436a4de-6fb1-46ce-902f-9f964c65cee8",
"metadata": {},
"outputs": [],
Expand All @@ -273,7 +272,7 @@
},
{
"cell_type": "code",
"execution_count": 6,
"execution_count": 7,
"id": "9a77b62d-8eaf-4af9-a435-281ad2192638",
"metadata": {},
"outputs": [],
Expand All @@ -294,35 +293,52 @@
},
{
"cell_type": "code",
"execution_count": 7,
"execution_count": 10,
"id": "ff844f31-5bff-4874-93a3-b259e44e8163",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Some weights of BloomForCausalLM were not initialized from the model checkpoint at bigscience/bloom-3b and are newly initialized: ['lm_head.weight']\n",
"You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Inference time: 1104.456901550293 ms\n"
"Mean Inference time: 453.4655570983887 ms\n"
]
}
],
"source": [
"tokenizer = AutoTokenizer.from_pretrained(model_name)\n",
"model_org = AutoModelForCausalLM.from_pretrained(model_name, device_map=\"auto\", torch_dtype=\"auto\")\n",
"\n",
"# Calculate the inference time.\n",
"start_time = time.time()\n",
"generate_from_model(model_org, tokenizer)\n",
"end_time = time.time()\n",
"# Number of times to run the inference\n",
"num_iterations = 10\n",
"total_inference_time = 0\n",
"\n",
"# Run inference multiple times\n",
"for _ in range(num_iterations):\n",
" start_time = time.time()\n",
" # Replace generate_from_model with your actual inference function\n",
" generate_from_model(model_org, tokenizer)\n",
" end_time = time.time()\n",
"\n",
"# Print the inference time.\n",
"inference_time = (end_time - start_time) * 1000\n",
"print(f\"Inference time: {inference_time} ms\")"
" inference_time = (end_time - start_time) * 1000\n",
" total_inference_time += inference_time\n",
"\n",
"# Calculate and print the mean inference time\n",
"mean_inference_time = total_inference_time / num_iterations\n",
"print(f\"Mean Inference time: {mean_inference_time} ms\")"
]
},
{
"cell_type": "code",
"execution_count": 8,
"execution_count": 11,
"id": "2ec11f85-5f07-41af-9a31-b4173f883f05",
"metadata": {},
"outputs": [
Expand All @@ -336,7 +352,7 @@
"\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n",
"| ID | Name | Serial | UUID || GPU temp. | GPU util. | Memory util. || Memory total | Memory used | Memory free || Display mode | Display active |\n",
"-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------\n",
"| 0 | Tesla T4 | 1564620006411 | GPU-e8a8cf27-187d-2821-36c5-e654c154fbb1 || 39C | 0% | 42% || 15360MB | 6499MB | 8429MB || Enabled | Disabled |\n"
"| 0 | Tesla T4 | 1564620006411 | GPU-e8a8cf27-187d-2821-36c5-e654c154fbb1 || 40C | 98% | 42% || 15360MB | 6519MB | 8409MB || Enabled | Disabled |\n"
]
}
],
Expand All @@ -349,7 +365,7 @@
"id": "56138227-09cf-4696-a4fd-f8f5fa712241",
"metadata": {},
"source": [
"We see that **memory used is 6499MB** when we load the model without 8-bit quantization. The **inference time is ~1104 ms**."
"We see that **memory used is 6519MB** when we load the model without 8-bit quantization. The **inference time is ~453 ms**."
]
},
{
Expand All @@ -362,48 +378,52 @@
},
{
"cell_type": "code",
"execution_count": 7,
"execution_count": 8,
"id": "a7c81530-013a-4b45-b2b4-ef41057dcd37",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Some weights of BloomForCausalLM were not initialized from the model checkpoint at bigscience/bloom-3b and are newly initialized: ['lm_head.weight']\n",
"You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"===================================BUG REPORT===================================\n",
"Welcome to bitsandbytes. For bug reports, please run\n",
"\n",
"python -m bitsandbytes\n",
"\n",
" and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues\n",
"================================================================================\n",
"bin /opt/app-root/lib64/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so\n",
"CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths...\n",
"CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so.11.0\n",
"CUDA SETUP: Highest compute capability among GPUs detected: 7.5\n",
"CUDA SETUP: Detected CUDA version 118\n",
"CUDA SETUP: Loading binary /opt/app-root/lib64/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...\n",
"Inference time: 2288.9277935028076 ms\n"
"Mean Inference time: 1659.8277807235718 ms\n"
]
}
],
"source": [
"tokenizer = AutoTokenizer.from_pretrained(model_name)\n",
"model_8bit = AutoModelForCausalLM.from_pretrained(model_name, device_map=\"auto\", load_in_8bit=True, torch_dtype=torch.float16)\n",
"# Calculate the inference time.\n",
"start_time = time.time()\n",
"generate_from_model(model_8bit, tokenizer)\n",
"end_time = time.time()\n",
"\n",
"# Print the inference time.\n",
"inference_time = (end_time - start_time) * 1000\n",
"print(f\"Inference time: {inference_time} ms\")"
"# Number of times to run the inference\n",
"num_iterations = 10\n",
"total_inference_time = 0\n",
"\n",
"# Run inference multiple times\n",
"for _ in range(num_iterations):\n",
" start_time = time.time()\n",
" # Replace generate_from_model with your actual inference function\n",
" generate_from_model(model_8bit, tokenizer)\n",
" end_time = time.time()\n",
"\n",
" inference_time = (end_time - start_time) * 1000\n",
" total_inference_time += inference_time\n",
"\n",
"# Calculate and print the mean inference time\n",
"mean_inference_time = total_inference_time / num_iterations\n",
"print(f\"Mean Inference time: {mean_inference_time} ms\")"
]
},
{
"cell_type": "code",
"execution_count": 8,
"execution_count": 9,
"id": "0f0b37d0-0d16-4c09-bde5-b0f6c03a1702",
"metadata": {},
"outputs": [
Expand All @@ -417,20 +437,20 @@
"\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n",
"| ID | Name | Serial | UUID || GPU temp. | GPU util. | Memory util. || Memory total | Memory used | Memory free || Display mode | Display active |\n",
"-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------\n",
"| 0 | Tesla T4 | 1564620006411 | GPU-e8a8cf27-187d-2821-36c5-e654c154fbb1 || 37C | 0% | 28% || 15360MB | 4323MB | 10605MB || Enabled | Disabled |\n"
"| 0 | Tesla T4 | 1564620006411 | GPU-e8a8cf27-187d-2821-36c5-e654c154fbb1 || 36C | 38% | 28% || 15360MB | 4323MB | 10605MB || Enabled | Disabled |\n"
]
}
],
"source": [
"GPUtil.showUtilization(all=True) "
"GPUtil.showUtilization(all=True)"
]
},
{
"cell_type": "markdown",
"id": "5f6dac3b-0970-4f53-afb7-1102827dbe66",
"metadata": {},
"source": [
"We see that **memory used is 4323MB** when we load the model without 8-bit quantization. The **inference time in this case is ~2288 ms**."
"We see that **memory used is 4323MB** when we load the model with 8-bit quantization. The **mean inference time in this case is ~1659 ms**."
]
},
{
Expand All @@ -443,12 +463,44 @@
},
{
"cell_type": "markdown",
"id": "a8a6250b-40f7-4496-9419-719272fbd834",
"id": "0f6183c4-39e4-4d46-88a5-763c9608a622",
"metadata": {},
"source": [
"From these observations, we can see that the model without 8-bit quantization utilizes more memory (6499 MB) compared to the model with 8-bit quantization (4323 MB). However, the model without quantization achieves faster inference times (1104 ms) compared to the quantized model (2288 ms).\n",
"\n",
"It is important to consider the trade-off between memory usage and inference time when choosing between these models. If memory constraints are a concern, the quantized model may be preferred. On the other hand, if faster inference times are crucial and memory is not a limiting factor, the non-quantized model might be a better choice."
"**Without 8-bit quantization**"
]
},
{
"cell_type": "markdown",
"id": "61303f4f-c5b1-40e5-996f-526df081fc59",
"metadata": {},
"source": [
"- Memory Used: 6519 MB\n",
"- Inference Time : 453 ms"
]
},
{
"cell_type": "markdown",
"id": "576ce426-753b-435f-8513-2dc3af4938c0",
"metadata": {},
"source": [
"**With 8-bit quantization**"
]
},
{
"cell_type": "markdown",
"id": "c2f66bc3-2010-4eaa-9150-3fba1a21feed",
"metadata": {},
"source": [
"- Memory Used: 4323 MB\n",
"- Inference Time : 1659 ms"
]
},
{
"cell_type": "markdown",
"id": "bf10f9ac-9757-4eff-ba9d-13ba1d935abe",
"metadata": {},
"source": [
"After applying 8-bit quantization to the model, there is a decrease of approximately **33.60%** in memory usage compared to the non-quantized version. However, this reduction in memory usage comes at the cost of an approximately **266.85%** increase in mean inference time. Therefore, while 8-bit quantization helps save memory, it also leads to a significant increase in inference time, indicating a trade-off between memory efficiency and computational speed. The choice between the two depends on the specific requirements and priorities of the application."
]
}
],
Expand Down

0 comments on commit 8e2222a

Please sign in to comment.