Updated nb

redhat-et · Aug 8, 2023 · 8e2222a · 8e2222a
1 parent 672f677
commit 8e2222a
Showing 1 changed file with 110 additions and 58 deletions.
diff --git a/notebooks/efficient-gpu-utilization.ipynb b/notebooks/efficient-gpu-utilization.ipynb
@@ -41,6 +41,14 @@
     "Image('../reports/figures/LLM_size.png')"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "79f44610-1ef9-4359-a1d8-1486ebb43d7a",
+   "metadata": {},
+   "source": [
+    "[Source](https://www.tasq.ai/blog/large-language-models/)"
+   ]
+  },
   {
    "cell_type": "markdown",
    "id": "569c40f1-0bd0-4123-8697-0c24ab6e5ff4",
@@ -178,19 +186,10 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 3,
+   "execution_count": 4,
    "id": "f97aadcb-860a-438e-a0c8-d626a1bdd7e0",
    "metadata": {},
-   "outputs": [
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "/opt/app-root/lib64/python3.9/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
-      "  from .autonotebook import tqdm as notebook_tqdm\n"
-     ]
-    }
-   ],
+   "outputs": [],
    "source": [
     "from torch import cuda\n",
     "import scipy\n",
@@ -218,7 +217,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 4,
+   "execution_count": 5,
    "id": "18de57f3-0600-41c5-b4c8-4757184d8512",
    "metadata": {},
    "outputs": [
@@ -228,7 +227,7 @@
      "text": [
       "| ID | Name     | Serial        | UUID                                     || GPU temp. | GPU util. | Memory util. || Memory total | Memory used | Memory free || Display mode | Display active |\n",
       "-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------\n",
-      "|  0 | Tesla T4 | 1564620006411 | GPU-e8a8cf27-187d-2821-36c5-e654c154fbb1 ||       37C |        0% |           0% ||      15360MB |         5MB |     14923MB || Enabled      | Disabled       |\n"
+      "|  0 | Tesla T4 | 1564620006411 | GPU-e8a8cf27-187d-2821-36c5-e654c154fbb1 ||       33C |        0% |           0% ||      15360MB |         5MB |     14923MB || Enabled      | Disabled       |\n"
      ]
     }
    ],
@@ -261,7 +260,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 5,
+   "execution_count": 6,
    "id": "f436a4de-6fb1-46ce-902f-9f964c65cee8",
    "metadata": {},
    "outputs": [],
@@ -273,7 +272,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 6,
+   "execution_count": 7,
    "id": "9a77b62d-8eaf-4af9-a435-281ad2192638",
    "metadata": {},
    "outputs": [],
@@ -294,35 +293,52 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 7,
+   "execution_count": 10,
    "id": "ff844f31-5bff-4874-93a3-b259e44e8163",
    "metadata": {},
    "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Some weights of BloomForCausalLM were not initialized from the model checkpoint at bigscience/bloom-3b and are newly initialized: ['lm_head.weight']\n",
+      "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n"
+     ]
+    },
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "Inference time: 1104.456901550293 ms\n"
+      "Mean Inference time: 453.4655570983887 ms\n"
      ]
     }
    ],
    "source": [
     "tokenizer = AutoTokenizer.from_pretrained(model_name)\n",
     "model_org = AutoModelForCausalLM.from_pretrained(model_name, device_map=\"auto\", torch_dtype=\"auto\")\n",
     "\n",
-    "# Calculate the inference time.\n",
-    "start_time = time.time()\n",
-    "generate_from_model(model_org, tokenizer)\n",
-    "end_time = time.time()\n",
+    "# Number of times to run the inference\n",
+    "num_iterations = 10\n",
+    "total_inference_time = 0\n",
+    "\n",
+    "# Run inference multiple times\n",
+    "for _ in range(num_iterations):\n",
+    "    start_time = time.time()\n",
+    "    # Replace generate_from_model with your actual inference function\n",
+    "    generate_from_model(model_org, tokenizer)\n",
+    "    end_time = time.time()\n",
     "\n",
-    "# Print the inference time.\n",
-    "inference_time = (end_time - start_time) * 1000\n",
-    "print(f\"Inference time: {inference_time} ms\")"
+    "    inference_time = (end_time - start_time) * 1000\n",
+    "    total_inference_time += inference_time\n",
+    "\n",
+    "# Calculate and print the mean inference time\n",
+    "mean_inference_time = total_inference_time / num_iterations\n",
+    "print(f\"Mean Inference time: {mean_inference_time} ms\")"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 8,
+   "execution_count": 11,
    "id": "2ec11f85-5f07-41af-9a31-b4173f883f05",
    "metadata": {},
    "outputs": [
@@ -336,7 +352,7 @@
       "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n",
       "| ID | Name     | Serial        | UUID                                     || GPU temp. | GPU util. | Memory util. || Memory total | Memory used | Memory free || Display mode | Display active |\n",
       "-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------\n",
-      "|  0 | Tesla T4 | 1564620006411 | GPU-e8a8cf27-187d-2821-36c5-e654c154fbb1 ||       39C |        0% |          42% ||      15360MB |      6499MB |      8429MB || Enabled      | Disabled       |\n"
+      "|  0 | Tesla T4 | 1564620006411 | GPU-e8a8cf27-187d-2821-36c5-e654c154fbb1 ||       40C |       98% |          42% ||      15360MB |      6519MB |      8409MB || Enabled      | Disabled       |\n"
      ]
     }
    ],
@@ -349,7 +365,7 @@
    "id": "56138227-09cf-4696-a4fd-f8f5fa712241",
    "metadata": {},
    "source": [
-    "We see that **memory used is 6499MB** when we load the model without 8-bit quantization. The **inference time is ~1104 ms**."
+    "We see that **memory used is 6519MB** when we load the model without 8-bit quantization. The **inference time is ~453 ms**."
    ]
   },
   {
@@ -362,48 +378,52 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 7,
+   "execution_count": 8,
    "id": "a7c81530-013a-4b45-b2b4-ef41057dcd37",
    "metadata": {},
    "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Some weights of BloomForCausalLM were not initialized from the model checkpoint at bigscience/bloom-3b and are newly initialized: ['lm_head.weight']\n",
+      "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n"
+     ]
+    },
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "\n",
-      "===================================BUG REPORT===================================\n",
-      "Welcome to bitsandbytes. For bug reports, please run\n",
-      "\n",
-      "python -m bitsandbytes\n",
-      "\n",
-      " and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues\n",
-      "================================================================================\n",
-      "bin /opt/app-root/lib64/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so\n",
-      "CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths...\n",
-      "CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so.11.0\n",
-      "CUDA SETUP: Highest compute capability among GPUs detected: 7.5\n",
-      "CUDA SETUP: Detected CUDA version 118\n",
-      "CUDA SETUP: Loading binary /opt/app-root/lib64/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...\n",
-      "Inference time: 2288.9277935028076 ms\n"
+      "Mean Inference time: 1659.8277807235718 ms\n"
      ]
     }
    ],
    "source": [
     "tokenizer = AutoTokenizer.from_pretrained(model_name)\n",
     "model_8bit = AutoModelForCausalLM.from_pretrained(model_name, device_map=\"auto\", load_in_8bit=True, torch_dtype=torch.float16)\n",
-    "# Calculate the inference time.\n",
-    "start_time = time.time()\n",
-    "generate_from_model(model_8bit, tokenizer)\n",
-    "end_time = time.time()\n",
     "\n",
-    "# Print the inference time.\n",
-    "inference_time = (end_time - start_time) * 1000\n",
-    "print(f\"Inference time: {inference_time} ms\")"
+    "# Number of times to run the inference\n",
+    "num_iterations = 10\n",
+    "total_inference_time = 0\n",
+    "\n",
+    "# Run inference multiple times\n",
+    "for _ in range(num_iterations):\n",
+    "    start_time = time.time()\n",
+    "    # Replace generate_from_model with your actual inference function\n",
+    "    generate_from_model(model_8bit, tokenizer)\n",
+    "    end_time = time.time()\n",
+    "\n",
+    "    inference_time = (end_time - start_time) * 1000\n",
+    "    total_inference_time += inference_time\n",
+    "\n",
+    "# Calculate and print the mean inference time\n",
+    "mean_inference_time = total_inference_time / num_iterations\n",
+    "print(f\"Mean Inference time: {mean_inference_time} ms\")"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 8,
+   "execution_count": 9,
    "id": "0f0b37d0-0d16-4c09-bde5-b0f6c03a1702",
    "metadata": {},
    "outputs": [
@@ -417,20 +437,20 @@
       "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n",
       "| ID | Name     | Serial        | UUID                                     || GPU temp. | GPU util. | Memory util. || Memory total | Memory used | Memory free || Display mode | Display active |\n",
       "-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------\n",
-      "|  0 | Tesla T4 | 1564620006411 | GPU-e8a8cf27-187d-2821-36c5-e654c154fbb1 ||       37C |        0% |          28% ||      15360MB |      4323MB |     10605MB || Enabled      | Disabled       |\n"
+      "|  0 | Tesla T4 | 1564620006411 | GPU-e8a8cf27-187d-2821-36c5-e654c154fbb1 ||       36C |       38% |          28% ||      15360MB |      4323MB |     10605MB || Enabled      | Disabled       |\n"
      ]
     }
    ],
    "source": [
-    "GPUtil.showUtilization(all=True) "
+    "GPUtil.showUtilization(all=True)"
    ]
   },
   {
    "cell_type": "markdown",
    "id": "5f6dac3b-0970-4f53-afb7-1102827dbe66",
    "metadata": {},
    "source": [
-    "We see that **memory used is 4323MB** when we load the model without 8-bit quantization. The **inference time in this case is ~2288 ms**."
+    "We see that **memory used is 4323MB** when we load the model with 8-bit quantization. The **mean inference time in this case is ~1659 ms**."
    ]
   },
   {
@@ -443,12 +463,44 @@
   },
   {
    "cell_type": "markdown",
-   "id": "a8a6250b-40f7-4496-9419-719272fbd834",
+   "id": "0f6183c4-39e4-4d46-88a5-763c9608a622",
    "metadata": {},
    "source": [
-    "From these observations, we can see that the model without 8-bit quantization utilizes more memory (6499 MB) compared to the model with 8-bit quantization (4323 MB). However, the model without quantization achieves faster inference times (1104 ms) compared to the quantized model (2288 ms).\n",
-    "\n",
-    "It is important to consider the trade-off between memory usage and inference time when choosing between these models. If memory constraints are a concern, the quantized model may be preferred. On the other hand, if faster inference times are crucial and memory is not a limiting factor, the non-quantized model might be a better choice."
+    "**Without 8-bit quantization**"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "61303f4f-c5b1-40e5-996f-526df081fc59",
+   "metadata": {},
+   "source": [
+    "- Memory Used: 6519 MB\n",
+    "- Inference Time : 453 ms"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "576ce426-753b-435f-8513-2dc3af4938c0",
+   "metadata": {},
+   "source": [
+    "**With 8-bit quantization**"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c2f66bc3-2010-4eaa-9150-3fba1a21feed",
+   "metadata": {},
+   "source": [
+    "- Memory Used: 4323 MB\n",
+    "- Inference Time : 1659 ms"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "bf10f9ac-9757-4eff-ba9d-13ba1d935abe",
+   "metadata": {},
+   "source": [
+    "After applying 8-bit quantization to the model, there is a decrease of approximately **33.60%** in memory usage compared to the non-quantized version. However, this reduction in memory usage comes at the cost of an approximately **266.85%** increase in mean inference time. Therefore, while 8-bit quantization helps save memory, it also leads to a significant increase in inference time, indicating a trade-off between memory efficiency and computational speed. The choice between the two depends on the specific requirements and priorities of the application."
    ]
   }
  ],