diff --git a/docs/vertexai.md b/docs/vertexai.md
index f38201f..ed3c448 100644
--- a/docs/vertexai.md
+++ b/docs/vertexai.md
@@ -20,8 +20,20 @@ Google Cloud offers three flavors of Notebook instances: User-Managed, Google Ma
![2_select_notebook_name](/images/3_select_notebook_name.png)
5. On the _Environment_ tab, select `Debian 11` and select your desired Environment. Many of the tutorials specify a recommended environment. Don't worry about a startup script or metadata. Click **Continue**.
+ - The following environments are configured for **GPU use**.
+
+ ![GPU environments](/images/GPU_environments.png)
-6. Under _Machine type_ select your desired number of CPUs/GPUs. This is usually specified by the tutorial you are completing.
+6. Under _Machine type_ select your desired number of CPUs/GPUs. This is usually specified by the tutorial you are completing.
+
+- **Follow the steps below if you are utilizing GPUs:**
+ - Click on the GPU dropdown menu and select your GPU processor
+
+ ![GPU processors](/images/GPU_processor.png)
+ - Then check mark where it says **'Install NVIDIA GPU driver automatically for me'** to have your notebook automatically install GPU drivers.
+ - Finally select the number of GPUs you wish to utilize. The number of GPUs varies from machine type and GPU processor selected.
+
+ ![Number of GPUs](/images/GPU_numbers.png)
7. On the same page, click **Enable Idle Shutdown** and specify the idle minutes for shutdown. This means, if you close your browser and walk away without stopping your instance, it will shutdown automatically after this many minutes. We recommend 30 minutes.
@@ -69,3 +81,7 @@ Another thing worth noting is that when you run a cell, sometimes it doesn't pro
+
+```python
+
+```
diff --git a/images/GCP_chatbot_results.png b/images/GCP_chatbot_results.png
new file mode 100644
index 0000000..b348b9f
Binary files /dev/null and b/images/GCP_chatbot_results.png differ
diff --git a/images/GPU_environments.png b/images/GPU_environments.png
new file mode 100644
index 0000000..2a80982
Binary files /dev/null and b/images/GPU_environments.png differ
diff --git a/images/GPU_numbers.png b/images/GPU_numbers.png
new file mode 100644
index 0000000..7269ec0
Binary files /dev/null and b/images/GPU_numbers.png differ
diff --git a/images/GPU_processor.png b/images/GPU_processor.png
new file mode 100644
index 0000000..530ba8d
Binary files /dev/null and b/images/GPU_processor.png differ
diff --git a/tutorials/notebooks/GenAI/GCP_GenAI_Huggingface.ipynb b/tutorials/notebooks/GenAI/GCP_GenAI_Huggingface.ipynb
new file mode 100644
index 0000000..09b3d8d
--- /dev/null
+++ b/tutorials/notebooks/GenAI/GCP_GenAI_Huggingface.ipynb
@@ -0,0 +1,2590 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "92021b22-3fbf-4489-9e88-aea4d73f3529",
+ "metadata": {},
+ "source": [
+ "# Finetuning and Deploying Hugging Face Models on Vertex AI"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "dfbc2ea5-1aca-4322-a15c-7bcbb3925ef6",
+ "metadata": {},
+ "source": [
+ "For this tutorial it is recommended to use 1 GPU to speed up processes, this notebooks was run using the machinetype n1-highcpu-8 (8 vCPUs, 7.199 GB RAM) on Tensorflow. Visit the following tutorial to set up notebooks that utilize: GPUs [Spinning up a Vertex AI Notebook](../../../docs/vertexai.md).\n",
+ "\n",
+ "This tutorial will focus on utilizing Hugging Face which is a repository for user to share and download machine learning models, datasets, and demos. For this tutorial we will load in a model and dataset from Hugging Face and train and test our model before deploying it on Vertex AI. The model we will be deploying is Flan T5 and the datasets is [ccdv/pubmed-summarization](https://HuggingFace.co/datasets/ccdv/pubmed-summarization). Steps will show how to hypertune a model locally and how to launch our custom training job on Vertex AI Training, these steps are based on Keras NLP Tutorials for [abstractive summarization](https://keras.io/examples/nlp/t5_hf_summarization/).\n",
+ "\n",
+ "You may be wondering why are we training a pretrained model? The reason for this is because we are fine tuning our pretrained model for optimal performance on a particular application, in our case summarizing scientific documents. This is not a necessary step anymore as new methods have been made to enhance model performance like zero-shot learning which we will go over in our next tutorial."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6ab3668c-58c0-489a-aa4f-6f7e045b450f",
+ "metadata": {},
+ "source": [
+ "## Install Tools"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "161dea96-601e-4194-a285-51b0c4403e9d",
+ "metadata": {},
+ "source": [
+ "Hugging Face **transformers** are an open-source framework that allows you to utilize APIs and tools to download pretrained models, set hyperparameters, tokenize datasets, and further tune them to suite your needs. Here we are updating Vertex AI as well as installing the transformers package and **datasets** so that we can have access to Hugging Face datasets and as a bonus we are adding the S3 feature to help download datasets that may already be in a S3 bucket."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "id": "a6e5884b-ac90-42d4-aafd-d34d5495d24d",
+ "metadata": {
+ "scrolled": true,
+ "tags": []
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Collecting transformers\n",
+ " Obtaining dependency information for transformers from https://files.pythonhosted.org/packages/c1/bd/f64d67df4d3b05a460f281defe830ffab6d7940b7ca98ec085e94e024781/transformers-4.34.1-py3-none-any.whl.metadata\n",
+ " Downloading transformers-4.34.1-py3-none-any.whl.metadata (121 kB)\n",
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m121.5/121.5 kB\u001b[0m \u001b[31m8.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+ "\u001b[?25hCollecting datasets\n",
+ " Obtaining dependency information for datasets from https://files.pythonhosted.org/packages/7c/55/b3432f43d6d7fee999bb23a547820d74c48ec540f5f7842e41aa5d8d5f3a/datasets-2.14.6-py3-none-any.whl.metadata\n",
+ " Downloading datasets-2.14.6-py3-none-any.whl.metadata (19 kB)\n",
+ "Collecting rouge_score\n",
+ " Downloading rouge_score-0.1.2.tar.gz (17 kB)\n",
+ " Preparing metadata (setup.py) ... \u001b[?25ldone\n",
+ "\u001b[?25hCollecting evaluate\n",
+ " Obtaining dependency information for evaluate from https://files.pythonhosted.org/packages/70/63/7644a1eb7b0297e585a6adec98ed9e575309bb973c33b394dae66bc35c69/evaluate-0.4.1-py3-none-any.whl.metadata\n",
+ " Downloading evaluate-0.4.1-py3-none-any.whl.metadata (9.4 kB)\n",
+ "Collecting keras_nlp\n",
+ " Obtaining dependency information for keras_nlp from https://files.pythonhosted.org/packages/37/d4/dfd85606db811af2138e97fc480eb7ed709042dd96dd453868bede0929fe/keras_nlp-0.6.2-py3-none-any.whl.metadata\n",
+ " Downloading keras_nlp-0.6.2-py3-none-any.whl.metadata (7.2 kB)\n",
+ "Requirement already satisfied: filelock in /opt/conda/lib/python3.10/site-packages (from transformers) (3.12.4)\n",
+ "Collecting huggingface-hub<1.0,>=0.16.4 (from transformers)\n",
+ " Obtaining dependency information for huggingface-hub<1.0,>=0.16.4 from https://files.pythonhosted.org/packages/ef/b5/b6107bd65fa4c96fdf00e4733e2fe5729bb9e5e09997f63074bb43d3ab28/huggingface_hub-0.18.0-py3-none-any.whl.metadata\n",
+ " Downloading huggingface_hub-0.18.0-py3-none-any.whl.metadata (13 kB)\n",
+ "Requirement already satisfied: numpy>=1.17 in /opt/conda/lib/python3.10/site-packages (from transformers) (1.23.5)\n",
+ "Requirement already satisfied: packaging>=20.0 in /opt/conda/lib/python3.10/site-packages (from transformers) (23.1)\n",
+ "Requirement already satisfied: pyyaml>=5.1 in /opt/conda/lib/python3.10/site-packages (from transformers) (6.0.1)\n",
+ "Requirement already satisfied: regex!=2019.12.17 in /opt/conda/lib/python3.10/site-packages (from transformers) (2023.8.8)\n",
+ "Requirement already satisfied: requests in /opt/conda/lib/python3.10/site-packages (from transformers) (2.31.0)\n",
+ "Collecting tokenizers<0.15,>=0.14 (from transformers)\n",
+ " Obtaining dependency information for tokenizers<0.15,>=0.14 from https://files.pythonhosted.org/packages/a7/7b/c1f643eb086b6c5c33eef0c3752e37624bd23e4cbc9f1332748f1c6252d1/tokenizers-0.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata\n",
+ " Downloading tokenizers-0.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)\n",
+ "Collecting safetensors>=0.3.1 (from transformers)\n",
+ " Obtaining dependency information for safetensors>=0.3.1 from https://files.pythonhosted.org/packages/20/4e/878b080dbda92666233ec6f316a53969edcb58eab1aa399a64d0521cf953/safetensors-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata\n",
+ " Downloading safetensors-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.8 kB)\n",
+ "Requirement already satisfied: tqdm>=4.27 in /opt/conda/lib/python3.10/site-packages (from transformers) (4.66.1)\n",
+ "Requirement already satisfied: pyarrow>=8.0.0 in /opt/conda/lib/python3.10/site-packages (from datasets) (9.0.0)\n",
+ "Requirement already satisfied: dill<0.3.8,>=0.3.0 in /opt/conda/lib/python3.10/site-packages (from datasets) (0.3.1.1)\n",
+ "Requirement already satisfied: pandas in /opt/conda/lib/python3.10/site-packages (from datasets) (2.0.3)\n",
+ "Collecting xxhash (from datasets)\n",
+ " Obtaining dependency information for xxhash from https://files.pythonhosted.org/packages/80/8a/1dd41557883b6196f8f092011a5c1f72d4d44cf36d7b67d4a5efe3127949/xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata\n",
+ " Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)\n",
+ "Collecting multiprocess (from datasets)\n",
+ " Obtaining dependency information for multiprocess from https://files.pythonhosted.org/packages/35/a8/36d8d7b3e46b377800d8dec47891cdf05842d1a2366909ae4a0c89fbc5e6/multiprocess-0.70.15-py310-none-any.whl.metadata\n",
+ " Downloading multiprocess-0.70.15-py310-none-any.whl.metadata (7.2 kB)\n",
+ "Requirement already satisfied: fsspec[http]<=2023.10.0,>=2023.1.0 in /opt/conda/lib/python3.10/site-packages (from datasets) (2023.9.2)\n",
+ "Requirement already satisfied: aiohttp in /opt/conda/lib/python3.10/site-packages (from datasets) (3.8.5)\n",
+ "Requirement already satisfied: absl-py in /opt/conda/lib/python3.10/site-packages (from rouge_score) (1.4.0)\n",
+ "Collecting nltk (from rouge_score)\n",
+ " Downloading nltk-3.8.1-py3-none-any.whl (1.5 MB)\n",
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.5/1.5 MB\u001b[0m \u001b[31m80.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+ "\u001b[?25hRequirement already satisfied: six>=1.14.0 in /opt/conda/lib/python3.10/site-packages (from rouge_score) (1.16.0)\n",
+ "Collecting responses<0.19 (from evaluate)\n",
+ " Downloading responses-0.18.0-py3-none-any.whl (38 kB)\n",
+ "Requirement already satisfied: keras-core in /opt/conda/lib/python3.10/site-packages (from keras_nlp) (0.1.7)\n",
+ "Requirement already satisfied: rich in /opt/conda/lib/python3.10/site-packages (from keras_nlp) (13.5.3)\n",
+ "Requirement already satisfied: dm-tree in /opt/conda/lib/python3.10/site-packages (from keras_nlp) (0.1.8)\n",
+ "Collecting tensorflow-text (from keras_nlp)\n",
+ " Obtaining dependency information for tensorflow-text from https://files.pythonhosted.org/packages/0b/5f/8b301d2d0cea8334c22aaeb8880ce115ec34d7eba20f7b08c64202011a85/tensorflow_text-2.14.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata\n",
+ " Downloading tensorflow_text-2.14.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.9 kB)\n",
+ "Requirement already satisfied: attrs>=17.3.0 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets) (23.1.0)\n",
+ "Requirement already satisfied: charset-normalizer<4.0,>=2.0 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets) (3.2.0)\n",
+ "Requirement already satisfied: multidict<7.0,>=4.5 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets) (6.0.4)\n",
+ "Requirement already satisfied: async-timeout<5.0,>=4.0.0a3 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets) (4.0.3)\n",
+ "Requirement already satisfied: yarl<2.0,>=1.0 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets) (1.9.2)\n",
+ "Requirement already satisfied: frozenlist>=1.1.1 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets) (1.4.0)\n",
+ "Requirement already satisfied: aiosignal>=1.1.2 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets) (1.3.1)\n",
+ "Requirement already satisfied: typing-extensions>=3.7.4.3 in /opt/conda/lib/python3.10/site-packages (from huggingface-hub<1.0,>=0.16.4->transformers) (4.5.0)\n",
+ "Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.10/site-packages (from requests->transformers) (3.4)\n",
+ "Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/conda/lib/python3.10/site-packages (from requests->transformers) (1.26.16)\n",
+ "Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.10/site-packages (from requests->transformers) (2023.7.22)\n",
+ "Collecting huggingface-hub<1.0,>=0.16.4 (from transformers)\n",
+ " Obtaining dependency information for huggingface-hub<1.0,>=0.16.4 from https://files.pythonhosted.org/packages/aa/f3/3fc97336a0e90516901befd4f500f08d691034d387406fdbde85bea827cc/huggingface_hub-0.17.3-py3-none-any.whl.metadata\n",
+ " Downloading huggingface_hub-0.17.3-py3-none-any.whl.metadata (13 kB)\n",
+ "Requirement already satisfied: namex in /opt/conda/lib/python3.10/site-packages (from keras-core->keras_nlp) (0.0.7)\n",
+ "Requirement already satisfied: h5py in /opt/conda/lib/python3.10/site-packages (from keras-core->keras_nlp) (3.9.0)\n",
+ "Collecting dill<0.3.8,>=0.3.0 (from datasets)\n",
+ " Obtaining dependency information for dill<0.3.8,>=0.3.0 from https://files.pythonhosted.org/packages/f5/3a/74a29b11cf2cdfcd6ba89c0cecd70b37cd1ba7b77978ce611eb7a146a832/dill-0.3.7-py3-none-any.whl.metadata\n",
+ " Downloading dill-0.3.7-py3-none-any.whl.metadata (9.9 kB)\n",
+ "Requirement already satisfied: click in /opt/conda/lib/python3.10/site-packages (from nltk->rouge_score) (8.1.7)\n",
+ "Requirement already satisfied: joblib in /opt/conda/lib/python3.10/site-packages (from nltk->rouge_score) (1.3.2)\n",
+ "Requirement already satisfied: python-dateutil>=2.8.2 in /opt/conda/lib/python3.10/site-packages (from pandas->datasets) (2.8.2)\n",
+ "Requirement already satisfied: pytz>=2020.1 in /opt/conda/lib/python3.10/site-packages (from pandas->datasets) (2023.3.post1)\n",
+ "Requirement already satisfied: tzdata>=2022.1 in /opt/conda/lib/python3.10/site-packages (from pandas->datasets) (2023.3)\n",
+ "Requirement already satisfied: markdown-it-py>=2.2.0 in /opt/conda/lib/python3.10/site-packages (from rich->keras_nlp) (3.0.0)\n",
+ "Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /opt/conda/lib/python3.10/site-packages (from rich->keras_nlp) (2.16.1)\n",
+ "Requirement already satisfied: tensorflow-hub>=0.13.0 in /opt/conda/lib/python3.10/site-packages (from tensorflow-text->keras_nlp) (0.14.0)\n",
+ "Collecting tensorflow<2.15,>=2.14.0 (from tensorflow-text->keras_nlp)\n",
+ " Obtaining dependency information for tensorflow<2.15,>=2.14.0 from https://files.pythonhosted.org/packages/e2/7a/c7762c698fb1ac41a7e3afee51dc72aa3ec74ae8d2f57ce19a9cded3a4af/tensorflow-2.14.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata\n",
+ " Downloading tensorflow-2.14.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.1 kB)\n",
+ "Requirement already satisfied: mdurl~=0.1 in /opt/conda/lib/python3.10/site-packages (from markdown-it-py>=2.2.0->rich->keras_nlp) (0.1.2)\n",
+ "Requirement already satisfied: astunparse>=1.6.0 in /opt/conda/lib/python3.10/site-packages (from tensorflow<2.15,>=2.14.0->tensorflow-text->keras_nlp) (1.6.3)\n",
+ "Requirement already satisfied: flatbuffers>=23.5.26 in /opt/conda/lib/python3.10/site-packages (from tensorflow<2.15,>=2.14.0->tensorflow-text->keras_nlp) (23.5.26)\n",
+ "Requirement already satisfied: gast!=0.5.0,!=0.5.1,!=0.5.2,>=0.2.1 in /opt/conda/lib/python3.10/site-packages (from tensorflow<2.15,>=2.14.0->tensorflow-text->keras_nlp) (0.4.0)\n",
+ "Requirement already satisfied: google-pasta>=0.1.1 in /opt/conda/lib/python3.10/site-packages (from tensorflow<2.15,>=2.14.0->tensorflow-text->keras_nlp) (0.2.0)\n",
+ "Requirement already satisfied: libclang>=13.0.0 in /opt/conda/lib/python3.10/site-packages (from tensorflow<2.15,>=2.14.0->tensorflow-text->keras_nlp) (16.0.6)\n",
+ "Collecting ml-dtypes==0.2.0 (from tensorflow<2.15,>=2.14.0->tensorflow-text->keras_nlp)\n",
+ " Obtaining dependency information for ml-dtypes==0.2.0 from https://files.pythonhosted.org/packages/d1/1d/d5cf76e5e40f69dbd273036e3172ae4a614577cb141673427b80cac948df/ml_dtypes-0.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata\n",
+ " Downloading ml_dtypes-0.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (20 kB)\n",
+ "Requirement already satisfied: opt-einsum>=2.3.2 in /opt/conda/lib/python3.10/site-packages (from tensorflow<2.15,>=2.14.0->tensorflow-text->keras_nlp) (3.3.0)\n",
+ "Requirement already satisfied: protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.20.3 in /opt/conda/lib/python3.10/site-packages (from tensorflow<2.15,>=2.14.0->tensorflow-text->keras_nlp) (3.20.3)\n",
+ "Requirement already satisfied: setuptools in /opt/conda/lib/python3.10/site-packages (from tensorflow<2.15,>=2.14.0->tensorflow-text->keras_nlp) (68.2.2)\n",
+ "Requirement already satisfied: termcolor>=1.1.0 in /opt/conda/lib/python3.10/site-packages (from tensorflow<2.15,>=2.14.0->tensorflow-text->keras_nlp) (2.3.0)\n",
+ "Requirement already satisfied: wrapt<1.15,>=1.11.0 in /opt/conda/lib/python3.10/site-packages (from tensorflow<2.15,>=2.14.0->tensorflow-text->keras_nlp) (1.14.1)\n",
+ "Requirement already satisfied: tensorflow-io-gcs-filesystem>=0.23.1 in /opt/conda/lib/python3.10/site-packages (from tensorflow<2.15,>=2.14.0->tensorflow-text->keras_nlp) (0.31.0)\n",
+ "Requirement already satisfied: grpcio<2.0,>=1.24.3 in /opt/conda/lib/python3.10/site-packages (from tensorflow<2.15,>=2.14.0->tensorflow-text->keras_nlp) (1.48.0)\n",
+ "Collecting tensorboard<2.15,>=2.14 (from tensorflow<2.15,>=2.14.0->tensorflow-text->keras_nlp)\n",
+ " Obtaining dependency information for tensorboard<2.15,>=2.14 from https://files.pythonhosted.org/packages/73/a2/66ed644f6ed1562e0285fcd959af17670ea313c8f331c46f79ee77187eb9/tensorboard-2.14.1-py3-none-any.whl.metadata\n",
+ " Downloading tensorboard-2.14.1-py3-none-any.whl.metadata (1.7 kB)\n",
+ "Collecting tensorflow-estimator<2.15,>=2.14.0 (from tensorflow<2.15,>=2.14.0->tensorflow-text->keras_nlp)\n",
+ " Obtaining dependency information for tensorflow-estimator<2.15,>=2.14.0 from https://files.pythonhosted.org/packages/d1/da/4f264c196325bb6e37a6285caec5b12a03def489b57cc1fdac02bb6272cd/tensorflow_estimator-2.14.0-py2.py3-none-any.whl.metadata\n",
+ " Downloading tensorflow_estimator-2.14.0-py2.py3-none-any.whl.metadata (1.3 kB)\n",
+ "Collecting keras<2.15,>=2.14.0 (from tensorflow<2.15,>=2.14.0->tensorflow-text->keras_nlp)\n",
+ " Obtaining dependency information for keras<2.15,>=2.14.0 from https://files.pythonhosted.org/packages/fe/58/34d4d8f1aa11120c2d36d7ad27d0526164b1a8ae45990a2fede31d0e59bf/keras-2.14.0-py3-none-any.whl.metadata\n",
+ " Downloading keras-2.14.0-py3-none-any.whl.metadata (2.4 kB)\n",
+ "Requirement already satisfied: wheel<1.0,>=0.23.0 in /opt/conda/lib/python3.10/site-packages (from astunparse>=1.6.0->tensorflow<2.15,>=2.14.0->tensorflow-text->keras_nlp) (0.41.2)\n",
+ "Collecting grpcio<2.0,>=1.24.3 (from tensorflow<2.15,>=2.14.0->tensorflow-text->keras_nlp)\n",
+ " Obtaining dependency information for grpcio<2.0,>=1.24.3 from https://files.pythonhosted.org/packages/29/cc/e6883efbbcaa6570a0d2207ba53c796137f11293e47d11e2696f37b66811/grpcio-1.59.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata\n",
+ " Downloading grpcio-1.59.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.0 kB)\n",
+ "Requirement already satisfied: google-auth<3,>=1.6.3 in /opt/conda/lib/python3.10/site-packages (from tensorboard<2.15,>=2.14->tensorflow<2.15,>=2.14.0->tensorflow-text->keras_nlp) (2.23.0)\n",
+ "Requirement already satisfied: google-auth-oauthlib<1.1,>=0.5 in /opt/conda/lib/python3.10/site-packages (from tensorboard<2.15,>=2.14->tensorflow<2.15,>=2.14.0->tensorflow-text->keras_nlp) (1.0.0)\n",
+ "Requirement already satisfied: markdown>=2.6.8 in /opt/conda/lib/python3.10/site-packages (from tensorboard<2.15,>=2.14->tensorflow<2.15,>=2.14.0->tensorflow-text->keras_nlp) (3.4.4)\n",
+ "Requirement already satisfied: tensorboard-data-server<0.8.0,>=0.7.0 in /opt/conda/lib/python3.10/site-packages (from tensorboard<2.15,>=2.14->tensorflow<2.15,>=2.14.0->tensorflow-text->keras_nlp) (0.7.1)\n",
+ "Requirement already satisfied: werkzeug>=1.0.1 in /opt/conda/lib/python3.10/site-packages (from tensorboard<2.15,>=2.14->tensorflow<2.15,>=2.14.0->tensorflow-text->keras_nlp) (2.1.2)\n",
+ "Requirement already satisfied: cachetools<6.0,>=2.0.0 in /opt/conda/lib/python3.10/site-packages (from google-auth<3,>=1.6.3->tensorboard<2.15,>=2.14->tensorflow<2.15,>=2.14.0->tensorflow-text->keras_nlp) (4.2.4)\n",
+ "Requirement already satisfied: pyasn1-modules>=0.2.1 in /opt/conda/lib/python3.10/site-packages (from google-auth<3,>=1.6.3->tensorboard<2.15,>=2.14->tensorflow<2.15,>=2.14.0->tensorflow-text->keras_nlp) (0.3.0)\n",
+ "Requirement already satisfied: rsa<5,>=3.1.4 in /opt/conda/lib/python3.10/site-packages (from google-auth<3,>=1.6.3->tensorboard<2.15,>=2.14->tensorflow<2.15,>=2.14.0->tensorflow-text->keras_nlp) (4.9)\n",
+ "Requirement already satisfied: requests-oauthlib>=0.7.0 in /opt/conda/lib/python3.10/site-packages (from google-auth-oauthlib<1.1,>=0.5->tensorboard<2.15,>=2.14->tensorflow<2.15,>=2.14.0->tensorflow-text->keras_nlp) (1.3.1)\n",
+ "Requirement already satisfied: pyasn1<0.6.0,>=0.4.6 in /opt/conda/lib/python3.10/site-packages (from pyasn1-modules>=0.2.1->google-auth<3,>=1.6.3->tensorboard<2.15,>=2.14->tensorflow<2.15,>=2.14.0->tensorflow-text->keras_nlp) (0.5.0)\n",
+ "Requirement already satisfied: oauthlib>=3.0.0 in /opt/conda/lib/python3.10/site-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<1.1,>=0.5->tensorboard<2.15,>=2.14->tensorflow<2.15,>=2.14.0->tensorflow-text->keras_nlp) (3.2.2)\n",
+ "Downloading transformers-4.34.1-py3-none-any.whl (7.7 MB)\n",
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m7.7/7.7 MB\u001b[0m \u001b[31m98.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m:00:01\u001b[0m00:01\u001b[0m\n",
+ "\u001b[?25hDownloading datasets-2.14.6-py3-none-any.whl (493 kB)\n",
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m493.7/493.7 kB\u001b[0m \u001b[31m57.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+ "\u001b[?25hDownloading evaluate-0.4.1-py3-none-any.whl (84 kB)\n",
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m84.1/84.1 kB\u001b[0m \u001b[31m17.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+ "\u001b[?25hDownloading keras_nlp-0.6.2-py3-none-any.whl (590 kB)\n",
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m590.1/590.1 kB\u001b[0m \u001b[31m58.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+ "\u001b[?25hDownloading safetensors-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)\n",
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.3/1.3 MB\u001b[0m \u001b[31m84.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+ "\u001b[?25hDownloading tokenizers-0.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)\n",
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m3.8/3.8 MB\u001b[0m \u001b[31m108.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m\n",
+ "\u001b[?25hDownloading huggingface_hub-0.17.3-py3-none-any.whl (295 kB)\n",
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m295.0/295.0 kB\u001b[0m \u001b[31m46.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+ "\u001b[?25hDownloading multiprocess-0.70.15-py310-none-any.whl (134 kB)\n",
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m134.8/134.8 kB\u001b[0m \u001b[31m25.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+ "\u001b[?25hDownloading dill-0.3.7-py3-none-any.whl (115 kB)\n",
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m115.3/115.3 kB\u001b[0m \u001b[31m23.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+ "\u001b[?25hDownloading tensorflow_text-2.14.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.5 MB)\n",
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m6.5/6.5 MB\u001b[0m \u001b[31m108.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m\n",
+ "\u001b[?25hDownloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)\n",
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m194.1/194.1 kB\u001b[0m \u001b[31m33.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+ "\u001b[?25hDownloading tensorflow-2.14.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (489.8 MB)\n",
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m489.8/489.8 MB\u001b[0m \u001b[31m1.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m:00:01\u001b[0m00:01\u001b[0m\n",
+ "\u001b[?25hDownloading ml_dtypes-0.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)\n",
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.0/1.0 MB\u001b[0m \u001b[31m69.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+ "\u001b[?25hDownloading keras-2.14.0-py3-none-any.whl (1.7 MB)\n",
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.7/1.7 MB\u001b[0m \u001b[31m89.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+ "\u001b[?25hDownloading tensorboard-2.14.1-py3-none-any.whl (5.5 MB)\n",
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.5/5.5 MB\u001b[0m \u001b[31m113.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m\n",
+ "\u001b[?25hDownloading grpcio-1.59.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.3 MB)\n",
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.3/5.3 MB\u001b[0m \u001b[31m115.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m\n",
+ "\u001b[?25hDownloading tensorflow_estimator-2.14.0-py2.py3-none-any.whl (440 kB)\n",
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m440.7/440.7 kB\u001b[0m \u001b[31m53.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+ "\u001b[?25hBuilding wheels for collected packages: rouge_score\n",
+ " Building wheel for rouge_score (setup.py) ... \u001b[?25ldone\n",
+ "\u001b[?25h Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24932 sha256=57aa40a32d8d9171d43b9bc47cc3472fac0fb1192aa80eba9defb8e4ffd2352a\n",
+ " Stored in directory: /home/jupyter/.cache/pip/wheels/5f/dd/89/461065a73be61a532ff8599a28e9beef17985c9e9c31e541b4\n",
+ "Successfully built rouge_score\n",
+ "Installing collected packages: xxhash, tensorflow-estimator, safetensors, nltk, ml-dtypes, keras, grpcio, dill, rouge_score, responses, multiprocess, huggingface-hub, tokenizers, transformers, tensorboard, datasets, tensorflow, evaluate, tensorflow-text, keras_nlp\n",
+ " Attempting uninstall: tensorflow-estimator\n",
+ " Found existing installation: tensorflow-estimator 2.12.0\n",
+ " Uninstalling tensorflow-estimator-2.12.0:\n",
+ " Successfully uninstalled tensorflow-estimator-2.12.0\n",
+ " Attempting uninstall: ml-dtypes\n",
+ " Found existing installation: ml-dtypes 0.3.1\n",
+ " Uninstalling ml-dtypes-0.3.1:\n",
+ " Successfully uninstalled ml-dtypes-0.3.1\n",
+ " Attempting uninstall: keras\n",
+ " Found existing installation: keras 2.12.0\n",
+ " Uninstalling keras-2.12.0:\n",
+ " Successfully uninstalled keras-2.12.0\n",
+ " Attempting uninstall: grpcio\n",
+ " Found existing installation: grpcio 1.48.0\n",
+ " Uninstalling grpcio-1.48.0:\n",
+ " Successfully uninstalled grpcio-1.48.0\n",
+ " Attempting uninstall: dill\n",
+ " Found existing installation: dill 0.3.1.1\n",
+ " Uninstalling dill-0.3.1.1:\n",
+ " Successfully uninstalled dill-0.3.1.1\n",
+ " Attempting uninstall: tensorboard\n",
+ " Found existing installation: tensorboard 2.12.3\n",
+ " Uninstalling tensorboard-2.12.3:\n",
+ " Successfully uninstalled tensorboard-2.12.3\n",
+ " Attempting uninstall: tensorflow\n",
+ " Found existing installation: tensorflow 2.12.0\n",
+ " Uninstalling tensorflow-2.12.0:\n",
+ " Successfully uninstalled tensorflow-2.12.0\n",
+ "\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n",
+ "apache-beam 2.46.0 requires dill<0.3.2,>=0.3.1.1, but you have dill 0.3.7 which is incompatible.\u001b[0m\u001b[31m\n",
+ "\u001b[0mSuccessfully installed datasets-2.14.6 dill-0.3.7 evaluate-0.4.1 grpcio-1.58.0 huggingface-hub-0.17.3 keras-2.14.0 keras_nlp-0.6.2 ml-dtypes-0.2.0 multiprocess-0.70.15 nltk-3.8.1 responses-0.18.0 rouge_score-0.1.2 safetensors-0.4.0 tensorboard-2.14.1 tensorflow-2.14.0 tensorflow-estimator-2.14.0 tensorflow-text-2.14.0 tokenizers-0.14.1 transformers-4.34.1 xxhash-3.4.1\n"
+ ]
+ }
+ ],
+ "source": [
+ "!pip install \"transformers\" \"datasets\" \"rouge_score\" \"evaluate\" \"keras_nlp\""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e9b287a7-3035-48c1-8520-a7d3de4f925b",
+ "metadata": {},
+ "source": [
+ "## Download your dataset from Hugging Face"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "dd58ed15-7e8f-4f24-b68f-4a5b5fcd5cf2",
+ "metadata": {},
+ "source": [
+ "We will be downloading Hugging Face dataset 'ccdv/pubmed-summarization' which contains the full article and their abstracts which will help train our model to summarize scientific articles. Once the dataset is loaded we'll split the data into train, test, and validation datasets. Since these are large datasets we will only be using 5% of dataset to help our process run faster."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 211,
+ "id": "8f59e17e-c006-45ee-be0b-766774f9d420",
+ "metadata": {
+ "scrolled": true,
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "from datasets import load_dataset\n",
+ "\n",
+ "# load dataset\n",
+ "train, test, validation = load_dataset(\"ccdv/pubmed-summarization\", split=[\"train[:5%]\", \"test[:5%]\", \"validation[:5%]\" ])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "51d2756c-d280-4aec-a6fb-416dd91c3ba5",
+ "metadata": {},
+ "source": [
+ "Lets list the feaures of one of our datasets to determine what we will need to tokenize in a later step. this dataset features are 'article' and 'abstract'"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 114,
+ "id": "760c9128-793a-4bed-a127-b92ef496e33b",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Dataset({\n",
+ " features: ['article', 'abstract'],\n",
+ " num_rows: 5996\n",
+ "})\n"
+ ]
+ }
+ ],
+ "source": [
+ "print(train)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "3446d1b6-af0c-4127-93c0-c47fde142546",
+ "metadata": {},
+ "source": [
+ "## Finetuning our Model Locally"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ed6ddff1-2636-4e3b-88ee-e3c86c584245",
+ "metadata": {},
+ "source": [
+ "Now that we have our datasets we can upload our model which will be the small version of Flan T5."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b112574b-0e33-4c31-b12a-f1839024ea44",
+ "metadata": {},
+ "source": [
+ "\n",
+ "**Flan T5** is a text-to-text generation model and an advancement to the original T5 model and can be run on both CPUs and GPUs. **Text-to-text** is a method of creating text by using a neural network to generate new text from a given input. These T5 models can be fine-tuned for various zero shot NLP tasks that we have seen and heard of before: text classification, summarization, translation, and question-answering. Text-to-text is not to be confused by text2text generation which is a earlier version of T5 that is designed specifically for sequence-to-sequence tasks, such as machine translation and text generation and is limited to these task where as T5 models are more flexible due to the wider range of NPL tasks they can execute.\n",
+ "\n",
+ "Because it is a seq2seq class model we will be using the transformer **TFAutoModelForSeq2Seq** (specifically for tensorflow models) to help find a load our pretrained model architecture. Then we will assign an **AutoTokenizer** to preprocess the text of our inputs (the test, train, validation datasets) into an array of numbers."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 185,
+ "id": "bfd433b3-9790-4a10-ac08-6c90c194d8b0",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "#model name\n",
+ "CHECKPOINT = \"google/flan-t5-small\""
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 184,
+ "id": "1988cbcb-4bec-4aa2-a356-a211584ceacb",
+ "metadata": {
+ "scrolled": true,
+ "tags": []
+ },
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "2023-11-03 15:13:42.327557: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n",
+ "2023-11-03 15:13:42.327603: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n",
+ "2023-11-03 15:13:42.327636: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n",
+ "2023-11-03 15:13:42.336037: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.\n",
+ "To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.\n",
+ "2023-11-03 15:13:44.543851: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
+ "2023-11-03 15:13:44.554372: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
+ "2023-11-03 15:13:44.557202: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
+ "2023-11-03 15:13:44.560698: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
+ "2023-11-03 15:13:44.563540: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
+ "2023-11-03 15:13:44.566113: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
+ "2023-11-03 15:13:45.308267: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
+ "2023-11-03 15:13:45.310177: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
+ "2023-11-03 15:13:45.311838: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
+ "2023-11-03 15:13:45.313437: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 13589 MB memory: -> device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5\n",
+ "/opt/conda/lib/python3.10/site-packages/keras/src/initializers/initializers.py:120: UserWarning: The initializer RandomNormal is unseeded and being called multiple times, which will return identical values each time (even if the initializer is unseeded). Please update your code to provide a seed to the initializer, or avoid using the same initializer instance more than once.\n",
+ " warnings.warn(\n",
+ "All PyTorch model weights were used when initializing TFT5ForConditionalGeneration.\n",
+ "\n",
+ "All the weights of TFT5ForConditionalGeneration were initialized from the PyTorch model.\n",
+ "If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.\n"
+ ]
+ }
+ ],
+ "source": [
+ "from transformers import TFAutoModelForSeq2SeqLM, AutoTokenizer\n",
+ "\n",
+ "model = TFAutoModelForSeq2SeqLM.from_pretrained(CHECKPOINT)\n",
+ "tokenizer = AutoTokenizer.from_pretrained(CHECKPOINT)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f6ca0419-0075-4f62-becf-b859312cea22",
+ "metadata": {},
+ "source": [
+ "Now that we have loaded the architecture of our model and configured it to tokenize our inputs we can now implement a tokenization functions to start processing our datasets.\n",
+ "Since we are using a T5 model we will have to prefix the inputs with \"summarize:\" to know which task to perform. We create a preprocess function to append the prefix to each row within the \"article\" column of our dataset labeling them as inputs. The inputs are then tokenized, limited by a set max length, and truncated.\n",
+ "\n",
+ "A similar process is done for the \"abstract\" column within our dataset except we do not add the prefix and we labels them as **labels**.\n",
+ "\n",
+ "**What is Truncating?**\n",
+ "\n",
+ "Our group of inputs or batch will usually be different lengths which makes it hard to be converted to fixed-size tensors. To fix this problem **truncation** removes tokens ensure longer sequences will have the same length as the longest sequence in the batch which we have set to be **1024** for our inputs and **128** for our labels.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 212,
+ "id": "f101c309-f214-4b3f-b77b-d55491e48a59",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "prefix = \"summarize: \"\n",
+ "\n",
+ "def preprocess_function(examples):\n",
+ " inputs = [prefix + doc for doc in examples[\"article\"]]\n",
+ " model_inputs = tokenizer(inputs, max_length=1024, truncation=True)\n",
+ "\n",
+ " labels = tokenizer(text_target=\n",
+ " examples[\"abstract\"], max_length=128, truncation=True\n",
+ " )\n",
+ "\n",
+ " model_inputs[\"labels\"] = labels[\"input_ids\"]\n",
+ "\n",
+ " return model_inputs"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "491efc2a-1679-4044-9f53-4aff27329856",
+ "metadata": {},
+ "source": [
+ "Now that we have our tokenized function the next step is to implement the **map** function to iterate the function **preprocess_function** over our loaded datasets."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 210,
+ "id": "5e58eb58-a655-4e2b-8665-b4b770bc87a7",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "tokenized_train = train.map(preprocess_function, batched=True)\n",
+ "\n",
+ "#tokenized_test = test.map(preprocess_function, batched=True)\n",
+ "\n",
+ "#tokenized_validation = validation.map(preprocess_function, batched=True)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "41bc1576-44dd-4604-b5db-6c57b711096a",
+ "metadata": {},
+ "source": [
+ "Lets look at the structure of one of our new tokenized datasets you should see 3 new features (**'input_ids', 'attention_mask', 'labels'**) making 5 features total:\n",
+ "\n",
+ "- **input_ids:** As our inputs are being tokenized an ID is assigned for each token, meaning as each text is broken up into sequences (which can be words or subwords) and converted to tokens within our dataset they are assign an ID.\n",
+ "- **attention_masks:** Tokens that should be ignored by the model usually represented by a 0. Masking can be done when some sequences are not the same length so they can not belong in the same tensor and need to be padded.\n",
+ "- **labels:** The new name of the abstract column that has been tokenized."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "id": "80a25bc8-00db-4b8d-9b68-d52c5d6ca7fe",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Dataset({\n",
+ " features: ['article', 'abstract', 'input_ids', 'attention_mask', 'labels'],\n",
+ " num_rows: 5996\n",
+ "})\n"
+ ]
+ }
+ ],
+ "source": [
+ "print(tokenized_train)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "285d972c-6aa3-4072-a405-86e7dd82904e",
+ "metadata": {},
+ "source": [
+ "DataCollators are objects that dynamically pads the inputs and the labels in our batches, reverse to truncating **padding** adds a special padding token to ensure shorter sequences will have the same length as the longest sequence in the batch which a gain we set in out preprocess_function."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "id": "875ef33d-5ef3-4b07-b1de-6d471743a8ad",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from transformers import DataCollatorForSeq2Seq\n",
+ "data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=CHECKPOINT, return_tensors=\"tf\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1b48b037-d666-417b-b28e-88b715a1083c",
+ "metadata": {},
+ "source": [
+ "Then the last step will be to set our data format to be suitable for Tensorflow using the function **'prepare_tf_dataset()'** by automatically inspecting your model and keep only the features that are necessary. As you can see there are only 2 of our features left represented in the dataset: **input_ids and attention_mask**."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 13,
+ "id": "6fcac5a8-912f-461f-bfab-990e472c01ca",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.\n"
+ ]
+ }
+ ],
+ "source": [
+ "tf_train_set = model.prepare_tf_dataset(\n",
+ " tokenized_train,\n",
+ " shuffle=True,\n",
+ " batch_size=10,\n",
+ " collate_fn=data_collator,\n",
+ " \n",
+ ")\n",
+ "\n",
+ "tf_test_set = model.prepare_tf_dataset(\n",
+ " tokenized_test,\n",
+ " shuffle=False,\n",
+ " batch_size=10,\n",
+ " collate_fn=data_collator,\n",
+ " \n",
+ ")\n",
+ "\n",
+ "tf_validation_set = model.prepare_tf_dataset(\n",
+ " tokenized_validation,\n",
+ " shuffle=False,\n",
+ " batch_size=10,\n",
+ " collate_fn=data_collator,\n",
+ " \n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "id": "11aaf028-c713-4064-84cc-f699df3151ec",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "<_PrefetchDataset element_spec=({'input_ids': TensorSpec(shape=(10, None), dtype=tf.int64, name=None), 'attention_mask': TensorSpec(shape=(10, None), dtype=tf.int64, name=None)}, TensorSpec(shape=(10, None), dtype=tf.int64, name=None))>\n"
+ ]
+ }
+ ],
+ "source": [
+ "print (tf_train_set)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f7ad7839-4967-40fb-aa26-39cea71fa085",
+ "metadata": {},
+ "source": [
+ "**Learning rate** controls how much the model will change in response to the estimated error each time the model weights are updated. Too small of a learning rate could result very slow training process that could eventually get stuck, whereas a value too large may result in an unstable training process. Setting the **weight decay** helps to avoid overfitting, weights small, and avoid exploding gradient. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 15,
+ "id": "50ed6068-763a-46a6-8aed-4862f84413a9",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from transformers import AdamWeightDecay\n",
+ "optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)\n",
+ "model.compile(optimizer=optimizer)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "cd6a770a-e149-4970-a915-1362f544ef40",
+ "metadata": {},
+ "source": [
+ "Using the function metric_fn will help us calculate the **ROUGE** score between the ground-truth and predictions while training. ROUGE stands for **Recall-Oriented Understudy for Gisting Evaluation** this metric compares a reference sentence with what our model produces see if there is overlap if there is it calculates the precision and recall using the overlap.\n",
+ "\n",
+ "As an example say our model produced a sentence like so:\n",
+ "\n",
+ "**'the cat was found under the bed'**\n",
+ "\n",
+ "but the reference sentence normally written by a human is:\n",
+ "\n",
+ "**'the cat was under the bed'**"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 16,
+ "id": "f9c6b45b-7349-4965-938b-3a334ced3882",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Using TensorFlow backend\n"
+ ]
+ }
+ ],
+ "source": [
+ "import keras_nlp\n",
+ "\n",
+ "rouge_l = keras_nlp.metrics.RougeL()\n",
+ "\n",
+ "\n",
+ "def metric_fn(eval_predictions):\n",
+ " predictions, labels = eval_predictions\n",
+ " decoded_predictions = tokenizer.batch_decode(predictions, skip_special_tokens=True)\n",
+ " for label in labels:\n",
+ " label[label < 0] = tokenizer.pad_token_id # Replace masked label tokens\n",
+ " decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)\n",
+ " result = rouge_l(decoded_labels, decoded_predictions)\n",
+ " # We will print only the F1 score, you can use other aggregation metrics as well\n",
+ " result = {\"RougeL\": result[\"f1_score\"]}\n",
+ "\n",
+ " return result"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b57e9aa9-9d7c-41d0-8ab1-5d79277c969d",
+ "metadata": {},
+ "source": [
+ "We will use the validation dataset for calculating our ROUGE score. While our ROUGE score is being calculated and our training is running its best to set up a **callback system**. A callback is an object that can perform actions at various stages of training and helps to write logs after every batch of training to monitor your metrics, periodically save your model to disk, and if need be do early stopping. Here we are using Keras call back system."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 17,
+ "id": "06e014b1-e9d2-4d9f-a149-c6c0381f7407",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from transformers.keras_callbacks import KerasMetricCallback\n",
+ "metric_callback = KerasMetricCallback(\n",
+ " metric_fn, eval_dataset=tf_validation_set, predict_with_generate=True, use_xla_generation=True)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "cd1b16ea-0e8a-4c78-875e-e3edea6cf043",
+ "metadata": {},
+ "source": [
+ "Before we start to train our model the last step will be to set how many batches of training we should do, the number of iterations is called **epochs**, we will set ours to 3. Now we can start to train our model using the function **'fit'** and save our artifacts to a directory. The artifact that holds our model will be a file named **tf_model.h5**. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 18,
+ "id": "b8fd0c64-4d85-4b5e-86fe-538c7dc65da7",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Epoch 1/3\n",
+ "599/599 [==============================] - ETA: 0s - loss: 2.5073"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "/opt/conda/lib/python3.10/site-packages/tensorflow/python/autograph/impl/api.py:371: UserWarning: Using the model-agnostic default `max_length` (=20) to control the generation length. recommend setting `max_new_tokens` to control the maximum length of the generation.\n",
+ " return py_builtins.overload_of(f)(*args)\n",
+ "2023-11-02 13:09:59.053088: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55faf0d80f50 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:\n",
+ "2023-11-02 13:09:59.053132: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Tesla T4, Compute Capability 7.5\n",
+ "2023-11-02 13:10:00.019242: W tensorflow/compiler/tf2xla/kernels/assert_op.cc:38] Ignoring Assert operator shared/assert_less/Assert/Assert\n",
+ "2023-11-02 13:10:00.163195: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.\n",
+ "2023-11-02 13:10:00.714302: W tensorflow/compiler/tf2xla/kernels/assert_op.cc:38] Ignoring Assert operator shared/assert_less_1/Assert/Assert\n",
+ "2023-11-02 13:10:01.396732: W tensorflow/compiler/tf2xla/kernels/assert_op.cc:38] Ignoring Assert operator shared/assert_less/Assert/Assert\n",
+ "2023-11-02 13:10:02.853947: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:442] Loaded cuDNN version 8900\n",
+ "warning: Linking two modules of different target triples: 'LLVMDialectModule' is 'nvptx64-nvidia-gpulibs' whereas '' is 'nvptx64-nvidia-cuda'\n",
+ "\n",
+ "warning: Linking two modules of different target triples: 'LLVMDialectModule' is 'nvptx64-nvidia-gpulibs' whereas '' is 'nvptx64-nvidia-cuda'\n",
+ "\n",
+ "warning: Linking two modules of different target triples: 'LLVMDialectModule' is 'nvptx64-nvidia-gpulibs' whereas '' is 'nvptx64-nvidia-cuda'\n",
+ "\n",
+ "2023-11-02 13:10:12.362168: I ./tensorflow/compiler/jit/device_compiler.h:186] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.\n",
+ "2023-11-02 13:10:25.996665: W tensorflow/compiler/tf2xla/kernels/assert_op.cc:38] Ignoring Assert operator shared/assert_less/Assert/Assert\n",
+ "2023-11-02 13:10:26.174121: W tensorflow/compiler/tf2xla/kernels/assert_op.cc:38] Ignoring Assert operator shared/assert_less_1/Assert/Assert\n",
+ "2023-11-02 13:10:26.666553: W tensorflow/compiler/tf2xla/kernels/assert_op.cc:38] Ignoring Assert operator shared/assert_less/Assert/Assert\n",
+ "warning: Linking two modules of different target triples: 'LLVMDialectModule' is 'nvptx64-nvidia-gpulibs' whereas '' is 'nvptx64-nvidia-cuda'\n",
+ "\n",
+ "warning: Linking two modules of different target triples: 'LLVMDialectModule' is 'nvptx64-nvidia-gpulibs' whereas '' is 'nvptx64-nvidia-cuda'\n",
+ "\n",
+ "warning: Linking two modules of different target triples: 'LLVMDialectModule' is 'nvptx64-nvidia-gpulibs' whereas '' is 'nvptx64-nvidia-cuda'\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "599/599 [==============================] - 731s 1s/step - loss: 2.5073 - val_loss: 2.0886 - RougeL: 0.1196\n",
+ "Epoch 2/3\n",
+ "599/599 [==============================] - 662s 1s/step - loss: 2.3710 - val_loss: 2.0231 - RougeL: 0.1191\n",
+ "Epoch 3/3\n",
+ "599/599 [==============================] - 662s 1s/step - loss: 2.3102 - val_loss: 1.9996 - RougeL: 0.1172\n"
+ ]
+ }
+ ],
+ "source": [
+ "model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=3, callbacks=metric_callback)\n",
+ "\n",
+ "model.save_pretrained('saved_model')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d1c35de0-4cdf-4c2b-9c0d-c2272b71b362",
+ "metadata": {},
+ "source": [
+ "## Testing the Model"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "31f7f62a-ca17-4dcc-b25d-825572ee1630",
+ "metadata": {},
+ "source": [
+ "Here we will use a sample text that we want our model to summarize."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 98,
+ "id": "980d0053-0c3b-4d0d-91b9-9dd6e6dd3e64",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "text = \"Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a \\\n",
+ "highly transmissible and pathogenic coronavirus that emerged in late 2019 and has \\\n",
+ "caused a pandemic of acute respiratory disease, named ‘coronavirus disease 2019’ (COVID-19), \\\n",
+ "which threatens human health and public safety. In this Review, we describe the basic virology of \\\n",
+ "SARS-CoV-2, including genomic characteristics and receptor use, highlighting its key difference \\\n",
+ "from previously known coronaviruses. We summarize current knowledge of clinical, epidemiological and \\\n",
+ "pathological features of COVID-19, as well as recent progress in animal models and antiviral treatment \\\n",
+ "approaches for SARS-CoV-2 infection. We also discuss the potential wildlife hosts and zoonotic origin \\\n",
+ "of this emerging virus in detail.\""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "3156ef96-20ab-46ff-ae76-57d6ea54f0ff",
+ "metadata": {},
+ "source": [
+ "To predict the following tokenizes the text to gather the inputs, then uses **generate()** generate sequences of token ids for our model. We then decode our output to translate our tokenized output into text."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "cb34d7c2-a815-4c9d-bd8b-495aca8b2d02",
+ "metadata": {},
+ "source": [
+ "Below you will see that we have provided a paragraph about SARS-CoV-2 as our output, we also have some parameters that we specify to further tune our model to get a concise summary of what our text is about.\n",
+ "\n",
+ "- **Max_Length:** Max number of words to generate.\n",
+ "- **Num_Return_Sequences:** Number of different outputs to generate. For our example we want one sentence or sequence.\n",
+ "- **Temperature:** Controls randomness, higher values increase diversity meaning a more unique response make the model to think harder. Must be a number from 0 to 1.\n",
+ "- **Top_p (nucleus):** The cumulative probability cutoff for token selection. Lower values mean sampling from a smaller, more top-weighted nucleus. Must be a number from 0 to 1.\n",
+ "- **Top_k**: Sample from the k most likely next tokens at each step. Lower k focuses on higher probability tokens. This means the model choses the most probable words. Lower values eliminate fewer coherent words."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 101,
+ "id": "fc2206c7-1bbf-41eb-8c63-abb17752d00d",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.\n",
+ "\n",
+ "All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at saved_model.\n",
+ "If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.\n"
+ ]
+ },
+ {
+ "data": {
+ "text/plain": [
+ "'We describe the basic virology of Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) and its role in preventing the pandemic of acute respiratory disease, named ‘coronavirus disease 2019’ (COVID-19), which threatens human health and public safety.'"
+ ]
+ },
+ "execution_count": 101,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "from transformers import AutoTokenizer\n",
+ "\n",
+ "tokenizer = AutoTokenizer.from_pretrained(CHECKPOINT)\n",
+ "inputs = tokenizer.encode(text, return_tensors=\"tf\")\n",
+ "\n",
+ "from transformers import TFAutoModelForSeq2SeqLM\n",
+ "\n",
+ "model = TFAutoModelForSeq2SeqLM.from_pretrained(\"saved_model\")\n",
+ "\n",
+ "outputs = model.generate(inputs, \n",
+ " max_length=1000,\n",
+ " num_return_sequences = 1,\n",
+ " do_sample=True, \n",
+ " temperature = 0.6,\n",
+ " top_k = 50, \n",
+ " top_p = 0.95,)\n",
+ "\n",
+ "tokenizer.decode(outputs[0], skip_special_tokens=True)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "af0f0064-d9b4-430c-90ae-d2e390d1b78c",
+ "metadata": {},
+ "source": [
+ "### Optional: Summarizing PDF Files"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1abe4b0b-8ee2-48f1-afed-0cd2796717a2",
+ "metadata": {},
+ "source": [
+ "The process of summarizing scientific PDF files is relatively the same except that we first need to extract the text from the PDF. To do so lets download a PDF file from PubMed."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 94,
+ "id": "d69aa008-80c0-4a19-aa0f-8f5798673c47",
+ "metadata": {
+ "scrolled": true,
+ "tags": []
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "--2023-11-02 20:07:00-- https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7784226/pdf/12248_2020_Article_532.pdf\n",
+ "Resolving www.ncbi.nlm.nih.gov (www.ncbi.nlm.nih.gov)... 130.14.29.110, 2607:f220:41e:4290::110\n",
+ "Connecting to www.ncbi.nlm.nih.gov (www.ncbi.nlm.nih.gov)|130.14.29.110|:443... connected.\n",
+ "HTTP request sent, awaiting response... 200 OK\n",
+ "Length: 5757370 (5.5M) [application/pdf]\n",
+ "Saving to: ‘12248_2020_Article_532.pdf’\n",
+ "\n",
+ "12248_2020_Article_ 100%[===================>] 5.49M 7.25MB/s in 0.8s \n",
+ "\n",
+ "2023-11-02 20:07:01 (7.25 MB/s) - ‘12248_2020_Article_532.pdf’ saved [5757370/5757370]\n",
+ "\n"
+ ]
+ }
+ ],
+ "source": [
+ "! wget --user-agent=\"Chrome\" https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7784226/pdf/12248_2020_Article_532.pdf"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6555b048-a76c-4f10-b13c-4ac93f436fe8",
+ "metadata": {},
+ "source": [
+ "We'll be downloading some tools that help us extract only the text from our pdf file."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 30,
+ "id": "1347b3cd-5ce0-44c9-864d-a688bcacb1d0",
+ "metadata": {
+ "scrolled": true,
+ "tags": []
+ },
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
+ "To disable this warning, you can either:\n",
+ "\t- Avoid using `tokenizers` before the fork if possible\n",
+ "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Requirement already satisfied: fitz in /opt/conda/lib/python3.10/site-packages (0.0.1.dev2)\n",
+ "Collecting PyMuPDF\n",
+ " Obtaining dependency information for PyMuPDF from https://files.pythonhosted.org/packages/41/4a/530017aaf0a554aa6d9abd547932a02c0188962d12122fe611bf7a6d0c26/PyMuPDF-1.23.5-cp310-none-manylinux2014_x86_64.whl.metadata\n",
+ " Downloading PyMuPDF-1.23.5-cp310-none-manylinux2014_x86_64.whl.metadata (3.4 kB)\n",
+ "Requirement already satisfied: configobj in /opt/conda/lib/python3.10/site-packages (from fitz) (5.0.8)\n",
+ "Requirement already satisfied: configparser in /opt/conda/lib/python3.10/site-packages (from fitz) (6.0.0)\n",
+ "Requirement already satisfied: httplib2 in /opt/conda/lib/python3.10/site-packages (from fitz) (0.21.0)\n",
+ "Requirement already satisfied: nibabel in /opt/conda/lib/python3.10/site-packages (from fitz) (5.1.0)\n",
+ "Requirement already satisfied: nipype in /opt/conda/lib/python3.10/site-packages (from fitz) (1.8.6)\n",
+ "Requirement already satisfied: numpy in /opt/conda/lib/python3.10/site-packages (from fitz) (1.23.5)\n",
+ "Requirement already satisfied: pandas in /opt/conda/lib/python3.10/site-packages (from fitz) (2.0.3)\n",
+ "Requirement already satisfied: pyxnat in /opt/conda/lib/python3.10/site-packages (from fitz) (1.6)\n",
+ "Requirement already satisfied: scipy in /opt/conda/lib/python3.10/site-packages (from fitz) (1.11.2)\n",
+ "Collecting PyMuPDFb==1.23.5 (from PyMuPDF)\n",
+ " Obtaining dependency information for PyMuPDFb==1.23.5 from https://files.pythonhosted.org/packages/cf/14/de59687368ad2c047b038b5b9b04e40bd5d486d5b36c6aef42c18c35ea2c/PyMuPDFb-1.23.5-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata\n",
+ " Downloading PyMuPDFb-1.23.5-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.3 kB)\n",
+ "Requirement already satisfied: six in /opt/conda/lib/python3.10/site-packages (from configobj->fitz) (1.16.0)\n",
+ "Requirement already satisfied: pyparsing!=3.0.0,!=3.0.1,!=3.0.2,!=3.0.3,<4,>=2.4.2 in /opt/conda/lib/python3.10/site-packages (from httplib2->fitz) (3.1.1)\n",
+ "Requirement already satisfied: packaging>=17 in /opt/conda/lib/python3.10/site-packages (from nibabel->fitz) (23.1)\n",
+ "Requirement already satisfied: click>=6.6.0 in /opt/conda/lib/python3.10/site-packages (from nipype->fitz) (8.1.7)\n",
+ "Requirement already satisfied: networkx>=2.0 in /opt/conda/lib/python3.10/site-packages (from nipype->fitz) (3.1)\n",
+ "Requirement already satisfied: prov>=1.5.2 in /opt/conda/lib/python3.10/site-packages (from nipype->fitz) (2.0.0)\n",
+ "Requirement already satisfied: pydot>=1.2.3 in /opt/conda/lib/python3.10/site-packages (from nipype->fitz) (1.4.2)\n",
+ "Requirement already satisfied: python-dateutil>=2.2 in /opt/conda/lib/python3.10/site-packages (from nipype->fitz) (2.8.2)\n",
+ "Requirement already satisfied: rdflib>=5.0.0 in /opt/conda/lib/python3.10/site-packages (from nipype->fitz) (7.0.0)\n",
+ "Requirement already satisfied: simplejson>=3.8.0 in /opt/conda/lib/python3.10/site-packages (from nipype->fitz) (3.19.2)\n",
+ "Requirement already satisfied: traits!=5.0,<6.4,>=4.6 in /opt/conda/lib/python3.10/site-packages (from nipype->fitz) (6.3.2)\n",
+ "Requirement already satisfied: filelock>=3.0.0 in /opt/conda/lib/python3.10/site-packages (from nipype->fitz) (3.12.4)\n",
+ "Requirement already satisfied: etelemetry>=0.2.0 in /opt/conda/lib/python3.10/site-packages (from nipype->fitz) (0.3.1)\n",
+ "Requirement already satisfied: looseversion in /opt/conda/lib/python3.10/site-packages (from nipype->fitz) (1.3.0)\n",
+ "Requirement already satisfied: pytz>=2020.1 in /opt/conda/lib/python3.10/site-packages (from pandas->fitz) (2023.3.post1)\n",
+ "Requirement already satisfied: tzdata>=2022.1 in /opt/conda/lib/python3.10/site-packages (from pandas->fitz) (2023.3)\n",
+ "Requirement already satisfied: future>=0.16 in /opt/conda/lib/python3.10/site-packages (from pyxnat->fitz) (0.18.3)\n",
+ "Requirement already satisfied: lxml>=4.3 in /opt/conda/lib/python3.10/site-packages (from pyxnat->fitz) (4.9.3)\n",
+ "Requirement already satisfied: pathlib>=1.0 in /opt/conda/lib/python3.10/site-packages (from pyxnat->fitz) (1.0.1)\n",
+ "Requirement already satisfied: requests>=2.20 in /opt/conda/lib/python3.10/site-packages (from pyxnat->fitz) (2.31.0)\n",
+ "Requirement already satisfied: ci-info>=0.2 in /opt/conda/lib/python3.10/site-packages (from etelemetry>=0.2.0->nipype->fitz) (0.3.0)\n",
+ "Requirement already satisfied: isodate<0.7.0,>=0.6.0 in /opt/conda/lib/python3.10/site-packages (from rdflib>=5.0.0->nipype->fitz) (0.6.1)\n",
+ "Requirement already satisfied: charset-normalizer<4,>=2 in /opt/conda/lib/python3.10/site-packages (from requests>=2.20->pyxnat->fitz) (3.2.0)\n",
+ "Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.10/site-packages (from requests>=2.20->pyxnat->fitz) (3.4)\n",
+ "Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/conda/lib/python3.10/site-packages (from requests>=2.20->pyxnat->fitz) (1.26.16)\n",
+ "Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.10/site-packages (from requests>=2.20->pyxnat->fitz) (2023.7.22)\n",
+ "Downloading PyMuPDF-1.23.5-cp310-none-manylinux2014_x86_64.whl (4.3 MB)\n",
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m4.3/4.3 MB\u001b[0m \u001b[31m46.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
+ "\u001b[?25hDownloading PyMuPDFb-1.23.5-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (30.6 MB)\n",
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m30.6/30.6 MB\u001b[0m \u001b[31m42.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m:00:01\u001b[0m00:01\u001b[0m\n",
+ "\u001b[?25hInstalling collected packages: PyMuPDFb, PyMuPDF\n",
+ "Successfully installed PyMuPDF-1.23.5 PyMuPDFb-1.23.5\n"
+ ]
+ }
+ ],
+ "source": [
+ "!pip install \"fitz\" \"PyMuPDF\""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "3331b05f-d00e-4ff2-be24-edea732e4af2",
+ "metadata": {},
+ "source": [
+ "Now we can make a function **extract_text_from_pdf** to extract the text from the pdf and save it as a variable."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 95,
+ "id": "d77b6ffe-90e1-4a01-aa52-9cf93a9c5c85",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import fitz\n",
+ "def extract_text_from_pdf(pdf_path):\n",
+ " doc = fitz.open(pdf_path)\n",
+ " text = ''\n",
+ " for page in doc:\n",
+ " text += page.get_text()\n",
+ " return text\n",
+ "\n",
+ "text_pdf=extract_text_from_pdf('12248_2020_Article_532.pdf')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "10e761c6-da93-4ca0-827e-a3715c20eb10",
+ "metadata": {},
+ "source": [
+ "Finally we'll follow the same steps we did before to encode our inputs, pass it to our model, and then decode our output. Notice how we increased the max_length of what is expected of our input."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 97,
+ "id": "9a1f5dbd-6a9f-4533-a12e-8a6c4073df74",
+ "metadata": {
+ "scrolled": true,
+ "tags": []
+ },
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.\n",
+ "\n",
+ "All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at saved_model.\n",
+ "If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.\n"
+ ]
+ },
+ {
+ "ename": "TypeError",
+ "evalue": "Cannot convert 'Summary:' to EagerTensor of dtype int32",
+ "output_type": "error",
+ "traceback": [
+ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+ "\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)",
+ "Cell \u001b[0;32mIn[97], line 12\u001b[0m\n\u001b[1;32m 8\u001b[0m model \u001b[38;5;241m=\u001b[39m TFAutoModelForSeq2SeqLM\u001b[38;5;241m.\u001b[39mfrom_pretrained(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124msaved_model\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[1;32m 10\u001b[0m outputs \u001b[38;5;241m=\u001b[39m model\u001b[38;5;241m.\u001b[39mgenerate(inputs, max_new_tokens\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m100\u001b[39m, do_sample\u001b[38;5;241m=\u001b[39m\u001b[38;5;28;01mFalse\u001b[39;00m)\n\u001b[0;32m---> 12\u001b[0m tokenizer\u001b[38;5;241m.\u001b[39mdecode(\u001b[38;5;124;43m'\u001b[39;49m\u001b[38;5;124;43mSummary:\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[43m \u001b[49m\u001b[38;5;241;43m+\u001b[39;49m\u001b[43m \u001b[49m\u001b[43moutputs\u001b[49m\u001b[43m[\u001b[49m\u001b[38;5;241;43m0\u001b[39;49m\u001b[43m]\u001b[49m, skip_special_tokens\u001b[38;5;241m=\u001b[39m\u001b[38;5;28;01mTrue\u001b[39;00m)\n",
+ "File \u001b[0;32m/opt/conda/lib/python3.10/site-packages/tensorflow/python/util/traceback_utils.py:153\u001b[0m, in \u001b[0;36mfilter_traceback..error_handler\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m 151\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mException\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m e:\n\u001b[1;32m 152\u001b[0m filtered_tb \u001b[38;5;241m=\u001b[39m _process_traceback_frames(e\u001b[38;5;241m.\u001b[39m__traceback__)\n\u001b[0;32m--> 153\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m e\u001b[38;5;241m.\u001b[39mwith_traceback(filtered_tb) \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[1;32m 154\u001b[0m \u001b[38;5;28;01mfinally\u001b[39;00m:\n\u001b[1;32m 155\u001b[0m \u001b[38;5;28;01mdel\u001b[39;00m filtered_tb\n",
+ "File \u001b[0;32m/opt/conda/lib/python3.10/site-packages/tensorflow/python/framework/constant_op.py:102\u001b[0m, in \u001b[0;36mconvert_to_eager_tensor\u001b[0;34m(value, ctx, dtype)\u001b[0m\n\u001b[1;32m 100\u001b[0m dtype \u001b[38;5;241m=\u001b[39m dtypes\u001b[38;5;241m.\u001b[39mas_dtype(dtype)\u001b[38;5;241m.\u001b[39mas_datatype_enum\n\u001b[1;32m 101\u001b[0m ctx\u001b[38;5;241m.\u001b[39mensure_initialized()\n\u001b[0;32m--> 102\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mops\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mEagerTensor\u001b[49m\u001b[43m(\u001b[49m\u001b[43mvalue\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mctx\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mdevice_name\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mdtype\u001b[49m\u001b[43m)\u001b[49m\n",
+ "\u001b[0;31mTypeError\u001b[0m: Cannot convert 'Summary:' to EagerTensor of dtype int32"
+ ]
+ }
+ ],
+ "source": [
+ "from transformers import AutoTokenizer\n",
+ "\n",
+ "tokenizer = AutoTokenizer.from_pretrained(CHECKPOINT)\n",
+ "inputs = tokenizer.encode(text_pdf, max_length=1000, truncation=True, return_tensors=\"tf\")\n",
+ "\n",
+ "from transformers import TFAutoModelForSeq2SeqLM\n",
+ "\n",
+ "model = TFAutoModelForSeq2SeqLM.from_pretrained(\"saved_model\")\n",
+ "\n",
+ "outputs = model.generate(inputs, max_new_tokens=100, do_sample=False)\n",
+ "\n",
+ "tokenizer.decode(outputs[0], skip_special_tokens=True)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "257fb45f-7752-481d-a1f6-f3eeb7655fac",
+ "metadata": {},
+ "source": [
+ "## Finetuning our Model via Vertex AI Training API"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6ac841f6-c65e-4ebf-8c42-3030e2f92cb0",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "### Setting up our Datasets for Training "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "fff825cf-86fb-4777-885b-9e981be831b7",
+ "metadata": {},
+ "source": [
+ "Although we have our datasets saved locally inorder to utilize the Vertex AI Training API we will need to store our datasets in a bucket."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 233,
+ "id": "d3f49896-b2c1-47e6-a7cc-aca7753bb6c4",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from datasets import load_dataset\n",
+ "\n",
+ "# load dataset\n",
+ "train, test, validation = load_dataset(\"ccdv/pubmed-summarization\", split=[\"train[:5%]\", \"test[:5%]\", \"validation[:5%]\" ])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "id": "1b91fe7c-5970-45c1-9401-1db3206a8ce9",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "#load in the storage package and name our bucket\n",
+ "from google.cloud import storage\n",
+ "BUCKET='flan-t5-model-resources'\n",
+ "client = storage.Client()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 105,
+ "id": "0066ad72-e451-41c0-b30a-c3a7dfa5f17c",
+ "metadata": {
+ "scrolled": true,
+ "tags": []
+ },
+ "outputs": [
+ {
+ "ename": "Conflict",
+ "evalue": "409 POST https://storage.googleapis.com/storage/v1/b?project=cit-oconnellka-9999&prettyPrint=false: Your previous request to create the named bucket succeeded and you already own it.",
+ "output_type": "error",
+ "traceback": [
+ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+ "\u001b[0;31mConflict\u001b[0m Traceback (most recent call last)",
+ "Cell \u001b[0;32mIn[105], line 3\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[38;5;66;03m#Create bucket\u001b[39;00m\n\u001b[1;32m 2\u001b[0m bucket \u001b[38;5;241m=\u001b[39m client\u001b[38;5;241m.\u001b[39mbucket(BUCKET)\n\u001b[0;32m----> 3\u001b[0m \u001b[43mbucket\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mcreate\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n",
+ "File \u001b[0;32m/opt/conda/lib/python3.10/site-packages/google/cloud/storage/bucket.py:972\u001b[0m, in \u001b[0;36mBucket.create\u001b[0;34m(self, client, project, location, predefined_acl, predefined_default_object_acl, timeout, retry)\u001b[0m\n\u001b[1;32m 925\u001b[0m \u001b[38;5;250m\u001b[39m\u001b[38;5;124;03m\"\"\"Creates current bucket.\u001b[39;00m\n\u001b[1;32m 926\u001b[0m \n\u001b[1;32m 927\u001b[0m \u001b[38;5;124;03mIf the bucket already exists, will raise\u001b[39;00m\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 968\u001b[0m \u001b[38;5;124;03m (Optional) How to retry the RPC. See: :ref:`configuring_retries`\u001b[39;00m\n\u001b[1;32m 969\u001b[0m \u001b[38;5;124;03m\"\"\"\u001b[39;00m\n\u001b[1;32m 971\u001b[0m client \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_require_client(client)\n\u001b[0;32m--> 972\u001b[0m \u001b[43mclient\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mcreate_bucket\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 973\u001b[0m \u001b[43m \u001b[49m\u001b[43mbucket_or_name\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m,\u001b[49m\n\u001b[1;32m 974\u001b[0m \u001b[43m \u001b[49m\u001b[43mproject\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mproject\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 975\u001b[0m \u001b[43m \u001b[49m\u001b[43muser_project\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43muser_project\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 976\u001b[0m \u001b[43m \u001b[49m\u001b[43mlocation\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mlocation\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 977\u001b[0m \u001b[43m \u001b[49m\u001b[43mpredefined_acl\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mpredefined_acl\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 978\u001b[0m \u001b[43m \u001b[49m\u001b[43mpredefined_default_object_acl\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mpredefined_default_object_acl\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 979\u001b[0m \u001b[43m \u001b[49m\u001b[43mtimeout\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mtimeout\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 980\u001b[0m \u001b[43m \u001b[49m\u001b[43mretry\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mretry\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 981\u001b[0m \u001b[43m\u001b[49m\u001b[43m)\u001b[49m\n",
+ "File \u001b[0;32m/opt/conda/lib/python3.10/site-packages/google/cloud/storage/client.py:954\u001b[0m, in \u001b[0;36mClient.create_bucket\u001b[0;34m(self, bucket_or_name, requester_pays, project, user_project, location, data_locations, predefined_acl, predefined_default_object_acl, timeout, retry)\u001b[0m\n\u001b[1;32m 951\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m data_locations \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[1;32m 952\u001b[0m properties[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mcustomPlacementConfig\u001b[39m\u001b[38;5;124m\"\u001b[39m] \u001b[38;5;241m=\u001b[39m {\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mdataLocations\u001b[39m\u001b[38;5;124m\"\u001b[39m: data_locations}\n\u001b[0;32m--> 954\u001b[0m api_response \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_post_resource\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 955\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43m/b\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\n\u001b[1;32m 956\u001b[0m \u001b[43m \u001b[49m\u001b[43mproperties\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 957\u001b[0m \u001b[43m \u001b[49m\u001b[43mquery_params\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mquery_params\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 958\u001b[0m \u001b[43m \u001b[49m\u001b[43mtimeout\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mtimeout\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 959\u001b[0m \u001b[43m \u001b[49m\u001b[43mretry\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mretry\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 960\u001b[0m \u001b[43m \u001b[49m\u001b[43m_target_object\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mbucket\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 961\u001b[0m \u001b[43m\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 963\u001b[0m bucket\u001b[38;5;241m.\u001b[39m_set_properties(api_response)\n\u001b[1;32m 964\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m bucket\n",
+ "File \u001b[0;32m/opt/conda/lib/python3.10/site-packages/google/cloud/storage/client.py:618\u001b[0m, in \u001b[0;36mClient._post_resource\u001b[0;34m(self, path, data, query_params, headers, timeout, retry, _target_object)\u001b[0m\n\u001b[1;32m 557\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21m_post_resource\u001b[39m(\n\u001b[1;32m 558\u001b[0m \u001b[38;5;28mself\u001b[39m,\n\u001b[1;32m 559\u001b[0m path,\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 565\u001b[0m _target_object\u001b[38;5;241m=\u001b[39m\u001b[38;5;28;01mNone\u001b[39;00m,\n\u001b[1;32m 566\u001b[0m ):\n\u001b[1;32m 567\u001b[0m \u001b[38;5;250m \u001b[39m\u001b[38;5;124;03m\"\"\"Helper for bucket / blob methods making API 'POST' calls.\u001b[39;00m\n\u001b[1;32m 568\u001b[0m \n\u001b[1;32m 569\u001b[0m \u001b[38;5;124;03m Args:\u001b[39;00m\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 615\u001b[0m \u001b[38;5;124;03m If the bucket is not found.\u001b[39;00m\n\u001b[1;32m 616\u001b[0m \u001b[38;5;124;03m \"\"\"\u001b[39;00m\n\u001b[0;32m--> 618\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_connection\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mapi_request\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 619\u001b[0m \u001b[43m \u001b[49m\u001b[43mmethod\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mPOST\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\n\u001b[1;32m 620\u001b[0m \u001b[43m \u001b[49m\u001b[43mpath\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mpath\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 621\u001b[0m \u001b[43m \u001b[49m\u001b[43mdata\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mdata\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 622\u001b[0m \u001b[43m \u001b[49m\u001b[43mquery_params\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mquery_params\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 623\u001b[0m \u001b[43m \u001b[49m\u001b[43mheaders\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mheaders\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 624\u001b[0m \u001b[43m \u001b[49m\u001b[43mtimeout\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mtimeout\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 625\u001b[0m \u001b[43m \u001b[49m\u001b[43mretry\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mretry\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 626\u001b[0m \u001b[43m \u001b[49m\u001b[43m_target_object\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43m_target_object\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 627\u001b[0m \u001b[43m \u001b[49m\u001b[43m)\u001b[49m\n",
+ "File \u001b[0;32m/opt/conda/lib/python3.10/site-packages/google/cloud/storage/_http.py:72\u001b[0m, in \u001b[0;36mConnection.api_request\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m 70\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m retry:\n\u001b[1;32m 71\u001b[0m call \u001b[38;5;241m=\u001b[39m retry(call)\n\u001b[0;32m---> 72\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mcall\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n",
+ "File \u001b[0;32m/opt/conda/lib/python3.10/site-packages/google/api_core/retry.py:349\u001b[0m, in \u001b[0;36mRetry.__call__..retry_wrapped_func\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m 345\u001b[0m target \u001b[38;5;241m=\u001b[39m functools\u001b[38;5;241m.\u001b[39mpartial(func, \u001b[38;5;241m*\u001b[39margs, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs)\n\u001b[1;32m 346\u001b[0m sleep_generator \u001b[38;5;241m=\u001b[39m exponential_sleep_generator(\n\u001b[1;32m 347\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_initial, \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_maximum, multiplier\u001b[38;5;241m=\u001b[39m\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_multiplier\n\u001b[1;32m 348\u001b[0m )\n\u001b[0;32m--> 349\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mretry_target\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 350\u001b[0m \u001b[43m \u001b[49m\u001b[43mtarget\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 351\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_predicate\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 352\u001b[0m \u001b[43m \u001b[49m\u001b[43msleep_generator\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 353\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_timeout\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 354\u001b[0m \u001b[43m \u001b[49m\u001b[43mon_error\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mon_error\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 355\u001b[0m \u001b[43m\u001b[49m\u001b[43m)\u001b[49m\n",
+ "File \u001b[0;32m/opt/conda/lib/python3.10/site-packages/google/api_core/retry.py:191\u001b[0m, in \u001b[0;36mretry_target\u001b[0;34m(target, predicate, sleep_generator, timeout, on_error, **kwargs)\u001b[0m\n\u001b[1;32m 189\u001b[0m \u001b[38;5;28;01mfor\u001b[39;00m sleep \u001b[38;5;129;01min\u001b[39;00m sleep_generator:\n\u001b[1;32m 190\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[0;32m--> 191\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mtarget\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 193\u001b[0m \u001b[38;5;66;03m# pylint: disable=broad-except\u001b[39;00m\n\u001b[1;32m 194\u001b[0m \u001b[38;5;66;03m# This function explicitly must deal with broad exceptions.\u001b[39;00m\n\u001b[1;32m 195\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mException\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m exc:\n",
+ "File \u001b[0;32m/opt/conda/lib/python3.10/site-packages/google/cloud/_http/__init__.py:494\u001b[0m, in \u001b[0;36mJSONConnection.api_request\u001b[0;34m(self, method, path, query_params, data, content_type, headers, api_base_url, api_version, expect_json, _target_object, timeout, extra_api_info)\u001b[0m\n\u001b[1;32m 482\u001b[0m response \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_make_request(\n\u001b[1;32m 483\u001b[0m method\u001b[38;5;241m=\u001b[39mmethod,\n\u001b[1;32m 484\u001b[0m url\u001b[38;5;241m=\u001b[39murl,\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 490\u001b[0m extra_api_info\u001b[38;5;241m=\u001b[39mextra_api_info,\n\u001b[1;32m 491\u001b[0m )\n\u001b[1;32m 493\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;241m200\u001b[39m \u001b[38;5;241m<\u001b[39m\u001b[38;5;241m=\u001b[39m response\u001b[38;5;241m.\u001b[39mstatus_code \u001b[38;5;241m<\u001b[39m \u001b[38;5;241m300\u001b[39m:\n\u001b[0;32m--> 494\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m exceptions\u001b[38;5;241m.\u001b[39mfrom_http_response(response)\n\u001b[1;32m 496\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m expect_json \u001b[38;5;129;01mand\u001b[39;00m response\u001b[38;5;241m.\u001b[39mcontent:\n\u001b[1;32m 497\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m response\u001b[38;5;241m.\u001b[39mjson()\n",
+ "\u001b[0;31mConflict\u001b[0m: 409 POST https://storage.googleapis.com/storage/v1/b?project=cit-oconnellka-9999&prettyPrint=false: Your previous request to create the named bucket succeeded and you already own it."
+ ]
+ }
+ ],
+ "source": [
+ "#Create bucket\n",
+ "bucket = client.bucket(BUCKET)\n",
+ "bucket.create()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "4fc7335e-cfa7-4b0e-8880-85edfa573772",
+ "metadata": {},
+ "source": [
+ "Convert our datasets to csv and upload to our bucket in one step!"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 60,
+ "id": "1bfbbd92-4b2c-4e5c-95f8-d4e645a6ab24",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "application/vnd.jupyter.widget-view+json": {
+ "model_id": "094bf0b9c0bf44b0859f2b9c5f375e8c",
+ "version_major": 2,
+ "version_minor": 0
+ },
+ "text/plain": [
+ "Creating CSV from Arrow format: 0%| | 0/6 [00:00, ?ba/s]"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "from io import BytesIO\n",
+ "\n",
+ "#convert train dataset to csv and push to GCS bucket\n",
+ "csv_buffer = BytesIO()\n",
+ "train.to_csv(csv_buffer)\n",
+ "client = storage.Client()\n",
+ "bucket = client.get_bucket(BUCKET)\n",
+ "bucket.blob('train.csv').upload_from_file(csv_buffer, 'text/csv')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 61,
+ "id": "dbf3f68f-acc8-4086-9b89-be0d3eacf898",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "application/vnd.jupyter.widget-view+json": {
+ "model_id": "e54494e07db241aa8537c6bce84558bd",
+ "version_major": 2,
+ "version_minor": 0
+ },
+ "text/plain": [
+ "Creating CSV from Arrow format: 0%| | 0/1 [00:00, ?ba/s]"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "#convert test dataset to csv and push to GCS bucket\n",
+ "csv_buffer = BytesIO()\n",
+ "test.to_csv(csv_buffer)\n",
+ "client = storage.Client()\n",
+ "bucket = client.get_bucket(BUCKET)\n",
+ "bucket.blob('test.csv').upload_from_file(csv_buffer, 'text/csv')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 62,
+ "id": "ea1773e1-dfe2-46b5-a63d-782101d79096",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "application/vnd.jupyter.widget-view+json": {
+ "model_id": "8b3cd00c2e42453b9c85320fd43360a5",
+ "version_major": 2,
+ "version_minor": 0
+ },
+ "text/plain": [
+ "Creating CSV from Arrow format: 0%| | 0/1 [00:00, ?ba/s]"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "#convert validation dataset to csv and push to GCS bucket\n",
+ "csv_buffer = BytesIO()\n",
+ "validation.to_csv(csv_buffer)\n",
+ "client = storage.Client()\n",
+ "bucket = client.get_bucket(BUCKET)\n",
+ "bucket.blob('validation.csv').upload_from_file(csv_buffer, 'text/csv')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f18630a7-109c-4f53-9233-1842f5c27029",
+ "metadata": {},
+ "source": [
+ "Here we will be saving the location of our datasets be used when we execute the training of our model."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 257,
+ "id": "ebc1bc39-a554-473b-949a-d9588f6e7fb8",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "# save train_dataset to s3\n",
+ "training_input_path = f'gs://{BUCKET}/train.csv'\n",
+ "\n",
+ "# save test_dataset to s3\n",
+ "test_input_path = f'gs://{BUCKET}/test.csv'\n",
+ "\n",
+ "validation_input_path = f'gs://{BUCKET}/validation.csv'"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "9204b6dc-8f6e-407e-8c68-a036a6a5b7c9",
+ "metadata": {},
+ "source": [
+ "### Training our Model via Vertex AI Training API"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "2f873f2f-90b8-4566-96f5-37a23a2294e1",
+ "metadata": {},
+ "source": [
+ "To train our model on Vertex AI Training API you must first create a custom AI job, this is done by creating a autopkg that holds your requirements.txt and task.py files is a specific structure like so: "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8eeafae2-a698-4a52-a4f1-cae550245d0b",
+ "metadata": {},
+ "source": [
+ "```\n",
+ "autopkg-summarizer /\n",
+ " + requirements.txt\n",
+ " + trainer/\n",
+ " + task.py\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 103,
+ "id": "48d49de7-6d86-411e-9e6e-104763ae36e6",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "#Creates the following directories and files\n",
+ "!mkdir autopkg-summarizer\n",
+ "!touch autopkg-summarizer/requirements.txt\n",
+ "!mkdir autopkg-summarizer/trainer\n",
+ "!touch autopkg-summarizer/trainer/task.py"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7bd953b5-8d21-4c07-adcc-23604d5d0279",
+ "metadata": {},
+ "source": [
+ "Add your requirements.txt file by adding the packages below:\n",
+ "```\n",
+ "nltk\n",
+ "transformers\n",
+ "keras_nlp\n",
+ "datasets\n",
+ "rouge_score\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "443f0e45-dbe6-4ce9-b0ad-96a5fe52a455",
+ "metadata": {},
+ "source": [
+ "To create our training script we will be adding all the steps that we ran from the 'Finetuning our Model Locally' section of this tutorial to a file named task.py:"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "14ab5a7d-36bf-449d-b2cd-4f7e107c75de",
+ "metadata": {},
+ "source": [
+ "```\n",
+ "import nltk\n",
+ "import argparse\n",
+ "from datasets import load_dataset\n",
+ "#import evaluate\n",
+ "import numpy as np\n",
+ "from transformers import create_optimizer, AdamWeightDecay, TFAutoModelForSeq2SeqLM, AutoTokenizer, DataCollatorForSeq2Seq, set_seed\n",
+ "import tensorflow as tf\n",
+ "from tensorflow import keras\n",
+ "from transformers.keras_callbacks import KerasMetricCallback\n",
+ "import keras_nlp\n",
+ "\n",
+ "def get_args():\n",
+ " '''Parses args.'''\n",
+ " parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)\n",
+ " parser.add_argument(\n",
+ " '--model_name_or_path',\n",
+ " required=True,\n",
+ " type=str,\n",
+ " help='name of model or path to load into tokenizer and class')\n",
+ " parser.add_argument(\n",
+ " '--train_file',\n",
+ " required=True,\n",
+ " type=str,\n",
+ " help='train dataset in csv or json format')\n",
+ " parser.add_argument(\n",
+ " '--test_file',\n",
+ " required=True,\n",
+ " type=str,\n",
+ " help='test dataset in csv or json format')\n",
+ " parser.add_argument(\n",
+ " '--validation_file',\n",
+ " required=True,\n",
+ " type=str,\n",
+ " help='validation dataset in csv or json format used to calculate ROUGE score')\n",
+ " parser.add_argument(\n",
+ " '--text_column',\n",
+ " required=True,\n",
+ " type=str,\n",
+ " help='The name of the column in the datasets containing the full texts (for summarization)')\n",
+ " parser.add_argument(\n",
+ " '--summary_column',\n",
+ " required=True,\n",
+ " type=str,\n",
+ " help='The name of the column in the datasets containing the abstracts or summary of the full text')\n",
+ " parser.add_argument(\n",
+ " '--num_train_epochs',\n",
+ " required=False,\n",
+ " type=int,\n",
+ " default=3,\n",
+ " help='number of complete passes through the training dataset')\n",
+ " parser.add_argument(\n",
+ " '--source_prefix',\n",
+ " required=False,\n",
+ " type=str,\n",
+ " help='A prefix to add before every source text (needed for T5 models)')\n",
+ " parser.add_argument(\n",
+ " '--inputs_max_length',\n",
+ " required=False,\n",
+ " type=int,\n",
+ " default=1024,\n",
+ " help='max token length for model inputs')\n",
+ " parser.add_argument(\n",
+ " '--labels_max_length',\n",
+ " required=False,\n",
+ " type=int,\n",
+ " default=128,\n",
+ " help='max token length for model labels or targets')\n",
+ " parser.add_argument(\n",
+ " '--batch_size',\n",
+ " required=False,\n",
+ " type=int,\n",
+ " default=10,\n",
+ " help='max token length for model labels or targets')\n",
+ " parser.add_argument(\n",
+ " '--output_dir',\n",
+ " required=True,\n",
+ " type=str,\n",
+ " help='bucket to store saved model, include gs://')\n",
+ " \n",
+ " args = parser.parse_args()\n",
+ " return args\n",
+ "\n",
+ "def main():\n",
+ " \n",
+ " args = get_args() \n",
+ " \n",
+ " checkpoint = args.model_name_or_path\n",
+ " \n",
+ " tokenizer = AutoTokenizer.from_pretrained(checkpoint)\n",
+ " \n",
+ " text = args.text_column\n",
+ " summary = args.summary_column\n",
+ " inputs_max_length = args.inputs_max_length\n",
+ " labels_max_length = args.labels_max_length\n",
+ " prefix = args.source_prefix \n",
+ " \n",
+ " model = TFAutoModelForSeq2SeqLM.from_pretrained(checkpoint) \n",
+ " \n",
+ " data_files = {'train':args.train_file, 'test':args.test_file, 'validation':args.validation_file}\n",
+ " extension = args.train_file.split(\".\")[-1]\n",
+ " \n",
+ " raw_datasets = load_dataset(\n",
+ " extension,\n",
+ " data_files=data_files)\n",
+ " \n",
+ " raw_datasets = raw_datasets.filter(lambda x: x[text] is not None) \n",
+ " \n",
+ " train = raw_datasets[\"train\"]\n",
+ " test = raw_datasets[\"test\"]\n",
+ " validation = raw_datasets[\"validation\"]\n",
+ " \n",
+ " def preprocess_function(examples):\n",
+ " \n",
+ " inputs = [prefix + doc for doc in examples[text]]\n",
+ " model_inputs = tokenizer(inputs, max_length=inputs_max_length, truncation=True)\n",
+ "\n",
+ " # labels = tokenizer(text_target=examples[\"abstract\"], max_length=128, truncation=True)\n",
+ "\n",
+ " labels = tokenizer(text_target=\n",
+ " examples[summary], max_length=labels_max_length, truncation=True\n",
+ " )\n",
+ "\n",
+ " model_inputs[\"labels\"] = labels[\"input_ids\"]\n",
+ " return model_inputs\n",
+ " \n",
+ " tokenized_train = train.map(preprocess_function, batched=True)\n",
+ " tokenized_test = test.map(preprocess_function, batched=True)\n",
+ " tokenized_validation = validation.map(preprocess_function, batched=True)\n",
+ " \n",
+ " data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint, return_tensors=\"tf\")\n",
+ "\n",
+ " optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)\n",
+ " model.compile(optimizer=optimizer)\n",
+ "\n",
+ " tf_train_set = model.prepare_tf_dataset(\n",
+ " tokenized_train,\n",
+ " shuffle=True,\n",
+ " batch_size=args.batch_size,\n",
+ " collate_fn=data_collator\n",
+ " )\n",
+ "\n",
+ " tf_test_set = model.prepare_tf_dataset(\n",
+ " tokenized_test,\n",
+ " shuffle=False,\n",
+ " batch_size=args.batch_size,\n",
+ " collate_fn=data_collator\n",
+ " )\n",
+ " \n",
+ " tf_validation_set = model.prepare_tf_dataset(\n",
+ " tokenized_validation,\n",
+ " shuffle=False,\n",
+ " batch_size=args.batch_size,\n",
+ " collate_fn=data_collator\n",
+ " ) \n",
+ " \n",
+ " def metric_fn(eval_predictions):\n",
+ " predictions, labels = eval_predictions\n",
+ " decoded_predictions = tokenizer.batch_decode(predictions, skip_special_tokens=True)\n",
+ " for label in labels:\n",
+ " label[label < 0] = tokenizer.pad_token_id # Replace masked label tokens\n",
+ " decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)\n",
+ " result = rouge_l(decoded_labels, decoded_predictions)\n",
+ " # We will print only the F1 score, you can use other aggregation metrics as well\n",
+ " result = {\"RougeL\": result[\"f1_score\"]}\n",
+ "\n",
+ " return result\n",
+ " \n",
+ " rouge_l = keras_nlp.metrics.RougeL()\n",
+ "\n",
+ " metric_callback = KerasMetricCallback(\n",
+ " metric_fn, eval_dataset=tf_validation_set, predict_with_generate=True, use_xla_generation=True)\n",
+ "\n",
+ "\n",
+ " model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=args.num_train_epochs, callbacks=metric_callback)\n",
+ " model.save(f'{args.output_dir}/saved_model_artifacts_tf')\n",
+ " model.save_pretrained(f'{args.output_dir}/saved_model_hf_tf')\n",
+ "\n",
+ "\n",
+ "if __name__ == \"__main__\":\n",
+ " main()\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8775dba1-c47b-4375-9587-6fec561bc5f9",
+ "metadata": {},
+ "source": [
+ "### Hyperparameters (for the training script and custom AI job)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "a66d0c47-f6df-4b79-87a6-a637b04ebc87",
+ "metadata": {},
+ "source": [
+ "The first step to training our model other than setting up our datasets is to set our **hyperparameters**. Hyperparameters depend on your training script and for this one we need to identify our model, the location of our train and test files, etc. \n",
+ "\n",
+ "The batch_size, inputs_max_length, num_train_epochs, and labels_max_length already have defualts setting same as the ones we used in the first section of this tutorial!"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "id": "764980e6-9bc1-4715-b540-9e254b12f1f3",
+ "metadata": {
+ "scrolled": true,
+ "tags": []
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "2023-11-03 12:32:26.151679: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n",
+ "2023-11-03 12:32:26.151738: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n",
+ "2023-11-03 12:32:26.151777: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n",
+ "2023-11-03 12:32:26.161962: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.\n",
+ "To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.\n",
+ "Traceback (most recent call last):\n",
+ " File \"/home/jupyter/autopkg-summarizer/trainer/task.py\", line 4, in \n",
+ " import evaluate\n",
+ " File \"/opt/conda/lib/python3.10/site-packages/evaluate/__init__.py\", line 29, in \n",
+ " from .evaluation_suite import EvaluationSuite\n",
+ " File \"/opt/conda/lib/python3.10/site-packages/evaluate/evaluation_suite/__init__.py\", line 10, in \n",
+ " from ..evaluator import evaluator\n",
+ " File \"/opt/conda/lib/python3.10/site-packages/evaluate/evaluator/__init__.py\", line 17, in \n",
+ " from transformers.pipelines import SUPPORTED_TASKS as SUPPORTED_PIPELINE_TASKS\n",
+ " File \"/opt/conda/lib/python3.10/site-packages/transformers/pipelines/__init__.py\", line 72, in \n",
+ " from .table_question_answering import TableQuestionAnsweringArgumentHandler, TableQuestionAnsweringPipeline\n",
+ " File \"/opt/conda/lib/python3.10/site-packages/transformers/pipelines/table_question_answering.py\", line 26, in \n",
+ " import tensorflow_probability as tfp\n",
+ " File \"/opt/conda/lib/python3.10/site-packages/tensorflow_probability/__init__.py\", line 20, in \n",
+ " from tensorflow_probability import substrates\n",
+ " File \"/opt/conda/lib/python3.10/site-packages/tensorflow_probability/substrates/__init__.py\", line 17, in \n",
+ " from tensorflow_probability.python.internal import all_util\n",
+ " File \"/opt/conda/lib/python3.10/site-packages/tensorflow_probability/python/__init__.py\", line 138, in \n",
+ " dir(globals()[pkg_name]) # Forces loading the package from its lazy loader.\n",
+ " File \"/opt/conda/lib/python3.10/site-packages/tensorflow_probability/python/internal/lazy_loader.py\", line 57, in __dir__\n",
+ " module = self._load()\n",
+ " File \"/opt/conda/lib/python3.10/site-packages/tensorflow_probability/python/internal/lazy_loader.py\", line 40, in _load\n",
+ " module = importlib.import_module(self.__name__)\n",
+ " File \"/opt/conda/lib/python3.10/importlib/__init__.py\", line 126, in import_module\n",
+ " return _bootstrap._gcd_import(name[level:], package, level)\n",
+ " File \"/opt/conda/lib/python3.10/site-packages/tensorflow_probability/python/experimental/__init__.py\", line 31, in \n",
+ " from tensorflow_probability.python.experimental import bayesopt\n",
+ " File \"/opt/conda/lib/python3.10/site-packages/tensorflow_probability/python/experimental/bayesopt/__init__.py\", line 17, in \n",
+ " from tensorflow_probability.python.experimental.bayesopt import acquisition\n",
+ " File \"/opt/conda/lib/python3.10/site-packages/tensorflow_probability/python/experimental/bayesopt/acquisition/__init__.py\", line 17, in \n",
+ " from tensorflow_probability.python.experimental.bayesopt.acquisition.acquisition_function import AcquisitionFunction\n",
+ " File \"/opt/conda/lib/python3.10/site-packages/tensorflow_probability/python/experimental/bayesopt/acquisition/acquisition_function.py\", line 22, in \n",
+ " from tensorflow_probability.python.internal import prefer_static as ps\n",
+ " File \"/opt/conda/lib/python3.10/site-packages/tensorflow_probability/python/internal/prefer_static.py\", line 361, in \n",
+ " ones_like = _copy_docstring(tf.ones_like, _ones_like)\n",
+ " File \"/opt/conda/lib/python3.10/site-packages/tensorflow_probability/python/internal/prefer_static.py\", line 84, in _copy_docstring\n",
+ " raise ValueError(\n",
+ "ValueError: Arg specs do not match: original=FullArgSpec(args=['input', 'dtype', 'name', 'layout'], varargs=None, varkw=None, defaults=(None, None, None), kwonlyargs=[], kwonlydefaults=None, annotations={}), new=FullArgSpec(args=['input', 'dtype', 'name'], varargs=None, varkw=None, defaults=(None, None), kwonlyargs=[], kwonlydefaults=None, annotations={}), fn=\n"
+ ]
+ }
+ ],
+ "source": [
+ "#to view options and defaults you can run the command below\n",
+ "!python autopkg-summarizer/trainer/task.py --help"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 258,
+ "id": "b21c8c79-1709-4052-8522-ae332cfec934",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "#Parameters for task.py script\n",
+ "CHECKPOINT = \"google/flan-t5-small\"\n",
+ "train_file=training_input_path\n",
+ "test_file=test_input_path\n",
+ "validation_file=validation_input_path\n",
+ "text_column=\"article\"\n",
+ "summary_column=\"abstract\"\n",
+ "source_prefix=\"summarize: \" \n",
+ "output_dir= f'gs://{BUCKET}'"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e830bb4a-854e-412d-93ec-3059faf603d6",
+ "metadata": {},
+ "source": [
+ "For custom AI we need to set the machine type, the accelerator for GPUs, and prebuilt docker image that will run our training. See here for more available containers: https://cloud.google.com/vertex-ai/docs/training/pre-built-containers."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 20,
+ "id": "09392ddd-aa9d-4358-95a6-3e64fa1692ad",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "#Parameters for custom AI job\n",
+ "display_name='flan-t5-training-tf'\n",
+ "BASE_GPU_IMAGE_tf='us-docker.pkg.dev/vertex-ai/training/tf-gpu.2-12.py310:latest'\n",
+ "machine_type='n1-standard-4'\n",
+ "accelerator_type='NVIDIA_TESLA_V100'"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7a6a7856-624d-4229-8a5d-cfd263a84033",
+ "metadata": {},
+ "source": [
+ "### Submit Custom AI Training Job"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "16c88abb-2a4f-4475-ad58-1d69ca31c449",
+ "metadata": {},
+ "source": [
+ "Finally we can submit our training via a custom job! It will first deploy the container that we specified and then submit our model for training. This custom job can take 15 - 20 min using our sample datasets."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 262,
+ "id": "252d8e16-5b3d-409b-bc86-9da0ce996f72",
+ "metadata": {
+ "scrolled": true,
+ "tags": []
+ },
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
+ "To disable this warning, you can either:\n",
+ "\t- Avoid using `tokenizers` before the fork if possible\n",
+ "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Using endpoint [https://us-central1-aiplatform.googleapis.com/]\n",
+ "/usr/lib/google-cloud-sdk/platform/bundledpythonunix/lib/python3.9/subprocess.py:935: RuntimeWarning: line buffering (buffering=1) isn't supported in binary mode, the default buffer size will be used\n",
+ " self.stdin = io.open(p2cwrite, 'wb', bufsize)\n",
+ "/usr/lib/google-cloud-sdk/platform/bundledpythonunix/lib/python3.9/subprocess.py:941: RuntimeWarning: line buffering (buffering=1) isn't supported in binary mode, the default buffer size will be used\n",
+ " self.stdout = io.open(c2pread, 'rb', bufsize)\n",
+ "Sending build context to Docker daemon 18.99kB\n",
+ "Step 1/10 : FROM us-docker.pkg.dev/vertex-ai/training/tf-gpu.2-12.py310:latest\n",
+ " ---> bd2bbbab7d71\n",
+ "Step 2/10 : RUN mkdir -m 777 -p /usr/app /home\n",
+ " ---> Running in 358dbf3724e8\n",
+ "Removing intermediate container 358dbf3724e8\n",
+ " ---> edf7be7209d7\n",
+ "Step 3/10 : WORKDIR /usr/app\n",
+ " ---> Running in a23be90e59c5\n",
+ "Removing intermediate container a23be90e59c5\n",
+ " ---> c35f2baa964c\n",
+ "Step 4/10 : ENV HOME=/home\n",
+ " ---> Running in 0137537b093b\n",
+ "Removing intermediate container 0137537b093b\n",
+ " ---> 64af9b387e54\n",
+ "Step 5/10 : ENV PYTHONDONTWRITEBYTECODE=1\n",
+ " ---> Running in cc5806ee80a2\n",
+ "Removing intermediate container cc5806ee80a2\n",
+ " ---> dfe914f7ecbc\n",
+ "Step 6/10 : RUN rm -rf /var/sitecustomize\n",
+ " ---> Running in 3e7c5fa57fe2\n",
+ "Removing intermediate container 3e7c5fa57fe2\n",
+ " ---> fa997bc68c88\n",
+ "Step 7/10 : COPY [\"./requirements.txt\", \"./requirements.txt\"]\n",
+ " ---> 7c46da48c940\n",
+ "Step 8/10 : RUN pip3 install --no-cache-dir -r ./requirements.txt\n",
+ " ---> Running in 6502f72390d6\n",
+ "Collecting evaluate (from -r ./requirements.txt (line 1))\n",
+ " Obtaining dependency information for evaluate from https://files.pythonhosted.org/packages/70/63/7644a1eb7b0297e585a6adec98ed9e575309bb973c33b394dae66bc35c69/evaluate-0.4.1-py3-none-any.whl.metadata\n",
+ " Downloading evaluate-0.4.1-py3-none-any.whl.metadata (9.4 kB)\n",
+ "Collecting nltk (from -r ./requirements.txt (line 2))\n",
+ " Downloading nltk-3.8.1-py3-none-any.whl (1.5 MB)\n",
+ " ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.5/1.5 MB 61.9 MB/s eta 0:00:00\n",
+ "Collecting transformers (from -r ./requirements.txt (line 3))\n",
+ " Obtaining dependency information for transformers from https://files.pythonhosted.org/packages/9a/06/e4ec2a321e57c03b7e9345d709d554a52c33760e5015fdff0919d9459af0/transformers-4.35.0-py3-none-any.whl.metadata\n",
+ " Downloading transformers-4.35.0-py3-none-any.whl.metadata (123 kB)\n",
+ " ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 123.1/123.1 kB 203.6 MB/s eta 0:00:00\n",
+ "Collecting keras_nlp (from -r ./requirements.txt (line 4))\n",
+ " Obtaining dependency information for keras_nlp from https://files.pythonhosted.org/packages/37/d4/dfd85606db811af2138e97fc480eb7ed709042dd96dd453868bede0929fe/keras_nlp-0.6.2-py3-none-any.whl.metadata\n",
+ " Downloading keras_nlp-0.6.2-py3-none-any.whl.metadata (7.2 kB)\n",
+ "Collecting datasets (from -r ./requirements.txt (line 5))\n",
+ " Obtaining dependency information for datasets from https://files.pythonhosted.org/packages/7c/55/b3432f43d6d7fee999bb23a547820d74c48ec540f5f7842e41aa5d8d5f3a/datasets-2.14.6-py3-none-any.whl.metadata\n",
+ " Downloading datasets-2.14.6-py3-none-any.whl.metadata (19 kB)\n",
+ "Collecting rouge_score (from -r ./requirements.txt (line 6))\n",
+ " Downloading rouge_score-0.1.2.tar.gz (17 kB)\n",
+ " Preparing metadata (setup.py): started\n",
+ " Preparing metadata (setup.py): finished with status 'done'\n",
+ "Requirement already satisfied: numpy>=1.17 in /opt/conda/lib/python3.10/site-packages (from evaluate->-r ./requirements.txt (line 1)) (1.23.5)\n",
+ "Collecting dill (from evaluate->-r ./requirements.txt (line 1))\n",
+ " Obtaining dependency information for dill from https://files.pythonhosted.org/packages/f5/3a/74a29b11cf2cdfcd6ba89c0cecd70b37cd1ba7b77978ce611eb7a146a832/dill-0.3.7-py3-none-any.whl.metadata\n",
+ " Downloading dill-0.3.7-py3-none-any.whl.metadata (9.9 kB)\n",
+ "Requirement already satisfied: pandas in /opt/conda/lib/python3.10/site-packages (from evaluate->-r ./requirements.txt (line 1)) (2.0.3)\n",
+ "Requirement already satisfied: requests>=2.19.0 in /opt/conda/lib/python3.10/site-packages (from evaluate->-r ./requirements.txt (line 1)) (2.31.0)\n",
+ "Requirement already satisfied: tqdm>=4.62.1 in /opt/conda/lib/python3.10/site-packages (from evaluate->-r ./requirements.txt (line 1)) (4.65.0)\n",
+ "Collecting xxhash (from evaluate->-r ./requirements.txt (line 1))\n",
+ " Obtaining dependency information for xxhash from https://files.pythonhosted.org/packages/80/8a/1dd41557883b6196f8f092011a5c1f72d4d44cf36d7b67d4a5efe3127949/xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata\n",
+ " Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)\n",
+ "Collecting multiprocess (from evaluate->-r ./requirements.txt (line 1))\n",
+ " Obtaining dependency information for multiprocess from https://files.pythonhosted.org/packages/35/a8/36d8d7b3e46b377800d8dec47891cdf05842d1a2366909ae4a0c89fbc5e6/multiprocess-0.70.15-py310-none-any.whl.metadata\n",
+ " Downloading multiprocess-0.70.15-py310-none-any.whl.metadata (7.2 kB)\n",
+ "Requirement already satisfied: fsspec[http]>=2021.05.0 in /opt/conda/lib/python3.10/site-packages (from evaluate->-r ./requirements.txt (line 1)) (2023.6.0)\n",
+ "Collecting huggingface-hub>=0.7.0 (from evaluate->-r ./requirements.txt (line 1))\n",
+ " Obtaining dependency information for huggingface-hub>=0.7.0 from https://files.pythonhosted.org/packages/ef/b5/b6107bd65fa4c96fdf00e4733e2fe5729bb9e5e09997f63074bb43d3ab28/huggingface_hub-0.18.0-py3-none-any.whl.metadata\n",
+ " Downloading huggingface_hub-0.18.0-py3-none-any.whl.metadata (13 kB)\n",
+ "Requirement already satisfied: packaging in /opt/conda/lib/python3.10/site-packages (from evaluate->-r ./requirements.txt (line 1)) (23.1)\n",
+ "Collecting responses<0.19 (from evaluate->-r ./requirements.txt (line 1))\n",
+ " Downloading responses-0.18.0-py3-none-any.whl (38 kB)\n",
+ "Requirement already satisfied: click in /opt/conda/lib/python3.10/site-packages (from nltk->-r ./requirements.txt (line 2)) (8.1.6)\n",
+ "Requirement already satisfied: joblib in /opt/conda/lib/python3.10/site-packages (from nltk->-r ./requirements.txt (line 2)) (1.3.1)\n",
+ "Collecting regex>=2021.8.3 (from nltk->-r ./requirements.txt (line 2))\n",
+ " Obtaining dependency information for regex>=2021.8.3 from https://files.pythonhosted.org/packages/8f/3e/4b8b40eb3c80aeaf360f0361d956d129bb3d23b2a3ecbe3a04a8f3bdd6d3/regex-2023.10.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata\n",
+ " Downloading regex-2023.10.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (40 kB)\n",
+ " ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 40.9/40.9 kB 178.9 MB/s eta 0:00:00\n",
+ "Requirement already satisfied: filelock in /opt/conda/lib/python3.10/site-packages (from transformers->-r ./requirements.txt (line 3)) (3.12.2)\n",
+ "Requirement already satisfied: pyyaml>=5.1 in /opt/conda/lib/python3.10/site-packages (from transformers->-r ./requirements.txt (line 3)) (6.0.1)\n",
+ "Collecting tokenizers<0.15,>=0.14 (from transformers->-r ./requirements.txt (line 3))\n",
+ " Obtaining dependency information for tokenizers<0.15,>=0.14 from https://files.pythonhosted.org/packages/a7/7b/c1f643eb086b6c5c33eef0c3752e37624bd23e4cbc9f1332748f1c6252d1/tokenizers-0.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata\n",
+ " Downloading tokenizers-0.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)\n",
+ "Collecting safetensors>=0.3.1 (from transformers->-r ./requirements.txt (line 3))\n",
+ " Obtaining dependency information for safetensors>=0.3.1 from https://files.pythonhosted.org/packages/20/4e/878b080dbda92666233ec6f316a53969edcb58eab1aa399a64d0521cf953/safetensors-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata\n",
+ " Downloading safetensors-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.8 kB)\n",
+ "Collecting keras-core (from keras_nlp->-r ./requirements.txt (line 4))\n",
+ " Obtaining dependency information for keras-core from https://files.pythonhosted.org/packages/95/f7/b8dcff937ea64f822f0d3fe8c6010793406b82d14467cd0e9eecea458a40/keras_core-0.1.7-py3-none-any.whl.metadata\n",
+ " Downloading keras_core-0.1.7-py3-none-any.whl.metadata (4.3 kB)\n",
+ "Requirement already satisfied: absl-py in /opt/conda/lib/python3.10/site-packages (from keras_nlp->-r ./requirements.txt (line 4)) (1.4.0)\n",
+ "Requirement already satisfied: rich in /opt/conda/lib/python3.10/site-packages (from keras_nlp->-r ./requirements.txt (line 4)) (13.5.1)\n",
+ "Requirement already satisfied: dm-tree in /opt/conda/lib/python3.10/site-packages (from keras_nlp->-r ./requirements.txt (line 4)) (0.1.8)\n",
+ "Collecting tensorflow-text (from keras_nlp->-r ./requirements.txt (line 4))\n",
+ " Obtaining dependency information for tensorflow-text from https://files.pythonhosted.org/packages/0b/5f/8b301d2d0cea8334c22aaeb8880ce115ec34d7eba20f7b08c64202011a85/tensorflow_text-2.14.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata\n",
+ " Downloading tensorflow_text-2.14.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.9 kB)\n",
+ "Requirement already satisfied: pyarrow>=8.0.0 in /opt/conda/lib/python3.10/site-packages (from datasets->-r ./requirements.txt (line 5)) (12.0.1)\n",
+ "Requirement already satisfied: aiohttp in /opt/conda/lib/python3.10/site-packages (from datasets->-r ./requirements.txt (line 5)) (3.8.5)\n",
+ "Requirement already satisfied: six>=1.14.0 in /opt/conda/lib/python3.10/site-packages (from rouge_score->-r ./requirements.txt (line 6)) (1.16.0)\n",
+ "Requirement already satisfied: attrs>=17.3.0 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets->-r ./requirements.txt (line 5)) (23.1.0)\n",
+ "Requirement already satisfied: charset-normalizer<4.0,>=2.0 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets->-r ./requirements.txt (line 5)) (3.2.0)\n",
+ "Requirement already satisfied: multidict<7.0,>=4.5 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets->-r ./requirements.txt (line 5)) (6.0.4)\n",
+ "Requirement already satisfied: async-timeout<5.0,>=4.0.0a3 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets->-r ./requirements.txt (line 5)) (4.0.2)\n",
+ "Requirement already satisfied: yarl<2.0,>=1.0 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets->-r ./requirements.txt (line 5)) (1.9.2)\n",
+ "Requirement already satisfied: frozenlist>=1.1.1 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets->-r ./requirements.txt (line 5)) (1.4.0)\n",
+ "Requirement already satisfied: aiosignal>=1.1.2 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets->-r ./requirements.txt (line 5)) (1.3.1)\n",
+ "Requirement already satisfied: typing-extensions>=3.7.4.3 in /opt/conda/lib/python3.10/site-packages (from huggingface-hub>=0.7.0->evaluate->-r ./requirements.txt (line 1)) (4.7.1)\n",
+ "Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.10/site-packages (from requests>=2.19.0->evaluate->-r ./requirements.txt (line 1)) (3.4)\n",
+ "Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/conda/lib/python3.10/site-packages (from requests>=2.19.0->evaluate->-r ./requirements.txt (line 1)) (1.26.16)\n",
+ "Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.10/site-packages (from requests>=2.19.0->evaluate->-r ./requirements.txt (line 1)) (2023.7.22)\n",
+ "Collecting huggingface-hub>=0.7.0 (from evaluate->-r ./requirements.txt (line 1))\n",
+ " Obtaining dependency information for huggingface-hub>=0.7.0 from https://files.pythonhosted.org/packages/aa/f3/3fc97336a0e90516901befd4f500f08d691034d387406fdbde85bea827cc/huggingface_hub-0.17.3-py3-none-any.whl.metadata\n",
+ " Downloading huggingface_hub-0.17.3-py3-none-any.whl.metadata (13 kB)\n",
+ "Collecting namex (from keras-core->keras_nlp->-r ./requirements.txt (line 4))\n",
+ " Downloading namex-0.0.7-py3-none-any.whl (5.8 kB)\n",
+ "Requirement already satisfied: h5py in /opt/conda/lib/python3.10/site-packages (from keras-core->keras_nlp->-r ./requirements.txt (line 4)) (3.9.0)\n",
+ "Requirement already satisfied: python-dateutil>=2.8.2 in /opt/conda/lib/python3.10/site-packages (from pandas->evaluate->-r ./requirements.txt (line 1)) (2.8.2)\n",
+ "Requirement already satisfied: pytz>=2020.1 in /opt/conda/lib/python3.10/site-packages (from pandas->evaluate->-r ./requirements.txt (line 1)) (2023.3)\n",
+ "Requirement already satisfied: tzdata>=2022.1 in /opt/conda/lib/python3.10/site-packages (from pandas->evaluate->-r ./requirements.txt (line 1)) (2023.3)\n",
+ "Requirement already satisfied: markdown-it-py>=2.2.0 in /opt/conda/lib/python3.10/site-packages (from rich->keras_nlp->-r ./requirements.txt (line 4)) (3.0.0)\n",
+ "Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /opt/conda/lib/python3.10/site-packages (from rich->keras_nlp->-r ./requirements.txt (line 4)) (2.15.1)\n",
+ "Collecting tensorflow-hub>=0.13.0 (from tensorflow-text->keras_nlp->-r ./requirements.txt (line 4))\n",
+ " Obtaining dependency information for tensorflow-hub>=0.13.0 from https://files.pythonhosted.org/packages/6e/1a/fbae76f4057b9bcdf9468025d7a8ca952dec14bfafb9fc0b1e4244ce212f/tensorflow_hub-0.15.0-py2.py3-none-any.whl.metadata\n",
+ " Downloading tensorflow_hub-0.15.0-py2.py3-none-any.whl.metadata (1.3 kB)\n",
+ "Collecting tensorflow<2.15,>=2.14.0 (from tensorflow-text->keras_nlp->-r ./requirements.txt (line 4))\n",
+ " Obtaining dependency information for tensorflow<2.15,>=2.14.0 from https://files.pythonhosted.org/packages/e2/7a/c7762c698fb1ac41a7e3afee51dc72aa3ec74ae8d2f57ce19a9cded3a4af/tensorflow-2.14.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata\n",
+ " Downloading tensorflow-2.14.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.1 kB)\n",
+ "Requirement already satisfied: mdurl~=0.1 in /opt/conda/lib/python3.10/site-packages (from markdown-it-py>=2.2.0->rich->keras_nlp->-r ./requirements.txt (line 4)) (0.1.2)\n",
+ "Requirement already satisfied: astunparse>=1.6.0 in /opt/conda/lib/python3.10/site-packages (from tensorflow<2.15,>=2.14.0->tensorflow-text->keras_nlp->-r ./requirements.txt (line 4)) (1.6.3)\n",
+ "Requirement already satisfied: flatbuffers>=23.5.26 in /opt/conda/lib/python3.10/site-packages (from tensorflow<2.15,>=2.14.0->tensorflow-text->keras_nlp->-r ./requirements.txt (line 4)) (23.5.26)\n",
+ "Requirement already satisfied: gast!=0.5.0,!=0.5.1,!=0.5.2,>=0.2.1 in /opt/conda/lib/python3.10/site-packages (from tensorflow<2.15,>=2.14.0->tensorflow-text->keras_nlp->-r ./requirements.txt (line 4)) (0.4.0)\n",
+ "Requirement already satisfied: google-pasta>=0.1.1 in /opt/conda/lib/python3.10/site-packages (from tensorflow<2.15,>=2.14.0->tensorflow-text->keras_nlp->-r ./requirements.txt (line 4)) (0.2.0)\n",
+ "Requirement already satisfied: libclang>=13.0.0 in /opt/conda/lib/python3.10/site-packages (from tensorflow<2.15,>=2.14.0->tensorflow-text->keras_nlp->-r ./requirements.txt (line 4)) (16.0.6)\n",
+ "Requirement already satisfied: ml-dtypes==0.2.0 in /opt/conda/lib/python3.10/site-packages (from tensorflow<2.15,>=2.14.0->tensorflow-text->keras_nlp->-r ./requirements.txt (line 4)) (0.2.0)\n",
+ "Requirement already satisfied: opt-einsum>=2.3.2 in /opt/conda/lib/python3.10/site-packages (from tensorflow<2.15,>=2.14.0->tensorflow-text->keras_nlp->-r ./requirements.txt (line 4)) (3.3.0)\n",
+ "Collecting protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.20.3 (from tensorflow<2.15,>=2.14.0->tensorflow-text->keras_nlp->-r ./requirements.txt (line 4))\n",
+ " Obtaining dependency information for protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.20.3 from https://files.pythonhosted.org/packages/ae/32/45b1cf0c5d4a3ba881f5164c26af877c0dabfe6de0019d426aa0e5cf6806/protobuf-4.25.0-cp37-abi3-manylinux2014_x86_64.whl.metadata\n",
+ " Downloading protobuf-4.25.0-cp37-abi3-manylinux2014_x86_64.whl.metadata (541 bytes)\n",
+ "Requirement already satisfied: setuptools in /opt/conda/lib/python3.10/site-packages (from tensorflow<2.15,>=2.14.0->tensorflow-text->keras_nlp->-r ./requirements.txt (line 4)) (68.0.0)\n",
+ "Requirement already satisfied: termcolor>=1.1.0 in /opt/conda/lib/python3.10/site-packages (from tensorflow<2.15,>=2.14.0->tensorflow-text->keras_nlp->-r ./requirements.txt (line 4)) (2.3.0)\n",
+ "Requirement already satisfied: wrapt<1.15,>=1.11.0 in /opt/conda/lib/python3.10/site-packages (from tensorflow<2.15,>=2.14.0->tensorflow-text->keras_nlp->-r ./requirements.txt (line 4)) (1.14.1)\n",
+ "Requirement already satisfied: tensorflow-io-gcs-filesystem>=0.23.1 in /opt/conda/lib/python3.10/site-packages (from tensorflow<2.15,>=2.14.0->tensorflow-text->keras_nlp->-r ./requirements.txt (line 4)) (0.32.0)\n",
+ "Requirement already satisfied: grpcio<2.0,>=1.24.3 in /opt/conda/lib/python3.10/site-packages (from tensorflow<2.15,>=2.14.0->tensorflow-text->keras_nlp->-r ./requirements.txt (line 4)) (1.56.2)\n",
+ "Collecting tensorboard<2.15,>=2.14 (from tensorflow<2.15,>=2.14.0->tensorflow-text->keras_nlp->-r ./requirements.txt (line 4))\n",
+ " Obtaining dependency information for tensorboard<2.15,>=2.14 from https://files.pythonhosted.org/packages/73/a2/66ed644f6ed1562e0285fcd959af17670ea313c8f331c46f79ee77187eb9/tensorboard-2.14.1-py3-none-any.whl.metadata\n",
+ " Downloading tensorboard-2.14.1-py3-none-any.whl.metadata (1.7 kB)\n",
+ "Collecting tensorflow-estimator<2.15,>=2.14.0 (from tensorflow<2.15,>=2.14.0->tensorflow-text->keras_nlp->-r ./requirements.txt (line 4))\n",
+ " Obtaining dependency information for tensorflow-estimator<2.15,>=2.14.0 from https://files.pythonhosted.org/packages/d1/da/4f264c196325bb6e37a6285caec5b12a03def489b57cc1fdac02bb6272cd/tensorflow_estimator-2.14.0-py2.py3-none-any.whl.metadata\n",
+ " Downloading tensorflow_estimator-2.14.0-py2.py3-none-any.whl.metadata (1.3 kB)\n",
+ "Collecting keras<2.15,>=2.14.0 (from tensorflow<2.15,>=2.14.0->tensorflow-text->keras_nlp->-r ./requirements.txt (line 4))\n",
+ " Obtaining dependency information for keras<2.15,>=2.14.0 from https://files.pythonhosted.org/packages/fe/58/34d4d8f1aa11120c2d36d7ad27d0526164b1a8ae45990a2fede31d0e59bf/keras-2.14.0-py3-none-any.whl.metadata\n",
+ " Downloading keras-2.14.0-py3-none-any.whl.metadata (2.4 kB)\n",
+ "Requirement already satisfied: wheel<1.0,>=0.23.0 in /opt/conda/lib/python3.10/site-packages (from astunparse>=1.6.0->tensorflow<2.15,>=2.14.0->tensorflow-text->keras_nlp->-r ./requirements.txt (line 4)) (0.41.0)\n",
+ "Requirement already satisfied: google-auth<3,>=1.6.3 in /opt/conda/lib/python3.10/site-packages (from tensorboard<2.15,>=2.14->tensorflow<2.15,>=2.14.0->tensorflow-text->keras_nlp->-r ./requirements.txt (line 4)) (2.22.0)\n",
+ "Requirement already satisfied: google-auth-oauthlib<1.1,>=0.5 in /opt/conda/lib/python3.10/site-packages (from tensorboard<2.15,>=2.14->tensorflow<2.15,>=2.14.0->tensorflow-text->keras_nlp->-r ./requirements.txt (line 4)) (1.0.0)\n",
+ "Requirement already satisfied: markdown>=2.6.8 in /opt/conda/lib/python3.10/site-packages (from tensorboard<2.15,>=2.14->tensorflow<2.15,>=2.14.0->tensorflow-text->keras_nlp->-r ./requirements.txt (line 4)) (3.4.4)\n",
+ "Requirement already satisfied: tensorboard-data-server<0.8.0,>=0.7.0 in /opt/conda/lib/python3.10/site-packages (from tensorboard<2.15,>=2.14->tensorflow<2.15,>=2.14.0->tensorflow-text->keras_nlp->-r ./requirements.txt (line 4)) (0.7.1)\n",
+ "Requirement already satisfied: werkzeug>=1.0.1 in /opt/conda/lib/python3.10/site-packages (from tensorboard<2.15,>=2.14->tensorflow<2.15,>=2.14.0->tensorflow-text->keras_nlp->-r ./requirements.txt (line 4)) (2.3.6)\n",
+ "Requirement already satisfied: cachetools<6.0,>=2.0.0 in /opt/conda/lib/python3.10/site-packages (from google-auth<3,>=1.6.3->tensorboard<2.15,>=2.14->tensorflow<2.15,>=2.14.0->tensorflow-text->keras_nlp->-r ./requirements.txt (line 4)) (5.3.1)\n",
+ "Requirement already satisfied: pyasn1-modules>=0.2.1 in /opt/conda/lib/python3.10/site-packages (from google-auth<3,>=1.6.3->tensorboard<2.15,>=2.14->tensorflow<2.15,>=2.14.0->tensorflow-text->keras_nlp->-r ./requirements.txt (line 4)) (0.3.0)\n",
+ "Requirement already satisfied: rsa<5,>=3.1.4 in /opt/conda/lib/python3.10/site-packages (from google-auth<3,>=1.6.3->tensorboard<2.15,>=2.14->tensorflow<2.15,>=2.14.0->tensorflow-text->keras_nlp->-r ./requirements.txt (line 4)) (4.9)\n",
+ "Requirement already satisfied: requests-oauthlib>=0.7.0 in /opt/conda/lib/python3.10/site-packages (from google-auth-oauthlib<1.1,>=0.5->tensorboard<2.15,>=2.14->tensorflow<2.15,>=2.14.0->tensorflow-text->keras_nlp->-r ./requirements.txt (line 4)) (1.3.1)\n",
+ "Requirement already satisfied: MarkupSafe>=2.1.1 in /opt/conda/lib/python3.10/site-packages (from werkzeug>=1.0.1->tensorboard<2.15,>=2.14->tensorflow<2.15,>=2.14.0->tensorflow-text->keras_nlp->-r ./requirements.txt (line 4)) (2.1.3)\n",
+ "Requirement already satisfied: pyasn1<0.6.0,>=0.4.6 in /opt/conda/lib/python3.10/site-packages (from pyasn1-modules>=0.2.1->google-auth<3,>=1.6.3->tensorboard<2.15,>=2.14->tensorflow<2.15,>=2.14.0->tensorflow-text->keras_nlp->-r ./requirements.txt (line 4)) (0.5.0)\n",
+ "Requirement already satisfied: oauthlib>=3.0.0 in /opt/conda/lib/python3.10/site-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<1.1,>=0.5->tensorboard<2.15,>=2.14->tensorflow<2.15,>=2.14.0->tensorflow-text->keras_nlp->-r ./requirements.txt (line 4)) (3.2.2)\n",
+ "Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)\n",
+ " ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 84.1/84.1 kB 198.5 MB/s eta 0:00:00\n",
+ "Downloading transformers-4.35.0-py3-none-any.whl (7.9 MB)\n",
+ " ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.9/7.9 MB 124.2 MB/s eta 0:00:00\n",
+ "Downloading keras_nlp-0.6.2-py3-none-any.whl (590 kB)\n",
+ " ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 590.1/590.1 kB 225.3 MB/s eta 0:00:00\n",
+ "Downloading datasets-2.14.6-py3-none-any.whl (493 kB)\n",
+ " ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 493.7/493.7 kB 214.8 MB/s eta 0:00:00\n",
+ "Downloading dill-0.3.7-py3-none-any.whl (115 kB)\n",
+ " ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 115.3/115.3 kB 230.1 MB/s eta 0:00:00\n",
+ "Downloading regex-2023.10.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (773 kB)\n",
+ " ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 773.9/773.9 kB 233.9 MB/s eta 0:00:00\n",
+ "Downloading safetensors-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)\n",
+ " ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.3/1.3 MB 228.3 MB/s eta 0:00:00\n",
+ "Downloading tokenizers-0.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)\n",
+ " ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.8/3.8 MB 223.3 MB/s eta 0:00:00\n",
+ "Downloading huggingface_hub-0.17.3-py3-none-any.whl (295 kB)\n",
+ " ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 295.0/295.0 kB 240.5 MB/s eta 0:00:00\n",
+ "Downloading keras_core-0.1.7-py3-none-any.whl (950 kB)\n",
+ " ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 950.8/950.8 kB 236.1 MB/s eta 0:00:00\n",
+ "Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)\n",
+ " ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 134.8/134.8 kB 216.1 MB/s eta 0:00:00\n",
+ "Downloading tensorflow_text-2.14.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.5 MB)\n",
+ " ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.5/6.5 MB 158.6 MB/s eta 0:00:00\n",
+ "Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)\n",
+ " ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 194.1/194.1 kB 223.4 MB/s eta 0:00:00\n",
+ "Downloading tensorflow-2.14.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (489.8 MB)\n",
+ " ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 489.8/489.8 MB 212.8 MB/s eta 0:00:00\n",
+ "Downloading tensorflow_hub-0.15.0-py2.py3-none-any.whl (85 kB)\n",
+ " ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 85.4/85.4 kB 185.0 MB/s eta 0:00:00\n",
+ "Downloading keras-2.14.0-py3-none-any.whl (1.7 MB)\n",
+ " ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.7/1.7 MB 219.8 MB/s eta 0:00:00\n",
+ "Downloading protobuf-4.25.0-cp37-abi3-manylinux2014_x86_64.whl (294 kB)\n",
+ " ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 294.4/294.4 kB 217.2 MB/s eta 0:00:00\n",
+ "Downloading tensorboard-2.14.1-py3-none-any.whl (5.5 MB)\n",
+ " ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 5.5/5.5 MB 212.0 MB/s eta 0:00:00\n",
+ "Downloading tensorflow_estimator-2.14.0-py2.py3-none-any.whl (440 kB)\n",
+ " ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 440.7/440.7 kB 223.9 MB/s eta 0:00:00\n",
+ "Building wheels for collected packages: rouge_score\n",
+ " Building wheel for rouge_score (setup.py): started\n",
+ " Building wheel for rouge_score (setup.py): finished with status 'done'\n",
+ " Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24933 sha256=7fb2b5092b892710a8c128f5633d6f5f22dc260df119b78067900b8c74e972a4\n",
+ " Stored in directory: /tmp/pip-ephem-wheel-cache-sagd5q__/wheels/5f/dd/89/461065a73be61a532ff8599a28e9beef17985c9e9c31e541b4\n",
+ "Successfully built rouge_score\n",
+ "Installing collected packages: namex, xxhash, tensorflow-estimator, safetensors, regex, protobuf, keras, dill, tensorflow-hub, responses, nltk, multiprocess, huggingface-hub, tokenizers, rouge_score, keras-core, transformers, tensorboard, datasets, tensorflow, evaluate, tensorflow-text, keras_nlp\n",
+ " Attempting uninstall: tensorflow-estimator\n",
+ " Found existing installation: tensorflow-estimator 2.12.0\n",
+ " Uninstalling tensorflow-estimator-2.12.0:\n",
+ " Successfully uninstalled tensorflow-estimator-2.12.0\n",
+ " Attempting uninstall: protobuf\n",
+ " Found existing installation: protobuf 3.20.1\n",
+ " Uninstalling protobuf-3.20.1:\n",
+ " Successfully uninstalled protobuf-3.20.1\n",
+ " Attempting uninstall: keras\n",
+ " Found existing installation: keras 2.12.0\n",
+ " Uninstalling keras-2.12.0:\n",
+ " Successfully uninstalled keras-2.12.0\n",
+ " Attempting uninstall: tensorboard\n",
+ " Found existing installation: tensorboard 2.12.3\n",
+ " Uninstalling tensorboard-2.12.3:\n",
+ " Successfully uninstalled tensorboard-2.12.3\n",
+ " Attempting uninstall: tensorflow\n",
+ " Found existing installation: tensorflow 2.12.0\n",
+ " Uninstalling tensorflow-2.12.0:\n",
+ " Successfully uninstalled tensorflow-2.12.0\n",
+ "Successfully installed datasets-2.14.6 dill-0.3.7 evaluate-0.4.1 huggingface-hub-0.17.3 keras-2.14.0 keras-core-0.1.7 keras_nlp-0.6.2 multiprocess-0.70.15 namex-0.0.7 nltk-3.8.1 protobuf-4.25.0 regex-2023.10.3 responses-0.18.0 rouge_score-0.1.2 safetensors-0.4.0 tensorboard-2.14.1 tensorflow-2.14.0 tensorflow-estimator-2.14.0 tensorflow-hub-0.15.0 tensorflow-text-2.14.0 tokenizers-0.14.1 transformers-4.35.0 xxhash-3.4.1\n",
+ "\u001b[91mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n",
+ "google-cloud-datastore 1.15.5 requires protobuf<4.0.0dev, but you have protobuf 4.25.0 which is incompatible.\n",
+ "\u001b[0m\u001b[91mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv\n",
+ "\u001b[0mRemoving intermediate container 6502f72390d6\n",
+ " ---> 97a4b7990a59\n",
+ "Step 9/10 : COPY [\"trainer\", \"trainer\"]\n",
+ " ---> dce93f89c146\n",
+ "Step 10/10 : ENTRYPOINT [\"python3\", \"-m\", \"trainer.task\"]\n",
+ " ---> Running in beccb40ff5ce\n",
+ "Removing intermediate container beccb40ff5ce\n",
+ " ---> 6be133543c75\n",
+ "Successfully built 6be133543c75\n",
+ "Successfully tagged gcr.io/cit-oconnellka-9999/cloudai-autogenerated/flan-t5-training-tf3:20231103.17.39.12.779660\n",
+ "\n",
+ "A custom container image is built locally.\n",
+ "\n",
+ "/usr/lib/google-cloud-sdk/platform/bundledpythonunix/lib/python3.9/subprocess.py:935: RuntimeWarning: line buffering (buffering=1) isn't supported in binary mode, the default buffer size will be used\n",
+ " self.stdin = io.open(p2cwrite, 'wb', bufsize)\n",
+ "/usr/lib/google-cloud-sdk/platform/bundledpythonunix/lib/python3.9/subprocess.py:941: RuntimeWarning: line buffering (buffering=1) isn't supported in binary mode, the default buffer size will be used\n",
+ " self.stdout = io.open(c2pread, 'rb', bufsize)\n",
+ "The push refers to repository [gcr.io/cit-oconnellka-9999/cloudai-autogenerated/flan-t5-training-tf3]\n",
+ "de565aa0e952: Preparing\n",
+ "8027f564cadd: Preparing\n",
+ "5bd43a783137: Preparing\n",
+ "c2cec13eda62: Preparing\n",
+ "73c814d198fd: Preparing\n",
+ "e42695c7b436: Preparing\n",
+ "e42695c7b436: Preparing\n",
+ "7e34967c8575: Preparing\n",
+ "19c1ff49a1a3: Preparing\n",
+ "724eb7d1e386: Preparing\n",
+ "e7df186da59e: Preparing\n",
+ "e7df186da59e: Preparing\n",
+ "d9e5455afa58: Preparing\n",
+ "a4f1c7b5b5c5: Preparing\n",
+ "1eeca563762d: Preparing\n",
+ "b3f8d9df367e: Preparing\n",
+ "29e2658ae6ea: Preparing\n",
+ "228616cf4f10: Preparing\n",
+ "ae32b7336b96: Preparing\n",
+ "ae32b7336b96: Preparing\n",
+ "ea7b0ccc272e: Preparing\n",
+ "01d4173a3960: Preparing\n",
+ "c235d251a607: Preparing\n",
+ "f2833e4d69b4: Preparing\n",
+ "49fc5a524f1f: Preparing\n",
+ "e175e85d3600: Preparing\n",
+ "55bfb3527de7: Preparing\n",
+ "ee67859f37c6: Preparing\n",
+ "ed7e041f0699: Preparing\n",
+ "0235cf47cbae: Preparing\n",
+ "724eb7d1e386: Waiting\n",
+ "2971cdbb4b45: Preparing\n",
+ "8374b2bc65e7: Preparing\n",
+ "3b93a6feba89: Preparing\n",
+ "b15400eb0fa7: Preparing\n",
+ "29ecaf0c2ae0: Preparing\n",
+ "a4f1c7b5b5c5: Waiting\n",
+ "41e673079fce: Preparing\n",
+ "e7df186da59e: Waiting\n",
+ "1eeca563762d: Waiting\n",
+ "cda9215846ee: Preparing\n",
+ "d9e5455afa58: Waiting\n",
+ "c5eafb4bee8f: Preparing\n",
+ "b3f8d9df367e: Waiting\n",
+ "29e2658ae6ea: Waiting\n",
+ "81182eb0608d: Preparing\n",
+ "f2baf76d88ee: Preparing\n",
+ "228616cf4f10: Waiting\n",
+ "01d4173a3960: Waiting\n",
+ "cdd7c7392317: Preparing\n",
+ "ae32b7336b96: Waiting\n",
+ "c235d251a607: Waiting\n",
+ "ea7b0ccc272e: Waiting\n",
+ "e175e85d3600: Waiting\n",
+ "f2833e4d69b4: Waiting\n",
+ "b15400eb0fa7: Waiting\n",
+ "29ecaf0c2ae0: Waiting\n",
+ "2971cdbb4b45: Waiting\n",
+ "49fc5a524f1f: Waiting\n",
+ "41e673079fce: Waiting\n",
+ "55bfb3527de7: Waiting\n",
+ "ee67859f37c6: Waiting\n",
+ "cda9215846ee: Waiting\n",
+ "3b93a6feba89: Waiting\n",
+ "8374b2bc65e7: Waiting\n",
+ "ed7e041f0699: Waiting\n",
+ "c5eafb4bee8f: Waiting\n",
+ "0235cf47cbae: Waiting\n",
+ "81182eb0608d: Waiting\n",
+ "cdd7c7392317: Waiting\n",
+ "f2baf76d88ee: Waiting\n",
+ "e42695c7b436: Waiting\n",
+ "7e34967c8575: Waiting\n",
+ "19c1ff49a1a3: Waiting\n",
+ "73c814d198fd: Pushed\n",
+ "5bd43a783137: Pushed\n",
+ "c2cec13eda62: Pushed\n",
+ "de565aa0e952: Pushed\n",
+ "e42695c7b436: Layer already exists\n",
+ "7e34967c8575: Layer already exists\n",
+ "19c1ff49a1a3: Layer already exists\n",
+ "e7df186da59e: Layer already exists\n",
+ "724eb7d1e386: Layer already exists\n",
+ "d9e5455afa58: Layer already exists\n",
+ "a4f1c7b5b5c5: Layer already exists\n",
+ "1eeca563762d: Layer already exists\n",
+ "b3f8d9df367e: Layer already exists\n",
+ "228616cf4f10: Layer already exists\n",
+ "29e2658ae6ea: Layer already exists\n",
+ "ae32b7336b96: Layer already exists\n",
+ "ea7b0ccc272e: Layer already exists\n",
+ "01d4173a3960: Layer already exists\n",
+ "c235d251a607: Layer already exists\n",
+ "f2833e4d69b4: Layer already exists\n",
+ "49fc5a524f1f: Layer already exists\n",
+ "e175e85d3600: Layer already exists\n",
+ "55bfb3527de7: Layer already exists\n",
+ "ee67859f37c6: Layer already exists\n",
+ "ed7e041f0699: Layer already exists\n",
+ "0235cf47cbae: Layer already exists\n",
+ "2971cdbb4b45: Layer already exists\n",
+ "8374b2bc65e7: Layer already exists\n",
+ "3b93a6feba89: Layer already exists\n",
+ "b15400eb0fa7: Layer already exists\n",
+ "41e673079fce: Layer already exists\n",
+ "29ecaf0c2ae0: Layer already exists\n",
+ "c5eafb4bee8f: Layer already exists\n",
+ "cda9215846ee: Layer already exists\n",
+ "81182eb0608d: Layer already exists\n",
+ "f2baf76d88ee: Layer already exists\n",
+ "cdd7c7392317: Layer already exists\n",
+ "8027f564cadd: Pushed\n",
+ "20231103.17.39.12.779660: digest: sha256:1240e61185c933e273e7bc6b5112358d85942e1f8bcb2cf076b3a144e5b748eb size: 8901\n",
+ "\n",
+ "Custom container image [gcr.io/cit-oconnellka-9999/cloudai-autogenerated/flan-t5-training-tf3:20231103.17.39.12.779660] is created for your custom job.\n",
+ "\n",
+ "CustomJob [projects/144763482491/locations/us-central1/customJobs/6207308081613766656] is submitted successfully.\n",
+ "\n",
+ "Your job is still active. You may view the status of your job with the command\n",
+ "\n",
+ " $ gcloud ai custom-jobs describe projects/144763482491/locations/us-central1/customJobs/6207308081613766656\n",
+ "\n",
+ "or continue streaming the logs with the command\n",
+ "\n",
+ " $ gcloud ai custom-jobs stream-logs projects/144763482491/locations/us-central1/customJobs/6207308081613766656\n"
+ ]
+ }
+ ],
+ "source": [
+ "!gcloud ai custom-jobs create \\\n",
+ "--region=us-central1 \\\n",
+ "--display-name=$display_name \\\n",
+ "--args=--model_name_or_path=$CHECKPOINT \\\n",
+ "--args=--train_file=$train_file \\\n",
+ "--args=--test_file=$test_file \\\n",
+ "--args=--validation_file=$validation_file \\\n",
+ "--args=--text_column=$text_column \\\n",
+ "--args=--summary_column=$summary_column \\\n",
+ "--args=--output_dir=gs://$BUCKET \\\n",
+ "--args=--source_prefix=$source_prefix \\\n",
+ "--worker-pool-spec=machine-type=$machine_type,replica-count=1,accelerator-type=$accelerator_type,executor-image-uri=$BASE_GPU_IMAGE_tf,local-package-path=autopkg-summarizer,python-module=trainer.task"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "403fe77f-990c-4518-ad3a-0aac3d2c8b92",
+ "metadata": {},
+ "source": [
+ "Once you start training the output from the command line should show you the command to use to view the progress of your training via the command `gcloud ai custom-jobs stream-logs <`. You can also monitor and view logs on the console by going to `Vertex AI > Training > Custom Jobs`\n",
+ "select your custom job and click on \"View Logs\""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "dc7b81af-d424-427d-a37a-ac5da197567e",
+ "metadata": {},
+ "source": [
+ "## Deploy the Model"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d772dc95-95e9-40f2-a5c5-dc782d6f7e14",
+ "metadata": {},
+ "source": [
+ "### Upload the Model to Vertex AI's Model Registry"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "4dff127a-4d14-4fa1-a22a-75662eccce02",
+ "metadata": {},
+ "source": [
+ "Once our model is done training you should see a model_save.pd file in your bucket. We will need this inorder to upload our model to the Model Registry. Here we are specifiying a prebuilt docker image that will run our predictions, the name of our model and the directory in our bucket that holds our **model_save.pd** file."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "id": "26042230-dc95-4b6c-bd32-bf3596e5de52",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "TF_PREDICTION_IMAGE_URI_RUNTIME = 'us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.2-12:latest'"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "id": "f5a3ef3b-8080-4e2d-bb8f-7e2f22c59e05",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Creating Model\n",
+ "Create Model backing LRO: projects/144763482491/locations/us-central1/models/3296764669607280640/operations/1237604172191236096\n",
+ "Model created. Resource name: projects/144763482491/locations/us-central1/models/3296764669607280640@1\n",
+ "To use this Model in another session:\n",
+ "model = aiplatform.Model('projects/144763482491/locations/us-central1/models/3296764669607280640@1')\n"
+ ]
+ }
+ ],
+ "source": [
+ "from google.cloud import aiplatform as vertexai\n",
+ "from google.cloud import aiplatform\n",
+ "\n",
+ "#give your model a name\n",
+ "MODEL_DISPLAY_NAME = \"summarizer-tf-runtime\"\n",
+ "MODEL_DESCRIPTION = \"summarizes scientific texts and pdfs\" #optional\n",
+ "\n",
+ "#add your project ID and location\n",
+ "project=''\n",
+ "location=''\n",
+ "\n",
+ "vertexai.init(project=project, location=location, staging_bucket=BUCKET)\n",
+ "\n",
+ "\n",
+ "model = aiplatform.Model.upload(\n",
+ " display_name=MODEL_DISPLAY_NAME,\n",
+ " description=MODEL_DESCRIPTION,\n",
+ " serving_container_image_uri=TF_PREDICTION_IMAGE_URI_RUNTIME,\n",
+ " serving_container_args=[\"--allow_precompilation\", \"--allow_compression\", \"--use_tfrt\"],\n",
+ " artifact_uri=f'gs://{BUCKET}/saved_model_artifacts_tf', #directory where our artifacts are in our bucket\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "0d9772a7-f00d-4633-aa87-3861fa5dec79",
+ "metadata": {},
+ "source": [
+ "### Create a Endpoint and Deploy it to our Model"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "26122187-da1c-4d26-b1a0-ec1bd403cb19",
+ "metadata": {},
+ "source": [
+ "A **endpoint** is how the user of the model can communicate with the model. A single model endpoint responds by returning a single inference from at least one model. It can take 20 min or more to establish a endpoint."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 19,
+ "id": "74a2c3dd-0e34-4049-804b-940c9a440570",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Creating Endpoint\n",
+ "Create Endpoint backing LRO: projects/144763482491/locations/us-central1/endpoints/5468832298092724224/operations/884634551396073472\n",
+ "Endpoint created. Resource name: projects/144763482491/locations/us-central1/endpoints/5468832298092724224\n",
+ "To use this Endpoint in another session:\n",
+ "endpoint = aiplatform.Endpoint('projects/144763482491/locations/us-central1/endpoints/5468832298092724224')\n",
+ "Deploying model to Endpoint : projects/144763482491/locations/us-central1/endpoints/5468832298092724224\n",
+ "Deploy Endpoint model backing LRO: projects/144763482491/locations/us-central1/endpoints/5468832298092724224/operations/5601029261159825408\n",
+ "Endpoint model deployed. Resource name: projects/144763482491/locations/us-central1/endpoints/5468832298092724224\n"
+ ]
+ }
+ ],
+ "source": [
+ "ENDPOINT_DISPLAY_NAME = \"summarizer-endpoint\" \n",
+ "endpoint = aiplatform.Endpoint.create(display_name=ENDPOINT_DISPLAY_NAME)\n",
+ "\n",
+ "model_endpoint = model.deploy(\n",
+ " endpoint=endpoint,\n",
+ " deployed_model_display_name=MODEL_DISPLAY_NAME,\n",
+ " machine_type=\"n1-standard-8\",\n",
+ " accelerator_type=\"NVIDIA_TESLA_V100\",\n",
+ " accelerator_count=1,\n",
+ " traffic_percentage=100,\n",
+ " deploy_request_timeout=1200,\n",
+ " sync=True,\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "4dee7811-559f-4bdc-b56e-b932b0831c0f",
+ "metadata": {},
+ "source": [
+ "Here we are creating a endpoint and deploying our model to said endpoint. We are deploying our endpoint using 1 GPU which can take 20min to run, feel free to try out other machine types that utilize more GPUs."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1544dbe2-8f06-43e8-9b2c-e9d57332e00e",
+ "metadata": {},
+ "source": [
+ "## Delete All Resources"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8b7a6d9a-3d0c-425e-a8f3-cb3160f1ee3b",
+ "metadata": {},
+ "source": [
+ "**Warning:** Once you are done don't forget to delete your endpoint, model, buckets, and shutdown or delete your Vertex AI notebook to avoid additional charges!"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "24d0fea3-fd4a-4735-b6d5-b910239b5ffa",
+ "metadata": {},
+ "source": [
+ "First we will delete our custom job. The command below will list custom jobs allowing you to gather the job id from the field called **'name:projects//locations/us-central1/customJobs/'**"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 22,
+ "id": "9721f52d-040f-4dc7-808e-8d1ffb5efb4a",
+ "metadata": {
+ "scrolled": true,
+ "tags": []
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Using endpoint [https://us-central1-aiplatform.googleapis.com/]\n",
+ "---\n",
+ "createTime: '2023-11-03T17:43:15.502041Z'\n",
+ "displayName: flan-t5-training-tf3\n",
+ "endTime: '2023-11-03T18:03:29Z'\n",
+ "jobSpec:\n",
+ " workerPoolSpecs:\n",
+ " - containerSpec:\n",
+ " args:\n",
+ " - --model_name_or_path=google/flan-t5-small\n",
+ " - --train_file=gs://flan-t5-model-resources/train.csv\n",
+ " - --test_file=gs://flan-t5-model-resources/test.csv\n",
+ " - --validation_file=gs://flan-t5-model-resources/validation.csv\n",
+ " - --text_column=article\n",
+ " - --summary_column=abstract\n",
+ " - --output_dir=gs://flan-t5-model-resources/\n",
+ " - '--source_prefix=summarize:'\n",
+ " imageUri: gcr.io/cit-oconnellka-9999/cloudai-autogenerated/flan-t5-training-tf3:20231103.17.39.12.779660\n",
+ " diskSpec:\n",
+ " bootDiskSizeGb: 100\n",
+ " bootDiskType: pd-ssd\n",
+ " machineSpec:\n",
+ " acceleratorCount: 1\n",
+ " acceleratorType: NVIDIA_TESLA_V100\n",
+ " machineType: n1-standard-4\n",
+ " replicaCount: '1'\n",
+ "name: projects/144763482491/locations/us-central1/customJobs/6207308081613766656\n",
+ "startTime: '2023-11-03T17:48:23Z'\n",
+ "state: JOB_STATE_SUCCEEDED\n",
+ "updateTime: '2023-11-03T18:03:44.992454Z'\n",
+ "---\n",
+ "createTime: '2023-11-02T04:29:33.732327Z'\n",
+ "displayName: flan-t5-training-tf3\n",
+ "endTime: '2023-11-02T04:34:24Z'\n",
+ "error:\n",
+ " code: 3\n",
+ " message: 'The replica workerpool0-0 exited with a non-zero status of 1. To find\n",
+ " out more about why your job exited please check the logs: https://console.cloud.google.com/logs/viewer?project=144763482491&resource=ml_job%2Fjob_id%2F2998009561996066816&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%222998009561996066816%22'\n",
+ "jobSpec:\n",
+ " workerPoolSpecs:\n",
+ " - containerSpec:\n",
+ " args:\n",
+ " - --model_name_or_path=google/flan-t5-small\n",
+ " - --train_file=gs://flan-t5-model-resources/train.csv\n",
+ " - --test_file=gs://flan-t5-model-resources/test.csv\n",
+ " - --validation_file=gs://flan-t5-model-resources/validation.csv\n",
+ " - --text_column=article\n",
+ " - --summary_column=abstract\n",
+ " - --output_dir=gs://flan-t5-model-resources\n",
+ " - '--source_prefix=summarize:'\n",
+ " imageUri: gcr.io/cit-oconnellka-9999/cloudai-autogenerated/flan-t5-training-tf3:20231102.04.26.29.256583\n",
+ " diskSpec:\n",
+ " bootDiskSizeGb: 100\n",
+ " bootDiskType: pd-ssd\n",
+ " machineSpec:\n",
+ " acceleratorCount: 1\n",
+ " acceleratorType: NVIDIA_TESLA_V100\n",
+ " machineType: n1-standard-4\n",
+ " replicaCount: '1'\n",
+ "name: projects/144763482491/locations/us-central1/customJobs/2998009561996066816\n",
+ "startTime: '2023-11-02T04:33:54Z'\n",
+ "state: JOB_STATE_FAILED\n",
+ "updateTime: '2023-11-02T04:34:28.106045Z'\n",
+ "---\n",
+ "createTime: '2023-10-30T11:24:17.577560Z'\n",
+ "displayName: flan-t5-training-tf\n",
+ "endTime: '2023-10-30T11:44:47Z'\n",
+ "jobSpec:\n",
+ " workerPoolSpecs:\n",
+ " - containerSpec:\n",
+ " args:\n",
+ " - --job_dir=gs://flan-t5-model-resources\n",
+ " imageUri: gcr.io/cit-oconnellka-9999/cloudai-autogenerated/flan-t5-training-tf:20231030.11.21.47.379363\n",
+ " diskSpec:\n",
+ " bootDiskSizeGb: 100\n",
+ " bootDiskType: pd-ssd\n",
+ " machineSpec:\n",
+ " acceleratorCount: 1\n",
+ " acceleratorType: NVIDIA_TESLA_V100\n",
+ " machineType: n1-standard-4\n",
+ " replicaCount: '1'\n",
+ "name: projects/144763482491/locations/us-central1/customJobs/612998417047617536\n",
+ "startTime: '2023-10-30T11:29:12Z'\n",
+ "state: JOB_STATE_SUCCEEDED\n",
+ "updateTime: '2023-10-30T11:45:16.382233Z'\n",
+ "---\n",
+ "createTime: '2023-10-30T10:53:26.358002Z'\n",
+ "displayName: flan-t5-training-tf\n",
+ "endTime: '2023-10-30T11:12:59Z'\n",
+ "error:\n",
+ " code: 3\n",
+ " message: 'The replica workerpool0-0 exited with a non-zero status of 1. To find\n",
+ " out more about why your job exited please check the logs: https://console.cloud.google.com/logs/viewer?project=144763482491&resource=ml_job%2Fjob_id%2F6864276174814576640&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%226864276174814576640%22'\n",
+ "jobSpec:\n",
+ " workerPoolSpecs:\n",
+ " - containerSpec:\n",
+ " args:\n",
+ " - --job_dir=gs://flan-t5-model-resources\n",
+ " imageUri: gcr.io/cit-oconnellka-9999/cloudai-autogenerated/flan-t5-training-tf:20231030.10.50.08.545796\n",
+ " diskSpec:\n",
+ " bootDiskSizeGb: 100\n",
+ " bootDiskType: pd-ssd\n",
+ " machineSpec:\n",
+ " acceleratorCount: 1\n",
+ " acceleratorType: NVIDIA_TESLA_V100\n",
+ " machineType: n1-standard-4\n",
+ " replicaCount: '1'\n",
+ "name: projects/144763482491/locations/us-central1/customJobs/6864276174814576640\n",
+ "startTime: '2023-10-30T10:57:55Z'\n",
+ "state: JOB_STATE_FAILED\n",
+ "updateTime: '2023-10-30T11:13:29.896168Z'\n",
+ "---\n",
+ "createTime: '2023-10-26T21:28:18.991136Z'\n",
+ "displayName: flan-t5-training\n",
+ "endTime: '2023-10-26T21:53:59Z'\n",
+ "jobSpec:\n",
+ " workerPoolSpecs:\n",
+ " - containerSpec:\n",
+ " args:\n",
+ " - --per_device_train_batch_size=2\n",
+ " - --per_device_eval_batch_size=4\n",
+ " - --model_name_or_path=google/flan-t5-small\n",
+ " - --train_file=gs://flan-t5-model-resources/datasets/train.csv\n",
+ " - --test_file=gs://flan-t5-model-resources/datasets/test.csv\n",
+ " - --text_column=article\n",
+ " - --summary_column=abstract\n",
+ " - --do_train=True\n",
+ " - --do_eval=False\n",
+ " - --do_predict=True\n",
+ " - --predict_with_generate=True\n",
+ " - --output_dir=gs://flan-t5-model-resources/model_output\n",
+ " - --num_train_epochs=3\n",
+ " - --learning_rate=5e-5\n",
+ " - --seed=7\n",
+ " - --fp16=True\n",
+ " imageUri: gcr.io/cit-oconnellka-9999/cloudai-autogenerated/flan-t5-training:20231026.21.27.25.218708\n",
+ " diskSpec:\n",
+ " bootDiskSizeGb: 100\n",
+ " bootDiskType: pd-ssd\n",
+ " machineSpec:\n",
+ " acceleratorCount: 1\n",
+ " acceleratorType: NVIDIA_TESLA_V100\n",
+ " machineType: n1-standard-4\n",
+ " replicaCount: '1'\n",
+ "name: projects/144763482491/locations/us-central1/customJobs/8666538460460351488\n",
+ "startTime: '2023-10-26T21:33:52Z'\n",
+ "state: JOB_STATE_SUCCEEDED\n",
+ "updateTime: '2023-10-26T21:54:18.730721Z'\n"
+ ]
+ }
+ ],
+ "source": [
+ "!gcloud ai custom-jobs list --project=$project --region=$location"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 26,
+ "id": "f6e10933-3d84-41fd-8785-fa801b97bfb0",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Long running operation: projects/144763482491/locations/us-central1/operations/3654348322228928512\n",
+ "delete_custom_job_response: \n"
+ ]
+ }
+ ],
+ "source": [
+ "from google.cloud import aiplatform\n",
+ "custom_job_id=''\n",
+ "\n",
+ "def delete_custom_job_sample(custom_job_id: str,\n",
+ " project: str = project,\n",
+ " location: str = location,\n",
+ " api_endpoint: str = f'{location}-aiplatform.googleapis.com',\n",
+ " timeout: int = 300,\n",
+ "):\n",
+ " # The AI Platform services require regional API endpoints.\n",
+ " client_options = {\"api_endpoint\": api_endpoint}\n",
+ " # Initialize client that will be used to create and send requests.\n",
+ " # This client only needs to be created once, and can be reused for multiple requests.\n",
+ " client = aiplatform.gapic.JobServiceClient(client_options=client_options)\n",
+ " name = client.custom_job_path(\n",
+ " project=project, location=location, custom_job=custom_job_id\n",
+ " )\n",
+ " response = client.delete_custom_job(name=name)\n",
+ " print(\"Long running operation:\", response.operation.name)\n",
+ " delete_custom_job_response = response.result(timeout=timeout)\n",
+ " print(\"delete_custom_job_response:\", delete_custom_job_response)\n",
+ " \n",
+ "delete_custom_job_sample(custom_job_id)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7a02ba4b-1c8f-4fcc-93b9-3e2a6121b59b",
+ "metadata": {},
+ "source": [
+ "Now we will undeploy our model, delete endpoints, and delete finally our model!"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "dfc2276e-8ab2-4c80-9721-26153ea80d63",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "model_endpoint.undeploy_all()\n",
+ "model_endpoint.delete()\n",
+ "model.delete()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8fc98dd1-b4e5-4ab2-a6ce-e46b5a23ab5d",
+ "metadata": {},
+ "source": [
+ "Delete custom container stored in Custom Registry or Artifacr Registry. List the images to gather the tag id."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 41,
+ "id": "588af684-4c7e-43d5-a1f5-5510157aa40f",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Listed 0 items.\n",
+ "DIGEST TAGS TIMESTAMP\n",
+ "1240e61185c9 20231103.17.39.12.779660 2023-11-03T17:42:05\n",
+ "ca99b71c4661 20231103.16.13.42.102563 2023-11-03T16:21:43\n"
+ ]
+ }
+ ],
+ "source": [
+ "#list the containers\n",
+ "!gcloud container images list-tags gcr.io/$project/cloudai-autogenerated/$display_name"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 50,
+ "id": "635cc519-230d-48a0-b9b7-c350c2d62ac4",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "#Save the tag ID\n",
+ "tag_id=''"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 51,
+ "id": "946fa12b-ad77-4d19-a556-c926309a14c4",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "\u001b[1;33mWARNING:\u001b[0m Successfully resolved tag to sha256, but it is recommended to use sha256 directly.\n",
+ "Digests:\n",
+ "- gcr.io/cit-oconnellka-9999/cloudai-autogenerated/flan-t5-training-tf3@sha256:ca99b71c466168f467152e04791710a9e269e767985b22a6cd1702e4fac2f691\n",
+ " Associated tags:\n",
+ " - 20231103.16.13.42.102563\n",
+ "Tags:\n",
+ "- gcr.io/cit-oconnellka-9999/cloudai-autogenerated/flan-t5-training-tf3:20231103.16.13.42.102563\n",
+ "Deleted [gcr.io/cit-oconnellka-9999/cloudai-autogenerated/flan-t5-training-tf3:20231103.16.13.42.102563].\n",
+ "Deleted [gcr.io/cit-oconnellka-9999/cloudai-autogenerated/flan-t5-training-tf3@sha256:ca99b71c466168f467152e04791710a9e269e767985b22a6cd1702e4fac2f691].\n"
+ ]
+ }
+ ],
+ "source": [
+ "#delete \n",
+ "!gcloud container images delete gcr.io/$project/cloudai-autogenerated/$display_name:$tag_id --force-delete-tags --quiet"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "fd4a6cfd-9567-4da3-8d1b-6a4207442680",
+ "metadata": {},
+ "source": [
+ "And finally delete our bucket"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "85824daa-66a5-4303-8b17-2565863a2844",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "!gcloud storage rm --recursive gs://$BUCKET/"
+ ]
+ }
+ ],
+ "metadata": {
+ "environment": {
+ "kernel": "python3",
+ "name": "tf2-gpu.2-12.m112",
+ "type": "gcloud",
+ "uri": "gcr.io/deeplearning-platform-release/tf2-gpu.2-12:m112"
+ },
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.10.12"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/tutorials/notebooks/GenAI/GCP_Pubmed_chatbot.ipynb b/tutorials/notebooks/GenAI/GCP_Pubmed_chatbot.ipynb
new file mode 100644
index 0000000..7a3e657
--- /dev/null
+++ b/tutorials/notebooks/GenAI/GCP_Pubmed_chatbot.ipynb
@@ -0,0 +1,1151 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "2edc6187-82ae-44e2-852f-2ad2712c93aa",
+ "metadata": {},
+ "source": [
+ "# Creating a PubMed Chatbot on GCP"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "3ecea2ad-7c65-4367-87e1-b021167c3a1d",
+ "metadata": {},
+ "source": [
+ "For this tutorial we are creating a PubMed chatbot that will answer questions by gathering information from documents we have provided via an index. The model we will be using today is a pretrained 'text-bison@001' model from GCP.\n",
+ "\n",
+ "This tutorial will go over the following topics:\n",
+ "- Introduce langchain\n",
+ "- Explain the differences between zero-shot, one-shot, and few-shot prompting\n",
+ "- Practice using different document retrievers"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "4d01e74b-b5b4-4be9-b16e-ec55419318ef",
+ "metadata": {},
+ "source": [
+ "### Optional: Deploy the Model"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "9dbd13e7-afc9-416b-94dc-418a93e14587",
+ "metadata": {},
+ "source": [
+ "In this tutorial we will be using Google PaLM2 LLM **test-bison@001** which doesn't need to be deployed but if you would like to use another model you choose one from the **Model Garden** using the console which will allow you to add a model to your model registry, create an endpoint (or use an existing one), and deploy the model all in one step."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "4f3e3ab1-5f7e-4028-a66f-9619926a2afd",
+ "metadata": {},
+ "source": [
+ "## PubMed API vs RAG with Vertex AI Vector Search"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "5a820eea-1538-4f40-86c4-eb14fe09e127",
+ "metadata": {},
+ "source": [
+ "Our chatbot will rely on documents to answer our questions to do so we are supplying it a **vector index**. A vector index or index is a data structure that enables fast and accurate search and retrieval of vector embeddings from a large dataset of objects. We will be working with two options for our index: PubMed API vs RAG Vertex AI Vector Search method."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7314b115-9433-460d-b275-78aa50f0a858",
+ "metadata": {},
+ "source": [
+ "**What is the difference?**\n",
+ "\n",
+ "The **PubMed API** is provided free by langchain to connect your model to more than **35 million citations** for biomedical literature from MEDLINE, life science journals, and online books. The langchain package for PubMed is already a retriever meaning that just simply using this tool will our chatbot beable to retrieve documents to refer to. \n",
+ "\n",
+ "**Vertex AI Vector Search** (formally known as Matching Engine) is a vector store from GCP that allows the user more **security and control** on which documents you wish to supply to your model. Vector Search, formerly known as Vertex AI Matching Engine, is a vector store or database that stores the **embeddings** of your documents and the metadata. Because this is not a retriever we have to make it so for our model to send back an output that also tells us which documents it is referencing, this is where RAG comes in. **RAG** stands for **Retrieval-augmented generation** it is a method or technique that **indexes documents** by first loading them in, splitting them into chucks (making it easier for our model to search for relevant splits), embedding the splits, then storing them in a vector store. The next steps in RAG are based on the question you ask your chatbot. If we were to ask it \"What is a cell?\" the vector store will be searched by a retriever to find relevant splits that have to do with our question, thus **retrieving relevant documents**. And finally our chatbot will **generate an answer** that makes sense of what a cell is, as part of the answer it will also point out which source documents it used to create the answer.\n",
+ "\n",
+ "We will be exploring both methods!"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "bcf1690d-e93d-4cd3-89c6-8d06b5a071a8",
+ "metadata": {},
+ "source": [
+ "## Setting up Vertex AI Vector Search"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c6330ddf-7972-4451-9fcb-98cf83f5d118",
+ "metadata": {},
+ "source": [
+ "If you choose to use the RAG method with Vertex AI RAG Vector Search to supply documents to your model follow the instructions below:\n",
+ "\n",
+ "Set your project id, location, and bucket variables."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "fb694dc4-9e76-4091-9ddf-cd4eca816851",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "project_id=''\n",
+ "location=' (e.g.us-east4)'\n",
+ "bucket = ''"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "a7a349bb-2853-4028-972d-af7f3e857867",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "project_id='cit-oconnellka-9999'\n",
+ "location='us-central1'\n",
+ "bucket = 'pubmed-chatbot-resources'"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "02053f4d-fad7-44ab-a7c3-cfa1c218240f",
+ "metadata": {},
+ "source": [
+ "### Gathering our Docs For our Vector Store"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1d1c9de7-4a06-4f85-b9ff-c8c9e51f8c70",
+ "metadata": {},
+ "source": [
+ "AWS marketplace has PubMed database named **PubMed Central® (PMC)** that contains free full-text archive of biomedical and life sciences journal article at the U.S. National Institutes of Health's National Library of Medicine (NIH/NLM). We will be subsetting this database to add documents to our Vertex AI Vector Search Index. Ensure that you have the correct permissions to allow your environment to connect to buckets and Vertex AI.\n",
+ "\n",
+ "The first step will be to create a bucket that we will later use as our data source for our index."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "99d49432-cf03-4f19-aa82-ef7f8bad5bde",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "#make bucket\n",
+ "!gsutil mb -l {location} gs://{bucket}"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b6ad30ba-cee8-47f9-bc1e-ece8961ac66a",
+ "metadata": {},
+ "source": [
+ "We will then download the metadata file from the PMC index directory, this will list all of the articles within the PMC bucket and their paths. We will use this to subset the database into our own bucket. Here we are using curl to connect to the public AWS s3 bucket where the metadata and documents are originally stored."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "7b395e34-062d-4f77-afee-3601d471954a",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "#download the metadata file\n",
+ "!curl -O http://pmc-oa-opendata.s3.amazonaws.com/oa_comm/txt/metadata/csv/oa_comm.filelist.csv"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "93a8595a-767f-4cad-9273-62d8e2cf60d1",
+ "metadata": {},
+ "source": [
+ "We only want the metadata of the first 100 files."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "c26b0f29-2b07-43a6-800d-4aa5e957fe52",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "#import the file as a dataframe\n",
+ "import pandas as pd\n",
+ "\n",
+ "df = pd.read_csv('oa_comm.filelist.csv')\n",
+ "#first 100 files\n",
+ "first_100=df[0:100]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "abd1ae93-450e-4c79-83cc-ea46a1b507c1",
+ "metadata": {},
+ "source": [
+ "Lets look at our metadata! We can see that the bucket path to the files are under the **Key** column this is what we will use to loop through the PMC bucket and copy the first 100 files to our bucket."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "ff77b2aa-ed1b-4d27-8163-fdaa7a304582",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "first_100"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "84e5f36a-239c-4c15-80ab-f896d45849d3",
+ "metadata": {},
+ "source": [
+ "The following commands will gather the location of each document with in AWS s3 bucket, output the text from the docs as bytes and save the bytes to our bucket in the form of a text file in a directory named \"docs\". This will all be done using curl."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "7d63a7e2-dbf1-49ec-bc84-b8c2c8bde62d",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import os\n",
+ "from io import BytesIO\n",
+ "#gather path to files in bucket\n",
+ "for i in first_100['Key']:\n",
+ " doc_name=i.split(r'/')[-1]\n",
+ " os.system(f'curl http://pmc-oa-opendata.s3.amazonaws.com/{i} | curl -T - -v -H \"Authorization: Bearer `gcloud auth print-access-token`\" \"https://storage.googleapis.com/{bucket}/docs/{doc_name} \"')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c1b396c8-baa9-44d6-948c-2326dc514839",
+ "metadata": {},
+ "source": [
+ "### Creating an Index"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "bb6fa941-bf59-4cae-9aa8-2f2741f3a1b1",
+ "metadata": {},
+ "source": [
+ "To create our vector store index, we will first start by creating a dummy embeddings file. An index holds a set of records so our dummy data will be the first record and then later we will add our PubMed docs to the same index. Inorder for Vector Search to find our dummy embeddings file it too must be in our bucket and we will add it to the subdirectory 'init_index'."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "6cf5092c-23f3-4f28-9308-f34b8d90c62b",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import uuid\n",
+ "import numpy as np\n",
+ "import json\n",
+ "init_embedding = {\"id\": str(uuid.uuid4()), \"embedding\": list(np.zeros(768))}\n",
+ "\n",
+ "# dump embedding to a local file\n",
+ "with open(\"embeddings_0.json\", \"w\") as f:\n",
+ " json.dump(init_embedding, f)\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "8e8a4c42-dc17-48a3-a0bb-0cbea527ee7f",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "#move inital embeddings file to bucket\n",
+ "!gsutil cp embeddings_0.json gs://{bucket}/init_index/"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f4d1a3cd-4f89-4271-b025-71af2bf25095",
+ "metadata": {},
+ "source": [
+ "Now we can make our index, this can take up to 30min to 1hr. \n",
+ "\n",
+ "Please note that the dimensions depend on what text embedding model you are using for this tutorial we are using **Vertex AI's embedding model** which uses 768 dimensions. If you chose to change your model choose a embedding model that is compatible with it or you can use Tensorflow's Universal Sentence Encoder. For more information see [here](https://python.langchain.com/docs/integrations/vectorstores/matchingengine#using-tensorflow-universal-sentence-encoder-as-an-embedder)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "39aa7bba-3d15-4a3f-86c2-59d2c92a95ef",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from google.cloud import aiplatform\n",
+ "# create Index\n",
+ "index = aiplatform.MatchingEngineIndex.create_tree_ah_index(\n",
+ " display_name = f\"pubmed_vector_index\",\n",
+ " contents_delta_uri = f\"gs://{bucket}/init_index\",\n",
+ " dimensions = 768,\n",
+ " approximate_neighbors_count = 150,\n",
+ " distance_measure_type=\"DOT_PRODUCT_DISTANCE\",\n",
+ " location=location\n",
+ " \n",
+ ")\n",
+ "\n",
+ "#save index id\n",
+ "index_id=index.name"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ce65b0cc-cff3-47d6-af8c-7c39b2418ecb",
+ "metadata": {},
+ "source": [
+ "### Creating a Endpoint and Deploying our Index"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "9b3aa4a2-1145-475a-bd04-33bf69551751",
+ "metadata": {},
+ "source": [
+ "We will create a public endpoint for our vector store, you can also create a private one by setting up a VPC and specifying the VPC id for the params 'network'. Documentation for creating a VPC can be found [here](https://python.langchain.com/docs/integrations/vectorstores/matchingengine#imports-constants-and-configs)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "55596202-13b9-4e35-8099-0602a2b13e72",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "#Create the endpoint\n",
+ "index_endpoint = aiplatform.MatchingEngineIndexEndpoint.create(\n",
+ " display_name = \"pubmed_vector_endpoint\",\n",
+ " public_endpoint_enabled = True,\n",
+ " location = location\n",
+ ")\n",
+ "\n",
+ "#save endpoint id\n",
+ "endpoint_id = endpoint.name"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "51412f2f-f32b-44a9-93bc-3e2f6185cada",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "#deploy our index to our endpoint\n",
+ "deployed_index_id=\"deployed_pubmed_vector_index\"\n",
+ "index_endpoint = index_endpoint.deploy_index(\n",
+ " index=index, deployed_index_id=deployed_index_id\n",
+ ")\n",
+ "index_endpoint.deployed_indexes"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "613cef7d-d0aa-42a8-a46e-7fd1f5c48c3b",
+ "metadata": {},
+ "source": [
+ "### Adding Metadata to Our Data"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "2fa34e7b-99c7-4a2e-b73b-146636a98285",
+ "metadata": {},
+ "source": [
+ "After we have our documents stored in our bucket we can start to load our files back. This step is necessary though redundant because we will need to embed our docs for our vector store and we can attach metadata for each document. The first step of adding our metadata to the docs will be to remove the 'Key' column because this is no longer the location of our documents. Next, we'll convert the rest of the columns into a dictionary form."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b9016f15-db02-4073-b4c7-288d919bbb55",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import pandas as pd\n",
+ "#Remove the Key column to be replaced later\n",
+ "first_100.pop('Key')\n",
+ "#convert the metadata to dict\n",
+ "first_100_dict = first_100.to_dict('records')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "eb80dce6-dc5b-4a73-8591-572be35c092a",
+ "metadata": {},
+ "source": [
+ "Lets look at our metadata now!"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "69ce004e-ab8d-4b9c-91d8-9320e1679fcd",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "first_100_dict"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "2a607a48-31b8-4081-a347-bb1528f8e725",
+ "metadata": {},
+ "source": [
+ "Now we can load in our documents, add in the location of our docs in our bucket and the document name to our metadata, and finally attach that metadata to our documents. At the end we should have 100 documents before splitting the data."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "47170e83-3e9e-48e6-ab0f-cabdd39507e1",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "#add metadata\n",
+ "from langchain.document_loaders import GCSDirectoryLoader\n",
+ "print(f\"Processing documents from {bucket}\")\n",
+ "loader = GCSDirectoryLoader(\n",
+ " project_name=project_id, bucket=bucket, prefix='docs'\n",
+ ")\n",
+ "documents = loader.load()\n",
+ "\n",
+ "# loop through docs to add metadata to each one\n",
+ "for i in range(len(documents)):\n",
+ " doc_md = documents[i].metadata\n",
+ " document_name = doc_md[\"source\"].split(\"/\")[-1]\n",
+ " source = f\"{bucket}/docs/{document_name}\"\n",
+ " # Add document name and source to the metadata\n",
+ " documents[i].metadata = {\"source\": source, \"document_name\": document_name}\n",
+ " documents[i].metadata.update(first_100_dict[i])# attached other metadata to doc\n",
+ "print(f\"# of documents loaded (pre-chunking) = {len(documents)}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d2cb96ea-dd0c-47b7-9556-05e25c3efb1d",
+ "metadata": {},
+ "source": [
+ "Lets take a look at our metadata!"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "8673a445-7c2e-4650-91fa-4b0b38196e2c",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "documents[0].metadata"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "613b65c0-fa38-456f-acb3-d406803ef204",
+ "metadata": {},
+ "source": [
+ "### Splitting our Data"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6812ecaf-979f-4537-b420-071022a7b917",
+ "metadata": {},
+ "source": [
+ "Splitting our data into chucks will help our vector store parse through our data faster and efficiently. We'll then add the chuck number to our metadata."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "6e6503cf-02e5-4352-a6b1-13ef4e01c019",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
+ "# split the documents into chunks\n",
+ "text_splitter = RecursiveCharacterTextSplitter(\n",
+ " chunk_size=1000,\n",
+ " chunk_overlap=50,\n",
+ " separators=[\"\\n\\n\", \"\\n\", \".\", \"!\", \"?\", \",\", \" \", \"\"],\n",
+ ")\n",
+ "doc_splits = text_splitter.split_documents(documents)\n",
+ "\n",
+ "# Add chunk number to metadata\n",
+ "for idx, split in enumerate(doc_splits):\n",
+ " split.metadata[\"chunk\"] = idx\n",
+ "\n",
+ "print(f\"# of documents = {len(doc_splits)}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e1fb202a-6122-4083-81e4-ddcb33499e64",
+ "metadata": {},
+ "source": [
+ "After splitting our data we now have 7620 documents. And looking at our metadata we can see that the chunk number is the last entry."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "f1036b8e-6c7f-43be-83b7-5b9e61628003",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "doc_splits[0].metadata"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "055d5b85-950e-4a44-b3fa-a2dcec7df036",
+ "metadata": {},
+ "source": [
+ "### Embedding our Data"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "9a68b14e-0cf5-4973-90c1-0eee0c8bc8c9",
+ "metadata": {},
+ "source": [
+ "Now we can embed our text into **numerical vectors** that will help our model find similar objects like documents that hold similar texts or find similar photos based on the numbers assigned to the object. Depending on the model you choose you have to find an embedder that is compatible to our model. Since we are using a PaLM2 model (text-bison) we can use the embedding model from Vertex AI that defaults to using **'textembedding-gecko'**."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "50a4a98c-a332-469f-9a24-ce5abff23b15",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from langchain.vectorstores import MatchingEngine\n",
+ "from langchain.embeddings import VertexAIEmbeddings\n",
+ "embeddings = VertexAIEmbeddings()\n",
+ "\n",
+ "# initialize vector store\n",
+ "vector_store = MatchingEngine.from_components(\n",
+ " project_id=project_id,\n",
+ " region=location,\n",
+ " gcs_bucket_name=bucket,\n",
+ " embedding=embeddings,\n",
+ " index_id=index_id,\n",
+ " endpoint_id=endpoint_id,\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "4e3bfb5b-a3a6-4156-bca3-394774a94565",
+ "metadata": {},
+ "source": [
+ "For our split documents to be read by our embedding model we need to make tuple called **Document** that contains **page content** and **metadata**. The code below loops through the split docs and assigns them to the label page_content and the same is done for all parts of our metadata under the label metadata."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "9cda4699-5c46-49bb-97e3-059199254bba",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Store docs as embeddings in Matching Engine index\n",
+ "# It may take a while since API is rate limited\n",
+ "texts = [doc.page_content for doc in doc_splits]\n",
+ "metadatas = [\n",
+ " [\n",
+ " {\"namespace\": \"source\", \"allow_list\": [doc.metadata[\"source\"]]},\n",
+ " {\"namespace\": \"document_name\", \"allow_list\": [doc.metadata[\"document_name\"]]},\n",
+ " {\"namespace\": \"ETag\", \"allow_list\": [doc.metadata[\"ETag\"]]},\n",
+ " {\"namespace\": \"Article Citation\", \"allow_list\": [doc.metadata[\"Article Citation\"]]},\n",
+ " {\"namespace\": \"AccessionID\", \"allow_list\": [doc.metadata[\"AccessionID\"]]},\n",
+ " {\"namespace\": \"Last Updated UTC (YYYY-MM-DD HH:MM:SS)\", \"allow_list\": [doc.metadata[\"Last Updated UTC (YYYY-MM-DD HH:MM:SS)\"]]},\n",
+ " {\"namespace\": \"PMID\", \"allow_list\": [str(doc.metadata[\"PMID\"])]},\n",
+ " {\"namespace\": \"License\", \"allow_list\": [doc.metadata[\"License\"]]},\n",
+ " {\"namespace\": \"Retracted\", \"allow_list\": [doc.metadata[\"Retracted\"]]},\n",
+ " {\"namespace\": \"chunk\", \"allow_list\": [str(doc.metadata[\"chunk\"])]}\n",
+ " \n",
+ " ]\n",
+ " for doc in doc_splits\n",
+ "]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "216f6ef6-b488-45d3-ac4c-2aca0d6eab56",
+ "metadata": {},
+ "source": [
+ "lets look at our Document tuple!"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "6c1af269-fdbb-4db5-9c1b-41e21d304b9d",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "doc_splits[0]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "2aeb0bba-8ccd-4828-a7a2-6f34b03a03b9",
+ "metadata": {},
+ "source": [
+ "Now we can add our split documents and their metadata to our vector store. This is the longest step of the tutorial and can take up 1hr to complete. As you wait you can read up on Creating a Inference Script section of this tutorial."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b3c2f21b-06ab-470e-8807-638548d50f77",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "doc_ids = vector_store.add_texts(texts=texts, metadatas=metadatas)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "03b90f92-f223-42e0-9e5b-accd3fdfbeea",
+ "metadata": {},
+ "source": [
+ "Test whether search from vector store is working"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b6cd9aab-7f08-4a69-b7e4-9cd1d8f9110f",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "results=vector_store.similarity_search_with_score(\"brain\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "07b3bc6b-8c43-476f-a662-abda830dc2da",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "### Creating a Inference Script "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "3ba2291e-109e-4120-ad10-5dbfd341a07b",
+ "metadata": {},
+ "source": [
+ "Inorder for us to fluidly send input and receive outputs from our chatbot we need to create a **inference script** that will format inputs in a way that the chatbot can understand and format outputs in a way we can understand. We will also be supplying instructions to the chatbot through the script.\n",
+ "\n",
+ "Our script will utilize **langchain** tools and packages to enable our model to:\n",
+ "- **Connect to sources of context** (e.g. providing our model with tasks and examples)\n",
+ "- **Rely on reason** (e.g. instruct our model how to answer based on provided context)\n",
+ "\n",
+ "**Warning**: The following tools must be installed via your terminal `pip install \"langchain\" \"xmltodict\"` and the over all inference script must be run on the terminal via the command `python YOUR_SCRIPT.py`."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ad374085-c4b1-4083-85a5-90cba35846d6",
+ "metadata": {},
+ "source": [
+ "The first part of our script will be to list all the tools that are required. \n",
+ "- **PubMedRetriever:** Utilizes the langchain retriever tool to specifically retrieve PubMed documents from the PubMed API.\n",
+ "- **MatchingEngine:** Connects to Vertex AI Vector Search to be used as a langchain retriever tool to specifically retrieve embedded documents stored in your bucket. \n",
+ "- **ConversationalRetrievalChain:** Allows the user to construct a conversation with the model and retrieves the outputs while sending inputs to the model.\n",
+ "- **PromptTemplate:** Allows the user to prompt the model to provide instructions, best method for zero and few shot prompting\n",
+ "- **VertexAIEmbeddings:** Text embedding model used before to convert text to numerical vectors.\n",
+ "- **VertexAI**: Package used to import Google PaLM2 LLMs models (e.g. text-bison@001, code-bison). \n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6f0ad48d-c6c8-421a-a48b-88e979d15b57",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "```python\n",
+ "from langchain.retrievers import PubMedRetriever\n",
+ "from langchain.vectorstores import MatchingEngine\n",
+ "#from langchain.llms import VertexAIModelGarden #uncomment if utilizing models from Model Garden\n",
+ "from langchain.chains import ConversationalRetrievalChain\n",
+ "from langchain.prompts import PromptTemplate\n",
+ "from langchain.embeddings import VertexAIEmbeddings\n",
+ "from langchain.llms import VertexAI\n",
+ "import sys\n",
+ "import json\n",
+ "import os\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "900f4c31-71cd-4f39-8bfc-de098bdbaafc",
+ "metadata": {},
+ "source": [
+ "Second will build a class that will hold the functions we need to send inputs and retrieve outputs from our model. For the beginning of our class we will establish some colors to our text conversation with our chatbot which we will utilize later."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "decbb901-f811-4b8e-a956-4c8c7f914ae2",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "```python\n",
+ "class bcolors:\n",
+ " HEADER = '\\033[95m'\n",
+ " OKBLUE = '\\033[94m'\n",
+ " OKCYAN = '\\033[96m'\n",
+ " OKGREEN = '\\033[92m'\n",
+ " WARNING = '\\033[93m'\n",
+ " FAIL = '\\033[91m'\n",
+ " ENDC = '\\033[0m'\n",
+ " BOLD = '\\033[1m'\n",
+ " UNDERLINE = '\\033[4m'\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ba36d057-5189-4075-a243-18996c6fc932",
+ "metadata": {},
+ "source": [
+ "If you are using Vector Search instead of the PubMed API we need to create a function that will gather the necessary information to connect to our model, which will be the:\n",
+ "- Project ID\n",
+ "- Location of bucket and vector store (they should be in the same location)\n",
+ "- Bucket name\n",
+ "- Vector Store Index ID\n",
+ "- Vector Store Endpoint ID"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "3f7a244a-7e71-40d3-ae78-8e166dd3c7ee",
+ "metadata": {},
+ "source": [
+ "```python\n",
+ "def build_chain():\n",
+ " PROJECT_ID = os.environ[\"PROJECT_ID\"]\n",
+ " LOCATION_ID = os.environ[\"LOCATION_ID\"]\n",
+ " #ENDPOINT_ID = os.environ[\"ENDPOINT_ID\"] #uncomment if utilizing model from Model Garden\n",
+ " BUCKET = os.environ[\"BUCKET\"]\n",
+ " VC_INDEX_ID = os.environ[\"VC_INDEX_ID\"]\n",
+ " VC_ENDPOINT_ID = os.environ[\"VC_ENDPOINT_ID\"]\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "dab1012f-ed20-47b9-9162-924e03e836d5",
+ "metadata": {},
+ "source": [
+ "Now we can define our Google PaLM2 model being `text-bison@001` and other parameters:\n",
+ "\n",
+ "- Max Output Tokens: Limit of tokens outputted by the model.\n",
+ "- Temperature: Controls randomness, higher values increase diversity meaning a more unique response make the model to think harder. Must be a number from 0 to 1, 0 being less unique.\n",
+ "- Top_p (nucleus): The cumulative probability cutoff for token selection. Lower values mean sampling from a smaller, more top-weighted nucleus. Must be a number from 0 to 1.\n",
+ "- Top_k: Sample from the k most likely next tokens at each step. Lower k focuses on higher probability tokens. This means the model choses the most probable words. Lower values eliminate fewer coherent words.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8cadb1af-2c46-4ab1-92f9-6e0861f83324",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "```python\n",
+ "llm = VertexAI(\n",
+ " model_name=\"text-bison@001\",\n",
+ " max_output_tokens=1024,\n",
+ " temperature=0.2,\n",
+ " top_p=0.8,\n",
+ " top_k=40,\n",
+ " verbose=True,\n",
+ " \n",
+ " \n",
+ "#if using a model from the Model Garden uncomment\n",
+ "#llm = VertexAIModelGarden(project=PROJECT_ID, endpoint_id=ENDPOINT_ID, location=LOCATION_ID)\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c44b4f91-0c64-459b-a6e9-8a955c0797c7",
+ "metadata": {},
+ "source": [
+ "We specify what our retriever both the PubMed and Vector Search retriever are listed, please only add one per script.\n",
+ "\n",
+ "If using Vector Search we need to initialize our vector store as we did before when we added our split documents and metadata to it. Then we set the vector store as a **retriever** with the search type being **'similarity'** meaning it will find texts that are similar to each other depending on the question you ask the model. We also set **'k'** to 3 meaning that our retriever will retrieve 3 documents that are similar."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "21c61724-23d3-4b49-8c72-cbd208bdb5df",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "```python\n",
+ "retriever= PubMedRetriever()\n",
+ "\n",
+ "#only if using Vector Search as a retriever\n",
+ "\n",
+ "embeddings = VertexAIEmbeddings() #Make sure embedding model is compatible with model\n",
+ "\n",
+ " vector_store = MatchingEngine.from_components(\n",
+ " project_id=PROJECT_ID,\n",
+ " region=LOCATION_ID,\n",
+ " gcs_bucket_name=BUCKET,\n",
+ " embedding=embeddings,\n",
+ " index_id=VC_INDEX_ID,\n",
+ " endpoint_id=VC_ENDPOINT_ID\n",
+ " )\n",
+ "retriever = vector_store.as_retriever(\n",
+ " search_type=\"similarity\",\n",
+ " search_kwargs={\"k\":3}\n",
+ " )\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ec8e464a-0931-444a-aa58-09ee0c4c9884",
+ "metadata": {},
+ "source": [
+ "Here we are constructing our **prompt_template**, this is where we can try zero-shot or few-shot prompting. Only add one method per script."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "4431051e-0e84-408e-9821-f50a9b88c9c1",
+ "metadata": {},
+ "source": [
+ "#### Zero-shot prompting\n",
+ "\n",
+ "Zero-shot prompting does not require any additional training more so it gives a pre-trained language model a task or query to generate text (our output). The model relies on its general language understanding and the patterns it has learned during its training to produce relevant output. In our script we have connect our model to a **retriever** to make sure it gathers information from that retriever (this can be the PubMed API or Vector Search). \n",
+ "\n",
+ "See below that the task is more like instructions notifying our model they will be asked questions which it will answer based on the info of the scientific documents provided from the index provided (this can be the PubMed API or Vector Search index). All of this information is established as a **prompt template** for our model to receive."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c0316dc5-6274-4a5e-92e4-3d266ed6a4df",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "```python\n",
+ "prompt_template = \"\"\"\n",
+ " Ignore everything before.\n",
+ " \n",
+ " Instruction:\n",
+ " Instructions:\n",
+ " I will provide you with research papers on a specific topic in English, and you will create a cumulative summary. \n",
+ " The summary should be concise and should accurately and objectively communicate the takeaway of the papers related to the topic. \n",
+ " You should not include any personal opinions or interpretations in your summary, but rather focus on objectively presenting the information from the papers. \n",
+ " Your summary should be written in your own words and ensure that your summary is clear, concise, and accurately reflects the content of the original papers. First, provide a concise summary then citations at the end.\n",
+ " \n",
+ " {question} Answer \"don't know\" if not present in the document. \n",
+ " {context}\n",
+ " Solution:\"\"\"\n",
+ " PROMPT = PromptTemplate(\n",
+ " template=prompt_template, input_variables=[\"context\", \"question\"],\n",
+ " )\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "edbe7032-8507-4d07-baab-1b3bf0e92074",
+ "metadata": {},
+ "source": [
+ "#### One-shot and Few-shot Prompting"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "5614ea04-e1f8-4941-ae16-4359f718f98f",
+ "metadata": {},
+ "source": [
+ "One and few shot prompting are similar to one-shot prompting, in addition to giving our model a task just like before we have also supplied an example of how the our model structure our output.\n",
+ "\n",
+ "See below that we have implemented one-shot prompting to our script. "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "5ffb9669-5b77-4d9b-9f4e-a0d3a18b0fae",
+ "metadata": {},
+ "source": [
+ "```python\n",
+ "prompt_template = \"\"\"\n",
+ " Instructions:\n",
+ " I will provide you with research papers on a specific topic in English, and you will create a cumulative summary. \n",
+ " The summary should be concise and should accurately and objectively communicate the takeaway of the papers related to the topic. \n",
+ " You should not include any personal opinions or interpretations in your summary, but rather focus on objectively presenting the information from the papers. \n",
+ " Your summary should be written in your own words and ensure that your summary is clear, concise, and accurately reflects the content of the original papers. First, provide a concise summary then citations at the end. \n",
+ " Examples:\n",
+ " Question: What is a cell?\n",
+ " Answer: '''\n",
+ " Cell, in biology, the basic membrane-bound unit that contains the fundamental molecules of life and of which all living things are composed. \n",
+ " Sources: \n",
+ " Chow, Christopher , Laskey, Ronald A. , Cooper, John A. , Alberts, Bruce M. , Staehelin, L. Andrew , \n",
+ " Stein, Wilfred D. , Bernfield, Merton R. , Lodish, Harvey F. , Cuffe, Michael and Slack, Jonathan M.W.. \n",
+ " \"cell\". Encyclopedia Britannica, 26 Sep. 2023, https://www.britannica.com/science/cell-biology. Accessed 9 November 2023.\n",
+ " '''\n",
+ " \n",
+ " {question} Answer \"don't know\" if not present in the document. \n",
+ " {context}\n",
+ " \n",
+ "\n",
+ " \n",
+ " Solution:\"\"\"\n",
+ " PROMPT = PromptTemplate(\n",
+ " template=prompt_template, input_variables=[\"context\", \"question\"],\n",
+ " )\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "82c66d53-97b2-46dc-a466-70a3d3bee4a7",
+ "metadata": {},
+ "source": [
+ "The following set of commands control the chat history essentially telling the model to expect another question after it finishes answering the previous one. Follow up questions can contain references to past chat history so the **ConversationalRetrievalChain** combines the chat history and the followup question into a standalone question, then looks up relevant documents from the retriever, and finally passes those documents and the question to a question-answering chain to return a response.\n",
+ "\n",
+ "All of these pieces such as our conversational chain, prompt, and chat history are passed through a function called **run_chain** so that our model can return is response. We have also set the length of our chat history to one meaning that our model can only refer to the pervious conversation as a reference."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "fda4d33b-60f2-4462-a8e6-bbce7f8a7b07",
+ "metadata": {},
+ "source": [
+ "```python\n",
+ "condense_qa_template = \"\"\"\n",
+ " Chat History:\n",
+ " {chat_history}\n",
+ " Here is a new question for you: {question}\n",
+ " Standalone question:\"\"\"\n",
+ " standalone_question_prompt = PromptTemplate.from_template(condense_qa_template)\n",
+ " \n",
+ " qa = ConversationalRetrievalChain.from_llm(\n",
+ " llm=llm, \n",
+ " retriever=retriever, \n",
+ " condense_question_prompt=standalone_question_prompt, \n",
+ " return_source_documents=True, \n",
+ " combine_docs_chain_kwargs={\"prompt\":PROMPT},\n",
+ " )\n",
+ " return qa\n",
+ "\n",
+ "def run_chain(chain, prompt: str, history=[]):\n",
+ " print(prompt)\n",
+ " return chain({\"question\": prompt, \"chat_history\": history})\n",
+ "\n",
+ "MAX_HISTORY_LENGTH = 1 #increase to refer to more pervious chats\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b8f1ef8d-66fe-4f84-933b-af2d730bd114",
+ "metadata": {},
+ "source": [
+ "The final part of our script utilizes our class and incorporates colors to add a bit of flare to our conversation with our model. The model when first initialized should greet the user asking **\"Hello! How can I help you?\"** then instructs the user to ask a question or exit the session **\"Ask a question, start a New search: or CTRL-D to exit.\"**. With every question submitted to the model it is labeled as a **new search** we then run the run_chain function to get the models response or answer and add the response to the **chat history**. "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1aa6ef65-ced4-445e-875c-7fee3483b81d",
+ "metadata": {},
+ "source": [
+ "```python\n",
+ "if __name__ == \"__main__\":\n",
+ " chat_history = []\n",
+ " qa = build_chain()\n",
+ " print(bcolors.OKBLUE + \"Hello! How can I help you?\" + bcolors.ENDC)\n",
+ " print(bcolors.OKCYAN + \"Ask a question, start a New search: or CTRL-D to exit.\" + bcolors.ENDC)\n",
+ " print(\">\", end=\" \", flush=True)\n",
+ " for query in sys.stdin:\n",
+ " if (query.strip().lower().startswith(\"new search:\")):\n",
+ " query = query.strip().lower().replace(\"new search:\",\"\")\n",
+ " chat_history = []\n",
+ " elif (len(chat_history) == MAX_HISTORY_LENGTH):\n",
+ " chat_history.pop(0)\n",
+ " result = run_chain(qa, query, chat_history)\n",
+ " chat_history.append((query, result[\"answer\"]))\n",
+ " print(bcolors.OKGREEN + result['answer'] + bcolors.ENDC) \n",
+ " if 'source_documents' in result: \n",
+ " print(bcolors.OKGREEN + 'Sources:')\n",
+ " for idx, ref in enumerate(result[\"source_documents\"]):\n",
+ " print(ref.page_content) #Use this for Vector store\n",
+ " #print(\"PubMed UID: \"+ref.metadata[\"uid\"])#Use this for PubMed retriever\n",
+ " print(bcolors.ENDC)\n",
+ " print(bcolors.OKCYAN + \"Ask a question, start a New search: or CTRL-D to exit.\" + bcolors.ENDC)\n",
+ " print(\">\", end=\" \", flush=True)\n",
+ " print(bcolors.OKBLUE + \"Bye\" + bcolors.ENDC)\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1abcbd48-bb84-4310-b8eb-ad87850a8649",
+ "metadata": {},
+ "source": [
+ "Running our script in the terminal will require us to export the following global variables before using the command `python NAME_OF_SCRIPT.py`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "ba97df23-6893-438d-8a67-cb7dbf83e407",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "#retreive our index and endpoint id\n",
+ "print(index_id)\n",
+ "print(endpoint_id)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "7eab00a3-54ff-4873-8d25-eaf8bd18a2e6",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "#enter the global variables in your terminal\n",
+ "export PROJECT_ID='' \\\n",
+ "export LOCATION_ID='' \\\n",
+ "#export ENDPOINT_ID='' \\ #Uncomment if using model from Model Garden\n",
+ "export BUCKET='' \\\n",
+ "export VC_INDEX_ID='' \\\n",
+ "export VC_ENDPOINT_ID='VECTOR_SEARCH_ENDPOINT_ID>'"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "bbe127e6-c0b1-4e07-ad56-38c30a9bf858",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "You should see similar results on the terminal. In this example we ask the chatbot to explain brain cancer!"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "80c8fb4b-e74f-4e8d-892b-0f913eff747d",
+ "metadata": {},
+ "source": [
+ "![PubMed Chatbot Results](../../../images/GCP_chatbot_results.png)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "a178c1c6-368a-48c5-8beb-278443b685a2",
+ "metadata": {},
+ "source": [
+ "### Clean Up"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7ec06a34-dc47-453f-b519-424804fa2748",
+ "metadata": {},
+ "source": [
+ "**Warning:** Dont forget to delete the resources we just made to avoid accruing additional costs!"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "c307bb17-757a-4579-a0d8-698eb1bb3f2e",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "#Undeploy index\n",
+ "!gcloud ai index-endpoints undeploy-index {endpoint_id} \\\n",
+ " --deployed-index-id={deployed_index_id} \\\n",
+ " --project={project_id} \\\n",
+ " --region={location}\n",
+ "\n",
+ "\n",
+ "#Delete index and endpoint\n",
+ "!gcloud ai indexes delete {index_id} \\\n",
+ " --project={project_id} \\\n",
+ " --region={location} --quiet\n",
+ "\n",
+ "!gcloud ai index-endpoints delete {endpoint_id} \\\n",
+ " --project={project_id} \\\n",
+ " --region={location} --quiet"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "280cea0a-a8fc-494e-8ce4-afb65847a222",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "#Delete bucket\n",
+ "!gcloud storage rm --recursive gs://{bucket}/"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6928d95d-d7ec-43f6-9135-79fcfc9520d9",
+ "metadata": {},
+ "source": [
+ "If you have imported a model and deployed it don't forget to delete the model from the Model Registry and delete the endpoint."
+ ]
+ }
+ ],
+ "metadata": {
+ "environment": {
+ "kernel": "python3",
+ "name": "common-cpu.m113",
+ "type": "gcloud",
+ "uri": "gcr.io/deeplearning-platform-release/base-cpu:m113"
+ },
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.10.13"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/tutorials/notebooks/GenAI/example_scripts/example_langchain_chat_llama_2_zeroshot.py b/tutorials/notebooks/GenAI/example_scripts/example_langchain_chat_llama_2_zeroshot.py
new file mode 100644
index 0000000..1756c90
--- /dev/null
+++ b/tutorials/notebooks/GenAI/example_scripts/example_langchain_chat_llama_2_zeroshot.py
@@ -0,0 +1,101 @@
+from langchain.retrievers import PubMedRetriever
+from langchain.chains import ConversationalRetrievalChain
+from langchain.prompts import PromptTemplate
+#from langchain.llms import VertexAIModelGarden
+from langchain.llms import VertexAI
+import sys
+import json
+import os
+
+
+class bcolors:
+ HEADER = '\033[95m'
+ OKBLUE = '\033[94m'
+ OKCYAN = '\033[96m'
+ OKGREEN = '\033[92m'
+ WARNING = '\033[93m'
+ FAIL = '\033[91m'
+ ENDC = '\033[0m'
+ BOLD = '\033[1m'
+ UNDERLINE = '\033[4m'
+
+MAX_HISTORY_LENGTH = 1
+
+def build_chain():
+ #if using model from uncomment Model Garden
+ #PROJECT_ID = os.environ["PROJECT_ID"]
+ #LOCATION_ID = os.environ["LOCATION_ID"]
+ #ENDPOINT_ID = os.environ["ENDPOINT_ID"]
+
+ #llm = VertexAIModelGarden(project=PROJECT_ID, endpoint_id=ENDPOINT_ID, location=LOCATION_ID)
+
+ llm = VertexAI(
+ model_name="text-bison@001",
+ max_output_tokens=1024,
+ temperature=0.2,
+ top_p=0.8,
+ top_k=40,
+ verbose=True,
+)
+
+ retriever= PubMedRetriever()
+
+ prompt_template = """
+ Ignore everything before.
+ Instructions:
+ I will provide you with research papers on a specific topic in English, and you will create a cumulative summary.
+ The summary should be concise and should accurately and objectively communicate the takeaway of the papers related to the topic.
+ You should not include any personal opinions or interpretations in your summary, but rather focus on objectively presenting the information from the papers.
+ Your summary should be written in your own words and ensure that your summary is clear, concise, and accurately reflects the content of the original papers. First, provide a concise summary then citations at the end.
+ {question} Answer "don't know" if not present in the document.
+ {context}
+ Solution:"""
+
+
+ PROMPT = PromptTemplate(
+ template=prompt_template, input_variables=["context", "question"],
+ )
+
+ condense_qa_template = """
+ Chat History:
+ {chat_history}
+ Here is a new question for you: {question}
+ Standalone question:"""
+ standalone_question_prompt = PromptTemplate.from_template(condense_qa_template)
+
+ qa = ConversationalRetrievalChain.from_llm(
+ llm=llm,
+ retriever=retriever,
+ condense_question_prompt=standalone_question_prompt,
+ return_source_documents=True,
+ combine_docs_chain_kwargs={"prompt":PROMPT},
+ )
+ return qa
+
+def run_chain(chain, prompt: str, history=[]):
+ print(prompt)
+ return chain({"question": prompt, "chat_history": history})
+
+if __name__ == "__main__":
+ chat_history = []
+ qa = build_chain()
+ print(bcolors.OKBLUE + "Hello! How can I help you?" + bcolors.ENDC)
+ print(bcolors.OKCYAN + "Ask a question, start a New search: or CTRL-D to exit." + bcolors.ENDC)
+ print(">", end=" ", flush=True)
+ for query in sys.stdin:
+ if (query.strip().lower().startswith("new search:")):
+ query = query.strip().lower().replace("new search:","")
+ chat_history = []
+ elif (len(chat_history) == MAX_HISTORY_LENGTH):
+ chat_history.pop(0)
+ result = run_chain(qa, query, chat_history)
+ chat_history.append((query, result["answer"]))
+ print(bcolors.OKGREEN + result['answer'] + bcolors.ENDC)
+ if 'source_documents' in result:
+ print(bcolors.OKGREEN + 'Sources:')
+ for idx, ref in enumerate(result["source_documents"]):
+ print("PubMed UID: "+ref.metadata["uid"])
+ print(bcolors.ENDC)
+ print(bcolors.OKCYAN + "Ask a question, start a New search: or CTRL-D to exit." + bcolors.ENDC)
+ print(">", end=" ", flush=True)
+ print(bcolors.OKBLUE + "Bye" + bcolors.ENDC)
diff --git a/tutorials/notebooks/GenAI/example_scripts/example_vectorsearch_chat_llama_2_zeroshot.py b/tutorials/notebooks/GenAI/example_scripts/example_vectorsearch_chat_llama_2_zeroshot.py
new file mode 100644
index 0000000..bcc8acb
--- /dev/null
+++ b/tutorials/notebooks/GenAI/example_scripts/example_vectorsearch_chat_llama_2_zeroshot.py
@@ -0,0 +1,120 @@
+from langchain.chains import ConversationalRetrievalChain
+from langchain.prompts import PromptTemplate
+#from langchain.llms import VertexAIModelGarden
+from langchain.embeddings import VertexAIEmbeddings
+from langchain.vectorstores import MatchingEngine
+from langchain.llms import VertexAI
+import sys
+import json
+import os
+
+
+class bcolors:
+ HEADER = '\033[95m'
+ OKBLUE = '\033[94m'
+ OKCYAN = '\033[96m'
+ OKGREEN = '\033[92m'
+ WARNING = '\033[93m'
+ FAIL = '\033[91m'
+ ENDC = '\033[0m'
+ BOLD = '\033[1m'
+ UNDERLINE = '\033[4m'
+
+MAX_HISTORY_LENGTH = 1
+
+def build_chain():
+ #if using model from uncomment Model Garden
+
+ PROJECT_ID = os.environ["PROJECT_ID"]
+ LOCATION_ID = os.environ["LOCATION_ID"]
+ #ENDPOINT_ID = os.environ["ENDPOINT_ID"]
+ BUCKET = os.environ["BUCKET"]
+ VC_INDEX_ID = os.environ["VC_INDEX_ID"]
+ VC_ENDPOINT_ID = os.environ["VC_ENDPOINT_ID"]
+
+
+ #llm = VertexAIModelGarden(project=PROJECT_ID, endpoint_id=ENDPOINT_ID, location=LOCATION_ID)
+ llm = VertexAI(
+ model_name="text-bison@001",
+ max_output_tokens=1024,
+ temperature=0.2,
+ top_p=0.8,
+ top_k=40,
+ verbose=True,
+)
+ embeddings = VertexAIEmbeddings()
+
+ vector_store = MatchingEngine.from_components(
+ project_id=PROJECT_ID,
+ region=LOCATION_ID,
+ gcs_bucket_name=BUCKET,
+ embedding=embeddings,
+ index_id=VC_INDEX_ID,
+ endpoint_id=VC_ENDPOINT_ID
+ )
+
+ retriever = vector_store.as_retriever(
+ search_type="similarity",
+ search_kwargs={"k":3}
+ )
+
+ prompt_template = """
+ Ignore everything before.
+ Instructions:
+ I will provide you with research papers on a specific topic in English, and you will create a cumulative summary.
+ The summary should be concise and should accurately and objectively communicate the takeaway of the papers related to the topic.
+ You should not include any personal opinions or interpretations in your summary, but rather focus on objectively presenting the information from the papers.
+ Your summary should be written in your own words and ensure that your summary is clear, concise, and accurately reflects the content of the original papers. First, provide a concise summary then citations at the end.
+ {question} Answer "don't know" if not present in the document.
+ {context}
+ Solution:"""
+
+
+ PROMPT = PromptTemplate(
+ template=prompt_template, input_variables=["context", "question"],
+ )
+
+ condense_qa_template = """
+ Chat History:
+ {chat_history}
+ Here is a new question for you: {question}
+ Standalone question:"""
+ standalone_question_prompt = PromptTemplate.from_template(condense_qa_template)
+
+ #RetrievalQA.from_chain_type(llm=llm, chain_type="stuff"
+ qa = ConversationalRetrievalChain.from_llm(
+ llm=llm,
+ retriever=retriever,
+ condense_question_prompt=standalone_question_prompt,
+ return_source_documents=True,
+ combine_docs_chain_kwargs={"prompt":PROMPT},
+ )
+ return qa
+
+def run_chain(chain, prompt: str, history=[]):
+ print(prompt)
+ return chain({"question": prompt, "chat_history": history})
+
+if __name__ == "__main__":
+ chat_history = []
+ qa = build_chain()
+ print(bcolors.OKBLUE + "Hello! How can I help you?" + bcolors.ENDC)
+ print(bcolors.OKCYAN + "Ask a question, start a New search: or CTRL-D to exit." + bcolors.ENDC)
+ print(">", end=" ", flush=True)
+ for query in sys.stdin:
+ if (query.strip().lower().startswith("new search:")):
+ query = query.strip().lower().replace("new search:","")
+ chat_history = []
+ elif (len(chat_history) == MAX_HISTORY_LENGTH):
+ chat_history.pop(0)
+ result = run_chain(qa, query, chat_history)
+ chat_history.append((query, result["answer"]))
+ print(bcolors.OKGREEN + result['answer'] + bcolors.ENDC)
+ if 'source_documents' in result:
+ print(bcolors.OKGREEN + 'Sources:')
+ for idx, ref in enumerate(result["source_documents"]):
+ print(ref.page_content)
+ print(bcolors.ENDC)
+ print(bcolors.OKCYAN + "Ask a question, start a New search: or CTRL-D to exit." + bcolors.ENDC)
+ print(">", end=" ", flush=True)
+ print(bcolors.OKBLUE + "Bye" + bcolors.ENDC)