update changes 10/23/2024

STRIDES · Oct 23, 2024 · 3c481a2 · 3c481a2
2 parents 8e1a044 + 51dec5d
commit 3c481a2
Show file tree

Hide file tree

Showing 4 changed files with 128 additions and 1,442 deletions.
diff --git a/notebooks/GenAI/.ipynb_checkpoints/GCP_RAG_for_Structure_Data-checkpoint.ipynb b/notebooks/GenAI/.ipynb_checkpoints/GCP_RAG_for_Structure_Data-checkpoint.ipynb
@@ -33,7 +33,7 @@
    "source": [
     "Generative AI (GenAI) represents a transformative technology capable of producing human-like text, images, code, and various other types of content. While much of the focus has been on unstructured data—such as PDFs, text documents, image files, and websites—many GenAI implementations rely on a parameter called \"top K.\" This algorithm retrieves only the highest-scoring pieces of content relevant to a user's query, which can be limiting. Users seeking insights from structured data formats like CSV and JSON often require access to all relevant occurrences, rather than just a subset.\n",
     "\n",
-    "In this tutorial, we use a RAG agent together with a VertexAI chat model gemini-1.5-pro to query the BigQuery table. A Retrieval Augmented Generation (RAG) agent is a key part of a RAG application that enhances the capabilities of the large language models (LLMs) by integrating external data retrieval. AI agents empower LLMs to interact with the world through actions and tools."
+    "In this tutorial, we use a RAG agent together with a VertexAI chat model gemini-1.5-pro to query the BigQuery table. A Retrieval Augmented Generation (RAG) agent is a key part of a RAG application that enhances the capabilities of the large language models (LLMs) by integrating external data retrieval. AI agents empower LLMs to interact with the world through actions and tools. The agent connects to BigQuery, runs the query, and then sends the results to the model. The model processes the data and generates the final output that it is displayed for the user.\n"
    ]
   },
   {
@@ -53,7 +53,7 @@
     "\n",
     "In this tutorial, we will be using Google Gemini Pro 1.5, which does not require deployment. However, if you prefer to use a different model, you can select one from the Model Garden via the console. This will allow you to add the model to your registry, create an endpoint (or utilize an existing one), and deploy the model—all in a single step. Here is a link for more information: [model deployment](https://cloud.google.com/vertex-ai/docs/general/deployment).\n",
     "\n",
-    "Before we begin, you'll need to create a Vertex AI RAG Data Service Agent service account. To do this, go to the IAM section of the console. Ensure you check the box for \"Include Google-provided role grant.\" If the role is not listed, click \"Grant Access\" and add \"Vertex AI RAG Data Service Agent\" as a role.\n",
+    "Before we begin, you'll need to create a Vertex AI RAG Data Service Agent service account. To do this, go to the IAM section of the console. Ensure you check the box for \"Include Google-provided role grant.\" If the role is not listed, click \"Grant Access\" and add \"Vertex AI RAG Data Service Agent\" as a role. Additionally, to create datasets and tables in BigQuery, the user must have the __bigquery.datasets.create__ and __bigquery.tables.create__ permissions.\n",
     "\n",
     "We are using a e2-standard-4 (Efficient Instance: 4 vCPUs, 16 GB RAM) for this tutorial. "
    ]
@@ -155,7 +155,7 @@
    "id": "df0d9eda-ee54-4c4c-86e6-9166b2b82e2d",
    "metadata": {},
    "source": [
-    "Create a bucket where the data used for tutorial is going to be placed."
+    "Create a bucket where the data used for tutorial is going to be placed. It is important to keep in mind that the name of the bucket has to be unique."
    ]
   },
   {
@@ -174,7 +174,9 @@
    "id": "5781a07c-84ed-416c-a782-b1dc881e5c18",
    "metadata": {},
    "source": [
-    "Once the bucket is created, we need to access the CSV source file. In this tutorial, I transferred the data file to our Jupyter notebook by simply dragging and dropping it from my local folder. Next, we need to specify the bucket name and the path of the data source in order to upload the CSV file to the bucket. It is important to keep in mind that the name of the bucket has to be unique."
+    "Once the bucket is created, we need to access the CSV source file. In this tutorial, I transferred the data file to our Jupyter notebook by simply dragging and dropping it from my local folder. Next, we need to specify the bucket name and the path of the data source in order to upload the CSV file to the bucket. \n",
+    "\n",
+    "We are using the [health screening](https://www.kaggle.com/datasets/drateendrajha/health-screening-data) dataset from kaggle for this tutorial."
    ]
   },
   {
@@ -243,7 +245,7 @@
     "client = bigquery.Client()\n",
     "\n",
     "table_name = \"<Table Name>\"\n",
-    "table_id = f'{project_id}.{dataset_id}.{table_name}'\n",
+    "table_id = f'{project_id}.{dataset_name}.{table_name}'\n",
     "\n",
     "schema = [\n",
     "    bigquery.SchemaField(\"id\", \"INTEGER\", mode=\"NULLABLE\"),\n",
@@ -273,6 +275,37 @@
     ")"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "4e18540d-defa-414b-8b78-b86bce3dd46a",
+   "metadata": {},
+   "source": [
+    "As an option, the following code will auto detect and upload the data to the table in one step."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "671aa2a3-d259-42e3-8cc3-8ab6b1e4a7d1",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "table_id = f'{dataset_id}.{table_name}'\n",
+    "\n",
+    "job_config = bigquery.LoadJobConfig(\n",
+    "    autodetect=True, source_format=bigquery.SourceFormat.CSV, skip_leading_rows=1,\n",
+    ")\n",
+    "uri = f\"gs://{bucket}/ds_salaries.csv\"\n",
+    "load_job = client.load_table_from_uri(\n",
+    "    uri, table_id, job_config=job_config\n",
+    ")  # Make an API request.\n",
+    "load_job.result()  # Waits for the job to complete.\n",
+    "destination_table = client.get_table(table_id)\n",
+    "print(\"Loaded {} rows.\".format(destination_table.num_rows))"
+   ]
+  },
   {
    "cell_type": "markdown",
    "id": "6e97b4c6-429c-4d36-a4c1-3963b079af6a",
@@ -299,6 +332,30 @@
     "    print(\"{} project does not contain any datasets.\".format(project))"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "0c87a0f1-de43-4cce-a5e2-8952ce2283a0",
+   "metadata": {},
+   "source": [
+    "We load our dataset located in our bucket to a pandas dataframe."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2486d4e3-b462-42de-85ab-3c6d5a8ba363",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from google.cloud import storage\n",
+    "import pandas as pd\n",
+    "from io import StringIO\n",
+    "\n",
+    "path = \"<gsutil URI>\"\n",
+    "df = pd.read_csv(path)\n",
+    "df.head()"
+   ]
+  },
   {
    "cell_type": "markdown",
    "id": "ad42d610-daa6-4d92-ac14-2d90349dafe1",
@@ -316,10 +373,6 @@
    "source": [
     "df.to_gbq(table_id, project_id=project_id)\n",
     "\n",
-    "from google.cloud import storage\n",
-    "import pandas as pd\n",
-    "from io import StringIO\n",
-    "\n",
     "client.load_table_from_dataframe(df, table_id).result()"
    ]
   },
@@ -328,7 +381,7 @@
    "id": "4b16a292-afdc-415c-a8a7-071bc90281c3",
    "metadata": {},
    "source": [
-    "![image.png](../../NIHCloudLabGCP/images/gcp_rag_structure_data_01.png)"
+    "![image.png](../../../NIHCloudLabGCP/images/gcp_rag_structure_data_01.png)"
    ]
   },
   {
@@ -404,7 +457,7 @@
     "from langchain.agents.agent_toolkits import SQLDatabaseToolkit\n",
     "from langchain.sql_database import SQLDatabase\n",
     "\n",
-    "sqlalchemy_url = f'bigquery://{project_id}/{dataset}'\n",
+    "sqlalchemy_url = f'bigquery://{project_id}/{dataset_name}'\n",
     "print(sqlalchemy_url)"
    ]
   },