diff --git a/images/.ipynb_checkpoints/gcp_rag_structure_data_01-checkpoint.png b/images/.ipynb_checkpoints/gcp_rag_structure_data_01-checkpoint.png new file mode 100644 index 0000000..a9cedd8 Binary files /dev/null and b/images/.ipynb_checkpoints/gcp_rag_structure_data_01-checkpoint.png differ diff --git a/notebooks/GenAI/.ipynb_checkpoints/GCP_RAG_for_Structure_Data-checkpoint.ipynb b/notebooks/GenAI/.ipynb_checkpoints/GCP_RAG_for_Structure_Data-checkpoint.ipynb new file mode 100644 index 0000000..54d4614 --- /dev/null +++ b/notebooks/GenAI/.ipynb_checkpoints/GCP_RAG_for_Structure_Data-checkpoint.ipynb @@ -0,0 +1,598 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "6e38b3f2-2b29-4788-a29a-f9e5d10f8365", + "metadata": {}, + "source": [ + "##### __Skill Level:__ Intermediate" + ] + }, + { + "cell_type": "markdown", + "id": "6c0533e3-4fbd-47b1-8458-fd193bcb263f", + "metadata": {}, + "source": [ + "# Creating a Chatbot for Structure Data using a RAG" + ] + }, + { + "cell_type": "markdown", + "id": "76cf4b02-6422-4a48-882c-c00315e34c8a", + "metadata": { + "tags": [] + }, + "source": [ + "## Overview" + ] + }, + { + "cell_type": "markdown", + "id": "7b7870dd-09e0-4f6a-91a9-7c30529af720", + "metadata": {}, + "source": [ + "Generative AI (GenAI) represents a transformative technology capable of producing human-like text, images, code, and various other types of content. While much of the focus has been on unstructured data—such as PDFs, text documents, image files, and websites—many GenAI implementations rely on a parameter called \"top K.\" This algorithm retrieves only the highest-scoring pieces of content relevant to a user's query, which can be limiting. Users seeking insights from structured data formats like CSV and JSON often require access to all relevant occurrences, rather than just a subset.\n", + "\n", + "In this tutorial, we use a RAG agent together with a VertexAI chat model gemini-1.5-pro to query the BigQuery table. A Retrieval Augmented Generation (RAG) agent is a key part of a RAG application that enhances the capabilities of the large language models (LLMs) by integrating external data retrieval. AI agents empower LLMs to interact with the world through actions and tools." + ] + }, + { + "cell_type": "markdown", + "id": "d513a9fe-1ac4-4610-b53c-6696e461f828", + "metadata": {}, + "source": [ + "## Prerequisites" + ] + }, + { + "cell_type": "markdown", + "id": "fbdb09aa-73d6-4cee-b3e7-5d6b243e3d58", + "metadata": {}, + "source": [ + "We assume you have access to Vertex AI and have enabled the necessary APIs.\n", + "\n", + "In this tutorial, we will be using Google Gemini Pro 1.5, which does not require deployment. However, if you prefer to use a different model, you can select one from the Model Garden via the console. This will allow you to add the model to your registry, create an endpoint (or utilize an existing one), and deploy the model—all in a single step. Here is a link for more information: [model deployment](https://cloud.google.com/vertex-ai/docs/general/deployment).\n", + "\n", + "Before we begin, you'll need to create a Vertex AI RAG Data Service Agent service account. To do this, go to the IAM section of the console. Ensure you check the box for \"Include Google-provided role grant.\" If the role is not listed, click \"Grant Access\" and add \"Vertex AI RAG Data Service Agent\" as a role.\n", + "\n", + "We are using a e2-standard-4 (Efficient Instance: 4 vCPUs, 16 GB RAM) for this tutorial. " + ] + }, + { + "cell_type": "markdown", + "id": "eeb24716-0db8-462f-8faa-0b8ad9d61feb", + "metadata": { + "tags": [] + }, + "source": [ + "## Learning objectives" + ] + }, + { + "cell_type": "markdown", + "id": "f5216a43-31ee-4c3e-821a-4b81f4be8e54", + "metadata": {}, + "source": [ + "In this tutorial you will learn about:\n", + "- How to set up a BigQuery dataset and a table.\n", + "- How to load data to a BigQuery table.\n", + "- How to use the langchain ChatVertexAI agent to extract information from the table. \n" + ] + }, + { + "cell_type": "markdown", + "id": "348910aa-9802-4318-8fcb-86603158923b", + "metadata": { + "tags": [] + }, + "source": [ + "## Pricing" + ] + }, + { + "cell_type": "markdown", + "id": "0dca31b2-880d-4cbc-bf76-ede96998e371", + "metadata": {}, + "source": [ + "If you are following this tutorial in one sitting it will cost $23.89 per month. Completing the process in multiple sessions or using a method different from the tutorial may result in increased costs." + ] + }, + { + "cell_type": "markdown", + "id": "c7d9e3ee-1c7e-4baa-a818-c13330c665dd", + "metadata": {}, + "source": [ + "## Get Start" + ] + }, + { + "cell_type": "markdown", + "id": "137defcd-fd2b-421d-9b2e-fe2b878fb325", + "metadata": {}, + "source": [ + "### Install Packages" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "55c14a38-ba51-466d-90b5-1512f5a7fae0", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "!pip install google-cloud-bigquery\n", + "!pip install gcloud\n", + "!pip install langchain\n", + "!pip install -U langchain-community\n", + "!pip install -qU langchain_google_vertexai\n", + "!pip install pandas-gbq" + ] + }, + { + "cell_type": "markdown", + "id": "64e22773-9a03-47de-9b6d-7946abcba722", + "metadata": {}, + "source": [ + "Set your project id, location, and bucket variables." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7193a9ac-3169-45c1-9c5d-ba5e2f31458f", + "metadata": {}, + "outputs": [], + "source": [ + "project_id=''\n", + "location=' (e.g.us-east4)'\n", + "bucket = ''" + ] + }, + { + "cell_type": "markdown", + "id": "df0d9eda-ee54-4c4c-86e6-9166b2b82e2d", + "metadata": {}, + "source": [ + "Create a bucket where the data used for tutorial is going to be placed." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7b583234-2d92-4e5b-841d-c7d79dd87d45", + "metadata": {}, + "outputs": [], + "source": [ + "# make bucket\n", + "!gsutil mb -l {location} gs://{bucket}" + ] + }, + { + "cell_type": "markdown", + "id": "5781a07c-84ed-416c-a782-b1dc881e5c18", + "metadata": {}, + "source": [ + "Once the bucket is created, we need to access the CSV source file. In this tutorial, I transferred the data file to our Jupyter notebook by simply dragging and dropping it from my local folder. Next, we need to specify the bucket name and the path of the data source in order to upload the CSV file to the bucket. It is important to keep in mind that the name of the bucket has to be unique." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4829094e-61a6-4dd7-854f-6b21dc19b1b7", + "metadata": {}, + "outputs": [], + "source": [ + "!gsutil cp '' gs://{bucket}" + ] + }, + { + "cell_type": "markdown", + "id": "7a00dc83-8713-4ea5-9345-08a9f3d785d0", + "metadata": {}, + "source": [ + "The next step is to create a database in BigQuery. To accomplish this, we need to define a dataset_id and construct a dataset object that will be sent to the API for creation. Note that in BigQuery, a dataset is analogous to a database." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1750e13d-ec05-4335-82a2-ffe54c5b5f32", + "metadata": {}, + "outputs": [], + "source": [ + "from google.cloud import bigquery\n", + "\n", + "client = bigquery.Client()\n", + "\n", + "dataset_name = ''\n", + "# Set dataset_id to the ID of the dataset to create.\n", + "dataset_id = f'{project_id}.{dataset_name}'\n", + "\n", + "# Construct a full Dataset object to send to the API.\n", + "dataset = bigquery.Dataset(dataset_id)\n", + "\n", + "# Specify the geographic location where the dataset should reside.\n", + "dataset.location = location\n", + "\n", + "# Send the dataset to the API for creation, with an explicit timeout.\n", + "# Raises google.api_core.exceptions.Conflict if the Dataset already\n", + "# exists within the project.\n", + "dataset = client.create_dataset(dataset, timeout=30) # Make an API request.\n", + "print(\"Created dataset {}.{}\".format(client.project, dataset.dataset_id))" + ] + }, + { + "cell_type": "markdown", + "id": "544c191c-2bc0-4a47-a0a7-cea81035597a", + "metadata": {}, + "source": [ + "Once the dataset is created, the next step is to create a table. First, we define a table ID, followed by outlining the table's schema. Finally, we send an API request to create the table within the dataset established in the previous step." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "68206b14-1b5e-49c3-9f07-52ba6c70a93e", + "metadata": {}, + "outputs": [], + "source": [ + "# Set table_id to the ID of the table to create.\n", + "\n", + "client = bigquery.Client()\n", + "\n", + "table_name = \"\"\n", + "table_id = f'{project_id}.{dataset_id}.{table_name}'\n", + "\n", + "schema = [\n", + " bigquery.SchemaField(\"id\", \"INTEGER\", mode=\"NULLABLE\"),\n", + " bigquery.SchemaField(\"age\", \"INTEGER\", mode=\"NULLABLE\"), \n", + " bigquery.SchemaField(\"gender\", \"INTEGER\", mode=\"NULLABLE\"),\n", + " bigquery.SchemaField(\"height\", \"INTEGER\", mode=\"NULLABLE\"),\n", + " bigquery.SchemaField(\"weight\", \"FLOAT\", mode=\"NULLABLE\"),\n", + " bigquery.SchemaField(\"ap_hi\", \"INTEGER\", mode=\"NULLABLE\"),\n", + " bigquery.SchemaField(\"ap_lo\", \"INTEGER\", mode=\"NULLABLE\"),\n", + " bigquery.SchemaField(\"cholesterol\", \"INTEGER\", mode=\"NULLABLE\"),\n", + " bigquery.SchemaField(\"gluc\", \"INTEGER\", mode=\"NULLABLE\"),\n", + " bigquery.SchemaField(\"smoke\", \"INTEGER\", mode=\"NULLABLE\"),\n", + " bigquery.SchemaField(\"alco\", \"INTEGER\", mode=\"NULLABLE\"),\n", + " bigquery.SchemaField(\"active\", \"INTEGER\", mode=\"NULLABLE\"),\n", + " bigquery.SchemaField(\"cardio\", \"INTEGER\", mode=\"NULLABLE\"),\n", + " bigquery.SchemaField(\"ageinyr\", \"INTEGER\", mode=\"NULLABLE\"),\n", + " bigquery.SchemaField(\"bmi\", \"FLOAT\", mode=\"NULLABLE\"),\n", + " bigquery.SchemaField(\"bmicat\", \"STRING\", mode=\"NULLABLE\"),\n", + " bigquery.SchemaField(\"agegroup\", \"STRING\", mode=\"NULLABLE\"),\n", + "\n", + "]\n", + "\n", + "table = bigquery.Table(table_id, schema=schema)\n", + "table = client.create_table(table) # Make an API request.\n", + "print(\n", + " \"Created table {}.{}.{}\".format(table.project, table.dataset_id, table.table_id)\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "6e97b4c6-429c-4d36-a4c1-3963b079af6a", + "metadata": {}, + "source": [ + "This step is optional. We can make an API request to verify whether the dataset has been successfully created under our project." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "72f11cfd-8b20-48bf-b157-aaf791cd3a9f", + "metadata": {}, + "outputs": [], + "source": [ + "datasets = list(client.list_datasets()) # Make an API request.\n", + "project = project_id\n", + "\n", + "if datasets:\n", + " print(\"Datasets in project {}:\".format(project))\n", + " for dataset in datasets:\n", + " print(\"\\t{}\".format(dataset.dataset_id))\n", + "else:\n", + " print(\"{} project does not contain any datasets.\".format(project))" + ] + }, + { + "cell_type": "markdown", + "id": "ad42d610-daa6-4d92-ac14-2d90349dafe1", + "metadata": {}, + "source": [ + "We need to use the \"to_gbq\" function from the pandas-gbq library, which allows us to write a Pandas DataFrame to a Google BigQuery table. This enables us to populate our BigQuery table with the data from the DataFrame." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7cfd1d66-a0fb-403d-a9b1-27250cc81f91", + "metadata": {}, + "outputs": [], + "source": [ + "df.to_gbq(table_id, project_id=project_id)\n", + "\n", + "from google.cloud import storage\n", + "import pandas as pd\n", + "from io import StringIO\n", + "\n", + "client.load_table_from_dataframe(df, table_id).result()" + ] + }, + { + "cell_type": "markdown", + "id": "4b16a292-afdc-415c-a8a7-071bc90281c3", + "metadata": {}, + "source": [ + "![image.png](../../NIHCloudLabGCP/images/gcp_rag_structure_data_01.png)" + ] + }, + { + "cell_type": "markdown", + "id": "22b4dd65-9bd9-48ef-9c98-85e3e9a82701", + "metadata": {}, + "source": [ + "This step is also optional. We can execute a simple query on the BigQuery table to count the number of records." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "29e91fc1-735f-4d6c-9fcf-3a2d745363c1", + "metadata": {}, + "outputs": [], + "source": [ + "query = \"\"\"\n", + " SELECT count(gender) as cgender\n", + " FROM `project_id.dataset_id.table_id`\n", + "\"\"\"\n", + "\n", + "# Execute the query\n", + "query_job = client.query(query)\n", + "\n", + "# Fetch the results\n", + "results = query_job.result()\n", + "\n", + "# Iterate through the rows\n", + "for row in results:\n", + " #print(f\"gender: {row.name}, age: {row.age}\")\n", + " print(f\"cgender: {row.cgender}\")\n", + "\n", + "try:\n", + " query_job = client.query(query)\n", + " results = query_job.result()\n", + " print(results)\n", + "except Exception as e:\n", + " print(f\"Error executing query: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "id": "ce65ead3-694d-4d46-9376-38d35b8d0852", + "metadata": {}, + "source": [ + "cgender: 139920\n", + "" + ] + }, + { + "cell_type": "markdown", + "id": "b6fb53b8-9f93-4f37-8730-33c5871720fc", + "metadata": {}, + "source": [ + "To interact with the BigQuery table using a Pythonic domain language, we utilize SQLAlchemy. SQLAlchemy is a Python SQL toolkit that enables developers to access and manage SQL databases, allowing users to write queries as strings or chain Python objects for similar queries. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d1d29b90-7bcd-46b8-82b9-dc5e58f9edee", + "metadata": {}, + "outputs": [], + "source": [ + "from google.cloud import bigquery\n", + "from sqlalchemy import *\n", + "from sqlalchemy.engine import create_engine\n", + "from sqlalchemy.schema import *\n", + "import os\n", + "from langchain_community.agent_toolkits import create_sql_agent\n", + "from langchain.agents import create_sql_agent\n", + "from langchain.agents.agent_toolkits import SQLDatabaseToolkit\n", + "from langchain.sql_database import SQLDatabase\n", + "\n", + "sqlalchemy_url = f'bigquery://{project_id}/{dataset}'\n", + "print(sqlalchemy_url)" + ] + }, + { + "cell_type": "markdown", + "id": "9194cc1d-7ef1-4c4e-8e65-e7d9ada5be24", + "metadata": {}, + "source": [ + "Next, we import the __ChatVertexAI__ agent from Langchain and configure it with the appropriate hyperparameters. In this instance, we are using the __gemini-1.5-pro__ LLM model, after which we create the SQL agent to enable querying the BigQuery table using a natural language string as a prompt. Temperature regulates randomness, with higher temperatures resulting in more varied and unpredictable outputs. Top-k sampling selects from the k most probable next tokens at each step, where a lower k emphasizes higher-probability tokens. The max tokens hyperparameter specifies the maximum number of tokens in the response from the large language model. Max retries indicates how many responses we will receive from the model." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ee1899a2-dd79-4cf4-a02d-c7c35b99040b", + "metadata": {}, + "outputs": [], + "source": [ + "from langchain_google_vertexai import VertexAI, ChatVertexAI\n", + "\n", + "llm = ChatVertexAI(\n", + " model=\"gemini-1.5-pro\",\n", + " temperature=0,\n", + " max_tokens=8190,\n", + " timeout=None,\n", + " max_retries=2,\n", + ")\n", + "db = SQLDatabase.from_uri(sqlalchemy_url)\n", + "toolkit = SQLDatabaseToolkit(db=db, llm=llm)\n", + "agent_executor = create_sql_agent(\n", + " llm=llm,\n", + " toolkit=toolkit,\n", + " verbose=True,\n", + " top_k=100000\n", + " )\n" + ] + }, + { + "cell_type": "markdown", + "id": "f3551426-8644-48ba-b8c1-bf7d5eb843b3", + "metadata": {}, + "source": [ + "For our initial query, we check the number of rows to compare the result with what we obtained in a previous step using a simple SQL query against the BigQuery table. We can confirm that we received the same number of rows: 139,920." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e9753533-e82b-47af-936a-a401d512308a", + "metadata": {}, + "outputs": [], + "source": [ + "agent_executor.run(\"How many rows are in the table screening? \")" + ] + }, + { + "cell_type": "markdown", + "id": "099154a1-34e2-4f6b-b146-cec6f5943cf7", + "metadata": {}, + "source": [ + "> Finished chain.\n", + "'The table screening has 139920 rows.'" + ] + }, + { + "cell_type": "markdown", + "id": "f2b62012-b2d2-48c1-ae69-6f8816065551", + "metadata": { + "tags": [] + }, + "source": [ + "Next, we pose a question that requires applying a filter, and we successfully obtain the accurate number of obese individuals in the table." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5563d317-81ab-4a37-8225-e1b0dce20b36", + "metadata": {}, + "outputs": [], + "source": [ + "agent_executor.run(\"How many obese are in the table screen?\")" + ] + }, + { + "cell_type": "markdown", + "id": "40e1846d-1114-4ba1-b35a-65b47a838468", + "metadata": {}, + "source": [ + "> Finished chain.\n", + "'There are 37204 obese people in the table.'" + ] + }, + { + "cell_type": "markdown", + "id": "b96ea414-3f2d-4e66-ab0b-a034dfe507b0", + "metadata": {}, + "source": [ + "As a more complex query, we inquired about the number of female smokers, and the SQL agent accurately returned the answer of 1,626 female smokers." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a456ff90-a3a5-4ee4-81da-fba4fcdf7f2d", + "metadata": {}, + "outputs": [], + "source": [ + "agent_executor.run(\"How many smokers are female in the table screen? the value 1 in smoke means that is a smoker and the value 2 in gender means that is a female\")" + ] + }, + { + "cell_type": "markdown", + "id": "7ea7810f-9701-4cb0-9430-181a57929884", + "metadata": {}, + "source": [ + "> Finished chain.\n", + "'There are 1626 female smokers in the table.'" + ] + }, + { + "cell_type": "markdown", + "id": "ca0fb657-19e0-4488-b9f6-25ca3971064f", + "metadata": { + "tags": [] + }, + "source": [ + "## Conclusion" + ] + }, + { + "cell_type": "markdown", + "id": "7f85e560-231f-4841-b1d0-334326e4b347", + "metadata": {}, + "source": [ + "You have learned how to create a dataset and a table in BigQuery, as well as how to set up an agent that enables you to query the table using natural language instead of SQL queries, allowing you to obtain the results you need." + ] + }, + { + "cell_type": "markdown", + "id": "41573388-7feb-44b6-9e8a-0511cc8ae62e", + "metadata": {}, + "source": [ + "## Clean Up" + ] + }, + { + "cell_type": "markdown", + "id": "1f0edc0b-f2cc-478c-855f-9acff3792f4d", + "metadata": {}, + "source": [ + "Please remember to delete or stop your Jupyter notebook and delete your BigQuery dataset and table to prevent incurring charges. And if you have created any other services like buckets, please remember to delete them as well." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fe73dde2-056a-46a3-bacb-0a6671f7128f", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "environment": { + "kernel": "python3", + "name": "common-cpu.m125", + "type": "gcloud", + "uri": "us-docker.pkg.dev/deeplearning-platform-release/gcr.io/base-cpu:m125" + }, + "kernelspec": { + "display_name": "Python 3 (Local)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.15" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/notebooks/GenAI/.ipynb_checkpoints/GCP_RAG_for_Structure_Data_old-checkpoint.ipynb b/notebooks/GenAI/.ipynb_checkpoints/GCP_RAG_for_Structure_Data_old-checkpoint.ipynb new file mode 100644 index 0000000..a7d082e --- /dev/null +++ b/notebooks/GenAI/.ipynb_checkpoints/GCP_RAG_for_Structure_Data_old-checkpoint.ipynb @@ -0,0 +1,710 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "6e38b3f2-2b29-4788-a29a-f9e5d10f8365", + "metadata": {}, + "source": [ + "##### __Skill Level:__ Intermediate" + ] + }, + { + "cell_type": "markdown", + "id": "6c0533e3-4fbd-47b1-8458-fd193bcb263f", + "metadata": {}, + "source": [ + "# Creating a Chatbot for Structure Data using a RAG" + ] + }, + { + "cell_type": "markdown", + "id": "76cf4b02-6422-4a48-882c-c00315e34c8a", + "metadata": { + "tags": [] + }, + "source": [ + "## Overview" + ] + }, + { + "cell_type": "markdown", + "id": "7b7870dd-09e0-4f6a-91a9-7c30529af720", + "metadata": {}, + "source": [ + "Generative AI (GenAI) represents a transformative technology capable of producing human-like text, images, code, and various other types of content. While much of the focus has been on unstructured data—such as PDFs, text documents, image files, and websites—many GenAI implementations rely on a parameter called \"top K.\" This algorithm retrieves only the highest-scoring pieces of content relevant to a user's query, which can be limiting. Users seeking insights from structured data formats like CSV and JSON often require access to all relevant occurrences, rather than just a subset.\n", + "\n", + "In this tutorial, we present a technique that leverages SQL databases. By formulating a query based on the user's request, the model submits this query to the database, providing comprehensive results. This approach not only ensures users receive all pertinent information but also reduces the likelihood of exceeding token limits." + ] + }, + { + "cell_type": "markdown", + "id": "d513a9fe-1ac4-4610-b53c-6696e461f828", + "metadata": {}, + "source": [ + "## Prerequisites" + ] + }, + { + "cell_type": "markdown", + "id": "fbdb09aa-73d6-4cee-b3e7-5d6b243e3d58", + "metadata": {}, + "source": [ + "We assume you have access to Vertex AI and have enabled the necessary APIs.\n", + "\n", + "In this tutorial, we will be using Google Gemini Pro 1.5, which does not require deployment. However, if you prefer to use a different model, you can select one from the Model Garden via the console. This will allow you to add the model to your registry, create an endpoint (or utilize an existing one), and deploy the model—all in a single step. Here is a link for more information: [model deployment](https://cloud.google.com/vertex-ai/docs/general/deployment).\n", + "\n", + "Before we begin, you'll need to create a Vertex AI RAG Data Service Agent service account. To do this, go to the IAM section of the console. Ensure you check the box for \"Include Google-provided role grant.\" If the role is not listed, click \"Grant Access\" and add \"Vertex AI RAG Data Service Agent\" as a role." + ] + }, + { + "cell_type": "markdown", + "id": "eeb24716-0db8-462f-8faa-0b8ad9d61feb", + "metadata": { + "tags": [] + }, + "source": [ + "## Learning objectives" + ] + }, + { + "cell_type": "markdown", + "id": "f5216a43-31ee-4c3e-821a-4b81f4be8e54", + "metadata": {}, + "source": [ + "In this tutorial you will learn about:\n", + "- How to set up a BigQuery dataset and a table.\n", + "- How to load data to a BigQuery table.\n", + "- How to use the langchain ChatVertexAI agent to extract information froim the table. \n" + ] + }, + { + "cell_type": "markdown", + "id": "348910aa-9802-4318-8fcb-86603158923b", + "metadata": { + "tags": [] + }, + "source": [ + "## Pricing" + ] + }, + { + "cell_type": "markdown", + "id": "0dca31b2-880d-4cbc-bf76-ede96998e371", + "metadata": {}, + "source": [ + "
    \n", + "
  • BigQuery Standard 1TB Storage.
  • \n", + "
  • Model for Chat: 5 request per day. Average Input and Output characters 5,000.
  • \n", + "
  • Total $23.89 per month.
  • \n", + "
" + ] + }, + { + "cell_type": "markdown", + "id": "c7d9e3ee-1c7e-4baa-a818-c13330c665dd", + "metadata": {}, + "source": [ + "## Get Start" + ] + }, + { + "cell_type": "markdown", + "id": "137defcd-fd2b-421d-9b2e-fe2b878fb325", + "metadata": {}, + "source": [ + "### Install Packages" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "55c14a38-ba51-466d-90b5-1512f5a7fae0", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "!pip install google-cloud-bigquery\n", + "!pip install gcloud\n", + "!pip install langchain\n", + "!pip install -U langchain-community\n", + "!pip install -qU langchain_google_vertexai\n", + "!pip install pandas-gbq" + ] + }, + { + "cell_type": "markdown", + "id": "64e22773-9a03-47de-9b6d-7946abcba722", + "metadata": {}, + "source": [ + "Set your project id, location, and bucket variables." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7193a9ac-3169-45c1-9c5d-ba5e2f31458f", + "metadata": {}, + "outputs": [], + "source": [ + "project_id=''\n", + "location=' (e.g.us-east4)'\n", + "bucket = ''" + ] + }, + { + "cell_type": "markdown", + "id": "df0d9eda-ee54-4c4c-86e6-9166b2b82e2d", + "metadata": {}, + "source": [ + "Create a bucket where the data used for tutorial is going to be placed." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7b583234-2d92-4e5b-841d-c7d79dd87d45", + "metadata": {}, + "outputs": [], + "source": [ + "# make bucket\n", + "!gsutil mb -l {location} gs://{bucket}" + ] + }, + { + "cell_type": "markdown", + "id": "5781a07c-84ed-416c-a782-b1dc881e5c18", + "metadata": {}, + "source": [ + "Provide the names of the bucket name, source file name path, and destination blob name to upload the CSV source file to the bucket." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4829094e-61a6-4dd7-854f-6b21dc19b1b7", + "metadata": {}, + "outputs": [], + "source": [ + "from google.cloud import storage\n", + "\n", + "# The ID of your GCS bucket\n", + "bucket_name = bucket\n", + " # The path to your file to upload\n", + "source_file_name = \"\"\n", + " # The ID of your GCS object\n", + "destination_blob_name = \"\"\n", + "\n", + "storage_client = storage.Client()\n", + "bucket = storage_client.bucket(bucket_name)\n", + "blob = bucket.blob(destination_blob_name)\n", + "blob.upload_from_filename(source_file_name)\n", + "\n", + "print(\n", + " \"File {} uploaded to {}.\".format(\n", + " source_file_name, destination_blob_name\n", + " )\n", + ")\n" + ] + }, + { + "cell_type": "markdown", + "id": "7a00dc83-8713-4ea5-9345-08a9f3d785d0", + "metadata": {}, + "source": [ + "The next step is to create a database in BigQuery. To accomplish this, we need to define a dataset_id and construct a dataset object that will be sent to the API for creation. Note that in BigQuery, a dataset is analogous to a database." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1750e13d-ec05-4335-82a2-ffe54c5b5f32", + "metadata": {}, + "outputs": [], + "source": [ + "from google.cloud import bigquery\n", + "\n", + "client = bigquery.Client()\n", + "\n", + "# Set dataset_id to the ID of the dataset to create.\n", + "dataset_id = project_id.''\n", + "\n", + "# Construct a full Dataset object to send to the API.\n", + "dataset = bigquery.Dataset(dataset_id)\n", + "\n", + "# Specify the geographic location where the dataset should reside.\n", + "dataset.location = location\n", + "\n", + "# Send the dataset to the API for creation, with an explicit timeout.\n", + "# Raises google.api_core.exceptions.Conflict if the Dataset already\n", + "# exists within the project.\n", + "dataset = client.create_dataset(dataset, timeout=30) # Make an API request.\n", + "print(\"Created dataset {}.{}\".format(client.project, dataset.dataset_id))" + ] + }, + { + "cell_type": "markdown", + "id": "544c191c-2bc0-4a47-a0a7-cea81035597a", + "metadata": {}, + "source": [ + "Once the dataset is created, the next step is to create a table. First, we define a table ID, followed by outlining the table's schema. Finally, we send an API request to create the table within the dataset established in the previous step." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "68206b14-1b5e-49c3-9f07-52ba6c70a93e", + "metadata": {}, + "outputs": [], + "source": [ + "# Set table_id to the ID of the table to create.\n", + "\n", + "client = bigquery.Client()\n", + "\n", + "table_id = project_id.dataset_id.\"
\"\n", + "\n", + "schema = [\n", + " bigquery.SchemaField(\"id\", \"INTEGER\", mode=\"NULLABLE\"),\n", + " bigquery.SchemaField(\"age\", \"INTEGER\", mode=\"NULLABLE\"), \n", + " bigquery.SchemaField(\"gender\", \"INTEGER\", mode=\"NULLABLE\"),\n", + " bigquery.SchemaField(\"height\", \"INTEGER\", mode=\"NULLABLE\"),\n", + " bigquery.SchemaField(\"weight\", \"FLOAT\", mode=\"NULLABLE\"),\n", + " bigquery.SchemaField(\"ap_hi\", \"INTEGER\", mode=\"NULLABLE\"),\n", + " bigquery.SchemaField(\"ap_lo\", \"INTEGER\", mode=\"NULLABLE\"),\n", + " bigquery.SchemaField(\"cholesterol\", \"INTEGER\", mode=\"NULLABLE\"),\n", + " bigquery.SchemaField(\"gluc\", \"INTEGER\", mode=\"NULLABLE\"),\n", + " bigquery.SchemaField(\"smoke\", \"INTEGER\", mode=\"NULLABLE\"),\n", + " bigquery.SchemaField(\"alco\", \"INTEGER\", mode=\"NULLABLE\"),\n", + " bigquery.SchemaField(\"active\", \"INTEGER\", mode=\"NULLABLE\"),\n", + " bigquery.SchemaField(\"cardio\", \"INTEGER\", mode=\"NULLABLE\"),\n", + " bigquery.SchemaField(\"ageinyr\", \"INTEGER\", mode=\"NULLABLE\"),\n", + " bigquery.SchemaField(\"bmi\", \"FLOAT\", mode=\"NULLABLE\"),\n", + " bigquery.SchemaField(\"bmicat\", \"STRING\", mode=\"NULLABLE\"),\n", + " bigquery.SchemaField(\"agegroup\", \"STRING\", mode=\"NULLABLE\"),\n", + "\n", + "]\n", + "\n", + "table = bigquery.Table(table_id, schema=schema)\n", + "table = client.create_table(table) # Make an API request.\n", + "print(\n", + " \"Created table {}.{}.{}\".format(table.project, table.dataset_id, table.table_id)\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "6e97b4c6-429c-4d36-a4c1-3963b079af6a", + "metadata": {}, + "source": [ + "This step is optional. We can make an API request to verify whether the dataset has been successfully created under our project." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "72f11cfd-8b20-48bf-b157-aaf791cd3a9f", + "metadata": {}, + "outputs": [], + "source": [ + "datasets = list(client.list_datasets()) # Make an API request.\n", + "project = project_id\n", + "\n", + "if datasets:\n", + " print(\"Datasets in project {}:\".format(project))\n", + " for dataset in datasets:\n", + " print(\"\\t{}\".format(dataset.dataset_id))\n", + "else:\n", + " print(\"{} project does not contain any datasets.\".format(project))" + ] + }, + { + "cell_type": "markdown", + "id": "dad4f96d-ae52-43f2-8290-c04b25960ea8", + "metadata": {}, + "source": [ + "We can perform a similar check to confirm that the table we created is located within the appropriate dataset." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "80ea826c-9d70-4268-b4db-a657d5cdcaf0", + "metadata": {}, + "outputs": [], + "source": [ + "# Set dataset_id to the ID of the dataset to fetch.\n", + "\n", + "dataset = client.get_dataset(dataset_id) # Make an API request.\n", + "\n", + "full_dataset_id = \"{}.{}\".format(dataset.project, dataset.dataset_id)\n", + "friendly_name = dataset.friendly_name\n", + "print(\n", + " \"Got dataset '{}' with friendly_name '{}'.\".format(\n", + " full_dataset_id, friendly_name\n", + " )\n", + ")\n", + "\n", + "# View dataset properties.\n", + "print(\"Description: {}\".format(dataset.description))\n", + "print(\"Labels:\")\n", + "labels = dataset.labels\n", + "if labels:\n", + " for label, value in labels.items():\n", + " print(\"\\t{}: {}\".format(label, value))\n", + "else:\n", + " print(\"\\tDataset has no labels defined.\")\n", + "\n", + "# View tables in dataset.\n", + "print(\"Tables:\")\n", + "tables = list(client.list_tables(dataset)) # Make an API request(s).\n", + "if tables:\n", + " for table in tables:\n", + " print(\"\\t{}\".format(table.table_id))\n", + "else:\n", + " print(\"\\tThis dataset does not contain any tables.\")" + ] + }, + { + "cell_type": "markdown", + "id": "736080b9-0cd9-44fd-9580-dadb10a22a59", + "metadata": { + "tags": [] + }, + "source": [ + "To upload the data from our file stored in the bucket, we first need to create a DataFrame. In this example, we will be using a file that contains health screening data." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b614cc1a-b318-4364-8d47-1dea9700c792", + "metadata": {}, + "outputs": [], + "source": [ + "from google.cloud import storage\n", + "import pandas as pd\n", + "from io import StringIO\n", + "\n", + "# Initialize a client\n", + "client = storage.Client()\n", + "my_bucket = bucket\n", + "storage_client = storage.Client()\n", + "bucket = storage_client.get_bucket(my_bucket)\n", + "blob = bucket.blob(destination_blob_name)\n", + "path = \"\"\n", + "df = pd.read_csv(path)\n", + "df.head()" + ] + }, + { + "cell_type": "markdown", + "id": "ad42d610-daa6-4d92-ac14-2d90349dafe1", + "metadata": {}, + "source": [ + "We need to use the \"to_gbq\" function from the pandas-gbq library, which allows us to write a Pandas DataFrame to a Google BigQuery table. This enables us to populate our BigQuery table with the data from the DataFrame." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7cfd1d66-a0fb-403d-a9b1-27250cc81f91", + "metadata": {}, + "outputs": [], + "source": [ + "df.to_gbq(table_id, project_id=project_id)\n", + "\n", + "from google.cloud import storage\n", + "import pandas as pd\n", + "from io import StringIO\n", + "\n", + "client.load_table_from_dataframe(df, table_id).result()" + ] + }, + { + "cell_type": "markdown", + "id": "22b4dd65-9bd9-48ef-9c98-85e3e9a82701", + "metadata": {}, + "source": [ + "This step is also optional. We can execute a simple query on the BigQuery table to count the number of records." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "29e91fc1-735f-4d6c-9fcf-3a2d745363c1", + "metadata": {}, + "outputs": [], + "source": [ + "query = \"\"\"\n", + " SELECT count(gender) as cgender\n", + " FROM `project_id.dataset_id.table_id`\n", + "\"\"\"\n", + "\n", + "# Execute the query\n", + "query_job = client.query(query)\n", + "\n", + "# Fetch the results\n", + "results = query_job.result()\n", + "\n", + "# Iterate through the rows\n", + "for row in results:\n", + " #print(f\"gender: {row.name}, age: {row.age}\")\n", + " print(f\"cgender: {row.cgender}\")\n", + "\n", + "try:\n", + " query_job = client.query(query)\n", + " results = query_job.result()\n", + " print(results)\n", + "except Exception as e:\n", + " print(f\"Error executing query: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "id": "ce65ead3-694d-4d46-9376-38d35b8d0852", + "metadata": {}, + "source": [ + "cgender: 139920\n", + "" + ] + }, + { + "cell_type": "markdown", + "id": "b6fb53b8-9f93-4f37-8730-33c5871720fc", + "metadata": {}, + "source": [ + "To interact with the BigQuery table using a Pythonic domain language, we utilize SQLAlchemy. SQLAlchemy is a Python SQL toolkit that enables developers to access and manage SQL databases, allowing users to write queries as strings or chain Python objects for similar queries. However, to do this, we need specific credentials, which can be accessed through a service account file. For more information on how to create a service account file, please visit the following link: [Service Account Creation](https://cloud.google.com/iam/docs/service-accounts-create#python)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d1d29b90-7bcd-46b8-82b9-dc5e58f9edee", + "metadata": {}, + "outputs": [], + "source": [ + "from google.cloud import bigquery\n", + "from sqlalchemy import *\n", + "from sqlalchemy.engine import create_engine\n", + "from sqlalchemy.schema import *\n", + "import os\n", + "from langchain_community.agent_toolkits import create_sql_agent\n", + "from langchain.agents import create_sql_agent\n", + "from langchain.agents.agent_toolkits import SQLDatabaseToolkit\n", + "from langchain.sql_database import SQLDatabase\n", + "service_account_file = \"\" # Change to where your service account key file is located\n", + "project_id = project_id\n", + "dataset = \"\"\n", + "table = \"
\"\n", + "sqlalchemy_url = f'bigquery://{project_id}/{dataset}?credentials_path={service_account_file}'\n", + "print(sqlalchemy_url)" + ] + }, + { + "cell_type": "markdown", + "id": "9194cc1d-7ef1-4c4e-8e65-e7d9ada5be24", + "metadata": {}, + "source": [ + "Next, we import the __ChatVertexAI__ agent from Langchain and configure it with the appropriate hyperparameters. In this instance, we are using the __gemini-1.5-pro__ LLM model, after which we create the SQL agent to enable querying the BigQuery table using a natural language string as a prompt." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ee1899a2-dd79-4cf4-a02d-c7c35b99040b", + "metadata": {}, + "outputs": [], + "source": [ + "from langchain_google_vertexai import VertexAI, ChatVertexAI\n", + "\n", + "llm = ChatVertexAI(\n", + " model=\"gemini-1.5-pro\",\n", + " temperature=0,\n", + " max_tokens=8190,\n", + " timeout=None,\n", + " max_retries=2,\n", + ")\n", + "db = SQLDatabase.from_uri(sqlalchemy_url)\n", + "toolkit = SQLDatabaseToolkit(db=db, llm=llm)\n", + "agent_executor = create_sql_agent(\n", + " llm=llm,\n", + " toolkit=toolkit,\n", + " verbose=True,\n", + " top_k=100000\n", + " )\n" + ] + }, + { + "cell_type": "markdown", + "id": "f3551426-8644-48ba-b8c1-bf7d5eb843b3", + "metadata": {}, + "source": [ + "For our initial query, we check the number of rows to compare the result with what we obtained in a previous step using a simple SQL query against the BigQuery table. We can confirm that we received the same number of rows: 139,920." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e9753533-e82b-47af-936a-a401d512308a", + "metadata": {}, + "outputs": [], + "source": [ + "agent_executor.run(\"How many rows are in the table screening? \")" + ] + }, + { + "cell_type": "markdown", + "id": "099154a1-34e2-4f6b-b146-cec6f5943cf7", + "metadata": {}, + "source": [ + "> Finished chain.\n", + "'The table screening has 139920 rows.'" + ] + }, + { + "cell_type": "markdown", + "id": "65f645ed-8efa-436f-b993-72c45a7aadbe", + "metadata": { + "tags": [] + }, + "source": [ + "We know the output of the model are correct. The image below is a reminder of what the actual results are calculated from." + ] + }, + { + "cell_type": "markdown", + "id": "8160ee6e-f95f-44c3-ab27-4415b44ba461", + "metadata": { + "tags": [] + }, + "source": [ + "![image.png](../../images/gcp_rag_structure_data_01.png)" + ] + }, + { + "cell_type": "markdown", + "id": "f2b62012-b2d2-48c1-ae69-6f8816065551", + "metadata": { + "tags": [] + }, + "source": [ + "Next, we pose a question that requires applying a filter, and we successfully obtain the accurate number of obese individuals in the table." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5563d317-81ab-4a37-8225-e1b0dce20b36", + "metadata": {}, + "outputs": [], + "source": [ + "agent_executor.run(\"How many obese are in the table screen?\")" + ] + }, + { + "cell_type": "markdown", + "id": "40e1846d-1114-4ba1-b35a-65b47a838468", + "metadata": {}, + "source": [ + "> Finished chain.\n", + "'There are 37204 obese people in the table.'" + ] + }, + { + "cell_type": "markdown", + "id": "b96ea414-3f2d-4e66-ab0b-a034dfe507b0", + "metadata": {}, + "source": [ + "As a more complex query, we inquired about the number of female smokers, and the SQL agent accurately returned the answer of 1,626 female smokers." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a456ff90-a3a5-4ee4-81da-fba4fcdf7f2d", + "metadata": {}, + "outputs": [], + "source": [ + "agent_executor.run(\"How many smokers are female in the table screen? the value 1 in smoke means that is a smoker and the value 2 in gender means that is a female\")" + ] + }, + { + "cell_type": "markdown", + "id": "7ea7810f-9701-4cb0-9430-181a57929884", + "metadata": {}, + "source": [ + "> Finished chain.\n", + "'There are 1626 female smokers in the table.'" + ] + }, + { + "cell_type": "markdown", + "id": "ca0fb657-19e0-4488-b9f6-25ca3971064f", + "metadata": { + "tags": [] + }, + "source": [ + "## Conclusion" + ] + }, + { + "cell_type": "markdown", + "id": "7f85e560-231f-4841-b1d0-334326e4b347", + "metadata": {}, + "source": [ + "You have learned how to create a dataset and a table in BigQuery, as well as how to set up an agent that enables you to query the table using natural language instead of SQL queries, allowing you to obtain the results you need." + ] + }, + { + "cell_type": "markdown", + "id": "41573388-7feb-44b6-9e8a-0511cc8ae62e", + "metadata": {}, + "source": [ + "## Clean Up" + ] + }, + { + "cell_type": "markdown", + "id": "1f0edc0b-f2cc-478c-855f-9acff3792f4d", + "metadata": {}, + "source": [ + "Please remember to delete or stop your Jupyter notebook and delete your data store to prevent incurring charges. And if you have created any other services like buckets, please remember to delete them as well." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fe73dde2-056a-46a3-bacb-0a6671f7128f", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "environment": { + "kernel": "python3", + "name": "common-cpu.m125", + "type": "gcloud", + "uri": "us-docker.pkg.dev/deeplearning-platform-release/gcr.io/base-cpu:m125" + }, + "kernelspec": { + "display_name": "Python 3 (Local)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.15" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/notebooks/GenAI/GCP_RAG_for_Structure_Data.ipynb b/notebooks/GenAI/GCP_RAG_for_Structure_Data.ipynb index a7d082e..54d4614 100644 --- a/notebooks/GenAI/GCP_RAG_for_Structure_Data.ipynb +++ b/notebooks/GenAI/GCP_RAG_for_Structure_Data.ipynb @@ -33,7 +33,7 @@ "source": [ "Generative AI (GenAI) represents a transformative technology capable of producing human-like text, images, code, and various other types of content. While much of the focus has been on unstructured data—such as PDFs, text documents, image files, and websites—many GenAI implementations rely on a parameter called \"top K.\" This algorithm retrieves only the highest-scoring pieces of content relevant to a user's query, which can be limiting. Users seeking insights from structured data formats like CSV and JSON often require access to all relevant occurrences, rather than just a subset.\n", "\n", - "In this tutorial, we present a technique that leverages SQL databases. By formulating a query based on the user's request, the model submits this query to the database, providing comprehensive results. This approach not only ensures users receive all pertinent information but also reduces the likelihood of exceeding token limits." + "In this tutorial, we use a RAG agent together with a VertexAI chat model gemini-1.5-pro to query the BigQuery table. A Retrieval Augmented Generation (RAG) agent is a key part of a RAG application that enhances the capabilities of the large language models (LLMs) by integrating external data retrieval. AI agents empower LLMs to interact with the world through actions and tools." ] }, { @@ -53,7 +53,9 @@ "\n", "In this tutorial, we will be using Google Gemini Pro 1.5, which does not require deployment. However, if you prefer to use a different model, you can select one from the Model Garden via the console. This will allow you to add the model to your registry, create an endpoint (or utilize an existing one), and deploy the model—all in a single step. Here is a link for more information: [model deployment](https://cloud.google.com/vertex-ai/docs/general/deployment).\n", "\n", - "Before we begin, you'll need to create a Vertex AI RAG Data Service Agent service account. To do this, go to the IAM section of the console. Ensure you check the box for \"Include Google-provided role grant.\" If the role is not listed, click \"Grant Access\" and add \"Vertex AI RAG Data Service Agent\" as a role." + "Before we begin, you'll need to create a Vertex AI RAG Data Service Agent service account. To do this, go to the IAM section of the console. Ensure you check the box for \"Include Google-provided role grant.\" If the role is not listed, click \"Grant Access\" and add \"Vertex AI RAG Data Service Agent\" as a role.\n", + "\n", + "We are using a e2-standard-4 (Efficient Instance: 4 vCPUs, 16 GB RAM) for this tutorial. " ] }, { @@ -74,7 +76,7 @@ "In this tutorial you will learn about:\n", "- How to set up a BigQuery dataset and a table.\n", "- How to load data to a BigQuery table.\n", - "- How to use the langchain ChatVertexAI agent to extract information froim the table. \n" + "- How to use the langchain ChatVertexAI agent to extract information from the table. \n" ] }, { @@ -92,11 +94,7 @@ "id": "0dca31b2-880d-4cbc-bf76-ede96998e371", "metadata": {}, "source": [ - "
    \n", - "
  • BigQuery Standard 1TB Storage.
  • \n", - "
  • Model for Chat: 5 request per day. Average Input and Output characters 5,000.
  • \n", - "
  • Total $23.89 per month.
  • \n", - "
" + "If you are following this tutorial in one sitting it will cost $23.89 per month. Completing the process in multiple sessions or using a method different from the tutorial may result in increased costs." ] }, { @@ -176,7 +174,7 @@ "id": "5781a07c-84ed-416c-a782-b1dc881e5c18", "metadata": {}, "source": [ - "Provide the names of the bucket name, source file name path, and destination blob name to upload the CSV source file to the bucket." + "Once the bucket is created, we need to access the CSV source file. In this tutorial, I transferred the data file to our Jupyter notebook by simply dragging and dropping it from my local folder. Next, we need to specify the bucket name and the path of the data source in order to upload the CSV file to the bucket. It is important to keep in mind that the name of the bucket has to be unique." ] }, { @@ -186,25 +184,7 @@ "metadata": {}, "outputs": [], "source": [ - "from google.cloud import storage\n", - "\n", - "# The ID of your GCS bucket\n", - "bucket_name = bucket\n", - " # The path to your file to upload\n", - "source_file_name = \"\"\n", - " # The ID of your GCS object\n", - "destination_blob_name = \"\"\n", - "\n", - "storage_client = storage.Client()\n", - "bucket = storage_client.bucket(bucket_name)\n", - "blob = bucket.blob(destination_blob_name)\n", - "blob.upload_from_filename(source_file_name)\n", - "\n", - "print(\n", - " \"File {} uploaded to {}.\".format(\n", - " source_file_name, destination_blob_name\n", - " )\n", - ")\n" + "!gsutil cp '' gs://{bucket}" ] }, { @@ -226,8 +206,9 @@ "\n", "client = bigquery.Client()\n", "\n", + "dataset_name = ''\n", "# Set dataset_id to the ID of the dataset to create.\n", - "dataset_id = project_id.''\n", + "dataset_id = f'{project_id}.{dataset_name}'\n", "\n", "# Construct a full Dataset object to send to the API.\n", "dataset = bigquery.Dataset(dataset_id)\n", @@ -261,7 +242,8 @@ "\n", "client = bigquery.Client()\n", "\n", - "table_id = project_id.dataset_id.\"
\"\n", + "table_name = \"
\"\n", + "table_id = f'{project_id}.{dataset_id}.{table_name}'\n", "\n", "schema = [\n", " bigquery.SchemaField(\"id\", \"INTEGER\", mode=\"NULLABLE\"),\n", @@ -319,105 +301,34 @@ }, { "cell_type": "markdown", - "id": "dad4f96d-ae52-43f2-8290-c04b25960ea8", + "id": "ad42d610-daa6-4d92-ac14-2d90349dafe1", "metadata": {}, "source": [ - "We can perform a similar check to confirm that the table we created is located within the appropriate dataset." + "We need to use the \"to_gbq\" function from the pandas-gbq library, which allows us to write a Pandas DataFrame to a Google BigQuery table. This enables us to populate our BigQuery table with the data from the DataFrame." ] }, { "cell_type": "code", "execution_count": null, - "id": "80ea826c-9d70-4268-b4db-a657d5cdcaf0", + "id": "7cfd1d66-a0fb-403d-a9b1-27250cc81f91", "metadata": {}, "outputs": [], "source": [ - "# Set dataset_id to the ID of the dataset to fetch.\n", - "\n", - "dataset = client.get_dataset(dataset_id) # Make an API request.\n", - "\n", - "full_dataset_id = \"{}.{}\".format(dataset.project, dataset.dataset_id)\n", - "friendly_name = dataset.friendly_name\n", - "print(\n", - " \"Got dataset '{}' with friendly_name '{}'.\".format(\n", - " full_dataset_id, friendly_name\n", - " )\n", - ")\n", - "\n", - "# View dataset properties.\n", - "print(\"Description: {}\".format(dataset.description))\n", - "print(\"Labels:\")\n", - "labels = dataset.labels\n", - "if labels:\n", - " for label, value in labels.items():\n", - " print(\"\\t{}: {}\".format(label, value))\n", - "else:\n", - " print(\"\\tDataset has no labels defined.\")\n", + "df.to_gbq(table_id, project_id=project_id)\n", "\n", - "# View tables in dataset.\n", - "print(\"Tables:\")\n", - "tables = list(client.list_tables(dataset)) # Make an API request(s).\n", - "if tables:\n", - " for table in tables:\n", - " print(\"\\t{}\".format(table.table_id))\n", - "else:\n", - " print(\"\\tThis dataset does not contain any tables.\")" - ] - }, - { - "cell_type": "markdown", - "id": "736080b9-0cd9-44fd-9580-dadb10a22a59", - "metadata": { - "tags": [] - }, - "source": [ - "To upload the data from our file stored in the bucket, we first need to create a DataFrame. In this example, we will be using a file that contains health screening data." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b614cc1a-b318-4364-8d47-1dea9700c792", - "metadata": {}, - "outputs": [], - "source": [ "from google.cloud import storage\n", "import pandas as pd\n", "from io import StringIO\n", "\n", - "# Initialize a client\n", - "client = storage.Client()\n", - "my_bucket = bucket\n", - "storage_client = storage.Client()\n", - "bucket = storage_client.get_bucket(my_bucket)\n", - "blob = bucket.blob(destination_blob_name)\n", - "path = \"\"\n", - "df = pd.read_csv(path)\n", - "df.head()" + "client.load_table_from_dataframe(df, table_id).result()" ] }, { "cell_type": "markdown", - "id": "ad42d610-daa6-4d92-ac14-2d90349dafe1", + "id": "4b16a292-afdc-415c-a8a7-071bc90281c3", "metadata": {}, "source": [ - "We need to use the \"to_gbq\" function from the pandas-gbq library, which allows us to write a Pandas DataFrame to a Google BigQuery table. This enables us to populate our BigQuery table with the data from the DataFrame." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "7cfd1d66-a0fb-403d-a9b1-27250cc81f91", - "metadata": {}, - "outputs": [], - "source": [ - "df.to_gbq(table_id, project_id=project_id)\n", - "\n", - "from google.cloud import storage\n", - "import pandas as pd\n", - "from io import StringIO\n", - "\n", - "client.load_table_from_dataframe(df, table_id).result()" + "![image.png](../../NIHCloudLabGCP/images/gcp_rag_structure_data_01.png)" ] }, { @@ -473,7 +384,7 @@ "id": "b6fb53b8-9f93-4f37-8730-33c5871720fc", "metadata": {}, "source": [ - "To interact with the BigQuery table using a Pythonic domain language, we utilize SQLAlchemy. SQLAlchemy is a Python SQL toolkit that enables developers to access and manage SQL databases, allowing users to write queries as strings or chain Python objects for similar queries. However, to do this, we need specific credentials, which can be accessed through a service account file. For more information on how to create a service account file, please visit the following link: [Service Account Creation](https://cloud.google.com/iam/docs/service-accounts-create#python)" + "To interact with the BigQuery table using a Pythonic domain language, we utilize SQLAlchemy. SQLAlchemy is a Python SQL toolkit that enables developers to access and manage SQL databases, allowing users to write queries as strings or chain Python objects for similar queries. " ] }, { @@ -492,11 +403,8 @@ "from langchain.agents import create_sql_agent\n", "from langchain.agents.agent_toolkits import SQLDatabaseToolkit\n", "from langchain.sql_database import SQLDatabase\n", - "service_account_file = \"\" # Change to where your service account key file is located\n", - "project_id = project_id\n", - "dataset = \"\"\n", - "table = \"
\"\n", - "sqlalchemy_url = f'bigquery://{project_id}/{dataset}?credentials_path={service_account_file}'\n", + "\n", + "sqlalchemy_url = f'bigquery://{project_id}/{dataset}'\n", "print(sqlalchemy_url)" ] }, @@ -505,7 +413,7 @@ "id": "9194cc1d-7ef1-4c4e-8e65-e7d9ada5be24", "metadata": {}, "source": [ - "Next, we import the __ChatVertexAI__ agent from Langchain and configure it with the appropriate hyperparameters. In this instance, we are using the __gemini-1.5-pro__ LLM model, after which we create the SQL agent to enable querying the BigQuery table using a natural language string as a prompt." + "Next, we import the __ChatVertexAI__ agent from Langchain and configure it with the appropriate hyperparameters. In this instance, we are using the __gemini-1.5-pro__ LLM model, after which we create the SQL agent to enable querying the BigQuery table using a natural language string as a prompt. Temperature regulates randomness, with higher temperatures resulting in more varied and unpredictable outputs. Top-k sampling selects from the k most probable next tokens at each step, where a lower k emphasizes higher-probability tokens. The max tokens hyperparameter specifies the maximum number of tokens in the response from the large language model. Max retries indicates how many responses we will receive from the model." ] }, { @@ -561,26 +469,6 @@ "'The table screening has 139920 rows.'" ] }, - { - "cell_type": "markdown", - "id": "65f645ed-8efa-436f-b993-72c45a7aadbe", - "metadata": { - "tags": [] - }, - "source": [ - "We know the output of the model are correct. The image below is a reminder of what the actual results are calculated from." - ] - }, - { - "cell_type": "markdown", - "id": "8160ee6e-f95f-44c3-ab27-4415b44ba461", - "metadata": { - "tags": [] - }, - "source": [ - "![image.png](../../images/gcp_rag_structure_data_01.png)" - ] - }, { "cell_type": "markdown", "id": "f2b62012-b2d2-48c1-ae69-6f8816065551", @@ -668,7 +556,7 @@ "id": "1f0edc0b-f2cc-478c-855f-9acff3792f4d", "metadata": {}, "source": [ - "Please remember to delete or stop your Jupyter notebook and delete your data store to prevent incurring charges. And if you have created any other services like buckets, please remember to delete them as well." + "Please remember to delete or stop your Jupyter notebook and delete your BigQuery dataset and table to prevent incurring charges. And if you have created any other services like buckets, please remember to delete them as well." ] }, { diff --git a/notebooks/GenAI/GCP_RAG_for_Structure_Data_old.ipynb b/notebooks/GenAI/GCP_RAG_for_Structure_Data_old.ipynb new file mode 100644 index 0000000..a7d082e --- /dev/null +++ b/notebooks/GenAI/GCP_RAG_for_Structure_Data_old.ipynb @@ -0,0 +1,710 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "6e38b3f2-2b29-4788-a29a-f9e5d10f8365", + "metadata": {}, + "source": [ + "##### __Skill Level:__ Intermediate" + ] + }, + { + "cell_type": "markdown", + "id": "6c0533e3-4fbd-47b1-8458-fd193bcb263f", + "metadata": {}, + "source": [ + "# Creating a Chatbot for Structure Data using a RAG" + ] + }, + { + "cell_type": "markdown", + "id": "76cf4b02-6422-4a48-882c-c00315e34c8a", + "metadata": { + "tags": [] + }, + "source": [ + "## Overview" + ] + }, + { + "cell_type": "markdown", + "id": "7b7870dd-09e0-4f6a-91a9-7c30529af720", + "metadata": {}, + "source": [ + "Generative AI (GenAI) represents a transformative technology capable of producing human-like text, images, code, and various other types of content. While much of the focus has been on unstructured data—such as PDFs, text documents, image files, and websites—many GenAI implementations rely on a parameter called \"top K.\" This algorithm retrieves only the highest-scoring pieces of content relevant to a user's query, which can be limiting. Users seeking insights from structured data formats like CSV and JSON often require access to all relevant occurrences, rather than just a subset.\n", + "\n", + "In this tutorial, we present a technique that leverages SQL databases. By formulating a query based on the user's request, the model submits this query to the database, providing comprehensive results. This approach not only ensures users receive all pertinent information but also reduces the likelihood of exceeding token limits." + ] + }, + { + "cell_type": "markdown", + "id": "d513a9fe-1ac4-4610-b53c-6696e461f828", + "metadata": {}, + "source": [ + "## Prerequisites" + ] + }, + { + "cell_type": "markdown", + "id": "fbdb09aa-73d6-4cee-b3e7-5d6b243e3d58", + "metadata": {}, + "source": [ + "We assume you have access to Vertex AI and have enabled the necessary APIs.\n", + "\n", + "In this tutorial, we will be using Google Gemini Pro 1.5, which does not require deployment. However, if you prefer to use a different model, you can select one from the Model Garden via the console. This will allow you to add the model to your registry, create an endpoint (or utilize an existing one), and deploy the model—all in a single step. Here is a link for more information: [model deployment](https://cloud.google.com/vertex-ai/docs/general/deployment).\n", + "\n", + "Before we begin, you'll need to create a Vertex AI RAG Data Service Agent service account. To do this, go to the IAM section of the console. Ensure you check the box for \"Include Google-provided role grant.\" If the role is not listed, click \"Grant Access\" and add \"Vertex AI RAG Data Service Agent\" as a role." + ] + }, + { + "cell_type": "markdown", + "id": "eeb24716-0db8-462f-8faa-0b8ad9d61feb", + "metadata": { + "tags": [] + }, + "source": [ + "## Learning objectives" + ] + }, + { + "cell_type": "markdown", + "id": "f5216a43-31ee-4c3e-821a-4b81f4be8e54", + "metadata": {}, + "source": [ + "In this tutorial you will learn about:\n", + "- How to set up a BigQuery dataset and a table.\n", + "- How to load data to a BigQuery table.\n", + "- How to use the langchain ChatVertexAI agent to extract information froim the table. \n" + ] + }, + { + "cell_type": "markdown", + "id": "348910aa-9802-4318-8fcb-86603158923b", + "metadata": { + "tags": [] + }, + "source": [ + "## Pricing" + ] + }, + { + "cell_type": "markdown", + "id": "0dca31b2-880d-4cbc-bf76-ede96998e371", + "metadata": {}, + "source": [ + "
    \n", + "
  • BigQuery Standard 1TB Storage.
  • \n", + "
  • Model for Chat: 5 request per day. Average Input and Output characters 5,000.
  • \n", + "
  • Total $23.89 per month.
  • \n", + "
" + ] + }, + { + "cell_type": "markdown", + "id": "c7d9e3ee-1c7e-4baa-a818-c13330c665dd", + "metadata": {}, + "source": [ + "## Get Start" + ] + }, + { + "cell_type": "markdown", + "id": "137defcd-fd2b-421d-9b2e-fe2b878fb325", + "metadata": {}, + "source": [ + "### Install Packages" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "55c14a38-ba51-466d-90b5-1512f5a7fae0", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "!pip install google-cloud-bigquery\n", + "!pip install gcloud\n", + "!pip install langchain\n", + "!pip install -U langchain-community\n", + "!pip install -qU langchain_google_vertexai\n", + "!pip install pandas-gbq" + ] + }, + { + "cell_type": "markdown", + "id": "64e22773-9a03-47de-9b6d-7946abcba722", + "metadata": {}, + "source": [ + "Set your project id, location, and bucket variables." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7193a9ac-3169-45c1-9c5d-ba5e2f31458f", + "metadata": {}, + "outputs": [], + "source": [ + "project_id=''\n", + "location=' (e.g.us-east4)'\n", + "bucket = ''" + ] + }, + { + "cell_type": "markdown", + "id": "df0d9eda-ee54-4c4c-86e6-9166b2b82e2d", + "metadata": {}, + "source": [ + "Create a bucket where the data used for tutorial is going to be placed." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7b583234-2d92-4e5b-841d-c7d79dd87d45", + "metadata": {}, + "outputs": [], + "source": [ + "# make bucket\n", + "!gsutil mb -l {location} gs://{bucket}" + ] + }, + { + "cell_type": "markdown", + "id": "5781a07c-84ed-416c-a782-b1dc881e5c18", + "metadata": {}, + "source": [ + "Provide the names of the bucket name, source file name path, and destination blob name to upload the CSV source file to the bucket." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4829094e-61a6-4dd7-854f-6b21dc19b1b7", + "metadata": {}, + "outputs": [], + "source": [ + "from google.cloud import storage\n", + "\n", + "# The ID of your GCS bucket\n", + "bucket_name = bucket\n", + " # The path to your file to upload\n", + "source_file_name = \"\"\n", + " # The ID of your GCS object\n", + "destination_blob_name = \"\"\n", + "\n", + "storage_client = storage.Client()\n", + "bucket = storage_client.bucket(bucket_name)\n", + "blob = bucket.blob(destination_blob_name)\n", + "blob.upload_from_filename(source_file_name)\n", + "\n", + "print(\n", + " \"File {} uploaded to {}.\".format(\n", + " source_file_name, destination_blob_name\n", + " )\n", + ")\n" + ] + }, + { + "cell_type": "markdown", + "id": "7a00dc83-8713-4ea5-9345-08a9f3d785d0", + "metadata": {}, + "source": [ + "The next step is to create a database in BigQuery. To accomplish this, we need to define a dataset_id and construct a dataset object that will be sent to the API for creation. Note that in BigQuery, a dataset is analogous to a database." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1750e13d-ec05-4335-82a2-ffe54c5b5f32", + "metadata": {}, + "outputs": [], + "source": [ + "from google.cloud import bigquery\n", + "\n", + "client = bigquery.Client()\n", + "\n", + "# Set dataset_id to the ID of the dataset to create.\n", + "dataset_id = project_id.''\n", + "\n", + "# Construct a full Dataset object to send to the API.\n", + "dataset = bigquery.Dataset(dataset_id)\n", + "\n", + "# Specify the geographic location where the dataset should reside.\n", + "dataset.location = location\n", + "\n", + "# Send the dataset to the API for creation, with an explicit timeout.\n", + "# Raises google.api_core.exceptions.Conflict if the Dataset already\n", + "# exists within the project.\n", + "dataset = client.create_dataset(dataset, timeout=30) # Make an API request.\n", + "print(\"Created dataset {}.{}\".format(client.project, dataset.dataset_id))" + ] + }, + { + "cell_type": "markdown", + "id": "544c191c-2bc0-4a47-a0a7-cea81035597a", + "metadata": {}, + "source": [ + "Once the dataset is created, the next step is to create a table. First, we define a table ID, followed by outlining the table's schema. Finally, we send an API request to create the table within the dataset established in the previous step." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "68206b14-1b5e-49c3-9f07-52ba6c70a93e", + "metadata": {}, + "outputs": [], + "source": [ + "# Set table_id to the ID of the table to create.\n", + "\n", + "client = bigquery.Client()\n", + "\n", + "table_id = project_id.dataset_id.\"
\"\n", + "\n", + "schema = [\n", + " bigquery.SchemaField(\"id\", \"INTEGER\", mode=\"NULLABLE\"),\n", + " bigquery.SchemaField(\"age\", \"INTEGER\", mode=\"NULLABLE\"), \n", + " bigquery.SchemaField(\"gender\", \"INTEGER\", mode=\"NULLABLE\"),\n", + " bigquery.SchemaField(\"height\", \"INTEGER\", mode=\"NULLABLE\"),\n", + " bigquery.SchemaField(\"weight\", \"FLOAT\", mode=\"NULLABLE\"),\n", + " bigquery.SchemaField(\"ap_hi\", \"INTEGER\", mode=\"NULLABLE\"),\n", + " bigquery.SchemaField(\"ap_lo\", \"INTEGER\", mode=\"NULLABLE\"),\n", + " bigquery.SchemaField(\"cholesterol\", \"INTEGER\", mode=\"NULLABLE\"),\n", + " bigquery.SchemaField(\"gluc\", \"INTEGER\", mode=\"NULLABLE\"),\n", + " bigquery.SchemaField(\"smoke\", \"INTEGER\", mode=\"NULLABLE\"),\n", + " bigquery.SchemaField(\"alco\", \"INTEGER\", mode=\"NULLABLE\"),\n", + " bigquery.SchemaField(\"active\", \"INTEGER\", mode=\"NULLABLE\"),\n", + " bigquery.SchemaField(\"cardio\", \"INTEGER\", mode=\"NULLABLE\"),\n", + " bigquery.SchemaField(\"ageinyr\", \"INTEGER\", mode=\"NULLABLE\"),\n", + " bigquery.SchemaField(\"bmi\", \"FLOAT\", mode=\"NULLABLE\"),\n", + " bigquery.SchemaField(\"bmicat\", \"STRING\", mode=\"NULLABLE\"),\n", + " bigquery.SchemaField(\"agegroup\", \"STRING\", mode=\"NULLABLE\"),\n", + "\n", + "]\n", + "\n", + "table = bigquery.Table(table_id, schema=schema)\n", + "table = client.create_table(table) # Make an API request.\n", + "print(\n", + " \"Created table {}.{}.{}\".format(table.project, table.dataset_id, table.table_id)\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "6e97b4c6-429c-4d36-a4c1-3963b079af6a", + "metadata": {}, + "source": [ + "This step is optional. We can make an API request to verify whether the dataset has been successfully created under our project." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "72f11cfd-8b20-48bf-b157-aaf791cd3a9f", + "metadata": {}, + "outputs": [], + "source": [ + "datasets = list(client.list_datasets()) # Make an API request.\n", + "project = project_id\n", + "\n", + "if datasets:\n", + " print(\"Datasets in project {}:\".format(project))\n", + " for dataset in datasets:\n", + " print(\"\\t{}\".format(dataset.dataset_id))\n", + "else:\n", + " print(\"{} project does not contain any datasets.\".format(project))" + ] + }, + { + "cell_type": "markdown", + "id": "dad4f96d-ae52-43f2-8290-c04b25960ea8", + "metadata": {}, + "source": [ + "We can perform a similar check to confirm that the table we created is located within the appropriate dataset." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "80ea826c-9d70-4268-b4db-a657d5cdcaf0", + "metadata": {}, + "outputs": [], + "source": [ + "# Set dataset_id to the ID of the dataset to fetch.\n", + "\n", + "dataset = client.get_dataset(dataset_id) # Make an API request.\n", + "\n", + "full_dataset_id = \"{}.{}\".format(dataset.project, dataset.dataset_id)\n", + "friendly_name = dataset.friendly_name\n", + "print(\n", + " \"Got dataset '{}' with friendly_name '{}'.\".format(\n", + " full_dataset_id, friendly_name\n", + " )\n", + ")\n", + "\n", + "# View dataset properties.\n", + "print(\"Description: {}\".format(dataset.description))\n", + "print(\"Labels:\")\n", + "labels = dataset.labels\n", + "if labels:\n", + " for label, value in labels.items():\n", + " print(\"\\t{}: {}\".format(label, value))\n", + "else:\n", + " print(\"\\tDataset has no labels defined.\")\n", + "\n", + "# View tables in dataset.\n", + "print(\"Tables:\")\n", + "tables = list(client.list_tables(dataset)) # Make an API request(s).\n", + "if tables:\n", + " for table in tables:\n", + " print(\"\\t{}\".format(table.table_id))\n", + "else:\n", + " print(\"\\tThis dataset does not contain any tables.\")" + ] + }, + { + "cell_type": "markdown", + "id": "736080b9-0cd9-44fd-9580-dadb10a22a59", + "metadata": { + "tags": [] + }, + "source": [ + "To upload the data from our file stored in the bucket, we first need to create a DataFrame. In this example, we will be using a file that contains health screening data." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b614cc1a-b318-4364-8d47-1dea9700c792", + "metadata": {}, + "outputs": [], + "source": [ + "from google.cloud import storage\n", + "import pandas as pd\n", + "from io import StringIO\n", + "\n", + "# Initialize a client\n", + "client = storage.Client()\n", + "my_bucket = bucket\n", + "storage_client = storage.Client()\n", + "bucket = storage_client.get_bucket(my_bucket)\n", + "blob = bucket.blob(destination_blob_name)\n", + "path = \"\"\n", + "df = pd.read_csv(path)\n", + "df.head()" + ] + }, + { + "cell_type": "markdown", + "id": "ad42d610-daa6-4d92-ac14-2d90349dafe1", + "metadata": {}, + "source": [ + "We need to use the \"to_gbq\" function from the pandas-gbq library, which allows us to write a Pandas DataFrame to a Google BigQuery table. This enables us to populate our BigQuery table with the data from the DataFrame." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7cfd1d66-a0fb-403d-a9b1-27250cc81f91", + "metadata": {}, + "outputs": [], + "source": [ + "df.to_gbq(table_id, project_id=project_id)\n", + "\n", + "from google.cloud import storage\n", + "import pandas as pd\n", + "from io import StringIO\n", + "\n", + "client.load_table_from_dataframe(df, table_id).result()" + ] + }, + { + "cell_type": "markdown", + "id": "22b4dd65-9bd9-48ef-9c98-85e3e9a82701", + "metadata": {}, + "source": [ + "This step is also optional. We can execute a simple query on the BigQuery table to count the number of records." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "29e91fc1-735f-4d6c-9fcf-3a2d745363c1", + "metadata": {}, + "outputs": [], + "source": [ + "query = \"\"\"\n", + " SELECT count(gender) as cgender\n", + " FROM `project_id.dataset_id.table_id`\n", + "\"\"\"\n", + "\n", + "# Execute the query\n", + "query_job = client.query(query)\n", + "\n", + "# Fetch the results\n", + "results = query_job.result()\n", + "\n", + "# Iterate through the rows\n", + "for row in results:\n", + " #print(f\"gender: {row.name}, age: {row.age}\")\n", + " print(f\"cgender: {row.cgender}\")\n", + "\n", + "try:\n", + " query_job = client.query(query)\n", + " results = query_job.result()\n", + " print(results)\n", + "except Exception as e:\n", + " print(f\"Error executing query: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "id": "ce65ead3-694d-4d46-9376-38d35b8d0852", + "metadata": {}, + "source": [ + "cgender: 139920\n", + "" + ] + }, + { + "cell_type": "markdown", + "id": "b6fb53b8-9f93-4f37-8730-33c5871720fc", + "metadata": {}, + "source": [ + "To interact with the BigQuery table using a Pythonic domain language, we utilize SQLAlchemy. SQLAlchemy is a Python SQL toolkit that enables developers to access and manage SQL databases, allowing users to write queries as strings or chain Python objects for similar queries. However, to do this, we need specific credentials, which can be accessed through a service account file. For more information on how to create a service account file, please visit the following link: [Service Account Creation](https://cloud.google.com/iam/docs/service-accounts-create#python)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d1d29b90-7bcd-46b8-82b9-dc5e58f9edee", + "metadata": {}, + "outputs": [], + "source": [ + "from google.cloud import bigquery\n", + "from sqlalchemy import *\n", + "from sqlalchemy.engine import create_engine\n", + "from sqlalchemy.schema import *\n", + "import os\n", + "from langchain_community.agent_toolkits import create_sql_agent\n", + "from langchain.agents import create_sql_agent\n", + "from langchain.agents.agent_toolkits import SQLDatabaseToolkit\n", + "from langchain.sql_database import SQLDatabase\n", + "service_account_file = \"\" # Change to where your service account key file is located\n", + "project_id = project_id\n", + "dataset = \"\"\n", + "table = \"
\"\n", + "sqlalchemy_url = f'bigquery://{project_id}/{dataset}?credentials_path={service_account_file}'\n", + "print(sqlalchemy_url)" + ] + }, + { + "cell_type": "markdown", + "id": "9194cc1d-7ef1-4c4e-8e65-e7d9ada5be24", + "metadata": {}, + "source": [ + "Next, we import the __ChatVertexAI__ agent from Langchain and configure it with the appropriate hyperparameters. In this instance, we are using the __gemini-1.5-pro__ LLM model, after which we create the SQL agent to enable querying the BigQuery table using a natural language string as a prompt." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ee1899a2-dd79-4cf4-a02d-c7c35b99040b", + "metadata": {}, + "outputs": [], + "source": [ + "from langchain_google_vertexai import VertexAI, ChatVertexAI\n", + "\n", + "llm = ChatVertexAI(\n", + " model=\"gemini-1.5-pro\",\n", + " temperature=0,\n", + " max_tokens=8190,\n", + " timeout=None,\n", + " max_retries=2,\n", + ")\n", + "db = SQLDatabase.from_uri(sqlalchemy_url)\n", + "toolkit = SQLDatabaseToolkit(db=db, llm=llm)\n", + "agent_executor = create_sql_agent(\n", + " llm=llm,\n", + " toolkit=toolkit,\n", + " verbose=True,\n", + " top_k=100000\n", + " )\n" + ] + }, + { + "cell_type": "markdown", + "id": "f3551426-8644-48ba-b8c1-bf7d5eb843b3", + "metadata": {}, + "source": [ + "For our initial query, we check the number of rows to compare the result with what we obtained in a previous step using a simple SQL query against the BigQuery table. We can confirm that we received the same number of rows: 139,920." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e9753533-e82b-47af-936a-a401d512308a", + "metadata": {}, + "outputs": [], + "source": [ + "agent_executor.run(\"How many rows are in the table screening? \")" + ] + }, + { + "cell_type": "markdown", + "id": "099154a1-34e2-4f6b-b146-cec6f5943cf7", + "metadata": {}, + "source": [ + "> Finished chain.\n", + "'The table screening has 139920 rows.'" + ] + }, + { + "cell_type": "markdown", + "id": "65f645ed-8efa-436f-b993-72c45a7aadbe", + "metadata": { + "tags": [] + }, + "source": [ + "We know the output of the model are correct. The image below is a reminder of what the actual results are calculated from." + ] + }, + { + "cell_type": "markdown", + "id": "8160ee6e-f95f-44c3-ab27-4415b44ba461", + "metadata": { + "tags": [] + }, + "source": [ + "![image.png](../../images/gcp_rag_structure_data_01.png)" + ] + }, + { + "cell_type": "markdown", + "id": "f2b62012-b2d2-48c1-ae69-6f8816065551", + "metadata": { + "tags": [] + }, + "source": [ + "Next, we pose a question that requires applying a filter, and we successfully obtain the accurate number of obese individuals in the table." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5563d317-81ab-4a37-8225-e1b0dce20b36", + "metadata": {}, + "outputs": [], + "source": [ + "agent_executor.run(\"How many obese are in the table screen?\")" + ] + }, + { + "cell_type": "markdown", + "id": "40e1846d-1114-4ba1-b35a-65b47a838468", + "metadata": {}, + "source": [ + "> Finished chain.\n", + "'There are 37204 obese people in the table.'" + ] + }, + { + "cell_type": "markdown", + "id": "b96ea414-3f2d-4e66-ab0b-a034dfe507b0", + "metadata": {}, + "source": [ + "As a more complex query, we inquired about the number of female smokers, and the SQL agent accurately returned the answer of 1,626 female smokers." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a456ff90-a3a5-4ee4-81da-fba4fcdf7f2d", + "metadata": {}, + "outputs": [], + "source": [ + "agent_executor.run(\"How many smokers are female in the table screen? the value 1 in smoke means that is a smoker and the value 2 in gender means that is a female\")" + ] + }, + { + "cell_type": "markdown", + "id": "7ea7810f-9701-4cb0-9430-181a57929884", + "metadata": {}, + "source": [ + "> Finished chain.\n", + "'There are 1626 female smokers in the table.'" + ] + }, + { + "cell_type": "markdown", + "id": "ca0fb657-19e0-4488-b9f6-25ca3971064f", + "metadata": { + "tags": [] + }, + "source": [ + "## Conclusion" + ] + }, + { + "cell_type": "markdown", + "id": "7f85e560-231f-4841-b1d0-334326e4b347", + "metadata": {}, + "source": [ + "You have learned how to create a dataset and a table in BigQuery, as well as how to set up an agent that enables you to query the table using natural language instead of SQL queries, allowing you to obtain the results you need." + ] + }, + { + "cell_type": "markdown", + "id": "41573388-7feb-44b6-9e8a-0511cc8ae62e", + "metadata": {}, + "source": [ + "## Clean Up" + ] + }, + { + "cell_type": "markdown", + "id": "1f0edc0b-f2cc-478c-855f-9acff3792f4d", + "metadata": {}, + "source": [ + "Please remember to delete or stop your Jupyter notebook and delete your data store to prevent incurring charges. And if you have created any other services like buckets, please remember to delete them as well." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fe73dde2-056a-46a3-bacb-0a6671f7128f", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "environment": { + "kernel": "python3", + "name": "common-cpu.m125", + "type": "gcloud", + "uri": "us-docker.pkg.dev/deeplearning-platform-release/gcr.io/base-cpu:m125" + }, + "kernelspec": { + "display_name": "Python 3 (Local)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.15" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}