Skip to content

Commit

Permalink
update changes 10/23/2024
Browse files Browse the repository at this point in the history
  • Loading branch information
cetrarom2 committed Oct 23, 2024
2 parents 8e1a044 + 51dec5d commit 3c481a2
Show file tree
Hide file tree
Showing 4 changed files with 128 additions and 1,442 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@
"source": [
"Generative AI (GenAI) represents a transformative technology capable of producing human-like text, images, code, and various other types of content. While much of the focus has been on unstructured data—such as PDFs, text documents, image files, and websites—many GenAI implementations rely on a parameter called \"top K.\" This algorithm retrieves only the highest-scoring pieces of content relevant to a user's query, which can be limiting. Users seeking insights from structured data formats like CSV and JSON often require access to all relevant occurrences, rather than just a subset.\n",
"\n",
"In this tutorial, we use a RAG agent together with a VertexAI chat model gemini-1.5-pro to query the BigQuery table. A Retrieval Augmented Generation (RAG) agent is a key part of a RAG application that enhances the capabilities of the large language models (LLMs) by integrating external data retrieval. AI agents empower LLMs to interact with the world through actions and tools."
"In this tutorial, we use a RAG agent together with a VertexAI chat model gemini-1.5-pro to query the BigQuery table. A Retrieval Augmented Generation (RAG) agent is a key part of a RAG application that enhances the capabilities of the large language models (LLMs) by integrating external data retrieval. AI agents empower LLMs to interact with the world through actions and tools. The agent connects to BigQuery, runs the query, and then sends the results to the model. The model processes the data and generates the final output that it is displayed for the user.\n"
]
},
{
Expand All @@ -53,7 +53,7 @@
"\n",
"In this tutorial, we will be using Google Gemini Pro 1.5, which does not require deployment. However, if you prefer to use a different model, you can select one from the Model Garden via the console. This will allow you to add the model to your registry, create an endpoint (or utilize an existing one), and deploy the model—all in a single step. Here is a link for more information: [model deployment](https://cloud.google.com/vertex-ai/docs/general/deployment).\n",
"\n",
"Before we begin, you'll need to create a Vertex AI RAG Data Service Agent service account. To do this, go to the IAM section of the console. Ensure you check the box for \"Include Google-provided role grant.\" If the role is not listed, click \"Grant Access\" and add \"Vertex AI RAG Data Service Agent\" as a role.\n",
"Before we begin, you'll need to create a Vertex AI RAG Data Service Agent service account. To do this, go to the IAM section of the console. Ensure you check the box for \"Include Google-provided role grant.\" If the role is not listed, click \"Grant Access\" and add \"Vertex AI RAG Data Service Agent\" as a role. Additionally, to create datasets and tables in BigQuery, the user must have the __bigquery.datasets.create__ and __bigquery.tables.create__ permissions.\n",
"\n",
"We are using a e2-standard-4 (Efficient Instance: 4 vCPUs, 16 GB RAM) for this tutorial. "
]
Expand Down Expand Up @@ -155,7 +155,7 @@
"id": "df0d9eda-ee54-4c4c-86e6-9166b2b82e2d",
"metadata": {},
"source": [
"Create a bucket where the data used for tutorial is going to be placed."
"Create a bucket where the data used for tutorial is going to be placed. It is important to keep in mind that the name of the bucket has to be unique."
]
},
{
Expand All @@ -174,7 +174,9 @@
"id": "5781a07c-84ed-416c-a782-b1dc881e5c18",
"metadata": {},
"source": [
"Once the bucket is created, we need to access the CSV source file. In this tutorial, I transferred the data file to our Jupyter notebook by simply dragging and dropping it from my local folder. Next, we need to specify the bucket name and the path of the data source in order to upload the CSV file to the bucket. It is important to keep in mind that the name of the bucket has to be unique."
"Once the bucket is created, we need to access the CSV source file. In this tutorial, I transferred the data file to our Jupyter notebook by simply dragging and dropping it from my local folder. Next, we need to specify the bucket name and the path of the data source in order to upload the CSV file to the bucket. \n",
"\n",
"We are using the [health screening](https://www.kaggle.com/datasets/drateendrajha/health-screening-data) dataset from kaggle for this tutorial."
]
},
{
Expand Down Expand Up @@ -243,7 +245,7 @@
"client = bigquery.Client()\n",
"\n",
"table_name = \"<Table Name>\"\n",
"table_id = f'{project_id}.{dataset_id}.{table_name}'\n",
"table_id = f'{project_id}.{dataset_name}.{table_name}'\n",
"\n",
"schema = [\n",
" bigquery.SchemaField(\"id\", \"INTEGER\", mode=\"NULLABLE\"),\n",
Expand Down Expand Up @@ -273,6 +275,37 @@
")"
]
},
{
"cell_type": "markdown",
"id": "4e18540d-defa-414b-8b78-b86bce3dd46a",
"metadata": {},
"source": [
"As an option, the following code will auto detect and upload the data to the table in one step."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "671aa2a3-d259-42e3-8cc3-8ab6b1e4a7d1",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"table_id = f'{dataset_id}.{table_name}'\n",
"\n",
"job_config = bigquery.LoadJobConfig(\n",
" autodetect=True, source_format=bigquery.SourceFormat.CSV, skip_leading_rows=1,\n",
")\n",
"uri = f\"gs://{bucket}/ds_salaries.csv\"\n",
"load_job = client.load_table_from_uri(\n",
" uri, table_id, job_config=job_config\n",
") # Make an API request.\n",
"load_job.result() # Waits for the job to complete.\n",
"destination_table = client.get_table(table_id)\n",
"print(\"Loaded {} rows.\".format(destination_table.num_rows))"
]
},
{
"cell_type": "markdown",
"id": "6e97b4c6-429c-4d36-a4c1-3963b079af6a",
Expand All @@ -299,6 +332,30 @@
" print(\"{} project does not contain any datasets.\".format(project))"
]
},
{
"cell_type": "markdown",
"id": "0c87a0f1-de43-4cce-a5e2-8952ce2283a0",
"metadata": {},
"source": [
"We load our dataset located in our bucket to a pandas dataframe."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2486d4e3-b462-42de-85ab-3c6d5a8ba363",
"metadata": {},
"outputs": [],
"source": [
"from google.cloud import storage\n",
"import pandas as pd\n",
"from io import StringIO\n",
"\n",
"path = \"<gsutil URI>\"\n",
"df = pd.read_csv(path)\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"id": "ad42d610-daa6-4d92-ac14-2d90349dafe1",
Expand All @@ -316,10 +373,6 @@
"source": [
"df.to_gbq(table_id, project_id=project_id)\n",
"\n",
"from google.cloud import storage\n",
"import pandas as pd\n",
"from io import StringIO\n",
"\n",
"client.load_table_from_dataframe(df, table_id).result()"
]
},
Expand All @@ -328,7 +381,7 @@
"id": "4b16a292-afdc-415c-a8a7-071bc90281c3",
"metadata": {},
"source": [
"![image.png](../../NIHCloudLabGCP/images/gcp_rag_structure_data_01.png)"
"![image.png](../../../NIHCloudLabGCP/images/gcp_rag_structure_data_01.png)"
]
},
{
Expand Down Expand Up @@ -404,7 +457,7 @@
"from langchain.agents.agent_toolkits import SQLDatabaseToolkit\n",
"from langchain.sql_database import SQLDatabase\n",
"\n",
"sqlalchemy_url = f'bigquery://{project_id}/{dataset}'\n",
"sqlalchemy_url = f'bigquery://{project_id}/{dataset_name}'\n",
"print(sqlalchemy_url)"
]
},
Expand Down
Loading

0 comments on commit 3c481a2

Please sign in to comment.