Merge branch 'main' into chat-wrapper

DataUSA · Apr 1, 2024 · 03fbda5 · 03fbda5
2 parents 26ce4aa + 9e8012d
commit 03fbda5
Show file tree

Hide file tree

Showing 12 changed files with 5,123 additions and 29 deletions.
diff --git a/README.md b/README.md
@@ -9,16 +9,13 @@ This repository contains scripts for a chatbot that leverages artificial intelli
 
    - Also contains `tables.json` which contains available cubes, with their descriptions, column names, and relevant details.
 
-### 2. **`src/utils/`**
 ### 2. **`src/utils/`**
    - Houses all the main scripts to run the chatbot.
 
    - **Subfolders:**
      1. **`api_data_request/`**
         - Core scripts responsible for constructing the API URL. Contains functions for processing query cuts and matching values with their respective IDs.
 
-     2. **`data_analysis/`**
-
      2. **`data_analysis/`**
         - Contains scripts used for data analysis (mainly using [LangChain](https://python.langchain.com/docs/get_started/introduction)).
 
@@ -31,16 +28,6 @@ This repository contains scripts for a chatbot that leverages artificial intelli
      5. **`table_selection/`**
         - All scripts needed to lookup and manage the relevant cube that contains the data needed to answer the user's query.
 
-     3. **`helpers/`**
-        - Stores scripts to ingest cubes and drilldowns into a database. Also contains a script to map the tesseract schema to the custom `tables.json` format needed to run the chat.
-
-     4. **`preprocessors/`**
-        - Contains scripts that preprocess text (or any other data type as needed).
-
-     5. **`table_selection/`**
-        - All scripts needed to lookup and manage the relevant cube that contains the data needed to answer the user's query.
-
-
 ## General Workflow
 
 ### 1. Table Selection
@@ -59,9 +46,6 @@ This repository contains scripts for a chatbot that leverages artificial intelli
    - **Option 4: [in progress]**
       - Will receive the table name from the wrapper.
 
-   - **Option 4: [in progress]**
-      - Will receive the table name from the wrapper.
-
 2. All the above functions return the name of the most relevant table. The app currenty works with Option 3.
 
 ### 2. API URL Generator & Data Request
@@ -83,16 +67,12 @@ This repository contains scripts for a chatbot that leverages artificial intelli
 
    4. Instantiates an ApiBuilder object and sets the variables, measures, and cuts provided by the LLM as attributes using the class methods.
 
-   4. For the cuts, a similarity search is done over the corresponding dimension members of the cube to extract their ids from the database (with the `cuts_processing()` function).
-   4. Instantiates an ApiBuilder object and sets the variables, measures, and cuts provided by the LLM as attributes using the class methods.
-
    4. For the cuts, a similarity search is done over the corresponding dimension members of the cube to extract their ids from the database (with the `cuts_processing()` function).
 
-   5. The API URL (for Mondrian or Tesseract) is built using the processed cuts, drilldowns and measures obtained from previous steps by running the `build_url()` method.
    5. The API URL (for Mondrian or Tesseract) is built using the processed cuts, drilldowns and measures obtained from previous steps by running the `build_url()` method.
 
    6. The data is retrieved from the API using the `fetch_data()` method and stored in a pandas dataframe.
-   6. The data is retrieved from the API using the `fetch_data()` method and stored in a pandas dataframe.
+
 
 ### 3. Data Analysis/Processing
 
@@ -109,14 +89,22 @@ Currently, the cubes available to be queried by the chatbot are:
    - Data_USA_House_election
    - [in progress] pums_5
 
-In order to add one cube, the steps are:
 In order to add one cube, the steps are:
 
    1. Add the cube to the `tables.json` file. The following fields must be filled:
       - name
       - api (Tesseract or Mondrian)
       - description
-      - measures
+      - measures:
+         ```json
+            {
+               "name": "Millions Of Dollars",
+               "description": "value in millions of dollars of a shipment"
+            }
+         ```
+
+      - dimensions
+         - Add each hierarchy separately, filling the following fields for each:
             ```json
                {
                   "name": "Millions Of Dollars",
@@ -139,6 +127,7 @@ In order to add one cube, the steps are:
                             ]
                         }
                     ]
+
                   "name": "Time",
                   "description": "Periodicity of the data (monthly or annual).",
                   "hierarchies": [
@@ -154,13 +143,11 @@ In order to add one cube, the steps are:
                }
             ```
 
-   2. Add the cube to the database (**datausa_tables.cubes**), filling the following columns (you can use the `cubes_to_db.py` script):
    2. Add the cube to the database (**datausa_tables.cubes**), filling the following columns (you can use the `cubes_to_db.py` script):
       - table_name
       - table_description
       - embedding (embedding of the table's description is represented as a 384-dimensional vector, derived using the `SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')` model)
 
-   3. Add drilldown members & ids to the db (**datausa_drilldowns.drilldowns**)
    3. Add drilldown members & ids to the db (**datausa_drilldowns.drilldowns**)
       - This process can be initiated by executing the `drilldowns_to_db.py` script. During execution, the code will prompt for the API URL to fetch the drilldown members and IDs. Then, it will request the measure name in order to remove it from the dataframe before loading the data to the database.
       - The script then appends a column containing embeddings generated from the drilldown names using the same embedding model mentioned before.

diff --git a/api/src/config.py b/api/src/config.py
@@ -36,8 +36,10 @@
 TESSERACT_API = getenv("TESSERACT_API")
 
 # Mondrian Connection
+
 MONDRIAN_API = getenv('MONDRIAN_API')
 
 # Files Directories
 TABLES_PATH = getenv('TABLES_PATH')
 FEW_SHOT_PATH = getenv('FEW_SHOT_PATH')
+
diff --git a/api/src/main.py b/api/src/main.py
@@ -9,6 +9,7 @@
 
 # fastapi instance declaration
 app = FastAPI()
+
 # api functions
 @app.get("/")
 async def root():
@@ -21,6 +22,7 @@ async def root():
 async def wrap(query):
     return StreamingResponse(Langbot(query, get_api, [], TABLES_PATH), media_type="application/json")
 
+
 @app.get("/query/{query}")
 async def read_item(query: str):
     api_url, data, text_response = get_api(query, TABLES_PATH)
@@ -58,3 +60,4 @@ def fn2():
 def num():
     return StreamingResponse(fn2(), media_type="application/json")
 
+
diff --git a/api/src/utils/app.py b/api/src/utils/app.py
@@ -34,4 +34,5 @@ def get_api(query, TABLES_PATH):
 
 if __name__ == "__main__":
     TABLES_PATH = getenv('TABLES_PATH')
-    get_api('How much did the CPI of fresh fruits change between 2019 and 2021', TABLES_PATH)
+    get_api('How much did the CPI of fresh fruits change between 2019 and 2021', TABLES_PATH)
+
diff --git a/api/src/utils/data_analysis/data_analysis.py b/api/src/utils/data_analysis/data_analysis.py
@@ -2,6 +2,7 @@
 from langchain_experimental.agents import create_pandas_dataframe_agent
 from langchain_community.chat_models import ChatOpenAI
 
+
 def agent_answer(df, natural_language_query):
 
     prompt = (
@@ -21,7 +22,7 @@ def agent_answer(df, natural_language_query):
     )
 
     llm = ChatOpenAI(model_name='gpt-4-1106-preview', temperature=0, openai_api_key=OPENAI_KEY)
-    llm = ChatOpenAI(model_name='gpt-4-1106-preview', temperature=0, openai_api_key=OPENAI_KEY)
+
     agent =  create_pandas_dataframe_agent(llm, df, verbose=True)
     response = agent.run(prompt)
 

diff --git a/api/src/utils/few_shot_examples.py b/api/src/utils/few_shot_examples.py
@@ -4,7 +4,6 @@
 
 from config import FEW_SHOT_PATH
 
-
 few_shot_examples = {}
 with open(FEW_SHOT_PATH, "r") as f:
     few_shot_examples = json.load(f)

diff --git a/api/src/utils/helpers/cube_to_db.py b/api/src/utils/helpers/cube_to_db.py
@@ -0,0 +1,40 @@
+import pandas as pd
+
+from config import POSTGRES_ENGINE
+from sentence_transformers import SentenceTransformer
+
+def embedding(dataframe, column):
+    """
+    Creates embeddings for text in the passed column
+    """
+    model = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')
+
+    model_embeddings = model.encode(dataframe[column].to_list())
+    dataframe['embedding'] = model_embeddings.tolist()
+
+    return dataframe
+
+
+def create_table():
+    POSTGRES_ENGINE.execute("CREATE TABLE IF NOT EXISTS datausa_tables.cubes (table_name text, table_description text, embedding vector(384))") 
+    return
+
+
+def load_data_to_db(df):
+
+    print(df.head())
+
+    df_embeddings = embedding(df, 'table_description')
+    df_embeddings.to_sql('cubes', con=POSTGRES_ENGINE, if_exists='append', index=False, schema='datausa_tables')
+
+    return
+
+
+df = pd.DataFrame()
+
+df["table_name"] = ["Data_USA_House_election"]
+df['table_description'] = ["Table 'Data_USA_House_election' contains House election data, including number of votes by candidate, party and state."]
+
+create_table()
+
+load_data_to_db(df)
diff --git a/api/src/utils/helpers/drilldowns_to_db.py b/api/src/utils/helpers/drilldowns_to_db.py
@@ -0,0 +1,79 @@
+import pandas as pd
+import requests
+import urllib.parse
+
+from config import POSTGRES_ENGINE
+from sentence_transformers import SentenceTransformer
+
+def embedding(dataframe, column):
+    """
+    Creates embeddings for text in the passed column
+    """
+    model = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')
+
+    model_embeddings = model.encode(dataframe[column].to_list())
+    dataframe['embedding'] = model_embeddings.tolist()
+
+    return dataframe
+
+
+def create_table():
+    POSTGRES_ENGINE.execute("CREATE TABLE IF NOT EXISTS datausa_drilldowns.drilldowns (product_id text, product_name text, cube_name text, drilldown text, embedding vector(384))") 
+    return
+
+
+def get_data_from_api(api_url):
+    try:
+        r = requests.get(api_url)
+        df = pd.DataFrame.from_dict(r.json()['data'])
+    except: raise ValueError('Invalid API url:', api_url)
+
+    return df
+
+
+def get_api_params(api_url):
+    parsed_url = urllib.parse.urlparse(api_url)
+    query_params = urllib.parse.parse_qs(parsed_url.query)
+
+    cube = query_params.get('cube', [''])[0]
+    drilldown = query_params.get('drilldowns', [''])[0]
+
+    cube_name = cube.replace('+', ' ')
+    drilldown = drilldown.replace('+', ' ')
+
+    return cube_name, drilldown
+
+
+def load_data_to_db(api_url, measure_name):
+    cube_name, drilldown = get_api_params(api_url)
+    df = get_data_from_api(api_url=api_url)
+
+    df.rename(columns={f"{drilldown}": "drilldown_name", f"{drilldown} ID": "drilldown_id"}, inplace=True)
+
+    df['cube_name'] = f"{cube_name}"
+    df['drilldown'] = f"{drilldown}"
+    df.drop(f"{measure_name}", axis=1, inplace=True)
+
+    if 'drilldown_id' not in df.columns:
+        df['drilldown_id'] = df['drilldown']
+
+    df.replace('', pd.NA, inplace=True)
+    df.dropna(subset=['drilldown_name', 'drilldown_id'], how='all', inplace=True)
+
+    print(df.head())
+
+    #df_embeddings = embedding(df, 'product_name')
+    #df_embeddings.to_sql('drilldowns', con=POSTGRES_ENGINE, if_exists='append', index=False, schema='datausa_drilldowns')
+
+    return
+
+
+print("Enter API url: ")
+api_url = input()
+print("Enter measure name: ")
+measure_name = input()
+#df = pd.read_csv('/Users/alexandrabjanes/Datawheel/CODE/datausa-chat/tables.csv')
+#print(df.head())
+
+#create_table()
+load_data_to_db(api_url, measure_name = measure_name)
diff --git a/api/src/utils/helpers/schema_to_json.py b/api/src/utils/helpers/schema_to_json.py
@@ -6,6 +6,7 @@ def parse_xml_to_json(xml_file):
     """
     Parses XML schema to custom json format.
     """
+
     tree = ET.parse(xml_file)
     root = tree.getroot()
-Original file line number
+Diff line change
@@ Expand Up / @@ -6,6 +6,7 @@ def parse_xml_to_json(xml_file): @@
         """
         Parses XML schema to custom json format.
         """
         tree = ET.parse(xml_file)
         root = tree.getroot()
@@ Expand Down @@