Skip to content

Commit

Permalink
update implementaiton notebook
Browse files Browse the repository at this point in the history
  • Loading branch information
areibman committed Feb 22, 2024
1 parent c761e3f commit 041b65d
Show file tree
Hide file tree
Showing 3 changed files with 327 additions and 202 deletions.
327 changes: 327 additions & 0 deletions Example.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,327 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import json\n",
"import jq\n",
"from jaiqu import validate_schema, translate_schema"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Desired data format \n",
"\n",
"Create a `jsonschema` dictionary for the format of data you want. Data extracted from your input will be extracted into this format."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"schema = {\n",
" \"$schema\": \"http://json-schema.org/draft-07/schema#\",\n",
" \"type\": \"object\",\n",
" \"properties\": {\n",
" \"id\": {\n",
" \"type\": [\"string\", \"null\"],\n",
" \"description\": \"A unique identifier for the record.\"\n",
" },\n",
" \"date\": {\n",
" \"type\": \"string\",\n",
" \"description\": \"A string describing the date.\"\n",
" },\n",
" \"model\": {\n",
" \"type\": \"string\",\n",
" \"description\": \"A text field representing the model used.\"\n",
" }\n",
" },\n",
" \"required\": [\n",
" \"id\",\n",
" \"date\"\n",
" ]\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Sample input data\n",
"Provoide an input JSON dictionary containing the data you want to extract."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"input_json = {\n",
" \"call.id\": \"123\",\n",
" \"datetime\": \"2022-01-01\",\n",
" \"timestamp\": 1640995200,\n",
" \"Address\": \"123 Main St\",\n",
" \"user\": {\n",
" \"name\": \"John Doe\",\n",
" \"age\": 30,\n",
" \"contact\": \"[email protected]\"\n",
" }\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### (Optional) Create hints\n",
"The jaiqu agent may not know certain concepts. For example, you might want to have some keys interpreted a certain way (i.e. interpret \"contact\" as \"email\"). For tricky interpretations, create hints."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"key_hints = \"We are processing outputs of an containing an id and a date of a user.\""
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "16260aec353145dbb4055821841a64aa",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Validating schema: 0%| | 0/3 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"schema_properties, valid = validate_schema(input_json, schema, key_hints)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Verify schema\n",
"Verify the input JSON contains the keys and values requested in your schema"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Schema is valid: True\n",
"----------\n",
"{\n",
" \"id\": {\n",
" \"identified\": true,\n",
" \"key\": \"call.id\",\n",
" \"message\": \"\\\"call.id\\\" | \\\"id\\\" : Even though the names aren't exactly the same, we can infer that both fields aim to serve as identifiers. The embedded '.' in the key could imply a nested property or just a unique notation. Let's consider the likelihood that both fields are for unique identification which aligns with the intended role of \\\"id\\\". Also, looking at the type defined allows string or null which is compatible with the value provided. Extracted key: `call.id`.\",\n",
" \"type\": [\n",
" \"string\",\n",
" \"null\"\n",
" ],\n",
" \"description\": \"A unique identifier for the record.\",\n",
" \"required\": true\n",
" },\n",
" \"date\": {\n",
" \"identified\": true,\n",
" \"key\": \"datetime\",\n",
" \"message\": \"\\\"date\\\" | \\\"datetime\\\" : The field names differ, but \\\"datetime\\\" could potentially contain the date information. Furthermore, the description of \\\"date\\\" suggests it's a string describing the date, which matches the type and content of the \\\"datetime\\\" field. Therefore we can extract the key: `datetime`.\",\n",
" \"type\": \"string\",\n",
" \"description\": \"A string describing the date.\",\n",
" \"required\": true\n",
" },\n",
" \"model\": {\n",
" \"identified\": false,\n",
" \"key\": null,\n",
" \"message\": \"\\\"model\\\" | None: There are no fields in the provided schema that match 'model' or imply a text field representing the model used. Extracted key: `None`\",\n",
" \"type\": \"string\",\n",
" \"description\": \"A text field representing the model used.\",\n",
" \"required\": false\n",
" }\n",
"}\n"
]
}
],
"source": [
"print('Schema is valid:',valid)\n",
"print('-'*10)\n",
"print(json.dumps(schema_properties, indent=2))"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "58e46f7bc96c48cba50d4a78ccc4cd65",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Validating schema: 0%| | 0/3 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "219652ce841045bdbedf4032e5737faf",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Translating schema: 0%| | 0/2 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "465972d7466f48dca0e07ff12688f981",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Retry attempts: 0%| | 0/20 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "639d101d443c4805a34c269b1a36d0f6",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Validation attempts: 0%| | 0/20 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"jq_query = translate_schema(input_json, schema, key_hints=key_hints, max_retries=20)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Finalized jq query"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'{ \"id\": (.\"call.id\"? // \"None\"), \"date\": (.datetime? // null) }'"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"jq_query"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Check the jq query results"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[{'id': '123', 'date': '2022-01-01'}]\n"
]
}
],
"source": [
"result = jq.compile(jq_query).input(input_json).all()\n",
"print(jq.compile(jq_query).input(input_json).all())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Loading

0 comments on commit 041b65d

Please sign in to comment.