-
Notifications
You must be signed in to change notification settings - Fork 20
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
3 changed files
with
327 additions
and
202 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,327 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import json\n", | ||
"import jq\n", | ||
"from jaiqu import validate_schema, translate_schema" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Desired data format \n", | ||
"\n", | ||
"Create a `jsonschema` dictionary for the format of data you want. Data extracted from your input will be extracted into this format." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 2, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"schema = {\n", | ||
" \"$schema\": \"http://json-schema.org/draft-07/schema#\",\n", | ||
" \"type\": \"object\",\n", | ||
" \"properties\": {\n", | ||
" \"id\": {\n", | ||
" \"type\": [\"string\", \"null\"],\n", | ||
" \"description\": \"A unique identifier for the record.\"\n", | ||
" },\n", | ||
" \"date\": {\n", | ||
" \"type\": \"string\",\n", | ||
" \"description\": \"A string describing the date.\"\n", | ||
" },\n", | ||
" \"model\": {\n", | ||
" \"type\": \"string\",\n", | ||
" \"description\": \"A text field representing the model used.\"\n", | ||
" }\n", | ||
" },\n", | ||
" \"required\": [\n", | ||
" \"id\",\n", | ||
" \"date\"\n", | ||
" ]\n", | ||
"}" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Sample input data\n", | ||
"Provoide an input JSON dictionary containing the data you want to extract." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 3, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"input_json = {\n", | ||
" \"call.id\": \"123\",\n", | ||
" \"datetime\": \"2022-01-01\",\n", | ||
" \"timestamp\": 1640995200,\n", | ||
" \"Address\": \"123 Main St\",\n", | ||
" \"user\": {\n", | ||
" \"name\": \"John Doe\",\n", | ||
" \"age\": 30,\n", | ||
" \"contact\": \"[email protected]\"\n", | ||
" }\n", | ||
"}" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### (Optional) Create hints\n", | ||
"The jaiqu agent may not know certain concepts. For example, you might want to have some keys interpreted a certain way (i.e. interpret \"contact\" as \"email\"). For tricky interpretations, create hints." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 4, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"key_hints = \"We are processing outputs of an containing an id and a date of a user.\"" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 5, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"application/vnd.jupyter.widget-view+json": { | ||
"model_id": "16260aec353145dbb4055821841a64aa", | ||
"version_major": 2, | ||
"version_minor": 0 | ||
}, | ||
"text/plain": [ | ||
"Validating schema: 0%| | 0/3 [00:00<?, ?it/s]" | ||
] | ||
}, | ||
"metadata": {}, | ||
"output_type": "display_data" | ||
} | ||
], | ||
"source": [ | ||
"schema_properties, valid = validate_schema(input_json, schema, key_hints)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Verify schema\n", | ||
"Verify the input JSON contains the keys and values requested in your schema" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 6, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Schema is valid: True\n", | ||
"----------\n", | ||
"{\n", | ||
" \"id\": {\n", | ||
" \"identified\": true,\n", | ||
" \"key\": \"call.id\",\n", | ||
" \"message\": \"\\\"call.id\\\" | \\\"id\\\" : Even though the names aren't exactly the same, we can infer that both fields aim to serve as identifiers. The embedded '.' in the key could imply a nested property or just a unique notation. Let's consider the likelihood that both fields are for unique identification which aligns with the intended role of \\\"id\\\". Also, looking at the type defined allows string or null which is compatible with the value provided. Extracted key: `call.id`.\",\n", | ||
" \"type\": [\n", | ||
" \"string\",\n", | ||
" \"null\"\n", | ||
" ],\n", | ||
" \"description\": \"A unique identifier for the record.\",\n", | ||
" \"required\": true\n", | ||
" },\n", | ||
" \"date\": {\n", | ||
" \"identified\": true,\n", | ||
" \"key\": \"datetime\",\n", | ||
" \"message\": \"\\\"date\\\" | \\\"datetime\\\" : The field names differ, but \\\"datetime\\\" could potentially contain the date information. Furthermore, the description of \\\"date\\\" suggests it's a string describing the date, which matches the type and content of the \\\"datetime\\\" field. Therefore we can extract the key: `datetime`.\",\n", | ||
" \"type\": \"string\",\n", | ||
" \"description\": \"A string describing the date.\",\n", | ||
" \"required\": true\n", | ||
" },\n", | ||
" \"model\": {\n", | ||
" \"identified\": false,\n", | ||
" \"key\": null,\n", | ||
" \"message\": \"\\\"model\\\" | None: There are no fields in the provided schema that match 'model' or imply a text field representing the model used. Extracted key: `None`\",\n", | ||
" \"type\": \"string\",\n", | ||
" \"description\": \"A text field representing the model used.\",\n", | ||
" \"required\": false\n", | ||
" }\n", | ||
"}\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"print('Schema is valid:',valid)\n", | ||
"print('-'*10)\n", | ||
"print(json.dumps(schema_properties, indent=2))" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 7, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"application/vnd.jupyter.widget-view+json": { | ||
"model_id": "58e46f7bc96c48cba50d4a78ccc4cd65", | ||
"version_major": 2, | ||
"version_minor": 0 | ||
}, | ||
"text/plain": [ | ||
"Validating schema: 0%| | 0/3 [00:00<?, ?it/s]" | ||
] | ||
}, | ||
"metadata": {}, | ||
"output_type": "display_data" | ||
}, | ||
{ | ||
"data": { | ||
"application/vnd.jupyter.widget-view+json": { | ||
"model_id": "219652ce841045bdbedf4032e5737faf", | ||
"version_major": 2, | ||
"version_minor": 0 | ||
}, | ||
"text/plain": [ | ||
"Translating schema: 0%| | 0/2 [00:00<?, ?it/s]" | ||
] | ||
}, | ||
"metadata": {}, | ||
"output_type": "display_data" | ||
}, | ||
{ | ||
"data": { | ||
"application/vnd.jupyter.widget-view+json": { | ||
"model_id": "465972d7466f48dca0e07ff12688f981", | ||
"version_major": 2, | ||
"version_minor": 0 | ||
}, | ||
"text/plain": [ | ||
"Retry attempts: 0%| | 0/20 [00:00<?, ?it/s]" | ||
] | ||
}, | ||
"metadata": {}, | ||
"output_type": "display_data" | ||
}, | ||
{ | ||
"data": { | ||
"application/vnd.jupyter.widget-view+json": { | ||
"model_id": "639d101d443c4805a34c269b1a36d0f6", | ||
"version_major": 2, | ||
"version_minor": 0 | ||
}, | ||
"text/plain": [ | ||
"Validation attempts: 0%| | 0/20 [00:00<?, ?it/s]" | ||
] | ||
}, | ||
"metadata": {}, | ||
"output_type": "display_data" | ||
} | ||
], | ||
"source": [ | ||
"jq_query = translate_schema(input_json, schema, key_hints=key_hints, max_retries=20)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Finalized jq query" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 8, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"'{ \"id\": (.\"call.id\"? // \"None\"), \"date\": (.datetime? // null) }'" | ||
] | ||
}, | ||
"execution_count": 8, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"jq_query" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Check the jq query results" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 9, | ||
"metadata": { | ||
"scrolled": true | ||
}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"[{'id': '123', 'date': '2022-01-01'}]\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"result = jq.compile(jq_query).input(input_json).all()\n", | ||
"print(jq.compile(jq_query).input(input_json).all())" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3 (ipykernel)", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.11.5" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |
Oops, something went wrong.