-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add core functionality for LlamaIndex integration #5
Conversation
6614acf
to
e2585b8
Compare
@@ -2,7 +2,7 @@ | |||
name = "airtrain-py" | |||
description = "SDK for interacting with https://airtrain.ai" | |||
version = "0.0.1" | |||
requires-python = ">=3.8" | |||
requires-python = ">=3.8.1, <4.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had to add this because the llama-index package has the <4.0 constraint.
documents = GithubRepositoryReader(...).load_data(branch=branch) | ||
|
||
# You can upload documents directly. In this case Airtrain will generate embeddings | ||
result = at.upload_from_llama_nodes( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other APIs to initiate this upload may be added in subsequent PR(s), but upload_from_llama_nodes
will serve as a foundation for their implementation.
b912188
to
bb817cf
Compare
Regardless of what integration route we take with LlamaIndex, we will need to be able to upload from their "nodes" abstraction. This PR adds that capability. It turns out some of the auto-inferred types from the object you load from LlamaIndex document nodes like this can't be written to Parquet, so there's some logic to sanitize the data and remove problematic columns. Additionally, the
relationships
andmetadata
attributes have some nested data which it would often be useful to have as Airtrain columns, to get insights based on them. This PR contains logic to flatten those attributes into multiple ones.Testing
Without embeddings, full documents:
Driver script:
Note: implemented before relationship/metadata unpacking was implemented.
resulting dataset
With embeddings, after chunking
resulting dataset