Add core functionality for LlamaIndex integration #5

augray · 2024-09-10T15:21:52Z

Regardless of what integration route we take with LlamaIndex, we will need to be able to upload from their "nodes" abstraction. This PR adds that capability. It turns out some of the auto-inferred types from the object you load from LlamaIndex document nodes like this can't be written to Parquet, so there's some logic to sanitize the data and remove problematic columns. Additionally, the relationships and metadata attributes have some nested data which it would often be useful to have as Airtrain columns, to get insights based on them. This PR contains logic to flatten those attributes into multiple ones.

Testing

Without embeddings, full documents:

Driver script:

import os

from llama_index.readers.github import GithubRepositoryReader, GithubClient

from airtrain import upload_from_llama_nodes


def main() -> None:
    github_token = os.environ.get("GITHUB_TOKEN")
    owner = "sematic-ai"
    repo = "sematic"
    branch = "main"
    github_client = GithubClient(github_token=github_token, verbose=True)
    documents = GithubRepositoryReader(
        github_client=github_client,
        owner=owner,
        repo=repo,
        use_parser=False,
        verbose=False,
        filter_directories=(
            ["docs"],
            GithubRepositoryReader.FilterType.INCLUDE,
        ),
        filter_file_extensions=(
            [
                ".md",
            ],
            GithubRepositoryReader.FilterType.INCLUDE,
        ),
    ).load_data(branch=branch)
    result = upload_from_llama_nodes(
        documents,
        name="Sematic Docs Via Llama Index",
    )
    print(f"Uploaded {result.size} rows to {result.name}. View at: {result.url}")


if __name__ == "__main__":
    main()

Note: implemented before relationship/metadata unpacking was implemented.

resulting dataset

With embeddings, after chunking

import os

from llama_index.readers.github import GithubRepositoryReader, GithubClient
from llama_index.core.node_parser import (
    SentenceSplitter,
    SemanticSplitterNodeParser,
)
from llama_index.embeddings.openai import OpenAIEmbedding

from airtrain import upload_from_llama_nodes


def main() -> None:
    github_token = os.environ.get("GITHUB_TOKEN")
    owner = "sematic-ai"
    repo = "sematic"
    branch = "main"
    github_client = GithubClient(github_token=github_token, verbose=True)
    documents = GithubRepositoryReader(
        github_client=github_client,
        owner=owner,
        repo=repo,
        use_parser=False,
        verbose=False,
        filter_directories=(
            ["docs"],
            GithubRepositoryReader.FilterType.INCLUDE,
        ),
        filter_file_extensions=(
            [
                ".md",
            ],
            GithubRepositoryReader.FilterType.INCLUDE,
        ),
    ).load_data(branch=branch)
    embed_model = OpenAIEmbedding()
    splitter = SemanticSplitterNodeParser(
        buffer_size=1, breakpoint_percentile_threshold=95, embed_model=embed_model
    )
    nodes = splitter.get_nodes_from_documents(documents)
    result = upload_from_llama_nodes(
        nodes,
        name="Sematic Docs Via Llama Index, split + embed",
    )
    print(f"Uploaded {result.size} rows to {result.name}. View at: {result.url}")


if __name__ == "__main__":
    main()

resulting dataset

augray · 2024-09-11T18:11:03Z

pyproject.toml

@@ -2,7 +2,7 @@
 name = "airtrain-py"
 description = "SDK for interacting with https://airtrain.ai"
 version = "0.0.1"
-requires-python = ">=3.8"
+requires-python = ">=3.8.1, <4.0"


Had to add this because the llama-index package has the <4.0 constraint.

augray · 2024-09-11T18:27:03Z

README.md

+documents = GithubRepositoryReader(...).load_data(branch=branch)
+
+# You can upload documents directly. In this case Airtrain will generate embeddings
+result = at.upload_from_llama_nodes(


Other APIs to initiate this upload may be added in subsequent PR(s), but upload_from_llama_nodes will serve as a foundation for their implementation.

augray force-pushed the augray/llama-index-core branch from 6614acf to e2585b8 Compare September 11, 2024 18:10

augray changed the base branch from main to augray/basic-integrations September 11, 2024 18:10

augray commented Sep 11, 2024

View reviewed changes

augray requested review from idrisschebak, neutralino1 and chance-sematic September 11, 2024 18:30

Base automatically changed from augray/basic-integrations to main September 16, 2024 14:54

augray added 11 commits September 16, 2024 07:55

add initial LlamaIndex

e17076d

handle unions

0ac37f9

improve handling of node relationships

f2b7ae9

add assertion

ceaa651

add assert message

df104c1

flatten metadata and relationships

f7452fc

bump required llamaindex version

a94b129

bump required llama

4211e6e

Adapt llamaindex test based on py version

cdf1c48

branch based on python version

d196d1e

Add docs

bb817cf

augray force-pushed the augray/llama-index-core branch from b912188 to bb817cf Compare September 16, 2024 14:55

augray merged commit e277e81 into main Sep 16, 2024
5 checks passed

augray deleted the augray/llama-index-core branch September 16, 2024 14:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add core functionality for LlamaIndex integration #5

Add core functionality for LlamaIndex integration #5

augray commented Sep 10, 2024 •

edited

Loading

augray Sep 11, 2024

augray Sep 11, 2024

Add core functionality for LlamaIndex integration #5

Add core functionality for LlamaIndex integration #5

Conversation

augray commented Sep 10, 2024 • edited Loading

Testing

Without embeddings, full documents:

With embeddings, after chunking

augray Sep 11, 2024

Choose a reason for hiding this comment

augray Sep 11, 2024

Choose a reason for hiding this comment

augray commented Sep 10, 2024 •

edited

Loading