Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add core functionality for LlamaIndex integration #5

Merged
merged 11 commits into from
Sep 16, 2024

Conversation

augray
Copy link
Member

@augray augray commented Sep 10, 2024

Regardless of what integration route we take with LlamaIndex, we will need to be able to upload from their "nodes" abstraction. This PR adds that capability. It turns out some of the auto-inferred types from the object you load from LlamaIndex document nodes like this can't be written to Parquet, so there's some logic to sanitize the data and remove problematic columns. Additionally, the relationships and metadata attributes have some nested data which it would often be useful to have as Airtrain columns, to get insights based on them. This PR contains logic to flatten those attributes into multiple ones.

Testing

Without embeddings, full documents:

Driver script:

import os

from llama_index.readers.github import GithubRepositoryReader, GithubClient

from airtrain import upload_from_llama_nodes


def main() -> None:
    github_token = os.environ.get("GITHUB_TOKEN")
    owner = "sematic-ai"
    repo = "sematic"
    branch = "main"
    github_client = GithubClient(github_token=github_token, verbose=True)
    documents = GithubRepositoryReader(
        github_client=github_client,
        owner=owner,
        repo=repo,
        use_parser=False,
        verbose=False,
        filter_directories=(
            ["docs"],
            GithubRepositoryReader.FilterType.INCLUDE,
        ),
        filter_file_extensions=(
            [
                ".md",
            ],
            GithubRepositoryReader.FilterType.INCLUDE,
        ),
    ).load_data(branch=branch)
    result = upload_from_llama_nodes(
        documents,
        name="Sematic Docs Via Llama Index",
    )
    print(f"Uploaded {result.size} rows to {result.name}. View at: {result.url}")


if __name__ == "__main__":
    main()

Note: implemented before relationship/metadata unpacking was implemented.

resulting dataset

With embeddings, after chunking

import os

from llama_index.readers.github import GithubRepositoryReader, GithubClient
from llama_index.core.node_parser import (
    SentenceSplitter,
    SemanticSplitterNodeParser,
)
from llama_index.embeddings.openai import OpenAIEmbedding

from airtrain import upload_from_llama_nodes


def main() -> None:
    github_token = os.environ.get("GITHUB_TOKEN")
    owner = "sematic-ai"
    repo = "sematic"
    branch = "main"
    github_client = GithubClient(github_token=github_token, verbose=True)
    documents = GithubRepositoryReader(
        github_client=github_client,
        owner=owner,
        repo=repo,
        use_parser=False,
        verbose=False,
        filter_directories=(
            ["docs"],
            GithubRepositoryReader.FilterType.INCLUDE,
        ),
        filter_file_extensions=(
            [
                ".md",
            ],
            GithubRepositoryReader.FilterType.INCLUDE,
        ),
    ).load_data(branch=branch)
    embed_model = OpenAIEmbedding()
    splitter = SemanticSplitterNodeParser(
        buffer_size=1, breakpoint_percentile_threshold=95, embed_model=embed_model
    )
    nodes = splitter.get_nodes_from_documents(documents)
    result = upload_from_llama_nodes(
        nodes,
        name="Sematic Docs Via Llama Index, split + embed",
    )
    print(f"Uploaded {result.size} rows to {result.name}. View at: {result.url}")


if __name__ == "__main__":
    main()

resulting dataset

@augray augray changed the base branch from main to augray/basic-integrations September 11, 2024 18:10
@@ -2,7 +2,7 @@
name = "airtrain-py"
description = "SDK for interacting with https://airtrain.ai"
version = "0.0.1"
requires-python = ">=3.8"
requires-python = ">=3.8.1, <4.0"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had to add this because the llama-index package has the <4.0 constraint.

documents = GithubRepositoryReader(...).load_data(branch=branch)

# You can upload documents directly. In this case Airtrain will generate embeddings
result = at.upload_from_llama_nodes(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other APIs to initiate this upload may be added in subsequent PR(s), but upload_from_llama_nodes will serve as a foundation for their implementation.

Base automatically changed from augray/basic-integrations to main September 16, 2024 14:54
@augray augray merged commit e277e81 into main Sep 16, 2024
5 checks passed
@augray augray deleted the augray/llama-index-core branch September 16, 2024 14:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant