Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use arrow datasets for intermediates #221

Open
shreyashankar opened this issue Dec 2, 2024 · 1 comment
Open

Use arrow datasets for intermediates #221

shreyashankar opened this issue Dec 2, 2024 · 1 comment
Labels
efficiency Making docetl operations run faster good first engineering issue Engineering-focused issue for newcomers

Comments

@shreyashankar
Copy link
Collaborator

JSON is a bit bulky. It's easy to view, but for intermediates, especially when using the UI, it does not make sense to store and query the intermediates as JSON.

We should have an option in a docetl config to store the intermediates as arrow datasets. We will then have to query them in the UI accordingly.

@shreyashankar shreyashankar added efficiency Making docetl operations run faster good first engineering issue Engineering-focused issue for newcomers labels Dec 2, 2024
@shreyashankar
Copy link
Collaborator Author

I am thinking we should have a different storage format altogether.

Motivation

Our current JSON-based workflow is inefficient for handling dynamic schema changes and row-level operations, leading to performance and scalability challenges. The optimizer requires efficient document sampling based on semantic similarity, and future operations may involve retrieval and sampling during execution. A more robust data format is needed to support these requirements seamlessly.

Problem

  • Dynamic Schema Evolution: Each LLM operation adds new keys (columns) to documents, necessitating frequent schema updates without rewriting entire records.
  • Efficient Row Operations: Operations like unnest and reduce modify rows by expanding or collapsing them, which is currently cumbersome and inefficient.
  • Optimized Sampling: The optimizer (will eventually) documents per operation based on semantic similarity, requiring efficient access and potential vector search capabilities.
  • Scalability & Performance: Managing and accessing data through numerous JSON files hampers scalability and slows down read/write operations.

Requirements

  1. Single Row per Document: Maintain one row per document to avoid data duplication.
  2. Dynamic Column Addition: Easily append new columns without shifting existing data.
  3. Efficient Row Modifications: Support unnest and reduce operations without duplicating or collapsing rows inefficiently.
  4. Sampling Integration: Enable efficient document sampling in the optimizer, with future support for retrieval/sampling during execution.
  5. Scalable Metadata Management: Implement a lightweight metadata system to track schema and row indices.
  6. Concurrency Support: Allow safe concurrent read/write operations to maintain performance and data integrity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
efficiency Making docetl operations run faster good first engineering issue Engineering-focused issue for newcomers
Projects
None yet
Development

No branches or pull requests

1 participant