Use arrow datasets for intermediates #221

shreyashankar · 2024-12-02T00:08:04Z

JSON is a bit bulky. It's easy to view, but for intermediates, especially when using the UI, it does not make sense to store and query the intermediates as JSON.

We should have an option in a docetl config to store the intermediates as arrow datasets. We will then have to query them in the UI accordingly.

shreyashankar · 2024-12-26T05:21:46Z

I am thinking we should have a different storage format altogether.

Motivation

Our current JSON-based workflow is inefficient for handling dynamic schema changes and row-level operations, leading to performance and scalability challenges. The optimizer requires efficient document sampling based on semantic similarity, and future operations may involve retrieval and sampling during execution. A more robust data format is needed to support these requirements seamlessly.

Problem

Dynamic Schema Evolution: Each LLM operation adds new keys (columns) to documents, necessitating frequent schema updates without rewriting entire records.
Efficient Row Operations: Operations like unnest and reduce modify rows by expanding or collapsing them, which is currently cumbersome and inefficient.
Optimized Sampling: The optimizer (will eventually) documents per operation based on semantic similarity, requiring efficient access and potential vector search capabilities.
Scalability & Performance: Managing and accessing data through numerous JSON files hampers scalability and slows down read/write operations.

Requirements

Single Row per Document: Maintain one row per document to avoid data duplication.
Dynamic Column Addition: Easily append new columns without shifting existing data.
Efficient Row Modifications: Support unnest and reduce operations without duplicating or collapsing rows inefficiently.
Sampling Integration: Enable efficient document sampling in the optimizer, with future support for retrieval/sampling during execution.
Scalable Metadata Management: Implement a lightweight metadata system to track schema and row indices.
Concurrency Support: Allow safe concurrent read/write operations to maintain performance and data integrity.

shreyashankar added efficiency Making docetl operations run faster good first engineering issue Engineering-focused issue for newcomers labels Dec 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use arrow datasets for intermediates #221

Use arrow datasets for intermediates #221

shreyashankar commented Dec 2, 2024

shreyashankar commented Dec 26, 2024

Use arrow datasets for intermediates #221

Use arrow datasets for intermediates #221

Comments

shreyashankar commented Dec 2, 2024

shreyashankar commented Dec 26, 2024

Motivation

Problem

Requirements