What would you say is the best/most pragmatic approach to make use of data fusion in a streaming environment? #6760

alamb · 2023-06-24T12:43:53Z

alamb
Jun 24, 2023
Collaborator

This is from from ASF slack: https://the-asf.slack.com/archives/C01QUFS30TD/p1686751525575309 from a conversation between @dadepo and myself

Dade
Where the data that needs to be queried/transformed is supposed to be a stream

I stumble on this https://github.com/datafusion-contrib/datafusion-streams but looks like that is mostly an abandoned effort…

Andrew Lamb
I would recommend looking at "infinite" streams as in https://docs.rs/datafusion/latest/datafusion/datasource/streaming/struct.StreamingTable.html
I think the basics are there, though there is definitely more work to be done, in the realm of documentation and probable features

Dade
Got a couple of question:

Will I be be correct to say that the end result of this is RecordBatchStream, which is essentially a stream of RecordBatch?
What would be the general approach if I want to make use of a delta lake table for instance as the streaming source?
Does using this means keeping track of some state? For example to allow DataFusion keep track of the part of the stream it has already processed? (edited)

Andrew Lamb

Will I be be correct to say that the end result of this is RecordBatchStream, l which is essentially a stream of RecordBatch
Yes, I think that is basically the what will happen

What would be the general approach if I want to make use of a delta lake table for instance as the streaming source?
I am not sure -- I know that the delta-rs implementation uses DataFusion but I am not sure what this would look like.

Does using this means keeping track of some state? For example to allow DataFusion keep track of the part of the stream it has already processed? (edited)

I don't think DataFusion can track state of what has been processed at the moment -- that would have to be done out of line.

Dade

Specifically trying to create my own StreamingTable that read data as streams from a delta lake table.
I have a Java process that write the “rate” data (this is a test data source) as stream to a delta table.
I confirm I can read this as stream using Scala from the console

 val df = spark.readStream.format("delta").load("/tmp/delta-table");
df.writeStream.format("console").start();

This continually reads the data and writes it out to the console.

So my first question, to see if I can read this delta table as a stream, I first tried reading it normally. Then introduced sleep, and then read it again. Something like

    let table_path = "/tmp/delta-table";
    let table = deltalake::open_table(table_path).await?;


    let mut ctx = SessionContext::new();
    
    ctx.register_table("stream", Arc::new(table))?;

    let df = ctx.sql("select * from stream").await?;

    let vec = df.clone().collect().await?;
    dbg!(vec.last());

    thread::sleep(Duration::from_secs(30));

    let vec = df.clone().collect().await?;
    dbg!(vec.last());

I noticed that both reads gives the same data despite the 30 seconds delay, but If I stop the process and run it again, it gives fresh data. Any reason why this is the case?

I see the core of having the StreamingTable seems to be the execute function on the PartitionStream trait…and looking for an example implementation I found
https://github.com/apache/arrow-datafusion/blob/01eb72af4ccdc911ba3cfe22e41f2d71389c5eb9/datafusion/core/tests/memory_limit.rs#L259

Which is basically returning the record batch as a stream.

If I am going to provide an implementation of this for a delta table I open manually, that means continually execute the query that fetch the data from the table? For starters that does not seem to even work based on my previous message…plus it feels adhoc?
Is there any feature that can be used for example to only start reading from when the last batch ends?

Will I have to do these kind of bookkeeping manually?
memory_limit.rs

Or perhaps I am going going by this with the wrong design/approach?

Only way I could get a new batch is to basically load and register the table afresh

    thread::sleep(Duration::from_secs(30));
    let mut ctx = SessionContext::new();
    let table = deltalake::open_table(table_path).await?;
    ctx.register_table("stream", Arc::new(table))?;

Still have doubts with this. Does not feel like most optimal. And how can I guarantee the next read starts from the where the last read ended?

alamb · 2023-06-24T12:44:20Z

alamb
Jun 24, 2023
Collaborator Author

Perhaps @metesynnada or @ozankabak have some thoughts in this area or can help

0 replies

ozankabak · 2023-06-24T14:15:52Z

ozankabak
Jun 24, 2023
Collaborator

Yes, we are building a streaming system using Datafusion -- happy to discuss. However, it is holiday time for our team right now, so we will circle back to you towards the end of the month 🙂

0 replies

dispanser · 2023-06-26T06:23:58Z

dispanser
Jun 26, 2023

I noticed that both reads gives the same data despite the 30 seconds delay, but If I stop the process and run it again, it gives fresh data. Any reason why this is the case?

@dadepo
I haven't looked at delta-rs in a while, but there's an update fn in the DeltaTable impl:

    /// Updates the DeltaTable to the most recent state committed to the transaction log.
    #[cfg(not(any(feature = "parquet", feature = "parquet2")))]
    pub async fn update(&mut self) -> Result<(), DeltaTableError> {
        self.update_incremental(None).await
    }

In spark, the delta libs automatically fetch the latest and greatest for you IIRC, but maybe that's not the case for delta-rs.

In any case, even after calling update() manually, you'd still see the entire table in the second query, which is most likely not what you want in a streaming system. I think what you need is something more specific, like tailing on the delta commit log and only pick / process files added in the most recent commit(s).

There's a dedicated delta-rs channel in the Delta Users slack, and I remember some earlier discussions about streaming. Definitely worth a visit, and full of kind people.

@houqp , if you're still involved in these, any opinions?

1 reply

dadepo Jun 26, 2023

Hi @dispanser, you mentioned

like tailing on the delta commit log and only pick / process files added in the most recent commit(s).

Do you by chance know if delta-rs exposes the capability for this somehow?

Hopping in the delta-rs channel in a bit. Thanks for the recommendation

metesynnada · 2023-07-03T08:29:43Z

metesynnada
Jul 3, 2023

Although I don't have extensive knowledge about delta-rs, I do believe that the responsibility for tracking data should be placed on the source in general.

Taking a page from Kafka's playbook, it maintains group id offsets to keep track of data. In an ideal design, the source should be capable of determining what data will come next.

There's DeltaTableState, which could potentially handle data committing and related tasks. However, I am uncertain if DeltaScan utilizes this feature. The solution may hinge on leveraging DeltaTableState's committing information more effectively. This would likely necessitate alterations to the delta-rs library itself, as it does not seem to make this information readily accessible for utilization in its current state.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What would you say is the best/most pragmatic approach to make use of data fusion in a streaming environment? #6760

{{title}}

Replies: 4 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

What would you say is the best/most pragmatic approach to make use of data fusion in a streaming environment? #6760

alamb Jun 24, 2023 Collaborator

Replies: 4 comments · 1 reply

alamb Jun 24, 2023 Collaborator Author

ozankabak Jun 24, 2023 Collaborator

dispanser Jun 26, 2023

dadepo Jun 26, 2023

metesynnada Jul 3, 2023

alamb
Jun 24, 2023
Collaborator

Replies: 4 comments 1 reply

alamb
Jun 24, 2023
Collaborator Author

ozankabak
Jun 24, 2023
Collaborator

dispanser
Jun 26, 2023

metesynnada
Jul 3, 2023