Implement pre-fetching in map() and gen() #521

rlamy · 2024-10-18T21:46:05Z

This adds a prefetch setting which enables async downloading of objects to the cache before running a generator or mapper UDF (see #40). The default is to use 2 workers, but it can be disabled using .setting(prefetch=0). Note that it has no effect if caching isn't enabled (caching is disabled by default).

In order for this to work, AbstractWarehouse.dataset_select_paginated() is now required to be thread-safe, so query result pages are now buffered as a list in that function.

cloudflare-workers-and-pages · 2024-10-18T21:47:11Z

Deploying datachain-documentation with Cloudflare Pages

Latest commit:	`b54545e`
Status:	✅ Deploy successful!
Preview URL:	https://8858a811.datachain-documentation.pages.dev
Branch Preview URL:	https://issue-40.datachain-documentation.pages.dev

View logs

codecov · 2024-10-18T21:52:38Z

Codecov Report

Attention: Patch coverage is 95.83333% with 3 lines in your changes missing coverage. Please review.

Project coverage is 88.04%. Comparing base (a9a0649) to head (b54545e).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
src/datachain/lib/dc.py	60.00%	1 Missing and 1 partial ⚠️
src/datachain/lib/udf.py	96.55%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #521      +/-   ##
==========================================
+ Coverage   87.92%   88.04%   +0.11%     
==========================================
  Files         102      102              
  Lines       10044    10085      +41     
  Branches     1363     1373      +10     
==========================================
+ Hits         8831     8879      +48     
+ Misses        871      865       -6     
+ Partials      342      341       -1

Flag	Coverage Δ
datachain	`87.98% <95.83%> (+0.11%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚨 Try these New Features:

Flaky Tests Detection - Detect and resolve failed and flaky tests

src/datachain/data_storage/warehouse.py

src/datachain/lib/settings.py

tests/unit/test_asyn.py

shcheklein · 2024-10-19T01:55:33Z

src/datachain/lib/dc.py

@@ -325,6 +325,7 @@ def settings(
        parallel=None,
        workers=None,
        min_task_size=None,
+        prefetch: Optional[int] = None,


q: ~~why int?~~ let's update the docs here (do we have some CI to detect these discrepancies btw (missing docs) cc @skshetry )

tests/unit/test_asyn.py

shcheklein

Looks great. A few question.

One more general question.

Does this implementation mean that we now won't start (at least the very first row) UDF until file is fetched? Before it was doing this "on-demand" I guess - when file is needed. I wonder how big of an issue it can be in certain scenarios (and especially if we decide to do prefetch for batches (agg, batch mapper)).

rlamy · 2024-10-29T16:46:34Z

Does this implementation mean that we now won't start (at least the very first row) UDF until file is fetched? Before it was doing this "on-demand" I guess - when file is needed. I wonder how big of an issue it can be in certain scenarios (and especially if we decide to do prefetch for batches (agg, batch mapper)).

Yes, the File object is only passed to the UDF once prefetching is complete, but note that in this PR you need to specify cache=True to activate prefetching, which you probably wouldn't do in cases where prefetching is suboptimal.

skshetry · 2024-11-26T03:39:30Z

src/datachain/data_storage/warehouse.py

+            # Ensure we're using a thread-local connection
+            with self.clone() as wh:


I guess this clone() should be outside the loop? Otherwise, we'll not be reusing the connection.

rlamy requested review from dreadatour, shcheklein, dtulga, ilongin and dmpetrov October 18, 2024 21:46

rlamy force-pushed the issue-40 branch from d1d1457 to acdc969 Compare October 18, 2024 21:46

rlamy mentioned this pull request Oct 18, 2024

pre_fetch option in additional to cache for lib.File #40

Closed

shcheklein reviewed Oct 18, 2024

View reviewed changes

src/datachain/data_storage/warehouse.py Show resolved Hide resolved

shcheklein reviewed Oct 18, 2024

View reviewed changes

src/datachain/lib/settings.py Show resolved Hide resolved

shcheklein reviewed Oct 19, 2024

View reviewed changes

tests/unit/test_asyn.py Show resolved Hide resolved

shcheklein reviewed Oct 19, 2024

View reviewed changes

shcheklein reviewed Oct 20, 2024

View reviewed changes

tests/unit/test_asyn.py Show resolved Hide resolved

shcheklein reviewed Oct 20, 2024

View reviewed changes

shcheklein assigned rlamy Oct 20, 2024

rlamy force-pushed the issue-40 branch from acdc969 to 4224147 Compare October 29, 2024 13:05

0x2b3bfa0 unassigned rlamy Oct 29, 2024

skshetry force-pushed the issue-40 branch 2 times, most recently from ef79f67 to 9fd3155 Compare November 14, 2024 05:35

skshetry mentioned this pull request Nov 14, 2024

asyncmapper: shutdown producer on generator close #597

Merged

rlamy and others added 4 commits November 20, 2024 09:18

Use threading in AsyncMapper.produce()

d77cd44

Implement prefetching in .gen() and .map()

a0579ae

Avoid user code error in name_len()

59896cd

asyncmapper: shutdown producer on generator close (#597)

b54545e

skshetry force-pushed the issue-40 branch from 56cc2ad to b54545e Compare November 20, 2024 03:33

skshetry approved these changes Nov 20, 2024

View reviewed changes

skshetry merged commit 21857af into main Nov 20, 2024
38 checks passed

skshetry deleted the issue-40 branch November 20, 2024 05:33

skshetry reviewed Nov 26, 2024

View reviewed changes

skshetry mentioned this pull request Nov 26, 2024

dataset_select_paginated: do not clone on each paginated query #629

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement pre-fetching in map() and gen() #521

Implement pre-fetching in map() and gen() #521

Uh oh!

rlamy commented Oct 18, 2024

Uh oh!

cloudflare-workers-and-pages bot commented Oct 18, 2024 •

edited

Loading

Uh oh!

codecov bot commented Oct 18, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shcheklein Oct 19, 2024 •

edited

Loading

Uh oh!

Uh oh!

shcheklein left a comment

Uh oh!

rlamy commented Oct 29, 2024

Uh oh!

Uh oh!

skshetry Nov 26, 2024

Uh oh!

Uh oh!

		# Ensure we're using a thread-local connection
		with self.clone() as wh:

Implement pre-fetching in map() and gen() #521

Implement pre-fetching in map() and gen() #521

Uh oh!

Conversation

rlamy commented Oct 18, 2024

Uh oh!

cloudflare-workers-and-pages bot commented Oct 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying datachain-documentation with Cloudflare Pages

Uh oh!

codecov bot commented Oct 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shcheklein Oct 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

shcheklein left a comment

Choose a reason for hiding this comment

Uh oh!

rlamy commented Oct 29, 2024

Uh oh!

Uh oh!

skshetry Nov 26, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cloudflare-workers-and-pages bot commented Oct 18, 2024 •

edited

Loading

codecov bot commented Oct 18, 2024 •

edited

Loading

shcheklein Oct 19, 2024 •

edited

Loading