Adding FSSpec Export for CSV and Parquet #516

dtulga · 2024-10-16T23:52:26Z

This adds the ability to export / upload to FSSpec filesystems in to_csv and to_parquet - such as exporting directly to S3 or Hugging Face. This is done by passing the relevant url path to the export functions, such as:
chain.to_parquet("s3://dtulga-datachain-test/test.parquet")
chain.to_csv("hf://datasets/dtulga/datachain-test/test.csv")

This has been tested manually with S3 and Hugging Face, as seen here: https://huggingface.co/datasets/dtulga/datachain-test/tree/main This is part of #236 and #370

cloudflare-workers-and-pages · 2024-10-17T00:09:08Z

Deploying datachain-documentation with Cloudflare Pages

Latest commit:	`8f56cac`
Status:	✅ Deploy successful!
Preview URL:	https://51313908.datachain-documentation.pages.dev
Branch Preview URL:	https://dtulga-fsspec-export.datachain-documentation.pages.dev

View logs

codecov · 2024-10-17T00:14:54Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 87.46%. Comparing base (e699c1e) to head (8f56cac).
Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #516      +/-   ##
==========================================
+ Coverage   87.43%   87.46%   +0.02%     
==========================================
  Files          97       97              
  Lines       10069    10089      +20     
  Branches     1374     1378       +4     
==========================================
+ Hits         8804     8824      +20     
  Misses        908      908              
  Partials      357      357

Flag	Coverage Δ
datachain	`87.42% <95.45%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

shcheklein · 2024-10-17T19:27:30Z

tests/func/test_datachain.py

+    df1 = dc_from.select("first_name", "age", "city").to_pandas()
+    assert df1.equals(df)
+
+    # Cleanup any written files


qq - are we using real clouds? do we care about cleanup here? if we care - should it be wrapped into a fixture (so that we do this even in case of a failure)

This uses simulated clouds provided by pytest-servers - and the cleanup code is necessary to prevent test failures (as these uploaded files are not automatically cleaned up). I moved the cleanup code into a fixture to keep it in one location.

shcheklein · 2024-10-17T19:31:07Z

src/datachain/lib/dc.py

@@ -1887,6 +1889,7 @@ def to_parquet(
        path: Union[str, os.PathLike[str], BinaryIO],
        partition_cols: Optional[Sequence[str]] = None,
        chunk_size: int = DEFAULT_PARQUET_CHUNK_SIZE,
+        fs_kwargs: Optional[dict[str, Any]] = None,


I have a concern here that we don't use fs_kwargs in all other places (e.g. anon=True, or from_parquet I think reads from a file object - file.get_fs() or something). can we do a bit of research on that end and unify or get rid of this additional kwargs?

These kwargs are optional and are combined with the Catalog's client_config for any unified use cases, such as configuration that applies to read and write. I added this optional kwargs parameter here to be used if users need to specify a write-only custom configuration, such as an access token to be used on write, etc. that is not (or may not apply) on read or for the whole application / chain. For example, a token can be specified for Hugging Face filesystems on write, as described here: https://huggingface.co/docs/huggingface_hub/en/guides/hf_file_system#authentication but users may only want to specify this token on write to Hugging Face, not for other clouds or on read. I can rename this or change as desired, but it seems like having extra write-only kwargs can be useful in some cases.

kk. It seems to me that it should be symmetrical (people might need to provide something extra on reads as well eventually) and then the question is - will we be able to do the same easy and w/o changing some logic in those methods (like from_parquet, etc).

should we update the docs here btw?

Yes, the docs should be updated - I updated them in the latest commit. And shared (read and write) kwargs can be provided in client_config on Session or Catalog (as well as in environment variables) and these configuration settings will automatically be used for read and write. This fs_kwargs option just provides a way to provide write-only configuration, or override the shared configuration, only if necessary.

rlamy

The code looks good, but I think the docs should be updated to mention that fsspec URLs work as well.

dtulga · 2024-10-18T22:54:51Z

The code looks good, but I think the docs should be updated to mention that fsspec URLs work as well.

Agreed, updated the docs in the latest commit.

Adding FSSpec Export for CSV and Parquet

726bb28

dtulga self-assigned this Oct 16, 2024

Adding more cleanup code

94dc18c

dtulga added 2 commits October 17, 2024 09:45

Merge from main

dd645f8

Windows Path Fix

75bda33

dtulga marked this pull request as ready for review October 17, 2024 17:13

dtulga requested a review from a team October 17, 2024 17:14

shcheklein reviewed Oct 17, 2024

View reviewed changes

dtulga added 2 commits October 17, 2024 17:36

Moving cleanup to fixture

c154798

Merge from main

a3ea4f6

shcheklein approved these changes Oct 18, 2024

View reviewed changes

rlamy approved these changes Oct 18, 2024

View reviewed changes

Updating documentation

8f56cac

dtulga merged commit 5353966 into main Oct 19, 2024
38 checks passed

dtulga deleted the dtulga/fsspec-export branch October 19, 2024 00:16

dtulga mentioned this pull request Oct 22, 2024

Export to huggingface hub #370

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding FSSpec Export for CSV and Parquet #516

Adding FSSpec Export for CSV and Parquet #516

dtulga commented Oct 16, 2024

cloudflare-workers-and-pages bot commented Oct 17, 2024 •

edited

Loading

codecov bot commented Oct 17, 2024 •

edited

Loading

shcheklein Oct 17, 2024

dtulga Oct 18, 2024

shcheklein Oct 17, 2024

dtulga Oct 17, 2024

shcheklein Oct 18, 2024

dtulga Oct 18, 2024

rlamy left a comment

dtulga commented Oct 18, 2024

Adding FSSpec Export for CSV and Parquet #516

Adding FSSpec Export for CSV and Parquet #516

Conversation

dtulga commented Oct 16, 2024

cloudflare-workers-and-pages bot commented Oct 17, 2024 • edited Loading

Deploying datachain-documentation with Cloudflare Pages

codecov bot commented Oct 17, 2024 • edited Loading

Codecov Report

shcheklein Oct 17, 2024

Choose a reason for hiding this comment

dtulga Oct 18, 2024

Choose a reason for hiding this comment

shcheklein Oct 17, 2024

Choose a reason for hiding this comment

dtulga Oct 17, 2024

Choose a reason for hiding this comment

shcheklein Oct 18, 2024

Choose a reason for hiding this comment

dtulga Oct 18, 2024

Choose a reason for hiding this comment

rlamy left a comment

Choose a reason for hiding this comment

dtulga commented Oct 18, 2024

cloudflare-workers-and-pages bot commented Oct 17, 2024 •

edited

Loading

codecov bot commented Oct 17, 2024 •

edited

Loading