Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding FSSpec Export for CSV and Parquet #516

Merged
merged 7 commits into from
Oct 19, 2024
Merged

Adding FSSpec Export for CSV and Parquet #516

merged 7 commits into from
Oct 19, 2024

Conversation

dtulga
Copy link
Collaborator

@dtulga dtulga commented Oct 16, 2024

This adds the ability to export / upload to FSSpec filesystems in to_csv and to_parquet - such as exporting directly to S3 or Hugging Face. This is done by passing the relevant url path to the export functions, such as:
chain.to_parquet("s3://dtulga-datachain-test/test.parquet")
chain.to_csv("hf://datasets/dtulga/datachain-test/test.csv")

This has been tested manually with S3 and Hugging Face, as seen here: https://huggingface.co/datasets/dtulga/datachain-test/tree/main This is part of #236 and #370

@dtulga dtulga self-assigned this Oct 16, 2024
Copy link

cloudflare-workers-and-pages bot commented Oct 17, 2024

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: 8f56cac
Status: ✅  Deploy successful!
Preview URL: https://51313908.datachain-documentation.pages.dev
Branch Preview URL: https://dtulga-fsspec-export.datachain-documentation.pages.dev

View logs

Copy link

codecov bot commented Oct 17, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 87.46%. Comparing base (e699c1e) to head (8f56cac).
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #516      +/-   ##
==========================================
+ Coverage   87.43%   87.46%   +0.02%     
==========================================
  Files          97       97              
  Lines       10069    10089      +20     
  Branches     1374     1378       +4     
==========================================
+ Hits         8804     8824      +20     
  Misses        908      908              
  Partials      357      357              
Flag Coverage Δ
datachain 87.42% <95.45%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dtulga dtulga marked this pull request as ready for review October 17, 2024 17:13
@dtulga dtulga requested a review from a team October 17, 2024 17:14
df1 = dc_from.select("first_name", "age", "city").to_pandas()
assert df1.equals(df)

# Cleanup any written files
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qq - are we using real clouds? do we care about cleanup here? if we care - should it be wrapped into a fixture (so that we do this even in case of a failure)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This uses simulated clouds provided by pytest-servers - and the cleanup code is necessary to prevent test failures (as these uploaded files are not automatically cleaned up). I moved the cleanup code into a fixture to keep it in one location.

@@ -1887,6 +1889,7 @@ def to_parquet(
path: Union[str, os.PathLike[str], BinaryIO],
partition_cols: Optional[Sequence[str]] = None,
chunk_size: int = DEFAULT_PARQUET_CHUNK_SIZE,
fs_kwargs: Optional[dict[str, Any]] = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a concern here that we don't use fs_kwargs in all other places (e.g. anon=True, or from_parquet I think reads from a file object - file.get_fs() or something). can we do a bit of research on that end and unify or get rid of this additional kwargs?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These kwargs are optional and are combined with the Catalog's client_config for any unified use cases, such as configuration that applies to read and write. I added this optional kwargs parameter here to be used if users need to specify a write-only custom configuration, such as an access token to be used on write, etc. that is not (or may not apply) on read or for the whole application / chain. For example, a token can be specified for Hugging Face filesystems on write, as described here: https://huggingface.co/docs/huggingface_hub/en/guides/hf_file_system#authentication but users may only want to specify this token on write to Hugging Face, not for other clouds or on read. I can rename this or change as desired, but it seems like having extra write-only kwargs can be useful in some cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kk. It seems to me that it should be symmetrical (people might need to provide something extra on reads as well eventually) and then the question is - will we be able to do the same easy and w/o changing some logic in those methods (like from_parquet, etc).

should we update the docs here btw?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the docs should be updated - I updated them in the latest commit. And shared (read and write) kwargs can be provided in client_config on Session or Catalog (as well as in environment variables) and these configuration settings will automatically be used for read and write. This fs_kwargs option just provides a way to provide write-only configuration, or override the shared configuration, only if necessary.

Copy link
Member

@rlamy rlamy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good, but I think the docs should be updated to mention that fsspec URLs work as well.

@dtulga
Copy link
Collaborator Author

dtulga commented Oct 18, 2024

The code looks good, but I think the docs should be updated to mention that fsspec URLs work as well.

Agreed, updated the docs in the latest commit.

@dtulga dtulga merged commit 5353966 into main Oct 19, 2024
38 checks passed
@dtulga dtulga deleted the dtulga/fsspec-export branch October 19, 2024 00:16
@dtulga dtulga mentioned this pull request Oct 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants