Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DATA] Implement zero-copied string dtype and accelerate shuffle. #149

Merged
merged 1 commit into from
May 29, 2023

Conversation

francktcheng
Copy link
Collaborator

  1. Implement a zero-copied approach to read string data from Arrow to TF.
  2. Accelerate the shuffle operation of string type in ParquetDataset.

preliminary benchmarking results

  • col=300, batch_size=1000
  • Intel(R) Xeon(R) Platinum 8369B CPU @ 2.90GHz with 128 logical cores.
Dataset list type shuffling throughput (samples/s) speedup over TFRecord
TFRecord N N 1404.23 1.0
HbParquet N N 41137.53 29.3
HbParquet-ZeroCopy N N 51335.40 36.56
TFRecord N Y 1343.10 1.0
HbParquet N Y 6629.60 4.9
HbParquet-ZeroCopy N Y 10941.25 8.1
TFRecord Y N 1352.05 1.0
HbParquet Y N 2307.33 1.71
HbParquet-ZeroCopy Y N 2869.98 2.12
TFRecord Y Y 1367.96 1.0
HbParquet Y Y 1080.03 0.79
HbParquet-ZeroCopy Y Y 1454.02 1.06

@github-actions
Copy link

github-actions bot commented May 25, 2023

Test Results

  48 files  ±0    48 suites  ±0   1m 53s ⏱️ -2s
  52 tests  - 1    52 ✔️  - 1    0 💤 ±0  0 ±0 
156 runs   - 3  131 ✔️  - 3  25 💤 ±0  0 ±0 

Results for commit c477fae. ± Comparison against base commit 0545159.

This pull request removes 1 test.
ParquetDatasetStringTest ‑ test_unbatch_and_to_sparse

♻️ This comment has been updated with latest results.

@francktcheng francktcheng requested a deployment to deeprec-py3.6-cu114-ubuntu18.04 May 25, 2023 09:19 — with GitHub Actions Abandoned
@francktcheng francktcheng requested a deployment to tf1.15-py3.8-cu121-ubuntu20.04 May 25, 2023 09:19 — with GitHub Actions Abandoned
@francktcheng francktcheng had a problem deploying to tf1.15-py3.6-manylinux_2_24 May 25, 2023 09:19 — with GitHub Actions Failure
1. Implement a zero-copied approach to read string data from Arrow to TF.
2. Accelerate the shuffle operation of string type in ParquetDataset.

preliminary benchmarking results
- col=300, `batch_size`=1000
- `Intel(R) Xeon(R) Platinum 8369B CPU @ 2.90GHz` with 128 logical cores.

| Dataset            | list type | shuffling | throughput (samples/s) | speedup over TFRecord |
| ---                | ---       | ---       | ---                    | ---                   |
| TFRecord           | N         | N         | 1404.23                | 1.0                   |
| HbParquet          | N         | N         | 41137.53               | 29.3                  |
| HbParquet-ZeroCopy | N         | N         | 51335.40               | 36.56                 |
| TFRecord           | N         | Y         | 1343.10                | 1.0                   |
| HbParquet          | N         | Y         | 6629.60                | 4.9                   |
| HbParquet-ZeroCopy | N         | Y         | 10941.25               | 8.1                   |
| TFRecord           | Y         | N         | 1352.05                | 1.0                   |
| HbParquet          | Y         | N         | 2307.33                | 1.71                  |
| HbParquet-ZeroCopy | Y         | N         | 2869.98                | 2.12                  |
| TFRecord           | Y         | Y         | 1367.96                | 1.0                   |
| HbParquet          | Y         | Y         | 1080.03                | 0.79                  |
| HbParquet-ZeroCopy | Y         | Y         | 1454.02                | 1.06                  |

Signed-off-by: langshi.cls <[email protected]>
@francktcheng francktcheng force-pushed the features/parquet_string_acc branch from 5278580 to c477fae Compare May 29, 2023 03:42
@francktcheng francktcheng temporarily deployed to tf1.15-py3.8-cu121-ubuntu20.04 May 29, 2023 03:42 — with GitHub Actions Inactive
@francktcheng francktcheng temporarily deployed to tf1.15-py3.6-manylinux_2_24 May 29, 2023 03:42 — with GitHub Actions Inactive
@francktcheng francktcheng temporarily deployed to deeprec-py3.6-cu114-ubuntu18.04 May 29, 2023 03:42 — with GitHub Actions Inactive
@francktcheng francktcheng merged commit 02a714b into main May 29, 2023
@francktcheng francktcheng deleted the features/parquet_string_acc branch May 29, 2023 03:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Throughput is lower than TFRecords when there are many strings in Parquets file
1 participant