Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

apply formatting after iter_arrow to speed up format -> map, filter for iterable datasets #7207

Merged
merged 50 commits into from
Jan 14, 2025
Merged
Changes from 1 commit
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
a73bb02
apply formatting after iter_arrow
alex-hh Oct 8, 2024
4a761a9
add support for formatting to map iteration
alex-hh Oct 8, 2024
3b65d99
formatted iterator for filter
alex-hh Oct 8, 2024
d906b9f
fix filtered formatting
alex-hh Oct 8, 2024
421917d
option to disable formatting for outputs of map
alex-hh Oct 8, 2024
e7b67c3
remove format_outputs kwarg
alex-hh Oct 9, 2024
a4f9700
rename batched_examples_iterator -> inputs_iterator
alex-hh Oct 9, 2024
a465abd
support arbitrary input formatting in filtered examples iterable iter…
alex-hh Oct 9, 2024
1863f8c
preserve formatting on filtered shuffle
alex-hh Oct 9, 2024
205e0d6
pass token_per_repo_id to python_feature_decoder in formatters
alex-hh Oct 9, 2024
42dc44f
implement FormattedExamplesIterator
alex-hh Oct 9, 2024
4a8fed5
fix formatted examples iterable
alex-hh Oct 9, 2024
8cdf6a6
Merge branch 'main' into iterable-map-with-format
alex-hh Oct 9, 2024
2ddaa7d
restore is_typed property
alex-hh Oct 9, 2024
dcd5017
pass formatting config to formatted examples iterable
alex-hh Oct 9, 2024
1ae947e
fix formatter init
alex-hh Oct 9, 2024
20330e8
Merge branch 'main' into iterable-map-with-format
alex-hh Oct 9, 2024
8f6845f
map examples iterable expects to receive rebatchedarrowexamplesiterab…
alex-hh Oct 9, 2024
3a91aac
only apply features if they exist
alex-hh Oct 9, 2024
84fcf74
fix shuffle and shard
alex-hh Oct 9, 2024
4fac60a
remove formatting from FilteredExamplesIterable
alex-hh Oct 10, 2024
afa78aa
run pre commit
alex-hh Oct 10, 2024
5a8389b
filtered iter_arrow always allowed if available
alex-hh Oct 10, 2024
c97f02e
filtered examples iterable needs formatting when iter_arrow enabled
alex-hh Oct 10, 2024
76e09a1
only iter arrow on filter if formatting is set
alex-hh Oct 10, 2024
ee45f7f
add features property to support feature inference
alex-hh Oct 10, 2024
b828575
fix features property
alex-hh Oct 10, 2024
f76701b
dont re-encode featuers
alex-hh Oct 10, 2024
15a8cfe
avoid re-encoding outputs of map
alex-hh Oct 10, 2024
884bba1
map should not preserve formatting
alex-hh Oct 10, 2024
d979672
update comment
alex-hh Oct 10, 2024
190d062
update map features property
alex-hh Oct 10, 2024
85b7d4d
return bool for mapped ex iterable is typed
alex-hh Oct 11, 2024
3129274
pass return features to mapped exampels iterable constructor
alex-hh Oct 11, 2024
45f55b4
don't iter arrow with formatted filter to avoid re formatting
alex-hh Oct 11, 2024
5e31fe0
avoid re-formatting data
alex-hh Oct 12, 2024
49a84fe
rename return features -> features
alex-hh Oct 14, 2024
002f5b4
update refs to return_features
alex-hh Oct 14, 2024
2479264
decode features in batched map
alex-hh Oct 14, 2024
68bfa39
preserve formatting in with_format
alex-hh Oct 15, 2024
38f78d2
fix features (mapped ex iterable
alex-hh Oct 16, 2024
f59a8e6
Merge branch 'main' into iterable-map-with-format
alex-hh Oct 31, 2024
ca2deb4
update shard
alex-hh Oct 31, 2024
4efcf11
remove formatted examples iterable from with_format
alex-hh Nov 2, 2024
f997f8c
avoid reapplying features when chaining filter, map
alex-hh Nov 2, 2024
bd8bbd3
preserve formatting in map
alex-hh Nov 11, 2024
7d1c48d
Merge branch 'main' into iterable-map-with-format
lhoestq Jan 13, 2025
441a95b
fix tests
lhoestq Jan 13, 2025
9a0e112
style
lhoestq Jan 13, 2025
66d59c7
fix tests
lhoestq Jan 14, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions src/datasets/iterable_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -926,7 +926,7 @@ def __init__(

@property
def iter_arrow(self):
if self.formatting and self.formatting.format_type == "arrow":
if self.formatting and (self.formatting.format_type == "arrow" or self.ex_iterable.iter_arrow):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure about this change though, since it will do

Original data in Arrow -> Map function in NumPy -> Arrow (since it has iter_arrow and Arrow is preferred) -> Python objects

While it could be

Original data in Arrow -> Map function in NumPy -> Python objects -> Python objects

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, I wasn't really understanding the flow - have updated with a more direct attempt

return self._iter_arrow

def _init_state_dict(self) -> dict:
Expand All @@ -939,8 +939,8 @@ def _init_state_dict(self) -> dict:
return self._state_dict

def __iter__(self):
if self.formatting and self.formatting.format_type == "arrow":
formatter = PythonFormatter()
if self.formatting and (self.formatting.format_type == "arrow" or self.ex_iterable.iter_arrow):
formatter = get_formatter(self.formatting.format_type)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh this might me a good idea

for key, pa_table in self._iter_arrow(max_chunksize=1):
yield key, formatter.format_row(pa_table)
else:
Expand Down