-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
apply formatting after iter_arrow to speed up format -> map, filter for iterable datasets #7207
Merged
+239
−74
Merged
Changes from 1 commit
Commits
Show all changes
50 commits
Select commit
Hold shift + click to select a range
a73bb02
apply formatting after iter_arrow
alex-hh 4a761a9
add support for formatting to map iteration
alex-hh 3b65d99
formatted iterator for filter
alex-hh d906b9f
fix filtered formatting
alex-hh 421917d
option to disable formatting for outputs of map
alex-hh e7b67c3
remove format_outputs kwarg
alex-hh a4f9700
rename batched_examples_iterator -> inputs_iterator
alex-hh a465abd
support arbitrary input formatting in filtered examples iterable iter…
alex-hh 1863f8c
preserve formatting on filtered shuffle
alex-hh 205e0d6
pass token_per_repo_id to python_feature_decoder in formatters
alex-hh 42dc44f
implement FormattedExamplesIterator
alex-hh 4a8fed5
fix formatted examples iterable
alex-hh 8cdf6a6
Merge branch 'main' into iterable-map-with-format
alex-hh 2ddaa7d
restore is_typed property
alex-hh dcd5017
pass formatting config to formatted examples iterable
alex-hh 1ae947e
fix formatter init
alex-hh 20330e8
Merge branch 'main' into iterable-map-with-format
alex-hh 8f6845f
map examples iterable expects to receive rebatchedarrowexamplesiterab…
alex-hh 3a91aac
only apply features if they exist
alex-hh 84fcf74
fix shuffle and shard
alex-hh 4fac60a
remove formatting from FilteredExamplesIterable
alex-hh afa78aa
run pre commit
alex-hh 5a8389b
filtered iter_arrow always allowed if available
alex-hh c97f02e
filtered examples iterable needs formatting when iter_arrow enabled
alex-hh 76e09a1
only iter arrow on filter if formatting is set
alex-hh ee45f7f
add features property to support feature inference
alex-hh b828575
fix features property
alex-hh f76701b
dont re-encode featuers
alex-hh 15a8cfe
avoid re-encoding outputs of map
alex-hh 884bba1
map should not preserve formatting
alex-hh d979672
update comment
alex-hh 190d062
update map features property
alex-hh 85b7d4d
return bool for mapped ex iterable is typed
alex-hh 3129274
pass return features to mapped exampels iterable constructor
alex-hh 45f55b4
don't iter arrow with formatted filter to avoid re formatting
alex-hh 5e31fe0
avoid re-formatting data
alex-hh 49a84fe
rename return features -> features
alex-hh 002f5b4
update refs to return_features
alex-hh 2479264
decode features in batched map
alex-hh 68bfa39
preserve formatting in with_format
alex-hh 38f78d2
fix features (mapped ex iterable
alex-hh f59a8e6
Merge branch 'main' into iterable-map-with-format
alex-hh ca2deb4
update shard
alex-hh 4efcf11
remove formatted examples iterable from with_format
alex-hh f997f8c
avoid reapplying features when chaining filter, map
alex-hh bd8bbd3
preserve formatting in map
alex-hh 7d1c48d
Merge branch 'main' into iterable-map-with-format
lhoestq 441a95b
fix tests
lhoestq 9a0e112
style
lhoestq 66d59c7
fix tests
lhoestq File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -926,7 +926,7 @@ def __init__( | |
|
||
@property | ||
def iter_arrow(self): | ||
if self.formatting and self.formatting.format_type == "arrow": | ||
if self.formatting and (self.formatting.format_type == "arrow" or self.ex_iterable.iter_arrow): | ||
return self._iter_arrow | ||
|
||
def _init_state_dict(self) -> dict: | ||
|
@@ -939,8 +939,8 @@ def _init_state_dict(self) -> dict: | |
return self._state_dict | ||
|
||
def __iter__(self): | ||
if self.formatting and self.formatting.format_type == "arrow": | ||
formatter = PythonFormatter() | ||
if self.formatting and (self.formatting.format_type == "arrow" or self.ex_iterable.iter_arrow): | ||
formatter = get_formatter(self.formatting.format_type) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. oh this might me a good idea |
||
for key, pa_table in self._iter_arrow(max_chunksize=1): | ||
yield key, formatter.format_row(pa_table) | ||
else: | ||
|
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure about this change though, since it will do
Original data in Arrow -> Map function in NumPy -> Arrow (since it has
iter_arrow
and Arrow is preferred) -> Python objectsWhile it could be
Original data in Arrow -> Map function in NumPy -> Python objects -> Python objects
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed, I wasn't really understanding the flow - have updated with a more direct attempt