Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

apply formatting after iter_arrow to speed up format -> map, filter for iterable datasets #7207

Merged
merged 50 commits into from
Jan 14, 2025

Conversation

alex-hh
Copy link
Contributor

@alex-hh alex-hh commented Oct 8, 2024

I got to this by hacking around a bit but it seems to solve #7206

I have no idea if this approach makes sense or would break something else?

Could maybe work on a full pr if this looks reasonable @lhoestq ? I imagine the same issue might affect other iterable dataset methods?

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting ! Let's validate if we get the same speed on those 2 ?

  1. Dataset.to_iterable_dataset -> numpy format
  2. Dataset.to_iterable_dataset -> numpy format -> batched map identity (batched could be quite important)

Comment on lines 942 to 943
if self.formatting and (self.formatting.format_type == "arrow" or self.ex_iterable.iter_arrow):
formatter = get_formatter(self.formatting.format_type)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh this might me a good idea

@@ -926,7 +926,7 @@ def __init__(

@property
def iter_arrow(self):
if self.formatting and self.formatting.format_type == "arrow":
if self.formatting and (self.formatting.format_type == "arrow" or self.ex_iterable.iter_arrow):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure about this change though, since it will do

Original data in Arrow -> Map function in NumPy -> Arrow (since it has iter_arrow and Arrow is preferred) -> Python objects

While it could be

Original data in Arrow -> Map function in NumPy -> Python objects -> Python objects

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, I wasn't really understanding the flow - have updated with a more direct attempt

@alex-hh
Copy link
Contributor Author

alex-hh commented Oct 8, 2024

I think the problem is that the underlying ex_iterable will not use iter_arrow unless the formatting type is arrow, which leads to conversion from arrow -> python -> numpy in this case rather than arrow -> numpy.

Idea of updated fix is to use the ex_iterable's iter_arrow in any case where it's available and any formatting is specified. The formatter then works directly on arrow tables; the outputs of the formatter get passed to the function to be mapped.

With updated version:

import numpy as np
import time
from datasets import Dataset, Features, Array3D

features=Features(**{"array0": Array3D((None, 10, 10), dtype="float32"), "array1": Array3D((None,10,10), dtype="float32")})
dataset = Dataset.from_dict({f"array{i}": [np.zeros((x,10,10), dtype=np.float32) for x in [2000,1000]*25] for i in range(2)}, features=features)
ds = dataset.to_iterable_dataset()
ds = ds.with_format("numpy").map(lambda x: x, batched=True, batch_size=10)
t0 = time.time()
for ex in ds:
    pass
t1 = time.time()

Total time: < 0.01s (~30s on main)

ds = dataset.to_iterable_dataset()
ds = ds.with_format("numpy").map(lambda x: x, batched=False)
t0 = time.time()
for ex in ds:
    pass
t1 = time.time()

Time: ~0.02 s (~30s on main)

ds = dataset.to_iterable_dataset()
ds = ds.with_format("numpy")
t0 = time.time()
for ex in ds:
    pass
t1 = time.time()

Time: ~0.02s

@alex-hh alex-hh marked this pull request as ready for review October 8, 2024 22:04
@alex-hh
Copy link
Contributor Author

alex-hh commented Oct 8, 2024

also now working for filter with similar performance improvements:

filtered_examples = []
ds = dataset.to_iterable_dataset()
ds = ds.with_format("numpy").filter(lambda x: [arr.shape[0]==2000 for arr in x["array0"]], batch_size=10, batched=True)
t0 = time.time()
for ex in ds:
    filtered_examples.append(ex)
t1 = time.time()
assert len(filtered_examples) == 25

0.01s vs 50s on main

filtered_examples = []
ds = dataset.to_iterable_dataset()
ds = ds.with_format("numpy").filter(lambda x: x["array0"].shape[0]==2000, batched=False)
t0 = time.time()
for ex in ds:
    filtered_examples.append(ex)
t1 = time.time()
assert len(filtered_examples) == 25

0.04s vs 50s on main

@alex-hh alex-hh force-pushed the iterable-map-with-format branch from 7119cd2 to d906b9f Compare October 8, 2024 22:36
@alex-hh alex-hh changed the title apply formatting after iter_arrow to speed up format -> map for iterable datasets apply formatting after iter_arrow to speed up format -> map, filter for iterable datasets Oct 8, 2024
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@@ -2211,6 +2250,7 @@ def map(
remove_columns: Optional[Union[str, List[str]]] = None,
features: Optional[Features] = None,
fn_kwargs: Optional[dict] = None,
format_outputs: bool = True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this useful ?

Copy link
Contributor Author

@alex-hh alex-hh Oct 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was useful for optimising performance in my case -

idea is it allows the user to determine the formatting of the examples returned by map (within the mapped function) rather than via self.formatting, which determines the formatting of the inputs to the map function

if map returns lists / arbitrary types then re-applying the formatter to the outputs can be expensive

However, haven't thought about how to handle preserving a consistent self._formatting FormattingConfig for downstream dataset transformations...might require something better than this kwarg

Copy link
Member

@lhoestq lhoestq Oct 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can still do ds = ds.map(...).with_format(None) and no formatting will be applied to the output

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perfect - will remove the kwarg

@lhoestq
Copy link
Member

lhoestq commented Oct 9, 2024

(the distributed tests failing in the CI are unrelated)

@alex-hh
Copy link
Contributor Author

alex-hh commented Oct 9, 2024

There also appears to be a separate? issue with chaining filter and map bc filter iter_arrow only returns _iter_arrow if arrow formatting is applied (and vv presumably)

I don't have a good minimal example atm

)
batched_examples_iterator = formatted_arrow_examples_iterator(ex_iterable, formatter, batched=self.batched)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe name it input_iterator or something like that since it's not necessarily batched ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@lhoestq
Copy link
Member

lhoestq commented Oct 9, 2024

issue with chaining filter and map bc filter iter_arrow only returns _iter_arrow if arrow formatting is applied (and vv presumably)

Maybe related to this issue ?

ds = Dataset.from_dict({"a": range(10)}).to_iterable_dataset()
ds = ds.with_format("arrow").map(lambda x: x, features=Features({"a": Value("string")})).with_format(None)
print(list(ds))  # yields integers instead of strings

@lhoestq
Copy link
Member

lhoestq commented Oct 9, 2024

I feel like we could get rid of TypedExampleIterable altogether and apply formatting with feature conversion with formatted_python_examples_iterator and formatted_arrow_examples_iterator

btw you can pass features= in get_formatter() to get a formatter that does the feature conversion at the same time as formatting

(edit:

except maybe the arrow formatter doesn't use features yet, we can fix it like this if it's really needed

class ArrowFormatter(Formatter[pa.Table, pa.Array, pa.Table]):
    def format_row(self, pa_table: pa.Table) -> pa.Table:
-         return self.simple_arrow_extractor().extract_row(pa_table)
+         pa_table = self.simple_arrow_extractor().extract_row(pa_table)
+.        return cast_table_to_features(pa_table, self.features) if self.features else pa_table
            

)

@alex-hh
Copy link
Contributor Author

alex-hh commented Oct 9, 2024

I feel like we could get rid of TypedExampleIterable altogether and apply formatting with feature conversion with formatted_python_examples_iterator and formatted_arrow_examples_iterator

Oh nice didn't know about the feature support in get_formatter. Haven't thought through whether this works but would a FormattedExampleIterable (with feature conversion) be able to solve this and fit the API better?

@lhoestq
Copy link
Member

lhoestq commented Oct 9, 2024

Oh nice didn't know about the feature support in get_formatter. Haven't thought through whether this works but would a FormattedExampleIterable (with feature conversion) be able to solve this and fit the API better?

Yes this is surely the way to go actually !

@alex-hh
Copy link
Contributor Author

alex-hh commented Oct 9, 2024

ok i've fixed the chaining issue with my last two commits.

Will see if I can refactor into a FormattedExampleIterable

The other issue you posted seems to be unrelated (maybe something to do with feature decoding?)

Comment on lines 1268 to 1270
# if self.formatting is not None, we format, filter, then return arrow iterator
# if formatting is unset, we don't, as that would require filter fn to accept arrow format
if self.formatting and self.ex_iterable.iter_arrow:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's ok to not return iter_arrow if the formatting is not "arrow", this way we decode the data only once in the pipeline (e.g. if it contains image/audio data)

@alex-hh
Copy link
Contributor Author

alex-hh commented Oct 12, 2024

Thinking about this in the context of #7210 - am wondering if it would make sense for Features to define their own extraction arrow->object logic? e.g. Arrays should always be extracted with NumpyArrowExtractor, not only in case with_format is set to numpy (which a user can easily forget or not know to do)

@lhoestq
Copy link
Member

lhoestq commented Oct 14, 2024

Thinking about this in the context of #7210 - am wondering if it would make sense for Features to define their own extraction arrow->object logic? e.g. Arrays should always be extracted with NumpyArrowExtractor, not only in case with_format is set to numpy (which a user can easily forget or not know to do)

For ArrayND they already implement to_pylist to decode arrow data and it can be updated to return a numpy array (see the ArrayExtensionArray class for more details)

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few more comments, this is so close to be ready !

info=self._info.copy(),
split=self._split,
formatting=FormattingConfig(format_type=type),
formatting=None, # formatting is applied in ex_iterable
Copy link
Member

@lhoestq lhoestq Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally we tried to keep formatting as an attribute of the IterableDataset so that people can change the formatting without having to create nested examples iterables that would apply a different formatting every time.

What is the rationale for removing it from here ?

Note that applying a formatting on data already in the right format is cheap as you also noted (but no need to do is_formatted indeed)

Copy link
Contributor Author

@alex-hh alex-hh Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

c.f. comment #7207 (comment)

my thinking was that this way the formatting is never `lost'; it only gets overridden if the user overrides it.

there could definitely be some nested cases that i didn't consider though, can you think of some?

I guess also that hopefully having an explicit FormattedExamplesIterable makes some of the concern about nesting at least handled in a more transparent way: I guess now formatting is now really the responsibilyt of the FormattedExamplesIterable, and only happens when it gets iterated over (unless formatting is never set in which case it is still currently handled in IterableDataset._iter_ . I guess that formatting could probably also be handled by FormattedExamplesIterable with a bit more refactoring (although would somewhat prefer not to add to this pr unless seems essential)?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By nested I actually meant consecutive, I had in mind that this code would be a no-op:

format = ds.format
ds = ds.with_format("np")
ds = ds.with_format(**format)

(actually just noticed IterableDataset.format is not implemented yet, though it should be and work like Dataset.format, but anyway the point is successive with_format would override the previous ones since it's an operation that is done "on top" of the dataset rather than a processing function)

Copy link
Contributor Author

@alex-hh alex-hh Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

take the point on the consecutive formatting

For me it does kind of seem that a with_format().map() should not be assumed to preserve format state especially in the absence of any solution to disable output formatting; this is my main motivation I think.

I suppose you could imagine a solution where _formatting is preserved as an attribute by all transformations until map is applied, and only map removes it? Then the wrapping with FormattedExamplesIterable could happen in iter for non map transforms but in map directly?

Or map could just accept an output_formatting kwarg? Maybe with False to set self._formatting to None? Or with_format(False) could set self._formatting to None?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverted this case so with_format.with_format can still no op

)
info = self.info.copy()
info.features = features
return IterableDataset(
ex_iterable=ex_iterable,
info=info,
split=self._split,
formatting=self._formatting,
formatting=None, # formatting is applied in mapped ex_iterable; no need to re-apply it in IterableDataset
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean formatting is already applied to the input, but not to the output afaik ?

Copy link
Contributor Author

@alex-hh alex-hh Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes - i guess what i was thinking is that if a user calls

ds.with_format(format_type).map(function)

then they are responsible for choosing the format applied to the mapped function outputs:

if the mapped function does nothing, then the output formatting will still be be format_type.

it doesn't seem ideal to re-format the user's outputs (providing them with no way of overriding this).

e.g if i do with_format("numpy").map(fn) where fn extracts some python object from numpy arrays, I don't want to have np.asarray(obj) being called on the resulting object by the numpy formatter

Copy link
Member

@lhoestq lhoestq Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it doesn't seem ideal to re-format the user's outputs (providing them with no way of overriding this).
e.g if i do with_format("numpy").map(fn) where fn extracts some python object from numpy arrays, I don't want to have np.asarray(obj) being called on the resulting object by the numpy formatter

You can keep in self._formatting because np.asarray(obj) is cheap on an object that is already a numpy array (zero copy). And if someone does this

ds.with_format(format_type).map(function).with_format(another_type)

then np.asarray would not be called on the output since we can just replace the IterableDataset._formatting with the new one (similarly to the logic in my comment #7207 (comment))

Copy link
Contributor Author

@alex-hh alex-hh Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my question around this was that there is no way to just prevent any formatting happening on the outputs of map, because the user can't set _formatting to None (not what with_format(None) does).

So if map already takes care of formatting (including formatting to some object type not compatible with the standard formatting options provided by datasets) then keeping self._formatting risks introducing an expensive re-formatting step, even if user calls a new with_format.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe with_format(None) should imply no formatting (contrary to the current logic which is python formatting) ? would that work ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lhoestq what is the intended chaining logic? Should with_format.map.with_format override the original with_format and change the formatting of the inputs to map? Or is it intended to apply to the outputs?

Realise I haven't been clear about this in my head so far, and know your comment above was somewhat related but wanted to be sure I understand!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem, this is not as easy as it seems though you've been doing great ! Anyway, to answer your questions:

is it right that formatting doesn't get re-applied to output of Dataset.map?

MappedExamplesIterable doesn't apply formatting, but the IterableDataset does apply formatting if the users did set one.

e.g. would it potentially be possible to completely remove format_dict in IterableDataset.iter in the case that ex_iterable.iter_arrow evaluates to False, since in this case extraction+formatting has presumably already occurred?

the output of map can be anything in MappedExamplesIterable, it's the IterableDataset's job to format it if the user asks for a specific format. If there is no format set by the user, we yield the example as is (the data can even contain arbitrary python objects). E.g. if a user wants to do some processing from numpy arrays and return arbitrary python objects they should do ds = ds.with_format("np").map(fn).with_format(None)

So no I don't think we can remove the format_dict in IterableDataset (or we would have to add FormattedExamplesIterable after each MappedExamplesIterable but this doesn't allow disabling formatting after a map)

@lhoestq what is the intended chaining logic? Should with_format.map.with_format override the original with_format and change the formatting of the inputs to map? Or is it intended to apply to the outputs?

the second with_format sets the format of the data yielded when iterating on the IterableDataset - it doens't change the input to map

Copy link
Contributor Author

@alex-hh alex-hh Oct 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, just to be sure, the default behaviour of with_format().map() should be to apply the specified formatting to both the inputs and the outputs of map?

I know this goes back to previous discussions but asking because it looked like for the non-iterable arrow_dataset.Dataset.map formatting only applies to the inputs of the mapped function and not the outputs of the mapped function in which case I wondered if we could do the same here - but I wasn't sure I was understanding that correctly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the output of Dataset.map is written as Arrow data on disk, but when the data is accessed it's formatted using the Dataset format type (map doesn't change the format of the dataset)

that's why a code likes this still returns numpy arrays:

>>> ds = with_format("np").map(...)
>>> ds[0]  # dict of numpy arrays

and it can be reset to python objects with

>>> ds = with_format("np").map(...).with_format("python")
>>> ds[0]  # dict of python objects

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok got it thanks!

src/datasets/iterable_dataset.py Outdated Show resolved Hide resolved
src/datasets/iterable_dataset.py Outdated Show resolved Hide resolved
@alex-hh
Copy link
Contributor Author

alex-hh commented Oct 25, 2024

@lhoestq im no longer sure my specific concern about with_format(None) was well-founded - I didn't appreciate that the python formatter tries to do nothing to python objects including numpy arrays, so the existing with_format(None) should I think do what I want. Do you think with_format(None) is ok as is after all? If so think this is hopefully ready for final review!

@alex-hh
Copy link
Contributor Author

alex-hh commented Oct 31, 2024

@lhoestq I've updated to make compatible with latest changes on main, and think the current with_format None behaviour is probably fine - please let me know if there's anything else I can do!

@lhoestq
Copy link
Member

lhoestq commented Nov 5, 2024

Hi Alex, I will be less available from today and for a week. I'll review your PR and play with it once I come back if you don't mind !

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took the liberty to add a test and do slight changes to fix other tests :) it looks all good now ! IterableDataset is much faster now than before your changes :) This PR was also needed to further improve polars/pandas/arrow support

@lhoestq lhoestq merged commit 75e61d1 into huggingface:main Jan 14, 2025
14 checks passed
@alex-hh
Copy link
Contributor Author

alex-hh commented Jan 14, 2025

thanks for the reviews and extensions, happy to see this merged :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants