-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
apply formatting after iter_arrow to speed up format -> map, filter for iterable datasets #7207
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting ! Let's validate if we get the same speed on those 2 ?
- Dataset.to_iterable_dataset -> numpy format
- Dataset.to_iterable_dataset -> numpy format -> batched map identity (batched could be quite important)
src/datasets/iterable_dataset.py
Outdated
if self.formatting and (self.formatting.format_type == "arrow" or self.ex_iterable.iter_arrow): | ||
formatter = get_formatter(self.formatting.format_type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh this might me a good idea
src/datasets/iterable_dataset.py
Outdated
@@ -926,7 +926,7 @@ def __init__( | |||
|
|||
@property | |||
def iter_arrow(self): | |||
if self.formatting and self.formatting.format_type == "arrow": | |||
if self.formatting and (self.formatting.format_type == "arrow" or self.ex_iterable.iter_arrow): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure about this change though, since it will do
Original data in Arrow -> Map function in NumPy -> Arrow (since it has iter_arrow
and Arrow is preferred) -> Python objects
While it could be
Original data in Arrow -> Map function in NumPy -> Python objects -> Python objects
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed, I wasn't really understanding the flow - have updated with a more direct attempt
I think the problem is that the underlying ex_iterable will not use iter_arrow unless the formatting type is arrow, which leads to conversion from arrow -> python -> numpy in this case rather than arrow -> numpy. Idea of updated fix is to use the ex_iterable's iter_arrow in any case where it's available and any formatting is specified. The formatter then works directly on arrow tables; the outputs of the formatter get passed to the function to be mapped. With updated version: import numpy as np
import time
from datasets import Dataset, Features, Array3D
features=Features(**{"array0": Array3D((None, 10, 10), dtype="float32"), "array1": Array3D((None,10,10), dtype="float32")})
dataset = Dataset.from_dict({f"array{i}": [np.zeros((x,10,10), dtype=np.float32) for x in [2000,1000]*25] for i in range(2)}, features=features) ds = dataset.to_iterable_dataset()
ds = ds.with_format("numpy").map(lambda x: x, batched=True, batch_size=10)
t0 = time.time()
for ex in ds:
pass
t1 = time.time() Total time: < 0.01s (~30s on main) ds = dataset.to_iterable_dataset()
ds = ds.with_format("numpy").map(lambda x: x, batched=False)
t0 = time.time()
for ex in ds:
pass
t1 = time.time() Time: ~0.02 s (~30s on main) ds = dataset.to_iterable_dataset()
ds = ds.with_format("numpy")
t0 = time.time()
for ex in ds:
pass
t1 = time.time() Time: ~0.02s |
also now working for filter with similar performance improvements: filtered_examples = []
ds = dataset.to_iterable_dataset()
ds = ds.with_format("numpy").filter(lambda x: [arr.shape[0]==2000 for arr in x["array0"]], batch_size=10, batched=True)
t0 = time.time()
for ex in ds:
filtered_examples.append(ex)
t1 = time.time()
assert len(filtered_examples) == 25 0.01s vs 50s on main filtered_examples = []
ds = dataset.to_iterable_dataset()
ds = ds.with_format("numpy").filter(lambda x: x["array0"].shape[0]==2000, batched=False)
t0 = time.time()
for ex in ds:
filtered_examples.append(ex)
t1 = time.time()
assert len(filtered_examples) == 25 0.04s vs 50s on main |
7119cd2
to
d906b9f
Compare
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
src/datasets/iterable_dataset.py
Outdated
@@ -2211,6 +2250,7 @@ def map( | |||
remove_columns: Optional[Union[str, List[str]]] = None, | |||
features: Optional[Features] = None, | |||
fn_kwargs: Optional[dict] = None, | |||
format_outputs: bool = True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this useful ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was useful for optimising performance in my case -
idea is it allows the user to determine the formatting of the examples returned by map (within the mapped function) rather than via self.formatting, which determines the formatting of the inputs to the map function
if map returns lists / arbitrary types then re-applying the formatter to the outputs can be expensive
However, haven't thought about how to handle preserving a consistent self._formatting FormattingConfig for downstream dataset transformations...might require something better than this kwarg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can still do ds = ds.map(...).with_format(None)
and no formatting will be applied to the output
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perfect - will remove the kwarg
(the distributed tests failing in the CI are unrelated) |
There also appears to be a separate? issue with chaining filter and map bc filter iter_arrow only returns _iter_arrow if arrow formatting is applied (and vv presumably) I don't have a good minimal example atm |
src/datasets/iterable_dataset.py
Outdated
) | ||
batched_examples_iterator = formatted_arrow_examples_iterator(ex_iterable, formatter, batched=self.batched) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe name it input_iterator
or something like that since it's not necessarily batched ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
Maybe related to this issue ? ds = Dataset.from_dict({"a": range(10)}).to_iterable_dataset()
ds = ds.with_format("arrow").map(lambda x: x, features=Features({"a": Value("string")})).with_format(None)
print(list(ds)) # yields integers instead of strings |
I feel like we could get rid of TypedExampleIterable altogether and apply formatting with feature conversion with btw you can pass (edit: except maybe the arrow formatter doesn't use class ArrowFormatter(Formatter[pa.Table, pa.Array, pa.Table]):
def format_row(self, pa_table: pa.Table) -> pa.Table:
- return self.simple_arrow_extractor().extract_row(pa_table)
+ pa_table = self.simple_arrow_extractor().extract_row(pa_table)
+. return cast_table_to_features(pa_table, self.features) if self.features else pa_table
) |
Oh nice didn't know about the feature support in get_formatter. Haven't thought through whether this works but would a FormattedExampleIterable (with feature conversion) be able to solve this and fit the API better? |
Yes this is surely the way to go actually ! |
ok i've fixed the chaining issue with my last two commits. Will see if I can refactor into a FormattedExampleIterable The other issue you posted seems to be unrelated (maybe something to do with feature decoding?) |
src/datasets/iterable_dataset.py
Outdated
# if self.formatting is not None, we format, filter, then return arrow iterator | ||
# if formatting is unset, we don't, as that would require filter fn to accept arrow format | ||
if self.formatting and self.ex_iterable.iter_arrow: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's ok to not return iter_arrow if the formatting is not "arrow", this way we decode the data only once in the pipeline (e.g. if it contains image/audio data)
Thinking about this in the context of #7210 - am wondering if it would make sense for Features to define their own extraction arrow->object logic? e.g. Arrays should always be extracted with NumpyArrowExtractor, not only in case with_format is set to numpy (which a user can easily forget or not know to do) |
For |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few more comments, this is so close to be ready !
src/datasets/iterable_dataset.py
Outdated
info=self._info.copy(), | ||
split=self._split, | ||
formatting=FormattingConfig(format_type=type), | ||
formatting=None, # formatting is applied in ex_iterable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally we tried to keep formatting
as an attribute of the IterableDataset
so that people can change the formatting without having to create nested examples iterables that would apply a different formatting every time.
What is the rationale for removing it from here ?
Note that applying a formatting on data already in the right format is cheap as you also noted (but no need to do is_formatted indeed)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
c.f. comment #7207 (comment)
my thinking was that this way the formatting is never `lost'; it only gets overridden if the user overrides it.
there could definitely be some nested cases that i didn't consider though, can you think of some?
I guess also that hopefully having an explicit FormattedExamplesIterable makes some of the concern about nesting at least handled in a more transparent way: I guess now formatting is now really the responsibilyt of the FormattedExamplesIterable, and only happens when it gets iterated over (unless formatting is never set in which case it is still currently handled in IterableDataset._iter_ . I guess that formatting could probably also be handled by FormattedExamplesIterable with a bit more refactoring (although would somewhat prefer not to add to this pr unless seems essential)?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By nested I actually meant consecutive, I had in mind that this code would be a no-op:
format = ds.format
ds = ds.with_format("np")
ds = ds.with_format(**format)
(actually just noticed IterableDataset.format
is not implemented yet, though it should be and work like Dataset.format
, but anyway the point is successive with_format
would override the previous ones since it's an operation that is done "on top" of the dataset rather than a processing function)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
take the point on the consecutive formatting
For me it does kind of seem that a with_format().map() should not be assumed to preserve format state especially in the absence of any solution to disable output formatting; this is my main motivation I think.
I suppose you could imagine a solution where _formatting is preserved as an attribute by all transformations until map is applied, and only map removes it? Then the wrapping with FormattedExamplesIterable could happen in iter for non map transforms but in map directly?
Or map could just accept an output_formatting kwarg? Maybe with False to set self._formatting to None? Or with_format(False) could set self._formatting to None?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reverted this case so with_format.with_format can still no op
src/datasets/iterable_dataset.py
Outdated
) | ||
info = self.info.copy() | ||
info.features = features | ||
return IterableDataset( | ||
ex_iterable=ex_iterable, | ||
info=info, | ||
split=self._split, | ||
formatting=self._formatting, | ||
formatting=None, # formatting is applied in mapped ex_iterable; no need to re-apply it in IterableDataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean formatting is already applied to the input, but not to the output afaik ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes - i guess what i was thinking is that if a user calls
ds.with_format(format_type).map(function)
then they are responsible for choosing the format applied to the mapped function outputs:
if the mapped function does nothing, then the output formatting will still be be format_type.
it doesn't seem ideal to re-format the user's outputs (providing them with no way of overriding this).
e.g if i do with_format("numpy").map(fn) where fn extracts some python object from numpy arrays, I don't want to have np.asarray(obj) being called on the resulting object by the numpy formatter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it doesn't seem ideal to re-format the user's outputs (providing them with no way of overriding this).
e.g if i do with_format("numpy").map(fn) where fn extracts some python object from numpy arrays, I don't want to have np.asarray(obj) being called on the resulting object by the numpy formatter
You can keep in self._formatting
because np.asarray(obj)
is cheap on an object that is already a numpy array (zero copy). And if someone does this
ds.with_format(format_type).map(function).with_format(another_type)
then np.asarray would not be called on the output since we can just replace the IterableDataset._formatting
with the new one (similarly to the logic in my comment #7207 (comment))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think my question around this was that there is no way to just prevent any formatting happening on the outputs of map, because the user can't set _formatting to None (not what with_format(None) does).
So if map already takes care of formatting (including formatting to some object type not compatible with the standard formatting options provided by datasets) then keeping self._formatting risks introducing an expensive re-formatting step, even if user calls a new with_format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe with_format(None)
should imply no formatting (contrary to the current logic which is python formatting) ? would that work ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lhoestq what is the intended chaining logic? Should with_format.map.with_format override the original with_format and change the formatting of the inputs to map? Or is it intended to apply to the outputs?
Realise I haven't been clear about this in my head so far, and know your comment above was somewhat related but wanted to be sure I understand!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem, this is not as easy as it seems though you've been doing great ! Anyway, to answer your questions:
is it right that formatting doesn't get re-applied to output of Dataset.map?
MappedExamplesIterable doesn't apply formatting, but the IterableDataset does apply formatting if the users did set one.
e.g. would it potentially be possible to completely remove format_dict in IterableDataset.iter in the case that ex_iterable.iter_arrow evaluates to False, since in this case extraction+formatting has presumably already occurred?
the output of map can be anything in MappedExamplesIterable, it's the IterableDataset's job to format it if the user asks for a specific format. If there is no format set by the user, we yield the example as is (the data can even contain arbitrary python objects). E.g. if a user wants to do some processing from numpy arrays and return arbitrary python objects they should do ds = ds.with_format("np").map(fn).with_format(None)
So no I don't think we can remove the format_dict in IterableDataset (or we would have to add FormattedExamplesIterable after each MappedExamplesIterable but this doesn't allow disabling formatting after a map)
@lhoestq what is the intended chaining logic? Should with_format.map.with_format override the original with_format and change the formatting of the inputs to map? Or is it intended to apply to the outputs?
the second with_format sets the format of the data yielded when iterating on the IterableDataset - it doens't change the input to map
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so, just to be sure, the default behaviour of with_format().map() should be to apply the specified formatting to both the inputs and the outputs of map?
I know this goes back to previous discussions but asking because it looked like for the non-iterable arrow_dataset.Dataset.map formatting only applies to the inputs of the mapped function and not the outputs of the mapped function in which case I wondered if we could do the same here - but I wasn't sure I was understanding that correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the output of Dataset.map is written as Arrow data on disk, but when the data is accessed it's formatted using the Dataset format type (map doesn't change the format of the dataset)
that's why a code likes this still returns numpy arrays:
>>> ds = with_format("np").map(...)
>>> ds[0] # dict of numpy arrays
and it can be reset to python objects with
>>> ds = with_format("np").map(...).with_format("python")
>>> ds[0] # dict of python objects
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok got it thanks!
@lhoestq im no longer sure my specific concern about with_format(None) was well-founded - I didn't appreciate that the python formatter tries to do nothing to python objects including numpy arrays, so the existing with_format(None) should I think do what I want. Do you think with_format(None) is ok as is after all? If so think this is hopefully ready for final review! |
@lhoestq I've updated to make compatible with latest changes on main, and think the current with_format None behaviour is probably fine - please let me know if there's anything else I can do! |
Hi Alex, I will be less available from today and for a week. I'll review your PR and play with it once I come back if you don't mind ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took the liberty to add a test and do slight changes to fix other tests :) it looks all good now ! IterableDataset is much faster now than before your changes :) This PR was also needed to further improve polars/pandas/arrow support
thanks for the reviews and extensions, happy to see this merged :) |
I got to this by hacking around a bit but it seems to solve #7206
I have no idea if this approach makes sense or would break something else?
Could maybe work on a full pr if this looks reasonable @lhoestq ? I imagine the same issue might affect other iterable dataset methods?