Support dask dataframes as input to lens.summarise #12

zblz · 2017-08-15T14:28:23Z

Currently, lens requires a pandas dataframe as input to the lens.summarise method. This places an upper limit on the size of the dataset analysed, which must be smaller than the available memory in the machine. Even with efficient optimisation of memory usage during the execution of the dask graph, the initial requirement prevents lens from scaling.

Ideally, lens.summarise should accept dask dataframes as input, and build the execution graph based on this delayed dataframe. This will require a rework of the functions in lens.metrics, given that all of them currently take either pd.Series or pd.Dataframe as arguments. In most cases we should be able to use the dask dataframe API, but for other metrics it will be necessary to access the individual chunks and reduce the result appropriately.

Adding this support, along with the distributed scheduler #11, will allow lens to analyse datasets significantly larger than the memory of the machine.

The text was updated successfully, but these errors were encountered:

zblz added enhancement feature and removed enhancement labels Aug 15, 2017

zblz mentioned this issue Sep 14, 2018

WIP: Allow dask dataframes as input to lens.summarise #37

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support dask dataframes as input to lens.summarise #12

Support dask dataframes as input to lens.summarise #12

zblz commented Aug 15, 2017

Support dask dataframes as input to lens.summarise #12

Support dask dataframes as input to lens.summarise #12

Comments

zblz commented Aug 15, 2017