Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support dask dataframes as input to lens.summarise #12

Open
zblz opened this issue Aug 15, 2017 · 0 comments
Open

Support dask dataframes as input to lens.summarise #12

zblz opened this issue Aug 15, 2017 · 0 comments
Labels

Comments

@zblz
Copy link
Member

zblz commented Aug 15, 2017

Currently, lens requires a pandas dataframe as input to the lens.summarise method. This places an upper limit on the size of the dataset analysed, which must be smaller than the available memory in the machine. Even with efficient optimisation of memory usage during the execution of the dask graph, the initial requirement prevents lens from scaling.

Ideally, lens.summarise should accept dask dataframes as input, and build the execution graph based on this delayed dataframe. This will require a rework of the functions in lens.metrics, given that all of them currently take either pd.Series or pd.Dataframe as arguments. In most cases we should be able to use the dask dataframe API, but for other metrics it will be necessary to access the individual chunks and reduce the result appropriately.

Adding this support, along with the distributed scheduler #11, will allow lens to analyse datasets significantly larger than the memory of the machine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

1 participant