You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, lens requires a pandas dataframe as input to the lens.summarise method. This places an upper limit on the size of the dataset analysed, which must be smaller than the available memory in the machine. Even with efficient optimisation of memory usage during the execution of the dask graph, the initial requirement prevents lens from scaling.
Ideally, lens.summarise should accept dask dataframes as input, and build the execution graph based on this delayed dataframe. This will require a rework of the functions in lens.metrics, given that all of them currently take either pd.Series or pd.Dataframe as arguments. In most cases we should be able to use the dask dataframe API, but for other metrics it will be necessary to access the individual chunks and reduce the result appropriately.
Adding this support, along with the distributed scheduler #11, will allow lens to analyse datasets significantly larger than the memory of the machine.
The text was updated successfully, but these errors were encountered:
Currently, lens requires a pandas dataframe as input to the
lens.summarise
method. This places an upper limit on the size of the dataset analysed, which must be smaller than the available memory in the machine. Even with efficient optimisation of memory usage during the execution of the dask graph, the initial requirement preventslens
from scaling.Ideally,
lens.summarise
should accept dask dataframes as input, and build the execution graph based on this delayed dataframe. This will require a rework of the functions inlens.metrics
, given that all of them currently take eitherpd.Series
orpd.Dataframe
as arguments. In most cases we should be able to use the dask dataframe API, but for other metrics it will be necessary to access the individual chunks and reduce the result appropriately.Adding this support, along with the distributed scheduler #11, will allow lens to analyse datasets significantly larger than the memory of the machine.
The text was updated successfully, but these errors were encountered: