[python-package] merge multiple data files into one #6151

aslongaspossible · 2023-10-23T12:08:47Z

Summary

Hope that I can merge multiple data files (in the format of lightGBM Dataset binary file) to a big one.

Motivation

I found that the most memory-consuming step is generating lightGBM DataSet. I wanted to train a model with a large pandas DataFrame, and the memory usage always doubles (or even four times) at beginning, and sometimes causes MemoryError, the same behavior when I call save_binary. (That's why I guess it's the data-generating step.) If I just start training with a lightGBM DataSet binary file, the memory usage looks good. So if it is possible to generate several small lgb DataSets first, then merge them into one, perhaps I can train with larger data.

The text was updated successfully, but these errors were encountered:

shiyu1994 · 2023-10-25T12:44:41Z

@aslongaspossible Thanks for using LightGBM.

Yes. It is possible. Please check the method add_features_from:
https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.Dataset.html#lightgbm.Dataset.add_features_from

The method allows user to merge features from another dataset into an existing one. So, in your case, you may want to first split the dataset by features, convert them into binary datasets, and merge them using this method.

jameslamb · 2023-10-27T05:02:37Z

In the Python package, it's also possible to construct a LightGBM Dataset incrementally via the Sequence interface.

Here's an example where the data are partitioned into multiple hdf5 files on disk, for example: https://github.com/microsoft/LightGBM/blob/fcf76bceb902a69e1af7cdeac289d31916ddc3f5/examples/python-guide/dataset_from_multi_hdf5.py

Here's another example of implementing that interface:

LightGBM/tests/python_package_test/test_basic.py

Lines 98 to 118 in fcf76bc

    
           class NumpySequence(lgb.Sequence): 
        
               def __init__(self, ndarray, batch_size): 
        
                   self.ndarray = ndarray 
        
                   self.batch_size = batch_size 
        
               def __getitem__(self, idx): 
        
                   # The simple implementation is just a single "return self.ndarray[idx]" 
        
                   # The following is for demo and testing purpose. 
        
                   if isinstance(idx, numbers.Integral): 
        
                       return self.ndarray[idx] 
        
                   elif isinstance(idx, slice): 
        
                       if not (idx.step is None or idx.step == 1): 
        
                           raise NotImplementedError("No need to implement, caller will not set step by now") 
        
                       return self.ndarray[idx.start:idx.stop] 
        
                   elif isinstance(idx, list): 
        
                       return self.ndarray[idx] 
        
                   else: 
        
                       raise TypeError(f"Sequence Index must be an integer/list/slice, got {type(idx).__name__}") 
        
               def __len__(self): 
        
                   return len(self.ndarray)

Because the LightGBM Dataset boundary discretizes continuous features into histograms, it's much much smaller in memory than the corresponding pandas representation of that data. So the combination of the following should result in lower peak memory usage and an identical LightGBM Dataset:

split all the raw data into multiple files (CSV, Parquet, pickle, npy... whatever you want)
- each should have identical columns, but non-overlapping subset of the full dataset's rows
implement the lightgbm.Sequence interface, where each __getitem__() call returns a numpy representation of one of those files
pass that Sequence as data to lightgbm.Dataset()

If you do that or @shiyu1994 's suggestion or both, please do post that suggestion here to help others.

We also acknowledge that the Sequence interface isn't very well documented right now... we'd welcome additional improvements to the documentation that you think might make this easier for others in the future.

aslongaspossible · 2023-10-27T07:03:37Z

Seems that add_features_from merges files split by columns while Sequence interface merges files split by rows? Will there be any difference between these two methods of final histograms (i.e. lgb Dataset)? If any, which one gives better ability to predict?

aslongaspossible · 2023-10-27T09:32:43Z

@aslongaspossible Thanks for using LightGBM.

Yes. It is possible. Please check the method add_features_from: https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.Dataset.html#lightgbm.Dataset.add_features_from

The method allows user to merge features from another dataset into an existing one. So, in your case, you may want to first split the dataset by features, convert them into binary datasets, and merge them using this method.

Will the discretizing result depend on the sequence of adding features? Will these features that added first take higher weight over those later added?

jameslamb · 2023-10-27T14:19:02Z

Seems that add_features_from merges files split by columns while Sequence interface merges files split by rows?

Yes.

Will there be any difference between these two methods of final histograms (i.e. lgb Dataset)?

If there are, it'd be considered a bug.

Will the discretizing result depend on the sequence of adding features? Will these features that added first take higher weight over those later added?

No and no. The order of features doesn't affect LightGBM's training process...all splits from all included features are considered each time a new node is added to a tree.

aslongaspossible · 2023-10-29T03:23:49Z

In the Python package, it's also possible to construct a LightGBM Dataset incrementally via the Sequence interface.

Here's an example where the data are partitioned into multiple hdf5 files on disk, for example: https://github.com/microsoft/LightGBM/blob/fcf76bceb902a69e1af7cdeac289d31916ddc3f5/examples/python-guide/dataset_from_multi_hdf5.py

Seems that Sequence interface still have to load all training data in memory before constructing? So add_features_from is an essential step when the whole training data is larger than available memory?

jameslamb · 2023-10-30T03:18:28Z

Seems that Sequence interface still have to load all training data in memory before constructing?

What evidence do you have of that? Providing a reproducible example or linking to supporting evidence would cut out some question-and-answer cycles here and reduce the effort required to help you...we'd really appreciate that.

From what I recall, the Python package iterates over a Sequence one item at a time and does not hold all of those items in memory at once. That was one of the original motivations for that interface when it was first introduced in #4089.

But it's been a while since I've looked closely at that code. You could help us by describing exactly the evidence supporting that claim.

aslongaspossible · 2023-11-01T06:20:55Z

Sorry for misunderstanding that Sequence example. I tried hdf5 and the memory usage reduces. However It seems slow to construct. (It is constructing for several hours without output, and I'm not sure whether I should terminate it.) Is there any method to speed up the constructing? What's the best batch size?

jameslamb · 2023-11-01T13:59:12Z

Is there any method to speed up the constructing?

use fewer, larger files to reduce the total time spent in disk I/O
ensure environment variables OMP_NUM_THREADS and/or Dataset parameter num_threads are set to the number of physical CPU cores on your machine
- and if in an environment where this is possible, switch to a machine with more CPUs to take even more advantage of parallelism
reduce Dataset parameter max_bin or bin_construct_sample_cnt
- see https://lightgbm.readthedocs.io/en/stable/Parameters.html#dataset-parameters
manually remove low-variance features from the raw data before presenting it to LightGBM

What's the best batch size?

This is dependent on the size of your data and the environment you're running lightgbm in. In general, fewer + larger files should mean better compression on disk and therefore less disk I/O. But larger files increase the risk of out-of-memory errors.

You will have to determine this for yourself through experimentation.

aslongaspossible · 2023-11-07T09:03:28Z

I have difficulty creating validation data by add_features_from. I created every split of validation dataset with reference of corresponding training data, and then merged both of validation and training data by add_features_from. When I tried to train with these data, it raised error "Cannot add validation data, since it has different bin mappers with training data". Aren't "bin mappers" preserved after add_features_from? Is there a way to merge validation data?

jameslamb added the question label Oct 23, 2023

shiyu1994 added the awaiting response label Oct 26, 2023

jameslamb changed the title ~~merge multiple data files into one~~ [python-package] merge multiple data files into one Oct 27, 2023

github-actions bot removed the awaiting response label Oct 27, 2023

jameslamb added the awaiting response label Oct 27, 2023

github-actions bot removed the awaiting response label Oct 29, 2023

jameslamb added the awaiting response label Nov 1, 2023

aslongaspossible closed this as completed Nov 3, 2023

jameslamb removed the awaiting response label Nov 3, 2023

aslongaspossible reopened this Nov 7, 2023

RandiHBK mentioned this issue Jan 1, 2025

[python-package] Failed to train model with dataset built incrementally #6770

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python-package] merge multiple data files into one #6151

[python-package] merge multiple data files into one #6151

aslongaspossible commented Oct 23, 2023

shiyu1994 commented Oct 25, 2023

jameslamb commented Oct 27, 2023

aslongaspossible commented Oct 27, 2023

aslongaspossible commented Oct 27, 2023

jameslamb commented Oct 27, 2023

aslongaspossible commented Oct 29, 2023 •

edited

Loading

jameslamb commented Oct 30, 2023 •

edited

Loading

aslongaspossible commented Nov 1, 2023

jameslamb commented Nov 1, 2023

aslongaspossible commented Nov 7, 2023

[python-package] merge multiple data files into one #6151

[python-package] merge multiple data files into one #6151

Comments

aslongaspossible commented Oct 23, 2023

Summary

Motivation

shiyu1994 commented Oct 25, 2023

jameslamb commented Oct 27, 2023

aslongaspossible commented Oct 27, 2023

aslongaspossible commented Oct 27, 2023

jameslamb commented Oct 27, 2023

aslongaspossible commented Oct 29, 2023 • edited Loading

jameslamb commented Oct 30, 2023 • edited Loading

aslongaspossible commented Nov 1, 2023

jameslamb commented Nov 1, 2023

aslongaspossible commented Nov 7, 2023

aslongaspossible commented Oct 29, 2023 •

edited

Loading

jameslamb commented Oct 30, 2023 •

edited

Loading