Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python-package] merge multiple data files into one #6151

Open
aslongaspossible opened this issue Oct 23, 2023 · 10 comments
Open

[python-package] merge multiple data files into one #6151

aslongaspossible opened this issue Oct 23, 2023 · 10 comments
Labels

Comments

@aslongaspossible
Copy link

Summary

Hope that I can merge multiple data files (in the format of lightGBM Dataset binary file) to a big one.

Motivation

I found that the most memory-consuming step is generating lightGBM DataSet. I wanted to train a model with a large pandas DataFrame, and the memory usage always doubles (or even four times) at beginning, and sometimes causes MemoryError, the same behavior when I call save_binary. (That's why I guess it's the data-generating step.) If I just start training with a lightGBM DataSet binary file, the memory usage looks good. So if it is possible to generate several small lgb DataSets first, then merge them into one, perhaps I can train with larger data.

@shiyu1994
Copy link
Collaborator

@aslongaspossible Thanks for using LightGBM.

Yes. It is possible. Please check the method add_features_from:
https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.Dataset.html#lightgbm.Dataset.add_features_from

The method allows user to merge features from another dataset into an existing one. So, in your case, you may want to first split the dataset by features, convert them into binary datasets, and merge them using this method.

@jameslamb
Copy link
Collaborator

In the Python package, it's also possible to construct a LightGBM Dataset incrementally via the Sequence interface.

Here's an example where the data are partitioned into multiple hdf5 files on disk, for example: https://github.com/microsoft/LightGBM/blob/fcf76bceb902a69e1af7cdeac289d31916ddc3f5/examples/python-guide/dataset_from_multi_hdf5.py

Here's another example of implementing that interface:

class NumpySequence(lgb.Sequence):
def __init__(self, ndarray, batch_size):
self.ndarray = ndarray
self.batch_size = batch_size
def __getitem__(self, idx):
# The simple implementation is just a single "return self.ndarray[idx]"
# The following is for demo and testing purpose.
if isinstance(idx, numbers.Integral):
return self.ndarray[idx]
elif isinstance(idx, slice):
if not (idx.step is None or idx.step == 1):
raise NotImplementedError("No need to implement, caller will not set step by now")
return self.ndarray[idx.start:idx.stop]
elif isinstance(idx, list):
return self.ndarray[idx]
else:
raise TypeError(f"Sequence Index must be an integer/list/slice, got {type(idx).__name__}")
def __len__(self):
return len(self.ndarray)

Because the LightGBM Dataset boundary discretizes continuous features into histograms, it's much much smaller in memory than the corresponding pandas representation of that data. So the combination of the following should result in lower peak memory usage and an identical LightGBM Dataset:

  • split all the raw data into multiple files (CSV, Parquet, pickle, npy... whatever you want)
    • each should have identical columns, but non-overlapping subset of the full dataset's rows
  • implement the lightgbm.Sequence interface, where each __getitem__() call returns a numpy representation of one of those files
  • pass that Sequence as data to lightgbm.Dataset()

If you do that or @shiyu1994 's suggestion or both, please do post that suggestion here to help others.

We also acknowledge that the Sequence interface isn't very well documented right now... we'd welcome additional improvements to the documentation that you think might make this easier for others in the future.

@jameslamb jameslamb changed the title merge multiple data files into one [python-package] merge multiple data files into one Oct 27, 2023
@aslongaspossible
Copy link
Author

Seems that add_features_from merges files split by columns while Sequence interface merges files split by rows? Will there be any difference between these two methods of final histograms (i.e. lgb Dataset)? If any, which one gives better ability to predict?

@aslongaspossible
Copy link
Author

@aslongaspossible Thanks for using LightGBM.

Yes. It is possible. Please check the method add_features_from: https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.Dataset.html#lightgbm.Dataset.add_features_from

The method allows user to merge features from another dataset into an existing one. So, in your case, you may want to first split the dataset by features, convert them into binary datasets, and merge them using this method.

Will the discretizing result depend on the sequence of adding features? Will these features that added first take higher weight over those later added?

@jameslamb
Copy link
Collaborator

Seems that add_features_from merges files split by columns while Sequence interface merges files split by rows?

Yes.

Will there be any difference between these two methods of final histograms (i.e. lgb Dataset)?

If there are, it'd be considered a bug.

Will the discretizing result depend on the sequence of adding features? Will these features that added first take higher weight over those later added?

No and no. The order of features doesn't affect LightGBM's training process...all splits from all included features are considered each time a new node is added to a tree.

@aslongaspossible
Copy link
Author

aslongaspossible commented Oct 29, 2023

In the Python package, it's also possible to construct a LightGBM Dataset incrementally via the Sequence interface.

Here's an example where the data are partitioned into multiple hdf5 files on disk, for example: https://github.com/microsoft/LightGBM/blob/fcf76bceb902a69e1af7cdeac289d31916ddc3f5/examples/python-guide/dataset_from_multi_hdf5.py

Seems that Sequence interface still have to load all training data in memory before constructing? So add_features_from is an essential step when the whole training data is larger than available memory?

@jameslamb
Copy link
Collaborator

jameslamb commented Oct 30, 2023

Seems that Sequence interface still have to load all training data in memory before constructing?

What evidence do you have of that? Providing a reproducible example or linking to supporting evidence would cut out some question-and-answer cycles here and reduce the effort required to help you...we'd really appreciate that.

From what I recall, the Python package iterates over a Sequence one item at a time and does not hold all of those items in memory at once. That was one of the original motivations for that interface when it was first introduced in #4089.

But it's been a while since I've looked closely at that code. You could help us by describing exactly the evidence supporting that claim.

@aslongaspossible
Copy link
Author

Sorry for misunderstanding that Sequence example. I tried hdf5 and the memory usage reduces. However It seems slow to construct. (It is constructing for several hours without output, and I'm not sure whether I should terminate it.) Is there any method to speed up the constructing? What's the best batch size?

@jameslamb
Copy link
Collaborator

Is there any method to speed up the constructing?

  • use fewer, larger files to reduce the total time spent in disk I/O
  • ensure environment variables OMP_NUM_THREADS and/or Dataset parameter num_threads are set to the number of physical CPU cores on your machine
    • and if in an environment where this is possible, switch to a machine with more CPUs to take even more advantage of parallelism
  • reduce Dataset parameter max_bin or bin_construct_sample_cnt
  • manually remove low-variance features from the raw data before presenting it to LightGBM

What's the best batch size?

This is dependent on the size of your data and the environment you're running lightgbm in. In general, fewer + larger files should mean better compression on disk and therefore less disk I/O. But larger files increase the risk of out-of-memory errors.

You will have to determine this for yourself through experimentation.

@aslongaspossible
Copy link
Author

I have difficulty creating validation data by add_features_from. I created every split of validation dataset with reference of corresponding training data, and then merged both of validation and training data by add_features_from. When I tried to train with these data, it raised error "Cannot add validation data, since it has different bin mappers with training data". Aren't "bin mappers" preserved after add_features_from? Is there a way to merge validation data?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants