-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[python-package] merge multiple data files into one #6151
Comments
@aslongaspossible Thanks for using LightGBM. Yes. It is possible. Please check the method The method allows user to merge features from another dataset into an existing one. So, in your case, you may want to first split the dataset by features, convert them into binary datasets, and merge them using this method. |
In the Python package, it's also possible to construct a LightGBM Here's an example where the data are partitioned into multiple Here's another example of implementing that interface: LightGBM/tests/python_package_test/test_basic.py Lines 98 to 118 in fcf76bc
Because the LightGBM
If you do that or @shiyu1994 's suggestion or both, please do post that suggestion here to help others. We also acknowledge that the |
Seems that |
Will the discretizing result depend on the sequence of adding features? Will these features that added first take higher weight over those later added? |
Yes.
If there are, it'd be considered a bug.
No and no. The order of features doesn't affect LightGBM's training process...all splits from all included features are considered each time a new node is added to a tree. |
Seems that |
What evidence do you have of that? Providing a reproducible example or linking to supporting evidence would cut out some question-and-answer cycles here and reduce the effort required to help you...we'd really appreciate that. From what I recall, the Python package iterates over a But it's been a while since I've looked closely at that code. You could help us by describing exactly the evidence supporting that claim. |
Sorry for misunderstanding that |
This is dependent on the size of your data and the environment you're running You will have to determine this for yourself through experimentation. |
I have difficulty creating validation data by |
Summary
Hope that I can merge multiple data files (in the format of lightGBM Dataset binary file) to a big one.
Motivation
I found that the most memory-consuming step is generating lightGBM DataSet. I wanted to train a model with a large pandas DataFrame, and the memory usage always doubles (or even four times) at beginning, and sometimes causes MemoryError, the same behavior when I call
save_binary
. (That's why I guess it's the data-generating step.) If I just start training with a lightGBM DataSet binary file, the memory usage looks good. So if it is possible to generate several small lgb DataSets first, then merge them into one, perhaps I can train with larger data.The text was updated successfully, but these errors were encountered: