-
Notifications
You must be signed in to change notification settings - Fork 737
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Query: performance prospects on massive data sets (curse of dimensionality?) #513
Comments
Hi @tylerjereddy -- Glad to hear you've been finding that EBMs perform well on many of your datasets, and hopefully we can make EBMs work on this one too. We can currently process about 1,000,000,000 sample * feature items per day (for classification). Your problem being 900 * 860,000 = 774,000,000 would be expected to take about 1 day under this formula, however there are two caveats:
Please let us know how this works out for you. It's useful for us to get this kind of feedback. |
On the question of memory leaks: I think what you're observing is due to the normal memory fragmentation that you'd expect to find in this kind of process. We run valgrind on our nightly build, and we've only once (to my knowledge) had a memory leak that survived a few days in the code. Obviously, memory leaks are an area where there can be surprises, but you're running with mostly default parameters which is an area that should be fairly well tested. |
One more tip: We run the outer bags on separate cores, and we leave one core free by default on the machine. If you have 8 cores, then setting outer_bags to 7 will be ideal in term of CPU utilization. The 0.5.1 release increases outer bags to 14, so reducing the number of outer bags would improve the speed unless you have more than 14 cores. If you have a big machine with more cores, your model can benefit a little bit by using more of the available hardware by setting outer bags to the number of cores minus one. |
There are some nice articles about how EBMs gain the best of both worlds (performance and explainability), and generally I've found that to be true. However, we've been working on an exceptionally high dimensionality data set in the bioinformatics domain (shape ~ 900 records x ~ 860,000 features (float 64)/dimensions). Are there any published results describing acceptable/reasonable performance in this kind of scenario? Conversely, are there any descriptions of practical limits in terms of the number of features (dimensions)?
What about prospects for improvement in the future? It would be really neat to be able to assess feature importance on enormous design matrices that are refractive to many feature importance techniques. For example, with almost a million features and ~1/3 of them pretty highly correlated with each other, approaches like random forest feature importance might seem appealing, and certainly for performance it can be, but the randomness in feature selection can also lead to the dilution of the importance of correlated features.
As a more concrete description of what we're seeing, if I wait for 4.5 hours on a compute node with 6 TB of memory, I'll see a gradual (but quite slow compared to say a serious memory leak) increase in memory footprint to 700 GiB RAM, but no indication of progress (I don't think there's a verbose mode?), and I only have a single tree for benchmarking purposes:
Anyway not complaining, just wondering if this is something that is tractable or just unreasonable even in the long-term? For random forest (sklearn), it is about 6-7 minutes for 10,000 estimators, though closer to an hour if using concurrent oob estimates for sanity checking. While I'd obviously expect a parallel ensemble technique to be faster than a sequential one, in this case I've reduced the sequence of trees to length
1
(minus any internal aggregating at each level I may not understand). Looks like we're usinginterpret
0.5.1
.The text was updated successfully, but these errors were encountered: