-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Option to use MMIO for large Datasets #310
Comments
Yes, the current dataset representation isn't as memory efficient as we'd like it to be, particularly when deserializing from protobufs. The protobuf deserialization path doesn't deduplicate the feature name strings on the way through (unlike Could you provide a little detail on the size and shape of the datasets you want to work with? Are they sparse or dense? How many features & examples are there? Tribuo isn't particularly designed for very large datasets as most of the training methods create a copy of the data to get it into a more compute friendly representation, and the original dataset will still have a reference on the stack during a training call meaning it can't be garbage collected. We're investigating online learning support which will allow some models to scale much further wrt to data size as you'll only need a portion of it in memory while it is used for training, but that won't be possible for all model types. |
My current use case has about 80k examples, with about 1000 dense features each. (Short phrases that have been run through BERT or another embedding, plus a few fistfuls of contextual features.) I anticipate that my number of examples is going to grow exponentially. Online learning could be made to work (and would be a fantastic addition for other use cases), but I'd much prefer being able to work with the existing |
Ok, sounds like the dense example will help you quite a bit. Moving to memory mapped IO as a supported Tribuo If you're happy with writing your own dataset, then the protobuf serialization mechanisms will accept other classes that implement |
Is your feature request related to a problem? Please describe.
I'm frequently frustrated by OOMEs when building (or deserializing) Datasets that don't fit in heap memory (or even physical memory).
Describe the solution you'd like
I'd like to see an option in (or around) org.tribuo.Dataset to use mapped memory for storing Examples rather than on-heap.
Describe alternatives you've considered
I've considered subclassing Dataset and reimplementing everything that makes use of the
data
member, replacing it with an instance of Jan Kotek's MapDB, and using the existing protobuf implementations to marshall the Examples to/from storage. I also considered rolling my own MMIO-backed ISAM instead of MapDB, given how simple the use case is.The reason I've not yet done these is that my Datasets are computationally expensive to prepare; I need to serialize and deserialize them when spinning processes up and down, and the existing protobuf-based implementations all instantiate Datasets with on-heap storage.
I've also considered buying a ton of physical memory. ;)
The text was updated successfully, but these errors were encountered: