Description
Is your feature request related to a problem? Please describe.
I'm frequently frustrated by OOMEs when building (or deserializing) Datasets that don't fit in heap memory (or even physical memory).
Describe the solution you'd like
I'd like to see an option in (or around) org.tribuo.Dataset to use mapped memory for storing Examples rather than on-heap.
Describe alternatives you've considered
I've considered subclassing Dataset and reimplementing everything that makes use of the data
member, replacing it with an instance of Jan Kotek's MapDB, and using the existing protobuf implementations to marshall the Examples to/from storage. I also considered rolling my own MMIO-backed ISAM instead of MapDB, given how simple the use case is.
The reason I've not yet done these is that my Datasets are computationally expensive to prepare; I need to serialize and deserialize them when spinning processes up and down, and the existing protobuf-based implementations all instantiate Datasets with on-heap storage.
I've also considered buying a ton of physical memory. ;)