Option to use MMIO for large Datasets

**Is your feature request related to a problem? Please describe.**
I'm frequently frustrated by OOMEs when building (or deserializing) Datasets that don't fit in heap memory (or even physical memory). 

**Describe the solution you'd like**
I'd like to see an option in (or around) org.tribuo.Dataset to use mapped memory for storing Examples rather than on-heap.

**Describe alternatives you've considered**
I've considered subclassing Dataset and reimplementing everything that makes use of the `data` member, replacing it with an instance of Jan Kotek's MapDB, and using the existing protobuf implementations to marshall the Examples to/from storage. I also considered rolling my own MMIO-backed ISAM instead of MapDB, given how simple the use case is.

The reason I've not yet done these is that my Datasets are computationally expensive to prepare; I need to serialize and deserialize them when spinning processes up and down, and the existing protobuf-based implementations all instantiate Datasets with on-heap storage.

I've also considered buying a ton of physical memory. ;) 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Option to use MMIO for large Datasets #310

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Option to use MMIO for large Datasets #310

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions