Skip to content

Option to use MMIO for large Datasets #310

Open
@handshape

Description

@handshape

Is your feature request related to a problem? Please describe.
I'm frequently frustrated by OOMEs when building (or deserializing) Datasets that don't fit in heap memory (or even physical memory).

Describe the solution you'd like
I'd like to see an option in (or around) org.tribuo.Dataset to use mapped memory for storing Examples rather than on-heap.

Describe alternatives you've considered
I've considered subclassing Dataset and reimplementing everything that makes use of the data member, replacing it with an instance of Jan Kotek's MapDB, and using the existing protobuf implementations to marshall the Examples to/from storage. I also considered rolling my own MMIO-backed ISAM instead of MapDB, given how simple the use case is.

The reason I've not yet done these is that my Datasets are computationally expensive to prepare; I need to serialize and deserialize them when spinning processes up and down, and the existing protobuf-based implementations all instantiate Datasets with on-heap storage.

I've also considered buying a ton of physical memory. ;)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions