Skip to content

implement zarr-based caching for major classes #28

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
aalok-sathe opened this issue Apr 5, 2022 · 2 comments
Open

implement zarr-based caching for major classes #28

aalok-sathe opened this issue Apr 5, 2022 · 2 comments
Assignees
Labels
enhancement New feature or request lbs:data related to the part of the library handling datasets lbs:encoders related to the encoder part of the library lbs:mapping related to the mapping part of the library

Comments

@aalok-sathe
Copy link
Contributor

we need reliable state-caching for most classes to persist results to the disk, for later analysis and reuse in pipelines.
if cached results exist, they may be reused based on a flag (e.g. overwrite_cache=False)

@aalok-sathe aalok-sathe added enhancement New feature or request lbs:encoders related to the encoder part of the library lbs:mapping related to the mapping part of the library lbs:data related to the part of the library handling datasets labels Apr 5, 2022
@aalok-sathe
Copy link
Contributor Author

aalok-sathe commented Apr 5, 2022

proposal: make the __repr__ method of each Cacheable class uniquely identify that instance.
E.g., the repr(BrainScore()) should contain information about Mapping, Metric, and the encoders (all this can come from respective calls to the repr methods of these objects)

below list is in the form:

  • Object to repr()

    • entity it depends on
  • BrainScore

    • Mapping
    • Metric
    • Encoder1 outputs
    • Encoder2 outputs (should we create a class EncoderOutput, for more logical dependency in cache handling?) @lipkinb @gretatuckute
  • Mapping

    • str algorithm
    • hparams? tbd
  • Metric

    • str algorithm
  • EncoderOutput (?)

    • Encoder
    • Dataset
  • HFEncoder

    • str algorithm (pretrained_model_name_or_path)
    • str aggregation choices
    • Dataset
  • BrainEncoder

    • Dataset
  • Dataset

    • str path to the data

@aalok-sathe
Copy link
Contributor Author

zarr is unable to cache xarrays with dtype object in them. Somehow we're getting dtype object bleed in from somewhere. Once that is corrected to string, this issue disappears.
This issue is referenced here: pydata/xarray#3476
It is partially sovled by commits in #34

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request lbs:data related to the part of the library handling datasets lbs:encoders related to the encoder part of the library lbs:mapping related to the mapping part of the library
Projects
None yet
Development

No branches or pull requests

1 participant