Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache embeddings transformations individual from other preprocessing #9

Merged
merged 4 commits into from
Aug 28, 2024

Conversation

raphaelschwinger
Copy link

@raphaelschwinger raphaelschwinger commented Aug 26, 2024

This PR:

  • reverts embeddings_datamodule changes concerning disk_save_path and saving to disk
  • first creates embeddings and then k_samples
  • adds property embeddings_save_path to embeddings_datamodule
  • saves embedding transformation to disk and load it again
    • dataset, emeddings_model_name, average, sample_rate and max_length have to fit to use saved cache, otherwise emeddings get computed again

TODO

- [ ] check caching with fingerprint from HF.dataset

@raphaelschwinger raphaelschwinger self-assigned this Aug 26, 2024
Base automatically changed from ghani-test to main August 27, 2024 08:07
@raphaelschwinger raphaelschwinger marked this pull request as draft August 27, 2024 08:08
@raphaelschwinger
Copy link
Author

I checked if we could also use Hugging Faces caching feature, but I did not find a way to load the cache automatically. So I decided to stay with the manual caching of the embeddings.

@raphaelschwinger raphaelschwinger marked this pull request as ready for review August 27, 2024 09:26
@XgamerTV
Copy link

Looks good the only thing is that _save_dataset_to_disk() shouldn't be called in prepare_data as it creates a second file then. If a file already exists only k_samples is needed as far as I can tell but all other methods (_load_data, _preprocess_data, _create_splits, _save_dataset_to_disk) could be skipped. So maybe we should do an if in prepare_data if the file exists and then just call k_samples 🤔

@raphaelschwinger
Copy link
Author

@XgamerTV I could not fix the problem described at #10. I moved the cache retrieving to the prepare_dataset function as you suggested.

@raphaelschwinger raphaelschwinger merged commit 4d41071 into main Aug 28, 2024
@raphaelschwinger raphaelschwinger deleted the cache-embedding-preprocessing branch August 28, 2024 12:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants