-
Notifications
You must be signed in to change notification settings - Fork 1.6k
[ENH] Generative benchmarking chroma cloud support #4601
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
Please tag your PR title with one of: [ENH | BUG | DOC | TST | BLD | PERF | TYP | CLN | CHORE]. See https://docs.trychroma.com/contributing#contributing-code-and-ideas |
dcaae07
to
87f092d
Compare
87f092d
to
2f99ab1
Compare
Add Chroma Cloud Support to Generative Benchmarking and Improve Usability This PR introduces significant changes to the generative benchmarking sample application by adding seamless Chroma Cloud support to the Jupyter notebook workflow. Users can now benchmark either local or Chroma Cloud collections by providing the appropriate environment variables, with automated detection of cloud/local backends. The notebook was fully refactored for clarity, splitting out local data upload/setup logic into a new load_data.py script and making the data import structure more flexible (user-provided IDs no longer required). Substantial usability improvements were made, including better prompts, reorganization of setup and dataset configuration steps, clearer output, and code cleanup. Utility functions were added/expanded for environment detection and name normalization. Key Changes: Affected Areas: Potential Impact: Functionality: Significantly improves onboarding and experience for users benchmarking on Chroma (cloud or local), allowing easier switching between backends and supporting new user workflows. Performance: Minimal direct performance impact outside of user data upload and embedding, which are now more automated. Potential for larger/batch uploads may pose resource considerations. Security: Introduces handling of environment-based API keys for cloud; these should be kept confidential by users. No additional custom security checks added. Scalability: Supports scaling to Chroma Cloud platforms more easily, able to accommodate cloud collections as dataset grows; concerns may remain for very large local/corpus uploads. Review Focus: Testing Needed• Run the notebook with both local Chroma and Chroma Cloud (with correct environment variables) to verify automatic backend switching. Code Quality Assessmentsample_apps/generative_benchmarking/generate_benchmark.ipynb: Significantly improved; clearer structure, stepwise logic, reduced manual intervention, modular cell grouping. Some blocks may still be quite long. sample_apps/generative_benchmarking/load_data.py: Simple, clear logic. Follows best practices for dotenv and API key use, auto-generates IDs. Minor: no custom error handling if data file is malformed. sample_apps/generative_benchmarking/functions/utils.py: Utility functions are concise and have improved type annotations and edge case handling; single-responsibility structure. Some opportunity for even deeper docstring coverage. Best PracticesNotebook Structure: Python Scripts: Data Handling: Environment: Potential Issues• Relies on correct user environment variable setup; unclear errors possible if misconfigured. This summary was automatically generated by @propel-code-bot |
@kylediaz Itai and I just reviewed the notebook and made some comments:
let us know if you have any questions! |
Thanks you for the feedback. I'll get back to you soon :) |
2f99ab1
to
ced9ed2
Compare
|
||
|
||
def env_var_provided(val: str | None) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[BestPractice]
Consider using a more descriptive type hint for the val
parameter. The Union type syntax str | None
is Python 3.10+ syntax. For better compatibility, you could use Optional[str]
from typing instead, which would make the code compatible with older Python versions if needed.
Description of changes
Partially addressed issues in https://github.com/chroma-core/hosted-chroma/issues/2806
Test plan
Room for improvement
chroma install
is currently not able to upload data to a Chroma collection. This needs to be addressed in a separate PR.