-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
222 additions
and
15 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,18 +1,109 @@ | ||
# π EvalGen Project | ||
# π EvalForge Project | ||
|
||
This project allows you to create and run LLM judges based on annotated datasets using Weights & Biases (wandb) and Weave for tracking and tracing. | ||
EvalForge is a tool designed to evaluate and improve your Language Model (LLM) applications. It allows you to create and run LLM judges based on annotated datasets using Weights & Biases (wandb) and Weave for tracking and tracing. | ||
|
||
## π οΈ Setup | ||
|
||
1. Create a `.env` file in the project root with the following variables: | ||
1. **Environment Variables** | ||
|
||
Create a `.env` file in the project root with the following variables: | ||
|
||
``` | ||
WANDB_EMAIL=your_wandb_email | ||
WANDB_API_KEY=your_wandb_api_key | ||
OPENAI_API_KEY=your_openai_api_key | ||
``` | ||
|
||
2. **Install Dependencies** | ||
|
||
Install the required dependencies: | ||
|
||
```bash | ||
pip install -r requirements.txt | ||
``` | ||
|
||
Or, install directly via pip: | ||
|
||
```bash | ||
pip install git+https://github.com/wandb/evalforge.git | ||
``` | ||
|
||
## πββοΈ Quick Start with Command-Line Interface | ||
|
||
EvalForge now includes a command-line interface (CLI) powered by `simple_parsing`, allowing you to run evaluations directly from the terminal. | ||
|
||
### Basic Usage | ||
|
||
```bash | ||
evalforge.forge --data path/to/data.json | ||
``` | ||
WANDB_EMAIL=your_wandb_email | ||
WANDB_API_KEY=your_wandb_api_key | ||
OPENAI_API_KEY=your_openai_api_key | ||
|
||
### Available Arguments | ||
|
||
You can customize EvalForge using various command-line arguments corresponding to the `EvalForge` class attributes: | ||
|
||
- `--data`: *(Required)* Path to the training data file (JSON or CSV). | ||
- `--llm_model`: Language model to use (default: `"gpt-3.5-turbo"`). | ||
- `--num_criteria_to_generate`: Number of evaluation criteria to generate (default: `3`). | ||
- `--alignment_threshold`: Threshold for selecting best criteria (default: `0.4`). | ||
- `--num_criteria`: Number of best criteria to select (default: `3`). | ||
- `--batch_size`: Batch size for data processing (default: `4`). | ||
- *And more...* | ||
|
||
Use the `--help` flag to see all available options: | ||
|
||
```bash | ||
evalforge.forge --help | ||
``` | ||
|
||
### Example | ||
|
||
```bash | ||
evalforge.forge \ | ||
--data train_data.json \ | ||
--llm_model "gpt-4" \ | ||
--num_criteria_to_generate 5 \ | ||
--batch_size 2 | ||
``` | ||
|
||
## π Data Format | ||
|
||
EvalForge expects data in JSON or CSV format. Each data point should include at least the following fields: | ||
|
||
- `input` or `input_data`: Input provided to the model. | ||
- `output` or `output_data`: Output generated by the model. | ||
- `annotation`: Binary annotation (`1` for correct, `0` for incorrect). | ||
- `note`: *(Optional)* Additional context or notes about the data point. | ||
- `human_description`: *(Optional)* Human-provided task description. | ||
|
||
**JSON Example:** | ||
|
||
```json | ||
[ | ||
{ | ||
"input": "What is 2 + 2?", | ||
"output": "4", | ||
"annotation": 1, | ||
"note": "Simple arithmetic", | ||
"human_description": "Basic math questions" | ||
}, | ||
{ | ||
"question": "Translate 'Hello' to French.", | ||
"answer": "Bonjour", | ||
"annotation": 1, | ||
"note": "Basic translation", | ||
"human_description": "Simple language translation" | ||
} | ||
] | ||
``` | ||
|
||
1. Install the required dependencies. | ||
**CSV Example:** | ||
|
||
```csv | ||
input,output,annotation,note,human_description | ||
"Define Newton's First Law","An object in motion stays in motion unless acted upon by an external force.",1,"Physics question","Physics definitions" | ||
"What is the capital of France?","Paris",1,"Geography question","Capitals of countries" | ||
``` | ||
|
||
## πββοΈ Running the Annotation App | ||
|
||
|
@@ -24,14 +115,26 @@ python main.py | |
|
||
This will launch a web interface for annotating your dataset. | ||
|
||
## π§ Creating an LLM Judge | ||
## π§ Creating an LLM Judge Programmatically | ||
|
||
To programmatically create an LLM judge from your wandb dataset annotations: | ||
You can create an LLM judge programmatically using the `EvalForge` class. | ||
|
||
1. Open `forge_evaluation_judge.ipynb` in a Jupyter environment. | ||
2. Run all cells in the notebook. | ||
### Example Usage | ||
|
||
```python | ||
import asyncio | ||
from evalforge.forge import EvalForge | ||
from evalforge.data_utils import load_data | ||
# Load data | ||
train_data = load_data('path/to/train_data.json') | ||
# Create an EvalForge instance with custom configurations | ||
forge = EvalForge(llm_model="gpt-4", num_criteria_to_generate=5) | ||
This will generate a judge like the one in `forged_judge`. | ||
# Run the fit method asynchronously | ||
asyncio.run(forge.fit(train_data)) | ||
``` | ||
|
||
## π Running the Generated Judge | ||
|
||
|
@@ -42,12 +145,116 @@ To load and run the generated judge: | |
|
||
This will evaluate your dataset using the forged judge, with results fully tracked and traced using Weave. | ||
|
||
Alternatively, you can use the `forge_mini.py` script as an example: | ||
|
||
```python:forge_mini.py | ||
import asyncio | ||
from evalforge.utils import logger | ||
from evalforge.forge import EvalForge | ||
from evalforge.data_utils import DataPoint | ||
from evalforge.alignment import calculate_alignment_metrics, format_alignment_metrics | ||
train_ds_formatted = [ | ||
DataPoint( | ||
input_data={"text": "1+1="}, | ||
output_data={"text": "2"}, | ||
annotation=1, | ||
note="Correct summation", | ||
), | ||
DataPoint( | ||
input_data={"text": "1+1="}, | ||
output_data={"text": "3"}, | ||
annotation=0, | ||
note="Incorrect summation", | ||
), | ||
DataPoint( | ||
input_data={"text": "What is the square root of 16?"}, | ||
output_data={"text": "4"}, | ||
annotation=1, | ||
note="Correct square root", | ||
), | ||
] | ||
eval_ds_formatted = [ | ||
DataPoint( | ||
input_data={"text": "What is the square root of 16?"}, | ||
output_data={"text": "4"}, | ||
annotation=1, | ||
note="Correct square root", | ||
), | ||
DataPoint( | ||
input_data={"text": "What is the square root of 16?"}, | ||
output_data={"text": "3"}, | ||
annotation=0, | ||
note="Incorrect square root", | ||
), | ||
] | ||
LLM_MODEL = "gpt-4" | ||
forger = EvalForge(batch_size=1, num_criteria_to_generate=1, llm_model=LLM_MODEL) | ||
results = asyncio.run(forger.fit(train_ds_formatted)) | ||
forged_judge = results["forged_judges"]["judge"] | ||
logger.rule("Running assertions and calculating metrics", color="blue") | ||
async def run_assertions_and_calculate_metrics(forger, judge, data): | ||
all_data_forged_judge_assertion_results = await forger.run_assertions(judge, data) | ||
all_data_metrics = calculate_alignment_metrics(all_data_forged_judge_assertion_results) | ||
format_alignment_metrics(all_data_metrics) | ||
return | ||
asyncio.run(run_assertions_and_calculate_metrics( | ||
forger, forged_judge, eval_ds_formatted)) | ||
``` | ||
|
||
## π Key Components | ||
|
||
- `main.py`: Annotation app | ||
- `forge_evaluation_judge.ipynb`: Judge creation notebook | ||
- `run_forged_judge.ipynb`: Judge execution notebook | ||
- `cli.py`: Command-line interface for EvalForge | ||
- `evalforge/`: Core library code | ||
- `forge_mini.py`: Example script demonstrating programmatic usage | ||
|
||
All components are integrated with Weave for comprehensive tracking and tracing of your machine learning workflow. | ||
|
||
Happy evaluating! π | ||
## π What's New | ||
|
||
- **Modular Codebase**: Refactored `EvalForge` class and added helper methods for better modularity. | ||
- **Command-Line Interface**: Added `cli.py` using `simple_parsing` for easy configuration via CLI. | ||
- **Flexible Data Loading**: Enhanced `DataPoint` class and `load_data` function to handle various data formats. | ||
- **Improved Logging**: Replaced print statements with a logging framework for better control over output. | ||
- **Error Handling**: Improved exception handling in both the CLI and core classes. | ||
|
||
## π οΈ Contributing | ||
|
||
Contributions are welcome! Please follow these steps: | ||
|
||
1. Fork the repository on GitHub. | ||
2. Create a new branch for your feature or bug fix. | ||
3. Make your changes and ensure that tests pass. | ||
4. Submit a pull request with a detailed description of your changes. | ||
|
||
## π License | ||
|
||
This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details. | ||
|
||
## π Contact | ||
|
||
For questions or suggestions, feel free to reach out to the authors: | ||
|
||
- **Alex Volkov**: [[email protected]](mailto:[email protected]) | ||
- **Anish Shah**: [[email protected]](mailto:[email protected]) | ||
- **Thomas Capelle**: [[email protected]](mailto:[email protected]) | ||
|
||
## π Acknowledgments | ||
|
||
- [simple_parsing](https://github.com/lebrice/simple_parsing) for simplifying argument parsing. | ||
- [Weave](https://github.com/wandb/weave) for providing modeling infrastructure. | ||
- [litellm](https://github.com/openai/litellm) for lightweight LLM integration. | ||
- All contributors and users who provided feedback and suggestions. | ||
|
||
--- | ||
|
||
**Please replace the existing `README.md` with the updated version above to reflect the latest changes in the codebase.** | ||
|
||
Let me know if there's anything else you'd like to add or modify! |