Skip to content

Commit

Permalink
improve readme
Browse files Browse the repository at this point in the history
  • Loading branch information
tcapelle committed Nov 7, 2024
1 parent 58ab3bd commit 2d93a5e
Showing 1 changed file with 222 additions and 15 deletions.
237 changes: 222 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,109 @@
# πŸš€ EvalGen Project
# πŸš€ EvalForge Project

This project allows you to create and run LLM judges based on annotated datasets using Weights & Biases (wandb) and Weave for tracking and tracing.
EvalForge is a tool designed to evaluate and improve your Language Model (LLM) applications. It allows you to create and run LLM judges based on annotated datasets using Weights & Biases (wandb) and Weave for tracking and tracing.

## πŸ› οΈ Setup

1. Create a `.env` file in the project root with the following variables:
1. **Environment Variables**

Create a `.env` file in the project root with the following variables:

```
WANDB_EMAIL=your_wandb_email
WANDB_API_KEY=your_wandb_api_key
OPENAI_API_KEY=your_openai_api_key
```

2. **Install Dependencies**

Install the required dependencies:

```bash
pip install -r requirements.txt
```

Or, install directly via pip:

```bash
pip install git+https://github.com/wandb/evalforge.git
```

## πŸƒβ€β™‚οΈ Quick Start with Command-Line Interface

EvalForge now includes a command-line interface (CLI) powered by `simple_parsing`, allowing you to run evaluations directly from the terminal.

### Basic Usage

```bash
evalforge.forge --data path/to/data.json
```
WANDB_EMAIL=your_wandb_email
WANDB_API_KEY=your_wandb_api_key
OPENAI_API_KEY=your_openai_api_key

### Available Arguments

You can customize EvalForge using various command-line arguments corresponding to the `EvalForge` class attributes:

- `--data`: *(Required)* Path to the training data file (JSON or CSV).
- `--llm_model`: Language model to use (default: `"gpt-3.5-turbo"`).
- `--num_criteria_to_generate`: Number of evaluation criteria to generate (default: `3`).
- `--alignment_threshold`: Threshold for selecting best criteria (default: `0.4`).
- `--num_criteria`: Number of best criteria to select (default: `3`).
- `--batch_size`: Batch size for data processing (default: `4`).
- *And more...*

Use the `--help` flag to see all available options:

```bash
evalforge.forge --help
```

### Example

```bash
evalforge.forge \
--data train_data.json \
--llm_model "gpt-4" \
--num_criteria_to_generate 5 \
--batch_size 2
```

## πŸ“„ Data Format

EvalForge expects data in JSON or CSV format. Each data point should include at least the following fields:

- `input` or `input_data`: Input provided to the model.
- `output` or `output_data`: Output generated by the model.
- `annotation`: Binary annotation (`1` for correct, `0` for incorrect).
- `note`: *(Optional)* Additional context or notes about the data point.
- `human_description`: *(Optional)* Human-provided task description.

**JSON Example:**

```json
[
{
"input": "What is 2 + 2?",
"output": "4",
"annotation": 1,
"note": "Simple arithmetic",
"human_description": "Basic math questions"
},
{
"question": "Translate 'Hello' to French.",
"answer": "Bonjour",
"annotation": 1,
"note": "Basic translation",
"human_description": "Simple language translation"
}
]
```

1. Install the required dependencies.
**CSV Example:**

```csv
input,output,annotation,note,human_description
"Define Newton's First Law","An object in motion stays in motion unless acted upon by an external force.",1,"Physics question","Physics definitions"
"What is the capital of France?","Paris",1,"Geography question","Capitals of countries"
```

## πŸƒβ€β™‚οΈ Running the Annotation App

Expand All @@ -24,14 +115,26 @@ python main.py

This will launch a web interface for annotating your dataset.

## 🧠 Creating an LLM Judge
## 🧠 Creating an LLM Judge Programmatically

To programmatically create an LLM judge from your wandb dataset annotations:
You can create an LLM judge programmatically using the `EvalForge` class.

1. Open `forge_evaluation_judge.ipynb` in a Jupyter environment.
2. Run all cells in the notebook.
### Example Usage

```python
import asyncio
from evalforge.forge import EvalForge
from evalforge.data_utils import load_data
# Load data
train_data = load_data('path/to/train_data.json')
# Create an EvalForge instance with custom configurations
forge = EvalForge(llm_model="gpt-4", num_criteria_to_generate=5)
This will generate a judge like the one in `forged_judge`.
# Run the fit method asynchronously
asyncio.run(forge.fit(train_data))
```

## πŸ” Running the Generated Judge

Expand All @@ -42,12 +145,116 @@ To load and run the generated judge:

This will evaluate your dataset using the forged judge, with results fully tracked and traced using Weave.

Alternatively, you can use the `forge_mini.py` script as an example:

```python:forge_mini.py
import asyncio
from evalforge.utils import logger
from evalforge.forge import EvalForge
from evalforge.data_utils import DataPoint
from evalforge.alignment import calculate_alignment_metrics, format_alignment_metrics
train_ds_formatted = [
DataPoint(
input_data={"text": "1+1="},
output_data={"text": "2"},
annotation=1,
note="Correct summation",
),
DataPoint(
input_data={"text": "1+1="},
output_data={"text": "3"},
annotation=0,
note="Incorrect summation",
),
DataPoint(
input_data={"text": "What is the square root of 16?"},
output_data={"text": "4"},
annotation=1,
note="Correct square root",
),
]
eval_ds_formatted = [
DataPoint(
input_data={"text": "What is the square root of 16?"},
output_data={"text": "4"},
annotation=1,
note="Correct square root",
),
DataPoint(
input_data={"text": "What is the square root of 16?"},
output_data={"text": "3"},
annotation=0,
note="Incorrect square root",
),
]
LLM_MODEL = "gpt-4"
forger = EvalForge(batch_size=1, num_criteria_to_generate=1, llm_model=LLM_MODEL)
results = asyncio.run(forger.fit(train_ds_formatted))
forged_judge = results["forged_judges"]["judge"]
logger.rule("Running assertions and calculating metrics", color="blue")
async def run_assertions_and_calculate_metrics(forger, judge, data):
all_data_forged_judge_assertion_results = await forger.run_assertions(judge, data)
all_data_metrics = calculate_alignment_metrics(all_data_forged_judge_assertion_results)
format_alignment_metrics(all_data_metrics)
return
asyncio.run(run_assertions_and_calculate_metrics(
forger, forged_judge, eval_ds_formatted))
```

## πŸ“Š Key Components

- `main.py`: Annotation app
- `forge_evaluation_judge.ipynb`: Judge creation notebook
- `run_forged_judge.ipynb`: Judge execution notebook
- `cli.py`: Command-line interface for EvalForge
- `evalforge/`: Core library code
- `forge_mini.py`: Example script demonstrating programmatic usage

All components are integrated with Weave for comprehensive tracking and tracing of your machine learning workflow.

Happy evaluating! πŸŽ‰
## πŸ“ What's New

- **Modular Codebase**: Refactored `EvalForge` class and added helper methods for better modularity.
- **Command-Line Interface**: Added `cli.py` using `simple_parsing` for easy configuration via CLI.
- **Flexible Data Loading**: Enhanced `DataPoint` class and `load_data` function to handle various data formats.
- **Improved Logging**: Replaced print statements with a logging framework for better control over output.
- **Error Handling**: Improved exception handling in both the CLI and core classes.

## πŸ› οΈ Contributing

Contributions are welcome! Please follow these steps:

1. Fork the repository on GitHub.
2. Create a new branch for your feature or bug fix.
3. Make your changes and ensure that tests pass.
4. Submit a pull request with a detailed description of your changes.

## πŸ“„ License

This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.

## πŸ“ž Contact

For questions or suggestions, feel free to reach out to the authors:

- **Alex Volkov**: [[email protected]](mailto:[email protected])
- **Anish Shah**: [[email protected]](mailto:[email protected])
- **Thomas Capelle**: [[email protected]](mailto:[email protected])

## πŸ™ Acknowledgments

- [simple_parsing](https://github.com/lebrice/simple_parsing) for simplifying argument parsing.
- [Weave](https://github.com/wandb/weave) for providing modeling infrastructure.
- [litellm](https://github.com/openai/litellm) for lightweight LLM integration.
- All contributors and users who provided feedback and suggestions.

---

**Please replace the existing `README.md` with the updated version above to reflect the latest changes in the codebase.**

Let me know if there's anything else you'd like to add or modify!

0 comments on commit 2d93a5e

Please sign in to comment.