diff --git a/README.md b/README.md index 8601467..706c314 100644 --- a/README.md +++ b/README.md @@ -1,18 +1,109 @@ -# 🚀 EvalGen Project +# 🚀 EvalForge Project -This project allows you to create and run LLM judges based on annotated datasets using Weights & Biases (wandb) and Weave for tracking and tracing. +EvalForge is a tool designed to evaluate and improve your Language Model (LLM) applications. It allows you to create and run LLM judges based on annotated datasets using Weights & Biases (wandb) and Weave for tracking and tracing. ## 🛠️ Setup -1. Create a `.env` file in the project root with the following variables: +1. **Environment Variables** + Create a `.env` file in the project root with the following variables: + + ``` + WANDB_EMAIL=your_wandb_email + WANDB_API_KEY=your_wandb_api_key + OPENAI_API_KEY=your_openai_api_key + ``` + +2. **Install Dependencies** + + Install the required dependencies: + + ```bash + pip install -r requirements.txt + ``` + + Or, install directly via pip: + + ```bash + pip install git+https://github.com/wandb/evalforge.git + ``` + +## 🏃‍♂️ Quick Start with Command-Line Interface + +EvalForge now includes a command-line interface (CLI) powered by `simple_parsing`, allowing you to run evaluations directly from the terminal. + +### Basic Usage + +```bash +evalforge.forge --data path/to/data.json ``` -WANDB_EMAIL=your_wandb_email -WANDB_API_KEY=your_wandb_api_key -OPENAI_API_KEY=your_openai_api_key + +### Available Arguments + +You can customize EvalForge using various command-line arguments corresponding to the `EvalForge` class attributes: + +- `--data`: *(Required)* Path to the training data file (JSON or CSV). +- `--llm_model`: Language model to use (default: `"gpt-3.5-turbo"`). +- `--num_criteria_to_generate`: Number of evaluation criteria to generate (default: `3`). +- `--alignment_threshold`: Threshold for selecting best criteria (default: `0.4`). +- `--num_criteria`: Number of best criteria to select (default: `3`). +- `--batch_size`: Batch size for data processing (default: `4`). +- *And more...* + +Use the `--help` flag to see all available options: + +```bash +evalforge.forge --help +``` + +### Example + +```bash +evalforge.forge \ + --data train_data.json \ + --llm_model "gpt-4" \ + --num_criteria_to_generate 5 \ + --batch_size 2 +``` + +## 📄 Data Format + +EvalForge expects data in JSON or CSV format. Each data point should include at least the following fields: + +- `input` or `input_data`: Input provided to the model. +- `output` or `output_data`: Output generated by the model. +- `annotation`: Binary annotation (`1` for correct, `0` for incorrect). +- `note`: *(Optional)* Additional context or notes about the data point. +- `human_description`: *(Optional)* Human-provided task description. + +**JSON Example:** + +```json +[ + { + "input": "What is 2 + 2?", + "output": "4", + "annotation": 1, + "note": "Simple arithmetic", + "human_description": "Basic math questions" + }, + { + "question": "Translate 'Hello' to French.", + "answer": "Bonjour", + "annotation": 1, + "note": "Basic translation", + "human_description": "Simple language translation" + } +] ``` -1. Install the required dependencies. +**CSV Example:** + +```csv +input,output,annotation,note,human_description +"Define Newton's First Law","An object in motion stays in motion unless acted upon by an external force.",1,"Physics question","Physics definitions" +"What is the capital of France?","Paris",1,"Geography question","Capitals of countries" +``` ## 🏃‍♂️ Running the Annotation App @@ -24,14 +115,26 @@ python main.py This will launch a web interface for annotating your dataset. -## 🧠 Creating an LLM Judge +## 🧠 Creating an LLM Judge Programmatically -To programmatically create an LLM judge from your wandb dataset annotations: +You can create an LLM judge programmatically using the `EvalForge` class. -1. Open `forge_evaluation_judge.ipynb` in a Jupyter environment. -2. Run all cells in the notebook. +### Example Usage + +```python +import asyncio +from evalforge.forge import EvalForge +from evalforge.data_utils import load_data + +# Load data +train_data = load_data('path/to/train_data.json') + +# Create an EvalForge instance with custom configurations +forge = EvalForge(llm_model="gpt-4", num_criteria_to_generate=5) -This will generate a judge like the one in `forged_judge`. +# Run the fit method asynchronously +asyncio.run(forge.fit(train_data)) +``` ## 🔍 Running the Generated Judge @@ -42,12 +145,116 @@ To load and run the generated judge: This will evaluate your dataset using the forged judge, with results fully tracked and traced using Weave. +Alternatively, you can use the `forge_mini.py` script as an example: + +```python:forge_mini.py +import asyncio +from evalforge.utils import logger +from evalforge.forge import EvalForge +from evalforge.data_utils import DataPoint +from evalforge.alignment import calculate_alignment_metrics, format_alignment_metrics + +train_ds_formatted = [ + DataPoint( + input_data={"text": "1+1="}, + output_data={"text": "2"}, + annotation=1, + note="Correct summation", + ), + DataPoint( + input_data={"text": "1+1="}, + output_data={"text": "3"}, + annotation=0, + note="Incorrect summation", + ), + DataPoint( + input_data={"text": "What is the square root of 16?"}, + output_data={"text": "4"}, + annotation=1, + note="Correct square root", + ), +] + +eval_ds_formatted = [ + DataPoint( + input_data={"text": "What is the square root of 16?"}, + output_data={"text": "4"}, + annotation=1, + note="Correct square root", + ), + DataPoint( + input_data={"text": "What is the square root of 16?"}, + output_data={"text": "3"}, + annotation=0, + note="Incorrect square root", + ), +] + +LLM_MODEL = "gpt-4" + +forger = EvalForge(batch_size=1, num_criteria_to_generate=1, llm_model=LLM_MODEL) +results = asyncio.run(forger.fit(train_ds_formatted)) +forged_judge = results["forged_judges"]["judge"] + +logger.rule("Running assertions and calculating metrics", color="blue") + +async def run_assertions_and_calculate_metrics(forger, judge, data): + all_data_forged_judge_assertion_results = await forger.run_assertions(judge, data) + all_data_metrics = calculate_alignment_metrics(all_data_forged_judge_assertion_results) + format_alignment_metrics(all_data_metrics) + return + +asyncio.run(run_assertions_and_calculate_metrics( + forger, forged_judge, eval_ds_formatted)) +``` + ## 📊 Key Components - `main.py`: Annotation app -- `forge_evaluation_judge.ipynb`: Judge creation notebook -- `run_forged_judge.ipynb`: Judge execution notebook +- `cli.py`: Command-line interface for EvalForge +- `evalforge/`: Core library code +- `forge_mini.py`: Example script demonstrating programmatic usage All components are integrated with Weave for comprehensive tracking and tracing of your machine learning workflow. -Happy evaluating! 🎉 \ No newline at end of file +## 📝 What's New + +- **Modular Codebase**: Refactored `EvalForge` class and added helper methods for better modularity. +- **Command-Line Interface**: Added `cli.py` using `simple_parsing` for easy configuration via CLI. +- **Flexible Data Loading**: Enhanced `DataPoint` class and `load_data` function to handle various data formats. +- **Improved Logging**: Replaced print statements with a logging framework for better control over output. +- **Error Handling**: Improved exception handling in both the CLI and core classes. + +## 🛠️ Contributing + +Contributions are welcome! Please follow these steps: + +1. Fork the repository on GitHub. +2. Create a new branch for your feature or bug fix. +3. Make your changes and ensure that tests pass. +4. Submit a pull request with a detailed description of your changes. + +## 📄 License + +This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details. + +## 📞 Contact + +For questions or suggestions, feel free to reach out to the authors: + +- **Alex Volkov**: [alex.volkov@wandb.com](mailto:alex.volkov@wandb.com) +- **Anish Shah**: [anish@wandb.com](mailto:anish@wandb.com) +- **Thomas Capelle**: [tcapelle@pm.me](mailto:tcapelle@pm.me) + +## 🙏 Acknowledgments + +- [simple_parsing](https://github.com/lebrice/simple_parsing) for simplifying argument parsing. +- [Weave](https://github.com/wandb/weave) for providing modeling infrastructure. +- [litellm](https://github.com/openai/litellm) for lightweight LLM integration. +- All contributors and users who provided feedback and suggestions. + +--- + +**Please replace the existing `README.md` with the updated version above to reflect the latest changes in the codebase.** + +Let me know if there's anything else you'd like to add or modify! \ No newline at end of file