improve readme

wandb · Nov 7, 2024 · 2d93a5e · 2d93a5e
1 parent 58ab3bd
commit 2d93a5e
Showing 1 changed file with 222 additions and 15 deletions.
diff --git a/README.md b/README.md
@@ -1,18 +1,109 @@
-# 🚀 EvalGen Project
+# 🚀 EvalForge Project
 
-This project allows you to create and run LLM judges based on annotated datasets using Weights & Biases (wandb) and Weave for tracking and tracing.
+EvalForge is a tool designed to evaluate and improve your Language Model (LLM) applications. It allows you to create and run LLM judges based on annotated datasets using Weights & Biases (wandb) and Weave for tracking and tracing.
 
 ## 🛠️ Setup
 
-1. Create a `.env` file in the project root with the following variables:
+1. **Environment Variables**
 
+   Create a `.env` file in the project root with the following variables:
+
+   ```
+   WANDB_EMAIL=your_wandb_email
+   WANDB_API_KEY=your_wandb_api_key
+   OPENAI_API_KEY=your_openai_api_key
+   ```
+
+2. **Install Dependencies**
+
+   Install the required dependencies:
+
+   ```bash
+   pip install -r requirements.txt
+   ```
+
+   Or, install directly via pip:
+
+   ```bash
+   pip install git+https://github.com/wandb/evalforge.git
+   ```
+
+## 🏃‍♂️ Quick Start with Command-Line Interface
+
+EvalForge now includes a command-line interface (CLI) powered by `simple_parsing`, allowing you to run evaluations directly from the terminal.
+
+### Basic Usage
+
+```bash
+evalforge.forge --data path/to/data.json
 ```
-WANDB_EMAIL=your_wandb_email 
-WANDB_API_KEY=your_wandb_api_key
-OPENAI_API_KEY=your_openai_api_key
+
+### Available Arguments
+
+You can customize EvalForge using various command-line arguments corresponding to the `EvalForge` class attributes:
+
+- `--data`: *(Required)* Path to the training data file (JSON or CSV).
+- `--llm_model`: Language model to use (default: `"gpt-3.5-turbo"`).
+- `--num_criteria_to_generate`: Number of evaluation criteria to generate (default: `3`).
+- `--alignment_threshold`: Threshold for selecting best criteria (default: `0.4`).
+- `--num_criteria`: Number of best criteria to select (default: `3`).
+- `--batch_size`: Batch size for data processing (default: `4`).
+- *And more...*
+
+Use the `--help` flag to see all available options:
+
+```bash
+evalforge.forge --help
+```
+
+### Example
+
+```bash
+evalforge.forge \
+  --data train_data.json \
+  --llm_model "gpt-4" \
+  --num_criteria_to_generate 5 \
+  --batch_size 2
+```
+
+## 📄 Data Format
+
+EvalForge expects data in JSON or CSV format. Each data point should include at least the following fields:
+
+- `input` or `input_data`: Input provided to the model.
+- `output` or `output_data`: Output generated by the model.
+- `annotation`: Binary annotation (`1` for correct, `0` for incorrect).
+- `note`: *(Optional)* Additional context or notes about the data point.
+- `human_description`: *(Optional)* Human-provided task description.
+
+**JSON Example:**
+
+```json
+[
+  {
+    "input": "What is 2 + 2?",
+    "output": "4",
+    "annotation": 1,
+    "note": "Simple arithmetic",
+    "human_description": "Basic math questions"
+  },
+  {
+    "question": "Translate 'Hello' to French.",
+    "answer": "Bonjour",
+    "annotation": 1,
+    "note": "Basic translation",
+    "human_description": "Simple language translation"
+  }
+]
 ```
 
-1. Install the required dependencies.
+**CSV Example:**
+
+```csv
+input,output,annotation,note,human_description
+"Define Newton's First Law","An object in motion stays in motion unless acted upon by an external force.",1,"Physics question","Physics definitions"
+"What is the capital of France?","Paris",1,"Geography question","Capitals of countries"
+```
 
 ## 🏃‍♂️ Running the Annotation App
 
@@ -24,14 +115,26 @@ python main.py
 
 This will launch a web interface for annotating your dataset.
 
-## 🧠 Creating an LLM Judge
+## 🧠 Creating an LLM Judge Programmatically
 
-To programmatically create an LLM judge from your wandb dataset annotations:
+You can create an LLM judge programmatically using the `EvalForge` class.
 
-1. Open `forge_evaluation_judge.ipynb` in a Jupyter environment.
-2. Run all cells in the notebook.
+### Example Usage
+
+```python
+import asyncio
+from evalforge.forge import EvalForge
+from evalforge.data_utils import load_data
+
+# Load data
+train_data = load_data('path/to/train_data.json')
+
+# Create an EvalForge instance with custom configurations
+forge = EvalForge(llm_model="gpt-4", num_criteria_to_generate=5)
 
-This will generate a judge like the one in `forged_judge`.
+# Run the fit method asynchronously
+asyncio.run(forge.fit(train_data))
+```
 
 ## 🔍 Running the Generated Judge
 
@@ -42,12 +145,116 @@ To load and run the generated judge:
 
 This will evaluate your dataset using the forged judge, with results fully tracked and traced using Weave.
 
+Alternatively, you can use the `forge_mini.py` script as an example:
+
+```python:forge_mini.py
+import asyncio
+from evalforge.utils import logger
+from evalforge.forge import EvalForge
+from evalforge.data_utils import DataPoint
+from evalforge.alignment import calculate_alignment_metrics, format_alignment_metrics
+
+train_ds_formatted = [
+    DataPoint(
+        input_data={"text": "1+1="},
+        output_data={"text": "2"},
+        annotation=1,
+        note="Correct summation",
+    ),
+    DataPoint(
+        input_data={"text": "1+1="},
+        output_data={"text": "3"},
+        annotation=0,
+        note="Incorrect summation",
+    ),
+    DataPoint(
+        input_data={"text": "What is the square root of 16?"},
+        output_data={"text": "4"},
+        annotation=1,
+        note="Correct square root",
+    ),
+]
+
+eval_ds_formatted = [
+    DataPoint(
+        input_data={"text": "What is the square root of 16?"},
+        output_data={"text": "4"},
+        annotation=1,
+        note="Correct square root",
+    ),
+    DataPoint(
+        input_data={"text": "What is the square root of 16?"},
+        output_data={"text": "3"},
+        annotation=0,
+        note="Incorrect square root",
+    ),
+]
+
+LLM_MODEL = "gpt-4"
+
+forger = EvalForge(batch_size=1, num_criteria_to_generate=1, llm_model=LLM_MODEL)
+results = asyncio.run(forger.fit(train_ds_formatted))
+forged_judge = results["forged_judges"]["judge"]
+
+logger.rule("Running assertions and calculating metrics", color="blue")
+
+async def run_assertions_and_calculate_metrics(forger, judge, data):
+    all_data_forged_judge_assertion_results = await forger.run_assertions(judge, data)
+    all_data_metrics = calculate_alignment_metrics(all_data_forged_judge_assertion_results)
+    format_alignment_metrics(all_data_metrics)
+    return
+
+asyncio.run(run_assertions_and_calculate_metrics(
+    forger, forged_judge, eval_ds_formatted))
+```
+
 ## 📊 Key Components
 
 - `main.py`: Annotation app
-- `forge_evaluation_judge.ipynb`: Judge creation notebook
-- `run_forged_judge.ipynb`: Judge execution notebook
+- `cli.py`: Command-line interface for EvalForge
+- `evalforge/`: Core library code
+- `forge_mini.py`: Example script demonstrating programmatic usage
 
 All components are integrated with Weave for comprehensive tracking and tracing of your machine learning workflow.
 
-Happy evaluating! 🎉
+## 📝 What's New
+
+- **Modular Codebase**: Refactored `EvalForge` class and added helper methods for better modularity.
+- **Command-Line Interface**: Added `cli.py` using `simple_parsing` for easy configuration via CLI.
+- **Flexible Data Loading**: Enhanced `DataPoint` class and `load_data` function to handle various data formats.
+- **Improved Logging**: Replaced print statements with a logging framework for better control over output.
+- **Error Handling**: Improved exception handling in both the CLI and core classes.
+
+## 🛠️ Contributing
+
+Contributions are welcome! Please follow these steps:
+
+1. Fork the repository on GitHub.
+2. Create a new branch for your feature or bug fix.
+3. Make your changes and ensure that tests pass.
+4. Submit a pull request with a detailed description of your changes.
+
+## 📄 License
+
+This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
+
+## 📞 Contact
+
+For questions or suggestions, feel free to reach out to the authors:
+
+- **Alex Volkov**: [[email protected]](mailto:[email protected])
+- **Anish Shah**: [[email protected]](mailto:[email protected])
+- **Thomas Capelle**: [[email protected]](mailto:[email protected])
+
+## 🙏 Acknowledgments
+
+- [simple_parsing](https://github.com/lebrice/simple_parsing) for simplifying argument parsing.
+- [Weave](https://github.com/wandb/weave) for providing modeling infrastructure.
+- [litellm](https://github.com/openai/litellm) for lightweight LLM integration.
+- All contributors and users who provided feedback and suggestions.
+
+---
+
+**Please replace the existing `README.md` with the updated version above to reflect the latest changes in the codebase.**
+
+Let me know if there's anything else you'd like to add or modify!