Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(benchmarks) Add LLM evaluation pipeline for Medical challenge #3768

Merged
merged 37 commits into from
Sep 9, 2024
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
a56f2ee
Init medical eval
yan-gao-GY Jul 10, 2024
d1b6bfe
Merge branch 'main' into add-llm-medical-eval
yan-gao-GY Jul 10, 2024
2ba7a39
Update pyproject.toml
yan-gao-GY Jul 10, 2024
ea45e4b
Update readme
yan-gao-GY Jul 12, 2024
e7fbb62
Update readme
yan-gao-GY Jul 12, 2024
a2344ea
Update readme
yan-gao-GY Jul 15, 2024
d00479c
Merge branch 'main' into add-llm-medical-eval
yan-gao-GY Jul 15, 2024
7f40d3a
Update benchmarks/flowertune-llm/evaluation/medical/README.md
yan-gao-GY Aug 7, 2024
2327052
Update benchmarks/flowertune-llm/evaluation/medical/README.md
yan-gao-GY Aug 7, 2024
72db769
Update benchmarks/flowertune-llm/evaluation/medical/README.md
yan-gao-GY Aug 7, 2024
a06646b
Update benchmarks/flowertune-llm/evaluation/medical/README.md
yan-gao-GY Aug 7, 2024
5e9e6a7
Update benchmarks/flowertune-llm/evaluation/medical/README.md
yan-gao-GY Aug 7, 2024
b2529bf
Update benchmarks/flowertune-llm/evaluation/medical/README.md
yan-gao-GY Aug 7, 2024
f1c1804
Update benchmarks/flowertune-llm/evaluation/medical/README.md
yan-gao-GY Aug 7, 2024
a1fe792
Update benchmarks/flowertune-llm/evaluation/medical/README.md
yan-gao-GY Aug 7, 2024
87e1b29
Update benchmarks/flowertune-llm/evaluation/medical/README.md
yan-gao-GY Aug 7, 2024
f63e83e
Merge branch 'main' into add-llm-medical-eval
yan-gao-GY Aug 7, 2024
7d58f1c
Merge overall readme
yan-gao-GY Aug 7, 2024
a885bde
Merge overall readme
yan-gao-GY Aug 7, 2024
6e9de01
update readme & evaluate.py
yan-gao-GY Aug 7, 2024
76785d6
Merge branch 'main' into add-llm-medical-eval
yan-gao-GY Aug 7, 2024
5638c5b
Merge branch 'main' into add-llm-medical-eval
yan-gao-GY Aug 8, 2024
9b50e89
Replace pyproject.toml with requirements.txt
yan-gao-GY Aug 8, 2024
3c2bb9e
Merge branch 'main' into add-llm-medical-eval
yan-gao-GY Aug 13, 2024
9f02f33
Update top readme
yan-gao-GY Aug 13, 2024
ef92bea
Remove useless import
yan-gao-GY Aug 15, 2024
85f0d94
Merge branch 'main' into add-llm-medical-eval
yan-gao-GY Aug 23, 2024
929b962
Update license
yan-gao-GY Aug 23, 2024
b520b06
Merge branch 'main' into add-llm-medical-eval
jafermarq Sep 2, 2024
978ad0f
Merge branch 'main' into add-llm-medical-eval
yan-gao-GY Sep 5, 2024
42763c8
Simplify code
yan-gao-GY Sep 6, 2024
8529b12
Formatting
yan-gao-GY Sep 6, 2024
a10542a
Merge branch 'main' into add-llm-medical-eval
yan-gao-GY Sep 6, 2024
822cf4d
Update benchmarks/flowertune-llm/evaluation/medical/README.md
yan-gao-GY Sep 9, 2024
2b72996
Merge branch 'main' into add-llm-medical-eval
yan-gao-GY Sep 9, 2024
4d1a39a
Merge branch 'main' into add-llm-medical-eval
jafermarq Sep 9, 2024
1b51125
Add ref for instruction
yan-gao-GY Sep 9, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 44 additions & 0 deletions benchmarks/flowertune-llm/evaluation/medical/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
## Evaluation for Medical challenge

We leverage the medical question answering (QA) metric provided by [Meditron](https://github.com/epfLLM/meditron/tree/main/evaluation) to evaluate our fined-tuned LLMs.
Three datasets have been selected for this evaluation: [PubMedQA](https://huggingface.co/datasets/bigbio/pubmed_qa), [MedMCQA](https://huggingface.co/datasets/medmcqa), and [MedQA](https://huggingface.co/datasets/bigbio/med_qa).


### Step 0. Set up Environment

```shell
git clone --depth=1 https://github.com/adap/flower.git && mv flower/benchmarks/flowertune-llm/evaluation/medical ./flowertune-eval-medical && rm -rf flower && cd flowertune-eval-medical
```

Then, install dependencies with:

```shell
# From a new python environment, run:
pip install -e .

# Log in HuggingFace account
huggingface-cli login
```

### Step 1. Generate model answers to medical questions

```bash
python inference.py \
--peft-path=/path/to/fine-tuned-peft-model-dir/ # e.g., ./peft_1
--dataset-name=pubmedqa # chosen from [pubmedqa, medmcqa, medqa]
--run-name=fl # arbitrary name for this run
```
The answers will be saved to `benchmarks/generations/[dataset_name]-[run_name].jsonl` in default.


### Step 2. Calculate accuracy

```bash
python evaluate.py \
--dataset-name=pubmedqa # chosen from [pubmedqa, medmcqa, medqa]
--run-name=fl # run_name used in Step 1
```
The accuracy value will be printed on the screen.

> [!NOTE]
> Please ensure that you provide all **three accuracy values** for three evaluation datasets when submitting to the LLM Leaderboard (see the [`Make Submission`](https://github.com/adap/flower/tree/main/benchmarks/flowertune-llm/evaluation#make-submission-on-flowertune-llm-leaderboard) section).
343 changes: 343 additions & 0 deletions benchmarks/flowertune-llm/evaluation/medical/benchmarks.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,343 @@
# This python file is adapted from https://github.com/epfLLM/meditron/blob/main/evaluation/benchmarks.py

import json
import os
import random

import pandas as pd

from datasets import Dataset, load_dataset

ROOT_DIR = os.path.dirname(os.path.abspath(__file__))


def benchmark_factory(name):
"""Creates a benchmark object.

:param name: str, with the benchmark name.
return:
"""
# Note: benchmark is instantiated *after* selection.
factories = {
"medmcqa": MedMCQA,
"pubmedqa": ClosedPubMedQA,
"medqa": MedQA,
}
if name not in factories:
raise ValueError(
"Benchmark {} not found. \
Select one of the following: {}".format(
name, list(factories.keys())
)
)
return factories[name](name)


def load_instruction(prompt_name):
"""Loads the instruction for the given benchmark.

:param benchmark: str, the name of the benchmark
:param prompt_name: str, the name of the prompt to be used
"""
path = os.path.join(ROOT_DIR, "instructions.json")
if not os.path.exists(path):
raise FileNotFoundError(
"Please save the different prompts to instructions.json"
)

with open(path) as f:
prompts = json.load(f)
return prompts[prompt_name]


class Benchmark:
def __init__(self, name):
"""Class to implement a benchmark for evaluation.

:param name: str, with the benchmark name.
:param path: str (optional), the path to the benchmark data.
:param splits: list of str, the splits of the data: train / test
:param hub_name: str, the name of the HuggingFace hub dataset.
:param dir_name: str, the name of the directory where the data is stored.
:param train_data: HuggingFace Dataset, the train data.
:param test_data: HuggingFace Dataset, the test data.
:param generations: HuggingFace Dataset, the generations.
:param subsets: list of str (optional), the subsets of the data to download from
the HuggingFace hub.
"""
self.name = name
self.path = None
self.splits = None
self.hub_name = None
self.dir_name = None
self.train_data = None
self.test_data = None
self.generations = None
self.subsets = None

def load_from_hub(self):
"""Downloads the benchmark data from the HuggingFace hub (for 1st time loading)
This is specific to each benchmark and must be implemented in the extended
class."""
print(f"Downloading benchmark from HuggingFace hub ({self.hub_name}).")
try:
if self.subsets is None:
load_dataset(
self.hub_name,
cache_dir=os.path.join(ROOT_DIR, "benchmarks", "datasets"),
trust_remote_code=True,
download_mode="force_redownload",
)
else:
for subset in self.subsets:
load_dataset(
self.hub_name,
subset,
cache_dir=os.path.join(ROOT_DIR, "benchmarks", "datasets"),
trust_remote_code=True,
download_mode="force_redownload",
)
except:
raise ValueError(
"Default Huggingface loader failed for benchmark {}. \
Try implementing a custom load_from_hub function.".format(
self.name
)
)

def load_data(self, partition="train"):
"""Loads benchmark data from a local directory, or from the HuggingFace hub if
not yet downloaded. Based on the input partition type, instantiates the
respective class attribute.

:param path: str (optional), the path to the benchmark data.
:param partition: str, the split of the data: train / test
"""
print("=" * 50 + f"\nLoading data for benchmark {self.name}.\n")
if partition not in self.splits:
raise ValueError(
"Please provide a valid partition split: {}".format(self.splits)
)
if not os.path.exists(self.path):
os.makedirs(self.path)
self.load_from_hub()
try:
if self.subsets is None:
if partition == "train":
self.train_data = load_dataset(self.path, split=partition)
elif partition in ["test", "validation"]:
self.test_data = load_dataset(self.path, split=partition)
else:
if partition == "train":
self.train_data = aggregate_datasets(
self.path, self.subsets, partition=partition
)
elif partition in ["test", "validation"]:
self.test_data = aggregate_datasets(
self.path, self.subsets, partition=partition
)

except ValueError as e:
print(e)
raise ValueError(
"Couldn't load benchmark {} from local path.".format(self.name)
)

def preprocessing(self, partition="train"):
"""Applies a custom pre-processing over the partition. If instruction is
provided, preprends it to the question Updates the train or test self
attributes.

:param _preprocess: function: dict -> dict, the preprocessing function to apply.
:param partition: str, the split of the data: train / test
"""
try:
if partition == "train":
self.train_data = self.train_data.map(self.custom_preprocessing)
elif partition in ["test", "validation"]:
self.test_data = self.test_data.map(self.custom_preprocessing)
else:
raise ValueError(
"Please provide a valid partition split: train or test"
)
except Exception as e:
print(e)
raise ValueError(
"Error when pre-processing {} {} data.".format(self.name, partition)
)

def custom_preprocessing(self):
"""Wraps a pre-processing function (dict -> dict) specific to the benchmark.
Needs to be overriden in the extended class.

The return dictionary must contains keys 'prompt' & 'answer' for inference to
work.
"""
raise NotImplementedError("Implement custom_preprocessing() in a child class.")

def add_instruction(self, instruction=None, partition="train"):
"""Adds instructions to the data based on the input partition.

:param instruction: dict, with the `system` and `user` instructions. If None, then it creates prompt with few shot
:param partition: str, the split of the data: train / test
"""

def _add_instruction(row):
row["prompt"] = "{}\n{}\n{}\n".format(
instruction["system"], row["prompt"], instruction["user"]
)
return row

if partition == "train":
self.train_data = self.train_data.map(_add_instruction)
elif partition == "test" or partition == "validation":
self.test_data = self.test_data.map(_add_instruction)
else:
raise ValueError(
"Please provide a valid partition split: {}".format(self.splits)
)

def add_generations(self, data):
"""Adds the generations to the respective class attribute as a HuggingFace
Dataset.

:param data: pd.DataFrame or HuggingFace Dataset
"""
if isinstance(data, pd.DataFrame):
self.generations = Dataset.from_pandas(data)
elif isinstance(data, Dataset):
self.generations = data

def save_generations(self, dataset_name, run_name):
"""Saves the generations in the respective directory."""
path = os.path.join(ROOT_DIR, "benchmarks", "generations")
if not os.path.exists(os.path.dirname(path)):
os.makedirs(os.path.dirname(path))
gen_path = os.path.join(path, f"{dataset_name}-{run_name}.jsonl")

self.generations.to_json(gen_path, orient="records")
print(
"Stored {} generations to the following path: {}".format(
self.name, gen_path
)
)


class MedMCQA(Benchmark):
"""MedMCQA is a large-scale, Multiple-Choice Question Answering (MCQA) dataset
designed to address real-world medical entrance exam questions.

Huggingface card: https://huggingface.co/datasets/medmcqa
"""

def __init__(self, name="medmcqa") -> None:
super().__init__(name)
self.hub_name = "medmcqa"
self.dir_name = "medmcqa"
self.path = os.path.join(ROOT_DIR, "benchmarks", "datasets", self.dir_name)
self.splits = ["train", "validation", "test"]
self.num_options = 4

@staticmethod
def custom_preprocessing(row):
options = [row["opa"], row["opb"], row["opc"], row["opd"]]
answer = int(row["cop"])
row["prompt"] = format_mcq(row["question"], options)
row["gold"] = chr(ord("A") + answer) if answer in [0, 1, 2, 3] else None
return row


class ClosedPubMedQA(Benchmark):
"""PubMedQA is a novel biomedical question answering (QA) dataset. Its task is to
answer research biomedical questions with yes/no/maybe using PubMed abstracts.

Huggingface card: https://huggingface.co/datasets/bigbio/pubmed_qa
"""

def __init__(self, name="pubmedqa") -> None:
super().__init__(name)
self.hub_name = "bigbio/pubmed_qa"
self.dir_name = "bigbio___pubmed_qa"
self.path = os.path.join(ROOT_DIR, "benchmarks", "datasets", self.dir_name)
self.splits = ["train", "validation", "test"]
self.subsets = ["pubmed_qa_labeled_fold0_source"]
self.num_options = 3

@staticmethod
def custom_preprocessing(row):
context = "\n".join(row["CONTEXTS"])
row["prompt"] = f"{context}\n{row['QUESTION']}"
row["gold"] = row["final_decision"]
row["long_answer"] = row["LONG_ANSWER"]
return row


class MedQA(Benchmark):
"""MedQA is a dataset for solving medical problems collected from the professional
medical board exams.

Huggingface card: https://huggingface.co/datasets/bigbio/med_qa
"""

def __init__(self, name="medqa") -> None:
super().__init__(name)
self.hub_name = "bigbio/med_qa"
self.dir_name = "bigbio___med_qa"
self.path = os.path.join(ROOT_DIR, "benchmarks", "datasets", self.dir_name)
self.splits = ["train", "validation", "test"]
self.num_options = 5
self.subsets = ["med_qa_en_4options_source"]

@staticmethod
def custom_preprocessing(row):
choices = [opt["value"] for opt in row["options"]]
row["prompt"] = format_mcq(row["question"], choices)
for opt in row["options"]:
if opt["value"] == row["answer"]:
row["gold"] = opt["key"]
break
return row


def format_mcq(question, options):
"""
Formats a multiple choice question with the given options.
Uses the format recommended by: https://huggingface.co/blog/evaluating-mmlu-leaderboard

'Question: What is the capital of France?

Options:
A. London
B. Paris
C. Berlin
D. Rome'

:param question: str, the question
:param options: list of str, the options
:return: str, the formatted question
"""
if not question.endswith("?") and not question.endswith("."):
question += "?"
options_str = "\n".join([f"{chr(65+i)}. {options[i]}" for i in range(len(options))])
prompt = "Question: " + question + "\n\nOptions:\n" + options_str
return prompt


def aggregate_datasets(path, subsets, partition="train"):
"""Takes as input a Huggingface DatasetDict with subset name as key, and Dataset as
value. Returns a pd.DataFrame with all subsets concatenated.

:param subsets: list of str, the subsets of the data to download from the
HuggingFace hub.
:return: pd.DataFrame
"""
dataframes = []
for subset in subsets:
subset_data = load_dataset(os.path.join(path, subset), split=partition)
subset_df = pd.DataFrame(subset_data.map(lambda x: {"subset": subset, **x}))
dataframes.append(subset_df)
aggregate_df = pd.concat(dataframes, axis=0)
aggregate = Dataset.from_pandas(aggregate_df)
if "__index_level_0__" in aggregate.column_names:
aggregate = aggregate.remove_columns("__index_level_0__")
return aggregate
Loading