GitHub - nadavlab/MedConceptsQA: MedConceptsQA: Open source medical concepts QA benchmark

The benchmark is available at https://huggingface.co/datasets/ofir408/MedConceptsQA.

The paper is available at https://www.sciencedirect.com/science/article/pii/S0010482524011740.

If you use MedConceptsQA or find this repository useful for your research or work, please cite us using the following citation:

@article{SHOHAM2024109089,
title = {MedConceptsQA: Open source medical concepts QA benchmark},
journal = {Computers in Biology and Medicine},
volume = {182},
pages = {109089},
year = {2024},
issn = {0010-4825},
doi = {https://doi.org/10.1016/j.compbiomed.2024.109089},
url = {https://www.sciencedirect.com/science/article/pii/S0010482524011740},
author = {Ofir Ben Shoham and Nadav Rappoport}
}

How To Run?

Install the required dependencies:

git clone https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e .

Run the benchmark evaluation:

lm_eval --model hf --model_args pretrained=MODEL_ID --tasks med_concepts_qa --device cuda:0 --num_fewshot SHOTS_NUM --batch_size auto --limit 250 --output_path OUTPUT_RESULTS_DIR_PATH

Replace MODEL_ID with the model name (HuggingFace) or local path to the pretrained model you want to evaluate.

Replace OUTPUT_RESULTS_DIR_PATH with the directory path to store the results CSV file.

Replace SHOTS_NUM with the number of shots. The default is 4. For zero-shot learning, use 0.

Few-shot evaluation:

lm_eval --model hf --model_args pretrained=BioMistral/BioMistral-7B-DARE --tasks med_concepts_qa --device cuda:0 --num_fewshot 4 --batch_size auto --limit 250 --output_path  results/few_shot/250_examples/

Zero-shot evaluation:

lm_eval --model hf --model_args pretrained=BioMistral/BioMistral-7B-DARE --tasks med_concepts_qa --device cuda:0 --num_fewshot 0 --batch_size auto --limit 250 --output_path  results/few_shot/250_examples/

Run GPT benchmark evaluation:

Install the requirements: pip install -r requirements.txt

SET OPENAI_API_KEY as an environment variable with your OpenAI key and then run with:

python gpt4_runner.py --model_id gpt-4-0125-preview --shots_num 4 --total_eval_examples_num 250 --output_results_dir_path results/few_shot/250_examples/

gpt-4 zero shot evaluation:

python gpt4_runner.py --model_id gpt-4-0125-preview --shots_num 0 --total_eval_examples_num 250 --output_results_dir_path results/zero_shot/250_examples/

Leaderboard:

Zero-shot:

Model Name	Accuracy	CI
gpt-4-0125-preview	52.489	2.064
meta-llama/Meta-Llama-3.1-70B-Instruct	48.471	2.065
HPAI-BSC/Qwen2.5-Aloe-Beta-72B	48.347	2.065
m42-health/Llama3-Med42-70B	47.093	2.062
meta-llama/Meta-Llama-3-70B-Instruct	47.076	2.062
aaditya/Llama3-OpenBioLLM-70B	41.849	2.039
HPAI-BSC/Llama3.1-Aloe-Beta-8B	38.462	2.010
gpt-3.5-turbo	37.058	1.996
meta-llama/Meta-Llama-3-8B-Instruct	34.8	1.968
aaditya/Llama3-OpenBioLLM-8B	29.431	1.883
johnsnowlabs/JSL-MedMNX-7B	28.649	1.868
epfl-llm/meditron-70b	28.133	1.858
dmis-lab/meerkat-7b-v1.0	27.982	1.855
BioMistral/BioMistral-7B-DARE	26.836	1.831
epfl-llm/meditron-7b	26.107	1.814
HPAI-BSC/Llama3.1-Aloe-Beta-70B	25.929	1.811
dmis-lab/biobert-v1.1	25.636	1.804
UFNLP/gatortron-large	25.298	1.796
PharMolix/BioMedGPT-LM-7B	24.924	1.787

Few-shot:

Model Name	Accuracy	CI
gpt-4-0125-preview	61.911	3.475
meta-llama/Meta-Llama-3.1-70B-Instruct	58.720	3.523
HPAI-BSC/Llama3.1-Aloe-Beta-70B	58.142	3.530
meta-llama/Meta-Llama-3-70B-Instruct	57.867	3.534
m42-health/Llama3-Med42-70B	56.551	3.547
aaditya/Llama3-OpenBioLLM-70B	53.387	3.57
HPAI-BSC/Llama3.1-Aloe-Beta-8B	41.671	3.528
gpt-3.5-turbo	41.476	3.526
meta-llama/Meta-Llama-3-8B-Instruct	40.693	3.516
aaditya/Llama3-OpenBioLLM-8B	35.316	3.421
epfl-llm/meditron-70b	34.809	3.409
johnsnowlabs/JSL-MedMNX-7B	32.436	3.35
BioMistral/BioMistral-7B-DARE	28.702	3.237
PharMolix/BioMedGPT-LM-7B	28.204	3.22
dmis-lab/meerkat-7b-v1.0	28.187	3.219
epfl-llm/meditron-7b	26.231	3.148
dmis-lab/biobert-v1.1	25.982	3.138
UFNLP/gatortron-large	25.093	3.102
HPAI-BSC/Qwen2.5-Aloe-Beta-72B	25.058	3.101

If you wish to submit your model for evaluation, please open us a GitHub issue with your model's HuggingFace name.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.gitignore		.gitignore
README.md		README.md
create_benchmark.py		create_benchmark.py
gpt4_runner.py		gpt4_runner.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Leaderboard:

About

Releases

Packages

Languages

nadavlab/MedConceptsQA

Folders and files

Latest commit

History

Repository files navigation

Leaderboard:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages