Skip to content

Commit 5dae219

Browse files
authored
Merge pull request #32 from scicode-bench/zilinghan/hf
Zilinghan/hf
2 parents f10d88f + 0aa4bd8 commit 5dae219

File tree

11 files changed

+81
-168
lines changed

11 files changed

+81
-168
lines changed

README.md

+14-9
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,8 @@ This repo contains the evaluation code for the paper "[SciCode: A Research Codin
77

88
## 🔔News
99

10+
**[2025-02-17]: SciCode benchmark is available at [HuggingFace Datasets](https://huggingface.co/datasets/SciCode1/SciCode)!**
11+
1012
**[2025-02-01]: Results for DeepSeek-R1, DeepSeek-V3, and OpenAI o3-mini are added.**
1113

1214
**[2025-01-24]: SciCode has been integrated with [`inspect_ai`](https://inspect.ai-safety-institute.org.uk/) for easier and faster model evaluations.**
@@ -54,27 +56,30 @@ SciCode sources challenging and realistic research-level coding problems across
5456
| Mixtral-8x22B-Instruct | <div align="center">**0.0**</div> | <div align="center" style="color:grey">16.3</div> |
5557
| Llama-3-70B-Chat | <div align="center">**0.0**</div> | <div align="center" style="color:grey">14.6</div> |
5658

59+
## Instructions to evaluate a new model using `inspect_ai` (recommended)
60+
5761

58-
## Instructions to evaluate a new model
62+
Scicode has been integrated with `inspect_ai` for easier and faster model evaluation. You need to run the following steps ro run:
5963

6064
1. Clone this repository `git clone [email protected]:scicode-bench/SciCode.git`
6165
2. Install the `scicode` package with `pip install -e .`
6266
3. Download the [numeric test results](https://drive.google.com/drive/folders/1W5GZW6_bdiDAiipuFMqdUhvUaHIj6-pR?usp=drive_link) and save them as `./eval/data/test_data.h5`
63-
4. Run `eval/scripts/gencode_json.py` to generate new model outputs (see the [`eval/scripts` readme](eval/scripts/)) for more information
64-
5. Run `eval/scripts/test_generated_code.py` to evaluate the unittests
65-
66-
67-
## Instructions to evaluate a new model using `inspect_ai` (recommended)
68-
69-
Scicode has been integrated with `inspect_ai` for easier and faster model evaluation, compared with the methods above. You need to run the first three steps in the [above section](#instructions-to-evaluate-a-new-model), and then go to the `eval/inspect_ai` directory, setup correspoinding API key, and run the following command:
67+
4. Go to the `eval/inspect_ai` directory, setup correspoinding API key, and run the following command:
7068

7169
```bash
7270
cd eval/inspect_ai
7371
export OPENAI_API_KEY=your-openai-api-key
7472
inspect eval scicode.py --model openai/gpt-4o --temperature 0
7573
```
7674

77-
For more detailed information of using `inspect_ai`, see [`eval/inspect_ai` readme](eval/inspect_ai/)
75+
💡 For more detailed information of using `inspect_ai`, see [`eval/inspect_ai` readme](eval/inspect_ai/)
76+
77+
## Instructions to evaluate a new model in two steps (deprecated)
78+
79+
It should be noted that this is a deprecated way to evaluating models, and using `inspect_ai` is the recommended way. Please use this method only if `inspect_ai` does not work for your need. You need to run the first three steps in the above section, then run the following two commands:
80+
81+
4. Run `eval/scripts/gencode.py` to generate new model outputs (see the [`eval/scripts` readme](eval/scripts/)) for more information
82+
5. Run `eval/scripts/test_generated_code.py` to evaluate the unittests
7883

7984
## More information and FAQ
8085

eval/data/problems_all.jsonl

-65
This file was deleted.

eval/data/problems_dev.jsonl

-15
This file was deleted.

eval/inspect_ai/README.md

+6-6
Original file line numberDiff line numberDiff line change
@@ -16,26 +16,26 @@ However, there are some additional command line arguments that could be useful a
1616

1717
- `--max-connections`: Maximum amount of API connections to the evaluated model.
1818
- `--limit`: Limit of the number of samples to evaluate in the SciCode dataset.
19-
- `-T input_path=<another_input_json_file>`: This is useful when user wants to change to another json dataset (e.g., the dev set).
19+
- `-T split=validation/test`: Whether the user wants to run on the small `validation` set (15 samples) or the large `test` set (65 samples).
2020
- `-T output_dir=<your_output_dir>`: This changes the default output directory (`./tmp`).
2121
- `-T h5py_file=<your_h5py_file>`: This is used if your h5py file is not downloaded in the recommended directory.
2222
- `-T with_background=True/False`: Whether to include problem background.
2323
- `-T mode=normal/gold/dummy`: This provides two additional modes for sanity checks.
2424
- `normal` mode is the standard mode to evaluate a model
25-
- `gold` mode can only be used on the dev set which loads the gold answer
25+
- `gold` mode can only be used on the validation set which loads the gold answer
2626
- `dummy` mode does not call any real LLMs and generates some dummy outputs
2727

28-
For example, user can run five sames on the dev set with background as
28+
For example, user can run five samples on the validation set with background as
2929

3030
```bash
3131
inspect eval scicode.py \
3232
--model openai/gpt-4o \
3333
--temperature 0 \
3434
--limit 5 \
35-
-T input_path=../data/problems_dev.jsonl \
36-
-T output_dir=./tmp/dev \
35+
-T split=validation \
36+
-T output_dir=./tmp/val \
3737
-T with_background=True \
38-
-T mode=gold
38+
-T mode=normal
3939
```
4040

4141
User can run the evaluation on `Deepseek-v3` using together ai via the following command:

eval/inspect_ai/scicode.py

+7-7
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
from typing import Any
66
from pathlib import Path
77
from inspect_ai import Task, task
8-
from inspect_ai.dataset import json_dataset, Sample
8+
from inspect_ai.dataset import Sample, hf_dataset
99
from inspect_ai.solver import solver, TaskState, Generate
1010
from inspect_ai.scorer import scorer, mean, metric, Metric, Score, Target
1111
from scicode.parse.parse import extract_function_name, get_function_from_code
@@ -392,26 +392,26 @@ async def score(state: TaskState, target: Target):
392392

393393
@task
394394
def scicode(
395-
input_path: str = '../data/problems_all.jsonl',
395+
split: str = 'test',
396396
output_dir: str = './tmp',
397397
with_background: bool = False,
398398
h5py_file: str = '../data/test_data.h5',
399399
mode: str = 'normal',
400400
):
401-
dataset = json_dataset(
402-
input_path,
403-
record_to_sample
401+
402+
dataset = hf_dataset(
403+
'SciCode1/SciCode',
404+
split=split,
405+
sample_fields=record_to_sample,
404406
)
405407
return Task(
406408
dataset=dataset,
407409
solver=scicode_solver(
408-
input_path=input_path,
409410
output_dir=output_dir,
410411
with_background=with_background,
411412
mode=mode,
412413
),
413414
scorer=scicode_scorer(
414-
input_path=input_path,
415415
output_dir=output_dir,
416416
with_background=with_background,
417417
h5py_file=h5py_file,

eval/scripts/README.md

+6-6
Original file line numberDiff line numberDiff line change
@@ -23,19 +23,19 @@ TOGETHERAI_API_KEY = 'your_api_key'
2323
To generate code using the **Together AI** model (e.g., `Meta-Llama-3.1-70B-Instruct-Turbo`), go to the root of this repo and run:
2424

2525
```bash
26-
python eval/scripts/gencode_json.py --model litellm/together_ai/meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo
26+
python eval/scripts/gencode.py --model litellm/together_ai/meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo
2727
```
2828

2929
To generate code using **GPT-4o** (with default settings), go to the root of this repo and run:
3030

3131
```bash
32-
python eval/scripts/gencode_json.py --model gpt-4o
32+
python eval/scripts/gencode.py --model gpt-4o
3333
```
3434

3535
If you want to include **scientist-annotated background** in the prompts, use the `--with-background` flag:
3636

3737
```bash
38-
python eval/scripts/gencode_json.py --model gpt-4o --with-background
38+
python eval/scripts/gencode.py --model gpt-4o --with-background
3939
```
4040

4141
Please note that we do not plan to release the ground truth code for each problem to the public. However, we have made a dev set available that includes the ground truth code in `eval/data/problems_dev.jsonl`.
@@ -44,11 +44,11 @@ In this repository, **we only support evaluating with previously generated code
4444

4545
### Command-Line Arguments
4646

47-
When running the `gencode_json.py` script, you can use the following options:
47+
When running the `gencode.py` script, you can use the following options:
4848

4949
- `--model`: Specifies the model name to be used for generating code (e.g., `gpt-4o` or `litellm/together_ai/meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo`).
50+
- `--split`: Specifies which problem split (either `validation` or `test`) to run on.
5051
- `--output-dir`: Directory where the generated code outputs will be saved. Default is `eval_results/generated_code`.
51-
- `--input-path`: Directory containing the JSON files describing the problems. Default is `eval/data/problems_all.jsonl`.
5252
- `--prompt-dir`: Directory where prompt files are saved. Default is `eval_results/prompt`.
5353
- `--with-background`: If enabled, includes the problem background in the generated code.
5454
- `--temperature`: Controls the randomness of the output. Default is 0.
@@ -66,7 +66,7 @@ Download the [numeric test results](https://drive.google.com/drive/folders/1W5GZ
6666
To evaluate the generated code using a specific model, go to the root of this repo and use the following command:
6767

6868
```bash
69-
python eval/scripts/test_generated_code.py --model "model_name"
69+
python eval/scripts/test_generated_code.py --model "model_name"
7070
```
7171

7272
Replace `"model_name"` with the appropriate model name, and include `--with-background` if the code is generated with **scientist-annotated background**.

eval/scripts/gencode_json.py eval/scripts/gencode.py

+10-9
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
from scicode.parse.parse import (
55
extract_function_name,
66
get_function_from_code,
7-
read_from_jsonl
7+
read_from_hf_dataset,
88
)
99
from scicode.gen.models import extract_python_script, get_model_function
1010

@@ -150,18 +150,19 @@ def get_cli() -> argparse.ArgumentParser:
150150
parser.add_argument(
151151
"--model", type=str, default="gpt-4o", help="Model name"
152152
)
153+
parser.add_argument(
154+
"--split",
155+
type=str,
156+
default="test",
157+
choices=["validation", "test"],
158+
help="Dataset split manner",
159+
)
153160
parser.add_argument(
154161
"--output-dir",
155162
type=Path,
156163
default=Path("eval_results", "generated_code"),
157164
help="Output directory",
158165
)
159-
parser.add_argument(
160-
"--input-path",
161-
type=Path,
162-
default=Path("eval", "data", "problems_all.jsonl"),
163-
help="Input directory",
164-
)
165166
parser.add_argument(
166167
"--prompt-dir",
167168
type=Path,
@@ -183,8 +184,8 @@ def get_cli() -> argparse.ArgumentParser:
183184

184185

185186
def main(model: str,
187+
split: str,
186188
output_dir: Path,
187-
input_path: Path,
188189
prompt_dir: Path,
189190
with_background: bool,
190191
temperature: float
@@ -194,7 +195,7 @@ def main(model: str,
194195
prompt_dir=prompt_dir, with_background=with_background, temperature=temperature
195196
)
196197
prompt_template = BACKGOUND_PROMPT_TEMPLATE if with_background else DEFAULT_PROMPT_TEMPLATE
197-
data = read_from_jsonl(input_path)
198+
data = read_from_hf_dataset(split)
198199
for problem in data:
199200
prob_id = problem['problem_id']
200201
steps = len(problem['sub_steps'])

eval/scripts/test_generated_code.py

+27-25
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
import argparse
88

99
from scicode.parse.parse import H5PY_FILE
10-
from scicode.parse.parse import read_from_jsonl
10+
from scicode.parse.parse import read_from_hf_dataset
1111

1212

1313
PROB_NUM = 80
@@ -20,16 +20,23 @@ def _get_background_dir(with_background):
2020
return "with_background" if with_background else "without_background"
2121

2222

23-
def test_code(model_name, code_dir, log_dir, output_dir,
24-
jsonl_path, dev_set=False, with_background=False):
23+
def test_code(
24+
model_name,
25+
split,
26+
code_dir,
27+
log_dir,
28+
output_dir,
29+
with_background=False
30+
):
2531

26-
jsonl_data = read_from_jsonl(jsonl_path)
32+
scicode_data = read_from_hf_dataset(split)
33+
scicode_data = [data for data in scicode_data]
2734
json_dct = {}
2835
json_idx = {}
2936

30-
for prob_data in jsonl_data:
37+
for prob_data in scicode_data:
3138
json_dct[prob_data['problem_id']] = len(prob_data['sub_steps'])
32-
json_idx[prob_data['problem_id']] = jsonl_data.index(prob_data)
39+
json_idx[prob_data['problem_id']] = scicode_data.index(prob_data)
3340
start_time = time.time()
3441

3542
code_dir_ = Path(code_dir, model_name, _get_background_dir(with_background))
@@ -44,7 +51,7 @@ def test_code(model_name, code_dir, log_dir, output_dir,
4451
file_step = file_name.split(".")[1]
4552

4653
code_content = file_path.read_text(encoding='utf-8')
47-
json_content = jsonl_data[json_idx[file_id]]
54+
json_content = scicode_data[json_idx[file_id]]
4855
step_id = json_content["sub_steps"][int(file_step) - 1]["step_number"]
4956
test_lst = json_content["sub_steps"][int(file_step) - 1]["test_cases"]
5057
assert_file = Path(tmp_dir, f'{step_id}.py')
@@ -119,14 +126,14 @@ def run_script(script_path):
119126
correct_prob[i] == tot_prob[i]
120127
and tot_prob[i] != 0)
121128

122-
print(f'correct problems: {correct_prob_num}/{DEV_PROB_NUM if dev_set else PROB_NUM - DEV_PROB_NUM}')
123-
print(f'correct steps: {len(correct_step)}/{DEV_STEP_NUM if dev_set else STEP_NUM}')
129+
print(f'correct problems: {correct_prob_num}/{DEV_PROB_NUM if (split == "validation") else PROB_NUM - DEV_PROB_NUM}')
130+
print(f'correct steps: {len(correct_step)}/{DEV_STEP_NUM if (split == "validation") else STEP_NUM}')
124131

125132
Path(output_dir).mkdir(parents=True, exist_ok=True)
126133

127134
with open(f'{output_dir}/{model_name}_{_get_background_dir(with_background)}.txt', 'w') as f:
128-
f.write(f'correct problems: {correct_prob_num}/{DEV_PROB_NUM if dev_set else PROB_NUM - DEV_PROB_NUM}\n')
129-
f.write(f'correct steps: {len(correct_step)}/{DEV_STEP_NUM if dev_set else STEP_NUM}\n\n')
135+
f.write(f'correct problems: {correct_prob_num}/{DEV_PROB_NUM if (split == "validation") else PROB_NUM - DEV_PROB_NUM}\n')
136+
f.write(f'correct steps: {len(correct_step)}/{DEV_STEP_NUM if (split == "validation") else STEP_NUM}\n\n')
130137
f.write(f'duration: {test_time} seconds\n')
131138
f.write('\ncorrect problems: ')
132139
f.write(f'\n\n{[i + 1 for i in range(PROB_NUM) if correct_prob[i] == tot_prob[i] and tot_prob[i] != 0]}\n')
@@ -144,6 +151,13 @@ def get_cli() -> argparse.ArgumentParser:
144151
parser.add_argument(
145152
"--model", type=str, default="gpt-4o", help="Model name"
146153
)
154+
parser.add_argument(
155+
"--split",
156+
type=str,
157+
default="test",
158+
choices=["validation", "test"],
159+
help="Data split"
160+
)
147161
parser.add_argument(
148162
"--code-dir",
149163
type=Path,
@@ -162,17 +176,6 @@ def get_cli() -> argparse.ArgumentParser:
162176
default=Path("eval_results"),
163177
help="Eval results directory",
164178
)
165-
parser.add_argument(
166-
"--jsonl-path",
167-
type=Path,
168-
default=Path("eval", "data", "problems_all.jsonl"),
169-
help="Path to jsonl file",
170-
)
171-
parser.add_argument(
172-
"--dev-set",
173-
action='store_true',
174-
help="Test dev set if enabled",
175-
),
176179
parser.add_argument(
177180
"--with-background",
178181
action="store_true",
@@ -182,17 +185,16 @@ def get_cli() -> argparse.ArgumentParser:
182185

183186

184187
def main(model: str,
188+
split: str,
185189
code_dir: Path,
186190
log_dir: Path,
187191
output_dir: Path,
188-
jsonl_path: Path,
189-
dev_set: bool,
190192
with_background: bool
191193
) -> None:
192194
if not Path(H5PY_FILE).exists():
193195
raise FileNotFoundError("Please download the numeric test results before testing generated code.")
194196
model = Path(model).parts[-1]
195-
test_code(model, code_dir, log_dir, output_dir, jsonl_path, dev_set, with_background)
197+
test_code(model, split, code_dir, log_dir, output_dir, with_background)
196198

197199

198200
if __name__ == "__main__":

pyproject.toml

+2-1
Original file line numberDiff line numberDiff line change
@@ -32,12 +32,13 @@ dependencies = [
3232
"pytest",
3333
"pytest-cov",
3434
"litellm",
35+
"inspect-ai",
36+
"datasets",
3537
# requirements for execution
3638
"numpy",
3739
"scipy",
3840
"matplotlib",
3941
"sympy",
40-
"inspect-ai",
4142
]
4243

4344
# Classifiers help users find your project by categorizing it.

src/scicode/parse/parse.py

+5
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@
88
import scipy
99
import numpy as np
1010
from sympy import Symbol
11+
from datasets import load_dataset
1112

1213
OrderedContent = list[tuple[str, str]]
1314

@@ -56,6 +57,10 @@ def read_from_jsonl(file_path):
5657
data.append(json.loads(line.strip()))
5758
return data
5859

60+
def read_from_hf_dataset(split='validation'):
61+
dataset = load_dataset('SciCode1/SciCode', split=split)
62+
return dataset
63+
5964
def rm_comments(string: str) -> str:
6065
ret_lines = []
6166
lines = string.split('\n')

0 commit comments

Comments
 (0)