Skip to content

Commit 332acbd

Browse files
authored
Merge pull request #24 from scicode-bench/zilinghan/doc-update
update readme
2 parents 69a8cfc + d340aee commit 332acbd

File tree

1 file changed

+15
-1
lines changed

1 file changed

+15
-1
lines changed

eval/inspect_ai/README.md

+15-1
Original file line numberDiff line numberDiff line change
@@ -37,4 +37,18 @@ inspect eval scicode.py \
3737
-T mode=gold
3838
```
3939

40-
For more information regarding `inspect_ai`, we refer users to its [official documentation](https://inspect.ai-safety-institute.org.uk/).
40+
For more information regarding `inspect_ai`, we refer users to its [official documentation](https://inspect.ai-safety-institute.org.uk/).
41+
42+
### Extra: How SciCode are Evaluated Under the Hood?
43+
44+
During the evaluation, the sub-steps of each main problem of SciCode are passed in order to the evalauted LLM with necessary prompts and LLM responses for previous sub-steps. The generated Python code from LLM will be parsed and saved to disk, which will be used to run on test cases to determine the pass or fail for the sub-steps. The main problem will be considered as solved if the LLM can pass all sub-steps of the main problem.
45+
46+
### Extra: Reproducibility of `inspect_ai` Integration
47+
We use the SciCode `inspect_ai` integration to evaluate OpenAI's GPT-4o, and we compare it with the original way of evaluation. Below shows the comparison of two ways of the evaluations.
48+
49+
*[💡It should be noted that it is common to have slightly different results due to the randomness of LLM generations.]*
50+
51+
| Methods | Main Problem Resolve Rate | <span style="color:grey">Subproblem</span> |
52+
|---------------------------|-------------------------------------|-------------------------------------------------------|
53+
| `inspect_ai` Evaluation | <div align="center">**3.1 (2/65)**</div> | <div align="center" style="color:grey">25.1</div> |
54+
| Original Evaluation | <div align="center">**1.5 (1/65)**</div> | <div align="center" style="color:grey">25.0</div> |

0 commit comments

Comments
 (0)