File tree Expand file tree Collapse file tree 3 files changed +26
-7
lines changed Expand file tree Collapse file tree 3 files changed +26
-7
lines changed Original file line number Diff line number Diff line change @@ -34,6 +34,7 @@ SciCode sources challenging and realistic research-level coding problems across
34
34
| 🥇 OpenAI o1-preview | <div align =" center " >** 7.7** </div > | <div align =" center " style =" color :grey " >28.5</div > |
35
35
| 🥈 Claude3.5-Sonnet | <div align =" center " >** 4.6** </div > | <div align =" center " style =" color :grey " >26.0</div > |
36
36
| 🥉 Claude3.5-Sonnet (new) | <div align =" center " >** 4.6** </div > | <div align =" center " style =" color :grey " >25.3</div > |
37
+ | Deepseek-v3 | <div align =" center " >** 3.1** </div > | <div align =" center " style =" color :grey " >23.7</div > |
37
38
| Deepseek-Coder-v2 | <div align =" center " >** 3.1** </div > | <div align =" center " style =" color :grey " >21.2</div > |
38
39
| GPT-4o | <div align =" center " >** 1.5** </div > | <div align =" center " style =" color :grey " >25.0</div > |
39
40
| GPT-4-Turbo | <div align =" center " >** 1.5** </div > | <div align =" center " style =" color :grey " >22.9</div > |
Original file line number Diff line number Diff line change @@ -14,10 +14,11 @@ inspect eval scicode.py --model <your_model> --temperature 0
14
14
15
15
However, there are some additional command line arguments that could be useful as well.
16
16
17
- - ` --max_connections ` : Maximum amount of API connections to the evaluated model.
17
+ - ` --max-connections ` : Maximum amount of API connections to the evaluated model.
18
18
- ` --limit ` : Limit of the number of samples to evaluate in the SciCode dataset.
19
19
- ` -T input_path=<another_input_json_file> ` : This is useful when user wants to change to another json dataset (e.g., the dev set).
20
20
- ` -T output_dir=<your_output_dir> ` : This changes the default output directory (` ./tmp ` ).
21
+ - ` -T h5py_file=<your_h5py_file> ` : This is used if your h5py file is not downloaded in the recommended directory.
21
22
- ` -T with_background=True/False ` : Whether to include problem background.
22
23
- ` -T mode=normal/gold/dummy ` : This provides two additional modes for sanity checks.
23
24
- ` normal ` mode is the standard mode to evaluate a model
@@ -37,6 +38,19 @@ inspect eval scicode.py \
37
38
-T mode=gold
38
39
```
39
40
41
+ User can run the evaluation on ` Deepseek-v3 ` using together ai via the following command:
42
+
43
+ ``` bash
44
+ export TOGETHER_API_KEY=< YOUR_API_KEY>
45
+ inspect eval scicode.py \
46
+ --model together/deepseek-ai/DeepSeek-V3 \
47
+ --temperature 0 \
48
+ --max-connections 2 \
49
+ --max-tokens 32784 \
50
+ -T output_dir=./tmp/deepseek-v3 \
51
+ -T with_background=False
52
+ ```
53
+
40
54
For more information regarding ` inspect_ai ` , we refer users to its [ official documentation] ( https://inspect.ai-safety-institute.org.uk/ ) .
41
55
42
56
### Extra: How SciCode are Evaluated Under the Hood?
Original file line number Diff line number Diff line change @@ -336,12 +336,16 @@ async def solve(state: TaskState, generate: Generate) -> TaskState:
336
336
elif params ["mode" ] == "gold" :
337
337
response_from_llm = generate_gold_response (state .metadata , idx + 1 )
338
338
else :
339
- # ===Model Generation===
340
- state .user_prompt .text = prompt
341
- state_copy = copy .deepcopy (state )
342
- result = await generate (state = state_copy )
343
- response_from_llm = result .output .completion
344
- # ===Model Generation===
339
+ try :
340
+ # ===Model Generation===
341
+ state .user_prompt .text = prompt
342
+ state_copy = copy .deepcopy (state )
343
+ result = await generate (state = state_copy )
344
+ response_from_llm = result .output .completion
345
+ # ===Model Generation===
346
+ except :
347
+ print (f"Failed to generate response for problem { prob_id } step { idx + 1 } ." )
348
+ response_from_llm = generate_dummy_response (prompt )
345
349
prompt_assistant .register_previous_response (
346
350
prob_data = state .metadata ,
347
351
response = response_from_llm ,
You can’t perform that action at this time.
0 commit comments