Skip to content

Commit 28dfd8c

Browse files
committed
error handling to API calls; and deepseek-v3 results
1 parent d340aee commit 28dfd8c

File tree

3 files changed

+26
-7
lines changed

3 files changed

+26
-7
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,7 @@ SciCode sources challenging and realistic research-level coding problems across
3434
| 🥇 OpenAI o1-preview | <div align="center">**7.7**</div> | <div align="center" style="color:grey">28.5</div> |
3535
| 🥈 Claude3.5-Sonnet | <div align="center">**4.6**</div> | <div align="center" style="color:grey">26.0</div> |
3636
| 🥉 Claude3.5-Sonnet (new) | <div align="center">**4.6**</div> | <div align="center" style="color:grey">25.3</div> |
37+
| Deepseek-v3 | <div align="center">**3.1**</div> | <div align="center" style="color:grey">23.7</div> |
3738
| Deepseek-Coder-v2 | <div align="center">**3.1**</div> | <div align="center" style="color:grey">21.2</div> |
3839
| GPT-4o | <div align="center">**1.5**</div> | <div align="center" style="color:grey">25.0</div> |
3940
| GPT-4-Turbo | <div align="center">**1.5**</div> | <div align="center" style="color:grey">22.9</div> |

eval/inspect_ai/README.md

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,10 +14,11 @@ inspect eval scicode.py --model <your_model> --temperature 0
1414

1515
However, there are some additional command line arguments that could be useful as well.
1616

17-
- `--max_connections`: Maximum amount of API connections to the evaluated model.
17+
- `--max-connections`: Maximum amount of API connections to the evaluated model.
1818
- `--limit`: Limit of the number of samples to evaluate in the SciCode dataset.
1919
- `-T input_path=<another_input_json_file>`: This is useful when user wants to change to another json dataset (e.g., the dev set).
2020
- `-T output_dir=<your_output_dir>`: This changes the default output directory (`./tmp`).
21+
- `-T h5py_file=<your_h5py_file>`: This is used if your h5py file is not downloaded in the recommended directory.
2122
- `-T with_background=True/False`: Whether to include problem background.
2223
- `-T mode=normal/gold/dummy`: This provides two additional modes for sanity checks.
2324
- `normal` mode is the standard mode to evaluate a model
@@ -37,6 +38,19 @@ inspect eval scicode.py \
3738
-T mode=gold
3839
```
3940

41+
User can run the evaluation on `Deepseek-v3` using together ai via the following command:
42+
43+
```bash
44+
export TOGETHER_API_KEY=<YOUR_API_KEY>
45+
inspect eval scicode.py \
46+
--model together/deepseek-ai/DeepSeek-V3 \
47+
--temperature 0 \
48+
--max-connections 2 \
49+
--max-tokens 32784 \
50+
-T output_dir=./tmp/deepseek-v3 \
51+
-T with_background=False
52+
```
53+
4054
For more information regarding `inspect_ai`, we refer users to its [official documentation](https://inspect.ai-safety-institute.org.uk/).
4155

4256
### Extra: How SciCode are Evaluated Under the Hood?

eval/inspect_ai/scicode.py

Lines changed: 10 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -336,12 +336,16 @@ async def solve(state: TaskState, generate: Generate) -> TaskState:
336336
elif params["mode"] == "gold":
337337
response_from_llm = generate_gold_response(state.metadata, idx+1)
338338
else:
339-
# ===Model Generation===
340-
state.user_prompt.text = prompt
341-
state_copy = copy.deepcopy(state)
342-
result = await generate(state=state_copy)
343-
response_from_llm = result.output.completion
344-
# ===Model Generation===
339+
try:
340+
# ===Model Generation===
341+
state.user_prompt.text = prompt
342+
state_copy = copy.deepcopy(state)
343+
result = await generate(state=state_copy)
344+
response_from_llm = result.output.completion
345+
# ===Model Generation===
346+
except:
347+
print(f"Failed to generate response for problem {prob_id} step {idx+1}.")
348+
response_from_llm = generate_dummy_response(prompt)
345349
prompt_assistant.register_previous_response(
346350
prob_data=state.metadata,
347351
response=response_from_llm,

0 commit comments

Comments
 (0)