Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error handling to API calls; and deepseek-v3 results #26

Merged
merged 1 commit into from
Jan 31, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ SciCode sources challenging and realistic research-level coding problems across
| 🥇 OpenAI o1-preview | <div align="center">**7.7**</div> | <div align="center" style="color:grey">28.5</div> |
| 🥈 Claude3.5-Sonnet | <div align="center">**4.6**</div> | <div align="center" style="color:grey">26.0</div> |
| 🥉 Claude3.5-Sonnet (new) | <div align="center">**4.6**</div> | <div align="center" style="color:grey">25.3</div> |
| Deepseek-v3 | <div align="center">**3.1**</div> | <div align="center" style="color:grey">23.7</div> |
| Deepseek-Coder-v2 | <div align="center">**3.1**</div> | <div align="center" style="color:grey">21.2</div> |
| GPT-4o | <div align="center">**1.5**</div> | <div align="center" style="color:grey">25.0</div> |
| GPT-4-Turbo | <div align="center">**1.5**</div> | <div align="center" style="color:grey">22.9</div> |
Expand Down
16 changes: 15 additions & 1 deletion eval/inspect_ai/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,10 +14,11 @@ inspect eval scicode.py --model <your_model> --temperature 0

However, there are some additional command line arguments that could be useful as well.

- `--max_connections`: Maximum amount of API connections to the evaluated model.
- `--max-connections`: Maximum amount of API connections to the evaluated model.
- `--limit`: Limit of the number of samples to evaluate in the SciCode dataset.
- `-T input_path=<another_input_json_file>`: This is useful when user wants to change to another json dataset (e.g., the dev set).
- `-T output_dir=<your_output_dir>`: This changes the default output directory (`./tmp`).
- `-T h5py_file=<your_h5py_file>`: This is used if your h5py file is not downloaded in the recommended directory.
- `-T with_background=True/False`: Whether to include problem background.
- `-T mode=normal/gold/dummy`: This provides two additional modes for sanity checks.
- `normal` mode is the standard mode to evaluate a model
Expand All @@ -37,6 +38,19 @@ inspect eval scicode.py \
-T mode=gold
```

User can run the evaluation on `Deepseek-v3` using together ai via the following command:

```bash
export TOGETHER_API_KEY=<YOUR_API_KEY>
inspect eval scicode.py \
--model together/deepseek-ai/DeepSeek-V3 \
--temperature 0 \
--max-connections 2 \
--max-tokens 32784 \
-T output_dir=./tmp/deepseek-v3 \
-T with_background=False
```

For more information regarding `inspect_ai`, we refer users to its [official documentation](https://inspect.ai-safety-institute.org.uk/).

### Extra: How SciCode are Evaluated Under the Hood?
Expand Down
16 changes: 10 additions & 6 deletions eval/inspect_ai/scicode.py
Original file line number Diff line number Diff line change
Expand Up @@ -336,12 +336,16 @@ async def solve(state: TaskState, generate: Generate) -> TaskState:
elif params["mode"] == "gold":
response_from_llm = generate_gold_response(state.metadata, idx+1)
else:
# ===Model Generation===
state.user_prompt.text = prompt
state_copy = copy.deepcopy(state)
result = await generate(state=state_copy)
response_from_llm = result.output.completion
# ===Model Generation===
try:
# ===Model Generation===
state.user_prompt.text = prompt
state_copy = copy.deepcopy(state)
result = await generate(state=state_copy)
response_from_llm = result.output.completion
# ===Model Generation===
except:
print(f"Failed to generate response for problem {prob_id} step {idx+1}.")
response_from_llm = generate_dummy_response(prompt)
prompt_assistant.register_previous_response(
prob_data=state.metadata,
response=response_from_llm,
Expand Down