Skip to content

Commit

Permalink
Update docs for bash, ctf
Browse files Browse the repository at this point in the history
  • Loading branch information
john-b-yang committed Oct 29, 2023
1 parent a6fe7c7 commit c2f8b82
Show file tree
Hide file tree
Showing 14 changed files with 47 additions and 21 deletions.
8 changes: 3 additions & 5 deletions data/README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,10 @@
# InterCode Data
This folder contains 1. the data for all datasets that have been converted to be InterCode-compatible and operable with one of the `bash`, `python`, or `sql` InterCode environments, and 2. the evaluation results from experiments presented in the main paper.
This folder contains 1. the data for all datasets that have been converted to be InterCode compatible and operable with one of the InterCode environments, and 2. the evaluation results from experiments presented in the main paper.

The directory is laid out as follows:
```
├── bash/
│ ├── nl2bash/
│ └── test_bash_queries.json
├── ctf
├── nl2bash/
├── python
│ ├── apps/
│ └── mbpp/
Expand All @@ -18,7 +16,7 @@ The directory is laid out as follows:
├── bird/
└── spider/
```
The `bash`, `python`, and `sql` folders contain corresponding datasets that have been migrated to be compatible with the named InterCode environment. Each subfolder usually contains a `README.md` describing how the conversion was performed, a `transform.py` that performs the described conversion, and data files that are the task instances and any additional setup required to run the dataset on InterCode.
Each folder contains corresponding datasets that have been migrated to be compatible with one of the 5 InterCode environment. Each subfolder usually contains a `README.md` describing how the conversion was performed, a `transform.py` that performs the described conversion, and data files that are the task instances and any additional setup required to run the dataset on InterCode.

The `results` directory has model-specific folders, each of which contains experiment results for the model run with different strategies (i.e. Try Again, ReAct, Plan & Solve). Each json file contains a list of trajectories, where each trajectory corresponds to information from a single task episode.

Expand Down
10 changes: 0 additions & 10 deletions data/bash/nl2bash/README.md

This file was deleted.

26 changes: 26 additions & 0 deletions data/ctf/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# InterCode-CTF Dataset
For a comprehensive description of InterCode-CTF, please refer to the workshop paper we wrote that discusses the motivation behind this environment, the formulation of Capture the Flag as an interactive coding task, the collection procedure, and some initial experiments evaluating existing SOTA LMs on this dataset:

**[Language Agents as Hackers: Evaluating Cybersecurity Skills with Capture the Flag](https://john-b-yang.github.io/static/misc/preprint_InterCode_CTF.pdf)**
[John Yang](https://john-b-yang.github.io/), [Akshara Prabhakar](https://aksh555.github.io/), [Shunyu Yao](https://ysymyth.github.io/), [Kexin Pei](https://sites.google.com/site/kexinpeisite/), [Karthik Narasimhan](https://www.cs.princeton.edu/~karthikn/)

## Task Description
Task instances were manually collected from the [picoCTF](https://play.picoctf.org/practice) platform.
Each task instance has three components:
* `query`: Natural language instruction
* `gold`: The correct flag
* `task_assets/{task_id}`: Digital assets (e.g., code, images, executables) necessary for task completion

**Initialization.** At the beginning of each task episode, a task worker is given the `query` as the first observation.
The task worker must then interact with a Bash shell within an Ubuntu OS to solve the task.
Each task episode also places the task worker within a task instance-specific (`task_assets/{task_id}`) folder of the shell.

**Task Completion.** The task worker can then write Python and Bash commands to navigate the shell, investigate the given digital assets, and attempt to find the flag.
If it finds a flag, formatted as `picoCTF{...}` it must then run `submit picoCTF{...}` as an action. If the submitted value matches the `gold` flag, task completion is successful.

## Dataset
The `ic_ctf.json` file contains the following fields: `{task_id, query, gold, source, tags}`.
* `query` and `gold` contains the aforementioned information
* `task_id` is a unique identifier connecting the instance with the associated assets in `task_assets/`
* `source` is a URL to the original problem found on `picoCTF`
* `tags` is a list of the categories associated with the task instance
12 changes: 12 additions & 0 deletions data/nl2bash/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# NL2Bash Dataset Transformation
* [Link](https://github.com/TellinaTool/nl2bash) to NL2Bash.

## Dataset Description
This dataset consists of 200 natural language to bash command `[query, gold]` pairs acquired from the NL2Bash dataset.
To ensure that the `gold` solutions to each query are non-trivial, entities such as folders and files have been renamed in both `query` and `gold` in order to ground the solution in a file system.

To this end, we also create four different file systems for a more diverse evaluation setting:
* [fs_1](nl2bash_fs_1.json) has 60 queries based on the file system defined in [setup_fs_1](../../docker/bash_scripts/setup_nl2b_fs_1.sh)
* [fs_2](nl2bash_fs_1.json) has 53 queries based on the file system defined in [setup_fs_2](../../docker/bash_scripts/setup_nl2b_fs_2.sh)
* [fs_3](nl2bash_fs_3.json) has 60 queries based on the file system defined in [setup_fs_3](../../docker/bash_scripts/setup_nl2b_fs_3.sh)
* [fs_4](nl2bash_fs_4.json) has 27 general bash queries that are not grounded to any specific file system
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
2 changes: 1 addition & 1 deletion data/sql/bird/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ python transform.py <Path to BIRD dataset>
* You can also run this script on the

## Dataset Description
The script performs very simple adjustments to the BIRD-SQQL dataset to create an InterCode compatible version.
The script performs very simple adjustments to the BIRD-SQL dataset to create an InterCode compatible version.

To create the `ic_bird.json` task instances, the `transform.py` script iterates through the `dev.json` file and performs the following steps:
1. Changes the names of the following fields:
Expand Down
2 changes: 1 addition & 1 deletion run_demo.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ def preprocess_sql(record: Dict) -> List:
return [f"use {db}"]

DEMO_MAP = {
"bash": {"env": BashEnv, "image_name": "intercode-nl2bash", "data_path": "./data/bash/nl2bash/nl2bash_fs_1.json"},
"bash": {"env": BashEnv, "image_name": "intercode-nl2bash", "data_path": "./data/nl2bash/nl2bash_fs_1.json"},
"python": {"env": PythonEnv, "image_name": "intercode-python", "data_path": "./data/python/mbpp/ic_mbpp.json"},
"sql": {"env": SqlEnv, "image_name": "docker-env-sql", "data_path": "./data/sql/bird/ic_bird.json", "preprocess": preprocess_sql},
"ctf": {"env": CTFEnv, "image_name": "intercode-ctf", "data_path": "./data/ctf/ic_ctf.json", "preprocess": preprocess_ctf},
Expand Down
4 changes: 2 additions & 2 deletions scripts/expr_multi_turn.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
# Data Paths
# - (SQL) ./data/sql/spider/ic_spider_dev.json
# - (SQL) ./data/test/sql_queries.csv
# - (Bash) ./data/bash/nl2bash/nl2bash.json
# - (Bash) ./data/nl2bash/nl2bash.json
# - (Bash) ./data/test/bash_queries.json

# Environments
Expand All @@ -20,7 +20,7 @@

# Bash Call
# python -m experiments.eval_n_turn \
# --data_path ./data/bash/nl2bash/nl2bash_fs_1.json \
# --data_path ./data/nl2bash/nl2bash_fs_1.json \
# --dialogue_limit 7 \
# --env bash \
# --image_name intercode-nl2bash \
Expand Down
2 changes: 1 addition & 1 deletion scripts/expr_plan_solve.sh
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Bash Call
python -m experiments.eval_plan_solve \
--data_path ./data/bash/nl2bash/nl2bash_fs_1.json \
--data_path ./data/nl2bash/nl2bash_fs_1.json \
--env bash \
--image_name intercode-nl2bash \
--log_dir logs/experiments \
Expand Down
2 changes: 1 addition & 1 deletion scripts/expr_react.sh
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Bash Call
python -m experiments.eval_react \
--data_path ./data/bash/nl2bash/nl2bash_fs_1.json \
--data_path ./data/nl2bash/nl2bash_fs_1.json \
--env bash \
--image_name intercode-nl2bash \
--log_dir logs/experiments \
Expand Down

0 comments on commit c2f8b82

Please sign in to comment.