Update docs for bash, ctf

princeton-nlp · Oct 29, 2023 · c2f8b82 · c2f8b82
1 parent a6fe7c7
commit c2f8b82
Show file tree

Hide file tree

Showing 14 changed files with 47 additions and 21 deletions.
diff --git a/data/README.md b/data/README.md
@@ -1,12 +1,10 @@
 # InterCode Data
-This folder contains 1. the data for all datasets that have been converted to be InterCode-compatible and operable with one of the `bash`, `python`, or `sql` InterCode environments, and 2. the evaluation results from experiments presented in the main paper.
+This folder contains 1. the data for all datasets that have been converted to be InterCode compatible and operable with one of the InterCode environments, and 2. the evaluation results from experiments presented in the main paper.
 
 The directory is laid out as follows:
 ```
-├── bash/
-│   ├── nl2bash/
-│   └── test_bash_queries.json
 ├── ctf
+├── nl2bash/
 ├── python
 │   ├── apps/
 │   └── mbpp/
@@ -18,7 +16,7 @@ The directory is laid out as follows:
     ├── bird/
     └── spider/
 ```
-The `bash`, `python`, and `sql` folders contain corresponding datasets that have been migrated to be compatible with the named InterCode environment. Each subfolder usually contains a `README.md` describing how the conversion was performed, a `transform.py` that performs the described conversion, and data files that are the task instances and any additional setup required to run the dataset on InterCode.
+Each folder contains corresponding datasets that have been migrated to be compatible with one of the 5 InterCode environment. Each subfolder usually contains a `README.md` describing how the conversion was performed, a `transform.py` that performs the described conversion, and data files that are the task instances and any additional setup required to run the dataset on InterCode.
 
 The `results` directory has model-specific folders, each of which contains experiment results for the model run with different strategies (i.e. Try Again, ReAct, Plan & Solve). Each json file contains a list of trajectories, where each trajectory corresponds to information from a single task episode.
 

diff --git a/data/bash/nl2bash/README.md b/data/bash/nl2bash/README.md
diff --git a/data/ctf/README.md b/data/ctf/README.md
@@ -0,0 +1,26 @@
+# InterCode-CTF Dataset
+For a comprehensive description of InterCode-CTF, please refer to the workshop paper we wrote that discusses the motivation behind this environment, the formulation of Capture the Flag as an interactive coding task, the collection procedure, and some initial experiments evaluating existing SOTA LMs on this dataset:
+
+**[Language Agents as Hackers: Evaluating Cybersecurity Skills with Capture the Flag](https://john-b-yang.github.io/static/misc/preprint_InterCode_CTF.pdf)**   
+[John Yang](https://john-b-yang.github.io/), [Akshara Prabhakar](https://aksh555.github.io/), [Shunyu Yao](https://ysymyth.github.io/), [Kexin Pei](https://sites.google.com/site/kexinpeisite/), [Karthik Narasimhan](https://www.cs.princeton.edu/~karthikn/)
+
+## Task Description
+Task instances were manually collected from the [picoCTF](https://play.picoctf.org/practice) platform.
+Each task instance has three components:
+* `query`: Natural language instruction
+* `gold`: The correct flag
+* `task_assets/{task_id}`: Digital assets (e.g., code, images, executables) necessary for task completion
+
+**Initialization.** At the beginning of each task episode, a task worker is given the `query` as the first observation.
+The task worker must then interact with a Bash shell within an Ubuntu OS to solve the task.
+Each task episode also places the task worker within a task instance-specific (`task_assets/{task_id}`) folder of the shell.
+
+**Task Completion.** The task worker can then write Python and Bash commands to navigate the shell, investigate the given digital assets, and attempt to find the flag.
+If it finds a flag, formatted as `picoCTF{...}` it must then run `submit picoCTF{...}` as an action. If the submitted value matches the `gold` flag, task completion is successful.
+
+## Dataset
+The `ic_ctf.json` file contains the following fields: `{task_id, query, gold, source, tags}`.
+* `query` and `gold` contains the aforementioned information
+* `task_id` is a unique identifier connecting the instance with the associated assets in `task_assets/`
+* `source` is a URL to the original problem found on `picoCTF`
+* `tags` is a list of the categories associated with the task instance
diff --git a/data/nl2bash/README.md b/data/nl2bash/README.md
@@ -0,0 +1,12 @@
+# NL2Bash Dataset Transformation
+* [Link](https://github.com/TellinaTool/nl2bash) to NL2Bash.
+
+## Dataset Description
+This dataset consists of 200 natural language to bash command `[query, gold]` pairs acquired from the NL2Bash dataset.
+To ensure that the `gold` solutions to each query are non-trivial, entities such as folders and files have been renamed in both `query` and `gold` in order to ground the solution in a file system.
+
+To this end, we also create four different file systems for a more diverse evaluation setting:
+* [fs_1](nl2bash_fs_1.json) has 60 queries based on the file system defined in [setup_fs_1](../../docker/bash_scripts/setup_nl2b_fs_1.sh)
+* [fs_2](nl2bash_fs_1.json) has 53 queries based on the file system defined in [setup_fs_2](../../docker/bash_scripts/setup_nl2b_fs_2.sh)
+* [fs_3](nl2bash_fs_3.json) has 60 queries based on the file system defined in [setup_fs_3](../../docker/bash_scripts/setup_nl2b_fs_3.sh)
+* [fs_4](nl2bash_fs_4.json) has 27 general bash queries that are not grounded to any specific file system
diff --git a/data/bash/nl2bash/nl2bash_fs_1.json → data/nl2bash/nl2bash_fs_1.json b/data/bash/nl2bash/nl2bash_fs_1.json → data/nl2bash/nl2bash_fs_1.json
diff --git a/data/bash/nl2bash/nl2bash_fs_2.json → data/nl2bash/nl2bash_fs_2.json b/data/bash/nl2bash/nl2bash_fs_2.json → data/nl2bash/nl2bash_fs_2.json
diff --git a/data/bash/nl2bash/nl2bash_fs_3.json → data/nl2bash/nl2bash_fs_3.json b/data/bash/nl2bash/nl2bash_fs_3.json → data/nl2bash/nl2bash_fs_3.json
diff --git a/data/bash/nl2bash/nl2bash_fs_4.json → data/nl2bash/nl2bash_fs_4.json b/data/bash/nl2bash/nl2bash_fs_4.json → data/nl2bash/nl2bash_fs_4.json
diff --git a/data/bash/test_bash_queries.json → data/nl2bash/test_queries.json b/data/bash/test_bash_queries.json → data/nl2bash/test_queries.json
diff --git a/data/sql/bird/README.md b/data/sql/bird/README.md
@@ -11,7 +11,7 @@ python transform.py <Path to BIRD dataset>
 * You can also run this script on the 
 
 ## Dataset Description
-The script performs very simple adjustments to the BIRD-SQQL dataset to create an InterCode compatible version.
+The script performs very simple adjustments to the BIRD-SQL dataset to create an InterCode compatible version.
 
 To create the `ic_bird.json` task instances, the `transform.py` script iterates through the `dev.json` file and performs the following steps:
 1. Changes the names of the following fields:

diff --git a/run_demo.py b/run_demo.py
@@ -19,7 +19,7 @@ def preprocess_sql(record: Dict) -> List:
     return [f"use {db}"]
 
 DEMO_MAP = {
-    "bash": {"env": BashEnv, "image_name": "intercode-nl2bash", "data_path": "./data/bash/nl2bash/nl2bash_fs_1.json"},
+    "bash": {"env": BashEnv, "image_name": "intercode-nl2bash", "data_path": "./data/nl2bash/nl2bash_fs_1.json"},
     "python": {"env": PythonEnv, "image_name": "intercode-python", "data_path": "./data/python/mbpp/ic_mbpp.json"},
     "sql": {"env": SqlEnv, "image_name": "docker-env-sql", "data_path": "./data/sql/bird/ic_bird.json", "preprocess": preprocess_sql},
     "ctf": {"env": CTFEnv, "image_name": "intercode-ctf", "data_path": "./data/ctf/ic_ctf.json", "preprocess": preprocess_ctf},

diff --git a/scripts/expr_multi_turn.sh b/scripts/expr_multi_turn.sh
@@ -3,7 +3,7 @@
 # Data Paths
 # - (SQL)  ./data/sql/spider/ic_spider_dev.json
 # - (SQL)  ./data/test/sql_queries.csv
-# - (Bash) ./data/bash/nl2bash/nl2bash.json
+# - (Bash) ./data/nl2bash/nl2bash.json
 # - (Bash) ./data/test/bash_queries.json
 
 # Environments
@@ -20,7 +20,7 @@
 
 # Bash Call
 # python -m experiments.eval_n_turn \
-#     --data_path ./data/bash/nl2bash/nl2bash_fs_1.json \
+#     --data_path ./data/nl2bash/nl2bash_fs_1.json \
 #     --dialogue_limit 7 \
 #     --env bash \
 #     --image_name intercode-nl2bash \

diff --git a/scripts/expr_plan_solve.sh b/scripts/expr_plan_solve.sh
@@ -1,6 +1,6 @@
 # Bash Call
 python -m experiments.eval_plan_solve \
-    --data_path ./data/bash/nl2bash/nl2bash_fs_1.json \
+    --data_path ./data/nl2bash/nl2bash_fs_1.json \
     --env bash \
     --image_name intercode-nl2bash \
     --log_dir logs/experiments \

diff --git a/scripts/expr_react.sh b/scripts/expr_react.sh
@@ -1,6 +1,6 @@
 # Bash Call
 python -m experiments.eval_react \
-    --data_path ./data/bash/nl2bash/nl2bash_fs_1.json \
+    --data_path ./data/nl2bash/nl2bash_fs_1.json \
     --env bash \
     --image_name intercode-nl2bash \
     --log_dir logs/experiments \