release: v2.1.0+0.1.8+8.19

Merge pull request #21 from JetBrains-Research/v2.1.0-dev
JetBrains-Research · May 11, 2024 · ff7bfc9 · ff7bfc9
2 parents 17e4be3 + af0e80d
commit ff7bfc9
Show file tree

Hide file tree

Showing 28 changed files with 1,596 additions and 58 deletions.
diff --git a/.gitignore b/.gitignore
@@ -4,7 +4,7 @@ node_modules
 .vscode-test/
 *.vsix
 logs
-.DS_Store
+**/**/.DS_Store
 
 # Test resources 
 src/test/resources/coqProj/.vscode
@@ -15,4 +15,8 @@ src/test/resources/coqProj/.vscode
 **/*.aux
 src/test/resources/coqProj/.Makefile.coq.d
 src/test/resources/coqProj/Makefile.coq
-src/test/resources/coqProj/Makefile.coq.conf
+src/test/resources/coqProj/Makefile.coq.conf
+
+# Ignore the generated build files in datatests inside benchmarks
+dataset/**/result
+dataset/**/.vscode/
diff --git a/.gitmodules b/.gitmodules
@@ -0,0 +1,5 @@
+[submodule "dataset/imm"]
+	path = dataset/imm
+	url = https://github.com/weakmemory/imm.git
+	branch = coq819
+	ignore = dirty
diff --git a/.vscodeignore b/.vscodeignore
@@ -9,4 +9,6 @@ vsc-extension-quickstart.md
 **/*.map
 **/*.ts
 logs/**
-_opam/**
+_opam/**
+./out/test/**
+dataset
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,17 @@
 # Change Log
 
+### 2.1.0
+Major: 
+- Create a (still in development and improvement) benchmarking system. A guide on how to use it is in the README.
+- Conduct an experiment on the performance of different LLMs, using the developed infrastructure. Benchmarking report is located in the [docs folder](etc/docs/benchmarking_report01.md).
+- Correctly handle and display settings which occur when the user settings are not correctly set.
+
+Minor: 
+- Set order of contributed settings.
+- Add a comprehensive user settings guide to the README. 
+- Fix issue with Grazie service not being able to correctly accept coq ligatures.
+- Fix issue that occured when generated proof contained `Proof using {...}.` construct. 
+
 ### 2.0.0
 - Added multiple strategies for ranking theorems from the working file. As LLM context window is limited, we sometimes should somehow choose a subset of theorems we want to provide as references to the LLM. Thus, we have made a few strategies for ranking theorems. Now there are only 2 of them, but there are more to come. Now we have a strategy that randomly picks theorems, and also the one that ranks them depending on the distance from the hole.
 - Now different holes are solved in parallel. This is a huge improvement in terms of performance.

diff --git a/README.md b/README.md
@@ -6,7 +6,24 @@
 
 Now `coqpilot` is in early beta and seeks for feedbacks. Please feel free to open an issue regarding any problem you encounter or any feature you want to see in the future. 
 
-***Important:*** This version 2.0.0 is not yet tested very well. Please feel free to open an issue if you encounter any problem. Moreover, stay tuned for the soon release of the 2.0.1.
+# Table of Contents
+
+- 🚀 [Coqpilot Overview](#coqpilot)
+- 📋 [Requirements](#requirements)
+- 🔍 [Brief Technical Overview](#brief-technical-overview)
+- 💡 [Example Usage](#example-usage)
+- 🛠 [Installation](#installation)
+  - ▶️ [Coq-lsp Installation](#coq-lsp-installation)
+  - 🤖 [Building Locally](#building-locally)
+- ⚠️ [Important Information](#important)
+- ⚙️ [Extension Settings](#extension-settings)
+- 📐 [Guide to Model Configuration](#guide-to-model-configuration)
+  - 🎛 [How VsCode settings work](#how-vscode-settings-work)
+  - 🧰 [Model Configuration](#model-configuration)
+- 📌 [Contributed Commands](#contributed-commands)
+- 📊 [Benchmark](#benchmark)
+- 🔜 [Future Plans](#future-plans)
+- 📜 [Release Notes](#release-notes)
 
 ## Requirements
 
@@ -102,22 +119,126 @@ Comment: Such files are not visible in the vscode explorer, because plugin adds
 
 This extension contributes the following settings:
 
-* `coqpilot.contextTheoremsRankerType` : The type of theorems ranker that will be used to select theorems for proof generation (when context is smaller than taking all of them). Either randomly or by distance from the theorem, with the currently generated admit. 
+* `coqpilot.contextTheoremsRankerType` : The type of theorems ranker that will be used to select theorems for proof generation (when context is smaller than taking all of them). Either randomly, by Jacard index (similarity metric) or by distance from the theorem, with the currently observed admit. 
 * `coqpilot.loggingVerbosity` : Verbosity of the logs. Could be `info`, `debug`.
 
 * `coqpilot.openAiModelsParameters`, `coqpilot.predefinedProofsModelsParameters`, `coqpilot.grazieModelsParameters` and `coqpilot.lmStudioModelsParameters`:
 
 Each of these settings are modified in `settings.json` and contain an array of models from this service. Each model will be used for generation independantly. Multiple models for a single service could be defined. For example, you can define parameters for two open-ai gpt models. One would be using `gpt-3.5` and the other one `gpt-4`. CoqPilot will first try to generate proofs using the first model, and if it fails, it will try the second one. This way coqpilot iterates over all services (currently 4 of them) and for each service it iterates over all models. 
 
+## Guide to Model Configuration
+
+### How VsCode settings work
+
+A common way to change the settings, contributed by the extension, is to open the `settings.json` file, or click `Edit in settings.json` on some field in settings UI. Say, by default extension contributes field (setting) `A` with default state `a'`. When you click edit, this field is being copied to the `settings.json` file with the value `a'`: 
+```json
+{
+    "A": "a'"
+}
+```
+From that moment and until you completely remove this field from the `settings.json` file, this will be the source of truth for this setting. Once again, if you want to set the value of the setting `A` back to the default, you have to remove this field from the file completely.
+
+### Model configuration
+
+As mentioned in the previous section, at the moment, four services are supported. 
+
+By default, only `predefinedProofs` and `open-ai` services are enabled. The first one tries `auto.` tactic and the second one has one model -- `gpt-3.5`. Models for other services are defaulted with empty arrays. That denotes that we do not create any models from these services. 
+
+Each and every service is configured with an array of independent models. This was made to easily experiment with different models and their parameters. 
+
+The simplest service to configure is `predefinedProofs`: 
+```json
+{
+    "coqpilot.predefinedProofsModelsParameters": [
+        {
+            "modelName": "Any name",
+            "tactics": [
+                "reflexivity.",
+                "simpl. reflexivity.",
+                "auto."
+            ]
+        }
+    ]
+}
+```
+Model name here is only used for convinience inside code, so may be any string. 
+
+The most commonly used service is `open-ai` (`grazie` and `lmStudio` are configured very similarly). 
+```json
+{
+    "coqpilot.openAiModelsParameters": [
+        {
+            "temperature": 1,
+            "apiKey": "***your-api-key***",
+            "modelName": "gpt-3.5-turbo-0301",
+            "choices": 10,
+            "systemPrompt": "Generate proof...",
+            "newMessageMaxTokens": 2000,
+            "tokensLimit": 4096,
+            "multiroundProfile": {
+                "maxRoundsNumber": 1,
+                "proofFixChoices": 1,
+                "proofFixPrompt": "Unfortunately, the last proof is not correct..."
+            }
+        }
+    ],
+}
+```
+Don't forget to set up the `apiKey` field, by default it is set to `None`. Moreover, make sure that your open-ai key is valid and has enough credits to run the models. If you create a free version of the key, it will not work (it has some weird limitations like 5 requests per inf). You can check you key here: https://platform.openai.com/playground/chat. If the playground works, the key is probably valid.
+
+Multi-round profile setting configures the number of attempts to fix the proof if it is incorrect. If the proof is incorrect, the compiler message is sent to the LLM with a request to repair it. The number of round attempts for one proof is set by `maxRoundsNumber`. The number of choices for the proof fixing is set by `proofFixChoices`. By default, values are set to 1 and that means that **NO** attempts to fix the proof are made. That means that proof is only being generated once. That's equivalent to say that multi-round fixing is turned off. 0 is not a valid value for `maxRoundsNumber` nor for `proofFixChoices`.  
+
+Another thing to keep in mind: We are still in beta and changes in settings may occur pretty often. When that happens, and your re-defined settings (which are stored aside from the extension) are not updated (by hand), this can lead to exceptions. Keep in mind that if you get an error or plugin does not start after the update, you may want double check the settings. Easy way is remove the setting completely in `settings.json`, e.g. `openAiModelsParameters`, than go to the UI, click `Edit in settings.json` on the `openAiModelsParameters` field. It will fill up with updated default values. Afterwards you can re-define the settings as you want.
+
 ## Contributed Commands
 
 * `coqpilot.perform_completion_under_cursor`: Try to generate proof for the goal under the cursor.
 * `coqpilot.perform_completion_for_all_admits`: Try to prove all holes (admitted goals) in the current file.
-* `coqpilot.perform_completion_in_selection`: Try to prove holes (admitted goals) in selection.
+* `coqpilot.perform_completion_in_selection`: Try to prove holes (admitted goals) in selection. 
 
-## Planned Features
+## Benchmark
 
-- Add benchmarking options for various models: soon. 
+To run benchmarks on some project, apart from installing and building CoqPilot manually as described above, you will need to download the necessary projects that are used as datasets for the benchmarks. These projects are added as submodules to the repository. To download them, run the following commands:
+```bash
+git submodule init
+git submodule update
+```
+After that, you need to build the projects. And be careful, the actively maintained way to build this projects is `nix`. Moreover, when adding your own projects, make sure that they are built using `coq-8.19.0`. 
+
+First things first, the process of running the benchmark is not perfectly automated yet. We are working on it. For now, one project (one unit containing nix environment) shall be ran at a time. Let's say you are going to run the benchmark on the `imm` project. You will have to do the following: 
+
+<!-- 0. Go the the `imm` subdirectory and add a `_CoqProject` file in the root with the following: 
+    ```
+    -I result/lib/coq/8.19/user-contrib/imm
+    -R result/lib/coq/8.19/user-contrib/imm imm
+    ```
+    This is needed for the lsp-server to correctly resolve file dependencies. -->
+
+1. Install nix, as specified in the [here](https://nixos.org/download.html). 
+
+2. Install needed caches: 
+    ```bash
+    nix-env -iA nixpkgs.cachix && cachix use coq && cachix use coq-community && cachix use math-comp
+    cachix use weakmemory
+    ```
+
+3. Go to the `imm` subdirectory, apply the nix environment (without it the project will NOT build) and build the project: 
+    ```bash
+    cd dataset/imm 
+    nix-build
+    nix-shell 
+    ```
+4. Make sure the `_CoqProject` was successfully generated in the root of your mroject. Return to the project root not exiting the nix-shell. Run the benchmark: 
+    ```bash
+    cd ../../
+    npm run benchmark
+    ```    
+
+## Future plans
+
+- Currently the user needs to manually enter the nix shell to get the correct environment for the benchmarks. We are working on automating this process.
+- Benchmarking system is evolving and will soon become more stable with smart scheduling (choice of models/services depending on availability and token limit counts) and automatically generated informative reports for the user.
+- Get rid of the overhead due to hacks with coq-lsp and the aux files.
 
 ## Release Notes
 

diff --git a/dataset/auto_benchmark.v b/dataset/auto_benchmark.v
@@ -0,0 +1,45 @@
+Theorem test : forall (A : Type) (P : A -> Prop) (x : A), P x -> P x.
+Proof.
+    admit.
+Admitted.
+
+Theorem test2 : forall (A : Type) (P : A -> Prop) (x : A), P x -> P x.
+Proof.
+    intros A P x H.
+    admit.
+Admitted.
+
+Theorem test2nat1 : forall n : nat, n = 0 \/ n <> 0.
+Proof.
+  intros n.
+  destruct n.
+  - admit.
+  - right.
+  discriminate.
+Admitted.
+
+Theorem test2nat2 : forall n : nat, n = 0 \/ n <> 0.
+Proof.
+    intros n.
+    destruct n.
+    {
+        admit.
+    }
+    {
+        admit.
+    }
+Admitted.
+
+Theorem test_thr : forall n:nat, 0 + n = n.
+Proof.
+    intros n. Print plus.
+    admit.
+    (* reflexivity. *)
+Admitted.
+
+Lemma test_thr1 : forall n:nat, 0 + n + 0 = n.
+Proof.
+    intros n. Print plus.
+    admit.
+    (* reflexivity. *)
+Admitted.
diff --git a/dataset/imm b/dataset/imm
diff --git a/dataset/mixed_benchmark.v b/dataset/mixed_benchmark.v
@@ -0,0 +1,19 @@
+Theorem test_plus_1 : forall n:nat, 1 + n = S n.
+Proof.
+    auto.
+Qed.
+
+Theorem test_thr : forall n:nat, 0 + n = n.
+Proof.
+    admit.
+Admitted.
+
+Theorem add_comm : forall n m:nat, n + m = m + n.
+Proof.
+    admit.
+Admitted.
+
+Theorem add_comm_1 : forall n:nat, n + 1 = 1 + n.
+Proof.
+    admit.
+Admitted.