-
Notifications
You must be signed in to change notification settings - Fork 10
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Evaluation - Simulate usage using examples produced from sessions (#253)
# Use Simulation For Evaluation This PR completely overhauls how we do evaluation as outlined in [TN011EVALDATA](https://foyle.io/docs/tech-notes/tn011_eval_data/) One of the major pain points in our approach to evaluation has been building up a sufficiently large dataset for evaluation. This PR solves this problem by using examples generated from sessions produced by actual usage. This ensures that the more we use Foyle the more data we have available for evaluation. Another challenge for evaluation has been what do we use for our set of learned examples during evaluation? Using actual sessions solves this problem because sessions are ordered in time. During evaluation we start out with no learned examples. We then replay the sessions in the same order the occurred. Foyle can then learn from those sessions using its learning process to improve accuracy on subsequent examples. # Making the Evaluator a Simulator In order to achieve this we redo our Evaluator to act more like a simulator that simulates what a user would do by using the sessions as examples of intent and actions. We refactor the Evaluator to follow the pattern we first used in the AssertJob of having the experiment driver (the evaluator) interact with the Agent via RPC. This makes it easy to setup and configure an independent instance of the Agent with the suitable parameters for the experiment. # Use sqlite for storing the results We rewrite the evaluator to use sqlite to story the evaluation results rather than using pebble. This gives much better querying capabilities for exploring the evaluation results. We store the EvalResult proto in JSON not binary format so that we can use sqlite's capabilities to query the data. # Level 1 Evals This PR deletes the Assertor code because it is rendered out of data by all the changes. In a subsequent PR we should integration the level 1 assertions into the evaluator. Tracked in #261 # Code Cleanup Delete code for computing the distance between expected and actual programs. We have switched to LLM as judge. That metric is likely not useful anymore because generated code are often multi-line mini programs that the metric couldn't handle. Delete the data/eval directory. These were handcrafted evaluation examples expressed as markdown files. With this PR we are making two changes 1. Store EvalExamples as protos to allow richer data representations 2. Produce evaluation datasets from logs and actual usage Fix #140
- Loading branch information
Showing
71 changed files
with
1,875 additions
and
1,801 deletions.
There are no files selected for viewing
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,67 @@ | ||
package cmd | ||
|
||
import ( | ||
"fmt" | ||
"os" | ||
"strings" | ||
|
||
"github.com/go-logr/zapr" | ||
"github.com/jlewi/foyle/protos/go/foyle/v1alpha1" | ||
"github.com/pkg/errors" | ||
"github.com/spf13/cobra" | ||
"go.uber.org/zap" | ||
"google.golang.org/protobuf/encoding/protojson" | ||
"google.golang.org/protobuf/proto" | ||
) | ||
|
||
// NewProtoToJsonCmd creates a command for converting a proto to json | ||
func NewProtoToJsonCmd() *cobra.Command { | ||
cmd := &cobra.Command{ | ||
Use: "prototojson <file>", | ||
Short: "Dump the binary proto file to json", | ||
Run: func(cmd *cobra.Command, args []string) { | ||
err := func() error { | ||
log := zapr.NewLogger(zap.L()) | ||
if len(args) == 0 { | ||
log.Info("prototojson takes at least one argument which should be the path of the proto to dump.") | ||
} | ||
|
||
file := args[0] | ||
|
||
var message proto.Message | ||
var typeName string | ||
if strings.HasSuffix(file, ".evalexample.binpb") { | ||
message = &v1alpha1.EvalExample{} | ||
typeName = "EvalExample" | ||
} | ||
|
||
if strings.HasSuffix(file, ".example.binpb") { | ||
message = &v1alpha1.Example{} | ||
typeName = "Example" | ||
} | ||
|
||
if message == nil { | ||
return errors.Errorf("The type of proto could not be determined from the path suffix for file: %s", file) | ||
} | ||
data, err := os.ReadFile(file) | ||
if err != nil { | ||
return errors.Wrapf(err, "Error reading file %s", file) | ||
} | ||
|
||
if err := proto.Unmarshal(data, message); err != nil { | ||
return errors.Wrapf(err, "Error unmarshalling proto of type %s from file %s", typeName, file) | ||
} | ||
|
||
jsonP := protojson.Format(message) | ||
fmt.Fprintf(os.Stdout, "%s\n", jsonP) | ||
return nil | ||
}() | ||
if err != nil { | ||
fmt.Printf("Error running convert;\n %+v\n", err) | ||
os.Exit(1) | ||
} | ||
}, | ||
} | ||
|
||
return cmd | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
package agent | ||
|
||
const ( | ||
TraceIDHeader = "Foyle-Trace-ID" | ||
) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
-- name: UpdateResult :exec | ||
INSERT OR REPLACE INTO results | ||
(id, time, proto_json) | ||
VALUES | ||
(?, ?, ?); | ||
|
||
-- name: GetResult :one | ||
SELECT * FROM results | ||
WHERE id = ?; | ||
|
||
|
||
-- name: ListResults :many | ||
-- This queries for results. | ||
-- Results are listed in descending order of time (most recent first) because the primary use is for resuming | ||
-- in the evaluator | ||
SELECT * FROM results | ||
WHERE (:cursor = '' OR time < :cursor) | ||
ORDER BY time DESC | ||
LIMIT :page_size; |
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.