[RFC] Fuzzy behavior tests/CI for Merlin #1657

pitag-ha · 2023-07-25T19:48:08Z

Merlin already has quite a few end-to-end tests. All of those are behavior-specific: we want to test one concrete Merlin behavior, create specific input for that and test Merlin on that input. Additionally to those tests, we now also want to test Merlin's behavior non-specifically on a large randomized input scale.

This PR builds on top of PR #1642. The main differences are:

It's considerably faster: About 10-15 min for 23482 samples, as opposed to 30 min for 3113 samples (or >200 min for 93390 samples).
It gives more meaningul output (see Two in One section below):
- The category return data is finer-grained.
- It also adds category data for server crashes.
- It also outputs the full Merlin responses.
The test code base and the sample size are chosen in a way to get a good maintenance-impact payoff (see the Test cost maintenance data and Impact sections below)
CI integration is left out for now for the RFC: Given that it also cats the full Merlin responses, the CI integration requires filtering out switch-specific noise etc.

This implements both issues #1634 and #1635. Here, I'm doing both things in one workflow, which I think is the right thing to do. Closes #1634, closes #1635.

Approach

I'd like to discuss the approach. So, what do you think about the following?

Local tests

Similarly to PR #1642, I'm also proposing to have the tests locally as well, not only in CI. Same as there: they would not run every time on dune test.

Test cost and maintenance data

I've optimized the tool merl-an we use to generate the samples etc and I've chosen the sample base carefully. With that, the time seems reasonable to me for the CI, while the space still seems quite big to me to add it to the repo:

Time

If the test base is already fetched and built (that should be the case most of the time now thanks to Dune caching):

real    11m41,585s
user    2m18,031s
sys     0m20,412s

If the base needs to be fetched and built:

real    14m38,179s
user    2m52,283s
sys     0m27,659s

Both timings come from my local computer.

Space

There are two big test files that would get into the Merlin repo with this approach: category_data.t and full_responses.t. They could be made more compact, but at a pricee: see Handle the JSON output below. We could also make them smaller by choosing less samples. Currently their size is:

category_data.t:
4,7M
189460 lines

full_responses.t:
36M
1168705 lines

If we do go forward with this, we should probably use Git LFS. Does anyone have experience with that?

Suggested workflow

CI

For 23482 samples as in this RFC, I expect the CI to take about 15-20 min (side-note: I could push the github yml from the #1642 PR, even though currently the CI would fail, to make sure this is true).

There are other CIs on Merlin that also seem to take that long. Is that true? If so, we could try running the CI on every PR (fulfilling a few things such as ml/mli-files were modified etc) and see if that ends up being annoying or ok. We can also optimize the sample size further.

Alternatively, I have a potential alternative workflow in mind, if you think it's too long to run on every PR.

Locally

Locally, dune test won't run these tests. People can run them via FUZZING=true dune test, which they should do iff expecting or fearing a behavior change. That's one of the weak points of this approach, since that will take about 10 min locally - or even 15 min when the sample base isn't cached.

Two in one

This test generates two different data sets: category data in category_data.t, i.e. the implementation of issue #1635, and full Merlin responses in full_responses.t, i.e. the implementation of issue #1634.

Category data

The Merlin responses can face regressions/improvements on different levels. When treating such big amount of test data as here, it's helpful to reduce the data and seperate the different levels.

Return classes

One level a regression can happen on is the the return class. That's why in category_data.t, the responses are reduced to "belongs to one of these six return categories", e.g. category "successful return", or category "properly reported error" etc.

Server crashes

Another interesting regression is server crashes. That's why category_data.t also contains the server's query number.

Full responses

By reducing the response, regressions/behavior changes can also be missed, though. That's why the test also contains the full responses in full_responses.t.

Impact

Positive

Concrete regressions we would have caught in the past

I've checked how analyzing this test would look like by running it on a couple of concrete locate regressions that have happened in the past. Both regressions would have been caught by the new test.

For the first one, the test diff is here. It's short and easy to read. The full responses data would have caught the regression, the category data wouldn't. The regression was about locate in the context of functor applications.

For the second one, the test diff is here. It's also short and easy to read. Both the full responses data and the category data would have caught the regression. The regression was about locate in the context of modules without implementation.

Behavior changes on big PRs such as compiler bumps

I've also had a look how the test diffs would look like on big PRs such as compiler bumps.

The diff for the 4.13 bump is here. Unfortunately, it contains a bit of unnecessary noise (we need to fix that). The category data is easy to read and it directly shows a regression in occurrences : It sometimes returns an empty list now, where before it would find occurrences. The full responses are a bit harder to read, but also contain interesting changes.

The diff for the 4.14 bump is here. It's similar.

General

I also had to update the test between the mentioned PRs. Interestingly, all of them contain regressions as well (while also containing quite a few improvements of course). E.g.:

One interesting diff is this one. It contains one of the locate regressions whose fix is mentioned above. Additionally, it contains a few more locate regressions. One of them: The return is now an exception, where before it was a value. Those regressions were introduced at some point between the 4.14 bump and this PR.

Limits

Test code base

The test code base is buildable, in particular it doesn't have errors and doesn't contain holes. So no regression in those contexts can be found.

To give one example: The regression in the context of holes and PPXs wouldn't have been caught. I didn't even bother to try. We could "overfit" this test to that regression by adding a workflow to merl-an producing that situation. I don't think that's worth it. For those cases, the normal input-specific end-to-end tests are more suitable.

Sampe size

The samples can't cover all edge cases, so we'll always miss things.

To give one example: Originally, I included case-analysis in the queries that are run. I've checked if a regression making the server crash on certain case-analysis queries would have been caught. It wouldn't have.

Implementation details and their consequences

This is a quick PoC with the goal to gather some data (see Impact) and to discuss this approach/workflow I'm suggesting. There are several implementation details we'll need to approach before thinking about merging this. E.g.:

Where does `merl-an` live

The current workflow relies on having merl-an installed. Instead, I suggest upstreaming it to Merlin and having Dune build it as part of the test.

Where do the Irmin dependencies come from

The current workflow relies on having the Irmin dependencies installed. If we wanted that workflow, we'd have to add all Irmin dependencies as Merlin test dependencies. Instead, I suggest vendoring Irmin and its dependencies. Note:

CI integration

The CI integration can be taken from PR #1642. We just need to make sure the tests are unspecific to locally vs CI.

Filter out noise

Related: By not using merlin-wrapper, the tests aren't filtering out the noise coming from e.g. switch-specific and compiler-specific things.

Furthermore, the current [query_num] adds a lot of noise as well: As soon as there was one server crash, all the following queries will be marked in the diff as well.

Handle the JSON output

Currently, the tests pipe the JSON to jq. That makes both the JSON's themselves and their diffs very readable. However, it makes the files extremely long. We need to find a good balance between strongly formatted JSON (piping to jq) and very compact JSON (merl-an's output directly).

Sample set choise

The current sample choice is already quite conscious and adjusted and I don't think it's bad (I can give detail if that's interesting). However, there are still 3 ways in which we can improve it: Better sample distribution: pitag-ha/merl-an#28; optimize the test code base even more (can it be smaller catching the same? Bigger catching more?). Optimize the amount of samples per file (can it be less catching the same? Bigger cathing more?).

Queries

E.g.: Do we want to include case-analysis?

Minor improvements/details

There are several details we could improve if we want to, e.g:

The return category data could be even finer grained. E.g. when a list is returned, it's interesting to know its size. For example, that's interesting for complete-prefix, where changes often consist in having more or less suggestions than before.

Also, we could use dune diff workflow directly instead of cram tests. The cram tests just cat files, t

Takes about 10 min to run

pitag-ha · 2024-01-29T17:26:07Z

I'm closing this as it has long been superseded by #1716 .

pitag-ha added 2 commits July 24, 2023 18:46

Add query num to output to observe server crashes

fe29c5b

PoC for behavior CI

04e157a

Takes about 10 min to run

pitag-ha closed this Jan 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Fuzzy behavior tests/CI for Merlin #1657

[RFC] Fuzzy behavior tests/CI for Merlin #1657

pitag-ha commented Jul 25, 2023

pitag-ha commented Jan 29, 2024

[RFC] Fuzzy behavior tests/CI for Merlin #1657

[RFC] Fuzzy behavior tests/CI for Merlin #1657

Conversation

pitag-ha commented Jul 25, 2023

Approach

Local tests

Test cost and maintenance data

Time

Space

Suggested workflow

CI

Locally

Two in one

Category data

Return classes

Server crashes

Full responses

Impact

Positive

Concrete regressions we would have caught in the past

Behavior changes on big PRs such as compiler bumps

General

Limits

Test code base

Sampe size

Implementation details and their consequences

Where does merl-an live

Where do the Irmin dependencies come from

CI integration

Filter out noise

Handle the JSON output

Sample set choise

Queries

Minor improvements/details

pitag-ha commented Jan 29, 2024

Where does `merl-an` live