[RFC] Fuzzy behavior tests/CI for Merlin #1657
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Merlin already has quite a few end-to-end tests. All of those are behavior-specific: we want to test one concrete Merlin behavior, create specific input for that and test Merlin on that input. Additionally to those tests, we now also want to test Merlin's behavior non-specifically on a large randomized input scale.
This PR builds on top of PR #1642. The main differences are:
return
data is finer-grained.This implements both issues #1634 and #1635. Here, I'm doing both things in one workflow, which I think is the right thing to do. Closes #1634, closes #1635.
Approach
I'd like to discuss the approach. So, what do you think about the following?
Local tests
Similarly to PR #1642, I'm also proposing to have the tests locally as well, not only in CI. Same as there: they would not run every time on
dune test
.Test cost and maintenance data
I've optimized the tool
merl-an
we use to generate the samples etc and I've chosen the sample base carefully. With that, the time seems reasonable to me for the CI, while the space still seems quite big to me to add it to the repo:Time
If the test base is already fetched and built (that should be the case most of the time now thanks to Dune caching):
If the base needs to be fetched and built:
Both timings come from my local computer.
Space
There are two big test files that would get into the Merlin repo with this approach:
category_data.t
andfull_responses.t
. They could be made more compact, but at a pricee: see Handle the JSON output below. We could also make them smaller by choosing less samples. Currently their size is:category_data.t
:4,7M
189460 lines
full_responses.t
:36M
1168705 lines
If we do go forward with this, we should probably use Git LFS. Does anyone have experience with that?
Suggested workflow
CI
For 23482 samples as in this RFC, I expect the CI to take about 15-20 min (side-note: I could push the github yml from the #1642 PR, even though currently the CI would fail, to make sure this is true).
There are other CIs on Merlin that also seem to take that long. Is that true? If so, we could try running the CI on every PR (fulfilling a few things such as ml/mli-files were modified etc) and see if that ends up being annoying or ok. We can also optimize the sample size further.
Alternatively, I have a potential alternative workflow in mind, if you think it's too long to run on every PR.
Locally
Locally,
dune test
won't run these tests. People can run them viaFUZZING=true dune test
, which they should do iff expecting or fearing a behavior change. That's one of the weak points of this approach, since that will take about 10 min locally - or even 15 min when the sample base isn't cached.Two in one
This test generates two different data sets: category data in
category_data.t
, i.e. the implementation of issue #1635, and full Merlin responses infull_responses.t
, i.e. the implementation of issue #1634.Category data
The Merlin responses can face regressions/improvements on different levels. When treating such big amount of test data as here, it's helpful to reduce the data and seperate the different levels.
Return classes
One level a regression can happen on is the the return class. That's why in
category_data.t
, the responses are reduced to "belongs to one of these six return categories", e.g. category "successful return", or category "properly reported error" etc.Server crashes
Another interesting regression is server crashes. That's why
category_data.t
also contains the server's query number.Full responses
By reducing the response, regressions/behavior changes can also be missed, though. That's why the test also contains the full responses in
full_responses.t
.Impact
Positive
Concrete regressions we would have caught in the past
I've checked how analyzing this test would look like by running it on a couple of concrete
locate
regressions that have happened in the past. Both regressions would have been caught by the new test.For the first one, the test diff is here. It's short and easy to read. The full responses data would have caught the regression, the category data wouldn't. The regression was about
locate
in the context of functor applications.For the second one, the test diff is here. It's also short and easy to read. Both the full responses data and the category data would have caught the regression. The regression was about
locate
in the context of modules without implementation.Behavior changes on big PRs such as compiler bumps
I've also had a look how the test diffs would look like on big PRs such as compiler bumps.
The diff for the 4.13 bump is here. Unfortunately, it contains a bit of unnecessary noise (we need to fix that). The category data is easy to read and it directly shows a regression in
occurrences
: It sometimes returns an empty list now, where before it would find occurrences. The full responses are a bit harder to read, but also contain interesting changes.The diff for the 4.14 bump is here. It's similar.
General
I also had to update the test between the mentioned PRs. Interestingly, all of them contain regressions as well (while also containing quite a few improvements of course). E.g.:
One interesting diff is this one. It contains one of the
locate
regressions whose fix is mentioned above. Additionally, it contains a few morelocate
regressions. One of them: The return is now an exception, where before it was a value. Those regressions were introduced at some point between the 4.14 bump and this PR.Limits
Test code base
The test code base is buildable, in particular it doesn't have errors and doesn't contain holes. So no regression in those contexts can be found.
To give one example: The regression in the context of holes and PPXs wouldn't have been caught. I didn't even bother to try. We could "overfit" this test to that regression by adding a workflow to
merl-an
producing that situation. I don't think that's worth it. For those cases, the normal input-specific end-to-end tests are more suitable.Sampe size
The samples can't cover all edge cases, so we'll always miss things.
To give one example: Originally, I included
case-analysis
in the queries that are run. I've checked if a regression making the server crash on certaincase-analysis
queries would have been caught. It wouldn't have.Implementation details and their consequences
This is a quick PoC with the goal to gather some data (see Impact) and to discuss this approach/workflow I'm suggesting. There are several implementation details we'll need to approach before thinking about merging this. E.g.:
Where does
merl-an
liveThe current workflow relies on having
merl-an
installed. Instead, I suggest upstreaming it to Merlin and having Dune build it as part of the test.Where do the Irmin dependencies come from
The current workflow relies on having the Irmin dependencies installed. If we wanted that workflow, we'd have to add all Irmin dependencies as Merlin test dependencies. Instead, I suggest vendoring Irmin and its dependencies. Note:
CI integration
The CI integration can be taken from PR #1642. We just need to make sure the tests are unspecific to locally vs CI.
Filter out noise
Related: By not using
merlin-wrapper
, the tests aren't filtering out the noise coming from e.g. switch-specific and compiler-specific things.Furthermore, the current [query_num] adds a lot of noise as well: As soon as there was one server crash, all the following queries will be marked in the diff as well.
Handle the JSON output
Currently, the tests pipe the JSON to
jq
. That makes both the JSON's themselves and their diffs very readable. However, it makes the files extremely long. We need to find a good balance between strongly formatted JSON (piping tojq
) and very compact JSON (merl-an
's output directly).Sample set choise
The current sample choice is already quite conscious and adjusted and I don't think it's bad (I can give detail if that's interesting). However, there are still 3 ways in which we can improve it: Better sample distribution: pitag-ha/merl-an#28; optimize the test code base even more (can it be smaller catching the same? Bigger catching more?). Optimize the amount of samples per file (can it be less catching the same? Bigger cathing more?).
Queries
E.g.: Do we want to include
case-analysis
?Minor improvements/details
There are several details we could improve if we want to, e.g:
The
return
category data could be even finer grained. E.g. when a list is returned, it's interesting to know its size. For example, that's interesting forcomplete-prefix
, where changes often consist in having more or less suggestions than before.Also, we could use dune diff workflow directly instead of cram tests. The cram tests just cat files, t