[Discuss][Infra] Relax fuzzing #302

slyubomirsky · 2022-12-09T04:18:53Z

The idea of fuzzing Relax has been raised in some community meetings, so we may discuss how we might approach this problem and how a Relax fuzzer could be implemented. For comparison, please see my ongoing thread about fuzzing Relay (I hope to have an official RFC out as soon as I've worked out some issues with running tests and have implemented a way to cluster stack traces).

Motivation for Fuzzing

Fuzzing is a method for testing compilers by randomly generating input programs en masse and examining the resulting errors when compiling and perhaps running them. Ideally, the program generation would be in some way checked against the language specification to ensure that the generated programs are valid in the language; this would ensure that any error reported must be a compiler bug. Fuzzing would help ensure the robustness of the compiler infrastructure, since our unit tests contain only a small number of handcrafted test cases and are far from exhaustive. By generating more test cases that exercise more language features, we will be able to be more confident in the correctness of the implementation.

General Approach

I propose that Relax fuzzing can be handled in much the same method as I handle Relay in my proposal. However, some care will have to be taken to address the fact that shape analysis in Relax is best-effort rather than part of the type system and that Relax allows for shape to be dynamically checked (resulting in errors at run time).

At a high level, here are my proposed guiding principles for fuzzing Relax:

Rather than have a single program that handles all fuzzing, we should instead aim to have a fuzzing library that allows for implementing program generators as needed. The library will provide utilities to make it simpler to write program generators; we could conceive of the generators as creating programs according to different policies. For instance, one generator could insist on all shapes being static and on there being no shape errors at run time (which would likely force it to be very conservative), which would make it easier to write test oracles. Another generator could include symbolic shapes but make no guarantees as to whether shapes could fail to match at run time. Still another generator could create programs that purposely fail to type check, to ensure our type system catches failure cases. Separating out the generators will make it simpler to test different cases we care about while keeping the complexity (I hope) manageable.
In general, we could approach program generation like in the Relay proposal, with a "type-directed" approach: Starting with a goal type, we choose an expression that could produce that type and then proceed recursively according to the typing rules to generate any subexpressions until we fill in the entire program. Shapes could be treated analogously, though we may have to be careful in how we reason about operators given that shapes could contain complex expressions.
Similarly to the type "solvers" in my Relay fuzzing proposal, we should have a registry of type and shape solvers for Relax operators. The typing rules are generally a lot easier than those in Relay, but we may need more reasoning for dealing with shapes (especially symbolic ones). The goal of these solvers would be to apply typing and shape rules backwards, to allow us to start from a goal type/shape and figure out what arguments we could use. (I don't see a clear way to apply the existing rules in the "forward" direction while still guaranteeing that the final results will work, though I would be happy to hear any suggestions for how it could be done.) As in the Relay fuzzing proposal, the registries of solvers should allow for registering multiple different solvers that implement different policies, to allow for testing different functionality.
When possible, we should make generation probabilities for different grammar constructs configurable. This would allow for implementing different generation policies just by changing parameters.
We could imagine two disciplines for fuzzing: Small-scale fuzzing that we include as part of the CI (e.g., if you implement a new operator, you could include a small number of randomly generated tests in the CI for that op, though we should fix the random seeds to avoid flakiness) as well as larger-scale program generation to test more complicated conditions that might be more expensive and could be run periodically instead (once every N commits or run as a "dashboard" entirely outside of CI).

We could reuse some of the infrastructure in my Relay fuzzing POC and thus focus only on the peculiarities of Relax, particularly if the Relay fuzzer is adopted in mainline.

Test Oracles and Generation Policies for Using Them

Type-checking: Given a well-typed program, ensure that the compiler raises no type errors (precondition: generate a well-typed program)
Rejecting wrong types: Given a badly typed program, ensure the compiler rejects it (precondition: generate a program known to be incorrectly typed)
Parser round-tripping: Given a Relax program, ensure that pretty-printing it and parsing the result returns the original program (precondition: any program should suffice).
Normalization: Ensure that normalization succeeds and does not affect the run-time semantics of a Relax program (precondition: any program should suffice, though we would need a reference interpreter. Better to use programs known not to have any shape errors).
Differential equality: Given a reference interpreter, ensure that our preferred Relax compilers do not affect the run-time semantics of a given Relax program (precondition: any program should suffice; we may want to test programs that we expect to fail at run time separately from programs we expect not to fail at run time).
Testing individual operators: Generate programs consisting of single calls to operators to validate the operators against reference implementations. This could also be used to test their typing and shape rules (including by generating input types and shapes known to fail).

This is not an exhaustive list, but it gives some sense of the conditions we could test with fuzzing and what program generators we would need to implement to handle them.

Side Note: Fuzzing the TVMScript Parser

So far I have discussed generating Relax programs, but it would also be important for us to test our TVMScript parser. Parser roundtripping from the Relax side is one way to do this, but this would account only for the test cases corresponding to valid Relax programs. We would likely also want to test error cases. There are different ways we could consider doing it. The most robust would be implementing a Python AST generator that generates ASTs we expect the parser to handle and ASTs we expect the parser to reject. A simpler approach that might not be as comprehensive would be generating Relax ASTs, pretty-printing into Python, and then artificially injecting faults into the Python AST to test error-handling.

Complications

In my opinion, there are three significant impediments to pursuing Relax fuzzing, and they are related to each other:

The specification is still greatly in flux, making it hard to determine what "should" happen for a given program.
Many correctness conditions would be hard to check without a reference interpreter, which would require some effort (hopefully not very much, given that a reference interpreter is supposed to be simple) to implement as well as agreement on the specification.
Related to a reference interpreter, even if we have agreement on the language specification, it is not obvious what to use for reference implementations of operators, though we have some options: Using some basic TIR implementation of the operators, calling out to external libraries, etc.

I think our capabilities for testing interesting conditions with fuzzing would be relatively limited without a reference interpreter, though we would still be able to test basic properties like ensuring type/shape checking succeeds or parser roundtripping or that compiler passes do not crash. For fuzzing in the near term, we could stick to testing only these simpler conditions (I will start writing proof-of-concept fuzzers), though we would eventually want deeper validation.

Further Points for Discussion

Many of the same community-related concerns that apply to the Relay fuzzing proposal apply here as well:

How should fuzzing results be presented to the community? Who will be responsible for examining them? It is easy to generate programs on a machine and run test oracles on them, but we are likely to encounter bugs and should have some way to present them to community members who might begin to debug these.
How often should large-scale fuzzing trials be run? We wouldn't want to create too much of a burden on the CI infrastructure. In general, we should have some way to make governance decisions about the fuzzing infrastructure.
It will likely be necessary to ensure new operators or language features are also incorporated into the fuzzers.
In terms of infrastructure in general, where should the fuzzing infrastructure be located in the repo? I've taken a rather ad hoc approach with the Relay fuzzer prototype and inserted it in python/tvm/relay/testing/fuzzing and infrastructure scripts in tests/scripts/relay_fuzzing, but I wonder if there might be better places for such things.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Discuss][Infra] Relax fuzzing #302

[Discuss][Infra] Relax fuzzing #302

slyubomirsky commented Dec 9, 2022 •

edited

Loading

[Discuss][Infra] Relax fuzzing #302

[Discuss][Infra] Relax fuzzing #302

Comments

slyubomirsky commented Dec 9, 2022 • edited Loading

Motivation for Fuzzing

General Approach

Test Oracles and Generation Policies for Using Them

Side Note: Fuzzing the TVMScript Parser

Complications

Further Points for Discussion

slyubomirsky commented Dec 9, 2022 •

edited

Loading