Automated testing for the stacks-blockchain #3732

kantai · 2023-05-31T16:07:13Z

kantai
May 31, 2023
Maintainer

After the various issues addressed by epochs 2.2, 2.3, and 2.4, a key take-away should be that the stacks-blockchain repository simply does not have enough automated testing in place. While each of the issues addressed could have been caught with better unit tests, knowing a priori which paths must be exercised in which combinations is a hard problem.

There were three main issues uncovered and addressed during this time:

pox-2 contained a bug in stacks-increase which only surfaces when there are multiple stackers. The existing code coverage for stacks-increase only tested the case of a single stacker. Note that even branch coverage would not have highlighted this issue.
The epoch 2.2 activation created an issue for trait invocation compatibility: the trait conversion checks were applied with an epoch equality check on epoch 2.1, meaning that once epoch 2.2 activated, the trait conversions did not work. Again, code coverage would not surface this issue.
A bug in the computation of least_supertype led to the type checker allowing construction of tuples via methods like append when the append tuple contains additional keys. The deserializer rejects structs that are serialized in this way. These code paths not only had high code coverage, but also had fuzz testing in place. SIP-024 and Epoch 2.4 addressed this issue by sanitizing the outputs of these methods, and sanitizing inputs from data sources, contract calls, and transaction arguments.

In this discussion, I want to talk about what kinds of automated testing would have captured these issues without knowing about them beforehand. This is different from regression testing: in regression testing, we simply add unit tests for these cases. I'm arguing that we need to expand the automated testing apparatuses of the stacks-blockchain so that these bugs, and perhaps others, are caught before they're known.

Addressing PoX-2 Bug

Ultimately, the stacks-increase bug and the also-addressed "swimming in multiple pools" bug are smart contract bugs. They should be caught using techniques that are applicable to any smart contracts. Expanding tests like the contract_tests module in the stacks-blockchain repo or the hirosystems/stacks-2-1-testing tests (which use Stacks DevnetJS) should be able to catch these bugs. However, those tests are all hand written. Instead, the style of testing that should be applied here is something along the lines of property testing (or quickcheck).

There are two options for applying property testing. One is to combine a TS/JS property testing framework with DevnetJS. The second is to use a property testing framework to drive a TestPeer based tester.

DevnetJS Property Testing

DevnetJS property testing would allow us to write property assertions in typescript, and then use DevnetJS to run a full end-to-end test with the stacks-node. This has the further benefit that this work would also provide something like a demo case to other contract writers interested in property testing using the Clarinet tools. The downside of this is that DevnetJS tests require some non-trivial build up and tear down: they run a bitcoind regtest node and a stacks-node, and even the simplest tests can take about 30 seconds on speedy machines. Successful property testing must run each attempt very quickly. For complex issues like the SIP-024 issue, ~2000 tests per second are required to find the issue in ~ 1 hour; at 30 seconds/test, that would require 2,500 days. However, the input space of smart contracts is smaller than the input space of least_supertype, so it may be the case that even at 30 seconds / test, a property tester would be able to surface this issue quickly.

A related option is to use Clarinet's normal test execution (rather than devnetjs) to test the contracts. However, it would require altering Clarinet's handling of "special" contracts like pox-* so that lockups can be sufficiently tested. I do not know how complex this would be, and whether or not it would create a situation where we're more likely to be testing the clarinet integration than the actual pox contract.

TestPeer Property Testing

The other option is to use the TestPeer struct in the stacks-blockchain's rust testing framework and combine it with a rust property testing framework. These tests can be much faster than the DevnetJS tests, but would still be able to exercise pox specific code like lockups and realized reward set payouts. A typical TestPeer based test takes ~1 second on a fast machine, which is much faster than DevnetJS, though it still may not be fast enough for property testing. A downside of this approach is that the TestPeer is not an end-to-end testing struct in the same way that DevnetJS is -- it does not test the neon_node implementation -- and so it is something in between a full e2e test and a unit test. I'm not too concerned about this: the property testing we're worried about is of the contract itself, and not necessarily the integration elements of the neon_node as those must be tested elsewhere. The bigger downsides are that I think this framework will require more implementation effort to get working (input generation is probably the tricky part) and then would have limited benefit to other contract writers: they don't write rust tests for their contracts.

Proposal

Implement a property testing framework for pox-2 using DevnetJS and see if it can catch the two pox-2 bugs within 2 hours of execution time.
If it cannot, explore the viability of testing pox-2 via normal clarinet execution (the only requirement should be the lock handling).
If the above isn't viable, implement a property testing framework for pox-2 using TestPeer, and see if it can catch the two pox-2 bugs within 2 hours.

Addressing the 2.2 trait bug

The 2.2 trait bug should have been uncovered through some automated tests. Adding 2.2 did not alter any related lines of code, so surfacing this issue absolutely would have required some automated testing.

I think there's two possible approaches here:

Use the rstest epoch, clarity_version testing framework to create tests for any new unit tests. This may require further pairing (i.e., quadratic epoch, epoch pairs).
Speculative execution of existing blocks with the new epoch rules active. Some failures should be suspected, other failures would be surprising (i.e., in the 2.2 case, every trait invocation failing would have been surprising, pox-2 failing would not have been).

Option 1 is easier, and probably should be done in all cases -- this would simply expand existing unit tests, and place a requirement on future PRs that any epoch-activated feature would need to have unit tests that use rstest templates.

Option 2 requires a bit more effort -- I did this style of testing manually for 2.4 using the test/replay-block branch, which wasn't too hard, but only because I "faked" it, by making the value_serializing() method return true for all epochs: epoch-2.4 was not actually speculatively activated. Speculatively activating an epoch is possible. However, by itself it invalidates a block: the epoch is stored in the clarity db at the end of the block execution. This isn't too problematic, as long as the testing framework was checking for transaction event mutations (i.e., transaction 1 had events a, b, c and returned x in default rules, but only events a, b and returned y in 2.4 rules).

Proposal

Extend existing trait tests with epoch rstest template. Make sure that the test fails on epoch-2.2, and would have failed in the epoch-2.2 release image.

Addressing the 2.2.0.0.1 pox_2_unlock_height bug

This bug was caught before activating, however it could have led to an early chain split. The issue was that the pox_2_unlock_height needed to be epoch gated. This necessitated a quick hotfix (i.e., 2.2.0.0.1). This is the kind of bug that could be caught with the genesis sync. For addressing this issue, I think there's two things to do:

Expanded clarity integration tests which apply a lock and assert a specific unlock height during epoch 2.1. This would require that the testing setup apply a 2.2 unlock before that specific height. This is really regression testing, because it requires some specific knowledge of the issue to write the test. Mutation testing (more on this later) would perhaps help us surface areas where additional unit tests would be useful.
Non-speculative replay block testing combined with account state lookups. Basically, use the replay-block command to replay all blocks (i.e., the work of a genesis sync). If that discovers a surprise invalidation for 2.2.0.0.0, then great, that is sufficient. If that's not the case, we would need to expand this to also do some account lookups, and compare against the expected account value using the prior release. This is harder and more time consuming, but it may be necessary.

Addressing the `least_supertype` bug

Here, property testing (or really, more assertive fuzz testing) would have discovered this issue. The fuzz testing I did for the sanitization support exercises the property at fault for this bug (i.e., the constructed clarity value does not match the expected type), this is implemented in a branch currently test/sanitize-fuzzing. This fuzzing test has the nice property that it doesn't require any code changes in the Clarity library itself (the arbitrary implementation is done in the fuzzing codebase), meaning that it could be tested against specific versions.

In order to deal with this in the future, I think the important thing is to require explicit property testing on any PR units going forward. This requires a lot of thought on the part of the PR submitter -- explicitly enumerating the guarantees that each function should provide. But I think that work is worthwhile and not just for testing.

moodmosaic · 2023-06-05T14:37:11Z

moodmosaic
Jun 5, 2023
Maintainer

Implement a property testing framework for

(Co-founder of @hedgehogqa here.) Implementing a property testing framework from scratch is rarely a good idea. I would suggest first to see if you can integrate with some of the existing, mature, battle-hardened, tools that are out there.

When I was working in adding property testing (and fuzzing) in clarinet (hirosystems/clarinet#398) I've done a fair amount of research and picked fast-check as the underlying library.

It has all the modern features of a prop/fuzz testing library, e.g. model testing, integrated shrinking, control over the scope of generated values, and many other useful functions.

Even though property testing (and fuzz testing) are superior to traditional, example-based, testing, what makes a huge difference is the ability to create a simplified model which you can then compare with the smart contract's state (after executing randomized commands on it).

That is really the technique that has the chance to detect unexpected bugs, and that's what the guys on Ethereum are doing, for example, Echidna (was using Hedgehog ❤️), dapp.tools (was using QuickCheck), foundry (uses proptest), etc. They call it invariant testing.

In 2022, I've encoded all these techniques pretty much successfully into Hiro's Clarinet and got some working prototypes as well, for example it could detect the bug that's hidden in this clarity contract: https://explorer.hiro.so/txid/0x5864dabc9122732e16fcebd5ddaa727db8614eaee59499967c18011c1ddbd5b8?chain=testnet

I love Stacks, so absolutely feel free to ping me with any questions or any help needed. /cc @igorsyl

1 reply

kantai Jun 7, 2023
Maintainer Author

Implement a property testing framework for

(Co-founder of @hedgehogqa here.) Implementing a property testing framework from scratch is rarely a good idea. I would suggest first to see if you can integrate with some of the existing, mature, battle-hardened, tools that are out there.

Totally -- what I meant above is to try to integrate an existing property testing framework into one of testing systems for contracts. Either a rust property testing framework or a TS/JS one which integrate with Clarinet or Devnet.JS

When I was working in adding property testing (and fuzzing) in clarinet (hirosystems/clarinet#398) I've done a fair amount of research and picked fast-check as the underlying library.

It has all the modern features of a prop/fuzz testing library, e.g. model testing, integrated shrinking, control over the scope of generated values, and many other useful functions.

Even though property testing (and fuzz testing) are superior to traditional, example-based, testing, what makes a huge difference is the ability to create a simplified model which you can then compare with the smart contract's state (after executing randomized commands on it).

I agree with this completely. As a bonus, this would also benefit the codebase by providing each component or contract with a set of properties and models that describe their behavior. This would become an additional avenue of "code documentation".

That is really the technique that has the chance to detect unexpected bugs, and that's what the guys on Ethereum are doing, for example, Echidna (was using Hedgehog ❤️), dapp.tools (was using QuickCheck), foundry (uses proptest), etc. They call it invariant testing.

In 2022, I've encoded all these techniques pretty much successfully into Hiro's Clarinet and got some working prototypes as well, for example it could detect the bug that's hidden in this clarity contract: https://explorer.hiro.so/txid/0x5864dabc9122732e16fcebd5ddaa727db8614eaee59499967c18011c1ddbd5b8?chain=testnet

I love Stacks, so absolutely feel free to ping me with any questions or any help needed. /cc @igorsyl

This is awesome, and we definitely appreciate this! Once we start looking into testing with Clarinet or Devnet.JS, we'll be sure to ping you.

moodmosaic · 2023-06-05T16:06:13Z

moodmosaic
Jun 5, 2023
Maintainer

explore the viability of testing pox-2 via normal clarinet execution

That's a good option as it allows you to write tests in clarinet and then turn them into prop/fuzz tests (not to be confused with model/invariant tests)

This uses a forked version of clarinet, and it was a prototype, but just to get a rough idea of how things may look like:

Clarinet.test({
  name: "ascii-to-buff",
  runs: 1000,
  logs: true,
  data: {
    input: {
      minLength: 0,
      maxLength: 127,
    }
  },
  fn(chain: Chain, account: Account, input: string) {
    chain.callReadOnlyFn(
      "convert7",
      "ascii-to-buff",
      [types.ascii(input)],
      account.address,
    ).result.expectBuff(Buffer.from(input));
  },
});

You can see the output from that test in PromptECO/clarity-sequence#4 (comment).

0 replies

moodmosaic · 2023-06-06T07:54:52Z

moodmosaic
Jun 6, 2023
Maintainer

If the above isn't viable, implement a property testing framework for pox-2 using TestPeer

I'm not aware of TestPeer, do you have any references/links?

At the very least it should be able to provide fine-grained control over (1) the scope and (2) shrinking of generated values, and (3) replay of failures.

(1) so you can be explicit about what you're generating (otherwise you may never detect edge cases!)
(2) when an edge case is detected, the library should be able to (at least try) to provide the simplest possible counterexample (shrinking)
(3) a seed that you can feed into the library so that it can replay the failure (this is often useful when a failure is detected overnight as part of the nightly CI build!)

In Rust, proptest should have all the above, and in TypeScript fast-check has all the above (and more).

3 replies

kantai Jun 7, 2023
Maintainer Author

TestPeer is just an implementation of a stacks-node for use in unit tests -- it generates mock bitcoin block data and mines stacks blocks in that mock environment, invoking almost all of the code paths in a nearly identical way to a real stacks-node (so it is something like an integration test, but it is not an end-to-end test).

A lot of the cargo tests in this repo use that struct. For example, the pox-3 tests use it: https://github.com/stacks-network/stacks-blockchain/blob/master/src/chainstate/stacks/boot/pox_3_tests.rs#L565

By itself, it isn't a testing framework, but it could be integrated with a property testing framework. We would need to figure out what appropriate input spaces for the tests would be: Tests for a contract could invoke any of the contracts methods, with any inputs, but also block boundaries (i.e., are the random transactions in the same block or multiple blocks), but I believe that is tractable.

moodmosaic Jun 7, 2023
Maintainer

It would be interesting to see how shrinking works on this when an edge case is detected. The way I set this up in Clarinet, you'd get all the calls that lead to the bug, e.g.:

That's invariant testing, and the clarity contract used in the screenshot is this: https://explorer.hiro.so/txid/0x5864dabc9122732e16fcebd5ddaa727db8614eaee59499967c18011c1ddbd5b8?chain=testnet

moodmosaic Jun 10, 2023
Maintainer

Inveriant testing image in its original resolution: https://user-images.githubusercontent.com/287532/244202803-6fe93f2a-897a-4af6-ba4e-cb41c8c093a5.png

moodmosaic · 2023-06-06T20:30:29Z

moodmosaic
Jun 6, 2023
Maintainer

this is implemented in a branch currently test/sanitize-fuzzing

This seems to be in the right direction, however it is important to be aware of the distribution of test cases: if test data is not well distributed then conclusions drawn from the test results may be invalid.

Does the arbitrary crate alongside libfuzzer_sys for (structure-aware) fuzzing in Rust offers something like this?

QuickCheck, Hedgehog, fast-check (partially), and other libraries with 'Arbitrary'-like¹ functionality offer this. See QuickCheck's docs on that subject, for example, and this talk by John Hughes.

'Arbitrary'-like¹: A pair of a test/fuzz data generator function and a shrinker/reducer function.

1 reply

kantai Jun 7, 2023
Maintainer Author

As far as I know, the arbitrary crate only translates a byte vector into the input type. I don't think it makes any guarantees about the distribution of the test cases (each of which is just a byte vector): the fuzzing library is what controls that. Rust fuzzing libfuzzer uses LLVM's libFuzzer, but AFL could also be used instead. libfuzzer is a code coverage directed fuzzer, so it tries to explore new code paths and it does random exploration between those. I don't know enough about its exact algorithm to tell you if it suffers from distribution problems, but it could.

kantai · 2023-07-12T19:07:52Z

kantai
Jul 12, 2023
Maintainer Author

If it cannot, explore the viability of testing pox-2 via normal clarinet execution (the only requirement should be the lock handling).

This should be fairly possible!

I opened a PR on the Clarinet repo that does this. I raised some questions about how best to implement it. One thing that came out of it would be that if Clarinet wanted to support this, it would be very nice to refactor out the lock handling code into a new workspace member.

See: hirosystems/clarinet#1074

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automated testing for the stacks-blockchain #3732

{{title}}

Replies: 5 comments 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Automated testing for the stacks-blockchain #3732

kantai May 31, 2023 Maintainer

Addressing PoX-2 Bug

DevnetJS Property Testing

TestPeer Property Testing

Proposal

Addressing the 2.2 trait bug

Proposal

Addressing the 2.2.0.0.1 pox_2_unlock_height bug

Addressing the least_supertype bug

Replies: 5 comments · 5 replies

moodmosaic Jun 5, 2023 Maintainer

kantai Jun 7, 2023 Maintainer Author

moodmosaic Jun 5, 2023 Maintainer

moodmosaic Jun 6, 2023 Maintainer

kantai Jun 7, 2023 Maintainer Author

moodmosaic Jun 7, 2023 Maintainer

moodmosaic Jun 10, 2023 Maintainer

moodmosaic Jun 6, 2023 Maintainer

kantai Jun 7, 2023 Maintainer Author

kantai Jul 12, 2023 Maintainer Author

kantai
May 31, 2023
Maintainer

Addressing the `least_supertype` bug

Replies: 5 comments 5 replies

moodmosaic
Jun 5, 2023
Maintainer

kantai Jun 7, 2023
Maintainer Author

moodmosaic
Jun 5, 2023
Maintainer

moodmosaic
Jun 6, 2023
Maintainer

kantai Jun 7, 2023
Maintainer Author

moodmosaic Jun 7, 2023
Maintainer

moodmosaic Jun 10, 2023
Maintainer

moodmosaic
Jun 6, 2023
Maintainer

kantai Jun 7, 2023
Maintainer Author

kantai
Jul 12, 2023
Maintainer Author