Automated testing for the stacks-blockchain #3732
Replies: 5 comments 5 replies
-
(Co-founder of @hedgehogqa here.) Implementing a property testing framework from scratch is rarely a good idea. I would suggest first to see if you can integrate with some of the existing, mature, battle-hardened, tools that are out there. When I was working in adding property testing (and fuzzing) in clarinet (hirosystems/clarinet#398) I've done a fair amount of research and picked fast-check as the underlying library. It has all the modern features of a prop/fuzz testing library, e.g. model testing, integrated shrinking, control over the scope of generated values, and many other useful functions. Even though property testing (and fuzz testing) are superior to traditional, example-based, testing, what makes a huge difference is the ability to create a simplified model which you can then compare with the smart contract's state (after executing randomized commands on it). That is really the technique that has the chance to detect unexpected bugs, and that's what the guys on Ethereum are doing, for example, Echidna (was using Hedgehog ❤️), dapp.tools (was using QuickCheck), foundry (uses proptest), etc. They call it invariant testing. In 2022, I've encoded all these techniques pretty much successfully into Hiro's Clarinet and got some working prototypes as well, for example it could detect the bug that's hidden in this clarity contract: https://explorer.hiro.so/txid/0x5864dabc9122732e16fcebd5ddaa727db8614eaee59499967c18011c1ddbd5b8?chain=testnet I love Stacks, so absolutely feel free to ping me with any questions or any help needed. /cc @igorsyl |
Beta Was this translation helpful? Give feedback.
-
That's a good option as it allows you to write tests in clarinet and then turn them into prop/fuzz tests (not to be confused with model/invariant tests) This uses a forked version of clarinet, and it was a prototype, but just to get a rough idea of how things may look like: Clarinet.test({
name: "ascii-to-buff",
runs: 1000,
logs: true,
data: {
input: {
minLength: 0,
maxLength: 127,
}
},
fn(chain: Chain, account: Account, input: string) {
chain.callReadOnlyFn(
"convert7",
"ascii-to-buff",
[types.ascii(input)],
account.address,
).result.expectBuff(Buffer.from(input));
},
}); You can see the output from that test in PromptECO/clarity-sequence#4 (comment). |
Beta Was this translation helpful? Give feedback.
-
I'm not aware of At the very least it should be able to provide fine-grained control over (1) the scope and (2) shrinking of generated values, and (3) replay of failures. (1) so you can be explicit about what you're generating (otherwise you may never detect edge cases!) In Rust, proptest should have all the above, and in TypeScript fast-check has all the above (and more). |
Beta Was this translation helpful? Give feedback.
-
This seems to be in the right direction, however it is important to be aware of the distribution of test cases: if test data is not well distributed then conclusions drawn from the test results may be invalid. Does the QuickCheck, Hedgehog, fast-check (partially), and other libraries with 'Arbitrary'-like1 functionality offer this. See QuickCheck's docs on that subject, for example, and this talk by John Hughes. 'Arbitrary'-like1: A pair of a test/fuzz data generator function and a shrinker/reducer function. |
Beta Was this translation helpful? Give feedback.
-
This should be fairly possible! I opened a PR on the Clarinet repo that does this. I raised some questions about how best to implement it. One thing that came out of it would be that if Clarinet wanted to support this, it would be very nice to refactor out the lock handling code into a new workspace member. |
Beta Was this translation helpful? Give feedback.
-
After the various issues addressed by epochs 2.2, 2.3, and 2.4, a key take-away should be that the stacks-blockchain repository simply does not have enough automated testing in place. While each of the issues addressed could have been caught with better unit tests, knowing a priori which paths must be exercised in which combinations is a hard problem.
There were three main issues uncovered and addressed during this time:
pox-2
contained a bug instacks-increase
which only surfaces when there are multiple stackers. The existing code coverage forstacks-increase
only tested the case of a single stacker. Note that even branch coverage would not have highlighted this issue.least_supertype
led to the type checker allowing construction of tuples via methods likeappend
when the append tuple contains additional keys. The deserializer rejects structs that are serialized in this way. These code paths not only had high code coverage, but also had fuzz testing in place. SIP-024 and Epoch 2.4 addressed this issue by sanitizing the outputs of these methods, and sanitizing inputs from data sources, contract calls, and transaction arguments.In this discussion, I want to talk about what kinds of automated testing would have captured these issues without knowing about them beforehand. This is different from regression testing: in regression testing, we simply add unit tests for these cases. I'm arguing that we need to expand the automated testing apparatuses of the stacks-blockchain so that these bugs, and perhaps others, are caught before they're known.
Addressing PoX-2 Bug
Ultimately, the
stacks-increase
bug and the also-addressed "swimming in multiple pools" bug are smart contract bugs. They should be caught using techniques that are applicable to any smart contracts. Expanding tests like thecontract_tests
module in the stacks-blockchain repo or the hirosystems/stacks-2-1-testing tests (which use Stacks DevnetJS) should be able to catch these bugs. However, those tests are all hand written. Instead, the style of testing that should be applied here is something along the lines of property testing (or quickcheck).There are two options for applying property testing. One is to combine a TS/JS property testing framework with DevnetJS. The second is to use a property testing framework to drive a
TestPeer
based tester.DevnetJS Property Testing
DevnetJS property testing would allow us to write property assertions in typescript, and then use DevnetJS to run a full end-to-end test with the stacks-node. This has the further benefit that this work would also provide something like a demo case to other contract writers interested in property testing using the Clarinet tools. The downside of this is that DevnetJS tests require some non-trivial build up and tear down: they run a bitcoind regtest node and a stacks-node, and even the simplest tests can take about 30 seconds on speedy machines. Successful property testing must run each attempt very quickly. For complex issues like the SIP-024 issue, ~2000 tests per second are required to find the issue in ~ 1 hour; at 30 seconds/test, that would require 2,500 days. However, the input space of smart contracts is smaller than the input space of
least_supertype
, so it may be the case that even at 30 seconds / test, a property tester would be able to surface this issue quickly.A related option is to use Clarinet's normal test execution (rather than devnetjs) to test the contracts. However, it would require altering Clarinet's handling of "special" contracts like
pox-*
so that lockups can be sufficiently tested. I do not know how complex this would be, and whether or not it would create a situation where we're more likely to be testing the clarinet integration than the actual pox contract.TestPeer Property Testing
The other option is to use the
TestPeer
struct in the stacks-blockchain's rust testing framework and combine it with a rust property testing framework. These tests can be much faster than the DevnetJS tests, but would still be able to exercise pox specific code like lockups and realized reward set payouts. A typicalTestPeer
based test takes ~1 second on a fast machine, which is much faster than DevnetJS, though it still may not be fast enough for property testing. A downside of this approach is that theTestPeer
is not an end-to-end testing struct in the same way that DevnetJS is -- it does not test theneon_node
implementation -- and so it is something in between a full e2e test and a unit test. I'm not too concerned about this: the property testing we're worried about is of the contract itself, and not necessarily the integration elements of the neon_node as those must be tested elsewhere. The bigger downsides are that I think this framework will require more implementation effort to get working (input generation is probably the tricky part) and then would have limited benefit to other contract writers: they don't write rust tests for their contracts.Proposal
pox-2
using DevnetJS and see if it can catch the two pox-2 bugs within 2 hours of execution time.pox-2
via normal clarinet execution (the only requirement should be the lock handling).pox-2
usingTestPeer
, and see if it can catch the two pox-2 bugs within 2 hours.Addressing the 2.2 trait bug
The 2.2 trait bug should have been uncovered through some automated tests. Adding 2.2 did not alter any related lines of code, so surfacing this issue absolutely would have required some automated testing.
I think there's two possible approaches here:
rstest
epoch, clarity_version testing framework to create tests for any new unit tests. This may require further pairing (i.e., quadraticepoch, epoch
pairs).Option 1 is easier, and probably should be done in all cases -- this would simply expand existing unit tests, and place a requirement on future PRs that any epoch-activated feature would need to have unit tests that use rstest templates.
Option 2 requires a bit more effort -- I did this style of testing manually for 2.4 using the
test/replay-block
branch, which wasn't too hard, but only because I "faked" it, by making thevalue_serializing()
method return true for all epochs: epoch-2.4 was not actually speculatively activated. Speculatively activating an epoch is possible. However, by itself it invalidates a block: the epoch is stored in the clarity db at the end of the block execution. This isn't too problematic, as long as the testing framework was checking for transaction event mutations (i.e., transaction 1 had eventsa, b, c
and returnedx
in default rules, but only eventsa, b
and returnedy
in 2.4 rules).Proposal
rstest
template. Make sure that the test fails on epoch-2.2, and would have failed in the epoch-2.2 release image.Addressing the 2.2.0.0.1 pox_2_unlock_height bug
This bug was caught before activating, however it could have led to an early chain split. The issue was that the pox_2_unlock_height needed to be epoch gated. This necessitated a quick hotfix (i.e., 2.2.0.0.1). This is the kind of bug that could be caught with the genesis sync. For addressing this issue, I think there's two things to do:
Expanded clarity integration tests which apply a lock and assert a specific unlock height during epoch 2.1. This would require that the testing setup apply a 2.2 unlock before that specific height. This is really regression testing, because it requires some specific knowledge of the issue to write the test. Mutation testing (more on this later) would perhaps help us surface areas where additional unit tests would be useful.
Non-speculative replay block testing combined with account state lookups. Basically, use the
replay-block
command to replay all blocks (i.e., the work of a genesis sync). If that discovers a surprise invalidation for 2.2.0.0.0, then great, that is sufficient. If that's not the case, we would need to expand this to also do some account lookups, and compare against the expected account value using the prior release. This is harder and more time consuming, but it may be necessary.Addressing the
least_supertype
bugHere, property testing (or really, more assertive fuzz testing) would have discovered this issue. The fuzz testing I did for the sanitization support exercises the property at fault for this bug (i.e., the constructed clarity value does not match the expected type), this is implemented in a branch currently test/sanitize-fuzzing. This fuzzing test has the nice property that it doesn't require any code changes in the Clarity library itself (the arbitrary implementation is done in the fuzzing codebase), meaning that it could be tested against specific versions.
In order to deal with this in the future, I think the important thing is to require explicit property testing on any PR units going forward. This requires a lot of thought on the part of the PR submitter -- explicitly enumerating the guarantees that each function should provide. But I think that work is worthwhile and not just for testing.
Beta Was this translation helpful? Give feedback.
All reactions