refactor!: black-box integration tests #5087

0x009922 · 2024-09-19T08:47:27Z

Context

Very early draft, work in progress!

iroha_test_network has a few problems:

It doesn't test Iroha as a black box, but runs directly using internal implementation details.
Execution flow of peers is messy, hard to debug & control. (e.g. for cases of restarts). Logs from all peers are all over the place, very noisy.
Messy API: hard to fine tune the network & peers for specific cases; high coupling with defaults; thread::sleeps with pipeline time instead of providing precise lifecycle hooks
Non-optimal defaults: most of the tests don't need 4 peers with 4s of pipeline time. A single peer with close-to-zero block time works significantly faster.

Black boxing would mean to run Iroha through its CLI as a dedicated process, just as users would do.

Solution

Re-implement iroha_test_network:

Black-boxing: run irohad as a direct child process.
- Create temp dir for every randomly generated peer and store there its state, configs, and logs. It could be observed (and maybe even replayed??) after tests run.
- Configuration through raw TOML - feel your users!
- Can work with any irohad target (e.g. debug, release)
Allocate ports automatically (fslock-based solution) - no more manual ports setting. Works intra-process (cargo test) and inter-process (cargo nextest Use Nextest #4987).
Higher-level, dynamic, flexible, async-first API, helpful for writing expressive tests.
Optimal defaults: a single peer with very short timings.
Faster!
Bonus: test execution can now be terminated immediately with SIGINT (Ctrl + C). It used to be suspended.

Other changes:

Turns out some integration tests were broken, but passing for other reasons. It was primarily a cause of a messy test network API.
Make irohad a closed binary; don't expose Iroha; remove samples from it. Closes [suggestion] Refactor Iroha CLI #4136
Minor changes in iroha_core and iroha_torii.
remove unstable_network tests: they rely on a direct violation of black boxing - the use of FreezeStatus to make peers faulty. To be re-implemented in some other way.

Further steps

[suggestion] Fully asynchronous client API #3130 is a must
Make iroha_test_network an executable, use in pytests and in SDKs
Move tests of business logic from black-boxed integration tests to iroha_core itself.

Flaky tests

restarted_peer_should_have_the_same_asset_amount - possibly due to a bug in Iroha? Read FIXME comment there.
extra_functional::connected_peers::* sometimes fail due to [BUG] Sumeragi panics with "index out of bounds" message after unregistering a peer #5104

Migration Guide (optional)

TODO

Review notes (optional)

Due to Client still being blocking, there is some ugly code with spawn_blocking.

Checklist

I've read CONTRIBUTING.md.
(optional) I've written unit tests for the code changes.
All review comments have been resolved.
All CI checks pass.

Undraft:

Update all integration tests
~~Update~~ remove benches and examples
Support running from CI (set irohad binary path via ENV?)
Stabilise flaky tests

crates/iroha/Cargo.toml

crates/iroha/src/lib.rs

.gitignore

crates/iroha/tests/integration/asset.rs

Erigara · 2024-09-20T10:21:45Z

crates/iroha/tests/integration/extra_functional/multiple_blocks_created.rs

+    tokio::spawn(async move {
+        while let Ok(e) = events.recv().await {
+            match e {
+                PeerLifecycleEvent::LogBlockCommitted { height } => {
+                    println!("Last peer block committed: {height}")
+                }
+                _ => {}
+            }
+        }
+    });


This task is not synchronized with the main one.
It's purpose is to only print messages?

Also logger is no longer available in integration tests, right?

Yes, this is only for debugging. It sometimes fails due to timeout being too short (which is convenient for local development). Need to adjust further.

Also logger is no longer available in integration tests, right?

test_logger() setup function is still available and will work as before, but I want to discourage its usage. Mainly because it requires initialisation before logs could be seen, which adds a layer of complexity to already complex tests.

If some advanced logging is needed (more advanced than println!), we can use log crate, for example. tracing is a bit too heavy imo.

crates/irohad/src/main.rs

crates/iroha_test_network/src/lib.rs

Erigara · 2024-09-20T12:57:25Z

crates/iroha_test_network/src/lib.rs

+const PEER_KILL_TIMEOUT: Duration = Duration::from_secs(2);
+
+// TODO: read from ENV?
+const IROHA_BIN: &'static str = "/Users/qua/dev/iroha/target/release/irohad";


Can we do smt like?

const IROHA_BIN: &str = concat!(env!("CARGO_MANIFEST_DIR"), "/targer/release/irohad");

Sure, that would be better, though still temporary. I plan to:

Add ability to switch between debug and release

Add a check before executing Iroha that the binary exists and give a recommendation to run cargo build --bin irohad.

Or maybe even run cargo build?

Replaced with usage of which, i.e. irohad must be available in $PATH before running test network.

The easiest way is to run cargo install --path creates/irohad.

crates/iroha_test_network/Cargo.toml

crates/iroha/tests/integration/asset.rs

mversic · 2024-09-24T05:58:23Z

crates/iroha/tests/integration/asset.rs

+    let (network, _rt) = NetworkBuilder::new().start_blocking().unwrap();
+    let test_client = network.client();


I suppose a network with 1 peer is created by default? I think that's ok. No need to test consensus every time

However, I hope that network.client() returns a random client when a multi-node network is started

I suppose a network with 1 peer is created by default? I think that's ok. No need to test consensus every time

Yes, it is.

However, I hope that network.client() returns a random client when a multi-node network is started

Currently not, but in light of your other comment of returning random peers by default, it will.

Made network.client() and network.peer() both return random ones.

crates/iroha_test_network/src/lib.rs

0x009922 · 2024-09-26T03:46:21Z

With hot-start (when all WASMs are pre-built) and using release Iroha, all integration tests now take less than 30 seconds to complete. The majority of them completes within 10 seconds. Time triggers take a lot and need a more thorough rewrite (which I am not going to do here).

Some tests are still flaky though (e.g. restarted_peer_should_have_the_same_asset_amount, network_stable_after_add_and_after_remove_peer, and longish multiple_blocks_created).

nxsaken · 2024-09-27T07:26:52Z

.github/workflows/iroha2-dev-pr.yml

@@ -76,8 +76,7 @@ jobs:
          path: ${{ env.DOCKER_COMPOSE_PATH }}
      - name: Run tests, with coverage
        run: |
-          cargo build --bin irohad
-          export IROHAD_EXEC=$(realpath ./target/debug/irohad)
+          mold --run cargo install --path ./crates/irohad 


If we end up fixing the cache, this will likely break without which irohad || because cargo install returns an error if the binary has been installed.

Will do, but it doesn't currently work anyway =(

I have yet to investigate (logs take 200mb) why

Added which irohad check

cargo install works, by the way. But I got a vast amount of "multiple items" errors in rust std while it compiles wasms/ui tests, haven't figured it out yet.

0x009922 · 2024-10-01T09:00:49Z

Investigation: CI seems to fail due to rust-lang/wg-cargo-std-aware#68. iroha_wasm_builder enables build-std feature, and also tries to remove instrument-coverage silently. I guess it stopped working (don't know why yet)

https://github.com/0x009922/iroha/blob/6926c61f021f514f29ad6f547274726733ff9128/crates/iroha_wasm_builder/src/lib.rs#L195-L196

https://github.com/0x009922/iroha/blob/6926c61f021f514f29ad6f547274726733ff9128/crates/iroha_wasm_builder/src/lib.rs#L370-L374

Signed-off-by: 0x009922 <[email protected]>

0x009922 · 2024-10-02T23:25:56Z

This PR becomes more and more God-like, too many things.

After fixing CI, I am considering splitting it into smaller PRs as much as possible.

Signed-off-by: 0x009922 <[email protected]>

And remove extra `iroha_wasm_builder` dependency Signed-off-by: 0x009922 <[email protected]>

0x009922 · 2024-10-03T01:52:31Z

Finally! All checks are truly green!

CI has successfully run all workspace tests with no default features in a single run within 53 seconds!

     Summary [  52.830s] 614 tests run: 614 passed (1 flaky), 9 skipped
   FLAKY 2/3 [   3.059s] iroha::mod integration::extra_functional::connected_peers::connected_peers_with_f_2_1_2

Same with all features enabled:

     Summary [  53.448s] 615 tests run: 615 passed (2 flaky), 9 skipped
   FLAKY 3/3 [   2.463s] iroha::mod integration::extra_functional::connected_peers::connected_peers_with_f_1_0_1
   FLAKY 2/3 [   2.899s] iroha::mod integration::extra_functional::connected_peers::connected_peers_with_f_2_1_2

Note: these tests are expectedly flaky due to #5104

0x009922 · 2024-10-03T02:19:11Z

Closing this PR in order to split it into approximately these smaller chunks:

refactor!: black-box integration tests (still huge, with all tests changes and irohad rewamp)
build(wasm_samples): don't run iroha_wasm_builder from tests, pre-compile WASM samples
refactor(iroha_torii): single TCP listener
feat(iroha_core): graceful shutdown without network packets
fix(irohad, iroha_core): compile without default features
fix(iroha_config): broken trusted peers check

0x009922 added Enhancement New feature or request Tests labels Sep 19, 2024

0x009922 self-assigned this Sep 19, 2024

github-actions bot added the api-changes Changes in the API for client libraries label Sep 19, 2024

Erigara assigned Erigara and unassigned Erigara Sep 19, 2024

0x009922 force-pushed the black-box-test-network branch from 2cc8911 to bf482f8 Compare September 20, 2024 04:54

nxsaken reviewed Sep 20, 2024

View reviewed changes

crates/iroha/Cargo.toml Show resolved Hide resolved

Erigara mentioned this pull request Sep 20, 2024

fix: fix flaky unstable network tests #5013

Closed

Erigara reviewed Sep 20, 2024

View reviewed changes

mversic reviewed Sep 23, 2024

View reviewed changes

crates/iroha_test_network/Cargo.toml Outdated Show resolved Hide resolved

0x009922 force-pushed the black-box-test-network branch from c71c692 to 3e86d7b Compare September 24, 2024 05:07

mversic reviewed Sep 24, 2024

View reviewed changes

crates/iroha/tests/integration/asset.rs Show resolved Hide resolved

mversic reviewed Sep 24, 2024

View reviewed changes

crates/iroha_test_network/src/lib.rs Outdated Show resolved Hide resolved

0x009922 mentioned this pull request Sep 26, 2024

[BUG] Sumeragi panics with "index out of bounds" message after unregistering a peer #5104

Closed

0x009922 force-pushed the black-box-test-network branch from 970429a to 2cc42a2 Compare September 26, 2024 03:41

nxsaken reviewed Sep 27, 2024

View reviewed changes

0x009922 force-pushed the black-box-test-network branch from 657617d to 574e713 Compare September 30, 2024 02:52

Erigara assigned Erigara and mversic Sep 30, 2024

0x009922 assigned nxsaken Oct 1, 2024

0x009922 force-pushed the black-box-test-network branch from 7124982 to 9d61ee4 Compare October 1, 2024 07:54

0x009922 force-pushed the black-box-test-network branch from d4ccbbe to 8e5b726 Compare October 2, 2024 08:24

0x009922 changed the title ~~refactor!: re-implement iroha_test_network~~ refactor!: black-box integration tests Oct 2, 2024

0x009922 added 2 commits October 2, 2024 17:25

refactor!: black-box integration tests

c304e9e

Signed-off-by: 0x009922 <[email protected]>

fix: lints, correct upload of executor.wasm

9450aec

Signed-off-by: 0x009922 <[email protected]>

0x009922 force-pushed the black-box-test-network branch from 8e5b726 to 9450aec Compare October 2, 2024 23:25

0x009922 added 2 commits October 3, 2024 08:55

fix: make iroha_core compile without telemetry feature

7176e92

Signed-off-by: 0x009922 <[email protected]>

ci: copy executor from script; debug; single test command

481c789

Signed-off-by: 0x009922 <[email protected]>

0x009922 unassigned nxsaken, Erigara and mversic Oct 3, 2024

ci: remove debug, enable full tests

8446547

And remove extra `iroha_wasm_builder` dependency Signed-off-by: 0x009922 <[email protected]>

0x009922 closed this Oct 3, 2024

		let (network, _rt) = NetworkBuilder::new().start_blocking().unwrap();
		let test_client = network.client();

refactor!: black-box integration tests #5087

refactor!: black-box integration tests #5087

Uh oh!

Conversation

0x009922 commented Sep 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Solution

Further steps

Flaky tests

Migration Guide (optional)

Review notes (optional)

Checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

0x009922 commented Sep 26, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

0x009922 commented Oct 1, 2024

Uh oh!

0x009922 commented Oct 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0x009922 commented Oct 3, 2024

Uh oh!

0x009922 commented Oct 3, 2024

Uh oh!

Uh oh!

0x009922 commented Sep 19, 2024 •

edited

Loading

0x009922 commented Oct 2, 2024 •

edited

Loading