e2e: Improve Celestia-Node Integration Testing #662

jbowen93 · 2022-04-29T12:15:15Z

Inspired by discussion in #626

The ephemeral cluster as it used for integration testing in celestiaorg/optimint and celestiaorg/ethermint currently stands up a full cluster consisting of:

celestia-app nodes
celestia bridge nodes
celestia light nodes
A DALC
celestia ethermint nodes

We should create an integration test using an ephemeral cluster with celestia app, bridge, light nodes and a DALC. This will allow us to properly test for breakages in the external interfaces of the network.

I would like to leave both the network topology (number of each node type) and the specific tests to run up to the celestia node team.

Wondertan · 2022-04-29T13:26:04Z

Ephemeral cluster:

Only enables spawning a whole network and doesnt allow writing actual tests. There is no analogs to fundamental assert statement to check that something is right or not.
Reauires more work to be considered useful for integrational testing.
Forces to write config first test scenarios, which is inconvenient for developers. Recently, we defined within team that devs are the one who write integration tests.
Creates overhead for developers requiring them to understand devops tools and their apis

The current state of us using Swamp doesn't have the problems above and allows better control on the tests scenarios. The only downside right now is that we don't have a way to spawn the App/Core node for the Swamp, but it is solvable problem that we should stick to.

liamsi · 2022-04-29T14:19:12Z

Even if the ephemeral cluster does not allow for easily adding tests right now, I think simply hard-coding 1-3 very simple stupid tests (e.g. checking writing and reading data roundtrips from ethermint to celestia) would ber very valuable and would have caught the bug that Tomasz found. It is not meant as a replacement for the swamp/intergration tests but a rather complimentary thing. Different from the swamp tests, this could simply run on each release instead. Though for this to be somewhat useful we must have a more frequent releases (of node) than currently.

Wondertan · 2022-04-29T23:37:09Z

Yeah, we could make those complimentary tests on the DALC level/repo and still probably should. Not on the node level.

It is also not very clear what does mean by hardcoding mean, how does it looks like.

This is a concern out if this repo and this issue should be transfered to DALC and runnned on DALC releases.

liamsi · 2022-04-30T00:35:24Z

Yes, but soon™ there will be no DALC repo anymore.

jbowen93 · 2022-04-30T19:24:58Z

The "ephemeral cluster" is a bag of docker compose configs orchestrated by bash scripts. It can be used to stand up just app nodes, app+node, app+node+dalc, app+node+dalc+ethermint, etc. When I say "network topology us up to node team" this is what I mean.

When you say "it doesn't support tests" I think you mean there isn't a defined framework for writing tests. To write tests you:

Start a cluster with your desired topology
Make REST/Rpc calls as an external caller

The goal is to not to write nuanced tests that execute specific code flows with specific parameters. The goal is to run a "smoke test" so we have a general understanding of how well the system works as a whole at a given commit. The information we have available to determine how well things are working will be the same information validators and clients have. If the metrics/logs/analytics we provide are insufficient to know if that network works we should fix that.

Regarding this being more work for devs, I agree completely. But this is the reality of distributed systems work, knowing your code works in isolation or with mocked interfaces is insufficient. Better to pay the cost up front.

Some that could alleviate this pain:

Myself and @Bidon15 work to develop a test harness in go that allows for more configurability when setting up a network (we can use starport/ignite cli as inspiration). This will also allow for a cleaner interface to integrate testing.
Prioritize DevOps hire
Social commitment that this testing is non blocking. Reviewers understand that this is best effort. My strong preference is to run these on every push to main because a large benefit is having a historical record of when things stopped working.

renaynay · 2022-05-02T10:45:00Z

Hey catching up on this now -- last friday, @liamsi @adlerjohn and I briefly discussed release cadence. We haven't yet gotten @Wondertan's input on it but we have a couple options:

release every month (usually aligned with monthly call)
release more frequently (every week - every two weeks)
release once we complete a feature we want to test against our longer-standing dev/testnet

If we choose the option to release more frequently, we could standup the ephemeral cluster on every release of celestia-node instead of every push to main since we push very frequently to main.

One thing to consider is that we need some sort of mechanism to pipe through/display error logs because otherwise we will have to spend time looking through logs to try and catch potential errors, etc (and if we decide to spin up a cluster on every push to main, that's just not realistic).

Please take a look at https://github.com/ethereum/hive for an idea of how geth's integration test setup works. One thing to consider is that the entire test harness runs on every commit to master (which is much less frequent than celestia-node as the project is in maintenance mode). But -- one nice thing is that they have this UI https://hivetests.ethdevops.io/ that displays the results of every test run + logs.

If we want to make our CI useful and not burdensome, we should consider putting a lot of effort into this. Not necessarily create a UI, but I would say a really solid smoke test that tests for the success of our key features:

is the light node able to sample every block in the network
is the full node able to reconstruct every block
can a message be posted and retrieved from network successfully?
can large block be handled

just a couple basic test cases and some way to display success/failure.

jbowen93 · 2022-05-04T17:28:00Z

Renay, thanks so much for the detailed post. I have my Google SDLC bias and am still of the opinion that frequent pushes to main don't require you to reduce the frequency of integration testing.

We are using Github Actions because they're good and free. The current optimint/ethermint docker build takes 5 mins and the ephemeral-cluster test takes 3 minutes and the logs are aggregated for us. Example

You can see a list of runs for a specific job (example) and there's filtering available.

I agree we should put significantly more effort into this but even getting a single job run post PR (once branch is merged to main) that generates a sha tagged docker image and gives us a record of test runs to look would be a big win.

liamsi · 2022-05-14T13:20:36Z

BTW, I am not advocating to use the ephemeral cluster for implementing all e2e testing. If anything some extremely basic tests could be written (via using RPC endpoints for instance: 1) write data to celestia as an app, e.g. ethermint, 2) read that data per namespace again). I think we should maybe revisit this after we have more involved test infra ready (e.g after celestiaorg/test-infra#22).

Wondertan · 2022-05-16T09:25:30Z

Ok, so there was a bit of misunderstanding then. I would like to have an ephemeral cluster as a tool to launch the whole stack quickly, dev on it, and have some examples of applications running on celestia-node. The development of e2e tests on it would be a pita. Running ethermint over it also sounds like a good idea, but outside of this repo and somewhere in our ethermint fork. Celestia-node must not import its downstream dependants.

Bidon15 · 2022-09-20T13:28:27Z

Closing this one in favour of upcoming rollkit/rollkit#518
and #7

jbowen93 self-assigned this Apr 29, 2022

adlerjohn added this to Celestia Node Apr 29, 2022

adlerjohn moved this to TODO in Celestia Node Apr 29, 2022

adlerjohn added kind:improvement e2e e2e testing related labels Apr 29, 2022

jbowen93 added ci:actions ci:docker Related to docker + dockerhub flow labels Apr 29, 2022

Bidon15 closed this as completed Sep 20, 2022

Repository owner moved this from TODO to Done in Celestia Node Sep 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

e2e: Improve Celestia-Node Integration Testing #662

e2e: Improve Celestia-Node Integration Testing #662

jbowen93 commented Apr 29, 2022 •

edited

Loading

Wondertan commented Apr 29, 2022

liamsi commented Apr 29, 2022

Wondertan commented Apr 29, 2022 •

edited

Loading

liamsi commented Apr 30, 2022

jbowen93 commented Apr 30, 2022

renaynay commented May 2, 2022

jbowen93 commented May 4, 2022

liamsi commented May 14, 2022

Wondertan commented May 16, 2022 •

edited

Loading

Bidon15 commented Sep 20, 2022

e2e: Improve Celestia-Node Integration Testing #662

e2e: Improve Celestia-Node Integration Testing #662

Comments

jbowen93 commented Apr 29, 2022 • edited Loading

Wondertan commented Apr 29, 2022

liamsi commented Apr 29, 2022

Wondertan commented Apr 29, 2022 • edited Loading

liamsi commented Apr 30, 2022

jbowen93 commented Apr 30, 2022

renaynay commented May 2, 2022

jbowen93 commented May 4, 2022

liamsi commented May 14, 2022

Wondertan commented May 16, 2022 • edited Loading

Bidon15 commented Sep 20, 2022

jbowen93 commented Apr 29, 2022 •

edited

Loading

Wondertan commented Apr 29, 2022 •

edited

Loading

Wondertan commented May 16, 2022 •

edited

Loading