Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

e2e: Improve Celestia-Node Integration Testing #662

Closed
jbowen93 opened this issue Apr 29, 2022 · 10 comments
Closed

e2e: Improve Celestia-Node Integration Testing #662

jbowen93 opened this issue Apr 29, 2022 · 10 comments
Assignees
Labels
ci:actions ci:docker Related to docker + dockerhub flow e2e e2e testing related

Comments

@jbowen93
Copy link
Contributor

jbowen93 commented Apr 29, 2022

Inspired by discussion in #626

The ephemeral cluster as it used for integration testing in celestiaorg/optimint and celestiaorg/ethermint currently stands up a full cluster consisting of:

  • celestia-app nodes
  • celestia bridge nodes
  • celestia light nodes
  • A DALC
  • celestia ethermint nodes

We should create an integration test using an ephemeral cluster with celestia app, bridge, light nodes and a DALC. This will allow us to properly test for breakages in the external interfaces of the network.

I would like to leave both the network topology (number of each node type) and the specific tests to run up to the celestia node team.

@jbowen93 jbowen93 self-assigned this Apr 29, 2022
@adlerjohn adlerjohn moved this to TODO in Celestia Node Apr 29, 2022
@adlerjohn adlerjohn added kind:improvement e2e e2e testing related labels Apr 29, 2022
@jbowen93 jbowen93 added ci:actions ci:docker Related to docker + dockerhub flow labels Apr 29, 2022
@Wondertan
Copy link
Member

Ephemeral cluster:

  • Only enables spawning a whole network and doesnt allow writing actual tests. There is no analogs to fundamental assert statement to check that something is right or not.
  • Reauires more work to be considered useful for integrational testing.
  • Forces to write config first test scenarios, which is inconvenient for developers. Recently, we defined within team that devs are the one who write integration tests.
  • Creates overhead for developers requiring them to understand devops tools and their apis

The current state of us using Swamp doesn't have the problems above and allows better control on the tests scenarios. The only downside right now is that we don't have a way to spawn the App/Core node for the Swamp, but it is solvable problem that we should stick to.

@liamsi
Copy link
Member

liamsi commented Apr 29, 2022

Even if the ephemeral cluster does not allow for easily adding tests right now, I think simply hard-coding 1-3 very simple stupid tests (e.g. checking writing and reading data roundtrips from ethermint to celestia) would ber very valuable and would have caught the bug that Tomasz found. It is not meant as a replacement for the swamp/intergration tests but a rather complimentary thing. Different from the swamp tests, this could simply run on each release instead. Though for this to be somewhat useful we must have a more frequent releases (of node) than currently.

@Wondertan
Copy link
Member

Wondertan commented Apr 29, 2022

Yeah, we could make those complimentary tests on the DALC level/repo and still probably should. Not on the node level.

It is also not very clear what does mean by hardcoding mean, how does it looks like.

This is a concern out if this repo and this issue should be transfered to DALC and runnned on DALC releases.

@liamsi
Copy link
Member

liamsi commented Apr 30, 2022

Yes, but soon™ there will be no DALC repo anymore.

@jbowen93
Copy link
Contributor Author

The "ephemeral cluster" is a bag of docker compose configs orchestrated by bash scripts. It can be used to stand up just app nodes, app+node, app+node+dalc, app+node+dalc+ethermint, etc. When I say "network topology us up to node team" this is what I mean.

When you say "it doesn't support tests" I think you mean there isn't a defined framework for writing tests. To write tests you:

  1. Start a cluster with your desired topology
  2. Make REST/Rpc calls as an external caller

The goal is to not to write nuanced tests that execute specific code flows with specific parameters. The goal is to run a "smoke test" so we have a general understanding of how well the system works as a whole at a given commit. The information we have available to determine how well things are working will be the same information validators and clients have. If the metrics/logs/analytics we provide are insufficient to know if that network works we should fix that.

Regarding this being more work for devs, I agree completely. But this is the reality of distributed systems work, knowing your code works in isolation or with mocked interfaces is insufficient. Better to pay the cost up front.

Some that could alleviate this pain:

  • Myself and @Bidon15 work to develop a test harness in go that allows for more configurability when setting up a network (we can use starport/ignite cli as inspiration). This will also allow for a cleaner interface to integrate testing.
  • Prioritize DevOps hire
  • Social commitment that this testing is non blocking. Reviewers understand that this is best effort. My strong preference is to run these on every push to main because a large benefit is having a historical record of when things stopped working.

@renaynay
Copy link
Member

renaynay commented May 2, 2022

Hey catching up on this now -- last friday, @liamsi @adlerjohn and I briefly discussed release cadence. We haven't yet gotten @Wondertan's input on it but we have a couple options:

  • release every month (usually aligned with monthly call)
  • release more frequently (every week - every two weeks)
  • release once we complete a feature we want to test against our longer-standing dev/testnet

If we choose the option to release more frequently, we could standup the ephemeral cluster on every release of celestia-node instead of every push to main since we push very frequently to main.

One thing to consider is that we need some sort of mechanism to pipe through/display error logs because otherwise we will have to spend time looking through logs to try and catch potential errors, etc (and if we decide to spin up a cluster on every push to main, that's just not realistic).

Please take a look at https://github.com/ethereum/hive for an idea of how geth's integration test setup works. One thing to consider is that the entire test harness runs on every commit to master (which is much less frequent than celestia-node as the project is in maintenance mode). But -- one nice thing is that they have this UI https://hivetests.ethdevops.io/ that displays the results of every test run + logs.

If we want to make our CI useful and not burdensome, we should consider putting a lot of effort into this. Not necessarily create a UI, but I would say a really solid smoke test that tests for the success of our key features:

  • is the light node able to sample every block in the network
  • is the full node able to reconstruct every block
  • can a message be posted and retrieved from network successfully?
  • can large block be handled

just a couple basic test cases and some way to display success/failure.

@jbowen93
Copy link
Contributor Author

jbowen93 commented May 4, 2022

Renay, thanks so much for the detailed post. I have my Google SDLC bias and am still of the opinion that frequent pushes to main don't require you to reduce the frequency of integration testing.

We are using Github Actions because they're good and free. The current optimint/ethermint docker build takes 5 mins and the ephemeral-cluster test takes 3 minutes and the logs are aggregated for us. Example

You can see a list of runs for a specific job (example) and there's filtering available.

I agree we should put significantly more effort into this but even getting a single job run post PR (once branch is merged to main) that generates a sha tagged docker image and gives us a record of test runs to look would be a big win.

@liamsi
Copy link
Member

liamsi commented May 14, 2022

BTW, I am not advocating to use the ephemeral cluster for implementing all e2e testing. If anything some extremely basic tests could be written (via using RPC endpoints for instance: 1) write data to celestia as an app, e.g. ethermint, 2) read that data per namespace again). I think we should maybe revisit this after we have more involved test infra ready (e.g after celestiaorg/test-infra#22).

@Wondertan
Copy link
Member

Wondertan commented May 16, 2022

Ok, so there was a bit of misunderstanding then. I would like to have an ephemeral cluster as a tool to launch the whole stack quickly, dev on it, and have some examples of applications running on celestia-node. The development of e2e tests on it would be a pita. Running ethermint over it also sounds like a good idea, but outside of this repo and somewhere in our ethermint fork. Celestia-node must not import its downstream dependants.

@Bidon15
Copy link
Member

Bidon15 commented Sep 20, 2022

Closing this one in favour of upcoming rollkit/rollkit#518
and #7

@Bidon15 Bidon15 closed this as completed Sep 20, 2022
Repository owner moved this from TODO to Done in Celestia Node Sep 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci:actions ci:docker Related to docker + dockerhub flow e2e e2e testing related
Projects
No open projects
Archived in project
Development

No branches or pull requests

6 participants