-
Notifications
You must be signed in to change notification settings - Fork 973
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
e2e: Improve Celestia-Node Integration Testing #662
Comments
Ephemeral cluster:
The current state of us using Swamp doesn't have the problems above and allows better control on the tests scenarios. The only downside right now is that we don't have a way to spawn the App/Core node for the Swamp, but it is solvable problem that we should stick to. |
Even if the ephemeral cluster does not allow for easily adding tests right now, I think simply hard-coding 1-3 very simple stupid tests (e.g. checking writing and reading data roundtrips from ethermint to celestia) would ber very valuable and would have caught the bug that Tomasz found. It is not meant as a replacement for the swamp/intergration tests but a rather complimentary thing. Different from the swamp tests, this could simply run on each release instead. Though for this to be somewhat useful we must have a more frequent releases (of node) than currently. |
Yeah, we could make those complimentary tests on the DALC level/repo and still probably should. Not on the node level. It is also not very clear what does mean by hardcoding mean, how does it looks like. This is a concern out if this repo and this issue should be transfered to DALC and runnned on DALC releases. |
Yes, but soon™ there will be no DALC repo anymore. |
The "ephemeral cluster" is a bag of docker compose configs orchestrated by bash scripts. It can be used to stand up just app nodes, app+node, app+node+dalc, app+node+dalc+ethermint, etc. When I say "network topology us up to node team" this is what I mean. When you say "it doesn't support tests" I think you mean there isn't a defined framework for writing tests. To write tests you:
The goal is to not to write nuanced tests that execute specific code flows with specific parameters. The goal is to run a "smoke test" so we have a general understanding of how well the system works as a whole at a given commit. The information we have available to determine how well things are working will be the same information validators and clients have. If the metrics/logs/analytics we provide are insufficient to know if that network works we should fix that. Regarding this being more work for devs, I agree completely. But this is the reality of distributed systems work, knowing your code works in isolation or with mocked interfaces is insufficient. Better to pay the cost up front. Some that could alleviate this pain:
|
Hey catching up on this now -- last friday, @liamsi @adlerjohn and I briefly discussed release cadence. We haven't yet gotten @Wondertan's input on it but we have a couple options:
If we choose the option to release more frequently, we could standup the ephemeral cluster on every release of celestia-node instead of every push to main since we push very frequently to main. One thing to consider is that we need some sort of mechanism to pipe through/display error logs because otherwise we will have to spend time looking through logs to try and catch potential errors, etc (and if we decide to spin up a cluster on every push to main, that's just not realistic). Please take a look at https://github.com/ethereum/hive for an idea of how geth's integration test setup works. One thing to consider is that the entire test harness runs on every commit to master (which is much less frequent than celestia-node as the project is in maintenance mode). But -- one nice thing is that they have this UI https://hivetests.ethdevops.io/ that displays the results of every test run + logs. If we want to make our CI useful and not burdensome, we should consider putting a lot of effort into this. Not necessarily create a UI, but I would say a really solid smoke test that tests for the success of our key features:
just a couple basic test cases and some way to display success/failure. |
Renay, thanks so much for the detailed post. I have my Google SDLC bias and am still of the opinion that frequent pushes to main don't require you to reduce the frequency of integration testing. We are using Github Actions because they're good and free. The current optimint/ethermint docker build takes 5 mins and the ephemeral-cluster test takes 3 minutes and the logs are aggregated for us. Example You can see a list of runs for a specific job (example) and there's filtering available. I agree we should put significantly more effort into this but even getting a single job run post PR (once branch is merged to main) that generates a sha tagged docker image and gives us a record of test runs to look would be a big win. |
BTW, I am not advocating to use the ephemeral cluster for implementing all e2e testing. If anything some extremely basic tests could be written (via using RPC endpoints for instance: 1) write data to celestia as an app, e.g. ethermint, 2) read that data per namespace again). I think we should maybe revisit this after we have more involved test infra ready (e.g after celestiaorg/test-infra#22). |
Ok, so there was a bit of misunderstanding then. I would like to have an ephemeral cluster as a tool to launch the whole stack quickly, dev on it, and have some examples of applications running on celestia-node. The development of e2e tests on it would be a pita. Running ethermint over it also sounds like a good idea, but outside of this repo and somewhere in our ethermint fork. Celestia-node must not import its downstream dependants. |
Closing this one in favour of upcoming rollkit/rollkit#518 |
Inspired by discussion in #626
The ephemeral cluster as it used for integration testing in celestiaorg/optimint and celestiaorg/ethermint currently stands up a full cluster consisting of:
We should create an integration test using an ephemeral cluster with celestia app, bridge, light nodes and a DALC. This will allow us to properly test for breakages in the external interfaces of the network.
I would like to leave both the network topology (number of each node type) and the specific tests to run up to the celestia node team.
The text was updated successfully, but these errors were encountered: