Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] CI setups for metrics measurements #6

Closed
grahamwhaley opened this issue Feb 23, 2018 · 16 comments
Closed

[RFC] CI setups for metrics measurements #6

grahamwhaley opened this issue Feb 23, 2018 · 16 comments

Comments

@grahamwhaley
Copy link
Contributor

Hi.
For Clear Containers we have a Jenkins CI system running (inside Intel) that executes a set of metrics to try and detect and report performance regressions. We'd like to discuss options for setting up such a system for Kata.

There are a few requirements and features for such a system - I'll list them out here to help discussion:

  1. The tests really need to run on 'bare metal'. Running them in a virtualised or cloud environment would introduce too much 'noise' into the results, making detection of regressions not viable. (well, for most of the tests - all time and compute based ones)
  2. The current system is comprised of a Jenkins master (which can be a VM/cloud instance), and bare metal Jenkins slave build bots.
  3. The build bots don't have to be particularly powerful machines (at present) - the metrics do not presently particularly stress 'at scale' setups.
  4. Having different build bots running different Linux distributions would be good to get more coverage, but starting with just one is better than nothing
  5. We may wish to discuss isolating the bare metal build machines from any other infrastructure. These machines will automatically download, build and execute any PRs submitted to the projects, with no guarantee what that code will do (yeah, that may sound a touch paranoid, but stuff does happen )
  6. Slightly longer term, we should also look at tying the results into a storage and visualisation system for both historical data reference and trend tracking. My current thoughts here are to store the results into an elasticsearch and use kibana for the visualisation/analysis.

/cc @emonty @gnawux for any input from your side, wrt any setups you presently have or infra available etc.

@egernst egernst added the next label Mar 7, 2018
@egernst
Copy link
Member

egernst commented Mar 7, 2018

@grahamwhaley - FYI, Talked with Clark @ openstack and he pointed me to http://lists.openstack.org/pipermail/openstack-dev/2018-March/127972.html

The TL;DR gist is that the existing infrastructure for openstack doesn't have a set solution at the moment. IMO it seems our best bet is the current solution we're using/migrating (that is, a bare-metal slave).

@fungi
Copy link

fungi commented Mar 7, 2018

Basically, the current resources donated to the OpenStack community CI system are all virtual machines in (frequently oversubscribed) public cloud providers, so performance consistency across providers or even within a single provider or even on the same virtual machine from one minute to the next could vary significantly. To be able to have comparative performance between runs we need an entirely different kind of donated server resource (which basically reduces to dedicated bare metal).
Luckily the framework we're using could likely consume this new kind of donation without any real modification, depending on how it's presented, so it's more a matter of donor logistics than anything.

@sboeuf
Copy link

sboeuf commented Mar 22, 2018

@grahamwhaley the proposal looks good to me ! And I am sure it's not gonna be a huge deal to setup, but I agree with @fungi that we need to find (gently offered :)) those dedicated baremetal machines and pin them to the metrics CI.

@grahamwhaley
Copy link
Contributor Author

Hi all (@annabellebertooch @fungi @mnaser @cboylan @egernst @sboeuf et. al. ).
In yesterdays metrics CI presentation in the arch committee we discussed bringing up the topic (again) of if there is any feasibility to:

  • support the metrics CI under Zuul
  • which also implies locating/identifying a way to get some build hardware that has some performance guarantees.

First, can I check, has the situation with the OSF/Zuul build hosts changed at all - do we presently have any dedicated bare metal machines for instance? (I will reference the thread referenced above for example: http://lists.openstack.org/pipermail/openstack-dev/2018-March/127972.html)

If not, can we discuss if there is any method whereby we can allocate or schedule any of the cloud hardware/systems to enable a 'whole machine' allocation or 'single VM' allocation to a job so that we can avoid any noisy-neighbor situations? I will note that internally we are effectively doing this - running the jobs withing a fresh-VM-per-build on a bare metal machine whilst ensuring there is only ever a single VM running on that host at any one time.

Referencing the previous OS thread above, sure I understand that the cloud hosts might be over subscribed and thus such an idea may not have been feasible before - but, I thought I'd check the status at least.

And for our other cloud partners (@jessfraz @jon) - if you were able to source some dedicated hosts or hardware in your clouds for Kata metrics CI, that would be great :-)
@vielmetti, @gnawux - maybe you have some thoughts and input from the packet.net side?

Thanks everybody!

@cboylan
Copy link

cboylan commented Jul 31, 2018

The OpenStack Zuul instance does not have access to dedicated baremetal instances currently. @mnaser is probably in the best position to know how feasible setting up "single VM" hypervisors is (as he manages the hypervisors).

From Zuul's perspective it is largely just a matter of plugging whatever resources we end up with into Nodepool and consuming them from there. Current tested drivers (that are known to work) can support this talking to OpenStack APIs (specifically Nova for VMs or Baremetal) or via simple static setup that speaks ssh to a preconfigured server.

There is an untested Azure driver as well, but I have no idea how to gauge how well it would be expected to work until we found a way to test and use it.

@vielmetti
Copy link

@grahamwhaley there is an existing sponsored Kata Containers CI project at Packet, perhaps we can fold this all into one shared effort. Reference https://github.com/WorksOnArm/cluster/issues/31

@grahamwhaley
Copy link
Contributor Author

Ah, thanks @vielmetti , I'd not seen that thread/request itself.
I see that request covered ARM hardware. afaict, there is nothing in the metrics codebase that should not work on ARM, and I can help the ARM aware folks get that set up to run (probably once they have the ARM QA CI up and stable).
Myself, I'm looking for x86 hardware - can we cover that on the same Ticket?
If so, I think we can get PR coverage running with a single Packet t1.small.x86, and then would likely need another one to cover master branch metrics tracking. I will caveat that the 8G of RAM on those might be a bit small for some of the metrics, but I think we can only find out by trying it...
If you ack we are good to do a trial on a t1.small.x86, then I'll open a Packet account and we can go from there. Would you like the formal request over on the Packet ticket itself?
And, many thanks for your support!

@vielmetti
Copy link

@grahamwhaley if you can put a request (this text is fine) in the Works on Arm cluster referenced above, then I can invite you to that project and you can continue on there. Also, if you could cc that request to [email protected] to my attention that will speed up a whole bunch of other processes to get you on board. Thanks!

@grahamwhaley
Copy link
Contributor Author

For completeness, things have been a bit quiet, so I have also applied through the CNCF program for access to packet.net hardware:
cncf/cluster#83
/cc @vielmetti

@grahamwhaley
Copy link
Contributor Author

An update then.
the CNCF have very generously granted us access to their Community Infrastructure Lab resources on packet.net.
I will start bringing up the metrics CI by seting up a Kata metrics CI job for the proxy repo on our existing Jenkins master, and then work on how we:

  • configure the instances on packet.net (to have all the bits installed we need to run the metrics - if we can for instance cloud-init that or if we need to do dynamic install on fresh boots etc.).
  • tie the instances into Jenkins - for instance if we have a nodepool method for packet.net

@jodh-intel
Copy link
Contributor

\o/ - great news! 😄

@grahamwhaley
Copy link
Contributor Author

Update time.
I've commissioned a packet.net t1.small.x86 running 24/7, setting it up with the jenkins slave VM configs from our scripts.
That is now tied into our Jenkins master for the proxy, shim and agent repos to run on PRs. For example, you can see the proxy job here.

We'll monitor those three PR hook/builds for the next few days, and then given they are producing stable results, also enable the runtime and tests repos.

@jodh-intel
Copy link
Contributor

🎆 🎈 Awesome news! 🎆 🎈

Thanks again to @vielmetti and packet.net! 😄

@vielmetti
Copy link

Thanks @jodh-intel and want to make sure that the CNCF also gets proper credit for this!

@grahamwhaley
Copy link
Contributor Author

@vielmetti - let's discuss offline with @annabellebertooch what can be done to attribute credit appropriately etc.

@grahamwhaley
Copy link
Contributor Author

I think... we can close this Issue. We have the packet.net machine running the CI metrics. I still have some work to do to try and understand some results 'noise' and slowness on that machine, but that does not require this Issue to be open. Closing now...

GabyCT pushed a commit to GabyCT/ci that referenced this issue Feb 12, 2019
Add a script that will be the **single** source of all static tests
run before building kata containers components.

Initially, it simply runs the `checkcommits` tool from this repository,
but will be extended later to run linters, etc.

All other kata containers repositories should invoke this script to
avoid a proliferation of (different) static check scripts.

Fixes kata-containers#6.

Signed-off-by: James O. D. Hunt <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants