[RFC] CI setups for metrics measurements #6

grahamwhaley · 2018-02-23T10:23:34Z

Hi.
For Clear Containers we have a Jenkins CI system running (inside Intel) that executes a set of metrics to try and detect and report performance regressions. We'd like to discuss options for setting up such a system for Kata.

There are a few requirements and features for such a system - I'll list them out here to help discussion:

The tests really need to run on 'bare metal'. Running them in a virtualised or cloud environment would introduce too much 'noise' into the results, making detection of regressions not viable. (well, for most of the tests - all time and compute based ones)
The current system is comprised of a Jenkins master (which can be a VM/cloud instance), and bare metal Jenkins slave build bots.
The build bots don't have to be particularly powerful machines (at present) - the metrics do not presently particularly stress 'at scale' setups.
Having different build bots running different Linux distributions would be good to get more coverage, but starting with just one is better than nothing
We may wish to discuss isolating the bare metal build machines from any other infrastructure. These machines will automatically download, build and execute any PRs submitted to the projects, with no guarantee what that code will do (yeah, that may sound a touch paranoid, but stuff does happen )
Slightly longer term, we should also look at tying the results into a storage and visualisation system for both historical data reference and trend tracking. My current thoughts here are to store the results into an elasticsearch and use kibana for the visualisation/analysis.

/cc @emonty @gnawux for any input from your side, wrt any setups you presently have or infra available etc.

egernst · 2018-03-07T23:31:47Z

@grahamwhaley - FYI, Talked with Clark @ openstack and he pointed me to http://lists.openstack.org/pipermail/openstack-dev/2018-March/127972.html

The TL;DR gist is that the existing infrastructure for openstack doesn't have a set solution at the moment. IMO it seems our best bet is the current solution we're using/migrating (that is, a bare-metal slave).

fungi · 2018-03-07T23:35:17Z

Basically, the current resources donated to the OpenStack community CI system are all virtual machines in (frequently oversubscribed) public cloud providers, so performance consistency across providers or even within a single provider or even on the same virtual machine from one minute to the next could vary significantly. To be able to have comparative performance between runs we need an entirely different kind of donated server resource (which basically reduces to dedicated bare metal).
Luckily the framework we're using could likely consume this new kind of donation without any real modification, depending on how it's presented, so it's more a matter of donor logistics than anything.

sboeuf · 2018-03-22T17:29:21Z

@grahamwhaley the proposal looks good to me ! And I am sure it's not gonna be a huge deal to setup, but I agree with @fungi that we need to find (gently offered :)) those dedicated baremetal machines and pin them to the metrics CI.

grahamwhaley · 2018-07-31T16:28:29Z

Hi all (@annabellebertooch @fungi @mnaser @cboylan @egernst @sboeuf et. al. ).
In yesterdays metrics CI presentation in the arch committee we discussed bringing up the topic (again) of if there is any feasibility to:

support the metrics CI under Zuul
which also implies locating/identifying a way to get some build hardware that has some performance guarantees.

First, can I check, has the situation with the OSF/Zuul build hosts changed at all - do we presently have any dedicated bare metal machines for instance? (I will reference the thread referenced above for example: http://lists.openstack.org/pipermail/openstack-dev/2018-March/127972.html)

If not, can we discuss if there is any method whereby we can allocate or schedule any of the cloud hardware/systems to enable a 'whole machine' allocation or 'single VM' allocation to a job so that we can avoid any noisy-neighbor situations? I will note that internally we are effectively doing this - running the jobs withing a fresh-VM-per-build on a bare metal machine whilst ensuring there is only ever a single VM running on that host at any one time.

Referencing the previous OS thread above, sure I understand that the cloud hosts might be over subscribed and thus such an idea may not have been feasible before - but, I thought I'd check the status at least.

And for our other cloud partners (@jessfraz @jon) - if you were able to source some dedicated hosts or hardware in your clouds for Kata metrics CI, that would be great :-)
@vielmetti, @gnawux - maybe you have some thoughts and input from the packet.net side?

Thanks everybody!

cboylan · 2018-07-31T17:43:31Z

The OpenStack Zuul instance does not have access to dedicated baremetal instances currently. @mnaser is probably in the best position to know how feasible setting up "single VM" hypervisors is (as he manages the hypervisors).

From Zuul's perspective it is largely just a matter of plugging whatever resources we end up with into Nodepool and consuming them from there. Current tested drivers (that are known to work) can support this talking to OpenStack APIs (specifically Nova for VMs or Baremetal) or via simple static setup that speaks ssh to a preconfigured server.

There is an untested Azure driver as well, but I have no idea how to gauge how well it would be expected to work until we found a way to test and use it.

vielmetti · 2018-08-17T13:39:21Z

@grahamwhaley there is an existing sponsored Kata Containers CI project at Packet, perhaps we can fold this all into one shared effort. Reference https://github.com/WorksOnArm/cluster/issues/31

grahamwhaley · 2018-08-17T13:55:12Z

Ah, thanks @vielmetti , I'd not seen that thread/request itself.
I see that request covered ARM hardware. afaict, there is nothing in the metrics codebase that should not work on ARM, and I can help the ARM aware folks get that set up to run (probably once they have the ARM QA CI up and stable).
Myself, I'm looking for x86 hardware - can we cover that on the same Ticket?
If so, I think we can get PR coverage running with a single Packet t1.small.x86, and then would likely need another one to cover master branch metrics tracking. I will caveat that the 8G of RAM on those might be a bit small for some of the metrics, but I think we can only find out by trying it...
If you ack we are good to do a trial on a t1.small.x86, then I'll open a Packet account and we can go from there. Would you like the formal request over on the Packet ticket itself?
And, many thanks for your support!

vielmetti · 2018-08-17T19:53:59Z

@grahamwhaley if you can put a request (this text is fine) in the Works on Arm cluster referenced above, then I can invite you to that project and you can continue on there. Also, if you could cc that request to [email protected] to my attention that will speed up a whole bunch of other processes to get you on board. Thanks!

grahamwhaley · 2018-08-30T17:07:13Z

For completeness, things have been a bit quiet, so I have also applied through the CNCF program for access to packet.net hardware:
cncf/cluster#83
/cc @vielmetti

grahamwhaley · 2018-08-31T10:02:06Z

An update then.
the CNCF have very generously granted us access to their Community Infrastructure Lab resources on packet.net.
I will start bringing up the metrics CI by seting up a Kata metrics CI job for the proxy repo on our existing Jenkins master, and then work on how we:

configure the instances on packet.net (to have all the bits installed we need to run the metrics - if we can for instance cloud-init that or if we need to do dynamic install on fresh boots etc.).
tie the instances into Jenkins - for instance if we have a nodepool method for packet.net

jodh-intel · 2018-08-31T10:47:09Z

\o/ - great news! 😄

grahamwhaley · 2018-09-11T15:33:54Z

Update time.
I've commissioned a packet.net t1.small.x86 running 24/7, setting it up with the jenkins slave VM configs from our scripts.
That is now tied into our Jenkins master for the proxy, shim and agent repos to run on PRs. For example, you can see the proxy job here.

We'll monitor those three PR hook/builds for the next few days, and then given they are producing stable results, also enable the runtime and tests repos.

jodh-intel · 2018-09-12T08:03:28Z

🎆 🎈 Awesome news! 🎆 🎈

Thanks again to @vielmetti and packet.net! 😄

vielmetti · 2018-09-12T12:55:10Z

Thanks @jodh-intel and want to make sure that the CNCF also gets proper credit for this!

grahamwhaley · 2018-09-12T12:59:01Z

@vielmetti - let's discuss offline with @annabellebertooch what can be done to attribute credit appropriately etc.

grahamwhaley · 2018-10-04T13:30:37Z

I think... we can close this Issue. We have the packet.net machine running the CI metrics. I still have some work to do to try and understand some results 'noise' and slowness on that machine, but that does not require this Issue to be open. Closing now...

Add a script that will be the **single** source of all static tests run before building kata containers components. Initially, it simply runs the `checkcommits` tool from this repository, but will be extended later to run linters, etc. All other kata containers repositories should invoke this script to avoid a proliferation of (different) static check scripts. Fixes kata-containers#6. Signed-off-by: James O. D. Hunt <[email protected]>

egernst added the next label Mar 7, 2018

grahamwhaley mentioned this issue Apr 5, 2018

ci: metrics: add metrics framework kata-containers/tests#216

Closed

dankohn mentioned this issue Aug 30, 2018

Kata Containers metrics CI Jenkins slave request cncf/cluster#83

Open

grahamwhaley mentioned this issue Sep 4, 2018

metrics: resource and deploy metrics data store and GUI view #60

Closed

grahamwhaley closed this as completed Oct 4, 2018

grahamwhaley mentioned this issue Jan 29, 2019

Docs: Who manages which resources that Kata uses kata-containers/community#71

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] CI setups for metrics measurements #6

[RFC] CI setups for metrics measurements #6

grahamwhaley commented Feb 23, 2018

egernst commented Mar 7, 2018

fungi commented Mar 7, 2018

sboeuf commented Mar 22, 2018

grahamwhaley commented Jul 31, 2018

cboylan commented Jul 31, 2018

vielmetti commented Aug 17, 2018

grahamwhaley commented Aug 17, 2018

vielmetti commented Aug 17, 2018

grahamwhaley commented Aug 30, 2018

grahamwhaley commented Aug 31, 2018

jodh-intel commented Aug 31, 2018

grahamwhaley commented Sep 11, 2018

jodh-intel commented Sep 12, 2018

vielmetti commented Sep 12, 2018

grahamwhaley commented Sep 12, 2018

grahamwhaley commented Oct 4, 2018

[RFC] CI setups for metrics measurements #6

[RFC] CI setups for metrics measurements #6

Comments

grahamwhaley commented Feb 23, 2018

egernst commented Mar 7, 2018

fungi commented Mar 7, 2018

sboeuf commented Mar 22, 2018

grahamwhaley commented Jul 31, 2018

cboylan commented Jul 31, 2018

vielmetti commented Aug 17, 2018

grahamwhaley commented Aug 17, 2018

vielmetti commented Aug 17, 2018

grahamwhaley commented Aug 30, 2018

grahamwhaley commented Aug 31, 2018

jodh-intel commented Aug 31, 2018

grahamwhaley commented Sep 11, 2018

jodh-intel commented Sep 12, 2018

vielmetti commented Sep 12, 2018

grahamwhaley commented Sep 12, 2018

grahamwhaley commented Oct 4, 2018