Skip to content

2022.05.09

Vanessa Surjadidjaja edited this page May 9, 2022 · 1 revision

CUGS presentation

Fallout: A Monitoring Infrastructure Supporting Informed System Acceptance

Putting up an infrastructure that will make the monitoring of the stand-up process easier and minimize the time spent during the testing and stand-up phase during factory and on-the-floor acceptance testing

Used to find outlier components: network links, processor thermals, memory utilization, CPU utilization, or any other components you have samplers for.

Constraints of standup and factory testing: Limited access to repo satisfying external software dependencies Limited ability to modify boot images Limited/missing access to shared/remote storage Vendor wanting minimal external influences

Utilizing Google graphs and Grafana to identify outlier behaviors and rule-violating behaviors. Uses similar aggregator and sampler setup as is usual with LDMS.

Link to presentation:

Main

LDMSCON

Tutorials are available at the conference websites

D/SOS Documentation

LDMS v4 Documentation

Basic

Configurations

Features & Functionalities

Working Examples

Development

Reference Docs

Building

Cray Specific
RPMs
  • Coming soon!

Adding to the code base

Testing

Misc

Man Pages

  • Man pages currently not posted, but they are available in the source and build

LDMS Documentation (v3 branches)

V3 has been deprecated and will be removed soon

Basic

Reference Docs

Building

General
Cray Specific

Configuring

Running

  • Running

Tutorial

Clone this wiki locally