2020.08.04

Topics for further discussion

Packaging - 8/6

Omar - TopDown Micro Architecture - 8/6

Analyzing ldms data

Slurm Plugins

Documentation

The documentation needs to be improved

SPANK plugin documentation
PAPI plugin documentation including relationship with SYS-PAPI
PAPI Store documentation
SYSPAPI plugin documentation and relationship to with PAPI
SYSPAPI store documentation

SPANK Feature

How do we get all events coming from the Spank plugin arrive on a connection with the same user-id. Currently the job_start arrives on a connection from root.

Features

Eric: What about a feature to be able to specify a hardware counter event mask directly with PAPI/SYSPAPI?

Stream security

All LDMSD Streams data is conveyed over an LDMS Transport. LDMS Transports are authenticated in one of three ways: "none", "ovis", and "munge".

"none" : No authentication is done. This should only be used when all users are trusted.
"ovis" : Secret word authentication. The remote peer is known to know the secret word; effectively every user is trusted as root.
"munge" : The uid/gid of the remote peer is trusted by virtue of being verified through a third party

The trust level of data published through the stream service is known, however, the subscribing client does not have direct access to this information. We should consider adding API/Event data updates to convey this information so that subscribers can make decisions about how and to what extent to trust this data.

We should add a white-list/black-list feature to ldmsd to allow/disallow the publishing of data.

Stream Performance

Add a "persistent-connection" API for streams.

Running LDMSD as user 'someone'

There has been prolonged interest in running 'ldmsd' as a particular user. There are a number of ways to do this, including:

Run a separate ldmsd daemon as user 'someone'
Add a capability an ldmsd running as root to fork, seteuid/setguid(someone), continue
- RDMA transports hate to be fork/exec'd
- What configuration, if any, do you inherit from the parent

There are security issues around this:

What uid/gid do you allow 'someone' to apply to the sets this ldmsd publishes
Simply disallowing the setting of uid/gid with the API is not sufficient because the use could write a sampler that modified the set memory directly

Visualizing ldms data

NERSC -- LDMS - VictoriaMetrics integration.

Developer Topics

Big DB chat.

Handling multiple systems data, Kafka etc....

Mark S: understanding and advertising the data sizes and the computation (from the set data size). Especially for multiple feeds going from the aggregators off to the monitoring cluster.
Phil R: ....and measuring it live. (Ben's dstat sampler and then v4-5 tracking if ldmsd's aren't keeping up)
Chris M: holding data at the aggregator because of connecting to Kafka etc.
Melissa A and Chris: getting rid of Kafka.

Security within LDMS, steams

What are the issues related to monitoring ldmsd from a web service?
carried in the meta data, but we are ignoring it at the store
what would have to change (if anything) at intermediate LDMS for apps to get access to some data for response (security limitations)
changing the rwx model to enable more fined grain access control
if you don't have permissions, you can't get the handle to get the set. You also cannot push changes to that set. You can change your own local copy, of course, but then you would have to have been authenticated to be in the ldms ecosystem.
TODO: big security review and understanding. will take it to GitHub issues.

Schema Issues

Schema Explosion

How do we go about reducing the number of schema in the system. The immutable nature of a schema coupled with most systems having many different kinds of nodes has resulted in an unreasonably large number of schema.

The problems are things like:

different number of cores
- Multiple schema for metric sets that keep data per core even though they contain the "exact same" data
different architectures (e.g. knl vs. haswell)
- The /proc/meminfo and /proc/vmstat have extra entries that result in extra schema
simple configuration inconsistency
- schema name is set by the configuration which means that if the configuration is not consistent, identical metric sets will have different schema names

All of the above create real issues with analysis and visualization.

Set Groups

Set groups can be used to get around issues with system resources that come and go, e.g. disks, network interfaces. Set groups are mutable collections of other metric sets. One solution to the system resource problem could be to create separate sets for each resource and then 'group' them together into a single named entity so they can be fetched remotely as one entity.

How it works:

A group is a configuration construct managed by ldmsd. It is not part of the LDMS protocol
A metric set is created, however, this metric set contains only the names of it's members. When an aggregator updates this group, it gets the, potentially updated, list of metric sets
A group is created at the source with a schema name that has a string in the name that identifies it as a group
When the updater callback "see's" this name in the schema, it knows to lookup all the entries in the group at once, and similarly with update.

Main

LDMSCON

Tutorials are available at the conference websites

D/SOS Documentation

Data Management

LDMS v4 Documentation

Basic

Configurations

Features & Functionalities

Working Examples

Example Configuration Scripts

Development

Versioning

Reference Docs

Building

Cray Specific

RPMs

Coming soon!

Adding to the code base

Testing

Testing Overview
Test Plans & Documentation: ldms-test

Misc

Man Pages

Man pages currently not posted, but they are available in the source and build

2020.08.04

Topics for further discussion

Packaging - 8/6

Omar - TopDown Micro Architecture - 8/6

Analyzing ldms data

Slurm Plugins

Documentation

SPANK Feature

Features

Stream security

Stream Performance

Running LDMSD as user 'someone'

Visualizing ldms data

NERSC -- LDMS - VictoriaMetrics integration.

Developer Topics

Big DB chat.

Handling multiple systems data, Kafka etc....

Security within LDMS, steams

Schema Issues

Schema Explosion

Set Groups

Main

LDMSCON

D/SOS Documentation

LDMS v4 Documentation

Basic

Configurations

Features & Functionalities

Working Examples

Development

Reference Docs

Building

Cray Specific

RPMs

Adding to the code base

Testing

Misc

Man Pages

LDMS Documentation (v3 branches)

Basic

Reference Docs

Building

General

Cray Specific

Configuring

Running

Tutorial

Clone this wiki locally