-
Notifications
You must be signed in to change notification settings - Fork 1
2020.08.04
The documentation needs to be improved
- SPANK plugin documentation
- PAPI plugin documentation including relationship with SYS-PAPI
- PAPI Store documentation
- SYSPAPI plugin documentation and relationship to with PAPI
- SYSPAPI store documentation
- How do we get all events coming from the Spank plugin arrive on a connection with the same user-id. Currently the job_start arrives on a connection from root.
- Eric: What about a feature to be able to specify a hardware counter event mask directly with PAPI/SYSPAPI?
All LDMSD Streams data is conveyed over an LDMS Transport. LDMS Transports are authenticated in one of three ways: "none", "ovis", and "munge".
- "none" : No authentication is done. This should only be used when all users are trusted.
- "ovis" : Secret word authentication. The remote peer is known to know the secret word; effectively every user is trusted as root.
- "munge" : The uid/gid of the remote peer is trusted by virtue of being verified through a third party
The trust level of data published through the stream service is known, however, the subscribing client does not have direct access to this information. We should consider adding API/Event data updates to convey this information so that subscribers can make decisions about how and to what extent to trust this data.
We should add a white-list/black-list feature to ldmsd to allow/disallow the publishing of data.
Add a "persistent-connection" API for streams.
There has been prolonged interest in running 'ldmsd' as a particular user. There are a number of ways to do this, including:
- Run a separate ldmsd daemon as user 'someone'
- Add a capability an ldmsd running as root to fork, seteuid/setguid(someone), continue
- RDMA transports hate to be fork/exec'd
- What configuration, if any, do you inherit from the parent
There are security issues around this:
- What uid/gid do you allow 'someone' to apply to the sets this ldmsd publishes
- Simply disallowing the setting of uid/gid with the API is not sufficient because the use could write a sampler that modified the set memory directly
- Mark S: understanding and advertising the data sizes and the computation (from the set data size). Especially for multiple feeds going from the aggregators off to the monitoring cluster.
- Phil R: ....and measuring it live. (Ben's dstat sampler and then v4-5 tracking if ldmsd's aren't keeping up)
- we may want to add network usage counters to the ldmsd zap transport and publish the result (each ldms message has a known size). Count message bytes and message events as atomic uint64_t. RDMA pulls need to get counted at the puller and sock maybe at both ends. dstat currently reports read/write io bytes.
- Chris M: holding data at the aggregator because of connecting to Kafka etc.
- Melissa A and Chris: getting rid of Kafka.
- What are the issues related to monitoring ldmsd from a web service?
- carried in the meta data, but we are ignoring it at the store
- what would have to change (if anything) at intermediate LDMS for apps to get access to some data for response (security limitations)
- changing the rwx model to enable more fined grain access control
- if you don't have permissions, you can't get the handle to get the set. You also cannot push changes to that set. You can change your own local copy, of course, but then you would have to have been authenticated to be in the ldms ecosystem.
- TODO: big security review and understanding. will take it to GitHub issues.
How do we go about reducing the number of schema in the system. The immutable nature of a schema coupled with most systems having many different kinds of nodes has resulted in an unreasonably large number of schema.
The problems are things like:
- different number of cores
- Multiple schema for metric sets that keep data per core even though they contain the "exact same" data
- different architectures (e.g. knl vs. haswell)
- The /proc/meminfo and /proc/vmstat have extra entries that result in extra schema
- simple configuration inconsistency
- schema name is set by the configuration which means that if the configuration is not consistent, identical metric sets will have different schema names
All of the above create real issues with analysis and visualization.
For schema that have a large number of instances (e.g. /dev/disk) it is probably better to do the following:
- have a single metric set that contains all of the disk
- when the number of disks on the node changes:
- ldms_set_delete() --> sends a message to the peer that the set is gone
- create a new schema and associated metric set with the greater or lesser number of disks
- ldms_set_new(new_schema), ldms_set_publish() --> peer gets notified there's a new set
- it becomes contingent upon the consumer (e.g. store_csv, store_sos) to determine and honor the number of entries in the schema:
- this could be done by convention with metrics with a certain "known" prefix, "resource_count:"
- when the set contains a smaller number of resources (e.g. DVS mount points), the ldmsd set group is a reasonable solution since the I/O multiplication that results is not a big issue
Set groups can be used to get around issues with system resources that come and go, e.g. disks, network interfaces. Set groups are mutable collections of other metric sets. One solution to the system resource problem could be to create separate sets for each resource and then 'group' them together into a single named entity so they can be fetched remotely as one entity.
How it works:
- A group is a configuration construct managed by ldmsd. It is not part of the LDMS protocol
- A metric set is created, however, this metric set contains only the names of it's members. When an aggregator updates this group, it gets the, potentially updated, list of metric sets
- A group is created at the source with a schema name that has a string in the name that identifies it as a group
- When the updater callback "see's" this name in the schema, it knows to lookup all the entries in the group at once, and similarly with update.
People would like to have rpm packaging files for, at least, CENTOS/RHEL and SLES. An ovis.spec (with a generically working config) file is provided in this directory. Note that this file would need to be updated every time the make dist produces a different image name.
- Probably need to script the creation of the real ovis.spec with ovis.spec.in so that the tar.gz file referenced in the spec file has the correct name
- Need to explore build dependencies
- e.g. Cython, libssl, libcrypt, etc...
- Add an ./rpm directory under ovis. In this directory we have the following:
- Centos
- 7
- ovis.spec
- 8
- ovis.spec
- 7
- SLES
- 12.x
- ovis.spec
- 15.x
- ovis.spec . . .
- 12.x
- Centos
With this file we can do:
- make dist
- rpmbuild ovis.spec
It would be great to have a github trigger that would do a configure, make CFLAGS="-Wall -Werror" ...
- Home
- Search
- Feature Overview
- LDMS Data Facilitates Analysis
- Contributing patches
- User Group Meeting Notes - BiWeekly!
- Publications
- News - now in Discussions
- Mailing Lists
- Help
Tutorials are available at the conference websites
- Coming soon!
- Testing Overview
- Test Plans & Documentation: ldms-test
- Man pages currently not posted, but they are available in the source and build
V3 has been deprecated and will be removed soon
- Configuring
- Configuration Considerations
- Running