Skip to content

Commit

Permalink
add more info to readme
Browse files Browse the repository at this point in the history
Signed-off-by: jason yang <[email protected]>
  • Loading branch information
JasonYangShadow committed Oct 6, 2023
1 parent 1e7add4 commit 05ef3c6
Showing 1 changed file with 31 additions and 10 deletions.
41 changes: 31 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,27 @@
# Apptheus

A redesigned Prometheus Pushgateway for ephemeral and batch jobs.
Apptainer connects Prometheus. A redesigned Prometheus Pushgateway for short-lived jobs.

## Background
To provide a unified way of collecting the Apptainer stats data. We plan to employ the cgroup feature, which requires putting starter (starter-suid) program under a created
sub cgroup so that container stats can be collected and visualized.
Prometheus is an open source metrics collections and monitoring tool that is widely adopted.
> Note: Promethuesnly supports pull model, meaning that Promethues will regularly (`scrape_interval = x`) pull data from metrics sources. If users want to push data to Prometheus, then metric cache components, such as Pushgateway, is needed. See [https://prometheus.io/blog/2016/07/23/pull-does-not-scale-or-does-it/](https://prometheus.io/blog/2016/07/23/pull-does-not-scale-or-does-it/).
To collect the cgroup stats, we are planning to deeply custormize the [Pushgateway](https://github.com/apptainer/apptheus) tool, tailing features and adding additional security policy. We call this tool `Apptheus`, meaning Apptainer links to Prometheus.
Pushgateway acts as a bridge (metric caches, metric sources) to Prometheus targeting at the metrics collection for `short-lived (ephemeral) jobs`. Unlike the normal jobs that Prometheus can easily collect metrics from, short-lived jobs require more flexiable way to push metrics. Pushgateway provides such push and pull features. Any tools can easily push their metrics to it, at the same time, Prometheus can pull metrics data from it.

> Pushgateway acts a bit of similar to [Node-exporter](https://prometheus.io/docs/guides/node-exporter/), but Pushgateway is more general and can receive pushed metrics. Node-exporter is mor specialized designed for machine metrics collection.
#### Common Prometheus Architecture
![prometheus architecture](https://www.devopsschool.com/blog/wp-content/uploads/2021/01/What-is-Prometheus-Architecutre-1024x615.png)
> Referenced from [https://www.devopsschool.com/blog/what-is-prometheus-and-how-it-works/](https://www.devopsschool.com/blog/what-is-prometheus-and-how-it-works/)
When we are thinking to collect metrics from Apptainer, several requirements should be satisfied:
1. Less invasion. We do not want to develop a tool that is tightly bound to Apptainer, bringing too much invasion to Apptainer itself.
2. Cgroup stats. We want to use existing Linux feature for metrics collection.
3. Security. Customized security policy that can help verify whether the caller is trusted.
4. Customized push policy. Can freely configure the push internval to sample container metrics.

To provide a unified way of collecting the Apptainer stats data. We need to put starter (starter-suid) program under a created
sub cgroup so that container stats can be collected and visualized. To collect the cgroup stats, we deeply custormized the [Pushgateway](https://github.com/apptainer/apptheus) tool, tailing features and adding additional security policy. We call this new created tool `Apptheus`, meaning Apptainer links to Prometheus.

> Note that this tool can be used for monitoring any programs, this tool comes from the development of one Apptainer RFE.
Expand All @@ -18,18 +33,24 @@ To collect the cgroup stats, we are planning to deeply custormize the [Pushgatew
```
GET /metrics
```

## Workflow

> Note that Apptheus should be started with privileges, which means the unix socket created by Apptheus is also privileged, so during the implementation, the permission of this newly created unix socket is changed to `0o777`, that is also the reason why we need to do additional security check, i.e., checking whether the program is trusted.
### Apptainer uses Apptheus
## Apptainer uses Apptheus

![workflow](doc/apptainer.png)

https://github.com/apptainer/apptheus/assets/2051711/b33c5f20-a030-4b91-a6a7-bc62fe1fc6b8


## Important Opts
## Important CLI Options
1. `--socket.path="/run/apptheus/gateway.sock"`, local socket path for verification. Default value is `/run/apptheus/gateway.sock`.
2. `--trust.path=""`, multiple trusted program paths separated using ';', for exmaple, for apptainer starter, the path usually is `/usr/local/libexec/apptainer/bin/starter` .
3. `--monitor.inverval=0.5s`, cgroup stat sample interval.
3. `--monitor.inverval=0.5s`, cgroup stat sample interval.

### Additional Info
1. Custom metrics with Pushgateway, Prometheus and Grafana (By Nokia) [https://youtu.be/w_jvj0QKrec?si=9ykBj0U03J-b0Z6m&t=2001](https://youtu.be/w_jvj0QKrec?si=9ykBj0U03J-b0Z6m&t=2001)
2. Springboot Actuator marks Prometheus as production ready feature. [https://docs.spring.io/spring-boot/docs/current/reference/html/actuator.html](https://docs.spring.io/spring-boot/docs/current/reference/html/actuator.html)
3. Other tools also uses Pushgateway and Prometheus
- [https://doc.arroyo.dev/introduction#metrics](https://doc.arroyo.dev/introduction#metrics)
- [https://deckhouse.io/documentation/v1.49/modules/303-prometheus-pushgateway/examples.html](https://deckhouse.io/documentation/v1.49/modules/303-prometheus-pushgateway/examples.html)
- [https://docs.dapr.io/operations/observability/metrics/prometheus/](https://docs.dapr.io/operations/observability/metrics/prometheus/)

0 comments on commit 05ef3c6

Please sign in to comment.