Skip to content

Commit

Permalink
Merge pull request #73 from danielerez/alarm_handler
Browse files Browse the repository at this point in the history
MGMT-16977: Mapping Alerts to AlarmEventRecords
  • Loading branch information
danielerez authored Mar 18, 2024
2 parents 3a24a17 + 25fbdef commit 3fbe178
Show file tree
Hide file tree
Showing 10 changed files with 1,098 additions and 37 deletions.
79 changes: 78 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -97,7 +97,7 @@ deployment-manager-server` [configuration](.vscode/launch.json).

The resource server exposes endpoints for retrieving resource types, resource pools
and resources objects. The server relies on the Search Query API of ACM hub.
Follow the these [instructions](docs/dev/env.md#search-query-api) to enable
Follow the these [instructions](docs/dev/env_acm.md#search-query-api) to enable
and configure the search API access.

The required URL and token can be obtained
Expand Down Expand Up @@ -161,3 +161,80 @@ To get a list of resources in a resource pool:
$ curl -s http://localhost:8002/o2ims-infrastructureInventory/v1/resourcePools/{resourcePoolId}
/resources | jq
```

#### Alarm server

The alarm server exposes endpoints for retrieving alarms (AlarmEventRecord objects).
The server relies on the Alertmanager API from Observability operator.
Follow the these [instructions](docs/dev/env_acm.md#observability) to enable
and configure Observability.

The required URL and token can be obtained
as follows:

```
$ export BACKEND_URL=$(
oc get route -n open-cluster-management-observability alertmanager -o json |
jq -r '"https://" + .spec.host'
)
$ export BACKEND_TOKEN=$(
oc create token -n openshift-oauth-apiserver oauth-apiserver-sa --duration=24h
)
$ export RESOURCE_SERVER_URL=http://localhost:8002/o2ims-infrastructureInventory/v1/
```

Start the resource server with a command like this:

```
$ ./oran-o2ims start alarm-server \
--log-level=debug \
--log-file=stdout \
--api-listener-address=localhost:8003 \
--cloud-id=123 \
--backend-url="${BACKEND_URL}" \
--backend-token="${BACKEND_TOKEN}" \
--resource-server-url="${RESOURCE_SERVER_URL}"
```

Notes:
* See more details regarding `api-listener-address` and `cloud-id` in the previous [section](#deployment-manager-server).
* The alarm server requires the `resource-server-url`, which is needed for fetching information about resources that are associated with retrieved alarms.

For more information about other command line flags use the `--help` command:

```
$ ./oran-o2ims start alarm-server --help
```

##### Run and Debug

Inside _VS Code_ use the _Run and Debug_ option with the `start
alarm-server` [configuration](.vscode/launch.json).

##### Requests Examples

###### GET Alarm List

To get a list of alarms:
```
$ curl -s http://localhost:8003/o2ims-infrastructureMonitoring/v1/alarms | jq
```

###### GET an Alarm

To get a list of resource pools:
```
$ curl -s http://localhost:8003/o2ims-infrastructureMonitoring/v1/alarms/{alarmEventRecordId} | jq
```

###### GET Alarm Probable Causes

To get a list of alarm probable causes:
```
$ curl -s http://localhost:8003/o2ims-infrastructureMonitoring/v1/alarmProbableCauses | jq
```

Notes:
* This API is not defined by O2ims Interface Specification.
* The server supports the `alarmProbableCauses` endpoint for exposing a custom list of probable causes.
* The list is available in [data folder](data/alarms/probable_causes.json). Can be customized and maintained as required.
32 changes: 32 additions & 0 deletions data/alarms/probable_causes.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
[
{
"probableCauseId": "Watchdog",
"name": "An alert that should always be firing to certify that Alertmanager is working properly.",
"description": "This is an alert meant to ensure that the entire alerting pipeline is functional.\nThis alert is always firing, therefore it should always be firing in Alertmanager\nand always fire against a receiver. There are integrations with various notification\nmechanisms that send a notification when this alert is not firing. For example the\n\"DeadMansSnitch\" integration in PagerDuty.\n"
},
{
"probableCauseId": "UpdateAvailable",
"name": "Your upstream update recommendation service recommends you update your cluster.",
"description": "For more information refer to 'oc adm upgrade' or https://console-openshift-console.apps.spoke1.redhat.com/settings/cluster/."
},
{
"probableCauseId": "NodeClockNotSynchronising",
"name": "Clock not synchronising.",
"description": "Clock on ostest-extraworker-1 is not synchronising. Ensure NTP is configured on this host."
},
{
"probableCauseId": "ClusterNotUpgradeable",
"name": "One or more cluster operators have been blocking minor version cluster upgrades for at least an hour.",
"description": "In most cases, you will still be able to apply patch releases. Reason AdminAckRequired. For more information refer to 'oc adm upgrade' or https://console-openshift-console.apps.spoke1.redhat.com/settings/cluster/."
},
{
"probableCauseId": "AlertmanagerReceiversNotConfigured",
"name": "Receivers (notification integrations) are not configured on Alertmanager",
"description": "Alerts are not configured to be sent to a notification system, meaning that you may not be notified in a timely fashion when important failures occur. Check the OpenShift documentation to learn how to configure notifications with Alertmanager."
},
{
"probableCauseId": "HighOverallControlPlaneMemory",
"name": "Memory utilization across all control plane nodes is high, and could impact responsiveness and stability.",
"description": "Given three control plane nodes, the overall memory utilization may only be about 2/3 of all available capacity. This is because if a single control plane node fails, the kube-apiserver and etcd my be slow to respond. To fix this, increase memory of the control plane nodes."
}
]
13 changes: 8 additions & 5 deletions docs/dev/env_devscripts.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,14 +57,15 @@ export REDFISH_EMULATOR_IGNORE_BOOT_DEVICE=True

### Installation

#### Run
#### Deploy env
```bash
cd dev-scripts
make
```

#### Clean
#### Destroy env
```bash
cd dev-scripts
make clean
```

Expand Down Expand Up @@ -93,7 +94,7 @@ sudo dnf install xinetd
```bash
cat /etc/NetworkManager/dnsmasq.d/openshift-ostest.conf
```
E.g. address=/api.ostest.test.metalkube.org/11.0.0.5
E.g. address=/.apps.ostest.test.metalkube.org/11.0.0.4

##### Add config file
*/etc/xinetd.d/openshift*
Expand All @@ -108,9 +109,11 @@ service openshift-ingress-ssl
protocol = tcp
user = root
wait = no
redirect = 10.0.0.5 443
redirect = 10.0.0.4 443
port = 443
only_from = 0.0.0.0/0
cps = 1000 0
instances = 1000
per_source = UNLIMITED
}
```
Expand Down Expand Up @@ -202,7 +205,7 @@ sudo systemctl start agent

### Import a spoke cluster
Navigate to web console:
* All Clusters > Infrastructure > Clusters > Cluster list > spoke0 > Import cluster
* All Clusters > Infrastructure > Clusters > Cluster list > spoke0 > Actions > Import cluster

### Access a spoke cluster
```bash
Expand Down
48 changes: 48 additions & 0 deletions internal/cmd/server/start_alarm_server.go
Original file line number Diff line number Diff line change
Expand Up @@ -237,6 +237,11 @@ func (c *AlarmServerCommand) run(cmd *cobra.Command, argv []string) error {
return err
}

// Create the handler for alarms probable causes:
if err := c.createAlarmProbableCausesHandler(router); err != nil {
return err
}

// Start the API server:
apiListener, err := network.NewListener().
SetLogger(c.logger).
Expand Down Expand Up @@ -319,6 +324,49 @@ func (c *AlarmServerCommand) createAlarmHandler(
return nil
}

// This API is not defined by O2ims Interface Specification.
// It is used for exposing the custom list of alarm probable causes.
func (c *AlarmServerCommand) createAlarmProbableCausesHandler(router *mux.Router) error {

// This API is not defined by

// Create the handler:
handler, err := service.NewAlarmProbableCauseHandler().
SetLogger(c.logger).
Build()
if err != nil {
c.logger.Error(
"Failed to create handler",
"error", err,
)
return exit.Error(1)
}

// Create the routes:
adapter, err := service.NewAdapter().
SetLogger(c.logger).
SetPathVariables("probableCauseID").
SetHandler(handler).
Build()
if err != nil {
c.logger.Error(
"Failed to create adapter",
"error", err,
)
return exit.Error(1)
}
router.Handle(
"/o2ims-infrastructureMonitoring/{version}/alarmProbableCauses",
adapter,
).Methods(http.MethodGet)
router.Handle(
"/o2ims-infrastructureMonitoring/{version}/alarmProbableCauses/{probableCauseID}",
adapter,
).Methods(http.MethodGet)

return nil
}

func (c *AlarmServerCommand) generateAlarmmanagerApiUrl(backendURL string) (string, error) {
u, err := url.Parse(backendURL)
if err != nil {
Expand Down
2 changes: 1 addition & 1 deletion internal/openapi/handler_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -131,7 +131,7 @@ var _ = Describe("Handler", func() {
})

It("All paths start with the expected prefix", func() {
Expect(spec).To(MatchJQ(`[.paths | keys[] | select(startswith("/o2ims-infrastructureInventory/") | not)] | length`, 0))
Expect(spec).To(MatchJQ(`[.paths | keys[] | select(startswith("/o2ims-infrastructure") | not)] | length`, 0))
})

It("Contains the expected schemas", func() {
Expand Down
98 changes: 98 additions & 0 deletions internal/openapi/spec.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,9 @@ tags:
- name: resources
description: |
Information about resources.
- name: alarms
description: |
Information about alarms.
paths:

Expand Down Expand Up @@ -284,6 +287,50 @@ paths:
schema:
$ref: "#/components/schemas/Resource"

/o2ims-infrastructureMonitoring/{version}/alarms:
get:
operationId: getAlarms
summary: Get alarms
description: |
Returns the list of AlarmEventRecords.
parameters:
- $ref: "#/components/parameters/excludeFields"
- $ref: "#/components/parameters/fields"
- $ref: "#/components/parameters/filter"
- $ref: "#/components/parameters/version"
tags:
- alrams
responses:
'200':
description: |
Successfully obtained the list of alarms.
content:
application/json:
schema:
$ref: "#/components/schemas/Alarms"

/o2ims-infrastructureMonitoring/{version}/alarms/{alarmEventRecordId}:
get:
operationId: getAlarm
summary: Get an alarm
description: |
Returns the details of an AlarmEventRecord.
parameters:
- $ref: "#/components/parameters/alarmEventRecordId"
- $ref: "#/components/parameters/excludeFields"
- $ref: "#/components/parameters/fields"
- $ref: "#/components/parameters/version"
tags:
- alarms
responses:
'200':
description: |
Successfully obtained the details of the alarm.
content:
application/json:
schema:
$ref: "#/components/schemas/Alarm"

components:

parameters:
Expand Down Expand Up @@ -338,6 +385,16 @@ components:
type: string
example: node_8_cores_amd64

alarmEventRecordId:
name: alarmEventRecordId
description: |
Unique identifier of an AlarmEventRecord.
in: path
required: true
schema:
type: string
example: ClusterNotUpgradeable_spoke1

fields:
name: fields
description: |
Expand Down Expand Up @@ -593,3 +650,44 @@ components:
type: array
items:
$ref: "#/components/schemas/Resources"

Alarm:
description: |
Information about an AlarmEventRecord.
type: object
properties:
alarmEventRecordId:
type: string
example: "ClusterNotUpgradeable_spoke1"
resourceID:
type: string
example: "my-node"
resourceTypeID:
type: string
example: "node_8_cores_amd64"
alarmRaisedTime:
type: string
example: "2024-03-10T13:21:33.613Z"
alarmChangedTime:
type: string
example: "2024-03-14T12:18:35.487Z"
alarmDefinitionID:
type: string
example: "ClusterNotUpgradeable"
probableCauseID:
type: string
example: "ClusterNotUpgradeable"
perceivedSeverity:
type: string
example: "info"
extensions:
type: object
example:
cluster: "spoke0"

Alarms:
description: |
List of alarms.
type: array
items:
$ref: "#/components/schemas/Alarms"
Loading

0 comments on commit 3fbe178

Please sign in to comment.