Starter Project #1: Deduplicating alarms to improve usability of Acto #215

tylergu · 2023-05-15T00:44:17Z

Description

Acto finds 56 bugs in 11 operators, however, it reports more than two thousand alarms in total. This is because Acto reports duplicated alarms for the same bug.
The large number of alarms imposes a usability issue to Acto: users need to inspect a large number of alarms every time they run Acto, while only finding a few bugs. It also makes Acto's evaluation labor-intensive.
We want to reduce or eliminate the duplicated alarms users have to inspect for each unique bug.

Solution

There are two solutions in my mind, the first one requires users to inspect the alarms first, and then write rules to automate the alarm inspection. The first solution is very actionable and would improve the usability of Acto and dramatically reduce the evaluation overhead of Acto. The second one aims to deduplicate the alarms automatically. It is more ambitious but less concrete than the first solution.

Solution 1: making alarm inspection "one time effort" by writing rules

This solution aims to make alarm inspection for each bug "one time effort".
Our experience in inspecting alarms is that, the alarms caused by the same bug share similar triggering condition and root cause. For example, Acto found a bug in cass-operator, that cass-operator is unable to delete labels from Pods/Services. This bug is triggered every time Acto tries to delete annotations/labels in the CR. This bug can cause duplicated alarms because there are multiple properties in the cass-operator's CR corresponding to the annotations/labels.

To make the inspection a one-time effort, we can provide a way for users to describe the mapping from alarms to the bug. For example, from the cass-operator's label/annotation bug described above, we know that the bug will be triggerrer when Acto deletes any label/annotation property in the CR, and we know which properties in the CR are labels/annotations. Then we can describe the mapping by writing a rule like the following:

bug:
  name: cass-330
  input:
    properties:
      - ".*additionalLabels.*"
      - ".*additionalAnnotations.*"
    prev: ".*"
    curr: null

This way any alarm corresponding to this bug can be automatically inspected in the future.

The alarm inspection can also turned into an interactive process: users inspect one alarm, and write one rule, and then the rule can automatically inspect the alarms corresponding to this bug so that users don't have to inspect them.

Actions:

Reproduce the cass-operator's experiment and go through the alarms to get familiar with what information does Acto provide for each alarm, and what information is needed to conclude each alarm as false positive or particular bug.
For each bug in cass-operator, determine the information needed to map the alarms to the bug. Or if the information provided by Acto is not enough, what additional information is needed?
Design an interface for writing the mapping from alarms to bugs. It can be any reasonable format, the rule shown above is just an example, if YAML is not expressive enough, we can consider other ways for the interface.

Solution 2: deduplicating alarms

This is a much more ambitious solution, which is to deduplicate alarms without any manual effort.
There are existing works for bucketing failed tests:

https://clairelegoues.com/papers/ase18scb.pdf
https://www.comp.nus.edu.sg/~abhik/pdf/FASE17.pdf
But I am not sure how these techniques can be applied to Acto's alarms.
Acto's alarms are hard to deduplicate because we do not have much information:
it is end-to-end test, so we do not have the program trace
majority of the bugs do not cause explicit errors, thus hard to know what part of the code caused the bug

The first step for this solution would be to first deduplicate the bugs with explicit errors, e.g. crash bugs.

tianyin · 2023-06-05T22:57:22Z

Discussed the task with @Spedoske and he will take it.

The plan is to first use Acto to test the RabbitMQ operator and inspect the testing results. It will give a good idea for @Spedoske to understand how duplicated alarms are.

Then we can implement the dedup feature.

@Spedoske please read the papers linked in the task and think about how to do it. It's a very challenging task actually.

Spedoske · 2023-06-09T02:58:53Z

The plan is to first use Acto to test the RabbitMQ operator and inspect the testing results.

I can run Acto now. I got the alarm report and I can also reproduce some of the trails.
I planned to use Kubernetes events to classify misconfigure and bugs, as well as to bucket the alarms.
See #221 .

tianyin · 2023-06-09T03:02:43Z

I planned to use Kubernetes events to classify misconfigure and bugs

@Spedoske before you starting to implement anything, let's make sure we do the following two:

Inspect all the 73 alarms (I know it's tedious but it will really help you)
Discuss about the solution -- it's unclear how Kubernetes events will do the magic. We may want to involve Professor @owolabileg who is more knowledeable.

tianyin changed the title ~~Onboarding task #1: deduplicating alarms to improve usability of Acto~~ Starter project #1: Deduplicating alarms to improve usability of Acto May 20, 2023

tianyin changed the title ~~Starter project #1: Deduplicating alarms to improve usability of Acto~~ Starter Project #1: Deduplicating alarms to improve usability of Acto May 20, 2023

tianyin assigned Spedoske Jun 5, 2023

tianyin added good first issue Good for newcomers task labels Jun 5, 2023

tianyin removed the task label Jul 4, 2023

tianyin unassigned Spedoske Feb 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Starter Project #1: Deduplicating alarms to improve usability of Acto #215

Starter Project #1: Deduplicating alarms to improve usability of Acto #215

tylergu commented May 15, 2023

tianyin commented Jun 5, 2023

Spedoske commented Jun 9, 2023 •

edited

Loading

tianyin commented Jun 9, 2023

Starter Project #1: Deduplicating alarms to improve usability of Acto #215

Starter Project #1: Deduplicating alarms to improve usability of Acto #215

Comments

tylergu commented May 15, 2023

Description

Solution

Solution 1: making alarm inspection "one time effort" by writing rules

Actions:

Solution 2: deduplicating alarms

tianyin commented Jun 5, 2023

Spedoske commented Jun 9, 2023 • edited Loading

tianyin commented Jun 9, 2023

Spedoske commented Jun 9, 2023 •

edited

Loading