New bugs get reported every day, and existing tickets in Launchpad get updated with comments, patches, and status changes. The triager's job is to review the traffic, deal with invalid or trivial issues, ignore stuff that's just noise, and flag items of importance, urgency, and/or readiness.
Triaging work is shared across the team, with the goal of successfully reviewing every bug filed against Ubuntu Server-related packages. Each triaging session focuses on a subset of tickets that changed during a specific time period, such as the previous day. The triager isn't responsible for solving the reported problem, only with ensuring it gets the appropriate attention.
All newly-reported issues will need a triager's review. A review involves analysing a bug to determine: a) if the bug is valid, and b) if enough information was provided, and then marking it 'Triaged'. Otherwise, it's set to a more appropriate state with a comment explaining why, and what the next actions are (if any).
Older issues generally require no triager action if they're progressing through their normal workflow. However, the triager should watch for comments providing new information that may make the issue more actionable.
Items in the triaging queue tend to fall into a few categories, that are handled in different ways:
Bug reports are often used to track status of packaging workflow tasks:
- Sync requests
- Merge requests
- Stable release updates (SRU)
- Main Inclusion Requests (MIR)
- Freeze exception request (FFe, UI-FFe, et al.)
- Package promotion/demotion ("Seed management/changes")
These will generally either be filed by, or assigned to, a team member. If not, investigate further. For properly owned tickets, 'No action is required' by the triager, unless something unusually weird is going on -- in which case 'Raise with the Team'. That raising could be "bringing it up in post-standup", "bringing it up in the weekly bug meeting" or "bringing it up in a chat channel and pinging related developers".
TL;DR -- do not stay alone with the weirdness. Group experience and group decisions win.
Problems that look like the result of an SRU update to a server package need to receive top priority. For such bugs, 'Add to Server-Todo Queue' and 'Raise with the Team'.
If possible, verify whether rolling back to the prior version of the package fixes the issue, and whether re-installing the update brings the issue back. In this case, also tag the bug 'regression-update'.
Be aware that bugs are sometimes described as regressions simply because they're new to the user, and not due to actually being introduced from an SRU update. For example, the user could have had a bad config file already, but the SRU triggers a restart of the service and that is when the user notices the problem and files a bug, thinking it was the update that introduced it.
A report of a regression caused by a security update should be passed to the
security team. Tag the bug 'regression-update', mark it "Public Security" and
subscribe ~ubuntu-security
, which will notify them.
You may also wish to ping them in #ubuntu-security
on Libera.Chat
to
ensure handover, but since they monitor Public Security bugs this is not
strictly required.
Urgently important issues such as ones with potential for widespread breakage, should immediately 'Add to Server-Todo Queue'.
Issues which involve an important customer or VIP, or that turn up frequently in search results, can generate a lot of "repeat visits" so to speak. Just 'Raise issue with the team', unless it's already well known in which case 'No action required' by the triager.
Sometimes users report problems that are really just a misunderstanding of how to use their system. Kindly redirect them to more appropriate venues for help, and/or lend any obvious advice, but otherwise close the ticket as "Invalid" due to being 'Not a Bug'.
If the ticket is missing log files, version details, or other information necessary to decide how to handle the bug, ask the user for what's needed, set the ticket to "Incomplete" due to 'Not Enough Info.'
If the issue is clearly the same as another report:
- Mark one as duplicate of the better reported (or older) issue.
- If there are existing tickets that sound very similar, make sure they are mentioned in a comment. Ask the reporter(s) to review those and identify whether it is indeed a duplicate, and if it is not, then they should elaborate on how this new one differs.
If an existing ticket pops up in the queue due to a comment, check if it is adding information.
- If it's just noise (e.g. "me too!"), 'No action is required' by the triager.
- If it seems to describe a legitimate problem, but one completely unrelated to the originally-reported issue, recommend filing a new bug report.
- Otherwise, outline the next steps needed for the bug to make progress so that interested parties have some solid guidance on what they can do to help.
If the issue can't be reproduced (e.g. in an LXC container) then ask the reporter to provide the 'Missing Steps to Reproduce' it.
If there seems to be enough information to reproduce the bug, try to do so in an LXC container -- itemise the steps to follow, and how to identify that the bug has indeed occurred. If it all looks good, subscribe the server team, or if the issue looks urgent and/or important 'Add to Server-Todo Queue'.
An issue that can be reproduced in a supported Ubuntu release, but not in the current development version of Ubuntu, may qualify for SRU processing.
- Make sure the reproduction steps are clearly outlined.
- If the issue is minor/trivial, it probably won't be worth SRUing, so should be closed as already fixed in development.
- If the issue is a request for a feature (not a bug fix), it may not qualify as eligible for SRU. If that's the case, either close it as fixed in development, or add bug tasks for the requested releases set as 'Wishlist', and close the main bug task as fixed.
- If the issue is a bug fix and looks important, then determine which supported Ubuntu releases will need the fix and add 'Bug Tasks' as appropriate. Even though the issue is fixed in development, leave the main task open in this case because otherwise Launchpad may not display it in reports and lists.
If the issue has been solved in Debian, it will likely be worth merging and/or SRUing the fix.
- Make sure steps to reproduce it are identified.
- If the issue affects the development release, it is a merge opportunity. If past feature-freeze, decide if it's worth a freeze exception. Make sure there is a merge bug in Launchpad for the package, and consider to 'Add to Server-Todo Queue'.
- If the issue affects a stable release and looks SRU-worthy, determine which supported Ubuntu releases will need the fix and add 'Bug Tasks' as appropriate.
All the steps of the "Already fixed in Debian" category apply here as well. We can help Debian when appropriate by:
- Filing a Debian bug about it, or chiming in if there is an existing one.
- If you create a PR for Ubuntu that can be used almost as-is, consider sending one via Salsa as well.
- Aligning our solution with Debian is not only kind, but also helps to avoid long term complex divergence and deltas.
If in doubt, or if none of the above applies, consider bringing it up via chat
-- or if looking for a group discussion (and decision) tag it
server-triage-discuss
. We try to resolve these bugs together in our weekly
meeting.
A few of our packages have common issue patterns or best practise triaging actions. This section shall list them so that anyone on triage duty can find all of them in one place.
MySQL often has low quality bug reports submitted by users not fully aware of which dependency brought it onto their system. Those often fall into a few common usage errors.
Furthermore, there are a few long-standing issues that affect many users but often are reported as new.
-
Due to that, we've found that a good first step in triaging MySQL is checking for duplicates.
- In the days of
mysql-5.7
(Xenial/Bionic) we tagged the common core bugs that one would duplicate new bugs to. Those are available as mysql-5.7 triage tag. - Since
mysql-8.0
we no longer use that tag, instead it turned out to be more reliable to just look at recent mysql-8.0 bugs by heat to spot the duplication candidates.
- In the days of
-
If not a duplicate, then still please update the bug title from the usual Apport "failed on postinst" to whatever the bug really is about for better recognition of the issue in any kind of overview that just lists the titles.
If these reports are about the inability to access devices or "permission denied" issues, the user often does not realise that libvirt applies an AppArmor profile to the guest for enhanced security.
If not available in the bug report (dmesg of the time of occurrence) ask the reporter to check for AppArmor denials at the time the problem triggers.
We subscribe ~ubuntu-server
directly to a bug to track our community bug
backlog while the bug meets the following criteria:
- Anything that, if the bug turns out to be valid, is something that would be
under the
~ubuntu-server
remit to fix (common use cases but not obscure ones -- although nothing stops an individual volunteering to work on an obscure use case). - By definition, if it's something that we wouldn't fix and request volunteers for even if we had time, then it doesn't warrant a subscription.
- This subscription is for the Ubuntu Server triage community and is not for tracking internal Canonical customer requests. Whether a Canonical customer has made a request about a particular bug makes no difference, and provides no additional priority under this process. A Canonical customer bug may still be subscribed if it qualifies under these criteria.
- If the bug is assigned to someone on our team, leave the team subscribed.
When the bug no longer meets these criteria, we unsubscribe from it.
This is our tag, which we use to represent "valid and we should work on it". I.e., better than just the usual "valid" backlog.
We want to assign bugs from this queue regularly. To avoid losing traction there is a weekly bug housekeeping meeting (see below) to ensure no bug gets blocked or forgotten for too long.
The goal is to have this list at around 30-40 bugs most of the time. If it
drops lower, we can refill the list with candidates from the ~ubuntu-server
subscribed bugs. However, if the list grows significantly out of this range it
becomes unrealistic to expect those issues to be handled in time, and we
should communicate that to the reporters.
Qualification for server-todo
:
-
Whatever we think that we want to work on soon. For example:
- An important new technology for Ubuntu-Server users.
- Great community engagement that provides debugging and/or patches.
-
We want to avoid bugs where the next step will take significantly longer than one day to complete, unless the bug is particularly important. For example:
- An Ubuntu-only feature that is important for our users -> OK to be in the list despite likely needing more time.
- A valid crash report, but a corner case affected only one user. All low
hanging fruits and obvious checks are done, therefore the next debug step
is estimated to take at least a week -> this might be OK for the backlog,
but not really for
server-todo
.
Make sure it is clear if the bug needs work in development or needs SRUs, by defining bug tasks accordingly. These bug tasks can help in identifying current vs. obsolete bugs.
-
Only bugs that qualify for the backlog qualify here. If they aren’t suitable for the backlog (e.g. not actionable by us) then they get dropped from both
server-todo
and the backlog.- If there are any updates to the case they will reappear in the triage queue and can be reevaluated then.
The tooling will help to report these to the triager.
-
Server subscription expiration -- default after 180 days
If nobody touched a bug for 180 days (~= 1 release cycle) it is reasonable to check for changed conditions. Quite often, for example, a patch one was waiting on might be available now, or a newer release fixed the bug already.
Essentially, anything that is listed here needs to be fully re-triaged to ensure the list reflects the current status. After the 180 days you will also have metrics on how many more people are affected by the bug (importance/#affected). Most commonly, it turns out that:
-
Recent releases upstream (or even already in Ubuntu) have the fix ->
- Re-triage, consider tagging
server-todo
for SRU.
- Re-triage, consider tagging
-
The bug should have been supported by the community but nothing happened ->
- Re-triage importance, consider dropping
~ubuntu-server
subscription.
- Re-triage importance, consider dropping
-
A bug that was formerly considered a "real case" no longer qualifies (e.g. alternative solutions have taken hold as the way to do it) ->
- Re-triage importance, consider dropping
~ubuntu-server
subscription
- Re-triage importance, consider dropping
If you're unsure, add the
server-triage-discuss
tag and bring it up at the next standup. -
Overall, we want to be honest in the bug reporter, to try to understand why an
issue was not worked on, and to explain it if possible. Also, if we drop
server-todo
or the ~ubuntu-server
subscription for any of the reasons
above, always add an explanatory comment. If reporters disagree with our
re-triage they will report on the bug and it will show up in the daily triage
duty the next day to be reconsidered with their point of view taken into
consideration.
In addition to the daily triage (and our ongoing dedication to resolving bugs we picked up that way) the Server Team has introduced a weekly bug housekeeping session. In that session we go through various lists, ensuring that no bugs or issues are forgotten, blocked or stalled for too long.
There are a few additional steps we take, which change over time based on current priorities. For example, looking for good candidates to do MREs in the future, or trying to assign our remaining cleanup on the Discourse documentation feedback backlog. While these may change, there is a core structure to what we always look at in this meeting:
-
Check the
server-todo
tagged bug list:-
Get list via
ustriage
:clear; ustriage --no-show-triage --extended --show-tagged --tag server-todo -S savebugs/todo-$(date -I'seconds').yaml -C $(ls -1t savebugs/* | head -n 1)
The list of last week's bugs helps to identify new/closed cases and is tracked in the helpers repository
-
Check size (see min/max above) of the
server-todo
tagged bug list. -
Ensure assigned bugs make reasonable progress:
- Discuss blockers/reasons if there was no progress.
- Notice, enjoy and celebrate progress that was made.
-
Ensure unassigned bugs find an owner:
- Ensure long term unassigned bugs are re-reevaluated (is there a reason why they are not tackled?)
-
-
Look at update-excuses by team to spot anything that needs our attention to migrate. If there are any team members assigned to analyse the case and ensure things are progressing.
-
Look at our merges schedule to identify if we have fallen behind on any of them.
Future outlook:
As capacity permits, we want (in the future) to start also looking at the sponsoring queue.
We have several stakeholders to keep up-to-date on things we've found during triage. We also want to keep the community generally informed, as well as raising issues within the team to ensure they are not being forgotten.
For the community we send a mail to [email protected]
that
summarises how many bugs we've triaged, and touches on the noteworthy cases.
This can also be used to CC additional people that (for case-specific reasons)
should be aware of a case.
An example of that would be if a security fix caused an upgrade-regression which would make us CC the uploader and/or ubuntu-security. This mail should also contain relevant information from documentation triage.
Furthermore, on cases that need immediate attention (or at least awareness) we might:
- Bring them up in the daily standup (if they need a discussion/decision that one can't do alone).
- Ping a subject matter expert via IRC/Mattermost.
In some cases, a package maintainer might already be aware and following the case. To avoid endless re-pings on such a case the agreement is that if the maintainer is personally subscribed (i.e. with their launchpad username, not just indirectly via teams like Ubuntu Virtualization) then we consider the maintainer to be aware and we will not do extra pings/mentions/CC.
According to load we might shift things, but generally every day Tuesday to Friday one team member is assigned to triage duty.
Monday is often more work and includes more low quality bugs since it includes all of the weekend. Therefore, Monday is rotated through all eligible bug triagers. This is organised internally in the team's Jira, and automatically creates a Task with the "bug-triage" label assigned to the person on rotation.
The ustriage tool is available as a snap and serves as our triage tool. It is maintained publicly on GitHub as ubuntu server triage.
It has options to identify bugs for the triage of the day as well as serving as a helper to check our tagged bugs, ensuring that nothing falls through the cracks. The README.md of the linked project has more details and use case examples.
This info is kept around for a while in case one encounters older bugs that still use those tags or processes. Eventually these sections will be removed.
Since early 2022 we have had enough control over the backlog bugs that the
former set of server-next
+ server-todo
bugs was reduced to just
server-todo
. The reason for this was that we found we had the capacity to
work on bugs that didn't fulfil the rather strict rules we used to have for
server-next
.
Since the backlog is bigger than what can be achieved in a short time, there is extra classification via the tag
server-next
. That tag is set by the triager (or anyone else working on doing the Root-Cause-Analysis or a Fix) to reflect that this is an issue that shall be tackled by the Team's resources "next".Another reason to add
server-next
in some cases is to preserve high quality contributions of the community. An example might be a report that the user already bisected and created a patch for - in those cases the benefit diminishes by bit rot way too fast, so handling that next helps to retain the work the reporters did. And vice versa it might encourage one or the other to provide more high quality bugs.The goal is to have this list around ~20 bugs most of the time, if dropping below we can refill with candidates from the ~ubuntu-server subscribed bugs. But if it grows significantly out of this range it is non-realistic to expect those issues to be handled in time, we should communicate so to the reporters.
The rules of the
server-next
tag are as follows:
- Must not tag unless bug is actionable. Doesn't mean it must have a patch, only that a developer has enough information to work on the bug, even if it means more debugging.
- Tag only if one of these two things are true:
- Delays will discourage this excellent community contribution.
- If you believe it affects a major use case for Ubuntu server users. In this case you should also set the bug Importance.
- The set of all bugs tagged
server-next
must be kept small. If it grows, the lowest priority bugs taggedserver-next
must be removed until the list isn’t too big.- This tag is for the Ubuntu Server triage community and is not for tracking internal Canonical customer requests. Whether a Canonical customer has made a request in relation to a particular bug makes no difference and provides no additional priority under this process. A Canonical customer bug may still be tagged if it qualifies under these criteria.
- If the bug is assigned to or otherwise owned by someone on our team, there is no need to tag it.
- Remove the tag when the bug is assigned to or otherwise owned by someone on our team.
Server-next expiration - default after 60 days
If we considered a bug actionable and added it to server-next, but then no update happened in 60 days that usually means something went wrong. Often bugs are blocked on external constraints. This needs to be evaluated as a case-by-case decision. Most common cases are, that it turns out:
- that the bug is not solvable/reasonable the way it was planned -> re-triage, maybe drop server-next.
- that it is actually fixed or otherwise progressed without update -> update bug
- that we failed to give it the required focus -> add the server-triage-discuss tag to the bug and bring it in the next standup