Skip to content

Commit 960c67f

Browse files
committed
Introduce confidence levels
1 parent 933e58d commit 960c67f

File tree

3 files changed

+82
-46
lines changed

3 files changed

+82
-46
lines changed

docs/comparison-analysis.md

+16-4
Original file line numberDiff line numberDiff line change
@@ -22,13 +22,15 @@ At the core of comparison analysis are the collection of test results for the tw
2222

2323
Analysis of the changes is performed in order to determine whether artifact B represents a performance change over artifact A. At a high level the analysis performed takes the following form:
2424

25-
Are there 1 or more _significant_ test results that indicate performance changes. If all significant test results indicate regressions (i.e., all percent relative changes are positive), then artifact B represents a performance regression over artifact A. If all significant test results indicate improvements (i.e., all percent relative changes are negative), then artifact B represents a performance improvement over artifact B. If some significant test results indicate improvement and others indicate regressions, then the performance change is mixed.
25+
How many _significant_ test results indicate performance changes? If all significant test results indicate regressions (i.e., all percent relative changes are positive), then artifact B represents a performance regression over artifact A. If all significant test results indicate improvements (i.e., all percent relative changes are negative), then artifact B represents a performance improvement over artifact B. If some significant test results indicate improvement and others indicate regressions, then the performance change is mixed.
2626

27-
* What makes a test result significant?
27+
Whether we actually _report_ an analysis or not depends on the context and how _confident_ we are in the summary of the results (see below for an explanation of how confidence is derived). For example, in pull request performance "try" runs, we report a performance change if we are at least confident that the results are "probably relevant", while for the triage report, we only report if the we are confident the results are "definitely relevant".
28+
29+
### What makes a test result significant?
2830

2931
A test result is significant if the relative change percentage meets some threshold. What the threshold is depends of whether the test case is "dodgy" or not (see below for an examination of "dodginess"). For dodgy test cases, the threshold is set at 1%. For non-dodgy test cases, the threshold is set to 0.1%.
3032

31-
* What makes a test case "dodgy"?
33+
### What makes a test case "dodgy"?
3234

3335
A test case is "dodgy" if it shows signs of either being noisy or highly variable.
3436

@@ -42,4 +44,14 @@ Any relative delta change that is above a threshold (currently 0.1) is considere
4244

4345
A highly variable test case is one where a certain percentage (currently 5%) of relative delta changes are significant. The logic being that test cases should only display significant relative delta changes a small percentage of the time.
4446

45-
A noisy test case is one where of all the non-significant relative delta changes, the average delta change is still above some threshold (0.001). The logic being that non-significant changes should, on average, being very close to 0. If they are not close to zero, then they are noisy.
47+
A noisy test case is one where of all the non-significant relative delta changes, the average delta change is still above some threshold (0.001). The logic being that non-significant changes should, on average, being very close to 0. If they are not close to zero, then they are noisy.
48+
49+
### How is confidence in whether a test analysis is "relevant" determined?
50+
51+
The confidence in whether a test analysis is relevant depends on the number of significant test results. Depending on that number a confidence level is reached:
52+
53+
* Maybe relevant: 0-3 changes
54+
* Probably relevant: 4-6 changes
55+
* Definitely relevant: >6 changes
56+
57+
Note: changes can be any combination of positive or negative changes.

site/src/comparison.rs

+64-40
Original file line numberDiff line numberDiff line change
@@ -122,24 +122,30 @@ pub async fn handle_compare(
122122

123123
async fn populate_report(comparison: &Comparison, report: &mut HashMap<Direction, Vec<String>>) {
124124
if let Some(summary) = ComparisonSummary::summarize_comparison(comparison) {
125-
if let Some(direction) = summary.direction() {
126-
let entry = report.entry(direction).or_default();
125+
if summary.confidence().is_definitely_relevant() {
126+
if let Some(direction) = summary.direction() {
127+
let entry = report.entry(direction).or_default();
127128

128-
entry.push(summary.write(comparison).await)
129+
entry.push(summary.write(comparison).await)
130+
}
129131
}
130132
}
131133
}
132134

133135
pub struct ComparisonSummary {
134-
hi: Option<TestResultComparison>,
135-
lo: Option<TestResultComparison>,
136+
/// Significant comparisons
137+
comparisons: Vec<TestResultComparison>,
136138
}
137139

138140
impl ComparisonSummary {
139141
pub fn summarize_comparison(comparison: &Comparison) -> Option<ComparisonSummary> {
140-
let mut benchmarks = comparison.get_benchmarks().collect::<Vec<_>>();
142+
let mut comparisons = comparison
143+
.get_benchmarks()
144+
.filter(|c| c.is_significant())
145+
.cloned()
146+
.collect::<Vec<_>>();
141147
// Skip empty commits, sometimes happens if there's a compiler bug or so.
142-
if benchmarks.len() == 0 {
148+
if comparisons.len() == 0 {
143149
return None;
144150
}
145151

@@ -148,27 +154,17 @@ impl ComparisonSummary {
148154
.partial_cmp(&b2.log_change())
149155
.unwrap_or(std::cmp::Ordering::Equal)
150156
};
151-
let lo = benchmarks
152-
.iter()
153-
.enumerate()
154-
.min_by(|&(_, b1), &(_, b2)| cmp(b1, b2))
155-
.filter(|(_, c)| c.is_significant() && !c.is_increase())
156-
.map(|(i, _)| i);
157-
let lo = lo.map(|lo| benchmarks.remove(lo)).cloned();
158-
let hi = benchmarks
159-
.iter()
160-
.enumerate()
161-
.max_by(|&(_, b1), &(_, b2)| cmp(b1, b2))
162-
.filter(|(_, c)| c.is_significant() && c.is_increase())
163-
.map(|(i, _)| i);
164-
let hi = hi.map(|hi| benchmarks.remove(hi)).cloned();
157+
comparisons.sort_by(cmp);
165158

166-
Some(ComparisonSummary { hi, lo })
159+
Some(ComparisonSummary { comparisons })
167160
}
168161

169162
/// The direction of the changes
170163
pub fn direction(&self) -> Option<Direction> {
171-
let d = match (&self.hi, &self.lo) {
164+
let d = match (
165+
self.largest_positive_change(),
166+
&self.largest_negative_change(),
167+
) {
172168
(None, None) => return None,
173169
(Some(b), None) => b.direction(),
174170
(None, Some(b)) => b.direction(),
@@ -178,22 +174,31 @@ impl ComparisonSummary {
178174
Some(d)
179175
}
180176

181-
/// The changes ordered by their signficance (most significant first)
182-
pub fn ordered_changes(&self) -> Vec<&TestResultComparison> {
183-
match (&self.hi, &self.lo) {
184-
(None, None) => Vec::new(),
185-
(Some(b), None) => vec![b],
186-
(None, Some(b)) => vec![b],
187-
(Some(a), Some(b))
188-
if b.log_change()
189-
.abs()
190-
.partial_cmp(&a.log_change().abs())
191-
.unwrap_or(std::cmp::Ordering::Equal)
192-
== std::cmp::Ordering::Greater =>
193-
{
194-
vec![b, a]
195-
}
196-
(Some(a), Some(b)) => vec![a, b],
177+
/// The n largest changes ordered by the magnitude of their change
178+
pub fn largest_changes(&self, n: usize) -> Vec<&TestResultComparison> {
179+
let mut changes: Vec<&TestResultComparison> = self.comparisons.iter().take(n).collect();
180+
changes.sort_by(|a, b| {
181+
b.log_change()
182+
.abs()
183+
.partial_cmp(&a.log_change().abs())
184+
.unwrap_or(std::cmp::Ordering::Equal)
185+
});
186+
changes
187+
}
188+
189+
pub fn largest_positive_change(&self) -> Option<&TestResultComparison> {
190+
self.comparisons.first().filter(|s| s.is_increase())
191+
}
192+
193+
pub fn largest_negative_change(&self) -> Option<&TestResultComparison> {
194+
self.comparisons.last().filter(|s| !s.is_increase())
195+
}
196+
197+
pub fn confidence(&self) -> ComparisonConfidence {
198+
match self.comparisons.len() {
199+
0..=3 => ComparisonConfidence::MaybeRelevant,
200+
4..=6 => ComparisonConfidence::ProbablyRelevant,
201+
_ => ComparisonConfidence::DefinitelyRelevant,
197202
}
198203
}
199204

@@ -213,14 +218,33 @@ impl ComparisonSummary {
213218
let end = &comparison.b.artifact;
214219
let link = &compare_link(start, end);
215220

216-
for change in self.ordered_changes() {
221+
for change in self.largest_changes(2) {
217222
write!(result, "- ").unwrap();
218223
change.summary_line(&mut result, Some(link))
219224
}
220225
result
221226
}
222227
}
223228

229+
/// The amount of confidence we have that a comparison actually represents a real
230+
/// change in the performance characteristics.
231+
#[derive(Clone, Copy)]
232+
pub enum ComparisonConfidence {
233+
MaybeRelevant,
234+
ProbablyRelevant,
235+
DefinitelyRelevant,
236+
}
237+
238+
impl ComparisonConfidence {
239+
pub fn is_definitely_relevant(self) -> bool {
240+
matches!(self, Self::DefinitelyRelevant)
241+
}
242+
243+
pub fn is_atleast_probably_relevant(self) -> bool {
244+
matches!(self, Self::DefinitelyRelevant | Self::ProbablyRelevant)
245+
}
246+
}
247+
224248
/// Compare two bounds on a given stat
225249
///
226250
/// Returns Ok(None) when no data for the end bound is present

site/src/github.rs

+2-2
Original file line numberDiff line numberDiff line change
@@ -606,7 +606,7 @@ async fn categorize_benchmark(
606606
const DISAGREEMENT: &str = "If you disagree with this performance assessment, \
607607
please file an issue in [rust-lang/rustc-perf](https://github.com/rust-lang/rustc-perf/issues/new).";
608608
let (summary, direction) = match ComparisonSummary::summarize_comparison(&comparison) {
609-
Some(s) if s.direction().is_some() => {
609+
Some(s) if s.confidence().is_atleast_probably_relevant() => {
610610
let direction = s.direction().unwrap();
611611
(s, direction)
612612
}
@@ -630,7 +630,7 @@ async fn categorize_benchmark(
630630
"This change led to significant {} in compiler performance.\n",
631631
category
632632
);
633-
for change in summary.ordered_changes() {
633+
for change in summary.largest_changes(2) {
634634
write!(result, "- ").unwrap();
635635
change.summary_line(&mut result, None)
636636
}

0 commit comments

Comments
 (0)