Data extrapolation/intrapolation #253

Mark-Simulacrum · 2018-06-29T15:56:16Z

Fixes #249.

Extrapolates on a run-based level across history.

nnethercote · 2018-07-02T04:36:08Z

Apologies: I'm having trouble imagining the problems you've described above. I find
this stuff a lot easier to think about with concrete examples.

Consider the following.

A window of 20 commits, labeled 1..20.
Four benchmarks: cargo, futures, parser, serde
- cargo is added partway through, parser is removed partway through, serde
  breaks intermittently (completely or just for NLL)
Two BuildKind+RunKind combinations: Check/Clean, Check/NLL
- I don't think it's worth worrying about addition/removal of
  BuildKind+RunKind combinations. One day NLL will not be needed but then we
  can just remove it wholesale.
Just the "instructions" metric, measured in billions.
- I don't think it's possible to gather a subset of the metrics -- we either
  get all of them or none of them -- so we don't have to worry about some
  being present and some not.

-----------------------------------------------------------------------------
commit  cargo           futures         parser          serde   
        Clean, NLL      Clean, NLL      Clean, NLL      Clean, NLL
-----------------------------------------------------------------------------
1       X,    X         5.0, 6.0        2.0, 3.0        8.0, 10.0
2       X,    X         5.0, 6.0        2.0, 3.0        8.0, 10.0
3       X,    X         5.0, 6.0        2.0, 3.0        I,   I
4       X,    X         5.0, 6.0        2.0, 3.0        I,   I
5       X,    X         5.0, 6.0        2.0, 3.0        8.0, 10.0
6       X,    X         5.0, 6.0        2.0, 3.0        8.0, 10.0
7       X,    X         5.0, 6.0        X,   X          8.0, 10.0
8       X,    X         5.0, 6.0        X,   X          8.0, 10.0
9       X,    X         5.0, 6.0        X,   X          8.0, 10.0
10      X,    X         5.0, 6.0        X,   X          I,   I
11      X,    X         5.0, 6.0        X,   X          I,   I
12      X,    X         5.0, 6.0        X,   X          I,   I
13      X,    X         5.0, 6.0        X,   X          8.0, 10.0
14      X,    X         5.0, 6.0        X,   X          8.0, 10.0
15      30.0, 40.0      5.0, 6.0        X,   X          8.0, 10.0
16      30.0, 40.0      5.0, 6.0        X,   X          8.0, 10.0
17      30.0, 40.0      5.0, 6.0        X,   X          8.0, I   
18      30.0, 40.0      5.0, 6.0        X,   X          8.0, I   
19      30.0, 40.0      5.0, 6.0        X,   X          8.0, 10.0
20      30.0, 40.0      5.0, 6.0        X,   X          8.0, 10.0
-----------------------------------------------------------------------------

For simplicity, I haven't shown any variation in the measurements for each
column.

So, overall we will have a single graph with two data series (Clean & Nll). Each
data point will be a summary (i.e. average) of 4 values: 1..4 measured values,
and 0..3 interpolated/extrapolated values.

Extrapolations (X):
- cargo: copy the 30.0,40.0 from commit 15 back through commits 1..14
- parser: copy the 2.0,3.0 from commit 6 forward through commits 7..20
Interpolations (I):
- serde: interpolate both Clean and NLL values for commits 3..4 and 10..12
- serde: interpolate just the NLL value for commits 17..18

I don't see any particular difficulties but I might be overlooking something.

@Mark-Simulacrum: can you map your concerns onto this example data?
Feel free to modify the example as necessary to aid your explanation. Thanks!

nnethercote · 2018-07-02T04:36:59Z

Also, I'd be interested to see screenshots of the new summary graphs. How do they compare to the existing ones? How do they compare to the per-benchmark graphs?

Mark-Simulacrum · 2018-07-02T20:28:10Z

So a similar scenario to yours, with the minor change that we landed a change making clean
benchmarks 2x slower in commit 10. I've eliminated the serde and parser benchmarks for readability.

The effect of such a change is a small blip on the graph -- initially -- and then once we restart
benchmarking cargo we see a spike at commit 1. Presuming that we've missed the regression when it
just landed, it's effectively "hidden": we've propogated Cargo back far enough that we're perhaps no
longer seeing it by default. That means that we may not see a blip on the graph at all: a 2x change
in performance is just a 2.5 increase in the summary for this specific case.

This scenario is one that would hide the regression fairly effectively. This trades today's graph
where the regression is still fairly obvious (though pales in comparison to re-introducing cargo).

-------------------------------------------------------------------------
commit  cargo            futures        Extra summary      Summary today
        Clean, NLL       Clean, NLL     Clean, NLL         Clean, NLL
-------------------------------------------------------------------------
1       X,    X          5.0, 6.0       32.5, 23.0         5.0, 6.0 
2       X,    X          5.0, 6.0       32.5, 23.0         5.0, 6.0
3       X,    X          5.0, 6.0       32.5, 23.0         5.0, 6.0   
4       X,    X          5.0, 6.0       32.5, 23.0         5.0, 6.0   
5       X,    X          5.0, 6.0       32.5, 23.0         5.0, 6.0   
6       X,    X          5.0, 6.0       32.5, 23.0         5.0, 6.0   
7       X,    X          5.0, 6.0       32.5, 23.0         5.0, 6.0   
8       X,    X          5.0, 6.0       32.5, 23.0         5.0, 6.0   
9       X,    X          5.0, 6.0       32.5, 23.0         5.0, 6.0   
10      X,    X         10.0, 6.0       35.0, 23.0        10.0, 6.0       
11      X,    X         10.0, 6.0       35.0, 23.0        10.0, 6.0
12      X,    X         10.0, 6.0       35.0, 23.0        10.0, 6.0 
13      X,    X         10.0, 6.0       35.0, 23.0        10.0, 6.0 
14      X,    X         10.0, 6.0       35.0, 23.0        10.0, 6.0 
15      60.0, 40.0      10.0, 6.0       35.0, 23.0        35.0, 23.0
16      60.0, 40.0      10.0, 6.0       35.0, 23.0        35.0, 23.0 
17      60.0, 40.0      10.0, 6.0       35.0, 23.0        35.0, 23.0 
18      60.0, 40.0      10.0, 6.0       35.0, 23.0        35.0, 23.0 
19      60.0, 40.0      10.0, 6.0       35.0, 23.0        35.0, 23.0 
20      60.0, 40.0      10.0, 6.0       35.0, 23.0        35.0, 23.0 
-----------------------------------------------------------------------

However, the problem described above doesn't seem all that serious -- sure, we might hide
regressions because of it, but largely that might be solved via a geometric instead of arithmetic
mean.

The concern I initially had was different: it might be confusing for back-propagation to work on,
for example, something like the chart below. The addition of the NLL benchmark would immediately
have the first commit become the baseline for "all time". Perhaps not a bad thing for something like
NLL, but the problem I see here is that given a future rustc that is -- for example -- instantaneous
at compiling huge projects like Servo and the addition of a new "run"-type benchmark (to
servo-script, for example if we added 0-println.patch) means that historical data for the println
benchmark has essentially been affected, perhaps making it far too low.

The chart below also shows that we might have misleadingly constant NLL stats that are unchanged for a long stretch while the runtime performance overall may be jumping around. I'm also not sure that this is a bad thing -- it might be easy to ignore -- but I worry about flat lines in some runs and highly variant lines in others. It feels odd. (This is the problem I was trying to explain before).

Note that this PR doesn't currently do anything except whole-benchmark interpolation /
extrapolation, and we can see that the screenshot reflects that. The large jump on the 12th of June
(commit ef8cb40c) is actually caused by a change to the script-servo benchmark gaining a println
patch. I've recently landed a patch to master (and deployed) that makes that possible, though
quite hard, to see in the compare.html page -- we can now show dashes in the first column.

I believe that's the question I'm asking here: what do we expect to happen when such a change
occurs? Should we see a jump? Should we see ~nothing? My understanding is that you expect to see nothing; is that correct?

-------------------------------------
commit  cargo            futures     
        Clean, NLL       Clean, NLL  
-------------------------------------
1       10.0, X          5.0, X    
2       10.0, X          5.0, X    
3       10.0, X          5.0, X    
4       10.0, X          5.0, X    
5       10.0, X          5.0, X    
6       10.0, X          5.0, X    
7       X,    X          5.0, X    
8       X,    X          5.0, X    
9       X,    X          5.0, X    
10      X,    X         10.0, 6.0    
11      X,    X         10.0, 6.0    
12      X,    X         10.0, 6.0    
13      X,    X         10.0, 6.0    
14      X,    X         10.0, 6.0    
15      60.0, 40.0      10.0, 6.0    
16      60.0, 40.0      10.0, 6.0    
17      60.0, 40.0      10.0, 6.0    
18      60.0, 40.0      10.0, 6.0    
19      60.0, 40.0      10.0, 6.0    
20      60.0, 40.0      10.0, 6.0    
-------------------------------------

nnethercote · 2018-07-02T23:58:13Z

The first example looks exactly how I'd expect. I can see your concern, but this is an extreme example, containing only two benchmarks, with one much larger than the other. In the real case where we have lots of benchmarks the masking effect you described should be much less. Also, if we normalize measurements so that each benchmark has equal weight, that will also greatly reduce the potential for masking.

The concern I initially had was different: it might be confusing for back-propagation to work on,
for example, something like the chart below. The addition of the NLL benchmark would immediately
have the first commit become the baseline for "all time". Perhaps not a bad thing for something like
NLL, but the problem I see here is that given a future rustc that is -- for example -- instantaneous
at compiling huge projects like Servo and the addition of a new "run"-type benchmark (to
servo-script, for example if we added 0-println.patch) means that historical data for the println
benchmark has essentially been affected, perhaps making it far too low.

I'm not sure if I follow this paragraph.

On the "all time" point: we're only looking at windows of measurements, such as one month at a time, right? If there are no measurements for a particular benchmark or configuration during that window, I would expect it to not show up. (E.g. parser has finally disappeared.)
On the 0-println.patch point: I think a data series should only show up on the summary graph if it potentially has data from every benchmark.
- "clean" does, as do "baseline incremental" and "clean incremental".
- "nll" doesn't, because it's currently disabled for a few benchmarks (such as the stylo ones). Until we have at least one NLL measurement for every benchmark in the commit window (which would require enabling it for every benchmark) it shouldn't show up on the graph, because we can't do an apples-to-apples comparison of it with the other data series such as "clean".
- "patched incremental: println" doesn't. I thought it did, but I see now a couple of benchmarks have no "patched incremental" measurements (e.g. serde), and a few others have a "patched incremental: dummy fn" instead (e.g. unused-warnings). Perhaps we should fix that by adding and renaming patches.

The chart below also shows that we might have misleadingly constant NLL stats that are unchanged for a long stretch while the runtime performance overall may be jumping around. I'm also not sure that this is a bad thing -- it might be easy to ignore -- but I worry about flat lines in some runs and highly variant lines in others. It feels odd. (This is the problem I was trying to explain before).

I think this is where the color markers will help? Interpolation/extrapolation over long periods could lead to very flat lines, but if they're marked then I hope it'll be clearer what's going on.

The large jump on the 12th of June (commit ef8cb40c) is actually caused by a change to the script-servo benchmark gaining a println patch... what do we expect to happen when such a change occurs? Should we see a jump? Should we see ~nothing? My understanding is that you expect to see nothing; is that correct?

I think the addition of a println patch shouldn't affect the lines for "clean", "baseline incremental", etc. And if that addition caused us to now have a println measurement for every benchmark, then the "patched incremental: println" line would be present and would be flat at June 12, but all the data points prior to June 12 would be colored to indicate interpolation had occurred.

Hopefully all this makes some sense? It is definitely more difficult than I first thought :)

Mark-Simulacrum · 2018-07-03T00:29:19Z

Yes, that all makes sense. I'll look at implementing the within-benchmark extrapolation and colored lines soon-ish.

Another question that came up while I was looking at this is if we'd like a toggle on the site which would enable/disable this -- that should be fairly easy to implement with the current setup (though it'd require a page reload). Not sure if it would be useful, though.

nnethercote · 2018-07-03T03:06:07Z

I don't think a toggle would be useful, except possibly for short-term testing purposes.

Mark-Simulacrum · 2018-07-03T18:14:03Z

Okay, looks like we have an initial version which I believe should work pretty much as expected; these are some selected screenshots; I am not ready to merge this yet as the startup performance in abysmal (that's when we do the computation for this) -- nearly 2 minutes.

nnethercote · 2018-07-04T01:39:57Z

Cool! Some observations and questions.

The summary graphs look promising. Nice flat lines :)
I'm surprised to see some non-pink dots for the NLL data series in the "Summary-debug" graph -- that suggests that all the benchmarks compiled successfully, but NLL is explicitly disabled for some of them.
What's happening with the upwards-curving pink on the left side of the coercions graphs? If data is missing for that entire period, I would have expected a flat line extending to the left of the first data point.
How does html5ever have an (all-pink) NLL line, when it's not enabled for NLL?
Will this extend easily to the dashboard view?

Mark-Simulacrum · 2018-07-04T03:37:39Z

I'm surprised to see some non-pink dots for the NLL data series in the "Summary-debug" graph -- that suggests that all the benchmarks compiled successfully, but NLL is explicitly disabled for some of them.

This PR didn't change that behavior (seems somewhat out of scope) -- we currently generate summary graphs/lines independent of whether all benchmarks actually ran for any commit. I also am uncertain that we have a clear expectation of what we should be doing here; we should discuss and implement the results.

What's happening with the upwards-curving pink on the left side of the coercions graphs? If data is missing for that entire period, I would have expected a flat line extending to the left of the first data point.

We don't limit interpolation/extrapolation to the visible window, so that is actually a smaller section of the larger curve. The reason that this data looks somewhat non-linear is subtle: our interpolation works on a commit basis, not on a time bases, but the X-axis on these graphs is time-based. Therefore, if the commits were more spaced out then it results in a less linear looking slope, though the data is linear on a commit-based x-axis.

How does html5ever have an (all-pink) NLL line, when it's not enabled for NLL?

I believe it used to be enabled for a short period of time, and then was disabled. The extrapolation/interpolation code does not currently limit itself to some time/commit range so it "reaches" into the past far enough to find those points. I see at least two points around 2018-03-22. The behavior here might not be ideal.

Will this extend easily to the dashboard view?

It should. I haven't updated its code yet in this PR for the visual effects, though. I have also not updated the comparison, NLL, or status pages -- I'd like suggestions if you have any on the correct approach for the dashboard (should we interpolate/extrapolate still? stop showing lines as in summary graphs?), as well as the proper way to indicate data interpolation on the comparison page.

Specifically -- should we "color" the whole section somehow so it stands out in the overall view? Maybe add a Unicode warning sign? Something less fancy?

On a slightly less important note, do you think the currently spontaneously chosen color is fine? Should we go for something less visually jarring?

nnethercote · 2018-07-04T04:58:35Z

We don't limit interpolation/extrapolation to the visible window,

Should we? That's what I was expecting.

to indicate data interpolation on the comparison page

I don't think we need to to interpolation on the comparison page. It shows data for two commits and doesn't do any summarizing. When data is missing it's already obvious.

Something less fancy?

I'm not really sure. "Jarring" is definitely a good word for the current pink :) It probably is worth distinguishing the following two cases, though:

This data point was missing and we have filled it in (for individual benchmark graphs).
This data point is a mix of real data and interpolated data (for summary graphs and the dashboard).

Mark-Simulacrum · 2018-07-04T15:23:47Z

We don't limit interpolation/extrapolation to the visible window,

Should we? That's what I was expecting.

This is somewhat harder -- currently we just run the extrapolation once when we load data and then cache that. Doing so on each request is certainly possible but will probably make the UI slower. I'd rather leave it as-is today, especially since that also means that changing the view you're looking at does not affect the data: this seems like a good property to have, overall.

This will not interpolate runs, but since we add those fairly rarely and they are never compared with each other that should not be a problem at least for now.

Mark-Simulacrum · 2018-07-13T20:47:24Z

I've rebased and I believe I'm ready to merge this -- the main unimplemented features are:

dashboard gives no indication that data may be interpolated
doesn't seem all that important, can only affect current master anyway
Interpolate within visible range
- I don't think this is something I'd want to do (yet, at least)

@nnethercote If you think the above are non-blockers then we can merge and deploy this whenever.

nnethercote · 2018-07-16T00:17:47Z

Apologies for the slow response.

Overall, it's great to see this land:

It makes it obvious when individual benchmarks were broken for extended periods of time. E.g. serde and tokio-webpush-simple both were broken in June.
The summary graphs are much more meaningful now.

There are still some rough edges, though.

NLL had a big jump on the summary-debug graph around July 2. I can't really see why, though it might be related to some missing extrapolations (see below).

Extrapolation is missing on the RHS of graphs, e.g. the NLL results for cargo-debug (and several others). It's also missing from the LHS in places where it's missing from the RHS, e.g. the NLL results for crates.io-debug.

Why can interpolation only affect master on the dashboard? I agree that it's much more likely to affect master (because master is most likely to have bustage), but that doesn't seem guaranteed. I remember when @aturon was first doing these measurements he had a failure for one of the benchmarks on one old release.

Anyway, I think it would be useful to have the pink indications on the dashboard.

On the graphs view, there is no explanation or legend explaining what the pink means. It's probably mysterious to everyone but you and me!

The new shade of pink is better, BTW 👍 It would still be good two have two different colours, though, one meaning "partial data" (for the summary graphs and dashboard) and one meaning "no data" (for the individual benchmarks).

Mark-Simulacrum · 2018-07-16T16:20:38Z

Extrapolation is missing on the RHS of graphs, e.g. the NLL results for cargo-debug (and several others). It's also missing from the LHS in places where it's missing from the RHS, e.g. the NLL results for crates.io-debug.

This might be due to the fact that we do not extrapolate runs/benchmarks that are not in the last 20 commits; that behavior is what allows benchmarks to eventually "go away" if they're removed, as with NLL for *-debug and *-opt.

Why can interpolation only affect master on the dashboard?

Well, to clarify, interpolation can only affect master today because it's not implemented to be generic over the dataset (i.e., only works with commits) but I agree that in theory we could extrapolate into artifact that too. For now though I think this'll have to be out of scope, ideally we'd have data for all benchmarks across all artifacts -- it might be worth adding some basic checkboxes below the dashboard graphs that confirm that.

On the graphs view, there is no explanation or legend explaining what the pink means. It's probably mysterious to everyone but you and me!

This is generally speaking the case -- do you have any ideas about where to put documentation like this?

It would still be good two have two different colours, though, one meaning "partial data" (for the summary graphs and dashboard) and one meaning "no data" (for the individual benchmarks).

It seems like that distinction is apparent by the location; i.e. on the summary graphs it's ~always partial data and on the individual benchmarks it's always no data, so I'm not quite sure of the usefulness here. Could you perhaps elaborate?

nnethercote · 2018-07-17T05:07:00Z

The ideal spot for the explanation of the pink dots would be in the graphs themselves, as part of the legend for the other colored dots. I don't know if that's possible. If not, just a note at the bottom of the page is better than nothing.

It seems like that distinction is apparent by the location; i.e. on the summary graphs it's ~always partial data and on the individual benchmarks it's always no data

That is true, but it's a non-obvious distinction. If we end up with per-graph indications of what the pink dots mean, then a single color is less problematic. But if the pink dots are explained at the bottom of the page, distinguishing the two cases would be a lot clearer.

Mark-Simulacrum added 7 commits July 13, 2018 07:40

Interpolate benchmark data when it is missing.

ddccb43

This will not interpolate runs, but since we add those fairly rarely and they are never compared with each other that should not be a problem at least for now.

Describe problems with interpolating data in a comment.

4be1e79

interpolate ranges

014ba7e

Fix bug in start commit index calculation

5bb2884

Conditionally query interpolated data

cb8dd49

Optimize interpolation performance

168b747

Change color to be less jarring

5426509

Mark-Simulacrum changed the title ~~[WIP] Data extrapolation/intrapolation~~ Data extrapolation/intrapolation Jul 14, 2018

Mark-Simulacrum merged commit 192cfcb into rust-lang:master Jul 14, 2018

Mark-Simulacrum deleted the interpolate branch July 14, 2018 13:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Data extrapolation/intrapolation #253

Data extrapolation/intrapolation #253

Uh oh!

Mark-Simulacrum commented Jun 29, 2018 •

edited

Loading

Uh oh!

nnethercote commented Jul 2, 2018

Uh oh!

nnethercote commented Jul 2, 2018

Uh oh!

Mark-Simulacrum commented Jul 2, 2018

Uh oh!

nnethercote commented Jul 2, 2018

Uh oh!

Mark-Simulacrum commented Jul 3, 2018

Uh oh!

nnethercote commented Jul 3, 2018

Uh oh!

Mark-Simulacrum commented Jul 3, 2018

Uh oh!

nnethercote commented Jul 4, 2018

Uh oh!

Mark-Simulacrum commented Jul 4, 2018

Uh oh!

nnethercote commented Jul 4, 2018

Uh oh!

Mark-Simulacrum commented Jul 4, 2018

Uh oh!

Mark-Simulacrum commented Jul 13, 2018

Uh oh!

nnethercote commented Jul 16, 2018

Uh oh!

Mark-Simulacrum commented Jul 16, 2018

Uh oh!

nnethercote commented Jul 17, 2018

Uh oh!

Uh oh!

Data extrapolation/intrapolation #253

Data extrapolation/intrapolation #253

Uh oh!

Conversation

Mark-Simulacrum commented Jun 29, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nnethercote commented Jul 2, 2018

Uh oh!

nnethercote commented Jul 2, 2018

Uh oh!

Mark-Simulacrum commented Jul 2, 2018

Uh oh!

nnethercote commented Jul 2, 2018

Uh oh!

Mark-Simulacrum commented Jul 3, 2018

Uh oh!

nnethercote commented Jul 3, 2018

Uh oh!

Mark-Simulacrum commented Jul 3, 2018

Uh oh!

nnethercote commented Jul 4, 2018

Uh oh!

Mark-Simulacrum commented Jul 4, 2018

Uh oh!

nnethercote commented Jul 4, 2018

Uh oh!

Mark-Simulacrum commented Jul 4, 2018

Uh oh!

Mark-Simulacrum commented Jul 13, 2018

Uh oh!

nnethercote commented Jul 16, 2018

Uh oh!

Mark-Simulacrum commented Jul 16, 2018

Uh oh!

nnethercote commented Jul 17, 2018

Uh oh!

Uh oh!

Mark-Simulacrum commented Jun 29, 2018 •

edited

Loading