Skip to content

Data extrapolation/intrapolation #253

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jul 14, 2018
Merged

Data extrapolation/intrapolation #253

merged 7 commits into from
Jul 14, 2018

Conversation

Mark-Simulacrum
Copy link
Member

@Mark-Simulacrum Mark-Simulacrum commented Jun 29, 2018

Fixes #249.

Extrapolates on a run-based level across history.

@nnethercote
Copy link
Contributor

Apologies: I'm having trouble imagining the problems you've described above. I find
this stuff a lot easier to think about with concrete examples.

Consider the following.

  • A window of 20 commits, labeled 1..20.
  • Four benchmarks: cargo, futures, parser, serde
    • cargo is added partway through, parser is removed partway through, serde
      breaks intermittently (completely or just for NLL)
  • Two BuildKind+RunKind combinations: Check/Clean, Check/NLL
    • I don't think it's worth worrying about addition/removal of
      BuildKind+RunKind combinations. One day NLL will not be needed but then we
      can just remove it wholesale.
  • Just the "instructions" metric, measured in billions.
    • I don't think it's possible to gather a subset of the metrics -- we either
      get all of them or none of them -- so we don't have to worry about some
      being present and some not.
-----------------------------------------------------------------------------
commit  cargo           futures         parser          serde   
        Clean, NLL      Clean, NLL      Clean, NLL      Clean, NLL
-----------------------------------------------------------------------------
1       X,    X         5.0, 6.0        2.0, 3.0        8.0, 10.0
2       X,    X         5.0, 6.0        2.0, 3.0        8.0, 10.0
3       X,    X         5.0, 6.0        2.0, 3.0        I,   I
4       X,    X         5.0, 6.0        2.0, 3.0        I,   I
5       X,    X         5.0, 6.0        2.0, 3.0        8.0, 10.0
6       X,    X         5.0, 6.0        2.0, 3.0        8.0, 10.0
7       X,    X         5.0, 6.0        X,   X          8.0, 10.0
8       X,    X         5.0, 6.0        X,   X          8.0, 10.0
9       X,    X         5.0, 6.0        X,   X          8.0, 10.0
10      X,    X         5.0, 6.0        X,   X          I,   I
11      X,    X         5.0, 6.0        X,   X          I,   I
12      X,    X         5.0, 6.0        X,   X          I,   I
13      X,    X         5.0, 6.0        X,   X          8.0, 10.0
14      X,    X         5.0, 6.0        X,   X          8.0, 10.0
15      30.0, 40.0      5.0, 6.0        X,   X          8.0, 10.0
16      30.0, 40.0      5.0, 6.0        X,   X          8.0, 10.0
17      30.0, 40.0      5.0, 6.0        X,   X          8.0, I   
18      30.0, 40.0      5.0, 6.0        X,   X          8.0, I   
19      30.0, 40.0      5.0, 6.0        X,   X          8.0, 10.0
20      30.0, 40.0      5.0, 6.0        X,   X          8.0, 10.0
-----------------------------------------------------------------------------

For simplicity, I haven't shown any variation in the measurements for each
column.

So, overall we will have a single graph with two data series (Clean & Nll). Each
data point will be a summary (i.e. average) of 4 values: 1..4 measured values,
and 0..3 interpolated/extrapolated values.

  • Extrapolations (X):

    • cargo: copy the 30.0,40.0 from commit 15 back through commits 1..14
    • parser: copy the 2.0,3.0 from commit 6 forward through commits 7..20
  • Interpolations (I):

    • serde: interpolate both Clean and NLL values for commits 3..4 and 10..12
    • serde: interpolate just the NLL value for commits 17..18

I don't see any particular difficulties but I might be overlooking something.

@Mark-Simulacrum: can you map your concerns onto this example data?
Feel free to modify the example as necessary to aid your explanation. Thanks!

@nnethercote
Copy link
Contributor

Also, I'd be interested to see screenshots of the new summary graphs. How do they compare to the existing ones? How do they compare to the per-benchmark graphs?

@Mark-Simulacrum
Copy link
Member Author

So a similar scenario to yours, with the minor change that we landed a change making clean
benchmarks 2x slower in commit 10. I've eliminated the serde and parser benchmarks for readability.

The effect of such a change is a small blip on the graph -- initially -- and then once we restart
benchmarking cargo we see a spike at commit 1. Presuming that we've missed the regression when it
just landed, it's effectively "hidden": we've propogated Cargo back far enough that we're perhaps no
longer seeing it by default. That means that we may not see a blip on the graph at all: a 2x change
in performance is just a 2.5 increase in the summary for this specific case.

This scenario is one that would hide the regression fairly effectively. This trades today's graph
where the regression is still fairly obvious (though pales in comparison to re-introducing cargo).

-------------------------------------------------------------------------
commit  cargo            futures        Extra summary      Summary today
        Clean, NLL       Clean, NLL     Clean, NLL         Clean, NLL
-------------------------------------------------------------------------
1       X,    X          5.0, 6.0       32.5, 23.0         5.0, 6.0 
2       X,    X          5.0, 6.0       32.5, 23.0         5.0, 6.0
3       X,    X          5.0, 6.0       32.5, 23.0         5.0, 6.0   
4       X,    X          5.0, 6.0       32.5, 23.0         5.0, 6.0   
5       X,    X          5.0, 6.0       32.5, 23.0         5.0, 6.0   
6       X,    X          5.0, 6.0       32.5, 23.0         5.0, 6.0   
7       X,    X          5.0, 6.0       32.5, 23.0         5.0, 6.0   
8       X,    X          5.0, 6.0       32.5, 23.0         5.0, 6.0   
9       X,    X          5.0, 6.0       32.5, 23.0         5.0, 6.0   
10      X,    X         10.0, 6.0       35.0, 23.0        10.0, 6.0       
11      X,    X         10.0, 6.0       35.0, 23.0        10.0, 6.0
12      X,    X         10.0, 6.0       35.0, 23.0        10.0, 6.0 
13      X,    X         10.0, 6.0       35.0, 23.0        10.0, 6.0 
14      X,    X         10.0, 6.0       35.0, 23.0        10.0, 6.0 
15      60.0, 40.0      10.0, 6.0       35.0, 23.0        35.0, 23.0
16      60.0, 40.0      10.0, 6.0       35.0, 23.0        35.0, 23.0 
17      60.0, 40.0      10.0, 6.0       35.0, 23.0        35.0, 23.0 
18      60.0, 40.0      10.0, 6.0       35.0, 23.0        35.0, 23.0 
19      60.0, 40.0      10.0, 6.0       35.0, 23.0        35.0, 23.0 
20      60.0, 40.0      10.0, 6.0       35.0, 23.0        35.0, 23.0 
-----------------------------------------------------------------------

However, the problem described above doesn't seem all that serious -- sure, we might hide
regressions because of it, but largely that might be solved via a geometric instead of arithmetic
mean.

The concern I initially had was different: it might be confusing for back-propagation to work on,
for example, something like the chart below. The addition of the NLL benchmark would immediately
have the first commit become the baseline for "all time". Perhaps not a bad thing for something like
NLL, but the problem I see here is that given a future rustc that is -- for example -- instantaneous
at compiling huge projects like Servo and the addition of a new "run"-type benchmark (to
servo-script, for example if we added 0-println.patch) means that historical data for the println
benchmark has essentially been affected, perhaps making it far too low.

The chart below also shows that we might have misleadingly constant NLL stats that are unchanged for a long stretch while the runtime performance overall may be jumping around. I'm also not sure that this is a bad thing -- it might be easy to ignore -- but I worry about flat lines in some runs and highly variant lines in others. It feels odd. (This is the problem I was trying to explain before).

Note that this PR doesn't currently do anything except whole-benchmark interpolation /
extrapolation, and we can see that the screenshot reflects that. The large jump on the 12th of June
(commit ef8cb40c) is actually caused by a change to the script-servo benchmark gaining a println
patch. I've recently landed a patch to master (and deployed) that makes that possible, though
quite hard, to see in the compare.html page -- we can now show dashes in the first column.

I believe that's the question I'm asking here: what do we expect to happen when such a change
occurs? Should we see a jump? Should we see ~nothing? My understanding is that you expect to see nothing; is that correct?

-------------------------------------
commit  cargo            futures     
        Clean, NLL       Clean, NLL  
-------------------------------------
1       10.0, X          5.0, X    
2       10.0, X          5.0, X    
3       10.0, X          5.0, X    
4       10.0, X          5.0, X    
5       10.0, X          5.0, X    
6       10.0, X          5.0, X    
7       X,    X          5.0, X    
8       X,    X          5.0, X    
9       X,    X          5.0, X    
10      X,    X         10.0, 6.0    
11      X,    X         10.0, 6.0    
12      X,    X         10.0, 6.0    
13      X,    X         10.0, 6.0    
14      X,    X         10.0, 6.0    
15      60.0, 40.0      10.0, 6.0    
16      60.0, 40.0      10.0, 6.0    
17      60.0, 40.0      10.0, 6.0    
18      60.0, 40.0      10.0, 6.0    
19      60.0, 40.0      10.0, 6.0    
20      60.0, 40.0      10.0, 6.0    
-------------------------------------

image

@nnethercote
Copy link
Contributor

The first example looks exactly how I'd expect. I can see your concern, but this is an extreme example, containing only two benchmarks, with one much larger than the other. In the real case where we have lots of benchmarks the masking effect you described should be much less. Also, if we normalize measurements so that each benchmark has equal weight, that will also greatly reduce the potential for masking.

The concern I initially had was different: it might be confusing for back-propagation to work on,
for example, something like the chart below. The addition of the NLL benchmark would immediately
have the first commit become the baseline for "all time". Perhaps not a bad thing for something like
NLL, but the problem I see here is that given a future rustc that is -- for example -- instantaneous
at compiling huge projects like Servo and the addition of a new "run"-type benchmark (to
servo-script, for example if we added 0-println.patch) means that historical data for the println
benchmark has essentially been affected, perhaps making it far too low.

I'm not sure if I follow this paragraph.

  • On the "all time" point: we're only looking at windows of measurements, such as one month at a time, right? If there are no measurements for a particular benchmark or configuration during that window, I would expect it to not show up. (E.g. parser has finally disappeared.)
  • On the 0-println.patch point: I think a data series should only show up on the summary graph if it potentially has data from every benchmark.
    • "clean" does, as do "baseline incremental" and "clean incremental".
    • "nll" doesn't, because it's currently disabled for a few benchmarks (such as the stylo ones). Until we have at least one NLL measurement for every benchmark in the commit window (which would require enabling it for every benchmark) it shouldn't show up on the graph, because we can't do an apples-to-apples comparison of it with the other data series such as "clean".
    • "patched incremental: println" doesn't. I thought it did, but I see now a couple of benchmarks have no "patched incremental" measurements (e.g. serde), and a few others have a "patched incremental: dummy fn" instead (e.g. unused-warnings). Perhaps we should fix that by adding and renaming patches.

The chart below also shows that we might have misleadingly constant NLL stats that are unchanged for a long stretch while the runtime performance overall may be jumping around. I'm also not sure that this is a bad thing -- it might be easy to ignore -- but I worry about flat lines in some runs and highly variant lines in others. It feels odd. (This is the problem I was trying to explain before).

I think this is where the color markers will help? Interpolation/extrapolation over long periods could lead to very flat lines, but if they're marked then I hope it'll be clearer what's going on.

The large jump on the 12th of June (commit ef8cb40c) is actually caused by a change to the script-servo benchmark gaining a println patch... what do we expect to happen when such a change occurs? Should we see a jump? Should we see ~nothing? My understanding is that you expect to see nothing; is that correct?

I think the addition of a println patch shouldn't affect the lines for "clean", "baseline incremental", etc. And if that addition caused us to now have a println measurement for every benchmark, then the "patched incremental: println" line would be present and would be flat at June 12, but all the data points prior to June 12 would be colored to indicate interpolation had occurred.

Hopefully all this makes some sense? It is definitely more difficult than I first thought :)

@Mark-Simulacrum
Copy link
Member Author

Yes, that all makes sense. I'll look at implementing the within-benchmark extrapolation and colored lines soon-ish.

Another question that came up while I was looking at this is if we'd like a toggle on the site which would enable/disable this -- that should be fairly easy to implement with the current setup (though it'd require a page reload). Not sure if it would be useful, though.

@nnethercote
Copy link
Contributor

I don't think a toggle would be useful, except possibly for short-term testing purposes.

@Mark-Simulacrum
Copy link
Member Author

Okay, looks like we have an initial version which I believe should work pretty much as expected; these are some selected screenshots; I am not ready to merge this yet as the startup performance in abysmal (that's when we do the computation for this) -- nearly 2 minutes.

image

image

image

image

@nnethercote
Copy link
Contributor

Cool! Some observations and questions.

  • The summary graphs look promising. Nice flat lines :)
  • I'm surprised to see some non-pink dots for the NLL data series in the "Summary-debug" graph -- that suggests that all the benchmarks compiled successfully, but NLL is explicitly disabled for some of them.
  • What's happening with the upwards-curving pink on the left side of the coercions graphs? If data is missing for that entire period, I would have expected a flat line extending to the left of the first data point.
  • How does html5ever have an (all-pink) NLL line, when it's not enabled for NLL?
  • Will this extend easily to the dashboard view?

@Mark-Simulacrum
Copy link
Member Author

I'm surprised to see some non-pink dots for the NLL data series in the "Summary-debug" graph -- that suggests that all the benchmarks compiled successfully, but NLL is explicitly disabled for some of them.

This PR didn't change that behavior (seems somewhat out of scope) -- we currently generate summary graphs/lines independent of whether all benchmarks actually ran for any commit. I also am uncertain that we have a clear expectation of what we should be doing here; we should discuss and implement the results.

What's happening with the upwards-curving pink on the left side of the coercions graphs? If data is missing for that entire period, I would have expected a flat line extending to the left of the first data point.

We don't limit interpolation/extrapolation to the visible window, so that is actually a smaller section of the larger curve. The reason that this data looks somewhat non-linear is subtle: our interpolation works on a commit basis, not on a time bases, but the X-axis on these graphs is time-based. Therefore, if the commits were more spaced out then it results in a less linear looking slope, though the data is linear on a commit-based x-axis.

image

How does html5ever have an (all-pink) NLL line, when it's not enabled for NLL?

I believe it used to be enabled for a short period of time, and then was disabled. The extrapolation/interpolation code does not currently limit itself to some time/commit range so it "reaches" into the past far enough to find those points. I see at least two points around 2018-03-22. The behavior here might not be ideal.

Will this extend easily to the dashboard view?

It should. I haven't updated its code yet in this PR for the visual effects, though. I have also not updated the comparison, NLL, or status pages -- I'd like suggestions if you have any on the correct approach for the dashboard (should we interpolate/extrapolate still? stop showing lines as in summary graphs?), as well as the proper way to indicate data interpolation on the comparison page.

Specifically -- should we "color" the whole section somehow so it stands out in the overall view? Maybe add a Unicode warning sign? Something less fancy?

On a slightly less important note, do you think the currently spontaneously chosen color is fine? Should we go for something less visually jarring?

@nnethercote
Copy link
Contributor

We don't limit interpolation/extrapolation to the visible window,

Should we? That's what I was expecting.

to indicate data interpolation on the comparison page

I don't think we need to to interpolation on the comparison page. It shows data for two commits and doesn't do any summarizing. When data is missing it's already obvious.

Something less fancy?

I'm not really sure. "Jarring" is definitely a good word for the current pink :) It probably is worth distinguishing the following two cases, though:

  • This data point was missing and we have filled it in (for individual benchmark graphs).
  • This data point is a mix of real data and interpolated data (for summary graphs and the dashboard).

@Mark-Simulacrum
Copy link
Member Author

We don't limit interpolation/extrapolation to the visible window,

Should we? That's what I was expecting.

This is somewhat harder -- currently we just run the extrapolation once when we load data and then cache that. Doing so on each request is certainly possible but will probably make the UI slower. I'd rather leave it as-is today, especially since that also means that changing the view you're looking at does not affect the data: this seems like a good property to have, overall.

@Mark-Simulacrum
Copy link
Member Author

I've rebased and I believe I'm ready to merge this -- the main unimplemented features are:

  • dashboard gives no indication that data may be interpolated
  • doesn't seem all that important, can only affect current master anyway
  • Interpolate within visible range
    • I don't think this is something I'd want to do (yet, at least)

@nnethercote If you think the above are non-blockers then we can merge and deploy this whenever.

@Mark-Simulacrum Mark-Simulacrum changed the title [WIP] Data extrapolation/intrapolation Data extrapolation/intrapolation Jul 14, 2018
@Mark-Simulacrum Mark-Simulacrum merged commit 192cfcb into rust-lang:master Jul 14, 2018
@Mark-Simulacrum Mark-Simulacrum deleted the interpolate branch July 14, 2018 13:23
@nnethercote
Copy link
Contributor

Apologies for the slow response.

Overall, it's great to see this land:

  • It makes it obvious when individual benchmarks were broken for extended periods of time. E.g. serde and tokio-webpush-simple both were broken in June.
  • The summary graphs are much more meaningful now.

There are still some rough edges, though.

NLL had a big jump on the summary-debug graph around July 2. I can't really see why, though it might be related to some missing extrapolations (see below).

Extrapolation is missing on the RHS of graphs, e.g. the NLL results for cargo-debug (and several others). It's also missing from the LHS in places where it's missing from the RHS, e.g. the NLL results for crates.io-debug.

Why can interpolation only affect master on the dashboard? I agree that it's much more likely to affect master (because master is most likely to have bustage), but that doesn't seem guaranteed. I remember when @aturon was first doing these measurements he had a failure for one of the benchmarks on one old release.

Anyway, I think it would be useful to have the pink indications on the dashboard.

On the graphs view, there is no explanation or legend explaining what the pink means. It's probably mysterious to everyone but you and me!

The new shade of pink is better, BTW 👍 It would still be good two have two different colours, though, one meaning "partial data" (for the summary graphs and dashboard) and one meaning "no data" (for the individual benchmarks).

@Mark-Simulacrum
Copy link
Member Author

Extrapolation is missing on the RHS of graphs, e.g. the NLL results for cargo-debug (and several others). It's also missing from the LHS in places where it's missing from the RHS, e.g. the NLL results for crates.io-debug.

This might be due to the fact that we do not extrapolate runs/benchmarks that are not in the last 20 commits; that behavior is what allows benchmarks to eventually "go away" if they're removed, as with NLL for *-debug and *-opt.

Why can interpolation only affect master on the dashboard?

Well, to clarify, interpolation can only affect master today because it's not implemented to be generic over the dataset (i.e., only works with commits) but I agree that in theory we could extrapolate into artifact that too. For now though I think this'll have to be out of scope, ideally we'd have data for all benchmarks across all artifacts -- it might be worth adding some basic checkboxes below the dashboard graphs that confirm that.

On the graphs view, there is no explanation or legend explaining what the pink means. It's probably mysterious to everyone but you and me!

This is generally speaking the case -- do you have any ideas about where to put documentation like this?

It would still be good two have two different colours, though, one meaning "partial data" (for the summary graphs and dashboard) and one meaning "no data" (for the individual benchmarks).

It seems like that distinction is apparent by the location; i.e. on the summary graphs it's ~always partial data and on the individual benchmarks it's always no data, so I'm not quite sure of the usefulness here. Could you perhaps elaborate?

@nnethercote
Copy link
Contributor

The ideal spot for the explanation of the pink dots would be in the graphs themselves, as part of the legend for the other colored dots. I don't know if that's possible. If not, just a note at the bottom of the page is better than nothing.

It seems like that distinction is apparent by the location; i.e. on the summary graphs it's ~always partial data and on the individual benchmarks it's always no data

That is true, but it's a non-obvious distinction. If we end up with per-graph indications of what the pink dots mean, then a single color is less problematic. But if the pink dots are explained at the bottom of the page, distinguishing the two cases would be a lot clearer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants