-
Notifications
You must be signed in to change notification settings - Fork 152
Data extrapolation/intrapolation #253
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data extrapolation/intrapolation #253
Conversation
Apologies: I'm having trouble imagining the problems you've described above. I find Consider the following.
For simplicity, I haven't shown any variation in the measurements for each So, overall we will have a single graph with two data series (Clean & Nll). Each
I don't see any particular difficulties but I might be overlooking something. @Mark-Simulacrum: can you map your concerns onto this example data? |
Also, I'd be interested to see screenshots of the new summary graphs. How do they compare to the existing ones? How do they compare to the per-benchmark graphs? |
So a similar scenario to yours, with the minor change that we landed a change making clean The effect of such a change is a small blip on the graph -- initially -- and then once we restart This scenario is one that would hide the regression fairly effectively. This trades today's graph
However, the problem described above doesn't seem all that serious -- sure, we might hide The concern I initially had was different: it might be confusing for back-propagation to work on, The chart below also shows that we might have misleadingly constant NLL stats that are unchanged for a long stretch while the runtime performance overall may be jumping around. I'm also not sure that this is a bad thing -- it might be easy to ignore -- but I worry about flat lines in some runs and highly variant lines in others. It feels odd. (This is the problem I was trying to explain before). Note that this PR doesn't currently do anything except whole-benchmark interpolation / I believe that's the question I'm asking here: what do we expect to happen when such a change
|
The first example looks exactly how I'd expect. I can see your concern, but this is an extreme example, containing only two benchmarks, with one much larger than the other. In the real case where we have lots of benchmarks the masking effect you described should be much less. Also, if we normalize measurements so that each benchmark has equal weight, that will also greatly reduce the potential for masking.
I'm not sure if I follow this paragraph.
I think this is where the color markers will help? Interpolation/extrapolation over long periods could lead to very flat lines, but if they're marked then I hope it'll be clearer what's going on.
I think the addition of a println patch shouldn't affect the lines for "clean", "baseline incremental", etc. And if that addition caused us to now have a println measurement for every benchmark, then the "patched incremental: println" line would be present and would be flat at June 12, but all the data points prior to June 12 would be colored to indicate interpolation had occurred. Hopefully all this makes some sense? It is definitely more difficult than I first thought :) |
Yes, that all makes sense. I'll look at implementing the within-benchmark extrapolation and colored lines soon-ish. Another question that came up while I was looking at this is if we'd like a toggle on the site which would enable/disable this -- that should be fairly easy to implement with the current setup (though it'd require a page reload). Not sure if it would be useful, though. |
I don't think a toggle would be useful, except possibly for short-term testing purposes. |
Cool! Some observations and questions.
|
This PR didn't change that behavior (seems somewhat out of scope) -- we currently generate summary graphs/lines independent of whether all benchmarks actually ran for any commit. I also am uncertain that we have a clear expectation of what we should be doing here; we should discuss and implement the results.
We don't limit interpolation/extrapolation to the visible window, so that is actually a smaller section of the larger curve. The reason that this data looks somewhat non-linear is subtle: our interpolation works on a commit basis, not on a time bases, but the X-axis on these graphs is time-based. Therefore, if the commits were more spaced out then it results in a less linear looking slope, though the data is linear on a commit-based x-axis.
I believe it used to be enabled for a short period of time, and then was disabled. The extrapolation/interpolation code does not currently limit itself to some time/commit range so it "reaches" into the past far enough to find those points. I see at least two points around 2018-03-22. The behavior here might not be ideal.
It should. I haven't updated its code yet in this PR for the visual effects, though. I have also not updated the comparison, NLL, or status pages -- I'd like suggestions if you have any on the correct approach for the dashboard (should we interpolate/extrapolate still? stop showing lines as in summary graphs?), as well as the proper way to indicate data interpolation on the comparison page. Specifically -- should we "color" the whole section somehow so it stands out in the overall view? Maybe add a Unicode warning sign? Something less fancy? On a slightly less important note, do you think the currently spontaneously chosen color is fine? Should we go for something less visually jarring? |
Should we? That's what I was expecting.
I don't think we need to to interpolation on the comparison page. It shows data for two commits and doesn't do any summarizing. When data is missing it's already obvious.
I'm not really sure. "Jarring" is definitely a good word for the current pink :) It probably is worth distinguishing the following two cases, though:
|
This is somewhat harder -- currently we just run the extrapolation once when we load data and then cache that. Doing so on each request is certainly possible but will probably make the UI slower. I'd rather leave it as-is today, especially since that also means that changing the view you're looking at does not affect the data: this seems like a good property to have, overall. |
This will not interpolate runs, but since we add those fairly rarely and they are never compared with each other that should not be a problem at least for now.
I've rebased and I believe I'm ready to merge this -- the main unimplemented features are:
@nnethercote If you think the above are non-blockers then we can merge and deploy this whenever. |
Apologies for the slow response. Overall, it's great to see this land:
There are still some rough edges, though. NLL had a big jump on the Extrapolation is missing on the RHS of graphs, e.g. the NLL results for Why can interpolation only affect master on the dashboard? I agree that it's much more likely to affect master (because master is most likely to have bustage), but that doesn't seem guaranteed. I remember when @aturon was first doing these measurements he had a failure for one of the benchmarks on one old release. Anyway, I think it would be useful to have the pink indications on the dashboard. On the graphs view, there is no explanation or legend explaining what the pink means. It's probably mysterious to everyone but you and me! The new shade of pink is better, BTW 👍 It would still be good two have two different colours, though, one meaning "partial data" (for the summary graphs and dashboard) and one meaning "no data" (for the individual benchmarks). |
This might be due to the fact that we do not extrapolate runs/benchmarks that are not in the last 20 commits; that behavior is what allows benchmarks to eventually "go away" if they're removed, as with NLL for *-debug and *-opt.
Well, to clarify, interpolation can only affect master today because it's not implemented to be generic over the dataset (i.e., only works with commits) but I agree that in theory we could extrapolate into artifact that too. For now though I think this'll have to be out of scope, ideally we'd have data for all benchmarks across all artifacts -- it might be worth adding some basic checkboxes below the dashboard graphs that confirm that.
This is generally speaking the case -- do you have any ideas about where to put documentation like this?
It seems like that distinction is apparent by the location; i.e. on the summary graphs it's ~always partial data and on the individual benchmarks it's always no data, so I'm not quite sure of the usefulness here. Could you perhaps elaborate? |
The ideal spot for the explanation of the pink dots would be in the graphs themselves, as part of the legend for the other colored dots. I don't know if that's possible. If not, just a note at the bottom of the page is better than nothing.
That is true, but it's a non-obvious distinction. If we end up with per-graph indications of what the pink dots mean, then a single color is less problematic. But if the pink dots are explained at the bottom of the page, distinguishing the two cases would be a lot clearer. |
Fixes #249.
Extrapolates on a run-based level across history.