Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agenda for Feb 3 meeting #54

Closed
foolip opened this issue Feb 2, 2022 · 4 comments
Closed

Agenda for Feb 3 meeting #54

foolip opened this issue Feb 2, 2022 · 4 comments
Labels
agenda Agenda item for the next meeting

Comments

@foolip
Copy link
Member

foolip commented Feb 2, 2022

Here's the agenda for our meeting tomorrow:

Previous meeting: #50

@foolip foolip added the agenda Agenda item for the next meeting label Feb 2, 2022
@jensimmons
Copy link
Contributor

Under "Update on metrics" I'd like to talk about how the math was handled in 2021 scoring and how we can perhaps do better for 2022:
https://docs.google.com/spreadsheets/d/1gwMmCM_qr5Ew7GPOEWNeWxN_jj0U4wMPcoECH5zzJYI/edit?usp=sharing

@foolip
Copy link
Member Author

foolip commented Feb 3, 2022

@jensimmons I've added #46 to the top of the agenda, since I think that explains a lot of what's going on with Compat 2021 scoring. If there's more to discuss we can do that of course.

@foolip
Copy link
Member Author

foolip commented Feb 4, 2022

Here are the notes I took:

Clarify / clean-up Interop 2021 labeling

Jen: Convertion of test results to score. Everything gets rounded down in the step of converting to an integer 0-20.
Philip: I found the same when looking into #46 today.
Una: Makes sense to have equal weight per area.
Sam: Changing the rounding will mean that the numbers don’t add up to the summary.
Philip: Yes, that can probably happen if we have a data grid explaining the numbers.
Sam: If we have extra digits beyond what we show it’s not a big deal.
James: We seem to be in agreement, we don’t want to introduce truncation error. Maybe it’s not exactly the same at the 2nd decimal point, but nobody is going to complain about that. Don’t deliberately introduce precision loss, seems reasonable.
Jen: I like the idea of trying to stick with percentages as much as possible. If someone is that invested that they notice…
Chris: So conclusion is we won’t round until we sum them up? And when we display, we round off a little at display time.
Philip: To be pedantic, it would be fixed point math at a few levels.
Sam: I think we’re in agreement, but can review when there’s code.
Philip: Note that we can’t get the percentages we’re using from the wpt.fyi results tables, unfortunately.

Score "Investigate" progress as part of the overall metric

Not revisiting the positions spreadsheet, it hasn’t changed.

Jen: Let’s say we make “investigate” a 16th bucket. If scored 0-100%, will that be the same for all browsers? Or somewhat depends on which browser representatives have done more of the work?
James: The proposal is it’s the same score for everyone, based on investigation being a collaborative effort. For interop, it’s not about one browser being better than another. The reason we think this is important is there are already a lot of incentives for implementing features, but much less incentive structure for cleanup work and making sure all browsers are interoperable. Without some visibility for this kind of work, the interop metric doesn’t add as much value as we’d like.
Chris: James, would you be OK with a separate metric along the lines what Jen is suggesting?
James: We haven’t had time to discuss it internally.
Note: #49 (comment) posted before meeting
Sam: To summarize, by including in the main metric, it doesn’t provide the same incentive for all vendors to work towards it. Our preference is to have it as a separate thing as far as 2022 is concerned. But not necessarily an absolute objection to including, but that’s not a definite position.
Jen: Is this 1/16th of the total, a side number…?
James: The issue has a fairly concrete proposal, which was “some percentage of the score” divided between the investigate areas. The proposal was 15%, but can definitely discuss details. But if it’s so little that it can be ignored, it won’t be worth it, it needs to have a meaningful change of affecting people’s incentives.
Chris: To clarify, you think that if it’s not included in the main metric, there will be no incentive to do this work?
James: Value proposition for Interop 2022 for us is the kind of work that affects interop and web compat.
Chris: As a specific proposal, 1/16th of the score?
James: I can suggest it, but it feels low. Compat 2021 was considered a success because everyone got above 90%. That suggests that 4-5% isn’t very much, but 10% is.
Chris: 10% then? Jen?
Jen: I think that scoring as one additional bucket makes more sense than carving out a chunk of the total. I understand the argument, but we don’t agree with it. Originally Mozilla didn’t want to include things that aren’t spec’d, and this is about things that aren’t.
James: We didn’t want to score the implementation of things that aren’t specified.
Tantek: I sympathize with Jen’s observation, but we’re delineating. These investigate items are about hard problems with existing implementations, which is hard and severely under-incentivized.

Philip: If we do agree to investigate and score, are we happy with the 4 in #49?
Sam: We have no objections to any of those.
Jen: Let’s not bikeshed the weighting of the 4 areas. This is all investigation, manual testing, spec work, etc.
Sam: One could argue that Pointer Events and MouseEvent go together.

Philip: If we score the investigation efforts, any constraint on what 100% means?
Sam: I think we can figure out the scoring. The contention is around whether we include it in the main score.
Chris: How about we all see if we agree that we will score these areas, they’ll be in one combined score, and that score will appear on the dashboard in some form. We should be able to agree on that. Agreeing on that would be substantial progress.
Jen: Apple’s position is we would be fine with including investigation in the main score. Investigate bucket will be 10%, the rest will be 90% (and 15*6=90). And each of the 4 items have a group with a person driving it, and that group reports back to “the keeper of the scores”. So we’d be comfortable with 10/90%, not a higher percentage.
Sam: If we have 3 investigate categories, each would be 3.33%, which is about half of the main areas, which seems reasonable.
Jen: I also want to be clear that investigate scores are the same number for every browser, no competition on that.
James: I need to go back and vet this but it seems very close to our proposal. I’m inclined to thing we’ll be happy with this, but I think this sounds good.
Philip: Substantial progress. Let’s revisit next week. I can look at metrics.
Chris: Can you update the RFC to restart the clock on the feedback process?
James: That would be ideal, yes.

Dashboard update

Feb 14 launch date

Jen: That’s a week and a few days away.
James: Agree it’s a bit tight.
Philip: Understood, we’ll get a move on with the dashboard.
Jen: We’ll want to have screenshots and numbers.
Una: Targeting next week is reasonable, maybe next week. Aim to launch a dashboard internally for us very soon.
Jen: Can Mozilla make a decision about “investigate” today/tomorrow so that there’s time to adapt the dashboard?
James: We will ASAP. We will also need at least a week from having the RFC/dashboard ready to being ready to go and approved.

@foolip
Copy link
Member Author

foolip commented Feb 4, 2022

I have updated web-platform-tests/rfcs#99 with the proposed 90/10% split of the metric as discussed in the meeting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
agenda Agenda item for the next meeting
Projects
None yet
Development

No branches or pull requests

2 participants