-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revised metrics proposal #15
base: master
Are you sure you want to change the base?
Revised metrics proposal #15
Conversation
Signed-off-by: Richard Edgar <[email protected]>
Signed-off-by: Richard Edgar <[email protected]>
Signed-off-by: Richard Edgar <[email protected]>
Signed-off-by: Richard Edgar <[email protected]>
Signed-off-by: Richard Edgar <[email protected]>
Signed-off-by: Richard Edgar <[email protected]>
Signed-off-by: Richard Edgar <[email protected]>
Signed-off-by: Richard Edgar <[email protected]>
Signed-off-by: Richard Edgar <[email protected]>
Signed-off-by: Richard Edgar <[email protected]>
Signed-off-by: Richard Edgar <[email protected]>
Signed-off-by: Richard Edgar <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for putting this together! This is very thorough and explains the difficulties very well. I hope my comments make sense :-)
api/Updated-Metrics.md
Outdated
|
||
We would provide operations to calculate differences in various ways (all of these results are a Pandas Series): | ||
```python | ||
>>> result.differences() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very confusing to me. Defaults to relative_to="max"?
Why not let the user specify? Reads a lot better with the explicit specification as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean you'd like to have no default at all or a different one than "max"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The user can specify, but I believe doing relative to max
matches our current behaviour. I would be happy with not having a default too, if that was the consensus.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our current behavior is that we are taking differences relative to the minimum not maximum (that's why all of our differences are positive). That's how I wrote the proposal here. In other words, the "relative to" part is whatever you subtract from all the group-level metric values.
But maybe it's confusing? An alternative term could be "origin". Another idea would be to say "from" instead of "relative_to"?
api/Updated-Metrics.md
Outdated
>>> result.differences(relative_to='overall', aggregate='max') | ||
0.1870 | ||
>>> result.differences(relative_to='overall', abs=True, aggregate='max') | ||
0.2436 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this different from the line above with no abs
? The value was positive, so that should be identical, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without abs
the values were -0.2436 and 0.1870. With abs
the values are 0.2436 and 0.1870.
api/Updated-Metrics.md
Outdated
|
||
There would be a similar method called `ratios()` on the `GroupedMetric` object: | ||
```python | ||
>>> result.ratios() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same thing, don't like the default at all :-(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And the same from me: I'm happy to change if there's agreement that requiring the argument makes more sense.
api/Updated-Metrics.md
Outdated
- `ratio_order=` determines how to build the ratio. Values are | ||
- `sub_unity` to make larger value the denominator | ||
- `super_unity` to make larger value the numerator | ||
- `from_relative` to make the value specified by `relative_to=` the denominator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
both this line and the next say relative_to
, is that intended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit worried about the complexity of the ratio_order
argument, esp for new users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. The change is between numerator and denominator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a question of naming, or whether we need all the cases? I'm not overly happy with the names myself, but I do think that all four cases might be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit of both, I guess... but I think you're right that you need all four cases.
api/Updated-Metrics.md
Outdated
This would make the following all equivalent: | ||
```python | ||
>>> r1 = group_summary(sklearn.accuracy_score, y_true, y_pred, sensitive_features=A_sex) | ||
>>> r2 = group_summary('accuracy_score', y_true, y_pred, sensitive_features=A_sex) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I presume this maps to the sklearn metrics? What if I want to define my own metric? Will that also work (although not with the string, of course)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me add an 'in addition' to the introductory sentence.
api/Updated-Metrics.md
Outdated
|
||
The question is whether users prefer: | ||
```python | ||
>>> diff = group_summary(skm.recall_score, y_true, y_pred, sensitive_features=A).difference(aggregate='max') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Obviously I'm just one data point, but I strongly prefer this over the version below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed - people were leaning that way, but since there's always the question of 'Do we really need a new concept?' I just wanted to call this out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the proposal. I think it's a good step. I'd wrap the lines though 🙈
api/Updated-Metrics.md
Outdated
|
||
```python | ||
y_true = [0, 1, 0, 0, 1, 1, ...] | ||
y_pred = [1, 0, 0, 1, 0, 1, ...] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we're dealing with metrics, should we separate floating point predictions (decision_function
or predict
) from discrete ones? (assuming some metrics need soft values and some work with hard values)
Maybe simply add :
y_pred = [1, 0, 0, 1, 0, 1, ...] | |
# or soft values, depending on what the underlying metric expects | |
y_pred = [1, 0, 0, 1, 0, 1, ...] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that we want separate group_summary()
functions for different types of underlying metrics, but I agree about the need to highlight that we just slice up the y_pred
list regardless of the types of its items, if that's what you meant. I was trying to write that below, but I'll make it more explicit.
api/Updated-Metrics.md
Outdated
weights = [ 1, 2, 3, 2, 2, 1, ...] | ||
``` | ||
|
||
We actually seek to be very agnostic as to the contents of the `y_true` and `y_pred` arrays. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agnostic in what way? I think the proposal later explains this, but it may be easier to follow if that's moved up here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was meaning in exactly the same way as your suggested comment. Let me rephrase.
api/Updated-Metrics.md
Outdated
We propose adding an extra argument to `differences()` and `ratios()`, to provide a `condition_on=` argument. | ||
|
||
Suppose we have a DataFrame, `A_3` with three sensitive features: Sex, Race and Income Band (the latter having values 'Low' and 'High'). | ||
This could represent a loan scenario where discrimination based on income is allowed, but within the income bands, other sensitive groups must be treated equally. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or to use abstract names and avoid mentioning sex
, race
or income
alltogether.
api/Updated-Metrics.md
Outdated
``` | ||
This should generalise to the other methods described above. | ||
|
||
One open question is how extra arguments should be passed to the individual metric functions, including how to handle the `indexed_params=`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can look at https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest/slep006/proposal.html and scikit-learn/scikit-learn#16079 to get some ideas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm reading that correctly, we would ditch the **kwargs
, and have an explicit props=
dictionary - and probably rename indexed_params=
to sample_props=
. When evaluating multiple metrics at once, I still think we'd need a list of dictionaries and a list of lists for these, though.
Signed-off-by: Richard Edgar <[email protected]>
Signed-off-by: Richard Edgar <[email protected]>
Signed-off-by: Richard Edgar <[email protected]>
Signed-off-by: Richard Edgar <[email protected]>
api/Updated-Metrics.md
Outdated
|
||
We do not intend to change the API invoked by the user. | ||
What will change is the return type. | ||
Rather than a `Bunch`, we will return a `GroupedMetric` object, which can offer richer functionality. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure whether we're already at the stage of discussing naming, but since we decided to discard .eval()
, we might want to think about a name that better reflects that the class that's currently called GroupedMetric
contains results/scores. The current name suggests that the class corresponds to some grouped metric itself rather than results/scores. Similarly, I'm not sure about group_summary()
, because in my view GroupedMetric
is not really a summary? My suggestion would be to come up with a few alternatives and ask some people who are not as deeply involved with the metrics API so far which one they find most intuitive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great time to be discussing naming, I think..... I'll see if I can come up with a few possibilities.
api/Updated-Metrics.md
Outdated
|
||
We would provide operations to calculate differences in various ways (all of these results are a Pandas Series): | ||
```python | ||
>>> result.differences() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean you'd like to have no default at all or a different one than "max"?
api/Updated-Metrics.md
Outdated
- `ratio_order=` determines how to build the ratio. Values are | ||
- `sub_unity` to make larger value the denominator | ||
- `super_unity` to make larger value the numerator | ||
- `from_relative` to make the value specified by `relative_to=` the denominator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit worried about the complexity of the ratio_order
argument, esp for new users.
api/Updated-Metrics.md
Outdated
|
||
The question is whether users prefer: | ||
```python | ||
>>> diff = group_summary(skm.recall_score, y_true, y_pred, sensitive_features=A).difference(aggregate='max') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Signed-off-by: Richard Edgar <[email protected]>
Signed-off-by: Richard Edgar <[email protected]>
Signed-off-by: Richard Edgar <[email protected]>
Signed-off-by: Richard Edgar <[email protected]>
Signed-off-by: Richard Edgar <[email protected]>
Signed-off-by: Richard Edgar <[email protected]>
Meeting 2020-09-08Thanks to all for attending today. The main agreement is that we should think about a 'data exploration' interface (provided by the object) and a 'scorer' interface (provided by wrapper functions for the objects). More minor points:
Next StepsI will take @MiroDudik 's list of scenarios from above, and start to rough out how things might look in each API:
One thing which had been confusing me from before: These are not all expected to be one-liners, but may well be more complex. The idea is to identify which pieces are relevant to the metrics API, and whether we like how that looks. |
Signed-off-by: Richard Edgar <[email protected]>
Signed-off-by: Richard Edgar <[email protected]>
Signed-off-by: Richard Edgar <[email protected]>
Signed-off-by: Richard Edgar <[email protected]>
Signed-off-by: Richard Edgar <[email protected]>
Signed-off-by: Richard Edgar <[email protected]>
Meeting 2020-09-14Thanks you @hildeweerts @adrinjalali @MiroDudik and @amueller for our meeting today. We agreed that:
It also sounded like we should keep the current set of functions built with According to @adrinjalali , the PR for SLEP006 Option 4 will probably be in the release of SciKit-Learn next April. We should aim to be compatible with that in our scorer objects. Next StepsI will update this proposal based on the above points. I will also start a rough implementation of |
Signed-off-by: Richard Edgar <[email protected]>
Signed-off-by: Richard Edgar <[email protected]>
Signed-off-by: Richard Edgar <[email protected]>
Signed-off-by: Richard Edgar <[email protected]>
Meeting 2020-09-21Thanks to @adrinjalali @hildeweerts @amueller and @MiroDudik for our meeting today. Points agreed:
Still TBD:
Yet to be discussed:
|
I was under the impression we're okay with restricting the metrics' signature and asking users to use |
Signed-off-by: Richard Edgar <[email protected]>
Meeting 2020-09-18Thanks to @adrinjalali @hildeweerts and @MiroDudik for our meeting today. We started by discussing the >>> target.overall
recall_score precision_score
overall 0.733333 0.717391 The motivation behind this decision was to ensure that we have a consistent type returned. However, doing so comes at the cost of introducing a 'magic' row name of 'overall'. Making this choice also gives some trouble doing operations such as recall_score precision_score
overall NaN NaN
pp NaN NaN
q NaN NaN The intended operation in this case is actually: target.by_group - target.overall.iloc[0]
recall_score precision_score
SF 0
pp 0.0057971 -0.00905797
q -0.00606061 0.00988142 Continuing with the target.difference(method='to_overall')
recall_score precision_score
overall 0.00606061 0.00988142
(target.by_group - target.overall.iloc[0]).abs().max()
recall_score 0.006061
precision_score 0.009881
dtype: float64 These are not a problem when there are conditional features, which ensures that there are matching portions of the DataFrame index. Together, these suggest that when there are no conditional features, the return type of We agreed that for now, we should drop the When specifying multiple functions in the fns = [skm.recall_score, skm.precision_score, skm.fbeta_score]
sample_params = [{'sample_weight': wgts},
{'sample_weight': wgts},
{'sample_weight': wgts}]
result = metrics.GroupedMetric(fns, Y_test, Y_pred,
sensitive_features=...,
conditional_features=...,
sample_params=sample_params) We agreed that using dictionaries instead of arrays would make more sense. Something like: fns = {
'recall': skm.recall_score,
'prec': skm.precision_score,
'fbeta_0.5': functools.partial(skm.fbeta_score, beta=0.5) }
sample_params = {
'recall' : {'sample_weight': wgts},
'prec': {'sample_weight': wgts},
'fbeta_0.5': {'sample_weight': wgts}
} There is one question as to whether Now that the API is pretty much finalised, I will open a new issue to discuss the naming. |
Meeting 2020-10-06Thanks to @adrinjalali @hildeweerts and @MiroDudik for our discussion today. We had a long talk about One question which did arise: how to specify the Related to these, although not essential for API design: do we provide all the We agreed to keep 'empty' combinations of features in the result tables as NaNs; users can easily drop those rows if they want. In future, we might add an optional argument to drop them when the objects are created. When it comes to naming, we still have issue #17 . I have the impression that no-one particularly likes Next steps:
|
Meeting 2020-10-12Thanks to @MiroDudik and @adrinjalali for our meeting today. We agreed on the following:
We also discussed Next steps
Personally, I would even suggest that |
Signed-off-by: Richard Edgar <[email protected]>
Signed-off-by: Richard Edgar <[email protected]>
Signed-off-by: Richard Edgar <[email protected]>
Meeting 2020-10-19Thanks to @adrinjalali , @hildeweerts and @MiroDudik for our meeting today. We were reviewing the current active PR implementing the new metrics API.
For the PR itself the following issues were raised:
@hildeweerts raised other issues with the documentation format, pointing that the entire @MiroDudik and @riedgar-ms will work to resolve the issues raised above, and get the PR merged. @adrinjalali and @hildeweerts will also review, but the merge will not be delayed for that (we can make further tweaks next week). |
Meeting 2020-10-26Thanks to @hildeweerts @adrinjalali and @MiroDudik for another productive meeting. We discussed the following: What to do with control_levels when there are noneIf the user wants to do their own manipulations of the We decided that Whether to allow strings for level namesThis is related to the previous point. The How to handle single callables@MiroDudik pointed out that if a user makes a call such as: acc = MetricsFrame(skm.accuracy_score, y_true, y_pred, sensitive_features=A) then having to write things like acc.overall['accuracy_score']
acc.by_group['accuracy_score']['group_0']
acc.group_max()['accuracy_score'] is tedious for people messing around with notebooks. It would be good to drop the extra redirection. However, @riedgar-ms was concerned that if a more general library is attempting to consume Fairlearn in a general way, then any logic we put in to handle the single function case would then have to be undone by that higher level library (introducing maintenance overhead and places for bugs to hide). We agreed that if the user passes in a dictionary of callables, then we keep the same behaviour, even if there's only one entry in the dictionary. However, if the user passes in a callable, then DataFrames becomes Series, and Series become scalars, so that users don't have to keep typing The initial implementation of this is likely to be messy (at this point in time, just set a flag and process all output before it's returned to the user). However, @riedgar-ms will provide a full set of tests, so that the internals can be improved in the future. |
Meeting 2020-11-02Thank you to @adrinjalali @hildeweerts and @MiroDudik for another productive discussion. We agreed that the best way forward is to release what we have as We also discussed |
Create a fresh proposal for metrics, based on discussions in meetings, and focusing on 'what does the user type?' rather than the implementation.