Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revised metrics proposal #15

Open
wants to merge 44 commits into
base: master
Choose a base branch
from
Open
Changes from 22 commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
8c96695
Starting work
riedgar-ms Aug 25, 2020
531c195
More text
riedgar-ms Aug 25, 2020
66eab2f
Working through next bit
riedgar-ms Aug 25, 2020
cf8ae38
Adding another example
riedgar-ms Aug 25, 2020
1d8987c
Add in multiple sensitive features and segmentation
riedgar-ms Aug 25, 2020
6d5c5e3
Sketch out the multiple metrics
riedgar-ms Aug 25, 2020
00e0a1a
Forgot to add a line
riedgar-ms Aug 25, 2020
5ce78ec
Think a bit about multiple metrics and getting the names
riedgar-ms Aug 26, 2020
9061c70
Add a note about some convenience wrappers
riedgar-ms Aug 26, 2020
5cd4e88
Some notes on pitfalls
riedgar-ms Aug 26, 2020
75e1d1d
Some more questions
riedgar-ms Aug 27, 2020
aac662c
Change Segmented Metrics to be Conditional Metrics
riedgar-ms Aug 27, 2020
bf96d2b
Working through some of the suggested changes
riedgar-ms Sep 1, 2020
416d427
Some more fixes
riedgar-ms Sep 1, 2020
fbb634f
More fixes
riedgar-ms Sep 1, 2020
77777ba
Fix an errant sex
riedgar-ms Sep 1, 2020
6d69fa9
Typo fix
riedgar-ms Sep 1, 2020
13107f4
Update the metric_ property
riedgar-ms Sep 2, 2020
048e484
Add extra clarifying note about datatypes for intersections
riedgar-ms Sep 2, 2020
9ff4e91
Expand on note for conditional parity input types
riedgar-ms Sep 2, 2020
5040649
Further updates to the text
riedgar-ms Sep 2, 2020
210872e
Add some suggestions for alternative names
riedgar-ms Sep 2, 2020
3ba87aa
Adding notebook of samples
riedgar-ms Sep 8, 2020
c13da58
More examples in notebook
riedgar-ms Sep 9, 2020
13069d5
Put in remaining comparisons
riedgar-ms Sep 9, 2020
2e10c46
Starting to change over to constructor method etc.
riedgar-ms Sep 10, 2020
8296f3a
More on methods
riedgar-ms Sep 10, 2020
f755711
More changes based on prior discussion
riedgar-ms Sep 10, 2020
0f28fe3
Another correction
riedgar-ms Sep 10, 2020
2ea2037
Some fixes to remove `group_summary()` (not yet complete)
riedgar-ms Sep 11, 2020
41bb325
Some extensive edits
riedgar-ms Sep 11, 2020
205c239
Add make_grouped_scorer()
riedgar-ms Sep 11, 2020
70d8811
Errant group_summary
riedgar-ms Sep 14, 2020
96703c6
Add link to SLEP006
riedgar-ms Sep 14, 2020
7521261
Minor update to notebook
riedgar-ms Sep 14, 2020
f9ed32c
Some small updates to the proposal
riedgar-ms Sep 14, 2020
091c47e
Add make_derived_metric back in
riedgar-ms Sep 14, 2020
180370b
Add note about meetings
riedgar-ms Sep 17, 2020
16983bf
Merge remote-tracking branch 'upstream/master' into riedgar-ms/revise…
riedgar-ms Sep 21, 2020
acfacab
Update after today's discussion
riedgar-ms Sep 21, 2020
697eee0
Merge remote-tracking branch 'upstream/master' into riedgar-ms/revise…
riedgar-ms Oct 15, 2020
f0cdb5c
Update to reflect reality
riedgar-ms Oct 15, 2020
4feb19f
Remove uneeded notebook
riedgar-ms Oct 15, 2020
5649f92
Fix the odd typo
riedgar-ms Oct 15, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
394 changes: 394 additions & 0 deletions api/Updated-Metrics.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,394 @@
# Updates for Metrics

This is an update for the existing metrics document, which is being left in place for now as a point of comparison.

## Assumed data

In the following we assume that we have variables of the following form defined:

```python
y_true = [0, 1, 0, 0, 1, 1, ...]
y_pred = [1, 0, 0, 1, 0, 1, ...] # Content can be different for other metrics (see below)
A_1 = [ 'C', 'B', 'B', 'C', ...]
A_2 = [ 'M', 'N', 'N', 'P', ...]
A = pd.DataFrame(np.transpose([A_1, A_2]), columns=['SF 1', 'SF 2'])

weights = [ 1, 2, 3, 2, 2, 1, ...]
```

We actually seek to be very agnostic as to the contents of the `y_true` and `y_pred` arrays; the meaning is imposed on them by the underlying metrics.
Here we have shown binary values for a simple classification problem, but they could be floating point values from a regression, or even collections of classes and associated probabilities.

## Basic Calls
riedgar-ms marked this conversation as resolved.
Show resolved Hide resolved

### Existing Syntax

Our basic method is `group_summary()`

```python
>>> result = flm.group_summary(skm.accuracy_score, y_true, y_pred, sensitive_features=A_1)
>>> print(result)
{'overall': 0.4, 'by_group': {'B': 0.6536, 'C': 0.213}}
>>> print(type(result))
<class 'sklearn.utils.Bunch'>
```
The `Bunch` is an object which can be accessed in two ways - either as a dictionary - `result['overall']` - or via properties named by the dictionary keys - `result.overall`.
Note that the `by_group` key accesses another `Bunch`.

We allow for sample weights (and other arguments which require slicing) via `indexed_params`, and passing through other arguments to the underlying metric function (in this case, `normalize`):
```python
>>> flm.group_summary(skm.accuracy_score, y_true, y_pred, sensitive_features=A_1, indexed_params=['sample_weight'], sample_weight=weights, normalize=False)
{'overall': 20, 'by_group': {'B': 60, 'C': 21}}
```

We also provide some wrappers for common metrics from SciKit-Learn:
```python
>>> flm.accuracy_score_group_summary(y_true, y_pred, sensitive_features=A_1)
{'overall': 0.4, 'by_group': {'B': 0.6536, 'C': 0.213}}
```

### Proposed Change

We do not intend to change the API invoked by the user.
What will change is the return type.
Rather than a `Bunch`, we will return a `GroupedMetric` object, which can offer richer functionality.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether we're already at the stage of discussing naming, but since we decided to discard .eval(), we might want to think about a name that better reflects that the class that's currently called GroupedMetric contains results/scores. The current name suggests that the class corresponds to some grouped metric itself rather than results/scores. Similarly, I'm not sure about group_summary(), because in my view GroupedMetric is not really a summary? My suggestion would be to come up with a few alternatives and ask some people who are not as deeply involved with the metrics API so far which one they find most intuitive.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great time to be discussing naming, I think..... I'll see if I can come up with a few possibilities.


At this basic level, there is only a slight change to the results seen by the user.
There are still properties `overall` and `by_groups`, with the same semantics.
However, the `by_groups` result is now a Pandas Series, and we also provide a `metric_` property to store a reference to the underlying metric:
```python
>>> result = flm.group_summary(skm.accuracy_score, y_true, y_pred, sensitive_features=A_1)
riedgar-ms marked this conversation as resolved.
Show resolved Hide resolved
>>> result.metric_
<class 'function'>
>>> result.overall
0.4
>>> result.by_groups
B 0.6536
C 0.2130
Name: accuracy_score dtype: float64
>>> print(type(result.by_groups))
<class 'pandas.core.series.Series'>
```
Constructing the name of the Series could be an issue.
In the example above, it is the name of the underlying metric function.
Something as short as the `__name__` could end up being ambiguous, but using the `__module__` property to disambiguate might not match the user's expectations:
```python
>>> import sklearn.metrics as skm
>>> skm.accuracy_score.__name__
'accuracy_score'
>>> skm.accuracy_score.__qualname__
'accuracy_score'
>>> skm.accuracy_score.__module__
'sklearn.metrics._classification'
```
We are seeing some of the actual internal structure here of SciKit-Learn, and the user might not be expecting that.
riedgar-ms marked this conversation as resolved.
Show resolved Hide resolved

We would continue to provide convenience wrappers such as `accuracy_score_group_summary` for users, and support passing through arguments along with `indexed_params`.
riedgar-ms marked this conversation as resolved.
Show resolved Hide resolved
There is little advantage to the change at this point.
This will change in the next section.

riedgar-ms marked this conversation as resolved.
Show resolved Hide resolved
## Obtaining Scalars

### Existing Syntax

We provide methods for turning the `Bunch`es returned from `group_summary()` into scalars:
```python
>>> difference_from_summary(result)
0.4406
>>> ratio_from_summary(result)
0.3259
>>> group_max_from_summary(result)
0.6536
>>> group_min_from_summary(result)
0.2130
```
We also provide wrappers such as `accuracy_score_difference()`, `accuracy_score_ratio()` and `accuracy_score_min()` for user convenience.

One point which these helpers lack (although it could be added) is the ability to select alternative values for measuring the difference and ratio.
For example, the user might not be interested in the difference between the maximum and minimum, but the difference from the overall value.
Or perhaps the difference from a particular group.

### Proposed Change

The `GroupedMetric` object would have methods for calculating the required scalars.
First, let us consider the differences.

We would provide operations to calculate differences in various ways (all of these results are a Pandas Series):
```python
>>> result.differences()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very confusing to me. Defaults to relative_to="max"?
Why not let the user specify? Reads a lot better with the explicit specification as well.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean you'd like to have no default at all or a different one than "max"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user can specify, but I believe doing relative to max matches our current behaviour. I would be happy with not having a default too, if that was the consensus.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our current behavior is that we are taking differences relative to the minimum not maximum (that's why all of our differences are positive). That's how I wrote the proposal here. In other words, the "relative to" part is whatever you subtract from all the group-level metric values.

But maybe it's confusing? An alternative term could be "origin". Another idea would be to say "from" instead of "relative_to"?

B 0.0
C 0.4406
Name: TBD dtype: float64
>>> result.differences(relative_to='min')
B -0.4406
C 0.0
Name: TBD dtype: float64
>>> result.differences(relative_to='min', abs=True)
B 0.4406
C 0.0
Name: TBD dtype: float64
>>> result.differences(relative_to='overall')
B -0.2436
C 0.1870
Name: TBD dtype: float64
>>> result.differences(relative_to='overall', abs=True)
B 0.2436
C 0.1870
Name: TBD dtype: float64
>>> result.differences(relative_to='group', group='C', abs=True)
B 0.4406
C 0.0
Name: TBD dtype: float64
```
riedgar-ms marked this conversation as resolved.
Show resolved Hide resolved
The arguments introduced so far for the `differences()` method:
- `relative_to=` to decide the common point for the differences. Possible values are `'max'` (the default), `'min'`, `'overall'` and `'group'`
- `group=` to select a group name, only when `relative_to` is set to `'group'`. Default is `None`
- `abs` to indicate whether to take the absolute value of each entry (defaults to false)

The user could then use the Pandas methods `max()` and `min()` to reduce these Series objects to scalars.
However, this will run into issues where the `relative_to` argument ends up pointing to either the maximum or minimum group, which will have a difference of zero.
That could then be the maximum or minimum value of the set of difference, but probably won't be what the user wants.

To address this case, we should add an extra argument `aggregate=` to the `differences()` method:
```python
>>> result.differences(aggregate='max')
0.4406
>>> result.differences(relative_to='overall', aggregate='max')
0.1870
>>> result.differences(relative_to='overall', abs=True, aggregate='max')
0.2436
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this different from the line above with no abs? The value was positive, so that should be identical, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without abs the values were -0.2436 and 0.1870. With abs the values are 0.2436 and 0.1870.

```
If `aggregate=None` (which would be the default), then the result is a Series, as shown above.

There would be a similar method called `ratios()` on the `GroupedMetric` object:
```python
>>> result.ratios()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same thing, don't like the default at all :-(

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the same from me: I'm happy to change if there's agreement that requiring the argument makes more sense.

B 1.0
C 0.3259
Name: TBD dtype: float64
```
The `ratios()` method will take the following arguments:
- `relative_to=` similar to `differences()`
- `group=` similar to `differences()`
- `ratio_order=` determines how to build the ratio. Values are
- `sub_unity` to make larger value the denominator
- `super_unity` to make larger value the numerator
- `from_relative` to make the value specified by `relative_to=` the denominator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both this line and the next say relative_to, is that intended?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit worried about the complexity of the ratio_order argument, esp for new users.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The change is between numerator and denominator.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a question of naming, or whether we need all the cases? I'm not overly happy with the names myself, but I do think that all four cases might be useful.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit of both, I guess... but I think you're right that you need all four cases.

- `to_relative` to make the value specified by `relative_to=` the numerator
- `aggregate=` similar to `differences()`

We would also provide the same wrappers such as `accuracy_score_difference()` but expose the extra arguments discussed here.
One question is whether the default aggregation should be `None` (to match the method), or whether it should default to scalar results similar to the existing methods.
riedgar-ms marked this conversation as resolved.
Show resolved Hide resolved

In the section on Conditional Metrics below, we shall discuss one extra optional argument for `differences()` and `ratios()`.

## Intersections of Sensitive Features

### Existing Syntax

Our current API does not support evaluating metrics on intersections of sensitive features (e.g. "black and female", "black and male", "white and female", "white and male").
To achieve this, users currently need to write something along the lines of:
```python
>>> A_combined = A['SF 1'] + '-' + A['SF 2']

>>> accuracy_score_group_summary(y_true, y_pred, sensitive_features=A_combined)
{ 'overall': 0.4, by_groups : { 'B-M':0.4, 'B-N':0.5, 'B-P':0.5, 'C-M':0.5, 'C-N': 0.6, 'C-P':0.7 } }
```
This is unecessarily cumbersome.
It is also possible that some combinations might not appear in the data (especially as more sensitive features are combined), but identifying which ones were not represented in the dataset would be tedious.


### Proposed Change

If `sensitive_features=` is a DataFrame (or list of Series.... exact supported types are TBD), we can generate our results in terms of a MultiIndex. Using the `A` DataFrame defined above, a user might write:
```python
>>> result = group_summary(skm.accuracy_score, y_true, y_pred, sensitive_features=A)
riedgar-ms marked this conversation as resolved.
Show resolved Hide resolved
>>> result.by_groups
SF 1 SF 2
B M 0.5
N 0.7
P 0.6
C M 0.4
N 0.5
P 0.5
Name: sklearn.metrics.accuracy_score, dtype: float64
```
If a particular combination of sensitive features had no representatives, then we would return `None` for that entry in the Series.
riedgar-ms marked this conversation as resolved.
Show resolved Hide resolved
Although this example has passed a DataFrame in for `sensitive_features=` we should aim to support lists of Series and `numpy.ndarray` as well.

The `differences()` and `ratios()` methods would act on this Series as before.

## Conditional (or Segmented) Metrics
riedgar-ms marked this conversation as resolved.
Show resolved Hide resolved

For our purposes, Conditional Metrics (alternatively known as Segmented Metrics) do not return single values when aggregation is requested in a call to `differences()` or `ratios()` but instead provide one result for each unique value of the specified condition feature(s).

### Existing Syntax

Not supported.
Users would have to devise the required code themselves

### Proposed Change

We propose adding an extra argument to `differences()` and `ratios()`, to provide a `condition_on=` argument.

Suppose we have a DataFrame, `A_3` with three sensitive features: SF 1, SF 2 and Income Band (the latter having values 'Low' and 'High').
This could represent a loan scenario where decisions can be based on income, but within the income bands, other sensitive groups must be treated equally.
When `differences()` is invoked with `condition_on=`, the result will not be a scalar, but a Series.
A user might make calls:
```python
>>> result = accuracy_score_group_summary(y_true, y_test, sensitive_features=A_3)
>>> result.differences(aggregate=min, condition_on='Income Band')
riedgar-ms marked this conversation as resolved.
Show resolved Hide resolved
Income Band
Low 0.3
High 0.4
Name: TBD, dtype: float64
```
We can also allow `condition_on=` to be a list of names:
```python
>>> result.differences(aggregate=min, condition_on=['Income Band', 'SF 1'])
Income Band SF 1
Low B 0.3
Low C 0.35
High B 0.4
High C 0.5
```
There may be demand for allowing the sensitive features to be supplied as a `numpy.ndarray` or even a list of `Series` (similar to how the `sensitive_features=` argument may not be a DataFrame).
To support this, `condition_on=` would need to allow integers (and lists of integers) as inputs, to index the columns.
If the user is specifying a list for `condition_on=` then we should probably be nice and detect cases where a feature is listed twice (especially if we're allowing both names and column indices).

## Multiple Metrics

Finally, we can also allow for the evaluation of multiple metrics at once.

### Existing Syntax

This is not supported.
Users would have to devise their own method

### Proposed Change

We allow a list of metric functions in the call to group summary.
Results become DataFrames, with one column for each metric:
```python
>>> result = group_summary([skm.accuracy_score, skm.precision_score], y_true, y_pred, sensitive_features=A_1)
>>> result.overall
sklearn.metrics.accuracy_score sklearn.metrics.precision_score
0 0.3 0.5
>>> result.by_groups
sklearn.metrics.accuracy_score sklearn.metrics.precision_score
'B' 0.4 0.7
'C' 0.6 0.75
```
This should generalise to the other methods described above.

One open question is how extra arguments should be passed to the individual metric functions, including how to handle the `indexed_params=`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm reading that correctly, we would ditch the **kwargs, and have an explicit props= dictionary - and probably rename indexed_params= to sample_props=. When evaluating multiple metrics at once, I still think we'd need a list of dictionaries and a list of lists for these, though.

A possible solution is to have lists, with indices corresponding to the list of functions supplied to `group_summary()`
For example, for `index_params=` we would have:
```python
indexed_params = [['sample_weight'], ['sample_weight']]
```
In the `**kwargs` a single `extra_args=` argument would be accepted (although not required), which would contain the individual `**kwargs` for each metric:
```python
extra_args = [
{
'sample_weight': [1,2,1,1,3, ...],
'normalize': False
},
{
'sample_weight': [1,2,1,1,3, ... ],
'pos_label' = 'Granted'
}
]
```
If users had a lot of functions with a lot of custom arguments, this could get error-prone and difficult to debug.

## Naming

The names `group_summary()` and `GroupedMetric` are not necessarily inspired, and there may well be better alternatives.
Changes to these would ripple throughout the module, so agreeing on these is an important first step.

Some possibilities for the function:
- `group_summary()`
- `metric_by_groups()`
- `calculate_group_metric()`
- ?

And for the result object:
- `GroupedMetric`
- `GroupMetricResult`
- `MetricByGroups`
- ?

Other names are also up for debate.
However, things like the wrappers `accuracy_score_group_summary()` will hinge on the names chosen above.
Arguments such as `index_params=` and `ratio_order=` (along with the allowed values of the latter) are important, but narrower in impact.

## User Conveniences

In addition to having the underlying metric be passed as a function, we can consider allowing the metric function given to `group_summary()` to be represented by a string.
We would provide a mapping of strings to suitable functions.
This would make the following all equivalent:
```python
>>> r1 = group_summary(sklearn.accuracy_score, y_true, y_pred, sensitive_features=A_1)
>>> r2 = group_summary('accuracy_score', y_true, y_pred, sensitive_features=A_1)
>>> r3 = accuracy_score_group_summary( y_true, y_pred, sensitive_features=A_1)
```
We would also allow mixtures of strings and functions in the multiple metric case.

## Generality

Throughout this document, we have been describing the case of classification metrics.
However, we do not actually require this.
It is the underlying metric function which gives meaning to the `y_true` and `y_pred` lists.
So long as these are of equal length (and equal in length to the sensitive feature list - which _will_ be treated as a categorical), then `group_summary()` does not actually care about their datatypes.
For example, each entry in `y_pred` could be a dictionary of predicted classes and accompanying probabilities.
Or the user might be working on a regression problem, and both `y_true` and `y_pred` would be floating point numbers (or `y_pred` might even be a tuple of predicted value and error).
So long as the underlying metric understands the data structures, `group_summary()` will not care.

There will be an effect on the `GroupedMetric` result object.
Although the `overall` and `by_groups` properties will work fine, the `differences()` and `ratios()` methods may not.
After all, what does "take the ratio of two confusion matrices" even mean?
riedgar-ms marked this conversation as resolved.
Show resolved Hide resolved
We should try to trap these cases, and throw a meaningful exception (rather than propagating whatever exception happens to emerge from the underlying libraries).
Since we know that `differences()` and `ratios()` will only work when the metric has produced scalar results, which should be a straightforward test using [`isscalar()` from Numpy](https://numpy.org/doc/stable/reference/generated/numpy.isscalar.html).

## Pitfalls

There are some potential pitfalls which could trap the unwary.

The biggest of these are related to missing classes in the subgroups.
To take an extreme case, suppose that the `B` group speciified by `SF 1` were always being predicted classes H or J, while the `C` group was always predicted classes K or L.
The user could request precision scores, but the results would not really be comparable between the two groups.
With intersections of sensitive features, cases like this become more likely.

Metrics in SciKit-Learn usually have arguments such as `pos_label=` and `labels=` to allow the user to specify the expected labels, and adjust their behaviour accordingly.
However, we do not require that users stick to the metrics defined in SciKit-Learn.
riedgar-ms marked this conversation as resolved.
Show resolved Hide resolved

If we implement the convenience strings-for-functions piece mentioned above, then _when the user specifies one of those strings_ we can log warnings if the appropriate arguments (such as `labels=`) are not specified.
We could even generate the argument ourselves if the user does not specify it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I probably would not include this in a first version, because it seems quite tricky. But I think we can add some sort of convenience function for this later on.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it's going to be a thorny function to write (because it will need to be kept in sync with SciKit-Learn), and perhaps best deferred until later.

However, this risks tying Fairlearn to particular versions of SciKit-Learn.

Unfortunately, the generality of `group_summary()` means that we cannot solve this for the user.
It cannot even tell if it is evaluating a classification or regression problem.

## The Wrapper Functions

In the above, we have assumed that we will provide both `group_summary()` and wrappers such as `accuracy_score_group_summary()`, `accuracy_score_difference()`, `accuracy_score_ratio()` and `accuracy_score_group_min()`.
These wrappers allow the metrics to be passed to SciKit-Learn subroutines such as `make_scorer()`, and they all accept arguments for both the aggregation (as described above) and the underlying metric.

We also provide wrappers for specific fairness metrics used in the literature such `demographic_parity_difference()` and `equalized_odds_difference()` (although even then we should add the extra `relative_to=` and `group=` arguments).


## Methods or Functions

Since the `GroupMetric` object contains no private members, it is not clear that it needs to be its own object.
We could continue to use a `Bunch` but make the `group_by` entry/property return a Pandas Series (which would embed all the other information we might need).
In the multiple metric case, we would still return a single `Bunch` but the properties would both be DataFrames.

The question is whether users prefer:
```python
>>> diff = group_summary(skm.recall_score, y_true, y_pred, sensitive_features=A).difference(aggregate='max')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obviously I'm just one data point, but I strongly prefer this over the version below.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed - people were leaning that way, but since there's always the question of 'Do we really need a new concept?' I just wanted to call this out.

```
or
```python
>>> diff = difference(group_summary(skm.recall_score, y_true, y_pred, sensitive_features=A), aggregate='max')
```