From 8c966952465bf5725d813aa6bbd4d9052cfc8387 Mon Sep 17 00:00:00 2001
From: Richard Edgar <riedgar@microsoft.com>
Date: Tue, 25 Aug 2020 12:33:24 -0400
Subject: [PATCH 01/42] Starting work

Signed-off-by: Richard Edgar <riedgar@microsoft.com>
---
 api/Updated-Metrics.md | 32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)
 create mode 100644 api/Updated-Metrics.md
diff --git a/api/Updated-Metrics.md b/api/Updated-Metrics.md
new file mode 100644
index 0000000..b3eb760
--- /dev/null
+++ b/api/Updated-Metrics.md
@@ -0,0 +1,32 @@
+# Updates for Metrics
+
+This is an update for the existing metrics document, which is being left in place for now as a point of comparison.
+
+## Assumed data
+
+In the following we assume that we have variables of the following form defined:
+
+```python
+y_true = [0, 1, 0, 0, 1, 1, ...]
+y_pred = [1, 0, 0, 1, 0, 1, ...]
+A_sex = [ 'male', 'female', 'female', 'male', ...]
+A_race = [ 'black', 'white', 'hispanic', 'black', ...]
+A = pd.DataFrame(np.transpose([A_sex, A_race]), columns=['Sex', 'Race'])
+
+weights = [ 1, 2, 3, 2, 2, 1, ...]
+```
+
+We actually seek to be very agnostic as to the contents of the `y_true` and `y_pred` arrays.
+Meaning is imposed on them by the underlying metrics.
+
+## Basic Calls
+
+### Existing Syntax
+
+```python
+result = flm.group_summary(skm.accuracy_score, y_true, y_pred, sensitive_features=A_sex)
+print(result)
+>>> {'overall': 0.4, 'by_group': {'male': 0.6536, 'female': 0.213}}
+print(type(result))
+>>> <class 'sklearn.utils.Bunch'>
+```
\ No newline at end of file

From 531c1958c4ad704b611a6b5d4bcf2a02182caf5c Mon Sep 17 00:00:00 2001
From: Richard Edgar <riedgar@microsoft.com>
Date: Tue, 25 Aug 2020 14:22:05 -0400
Subject: [PATCH 02/42] More text

Signed-off-by: Richard Edgar <riedgar@microsoft.com>
---
 api/Updated-Metrics.md | 51 +++++++++++++++++++++++++++++++++++++-----
 1 file changed, 45 insertions(+), 6 deletions(-)

diff --git a/api/Updated-Metrics.md b/api/Updated-Metrics.md
index b3eb760..060bca6 100644
--- a/api/Updated-Metrics.md
+++ b/api/Updated-Metrics.md
@@ -23,10 +23,49 @@ Meaning is imposed on them by the underlying metrics.
 
 ### Existing Syntax
 
+Our basic method is `group_summary()`
+
+```python
+>>> result = flm.group_summary(skm.accuracy_score, y_true, y_pred, sensitive_features=A_sex)
+>>> print(result)
+{'overall': 0.4, 'by_group': {'male': 0.6536, 'female': 0.213}}
+>>> print(type(result))
+<class 'sklearn.utils.Bunch'>
+```
+The `Bunch` is an object which can be accessed in two ways - either as a dictionary - `result['overall']` -  or via properties named by the dictionary keys - `result.overall`.
+Note that the `by_group` key accesses another `Bunch`.
+
+We allow for sample weights (and other arguments which require slicing) via `indexed_params`, and passing through other arguments to the underlying metric function (in this case, `normalize`):
+```python
+>>> flm.group_summary(skm.accuracy_score, y_true, y_pred, sensitive_features=A_sex, indexed_params=['sample_weight'], sample_weight=weights, normalize=False)
+{'overall': 20, 'by_group': {'male': 60, 'female': 21}}
+```
+
+We also provide some wrappers for common metrics from SciKit-Learn:
 ```python
-result = flm.group_summary(skm.accuracy_score, y_true, y_pred, sensitive_features=A_sex)
-print(result)
->>> {'overall': 0.4, 'by_group': {'male': 0.6536, 'female': 0.213}}
-print(type(result))
->>> <class 'sklearn.utils.Bunch'>
-```
\ No newline at end of file
+>>> flm.accuracy_score_group_summary(y_true, y_pred, sensitive_features=A_sex)
+{'overall': 0.4, 'by_group': {'male': 0.6536, 'female': 0.213}}
+```
+
+### Proposed Change
+
+We do not intend to change the API invoked by the user.
+What will change is the return type.
+Rather than a `Bunch`, we will return a `GroupedMetric` object, which can offer richer functionality.
+
+At this basic level, there is only a slight change to the results seen by the user. There are still properties `overall` and `by_groups`, with the same semantics.
+However, the `by_groups` result is now a Pandas Series:
+```python
+>>> result = flm.group_summary(skm.accuracy_score, y_true, y_pred, sensitive_features=A_sex)
+>>> result.overall
+0.4
+>>> result.by_groups
+Male      0.6536
+Female    0.2130
+dtype: float64
+>>> print(type(result.by_groups))
+<class 'pandas.core.series.Series'>
+```
+We would continue to provide convenience wrappers such as `accuracy_score_group_summary` for users, and support passing through arguments along with `indexed_params`.
+There is little advantage to the change at this point.
+This will change in the next section.
\ No newline at end of file

From 66eab2f0578668b4e1b505d7fb43e0aa20c0bbf0 Mon Sep 17 00:00:00 2001
From: Richard Edgar <riedgar@microsoft.com>
Date: Tue, 25 Aug 2020 15:02:36 -0400
Subject: [PATCH 03/42] Working through next bit

Signed-off-by: Richard Edgar <riedgar@microsoft.com>
---
 api/Updated-Metrics.md | 38 ++++++++++++++++++++++++++++++++++----
 1 file changed, 34 insertions(+), 4 deletions(-)

diff --git a/api/Updated-Metrics.md b/api/Updated-Metrics.md
index 060bca6..9dbc5e7 100644
--- a/api/Updated-Metrics.md
+++ b/api/Updated-Metrics.md
@@ -53,19 +53,49 @@ We do not intend to change the API invoked by the user.
 What will change is the return type.
 Rather than a `Bunch`, we will return a `GroupedMetric` object, which can offer richer functionality.
 
-At this basic level, there is only a slight change to the results seen by the user. There are still properties `overall` and `by_groups`, with the same semantics.
-However, the `by_groups` result is now a Pandas Series:
+At this basic level, there is only a slight change to the results seen by the user.
+There are still properties `overall` and `by_groups`, with the same semantics.
+However, the `by_groups` result is now a Pandas Series, and we also provide a `metric` property to record the name of the underlying metric:
 ```python
 >>> result = flm.group_summary(skm.accuracy_score, y_true, y_pred, sensitive_features=A_sex)
+>>> result.metric
+"sklearn.metrics.accuracy_score"
 >>> result.overall
 0.4
 >>> result.by_groups
 Male      0.6536
 Female    0.2130
-dtype: float64
+Name: sklearn.metrics.accuracy_score dtype: float64
 >>> print(type(result.by_groups))
 <class 'pandas.core.series.Series'>
 ```
 We would continue to provide convenience wrappers such as `accuracy_score_group_summary` for users, and support passing through arguments along with `indexed_params`.
 There is little advantage to the change at this point.
-This will change in the next section.
\ No newline at end of file
+This will change in the next section.
+
+## Obtaining Scalars
+
+### Existing Syntax
+
+We provide methods for turning the `Bunch`es returned from `group_summary()` into scalars:
+```python
+>>> difference_from_summary(result)
+0.4406
+>>> ratio_from_summary(result)
+0.3259
+>>> group_max_from_summary(result)
+0.6536
+>>> group_min_from_summary(result)
+0.2130
+```
+We also provide wrappers such as `accuracy_score_difference()`, `accuracy_score_ratio()` and `accuracy_score_min()` for user convenience.
+
+One
+
+## Multiple Sensitive Features
+
+## Conditional (or Segemented) Metrics
+
+For our purposes, Conditional Metrics (alternatively known as Segmented Metrics) do not return single
+
+## Multiple Metrics
\ No newline at end of file

From cf8ae38ee61be7b5a61987b34fa383e7c8a771b2 Mon Sep 17 00:00:00 2001
From: Richard Edgar <riedgar@microsoft.com>
Date: Tue, 25 Aug 2020 15:56:49 -0400
Subject: [PATCH 04/42] Adding another example

Signed-off-by: Richard Edgar <riedgar@microsoft.com>
---
 api/Updated-Metrics.md | 37 ++++++++++++++++++++++++++++++++++++-
 1 file changed, 36 insertions(+), 1 deletion(-)

diff --git a/api/Updated-Metrics.md b/api/Updated-Metrics.md
index 9dbc5e7..992a3c5 100644
--- a/api/Updated-Metrics.md
+++ b/api/Updated-Metrics.md
@@ -90,7 +90,42 @@ We provide methods for turning the `Bunch`es returned from `group_summary()` int
 ```
 We also provide wrappers such as `accuracy_score_difference()`, `accuracy_score_ratio()` and `accuracy_score_min()` for user convenience.
 
-One
+One point which these helpers lack (although it could be added) is the ability to select alternative values for measuring the difference and ratio.
+For example, the user might not be interested in the difference between the maximum and minimum, but the difference from the overall value.
+Or perhaps the difference from a particular group.
+
+### Proposed Change
+
+The `GroupedMetric` object would have methods for calculating the required scalars.
+First, let us consider the differences.
+
+We would provide operations to calculate differences in various ways (all of these results are a Pandas Series):
+```python
+>>> result.differences()
+Male      0.0
+Female    0.4406
+Name: TBD dtype: float64
+>>> result.differences(relative_to=min)
+Male     -0.4406
+Female    0.0
+Name: TBD dtype: float64
+>>> result.differences(relative_to=min, abs=True)
+Male      0.4406
+Female    0.0
+Name: TBD dtype: float64
+>>> result.differences(relative_to=overall)
+Male     -0.2436
+Female    0.1870
+Name: TBD dtype: float64
+>>> result.differences(relative_to=overall, abs=True)
+Male      0.2436
+Female    0.1870
+Name: TBD dtype: float64
+>>> result.differences(relative_to=group, group='Female', abs=True)
+Male      0.4406
+Female    0.0
+Name: TBD dtype: float64
+```
 
 ## Multiple Sensitive Features
 

From 1d8987ca0149c3377d929c34aed1a15d22fad021 Mon Sep 17 00:00:00 2001
From: Richard Edgar <riedgar@microsoft.com>
Date: Tue, 25 Aug 2020 18:11:32 -0400
Subject: [PATCH 05/42] Add in multiple sensitive features and segmentation

Signed-off-by: Richard Edgar <riedgar@microsoft.com>
---
 api/Updated-Metrics.md | 118 +++++++++++++++++++++++++++++++++++++----
 1 file changed, 109 insertions(+), 9 deletions(-)

diff --git a/api/Updated-Metrics.md b/api/Updated-Metrics.md
index 992a3c5..8c141ce 100644
--- a/api/Updated-Metrics.md
+++ b/api/Updated-Metrics.md
@@ -105,32 +105,132 @@ We would provide operations to calculate differences in various ways (all of the
 Male      0.0
 Female    0.4406
 Name: TBD dtype: float64
->>> result.differences(relative_to=min)
+>>> result.differences(relative_to='min')
 Male     -0.4406
 Female    0.0
 Name: TBD dtype: float64
->>> result.differences(relative_to=min, abs=True)
+>>> result.differences(relative_to='min', abs=True)
 Male      0.4406
 Female    0.0
 Name: TBD dtype: float64
->>> result.differences(relative_to=overall)
+>>> result.differences(relative_to='overall')
 Male     -0.2436
 Female    0.1870
 Name: TBD dtype: float64
->>> result.differences(relative_to=overall, abs=True)
+>>> result.differences(relative_to='overall', abs=True)
 Male      0.2436
 Female    0.1870
 Name: TBD dtype: float64
->>> result.differences(relative_to=group, group='Female', abs=True)
+>>> result.differences(relative_to='group', group='Female', abs=True)
 Male      0.4406
 Female    0.0
 Name: TBD dtype: float64
 ```
+The arguments introduced so far for the `differences()` method:
+- `relative_to=` to decide the common point for the differences. Possible values are `'max'` (the default), `'min'`, `'overall'` and `'group'`
+- `group=` to select a group name, only when `relative_to` is set to `'group'`. Default is `None`
+- `abs` to indicate whether to take the absolute value of each entry (defaults to false)
 
-## Multiple Sensitive Features
+The user could then use the Pandas methods `max()` and `min()` to reduce these Series objects to scalars.
+However, this will run into issues where the `relative_to` argument ends up pointing to either the maximum or minimum group, which will have a difference of zero.
+That could then be the maximum or minimum value of the set of difference, but probably won't be what the user wants.
 
-## Conditional (or Segemented) Metrics
+To address this case, we should add an extra argument `aggregate` to the `differences()` method:
+```python
+>>> result.differences(aggregate='max')
+0.4406
+>>> result.differences(relative_to='overall', aggregate='max')
+0.1870
+>>> result.differences(relative_to='overall', abs=True, aggregate='max')
+0.2436
+```
+
+There would be a similar method called `ratios()` on the `GroupedMetric` object:
+```python
+>>> result.ratios()
+Male      1.0
+Female    0.3259
+Name: TBD dtype: float64
+```
+The `ratios()` method will take the following arguments:
+- `relative_to=` similar to `differences()`
+- `group=` similar to `differences()`
+- `ratio_order=` determines how to build the ratio. Values are
+   - `sub_unity` to make larger value the denominator
+   - `super_unity` to make larger value the numerator
+   - `from_relative` to make the value specified by `relative_to=` the denominator
+   - `to_relative` to make the value specified by `relative_to=` the numerator
+- `aggregate=` similar to `differences()`
+
+## Intersections of Sensitive Features
+
+### Existing Syntax
+
+Our current API does not support evaluating metrics on intersections of sensitive features (e.g. "black and female", "black and male", "white and female", "white and male").
+To achieve this, users currently need to write something along the lines of:
+```python
+>>> A_combined = A['Sex'] + '-' + A['Race']
+
+>>> accuracy_score_group_summary(y_true, y_pred, sensitive_features=A_combined)
+{ 'overall': 0.4, by_groups : { 'Female-Black':0.4, 'Female-Hispanic':0.5, 'Female-White':0.5, 'Male-Black':0.5, 'Male-Hispanic': 0.6, 'Male-White':0.7 } }
+```
+This is unecessarily cumbersome.
+It is also possible that some combinations might not appear in the data (especially as more sensitive features are combined), but identifying which ones were missing would be tedious.
+
+
+### Proposed Change
+
+If `sensitive_features=` is a DataFrame, we can generate our results in terms of a MultiIndex:
+```python
+>>> result = group_summary(skm.accuracy_score, y_true, y_pred, sensitive_features=A)
+>>> result.by_groups
+Sex     Race
+Male    Black       0.5
+        White       0.7
+        Hispanic    0.6
+Female  Black       0.4
+        White       0.5
+        Hispanic    0.5
+Name: sklearn.metrics.accuracy_score, dtype: float64
+```
+If a particular combination of sensitive features had no representatives, then we would return `None` for that entry in the Series.
+
+The `differences()` and `ratio()` methods would act on this Series as before.
+
+## Segmented (or Conditional) Metrics
+
+For our purposes, Segmented Metrics (alternatively known as Conditional Metrics) do not return single values when aggregation is requested in a call to `differences()` or `ratios()` but instead provide one result for each unique value of the specified segmentation feature(s).
+
+### Existing Syntax
+
+Not supported.
+Users would have to devise the required code themselves
+
+### Proposed Change
+
+We propose adding an extra argument to `differences()` and `ratios()`, to provide a `segment_by=` argument.
+
+Suppose we have a DataFrame, `A_3` with three sensitive features: Sex, Race and Income Band (the latter having values 'Low' and 'High').
+This could represent a loan scenario where discrimination based on income is allowed, but within the income bands, other sensitive groups must be treated equally.
+When `differences()` is invoked with `segment_by=`, the result will not be a scalar, but a Series.
+A user might make calls:
+```python
+>>> result = accuracy_score_group_summary(y_true, y_test, sensitive_features=A_3)
+>>> result.differences(aggregate=min, segment_by='Income Band')
+Income Band
+Low                 0.3
+High                0.4
+Name: TBD, dtype: float64
+```
+We can also allow `segment_by=` to be a list of names:
+```python
+>>> result.differences(aggregate=min, segment_by=['Income Band', 'Sex'])
+Income Band     Sex
+Low             Female  0.3
+Low             Male    0.35
+High            Female  0.4
+High            Male    0.5
+```
 
-For our purposes, Conditional Metrics (alternatively known as Segmented Metrics) do not return single
+## Multiple Metrics
 
-## Multiple Metrics
\ No newline at end of file

From 6d5c5e35ecae2f8a17ad24fd4459e97053e7b715 Mon Sep 17 00:00:00 2001
From: Richard Edgar <riedgar@microsoft.com>
Date: Tue, 25 Aug 2020 18:21:48 -0400
Subject: [PATCH 06/42] Sketch out the multiple metrics

Signed-off-by: Richard Edgar <riedgar@microsoft.com>
---
 api/Updated-Metrics.md | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/api/Updated-Metrics.md b/api/Updated-Metrics.md
index 8c141ce..91890a0 100644
--- a/api/Updated-Metrics.md
+++ b/api/Updated-Metrics.md
@@ -234,3 +234,26 @@ High            Male    0.5
 
 ## Multiple Metrics
 
+Finally, we can also allow for the evaluation of multiple metrics at once.
+
+### Existing Syntax
+
+This is not supported.
+Users would have to devise their own method
+
+### Proposed Change
+
+We allow a list of metric functions in the call to group summary.
+Results become DataFrames, with one column for each metric:
+```python
+>>> result = group_summary([skm.accuracy_score, skm.precision_score], y_true, y_pred, sensitive_features=A_sex)
+>>> result.overall
+      sklearn.metrics.accuracy_score  sklearn.metrics.precision_score
+   0                             0.3                              0.5
+>>> result.by_groups
+      sklearn.metrics.accuracy_score  sklearn.metrics.precision_score
+'Female'                        0.4                            0.7
+```
+This should generalise to the other methods described above.
+
+One open question is how extra arguments should be passed to the individual metric functions, including how to handle the `indexed_params=`.
\ No newline at end of file

From 00e0a1a47c2e1e35062cab6714d4d9facdc2f0b6 Mon Sep 17 00:00:00 2001
From: Richard Edgar <riedgar@microsoft.com>
Date: Tue, 25 Aug 2020 18:27:10 -0400
Subject: [PATCH 07/42] Forgot to add a line

Signed-off-by: Richard Edgar <riedgar@microsoft.com>
---
 api/Updated-Metrics.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/api/Updated-Metrics.md b/api/Updated-Metrics.md
index 91890a0..352f2f0 100644
--- a/api/Updated-Metrics.md
+++ b/api/Updated-Metrics.md
@@ -253,6 +253,7 @@ Results become DataFrames, with one column for each metric:
 >>> result.by_groups
       sklearn.metrics.accuracy_score  sklearn.metrics.precision_score
 'Female'                        0.4                            0.7
+'Male'                          0.6                            0.75
 ```
 This should generalise to the other methods described above.
 

From 5ce78ec419bc39e74b8ba1179f6e22da6a55455c Mon Sep 17 00:00:00 2001
From: Richard Edgar <riedgar@microsoft.com>
Date: Wed, 26 Aug 2020 09:37:54 -0400
Subject: [PATCH 08/42] Think a bit about multiple metrics and getting the
 names

Signed-off-by: Richard Edgar <riedgar@microsoft.com>
---
 api/Updated-Metrics.md | 35 ++++++++++++++++++++++++++++++++++-
 1 file changed, 34 insertions(+), 1 deletion(-)

diff --git a/api/Updated-Metrics.md b/api/Updated-Metrics.md
index 352f2f0..40f73f4 100644
--- a/api/Updated-Metrics.md
+++ b/api/Updated-Metrics.md
@@ -69,6 +69,20 @@ Name: sklearn.metrics.accuracy_score dtype: float64
 >>> print(type(result.by_groups))
 <class 'pandas.core.series.Series'>
 ```
+The `metric` property (and using it as the name of the Series) could prove troublesome.
+This is because the fully qualified function name, as reconstructed from `__name__`, `__qualname` and `__module__` might not match the user's expectation.
+For example:
+```python
+>>> import sklearn.metrics as skm
+>>> skm.accuracy_score.__name__
+'accuracy_score'
+>>> skm.accuracy_score.__qualname__
+'accuracy_score'
+>>> skm.accuracy_score.__module__
+'sklearn.metrics._classification'
+```
+We are seeing some of the actual internal structure here of SciKit-Learn, and the user might not be expecting that.
+
 We would continue to provide convenience wrappers such as `accuracy_score_group_summary` for users, and support passing through arguments along with `indexed_params`.
 There is little advantage to the change at this point.
 This will change in the next section.
@@ -257,4 +271,23 @@ Results become DataFrames, with one column for each metric:
 ```
 This should generalise to the other methods described above.
 
-One open question is how extra arguments should be passed to the individual metric functions, including how to handle the `indexed_params=`.
\ No newline at end of file
+One open question is how extra arguments should be passed to the individual metric functions, including how to handle the `indexed_params=`.
+On possible solution is to have lists, with indices corresponding to the list of functions supplied to `group_summary()`
+For example, for `index_params=` we would have:
+```python
+indexed_params = [['sample_weight'], ['sample_weight']]
+```
+In the `**kwargs` a single `extra_args=` argument would be accepted (although not required), which would contain the individual `**kwargs` for each metric:
+```python
+extra_args = [ 
+    {
+        'sample_weight': [1,2,1,1,3, ...],
+        'normalize': False
+    },
+    {
+        'sample_weight': [1,2,1,1,3, ... ],
+        'pos_label' = 'Granted'
+    }
+]
+```
+If users had a lot of functions with a lot of custom arguments, this could get error-prone and difficult to debug.
\ No newline at end of file

From 9061c702a251f1e0818b8c15ae39acdc3c1cf90d Mon Sep 17 00:00:00 2001
From: Richard Edgar <riedgar@microsoft.com>
Date: Wed, 26 Aug 2020 09:48:05 -0400
Subject: [PATCH 09/42] Add a note about some convenience wrappers

Signed-off-by: Richard Edgar <riedgar@microsoft.com>
---
 api/Updated-Metrics.md | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/api/Updated-Metrics.md b/api/Updated-Metrics.md
index 40f73f4..3c7a87a 100644
--- a/api/Updated-Metrics.md
+++ b/api/Updated-Metrics.md
@@ -149,7 +149,7 @@ The user could then use the Pandas methods `max()` and `min()` to reduce these S
 However, this will run into issues where the `relative_to` argument ends up pointing to either the maximum or minimum group, which will have a difference of zero.
 That could then be the maximum or minimum value of the set of difference, but probably won't be what the user wants.
 
-To address this case, we should add an extra argument `aggregate` to the `differences()` method:
+To address this case, we should add an extra argument `aggregate=` to the `differences()` method:
 ```python
 >>> result.differences(aggregate='max')
 0.4406
@@ -158,6 +158,7 @@ To address this case, we should add an extra argument `aggregate` to the `differ
 >>> result.differences(relative_to='overall', abs=True, aggregate='max')
 0.2436
 ```
+If `aggregate=None` (which would be the default), then the result is a Series, as shown above.
 
 There would be a similar method called `ratios()` on the `GroupedMetric` object:
 ```python
@@ -176,6 +177,11 @@ The `ratios()` method will take the following arguments:
    - `to_relative` to make the value specified by `relative_to=` the numerator
 - `aggregate=` similar to `differences()`
 
+We would also provide the same wrappers such as `accuracy_score_difference()` but expose the extra arguments discussed here.
+One question is whether the default aggregation should be `None` (to match the method), or whether it should default to scalar results similar to the existing methods. 
+
+In the section on Segmented Metrics below, we shall discuss one extra optional argument for `differences()` and `ratios()`.
+
 ## Intersections of Sensitive Features
 
 ### Existing Syntax
@@ -189,12 +195,12 @@ To achieve this, users currently need to write something along the lines of:
 { 'overall': 0.4, by_groups : { 'Female-Black':0.4, 'Female-Hispanic':0.5, 'Female-White':0.5, 'Male-Black':0.5, 'Male-Hispanic': 0.6, 'Male-White':0.7 } }
 ```
 This is unecessarily cumbersome.
-It is also possible that some combinations might not appear in the data (especially as more sensitive features are combined), but identifying which ones were missing would be tedious.
+It is also possible that some combinations might not appear in the data (especially as more sensitive features are combined), but identifying which ones were not represented in the dataset would be tedious.
 
 
 ### Proposed Change
 
-If `sensitive_features=` is a DataFrame, we can generate our results in terms of a MultiIndex:
+If `sensitive_features=` is a DataFrame (or list of Series.... exact supported types are TBD), we can generate our results in terms of a MultiIndex. Using the `A` DataFrame defined above, a user might write:
 ```python
 >>> result = group_summary(skm.accuracy_score, y_true, y_pred, sensitive_features=A)
 >>> result.by_groups

From 5cd4e88830b980abe5080796d3efd11cadd05906 Mon Sep 17 00:00:00 2001
From: Richard Edgar <riedgar@microsoft.com>
Date: Wed, 26 Aug 2020 10:22:37 -0400
Subject: [PATCH 10/42] Some notes on pitfalls

Signed-off-by: Richard Edgar <riedgar@microsoft.com>
---
 api/Updated-Metrics.md | 49 +++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 48 insertions(+), 1 deletion(-)

diff --git a/api/Updated-Metrics.md b/api/Updated-Metrics.md
index 3c7a87a..d61905a 100644
--- a/api/Updated-Metrics.md
+++ b/api/Updated-Metrics.md
@@ -296,4 +296,51 @@ extra_args = [
     }
 ]
 ```
-If users had a lot of functions with a lot of custom arguments, this could get error-prone and difficult to debug.
\ No newline at end of file
+If users had a lot of functions with a lot of custom arguments, this could get error-prone and difficult to debug.
+
+## User Conveniences
+
+We can consider allowing the metric function given to `group_summary()` to be represented by a string.
+We would provide a mapping of strings to suitable functions.
+This would make the following all equivalent:
+```python
+>>> r1 = group_summary(sklearn.accuracy_score, y_true, y_pred, sensitive_features=A_sex)
+>>> r2 = group_summary('accuracy_score', y_true, y_pred, sensitive_features=A_sex)
+>>> r3 = accuracy_score_group_summary( y_true, y_pred, sensitive_features=A_sex)
+```
+We would also allow mixtures of strings and functions in the multiple metric case.
+
+## Generality
+
+Throughout this document, we have been describing the case of classification metrics.
+However, we do not actually require this.
+It is the underlying metric function which gives meaning to the `y_true` and `y_pred` lists.
+So long as these are of equal length (and equal in length to the sensitive feature list - which _will_ be treated as a categorical), then `group_summary()` does not actually care about their datatypes.
+For example, each entry in `y_pred` could be a dictionary of predicted classes and accompanying probabilities.
+Or the user might be working on a regression problem, and both `y_true` and `y_pred` would be floating point numbers (or `y_pred` might even be a tuple of predicted value and error).
+So long as the underlying metric understands the datastructures, `group_summary()` will not care.
+
+There will be an effect on the `GroupedMetric` result object.
+Although the `overall` and `by_groups` properties will work fine, the `differences()` and `ratios()` methods may not.
+After all, what does "take the ratio of two confusion matrices" even mean?
+We should try to trap these cases, and throw a meaningful exception (rather than propagating whatever exception happens to emerge from the underlying libraries).
+Since we know that `differences()` and `ratios()` will only work when the metric has produced scalar results, this should be a straightforward test.
+
+## Pitfalls
+
+There are some potential pitfalls which could trap the unwary.
+
+The biggest of these are related to missing classes in the subgroups.
+To take an extreme case, suppose that males were always being predicted classes A or B, while females were always predicted classes C or D.
+The user could request precision scores, but the results would not really be comparable between the two groups.
+With intersections of sensitive features, cases like this become more likely.
+
+Metrics in SciKit-Learn usually have arguments such as `pos_label=` and `labels=` to allow the user to specify the expected labels, and adjust their behaviour accordingly.
+However, we do not require that users stick to the metrics defined in SciKit-Learn.
+
+If we implement the convenience strings-for-functions piece mentioned above, then _when the user specifies one of those strings_ we can log warnings if the appropriate arguments (such as `labels=`) are not specified.
+We could even generate the argument ourselves if the user does not specify it.
+However, this risks tying Fairlearn to particular versions of SciKit-Learn.
+
+Unfortunately, the generality of `group_summary()` means that we cannot solve this for the user.
+It cannot even tell if it is evaluating a classification or regression problem.
\ No newline at end of file

From 75e1d1db1879ec914f4f57166a65209a13c3fad7 Mon Sep 17 00:00:00 2001
From: Richard Edgar <riedgar@microsoft.com>
Date: Thu, 27 Aug 2020 14:46:40 -0400
Subject: [PATCH 11/42] Some more questions

Signed-off-by: Richard Edgar <riedgar@microsoft.com>
---
 api/Updated-Metrics.md | 24 +++++++++++++++++++++++-
 1 file changed, 23 insertions(+), 1 deletion(-)

diff --git a/api/Updated-Metrics.md b/api/Updated-Metrics.md
index d61905a..03847ad 100644
--- a/api/Updated-Metrics.md
+++ b/api/Updated-Metrics.md
@@ -343,4 +343,26 @@ We could even generate the argument ourselves if the user does not specify it.
 However, this risks tying Fairlearn to particular versions of SciKit-Learn.
 
 Unfortunately, the generality of `group_summary()` means that we cannot solve this for the user.
-It cannot even tell if it is evaluating a classification or regression problem.
\ No newline at end of file
+It cannot even tell if it is evaluating a classification or regression problem.
+
+## The Wrapper Functions
+
+In the above, we have assumed that we will provide both `group_summary()` and wrappers such as `accuracy_score_group_summary()`, `accuracy_score_difference()`, `accuracy_score_ratio()` and `accuracys_score_group_min()`. Do these wrappers add value, or do they end up just polluting our namespace and confusing users?
+
+The wrappers such as `demographic_parity_difference()` and `equalized_odds_difference()` are arguably useful, since they are specific metrics used in the literature (although even then we might want to add the extra `relative_to=` and `group=` arguments).
+The case for `accuracy_score_group_summary()` and related functions is less clear.
+
+## Methods or Functions
+
+Since the `GroupMetric` object contains no private members, it is not clear that it needs to be its own oject.
+We could continue to use a `Bunch` but make the `group_by` entry/property return a Pandas Series (which would embed all the other information we might need).
+In the multiple metric case, we would still return a single `Bunch` but the properties would both be DataFrames.
+
+The question is whether users prefer:
+```python
+>>> diff = group_summary(skm.recall_score, y_true, y_pred, sensitive_features=A).difference(aggregate='max')
+```
+or
+```python
+>>> diff = difference(group_summary(skm.recall_score, y_true, y_pred, sensitive_features=A), aggregate='max')
+```

From aac662c7b39ec50129cc061e57cef1b81999ff99 Mon Sep 17 00:00:00 2001
From: Richard Edgar <riedgar@microsoft.com>
Date: Thu, 27 Aug 2020 14:49:03 -0400
Subject: [PATCH 12/42] Change Segmented Metrics to be Conditional Metrics

Signed-off-by: Richard Edgar <riedgar@microsoft.com>
---
 api/Updated-Metrics.md | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/api/Updated-Metrics.md b/api/Updated-Metrics.md
index 03847ad..c630c9a 100644
--- a/api/Updated-Metrics.md
+++ b/api/Updated-Metrics.md
@@ -180,7 +180,7 @@ The `ratios()` method will take the following arguments:
 We would also provide the same wrappers such as `accuracy_score_difference()` but expose the extra arguments discussed here.
 One question is whether the default aggregation should be `None` (to match the method), or whether it should default to scalar results similar to the existing methods. 
 
-In the section on Segmented Metrics below, we shall discuss one extra optional argument for `differences()` and `ratios()`.
+In the section on Conditional Metrics below, we shall discuss one extra optional argument for `differences()` and `ratios()`.
 
 ## Intersections of Sensitive Features
 
@@ -217,9 +217,9 @@ If a particular combination of sensitive features had no representatives, then w
 
 The `differences()` and `ratio()` methods would act on this Series as before.
 
-## Segmented (or Conditional) Metrics
+## Conditional (or Segmented) Metrics
 
-For our purposes, Segmented Metrics (alternatively known as Conditional Metrics) do not return single values when aggregation is requested in a call to `differences()` or `ratios()` but instead provide one result for each unique value of the specified segmentation feature(s).
+For our purposes, Conditional Metrics (alternatively known as Segmented Metrics) do not return single values when aggregation is requested in a call to `differences()` or `ratios()` but instead provide one result for each unique value of the specified condition feature(s).
 
 ### Existing Syntax
 
@@ -228,23 +228,23 @@ Users would have to devise the required code themselves
 
 ### Proposed Change
 
-We propose adding an extra argument to `differences()` and `ratios()`, to provide a `segment_by=` argument.
+We propose adding an extra argument to `differences()` and `ratios()`, to provide a `condition_on=` argument.
 
 Suppose we have a DataFrame, `A_3` with three sensitive features: Sex, Race and Income Band (the latter having values 'Low' and 'High').
 This could represent a loan scenario where discrimination based on income is allowed, but within the income bands, other sensitive groups must be treated equally.
-When `differences()` is invoked with `segment_by=`, the result will not be a scalar, but a Series.
+When `differences()` is invoked with `condition_on=`, the result will not be a scalar, but a Series.
 A user might make calls:
 ```python
 >>> result = accuracy_score_group_summary(y_true, y_test, sensitive_features=A_3)
->>> result.differences(aggregate=min, segment_by='Income Band')
+>>> result.differences(aggregate=min, condition_on='Income Band')
 Income Band
 Low                 0.3
 High                0.4
 Name: TBD, dtype: float64
 ```
-We can also allow `segment_by=` to be a list of names:
+We can also allow `condition_on=` to be a list of names:
 ```python
->>> result.differences(aggregate=min, segment_by=['Income Band', 'Sex'])
+>>> result.differences(aggregate=min, condition_on=['Income Band', 'Sex'])
 Income Band     Sex
 Low             Female  0.3
 Low             Male    0.35

From bf96d2bbec76df5289a1d488607c967b7edcee6a Mon Sep 17 00:00:00 2001
From: Richard Edgar <riedgar@microsoft.com>
Date: Tue, 1 Sep 2020 09:36:52 -0400
Subject: [PATCH 13/42] Working through some of the suggested changes

Signed-off-by: Richard Edgar <riedgar@microsoft.com>
---
 api/Updated-Metrics.md | 76 +++++++++++++++++++++---------------------
 1 file changed, 38 insertions(+), 38 deletions(-)

diff --git a/api/Updated-Metrics.md b/api/Updated-Metrics.md
index c630c9a..0d6504f 100644
--- a/api/Updated-Metrics.md
+++ b/api/Updated-Metrics.md
@@ -8,16 +8,16 @@ In the following we assume that we have variables of the following form defined:
 
 ```python
 y_true = [0, 1, 0, 0, 1, 1, ...]
-y_pred = [1, 0, 0, 1, 0, 1, ...]
-A_sex = [ 'male', 'female', 'female', 'male', ...]
-A_race = [ 'black', 'white', 'hispanic', 'black', ...]
-A = pd.DataFrame(np.transpose([A_sex, A_race]), columns=['Sex', 'Race'])
+y_pred = [1, 0, 0, 1, 0, 1, ...] # Content can be different for other metrics (see below)
+A_1 = [ 'C', 'B', 'B', 'C', ...]
+A_2 = [ 'M', 'N', 'N', 'P', ...]
+A = pd.DataFrame(np.transpose([A_1, A_2]), columns=['SF 1', 'SF 2'])
 
 weights = [ 1, 2, 3, 2, 2, 1, ...]
 ```
 
-We actually seek to be very agnostic as to the contents of the `y_true` and `y_pred` arrays.
-Meaning is imposed on them by the underlying metrics.
+We actually seek to be very agnostic as to the contents of the `y_true` and `y_pred` arrays; the meaning is imposed on them by the underlying metrics.
+Here we have shown binary values for a simple classification problem, but they could be floating point values from a regression, or even collections of classes and associated probabilities.
 
 ## Basic Calls
 
@@ -26,9 +26,9 @@ Meaning is imposed on them by the underlying metrics.
 Our basic method is `group_summary()`
 
 ```python
->>> result = flm.group_summary(skm.accuracy_score, y_true, y_pred, sensitive_features=A_sex)
+>>> result = flm.group_summary(skm.accuracy_score, y_true, y_pred, sensitive_features=A_1)
 >>> print(result)
-{'overall': 0.4, 'by_group': {'male': 0.6536, 'female': 0.213}}
+{'overall': 0.4, 'by_group': {'B': 0.6536, 'C': 0.213}}
 >>> print(type(result))
 <class 'sklearn.utils.Bunch'>
 ```
@@ -37,14 +37,14 @@ Note that the `by_group` key accesses another `Bunch`.
 
 We allow for sample weights (and other arguments which require slicing) via `indexed_params`, and passing through other arguments to the underlying metric function (in this case, `normalize`):
 ```python
->>> flm.group_summary(skm.accuracy_score, y_true, y_pred, sensitive_features=A_sex, indexed_params=['sample_weight'], sample_weight=weights, normalize=False)
-{'overall': 20, 'by_group': {'male': 60, 'female': 21}}
+>>> flm.group_summary(skm.accuracy_score, y_true, y_pred, sensitive_features=A_1, indexed_params=['sample_weight'], sample_weight=weights, normalize=False)
+{'overall': 20, 'by_group': {'B': 60, 'C': 21}}
 ```
 
 We also provide some wrappers for common metrics from SciKit-Learn:
 ```python
 >>> flm.accuracy_score_group_summary(y_true, y_pred, sensitive_features=A_sex)
-{'overall': 0.4, 'by_group': {'male': 0.6536, 'female': 0.213}}
+{'overall': 0.4, 'by_group': {'B': 0.6536, 'C': 0.213}}
 ```
 
 ### Proposed Change
@@ -57,14 +57,14 @@ At this basic level, there is only a slight change to the results seen by the us
 There are still properties `overall` and `by_groups`, with the same semantics.
 However, the `by_groups` result is now a Pandas Series, and we also provide a `metric` property to record the name of the underlying metric:
 ```python
->>> result = flm.group_summary(skm.accuracy_score, y_true, y_pred, sensitive_features=A_sex)
+>>> result = flm.group_summary(skm.accuracy_score, y_true, y_pred, sensitive_features=A_1)
 >>> result.metric
 "sklearn.metrics.accuracy_score"
 >>> result.overall
 0.4
 >>> result.by_groups
-Male      0.6536
-Female    0.2130
+B      0.6536
+C      0.2130
 Name: sklearn.metrics.accuracy_score dtype: float64
 >>> print(type(result.by_groups))
 <class 'pandas.core.series.Series'>
@@ -116,28 +116,28 @@ First, let us consider the differences.
 We would provide operations to calculate differences in various ways (all of these results are a Pandas Series):
 ```python
 >>> result.differences()
-Male      0.0
-Female    0.4406
+B      0.0
+C      0.4406
 Name: TBD dtype: float64
 >>> result.differences(relative_to='min')
-Male     -0.4406
-Female    0.0
+B     -0.4406
+C      0.0
 Name: TBD dtype: float64
 >>> result.differences(relative_to='min', abs=True)
-Male      0.4406
-Female    0.0
+B      0.4406
+C      0.0
 Name: TBD dtype: float64
 >>> result.differences(relative_to='overall')
-Male     -0.2436
-Female    0.1870
+B     -0.2436
+C      0.1870
 Name: TBD dtype: float64
 >>> result.differences(relative_to='overall', abs=True)
-Male      0.2436
-Female    0.1870
+B      0.2436
+C      0.1870
 Name: TBD dtype: float64
->>> result.differences(relative_to='group', group='Female', abs=True)
-Male      0.4406
-Female    0.0
+>>> result.differences(relative_to='group', group='C', abs=True)
+B      0.4406
+C      0.0
 Name: TBD dtype: float64
 ```
 The arguments introduced so far for the `differences()` method:
@@ -163,8 +163,8 @@ If `aggregate=None` (which would be the default), then the result is a Series, a
 There would be a similar method called `ratios()` on the `GroupedMetric` object:
 ```python
 >>> result.ratios()
-Male      1.0
-Female    0.3259
+B      1.0
+C      0.3259
 Name: TBD dtype: float64
 ```
 The `ratios()` method will take the following arguments:
@@ -189,10 +189,10 @@ In the section on Conditional Metrics below, we shall discuss one extra optional
 Our current API does not support evaluating metrics on intersections of sensitive features (e.g. "black and female", "black and male", "white and female", "white and male").
 To achieve this, users currently need to write something along the lines of:
 ```python
->>> A_combined = A['Sex'] + '-' + A['Race']
+>>> A_combined = A['SF 1'] + '-' + A['SF 2']
 
 >>> accuracy_score_group_summary(y_true, y_pred, sensitive_features=A_combined)
-{ 'overall': 0.4, by_groups : { 'Female-Black':0.4, 'Female-Hispanic':0.5, 'Female-White':0.5, 'Male-Black':0.5, 'Male-Hispanic': 0.6, 'Male-White':0.7 } }
+{ 'overall': 0.4, by_groups : { 'B-M':0.4, 'B-N':0.5, 'B-P':0.5, 'C-M':0.5, 'C-N': 0.6, 'C-P':0.7 } }
 ```
 This is unecessarily cumbersome.
 It is also possible that some combinations might not appear in the data (especially as more sensitive features are combined), but identifying which ones were not represented in the dataset would be tedious.
@@ -204,13 +204,13 @@ If `sensitive_features=` is a DataFrame (or list of Series.... exact supported t
 ```python
 >>> result = group_summary(skm.accuracy_score, y_true, y_pred, sensitive_features=A)
 >>> result.by_groups
-Sex     Race
-Male    Black       0.5
-        White       0.7
-        Hispanic    0.6
-Female  Black       0.4
-        White       0.5
-        Hispanic    0.5
+SF 1    SF 2
+B       M       0.5
+        N       0.7
+        P       0.6
+C       M       0.4
+        N       0.5
+        P       0.5
 Name: sklearn.metrics.accuracy_score, dtype: float64
 ```
 If a particular combination of sensitive features had no representatives, then we would return `None` for that entry in the Series.

From 416d427369263820a1267ed9d2390b11fed50456 Mon Sep 17 00:00:00 2001
From: Richard Edgar <riedgar@microsoft.com>
Date: Tue, 1 Sep 2020 09:40:36 -0400
Subject: [PATCH 14/42] Some more fixes

Signed-off-by: Richard Edgar <riedgar@microsoft.com>
---
 api/Updated-Metrics.md | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/api/Updated-Metrics.md b/api/Updated-Metrics.md
index 0d6504f..b9c3897 100644
--- a/api/Updated-Metrics.md
+++ b/api/Updated-Metrics.md
@@ -43,7 +43,7 @@ We allow for sample weights (and other arguments which require slicing) via `ind
 
 We also provide some wrappers for common metrics from SciKit-Learn:
 ```python
->>> flm.accuracy_score_group_summary(y_true, y_pred, sensitive_features=A_sex)
+>>> flm.accuracy_score_group_summary(y_true, y_pred, sensitive_features=A_1)
 {'overall': 0.4, 'by_group': {'B': 0.6536, 'C': 0.213}}
 ```
 
@@ -230,7 +230,7 @@ Users would have to devise the required code themselves
 
 We propose adding an extra argument to `differences()` and `ratios()`, to provide a `condition_on=` argument.
 
-Suppose we have a DataFrame, `A_3` with three sensitive features: Sex, Race and Income Band (the latter having values 'Low' and 'High').
+Suppose we have a DataFrame, `A_3` with three sensitive features: SF 1, SF 2 and Income Band (the latter having values 'Low' and 'High').
 This could represent a loan scenario where discrimination based on income is allowed, but within the income bands, other sensitive groups must be treated equally.
 When `differences()` is invoked with `condition_on=`, the result will not be a scalar, but a Series.
 A user might make calls:
@@ -304,9 +304,9 @@ We can consider allowing the metric function given to `group_summary()` to be re
 We would provide a mapping of strings to suitable functions.
 This would make the following all equivalent:
 ```python
->>> r1 = group_summary(sklearn.accuracy_score, y_true, y_pred, sensitive_features=A_sex)
->>> r2 = group_summary('accuracy_score', y_true, y_pred, sensitive_features=A_sex)
->>> r3 = accuracy_score_group_summary( y_true, y_pred, sensitive_features=A_sex)
+>>> r1 = group_summary(sklearn.accuracy_score, y_true, y_pred, sensitive_features=A_1)
+>>> r2 = group_summary('accuracy_score', y_true, y_pred, sensitive_features=A_1)
+>>> r3 = accuracy_score_group_summary( y_true, y_pred, sensitive_features=A_1)
 ```
 We would also allow mixtures of strings and functions in the multiple metric case.
 
@@ -331,7 +331,7 @@ Since we know that `differences()` and `ratios()` will only work when the metric
 There are some potential pitfalls which could trap the unwary.
 
 The biggest of these are related to missing classes in the subgroups.
-To take an extreme case, suppose that males were always being predicted classes A or B, while females were always predicted classes C or D.
+To take an extreme case, suppose that the `B` group speciified by `SF 1` were always being predicted classes H or J, while the `C` group was always predicted classes K or L.
 The user could request precision scores, but the results would not really be comparable between the two groups.
 With intersections of sensitive features, cases like this become more likely.
 

From fbb634fc8a15555077895ddf284c28a2fd96daaf Mon Sep 17 00:00:00 2001
From: Richard Edgar <riedgar@microsoft.com>
Date: Tue, 1 Sep 2020 10:16:26 -0400
Subject: [PATCH 15/42] More fixes

Signed-off-by: Richard Edgar <riedgar@microsoft.com>
---
 api/Updated-Metrics.md | 31 +++++++++++++++++--------------
 1 file changed, 17 insertions(+), 14 deletions(-)

diff --git a/api/Updated-Metrics.md b/api/Updated-Metrics.md
index b9c3897..cc91d88 100644
--- a/api/Updated-Metrics.md
+++ b/api/Updated-Metrics.md
@@ -215,7 +215,7 @@ Name: sklearn.metrics.accuracy_score, dtype: float64
 ```
 If a particular combination of sensitive features had no representatives, then we would return `None` for that entry in the Series.
 
-The `differences()` and `ratio()` methods would act on this Series as before.
+The `differences()` and `ratios()` methods would act on this Series as before.
 
 ## Conditional (or Segmented) Metrics
 
@@ -231,7 +231,7 @@ Users would have to devise the required code themselves
 We propose adding an extra argument to `differences()` and `ratios()`, to provide a `condition_on=` argument.
 
 Suppose we have a DataFrame, `A_3` with three sensitive features: SF 1, SF 2 and Income Band (the latter having values 'Low' and 'High').
-This could represent a loan scenario where discrimination based on income is allowed, but within the income bands, other sensitive groups must be treated equally.
+This could represent a loan scenario where decisions can be based on income, but within the income bands, other sensitive groups must be treated equally.
 When `differences()` is invoked with `condition_on=`, the result will not be a scalar, but a Series.
 A user might make calls:
 ```python
@@ -244,13 +244,16 @@ Name: TBD, dtype: float64
 ```
 We can also allow `condition_on=` to be a list of names:
 ```python
->>> result.differences(aggregate=min, condition_on=['Income Band', 'Sex'])
+>>> result.differences(aggregate=min, condition_on=['Income Band', 'SF 1'])
 Income Band     Sex
-Low             Female  0.3
-Low             Male    0.35
-High            Female  0.4
-High            Male    0.5
+Low             B       0.3
+Low             C       0.35
+High            B       0.4
+High            C       0.5
 ```
+There may be demand for allowing the sensitive features to be supplied as a `numpy.ndarray` or even a list of `Series`.
+To support this, `condition_on=` would need to allow integers (and lists of integers) as inputs, to index the columns.
+If the user is specifying a list for `condition_on=` then we should probably be nice and detect cases where a feature is listed twice (especially if we're allowing both names and column indices).
 
 ## Multiple Metrics
 
@@ -266,19 +269,19 @@ Users would have to devise their own method
 We allow a list of metric functions in the call to group summary.
 Results become DataFrames, with one column for each metric:
 ```python
->>> result = group_summary([skm.accuracy_score, skm.precision_score], y_true, y_pred, sensitive_features=A_sex)
+>>> result = group_summary([skm.accuracy_score, skm.precision_score], y_true, y_pred, sensitive_features=A_1)
 >>> result.overall
       sklearn.metrics.accuracy_score  sklearn.metrics.precision_score
    0                             0.3                              0.5
 >>> result.by_groups
       sklearn.metrics.accuracy_score  sklearn.metrics.precision_score
-'Female'                        0.4                            0.7
-'Male'                          0.6                            0.75
+'B'                              0.4                            0.7
+'C'                              0.6                            0.75
 ```
 This should generalise to the other methods described above.
 
 One open question is how extra arguments should be passed to the individual metric functions, including how to handle the `indexed_params=`.
-On possible solution is to have lists, with indices corresponding to the list of functions supplied to `group_summary()`
+Onek possible solution is to have lists, with indices corresponding to the list of functions supplied to `group_summary()`
 For example, for `index_params=` we would have:
 ```python
 indexed_params = [['sample_weight'], ['sample_weight']]
@@ -300,7 +303,7 @@ If users had a lot of functions with a lot of custom arguments, this could get e
 
 ## User Conveniences
 
-We can consider allowing the metric function given to `group_summary()` to be represented by a string.
+In addition to having the underlying metric be passed as a function, we can consider allowing the metric function given to `group_summary()` to be represented by a string.
 We would provide a mapping of strings to suitable functions.
 This would make the following all equivalent:
 ```python
@@ -318,7 +321,7 @@ It is the underlying metric function which gives meaning to the `y_true` and `y_
 So long as these are of equal length (and equal in length to the sensitive feature list - which _will_ be treated as a categorical), then `group_summary()` does not actually care about their datatypes.
 For example, each entry in `y_pred` could be a dictionary of predicted classes and accompanying probabilities.
 Or the user might be working on a regression problem, and both `y_true` and `y_pred` would be floating point numbers (or `y_pred` might even be a tuple of predicted value and error).
-So long as the underlying metric understands the datastructures, `group_summary()` will not care.
+So long as the underlying metric understands the data structures, `group_summary()` will not care.
 
 There will be an effect on the `GroupedMetric` result object.
 Although the `overall` and `by_groups` properties will work fine, the `differences()` and `ratios()` methods may not.
@@ -354,7 +357,7 @@ The case for `accuracy_score_group_summary()` and related functions is less clea
 
 ## Methods or Functions
 
-Since the `GroupMetric` object contains no private members, it is not clear that it needs to be its own oject.
+Since the `GroupMetric` object contains no private members, it is not clear that it needs to be its own object.
 We could continue to use a `Bunch` but make the `group_by` entry/property return a Pandas Series (which would embed all the other information we might need).
 In the multiple metric case, we would still return a single `Bunch` but the properties would both be DataFrames.
 

From 77777ba3efc8e159ced864b968b4b2b1a6f6d2ee Mon Sep 17 00:00:00 2001
From: Richard Edgar <riedgar@microsoft.com>
Date: Tue, 1 Sep 2020 10:16:51 -0400
Subject: [PATCH 16/42] Fix an errant sex

Signed-off-by: Richard Edgar <riedgar@microsoft.com>
---
 api/Updated-Metrics.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/api/Updated-Metrics.md b/api/Updated-Metrics.md
index cc91d88..82dd659 100644
--- a/api/Updated-Metrics.md
+++ b/api/Updated-Metrics.md
@@ -245,7 +245,7 @@ Name: TBD, dtype: float64
 We can also allow `condition_on=` to be a list of names:
 ```python
 >>> result.differences(aggregate=min, condition_on=['Income Band', 'SF 1'])
-Income Band     Sex
+Income Band     SF 1
 Low             B       0.3
 Low             C       0.35
 High            B       0.4

From 6d69fa92cfc74f106c168152cc4f3486e1a7ac8d Mon Sep 17 00:00:00 2001
From: Richard Edgar <riedgar@microsoft.com>
Date: Tue, 1 Sep 2020 15:50:45 -0400
Subject: [PATCH 17/42] Typo fix

Signed-off-by: Richard Edgar <riedgar@microsoft.com>
---
 api/Updated-Metrics.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/api/Updated-Metrics.md b/api/Updated-Metrics.md
index 82dd659..89a3ba4 100644
--- a/api/Updated-Metrics.md
+++ b/api/Updated-Metrics.md
@@ -281,7 +281,7 @@ Results become DataFrames, with one column for each metric:
 This should generalise to the other methods described above.
 
 One open question is how extra arguments should be passed to the individual metric functions, including how to handle the `indexed_params=`.
-Onek possible solution is to have lists, with indices corresponding to the list of functions supplied to `group_summary()`
+A possible solution is to have lists, with indices corresponding to the list of functions supplied to `group_summary()`
 For example, for `index_params=` we would have:
 ```python
 indexed_params = [['sample_weight'], ['sample_weight']]

From 13107f433de7ff8ab697ee4540383775db6a6a3d Mon Sep 17 00:00:00 2001
From: Richard Edgar <riedgar@microsoft.com>
Date: Wed, 2 Sep 2020 13:05:11 -0400
Subject: [PATCH 18/42] Update the metric_ property

Signed-off-by: Richard Edgar <riedgar@microsoft.com>
---
 api/Updated-Metrics.md | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/api/Updated-Metrics.md b/api/Updated-Metrics.md
index 89a3ba4..82e84c1 100644
--- a/api/Updated-Metrics.md
+++ b/api/Updated-Metrics.md
@@ -55,23 +55,23 @@ Rather than a `Bunch`, we will return a `GroupedMetric` object, which can offer
 
 At this basic level, there is only a slight change to the results seen by the user.
 There are still properties `overall` and `by_groups`, with the same semantics.
-However, the `by_groups` result is now a Pandas Series, and we also provide a `metric` property to record the name of the underlying metric:
+However, the `by_groups` result is now a Pandas Series, and we also provide a `metric_` property to store a reference to the underlying metric:
 ```python
 >>> result = flm.group_summary(skm.accuracy_score, y_true, y_pred, sensitive_features=A_1)
->>> result.metric
-"sklearn.metrics.accuracy_score"
+>>> result.metric_
+<class 'function'>
 >>> result.overall
 0.4
 >>> result.by_groups
 B      0.6536
 C      0.2130
-Name: sklearn.metrics.accuracy_score dtype: float64
+Name: accuracy_score dtype: float64
 >>> print(type(result.by_groups))
 <class 'pandas.core.series.Series'>
 ```
-The `metric` property (and using it as the name of the Series) could prove troublesome.
-This is because the fully qualified function name, as reconstructed from `__name__`, `__qualname` and `__module__` might not match the user's expectation.
-For example:
+Constructing the name of the Series could be an issue.
+In the example above, it is the name of the underlying metric function.
+Something as short as the `__name__` could end up being ambiguous, but using the `__module__` property to disambiguate might not match the user's expectations:
 ```python
 >>> import sklearn.metrics as skm
 >>> skm.accuracy_score.__name__

From 048e484b3ed0ff66c883625b47d27604a69a1e93 Mon Sep 17 00:00:00 2001
From: Richard Edgar <riedgar@microsoft.com>
Date: Wed, 2 Sep 2020 13:08:27 -0400
Subject: [PATCH 19/42] Add extra clarifying note about datatypes for
 intersections

Signed-off-by: Richard Edgar <riedgar@microsoft.com>
---
 api/Updated-Metrics.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/api/Updated-Metrics.md b/api/Updated-Metrics.md
index 82e84c1..ef861e1 100644
--- a/api/Updated-Metrics.md
+++ b/api/Updated-Metrics.md
@@ -214,6 +214,7 @@ C       M       0.4
 Name: sklearn.metrics.accuracy_score, dtype: float64
 ```
 If a particular combination of sensitive features had no representatives, then we would return `None` for that entry in the Series.
+Although this example has passed a DataFrame in for `sensitive_features=` we should aim to support lists of Series and `numpy.ndarray` as well.
 
 The `differences()` and `ratios()` methods would act on this Series as before.
 

From 9ff4e91b6f055e6943838fef7d85e358a6f8f793 Mon Sep 17 00:00:00 2001
From: Richard Edgar <riedgar@microsoft.com>
Date: Wed, 2 Sep 2020 13:10:00 -0400
Subject: [PATCH 20/42] Expand on note for conditional parity input types

Signed-off-by: Richard Edgar <riedgar@microsoft.com>
---
 api/Updated-Metrics.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/api/Updated-Metrics.md b/api/Updated-Metrics.md
index ef861e1..348ad44 100644
--- a/api/Updated-Metrics.md
+++ b/api/Updated-Metrics.md
@@ -252,7 +252,7 @@ Low             C       0.35
 High            B       0.4
 High            C       0.5
 ```
-There may be demand for allowing the sensitive features to be supplied as a `numpy.ndarray` or even a list of `Series`.
+There may be demand for allowing the sensitive features to be supplied as a `numpy.ndarray` or even a list of `Series` (similar to how the `sensitive_features=` argument may not be a DataFrame).
 To support this, `condition_on=` would need to allow integers (and lists of integers) as inputs, to index the columns.
 If the user is specifying a list for `condition_on=` then we should probably be nice and detect cases where a feature is listed twice (especially if we're allowing both names and column indices).
 

From 50406491b9d8258c51e99d552179d3a20e64ed59 Mon Sep 17 00:00:00 2001
From: Richard Edgar <riedgar@microsoft.com>
Date: Wed, 2 Sep 2020 13:17:15 -0400
Subject: [PATCH 21/42] Further updates to the text

Signed-off-by: Richard Edgar <riedgar@microsoft.com>
---
 api/Updated-Metrics.md | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/api/Updated-Metrics.md b/api/Updated-Metrics.md
index 348ad44..a193dbb 100644
--- a/api/Updated-Metrics.md
+++ b/api/Updated-Metrics.md
@@ -328,7 +328,7 @@ There will be an effect on the `GroupedMetric` result object.
 Although the `overall` and `by_groups` properties will work fine, the `differences()` and `ratios()` methods may not.
 After all, what does "take the ratio of two confusion matrices" even mean?
 We should try to trap these cases, and throw a meaningful exception (rather than propagating whatever exception happens to emerge from the underlying libraries).
-Since we know that `differences()` and `ratios()` will only work when the metric has produced scalar results, this should be a straightforward test.
+Since we know that `differences()` and `ratios()` will only work when the metric has produced scalar results, which should be a straightforward test using [`isscalar()` from Numpy](https://numpy.org/doc/stable/reference/generated/numpy.isscalar.html).
 
 ## Pitfalls
 
@@ -351,10 +351,11 @@ It cannot even tell if it is evaluating a classification or regression problem.
 
 ## The Wrapper Functions
 
-In the above, we have assumed that we will provide both `group_summary()` and wrappers such as `accuracy_score_group_summary()`, `accuracy_score_difference()`, `accuracy_score_ratio()` and `accuracys_score_group_min()`. Do these wrappers add value, or do they end up just polluting our namespace and confusing users?
+In the above, we have assumed that we will provide both `group_summary()` and wrappers such as `accuracy_score_group_summary()`, `accuracy_score_difference()`, `accuracy_score_ratio()` and `accuracy_score_group_min()`.
+These wrappers allow the metrics to be passed to SciKit-Learn subroutines such as `make_scorer()`, and they all accept arguments for both the aggregation (as described above) and the underlying metric.
+
+We also provide wrappers for specific fairness metrics used in the literature such `demographic_parity_difference()` and `equalized_odds_difference()` (although even then we should add the extra `relative_to=` and `group=` arguments).
 
-The wrappers such as `demographic_parity_difference()` and `equalized_odds_difference()` are arguably useful, since they are specific metrics used in the literature (although even then we might want to add the extra `relative_to=` and `group=` arguments).
-The case for `accuracy_score_group_summary()` and related functions is less clear.
 
 ## Methods or Functions
 

From 210872e33a096e5058d6178e35ef4f4463c69d8f Mon Sep 17 00:00:00 2001
From: Richard Edgar <riedgar@microsoft.com>
Date: Wed, 2 Sep 2020 13:33:47 -0400
Subject: [PATCH 22/42] Add some suggestions for alternative names

Signed-off-by: Richard Edgar <riedgar@microsoft.com>
---
 api/Updated-Metrics.md | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/api/Updated-Metrics.md b/api/Updated-Metrics.md
index a193dbb..2278184 100644
--- a/api/Updated-Metrics.md
+++ b/api/Updated-Metrics.md
@@ -302,6 +302,27 @@ extra_args = [
 ```
 If users had a lot of functions with a lot of custom arguments, this could get error-prone and difficult to debug.
 
+## Naming
+
+The names `group_summary()` and `GroupedMetric` are not necessarily inspired, and there may well be better alternatives.
+Changes to these would ripple throughout the module, so agreeing on these is an important first step.
+
+Some possibilities for the function:
+  - `group_summary()`
+  - `metric_by_groups()`
+  - `calculate_group_metric()`
+  - ?
+
+And for the result object:
+  - `GroupedMetric`
+  - `GroupMetricResult`
+  - `MetricByGroups`
+  - ?
+
+Other names are also up for debate.
+However, things like the wrappers `accuracy_score_group_summary()` will hinge on the names chosen above.
+Arguments such as `index_params=` and `ratio_order=` (along with the allowed values of the latter) are important, but narrower in impact.
+
 ## User Conveniences
 
 In addition to having the underlying metric be passed as a function, we can consider allowing the metric function given to `group_summary()` to be represented by a string.

From 3ba87aa9c2818ef596f84e5776f1d39e607ecf75 Mon Sep 17 00:00:00 2001
From: Richard Edgar <riedgar@microsoft.com>
Date: Tue, 8 Sep 2020 14:25:14 -0400
Subject: [PATCH 23/42] Adding notebook of samples

Signed-off-by: Richard Edgar <riedgar@microsoft.com>
---
 api/Metrics API Samples.ipynb | 412 ++++++++++++++++++++++++++++++++++
 1 file changed, 412 insertions(+)
 create mode 100644 api/Metrics API Samples.ipynb

diff --git a/api/Metrics API Samples.ipynb b/api/Metrics API Samples.ipynb
new file mode 100644
index 0000000..2ff0f4f
--- /dev/null
+++ b/api/Metrics API Samples.ipynb	
@@ -0,0 +1,412 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from sklearn import svm\n",
+    "from sklearn.preprocessing import LabelEncoder, StandardScaler\n",
+    "from sklearn.linear_model import LogisticRegression\n",
+    "import pandas as pd\n",
+    "import shap"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "X_raw, Y = shap.datasets.adult()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "A = X_raw[['Sex','Race']]\n",
+    "X = X_raw.drop(labels=['Sex', 'Race'],axis = 1)\n",
+    "X = pd.get_dummies(X)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sc = StandardScaler()\n",
+    "X_scaled = sc.fit_transform(X)\n",
+    "X_scaled = pd.DataFrame(X_scaled, columns=X.columns)\n",
+    "\n",
+    "le = LabelEncoder()\n",
+    "Y = le.fit_transform(Y)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "A value is trying to be set on a copy of a slice from a DataFrame\n",
+      "\n",
+      "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
+      "\n",
+      "A value is trying to be set on a copy of a slice from a DataFrame\n",
+      "\n",
+      "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n"
+     ]
+    }
+   ],
+   "source": [
+    "from sklearn.model_selection import train_test_split\n",
+    "X_train, X_test, Y_train, Y_test, A_train, A_test = train_test_split(X_scaled, \n",
+    "                                                    Y, \n",
+    "                                                    A,\n",
+    "                                                    test_size = 0.2,\n",
+    "                                                    random_state=0,\n",
+    "                                                    stratify=Y)\n",
+    "\n",
+    "# Work around indexing issue\n",
+    "X_train = X_train.reset_index(drop=True)\n",
+    "A_train = A_train.reset_index(drop=True)\n",
+    "X_test = X_test.reset_index(drop=True)\n",
+    "A_test = A_test.reset_index(drop=True)\n",
+    "\n",
+    "# Improve labels\n",
+    "A_test.Sex.loc[(A_test['Sex'] == 0)] = 'female'\n",
+    "A_test.Sex.loc[(A_test['Sex'] == 1)] = 'male'\n",
+    "\n",
+    "\n",
+    "A_test.Race.loc[(A_test['Race'] == 0)] = 'Amer-Indian-Eskimo'\n",
+    "A_test.Race.loc[(A_test['Race'] == 1)] = 'Asian-Pac-Islander'\n",
+    "A_test.Race.loc[(A_test['Race'] == 2)] = 'Black'\n",
+    "A_test.Race.loc[(A_test['Race'] == 3)] = 'Other'\n",
+    "A_test.Race.loc[(A_test['Race'] == 4)] = 'White'"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "lr_predictor = LogisticRegression(solver='liblinear', fit_intercept=True)\n",
+    "\n",
+    "lr_predictor.fit(X_train, Y_train)\n",
+    "Y_pred_lr = lr_predictor.predict(X_test)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "svm_predictor = svm.SVC()\n",
+    "\n",
+    "svm_predictor.fit(X_train, Y_train)\n",
+    "Y_pred_svm = svm_predictor.predict(X_test)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Sample APIs"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 39,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from sklearn.metrics import accuracy_score, f1_score, fbeta_score\n",
+    "from fairlearn.metrics import group_summary, make_derived_metric, difference_from_summary"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Report one disaggregated metric in a data frame"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 26,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Amer-Indian-Eskimo    0.923077\n",
+      "Asian-Pac-Islander    0.840796\n",
+      "Black                 0.914826\n",
+      "Other                 0.851064\n",
+      "White                 0.826492\n",
+      "dtype: float64\n",
+      "=======================\n",
+      "Amer-Indian-Eskimo    0.923077\n",
+      "Asian-Pac-Islander    0.840796\n",
+      "Black                 0.914826\n",
+      "Other                 0.851064\n",
+      "White                 0.826492\n",
+      "overall               0.836481\n",
+      "dtype: float64\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Current\n",
+    "bunch = group_summary(accuracy_score, Y_test, Y_pred_lr, sensitive_features=A_test['Race'])\n",
+    "frame = pd.Series(bunch.by_group)\n",
+    "frame_o = pd.Series({**bunch.by_group, 'overall': bunch.overall})\n",
+    "print(frame)\n",
+    "print(\"=======================\")\n",
+    "print(frame_o)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "6513\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Proposed\n",
+    "result = GroupedMetric(accuracy_score, Y_test, Y_pred_lr, sensitive_features=A_test['Race'])\n",
+    "frame = result.by_group\n",
+    "frame_o = result.to_df() # Throw if there is a group called 'overall'"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Report several disaggregated metrics in a data frame."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 27,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "                    accuracy        f1\n",
+      "Amer-Indian-Eskimo  0.923077  0.666667\n",
+      "Asian-Pac-Islander  0.840796  0.652174\n",
+      "Black               0.914826  0.550000\n",
+      "Other               0.851064  0.000000\n",
+      "White               0.826492  0.612800\n",
+      "=======================\n",
+      "                    accuracy        f1\n",
+      "Amer-Indian-Eskimo  0.923077  0.666667\n",
+      "Asian-Pac-Islander  0.840796  0.652174\n",
+      "Black               0.914826  0.550000\n",
+      "Other               0.851064  0.000000\n",
+      "White               0.826492  0.612800\n",
+      "overall             0.836481  0.610033\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Current\n",
+    "bunch1 = group_summary(accuracy_score, Y_test, Y_pred_lr, sensitive_features=A_test['Race'])\n",
+    "bunch2 = group_summary(f1_score, Y_test, Y_pred_lr, sensitive_features=A_test['Race'])\n",
+    "frame = pd.DataFrame({\n",
+    "   'accuracy': bunch1.by_group, 'f1': bunch2.by_group})\n",
+    "frame_o = pd.DataFrame({\n",
+    "   'accuracy': {**bunch1.by_group, 'overall': bunch1.overall},\n",
+    "   'f1': {**bunch2.by_group, 'overall': bunch2.overall}})\n",
+    "\n",
+    "print(frame)\n",
+    "print(\"=======================\")\n",
+    "print(frame_o)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Proposed\n",
+    "result = GroupedMetric({ 'accuracy':accuracy_score, 'f1':f1_score}, Y_test, Y_pred_lr, sensitive_features=A_test['Race'])\n",
+    "frame = result.by_group\n",
+    "frame_o = result.to_df() # Throw if there is a group called 'overall'"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Report metrics for intersecting sensitive features"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 31,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Amer-Indian-Eskimo-female    0.937500\n",
+      "Amer-Indian-Eskimo-male      0.916667\n",
+      "Asian-Pac-Islander-female    0.879310\n",
+      "Asian-Pac-Islander-male      0.825175\n",
+      "Black-female                 0.962382\n",
+      "Black-male                   0.866667\n",
+      "Other-female                 0.909091\n",
+      "Other-male                   0.833333\n",
+      "White-female                 0.917824\n",
+      "White-male                   0.785510\n",
+      "dtype: float64\n",
+      "=======================\n",
+      "Amer-Indian-Eskimo-female    0.937500\n",
+      "Amer-Indian-Eskimo-male      0.916667\n",
+      "Asian-Pac-Islander-female    0.879310\n",
+      "Asian-Pac-Islander-male      0.825175\n",
+      "Black-female                 0.962382\n",
+      "Black-male                   0.866667\n",
+      "Other-female                 0.909091\n",
+      "Other-male                   0.833333\n",
+      "White-female                 0.917824\n",
+      "White-male                   0.785510\n",
+      "overall                      0.836481\n",
+      "dtype: float64\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Current\n",
+    "sf = A_test['Race']+'-'+A_test['Sex'] # User builds new column manually\n",
+    "\n",
+    "bunch = group_summary(accuracy_score, Y_test, Y_pred_lr, sensitive_features=sf)\n",
+    "frame = pd.Series(bunch.by_group)\n",
+    "frame_o = pd.Series({**bunch.by_group, 'overall': bunch.overall})\n",
+    "\n",
+    "print(frame)\n",
+    "print(\"=======================\")\n",
+    "print(frame_o)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Proposed\n",
+    "result = GroupedMetric(accuracy_score, Y_test, Y_pred_lr, sensitive_features=[A['Race'], A['Sex']])\n",
+    "frame = result.by_group # Will have a MultiIndex built from the two sensitive feature columns\n",
+    "frame_o = result.to_def() # Not sure how to handle adding the extra 'overall' row"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Report several performance and fairness metrics of several models in a data frame"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 37,
+   "metadata": {},
+   "outputs": [
+    {
+     "ename": "NameError",
+     "evalue": "name 'make_group_summary' is not defined",
+     "output_type": "error",
+     "traceback": [
+      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
+      "\u001b[1;31mNameError\u001b[0m                                 Traceback (most recent call last)",
+      "\u001b[1;32m<ipython-input-37-3c393d25f724>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m\u001b[0m\n\u001b[0;32m      2\u001b[0m custom_difference1 = make_derived_metric(\n\u001b[0;32m      3\u001b[0m     \u001b[0mdifference_from_summary\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 4\u001b[1;33m     make_group_summary(fbeta_score, beta=0.5))\n\u001b[0m\u001b[0;32m      5\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m      6\u001b[0m \u001b[1;32mdef\u001b[0m \u001b[0mcustom_difference2\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0my_true\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0my_pred\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0msf\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
+      "\u001b[1;31mNameError\u001b[0m: name 'make_group_summary' is not defined"
+     ]
+    }
+   ],
+   "source": [
+    "# Current\n",
+    "custom_difference1 = make_derived_metric(\n",
+    "    difference_from_summary,\n",
+    "    make_group_summary(fbeta_score, beta=0.5))\n",
+    "\n",
+    "def custom_difference2(y_true, y_pred, sf):\n",
+    "    bunch = group_summary(fbeta_score, y_true, y_pred, sf, beta=0.5)\n",
+    "    frame = pd.Series(bunch.by_group)\n",
+    "    return (frame-frame['B']).min()\n",
+    "\n",
+    "fairness_metrics = {\n",
+    "    'Custom difference 1': custom_difference,\n",
+    "    'Demographic parity difference': demographic_parity_difference,\n",
+    "    'Worst-case balanced accuracy': balanced_accuracy_group_min}\n",
+    "perfomance_metrics = {\n",
+    "    'FPR': false_positive_rate,\n",
+    "    'FNR': false_negative_rate}\n",
+    "predictions_by_estimator = {\n",
+    "    'logreg': y_pred_lr,\n",
+    "    'svm': y_pred_svm}\n",
+    "\n",
+    "df = pd.DataFrame()\n",
+    "for pred_key, y_pred in predictions_by_estimator.items():\n",
+    "    for fairm_key, fairm in fairness_metrics.items():\n",
+    "        df.loc[fairm_key, pred_key] = fairm(y_true, y_pred, sf)\n",
+    "    for perfm_key, perfm in performance_metrics.items():\n",
+    "        df.loc[perfm_key, pred_key] = perfm(y_true, y_pred)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.10"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}

From c13da58b8c75d7574ca9a6d701dff1f787c6e1a0 Mon Sep 17 00:00:00 2001
From: Richard Edgar <riedgar@microsoft.com>
Date: Tue, 8 Sep 2020 20:44:01 -0400
Subject: [PATCH 24/42] More examples in notebook

Signed-off-by: Richard Edgar <riedgar@microsoft.com>
---
 api/Metrics API Samples.ipynb | 250 +++++++++++++++++++++++++++-------
 1 file changed, 200 insertions(+), 50 deletions(-)

diff --git a/api/Metrics API Samples.ipynb b/api/Metrics API Samples.ipynb
index 2ff0f4f..4c4651c 100644
--- a/api/Metrics API Samples.ipynb	
+++ b/api/Metrics API Samples.ipynb	
@@ -35,7 +35,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 6,
+   "execution_count": 4,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -49,21 +49,18 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 7,
+   "execution_count": 25,
    "metadata": {},
    "outputs": [
     {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "\n",
-      "A value is trying to be set on a copy of a slice from a DataFrame\n",
-      "\n",
-      "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
-      "\n",
-      "A value is trying to be set on a copy of a slice from a DataFrame\n",
-      "\n",
-      "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n"
+     "ename": "AttributeError",
+     "evalue": "'numpy.ndarray' object has no attribute 'reset_index'",
+     "output_type": "error",
+     "traceback": [
+      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
+      "\u001b[1;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
+      "\u001b[1;32m<ipython-input-25-96eed97377f8>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m\u001b[0m\n\u001b[0;32m     11\u001b[0m \u001b[0mA_train\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mA_train\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mreset_index\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mdrop\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;32mTrue\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m     12\u001b[0m \u001b[0mX_test\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mX_test\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mreset_index\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mdrop\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;32mTrue\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m---> 13\u001b[1;33m \u001b[0mY_test\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mY_test\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mreset_index\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mdrop\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;32mTrue\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m     14\u001b[0m \u001b[0mA_test\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mA_test\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mreset_index\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mdrop\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;32mTrue\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m     15\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n",
+      "\u001b[1;31mAttributeError\u001b[0m: 'numpy.ndarray' object has no attribute 'reset_index'"
      ]
     }
    ],
@@ -96,7 +93,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 11,
+   "execution_count": 6,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -108,7 +105,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 12,
+   "execution_count": 7,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -127,12 +124,14 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 39,
+   "execution_count": 8,
    "metadata": {},
    "outputs": [],
    "source": [
     "from sklearn.metrics import accuracy_score, f1_score, fbeta_score\n",
-    "from fairlearn.metrics import group_summary, make_derived_metric, difference_from_summary"
+    "from fairlearn.metrics import group_summary, make_derived_metric, difference_from_summary, make_metric_group_summary\n",
+    "from fairlearn.metrics import demographic_parity_difference, balanced_accuracy_score_group_min\n",
+    "from fairlearn.metrics import false_negative_rate, false_positive_rate"
    ]
   },
   {
@@ -144,7 +143,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 26,
+   "execution_count": 9,
    "metadata": {},
    "outputs": [
     {
@@ -180,17 +179,9 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 18,
+   "execution_count": null,
    "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "6513\n"
-     ]
-    }
-   ],
+   "outputs": [],
    "source": [
     "# Proposed\n",
     "result = GroupedMetric(accuracy_score, Y_test, Y_pred_lr, sensitive_features=A_test['Race'])\n",
@@ -207,7 +198,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 27,
+   "execution_count": 10,
    "metadata": {},
    "outputs": [
     {
@@ -267,7 +258,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 31,
+   "execution_count": 11,
    "metadata": {},
    "outputs": [
     {
@@ -335,49 +326,208 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 37,
+   "execution_count": 12,
    "metadata": {},
    "outputs": [
     {
-     "ename": "NameError",
-     "evalue": "name 'make_group_summary' is not defined",
-     "output_type": "error",
-     "traceback": [
-      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
-      "\u001b[1;31mNameError\u001b[0m                                 Traceback (most recent call last)",
-      "\u001b[1;32m<ipython-input-37-3c393d25f724>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m\u001b[0m\n\u001b[0;32m      2\u001b[0m custom_difference1 = make_derived_metric(\n\u001b[0;32m      3\u001b[0m     \u001b[0mdifference_from_summary\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 4\u001b[1;33m     make_group_summary(fbeta_score, beta=0.5))\n\u001b[0m\u001b[0;32m      5\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m      6\u001b[0m \u001b[1;32mdef\u001b[0m \u001b[0mcustom_difference2\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0my_true\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0my_pred\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0msf\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
-      "\u001b[1;31mNameError\u001b[0m: name 'make_group_summary' is not defined"
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "                                 logreg       svm\n",
+      "Custom difference 1            0.688073  0.769231\n",
+      "Custom difference 2           -0.673229 -0.704020\n",
+      "Demographic parity difference  0.166402  0.196253\n",
+      "Worst-case balanced accuracy   0.476190  0.476190\n",
+      "FPR                            0.066734  0.054803\n",
+      "FNR                            0.468750  0.462372\n"
      ]
     }
    ],
    "source": [
     "# Current\n",
+    "fb_s = lambda y_t, y_p: fbeta_score(y_t, y_p, beta=0.5)\n",
     "custom_difference1 = make_derived_metric(\n",
     "    difference_from_summary,\n",
-    "    make_group_summary(fbeta_score, beta=0.5))\n",
+    "    make_metric_group_summary(fb_s))\n",
     "\n",
-    "def custom_difference2(y_true, y_pred, sf):\n",
-    "    bunch = group_summary(fbeta_score, y_true, y_pred, sf, beta=0.5)\n",
+    "def custom_difference2(y_true, y_pred, sensitive_features):\n",
+    "    bunch = group_summary(fbeta_score, y_true, y_pred, sensitive_features=sensitive_features, beta=0.5)\n",
     "    frame = pd.Series(bunch.by_group)\n",
-    "    return (frame-frame['B']).min()\n",
+    "    return (frame-frame['White']).min()\n",
     "\n",
     "fairness_metrics = {\n",
-    "    'Custom difference 1': custom_difference,\n",
+    "    'Custom difference 1': custom_difference1,\n",
+    "    'Custom difference 2': custom_difference2,\n",
     "    'Demographic parity difference': demographic_parity_difference,\n",
-    "    'Worst-case balanced accuracy': balanced_accuracy_group_min}\n",
-    "perfomance_metrics = {\n",
+    "    'Worst-case balanced accuracy': balanced_accuracy_score_group_min}\n",
+    "performance_metrics = {\n",
     "    'FPR': false_positive_rate,\n",
     "    'FNR': false_negative_rate}\n",
     "predictions_by_estimator = {\n",
-    "    'logreg': y_pred_lr,\n",
-    "    'svm': y_pred_svm}\n",
+    "    'logreg': Y_pred_lr,\n",
+    "    'svm': Y_pred_svm}\n",
     "\n",
     "df = pd.DataFrame()\n",
     "for pred_key, y_pred in predictions_by_estimator.items():\n",
     "    for fairm_key, fairm in fairness_metrics.items():\n",
-    "        df.loc[fairm_key, pred_key] = fairm(y_true, y_pred, sf)\n",
+    "        df.loc[fairm_key, pred_key] = fairm(Y_test, y_pred, sensitive_features=A_test['Race'])\n",
     "    for perfm_key, perfm in performance_metrics.items():\n",
-    "        df.loc[perfm_key, pred_key] = perfm(y_true, y_pred)"
+    "        df.loc[perfm_key, pred_key] = perfm(Y_test, y_pred)\n",
+    "        \n",
+    "print(df)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Proposed\n",
+    "custom_difference1 = make_derived_metric('difference', fbeta_score, parms={'beta', 0.5})\n",
+    "\n",
+    "def custom_difference2(y_true, y_pred, sensitive_features):\n",
+    "    tmp = GroupedMetric(fbeta_score, y_true, y_pred, sensitive_features=sensitive_features, parms={'beta':0.5})\n",
+    "    return tmp.differences(relative_to='group', group='White', aggregate='min')\n",
+    "\n",
+    "# The remainder as before"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Create a fairness-performance raster plot of several models"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import matplotlib.pyplot as plt\n",
+    "%matplotlib inline"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYgAAAEGCAYAAAB/+QKOAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy86wFpkAAAACXBIWXMAAAsTAAALEwEAmpwYAAAdDUlEQVR4nO3df5CdVZ3n8ffHTjIkKAmBxtUkmjATcaI7BmyjFmoURBrXEHCsnWR0QN01xiEqbJkx0Slda2eswXZdnSkEIyComBS/icqYMIxERcHcQMgPYjRm0XSC0lSMjCFrSPjuH89puHSfe/t2p5++fZPPq+qpfp5zzvPc7+mk+9vPr3MUEZiZmfX1vGYHYGZmo5MThJmZZTlBmJlZlhOEmZllOUGYmVnWmGYHMJxOPvnkmD59erPDMDNrGRs2bHg8ItpzdUdVgpg+fTqVSqXZYZiZtQxJv6pV50tMZmaW5QRhZmZZThBmZpZVaoKQ1Clpu6QdkpZl6pdK2piWLZIOS5os6bSq8o2SnpB0aZmxmpnZc5V2k1pSG3AFcA7QDayXtDoiHu5tExFdQFdqPw+4LCL2AnuB2VXH2Q3cVlasZmbWX5lPMc0BdkTETgBJq4D5wMM12i8EVmbKzwZ+GRE177Qfidsf3E3Xmu3s2XeAF08az9JzT+OC06eU8VFmZi2lzEtMU4BdVdvdqawfSROATuCWTPUC8omjd99FkiqSKj09PYMK8PYHd7P81s3s3neAAHbvO8DyWzdz+4O7B3UcM7OjUZkJQpmyWmOLzwPuTZeXnj2ANA44H7ip1odExIqI6IiIjvb27LseNXWt2c6Bpw4/p+zAU4fpWrN9UMcxMzsalZkguoFpVdtTgT012tY6SzgPeCAifjvMsQGwZ9+BQZWbmR1LykwQ64GZkmakM4EFwOq+jSRNBOYCd2SOUeu+xLB48aTxgyo3MzuWlJYgIuIQsARYA2wDboyIrZIWS1pc1fRCYG1E7K/eP92XOAe4tawYl557GuPHtj2nbPzYNpaee1pZH2lm1jJ0NE052tHREYMdi8lPMZnZsUzShojoyNUdVYP1DcUFp09xQjAzy/BQG2ZmluUEYWZmWU4QZmaW5QRhZmZZThBmZpblBGFmZllOEGZmluUEYWZmWU4QZmaW5QRhZmZZThBmZpblBGFmZllOEGZmluUEYWZmWU4QZmaWVWqCkNQpabukHZKWZeqXStqYli2SDkuanOomSbpZ0s8kbZP0+jJjNTOz5yotQUhqA64AzgNmAQslzapuExFdETE7ImYDy4F1EbE3VX8J+F5EvBx4FcW0pWZmNkLKPIOYA+yIiJ0RcRBYBcyv034hsBJA0gnAm4BrACLiYETsKzFWMzPro8wEMQXYVbXdncr6kTQB6ARuSUWnAj3A1yQ9KOlqScfX2HeRpIqkSk9Pz/BFb2Z2jCszQShTFjXazgPurbq8NAY4A7gyIk4H9gP97mEARMSKiOiIiI729vYjjdnMzJIyE0Q3MK1qeyqwp0bbBaTLS1X7dkfE/Wn7ZoqEYWZmI6TMBLEemClphqRxFElgdd9GkiYCc4E7essi4jfALkmnpaKzgYdLjNXMzPoYU9aBI+KQpCXAGqANuDYitkpanOqvSk0vBNZGxP4+h/gwcENKLjuB95UVq5mZ9aeIWrcFWk9HR0dUKpVmh2Fm1jIkbYiIjlyd36Q2M7MsJwgzM8tygjAzsywnCDMzy3KCMDOzLCcIMzPLcoIwM7MsJwgzM8tygjAzsywnCDMzy3KCMDOzLCcIMzPLcoIwM7MsJwgzM8tygjAzs6xSE4SkTknbJe2Q1G9OaUlLJW1MyxZJhyVNTnWPSNqc6jzJg5nZCCttRjlJbcAVwDkUc0yvl7Q6Ip6ZOjQiuoCu1H4ecFlE7K06zFsi4vGyYjQzs9rKPIOYA+yIiJ0RcRBYBcyv034hsLLEeMzMbBDKTBBTgF1V292prB9JE4BO4Jaq4gDWStogaVGtD5G0SFJFUqWnp2cYwjYzMyg3QShTVmsC7HnAvX0uL50ZEWcA5wGXSHpTbseIWBERHRHR0d7efmQRm5nZM8pMEN3AtKrtqcCeGm0X0OfyUkTsSV8fA26juGRlZmYjpMwEsR6YKWmGpHEUSWB130aSJgJzgTuqyo6X9ILedeBtwJYSYzUzsz5Ke4opIg5JWgKsAdqAayNiq6TFqf6q1PRCYG1E7K/a/YXAbZJ6Y/xWRHyvrFjNzKw/RdS6LdB6Ojo6olLxKxNmZo2StCEiOnJ1fpPazMyynCDMzCzLCcLMzLKcIMzMLMsJwszMspwgzMwsywnCzMyynCDMzCzLCcLMzLIGTBCSLpE0qWr7REl/W2pUZmbWdI2cQXwgIvb1bkTE74APlBaRmZmNCo0kiOcpjZoHz0wlOq68kMzMbDRoZDTXNcCNkq6imPBnMeCRVc3MjnKNJIiPAx8EPkQxS9xa4OoygzIzs+YbMEFExNPAlWkxM7NjRM17EJJuTF83S9rUd2nk4JI6JW2XtEPSskz9Ukkb07JF0mFJk6vq2yQ9KOk7Q+mcmZkNXb0ziI+mr+8YyoHTzewrgHMo5qdeL2l1RDzc2yYiuoCu1H4ecFlE7O0TwzbghKHEYGZmQ1fzDCIiHk2/5K+JiF/1XRo49hxgR0TsjIiDwCpgfp32C4GVvRuSpgL/Bd/vMDNrirqPuUbEYeBJSROHcOwpwK6q7e5U1o+kCUAncEtV8ReBvwOervchkhZJqkiq9PT0DCFMMzPLaeQppv8HbJZ0F7C/tzAiPjLAfsqU1ZoAex5wb+/lJUnvAB6LiA2S3lzvQyJiBbACijmpB4jJzMwa1EiC+G5aqjXyi7gbmFa1PRXYU6PtAqouLwFnAudLejtwHHCCpG9GxHsa+FwzMxsGjSSISRHxpeoCSR+t1bjKemCmpBnAbook8Nd9G6XLV3OBZ375R8RyYHmqfzPwMScHM7OR1chQGxdnyt470E4RcQhYQvEm9jbgxojYKmmxpMVVTS8E1kbE/txxzMysORSRv1okaSHFX/xvAH5YVfUC4HBEvLX88Aano6MjKpVKs8MwM2sZkjZEREeurt4lph8DjwInA/+7qvw/gIZelDMzs9ZVM0Gkdx1+Bbxe0kuBmRHxb5LGA+MpEoWZmR2lGpkw6APAzcBXUtFU4PYSYzIzs1GgkZvUl1A8dvoEQET8AjilzKDMzKz5GkkQf0xDZQAgaQyNvQdhZmYtrJEEsU7SJ4Dxks4BbgK+XW5YZmbWbI0kiGVAD7CZYuKgO4G/LzMoMzNrvkYnDPpqWszM7BhRM0EMNClQRPzF8IdjZmajRb0ziKcpbkZ/i+Kew4ERicjMzEaFehMGzaaYxOf5FEniH4FXALsbnDDIzMxa2EATBv0sIj4dEWdQnEV8HbhsRCIzM7OmqnuTWtIUimG6LwR+R5EcbhuBuMzMrMnq3aReRzFy640Uw3vvTVXjJE3unf3NzMyOTvXOIF5KcZP6g8CiqnKl8lNLjMvMzJqs3miu00cwDjMzG2UaeZN6yCR1StouaYekZZn6pZI2pmWLpMOSJks6TtJPJT0kaaukz5QZp5mZ9VdagpDUBlwBnAfMAhZKmlXdJiK6ImJ2eqR2ObAu3dv4I3BWRLwKmA10SnpdWbGamVl/ZZ5BzAF2RMTONBrsKmB+nfYLgZUAUfhDKh+bFo8ga2Y2ghqZMOjzkl4xhGNPAXZVbXenstxnTAA6gVuqytokbQQeA+6KiPtr7LtIUkVSpaenZwhhmplZTiNnED8DVki6X9JiSRMbPLYyZbXOAuYB91Y/OhsRh9Olp6nAHEmvzO0YESsioiMiOtrb2xsMzczMBjJggoiIqyPiTOAiYDqwSdK3JL1lgF27gWlV21OBPTXaLiBdXsp8/j7gHoozDDMzGyEN3YNIN5xfnpbHgYeA/yFpVZ3d1gMzJc2QNI4iCazOHHsiMBe4o6qsXdKktD4eeCvFmYyZmY2QAeeDkPQFiktA/w58NiJ+mqoul7S91n4RcUjSEmAN0AZcGxFbJS1O9VelphcCayNif9XuLwKuT4npecCNEfGdQfbNzMyOgCLqPxwk6f3Aqoh4MlM3MSJ+X1Zwg9XR0RGVSqXZYZiZtQxJGyKiI1fXyCWmd/dNDpLuBhhNycHMzIZXvcH6jgMmACdLOpFnn0o6AXjxCMRmZmZNVO8exAeBSymSwQNV5U9QvCFtZmZHsXqD9X0J+JKkD0fEv4xgTGZmNgrUu8R0VkT8O7Bb0jv71kfEraVGZmZmTVXvEtNcikdb52XqAnCCMDM7itW7xPRpSc8D/jUibhzBmMzMbBSo+5hrRDwNLBmhWMzMbBRp5D2IuyR9TNK0NJnPZEmTS4/MzMyaasChNoD3p6+XVJV5Tmozs6PcgAkiImaMRCBmZja6NHIGQZqLYRZwXG9ZRHy9rKDMzKz5GhnN9dPAmykSxJ0Uc0z/CHCCMDM7ijVyk/pdwNnAbyLifcCrgD8pNSozM2u6RhLEgfS46yFJJ1DMEe0b1GZmR7lGEkQlze72VWADxcB9P627RyKpU9J2STskLcvUL5W0MS1bJB1Oj9FOk/R9SdskbZX00cF0yszMjtyAEwY9p7E0HTghIjY10LYN+DlwDsX81OuBhRHxcI3284DLIuIsSS8CXhQRD0h6AUViuqDWvr08YZCZ2eDUmzCo0aeY3gm8geL9hx8BAyYIYA6wIyJ2pmOsAuYDtX7JLwRWAkTEo8Cjaf0/JG0DptTZ18zMhtmAl5gkfRlYDGwGtgAflNTIfBBTgF1V292pLPcZE4BO4JZM3XTgdOD+GvsuklSRVOnp6WkgLDMza0QjZxBzgVdGuhYl6XqKZDEQZcpqXc+aB9wbEXufcwDp+RRJ49KIeCK3Y0SsAFZAcYmpgbjMzKwBjdyk3g68pGp7Go1dYupObXtNBfbUaLuAdHmpl6SxFMnhBs89YWY28hpJECcB2yTdI+keivsA7ZJWS1pdZ7/1wExJMySNo0gC/dpLmkhxlnJHVZmAa4BtEfGFhntjZmbDppFLTJ8ayoEj4pCkJcAaoA24NiK2Slqc6q9KTS8E1kbE/qrdzwT+BtgsaWMq+0RE3DmUWMzMbPAGfMxV0vGkl+UkvQx4OcUkQk+NRICD4cdczcwGp95jro1cYvoBcJykKcDdwPuA64YvPDMzG40aSRCKiCeBdwL/EhEXAq8oNywzM2u2hhKEpNcD7wa+m8raygvJzMxGg0YSxKXAcuC2dJP5VOD7pUZlZmZN18iMcuuAdVXbO4GPlBmUmZk1X80EIemLEXGppG+TeQM6Is4vNTIzM2uqemcQ30hfPz8SgZiZ2ehSM0FExIb0dZ2k9rTu0fDMzI4RNW9Sq/A/JT0O/Az4uaQeSUN6s9rMzFpLvaeYLqUY8uI1EXFSRJwIvBY4U9JlIxGcmZk1T70EcRHFDHD/t7cgPcH0nlRnZmZHsXoJYmxEPN63MN2HGFteSGZmNhrUSxAHh1hnZmZHgXqPub5KUm4WNwHHlRSPmZmNEvUec/V4S2Zmx7BGxmIaMkmdkrZL2iFpWaZ+qaSNadki6bCkyanuWkmPSdpSZoxmZpZXWoKQ1AZcAZwHzAIWSppV3SYiuiJidkTMphgQcF1E7E3V1wGdZcVnZmb1lXkGMQfYERE7I+IgsAqYX6f9QmBl70ZE/ADYW7u5mZmVqcwEMQXYVbXdncr6kTSB4mzhlhLjMTOzQSgzQShTVmsC7HnAvVWXlxr/EGmRpIqkSk+Ph4oyMxsuZSaIbmBa1fZUYE+Ntguourw0GBGxIiI6IqKjvb19KIcwM7OMMhPEemCmpBmSxlEkgdV9G0maCMwF7igxFjMzG6TSEkREHAKWAGuAbcCNacrSxZIWVzW9EFgbEfur95e0EvgJcJqkbkn/raxYzcysP0XUui3Qejo6OqJSqTQ7DDOzliFpQ0R05OpKfVHOzMxalxOEmZllOUGYmVmWE4SZmWU5QZiZWZYThJmZZTlBmJlZlhOEmZllOUGYmVmWE4SZmWU5QZiZWZYThJmZZTlBmJlZlhOEmZllOUGYmVmWE4SZmWWVmiAkdUraLmmHpGWZ+qWSNqZli6TDkiY3sq+ZmZWrtAQhqQ24AjgPmAUslDSruk1EdEXE7IiYDSwH1kXE3kb2NTOzcpV5BjEH2BEROyPiILAKmF+n/UJg5RD3NTOzYVZmgpgC7Kra7k5l/UiaAHQCtwxh30WSKpIqPT09Rxy0mZkVykwQypRFjbbzgHsjYu9g942IFRHREREd7e3tQwjTzMxyykwQ3cC0qu2pwJ4abRfw7OWlwe5rZmYlKDNBrAdmSpohaRxFEljdt5GkicBc4I7B7mtmZuUZU9aBI+KQpCXAGqANuDYitkpanOqvSk0vBNZGxP6B9i0rVjMz608RtW4LtJ6Ojo6oVCrNDsPMrGVI2hARHbk6v0ltZmZZThBmZpblBGFmZllOEGZmluUEYWZmWU4QZmaW5QRhZmZZThBmZpblBGFmZllOEGZmluUEYWZmWU4QZmaW5QRhZmZZThBmZpblBGFmZlmlJghJnZK2S9ohaVmNNm+WtFHSVknrqso/KmlLKr+0zDjNzKy/0maUk9QGXAGcQzHH9HpJqyPi4ao2k4AvA50R8WtJp6TyVwIfAOYAB4HvSfpuRPyirHjNzOy5yjyDmAPsiIidEXEQWAXM79Pmr4FbI+LXABHxWCr/c+C+iHgyIg4B6yimJjUzsxFSZoKYAuyq2u5OZdVeBpwo6R5JGyRdlMq3AG+SdJKkCcDbgWm5D5G0SFJFUqWnp2eYu2Bmduwq7RIToExZ3wmwxwCvBs4GxgM/kXRfRGyTdDlwF/AH4CHgUO5DImIFsAKKOamHKXYzs2NemQmim+f+1T8V2JNp83hE7Af2S/oB8Crg5xFxDXANgKTPprZmZpbc/uBuutZsZ8++A7x40niWnnsaF5ze90LN0JV5iWk9MFPSDEnjgAXA6j5t7gDeKGlMupT0WmAbQNUN65cA7wRWlhirmVlLuf3B3Sy/dTO79x0ggN37DrD81s3c/uDuYfuM0s4gIuKQpCXAGqANuDYitkpanOqvSpeSvgdsAp4Gro6ILekQt0g6CXgKuCQifldWrGZmraZrzXYOPHX4OWUHnjpM15rtw3YWUeYlJiLiTuDOPmVX9dnuAroy+76xzNjMzFrZnn0HBlU+FH6T2sysBb140vhBlQ+FE4SZWQtaeu5pjB/b9pyy8WPbWHruacP2GaVeYjIzs3L03mco8ykmJwgzsxZ1welThjUh9OVLTGZmluUEYWZmWU4QZmaW5QRhZmZZThBmZpaliKNnAFRJPcCvmh3HIJ0MPN7sIIaJ+zI6uS+j02jpy0sjoj1XcVQliFYkqRIRHc2OYzi4L6OT+zI6tUJffInJzMyynCDMzCzLCaL5VjQ7gGHkvoxO7svoNOr74nsQZmaW5TMIMzPLcoIwM7MsJ4hhJqlT0nZJOyQty9RL0j+n+k2Szqiqe0TSZkkbJVX67PfhdNytkj7Xqn2RNFvSfb3lkua0QF8mSbpZ0s8kbZP0+lQ+WdJdkn6Rvp7Ywn3pSmWbJN0maVKr9qWq/mOSQtLJrdyXZvzsPyMivAzTQjH39i+BU4FxwEPArD5t3g78KyDgdcD9VXWPACdnjvsW4N+AP0nbp7RwX9YC51Xtf08L9OV64L+n9XHApLT+OWBZWl8GXN7CfXkbMCatX97KfUnb04A1FC/O9vt/2Cp9acbPfvXiM4jhNQfYERE7I+IgsAqY36fNfODrUbgPmCTpRQMc90PAP0XEHwEi4rHhDjyjrL4EcEJanwjsGc6gaxhyXySdALwJuAYgIg5GxL6qfa5P69cDF5TbDaCkvkTE2og4lPa/D5jaqn1J/g/wdxT/30ZCWX1pxs/+M5wghtcUYFfVdncqa7RNAGslbZC0qKrNy4A3Srpf0jpJrxnmuHPK6sulQJekXcDngeXDGXQNR9KXU4Ee4GuSHpR0taTjU5sXRsSjAOnrKWUE32CcjbSp15dq76f4S7dspfRF0vnA7oh4qLTI+yvr36UZP/vPcIIYXsqU9f0Lpl6bMyPiDOA84BJJb0rlY4ATKU5LlwI3SsodZziV1ZcPAZdFxDTgMtJfTSU7kr6MAc4AroyI04H9FJeTmqXUvkj6JHAIuOHIQx3QsPdF0gTgk8CnhjPQBpT179KMn/1nOEEMr26Ka5+9ptL/EkrNNhHR+/Ux4DaK09befW5Np6Y/BZ6mGOirTGX15WLg1rR+U1V5mY6kL91Ad0Tcn8pvpvhhBvht7yW19HUkTv/L6guSLgbeAbw70gXvkpXRlz8FZgAPSXoktX9A0n8a9ugbi7ORNvX+XZrxs/8MJ4jhtR6YKWmGpHHAAmB1nzargYvSEw2vA34fEY9KOl7SCwDS6eXbgC1pn9uBs1LdyyhuYpU9CmRZfdkDzE3rZwG/KLkfcAR9iYjfALsknZbanQ08XLXPxWn9YuCOUntRKKUvkjqBjwPnR8STI9APKKEvEbE5Ik6JiOkRMZ3iF+wZqX1L9SWt387I/+w/a6Tuhh8rC8WTCj+neKLhk6lsMbA4rQu4ItVvBjpS+akUTz48BGzt3TeefarhmxS/ZB8AzmrhvrwB2JDq7gdePZr7kupmAxVgE8UP7Imp/CTgbookdzcwuYX7soPi+vjGtFzVqn3pc/xHGIGnmEr8d2nKz37v4qE2zMwsy5eYzMwsywnCzMyynCDMzCzLCcLMzLKcIMzMLMsJwlqCpMMqRoDdIumm9MbsYPbvSqNhdpUV40iQdJ2kJ3vfM0llX1IDo5ZK+sQA9XdqhEZxtdbgx1ytJUj6Q0Q8P63fAGyIiC80sN+YiDgk6QmgPdKgZ43ud2RRDz9J11G8Zfu5iPimpOdRvLcwGZgdETVfoqr+HvYpF8XvgqfLidpalc8grBX9EPiz9Mb2tZLWp0HO5gNIem86y/g2xYCBq4Hjgfsl/ZWkl0q6W8WY/HdLekna7zpJX5D0feDytH2lpO9L2ilpbvq8bekXNWm/K1XMbbFV0meqyh+R9BlJD6iYG+Plqfz5kr6WyjZJ+stU/jZJP0ntb5LU75d5shL4q7T+ZuBeivGTej/3PZJ+ms64viKpTdI/AeNT2Q2Spqd+fJniBaxpKd6T0zEuSrE9JOkbR/jvZa1qJN/K8+JlqAvwh/R1DMWQFh8CPgu8J5VPoniL9XjgvRRDLEzuu39a/zZwcVp/P3B7Wr8O+A7QVrW9iuIN2PnAE8B/pvjDagPFX+z0fg7FnAD3AH+Rth8BPpzW/xa4Oq1fDnyxKp4TKcbX+QFwfCr7OPCpzPfhOuBdFENynwh8lWLokkfSMf489W9sav9l4KLM92A6xbg+r6sq6z3GK4DtpDeQGaE3xL2MvmXMAPnDbLQYL2ljWv8hxSiwPwbOl/SxVH4c8JK0fldE7K1xrNcD70zr36CY+KfXTRFxuGr72xERkjYDv42IzQCStlL8kt0I/FcVQ5qPAV4EzKIYMgGeHZhwQ9VnvpVirB4AIuJ3kt6R9ru3uOLDOOAntb4Z6bgLgNcCH6wqPxt4NbA+HWc8tQcR/FUU8xL0dRZwc6TLVXW+j3aUc4KwVnEgImZXF6Rr538ZEdv7lL+WYsjkRlXfiOu7X+89i6er1nu3x0iaAXwMeE36RX8dRaLqu/9hnv15E/mhoO+KiIUNxryK4tLQ9RHxtJ4dAVqprJF5Nmp9j3Lx2THI9yCsla0BPpwSBZJOb3C/H/PsX/DvBn50BDGcQPGL9veSXkgx/8VA1gJLejdUzGV9H3CmpD9LZRPS6J1ZEfFrinkPvtyn6m7gXZJOSceZLOmlqe4pSWMbiO9uirOik3qP0cA+dhRygrBW9r+AscAmSVvSdiM+ArxP0ibgb4CPDjWAKGYte5Bi1NprKW4YD+QfgBPTI7sPAW+JiB6KeycrU1z3AS8f4LO/EhG/7FP2MPD3FDfnNwF3UVz2AlhB8b2qOxlQRGwF/hFYl+Ib8GkxOzr5MVczM8vyGYSZmWU5QZiZWZYThJmZZTlBmJlZlhOEmZllOUGYmVmWE4SZmWX9fwmPi4R/kYmPAAAAAElFTkSuQmCC\n",
+      "text/plain": [
+       "<Figure size 432x288 with 1 Axes>"
+      ]
+     },
+     "metadata": {
+      "needs_background": "light"
+     },
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "# Current\n",
+    "my_disparity_metric=custom_difference1\n",
+    "my_performance_metric=false_positive_rate\n",
+    "\n",
+    "xs = [my_performance_metric(Y_test, y_pred) for y_pred in predictions_by_estimator.values()]\n",
+    "ys = [my_disparity_metric(Y_test, y_pred, sensitive_features=A_test['Race']) \n",
+    "      for y_pred in predictions_by_estimator.values()]\n",
+    "\n",
+    "plt.scatter(xs,ys)\n",
+    "plt.xlabel('Performance Metric')\n",
+    "plt.ylabel('Disparity Metric')\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Proposed\n",
+    "\n",
+    "# Would also reuse the definition of custom_difference1"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Run sklearn.model_selection.cross_validate\n",
+    "\n",
+    "Use demographic parity and precision score as the metrics"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 26,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "6513 6513 6513\n"
+     ]
+    }
+   ],
+   "source": [
+    "from sklearn.model_selection import cross_validate\n",
+    "from sklearn.metrics import make_scorer, precision_score\n",
+    "\n",
+    "print(len(A_test['Race']), len(Y_test), len(X_test))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 27,
+   "metadata": {},
+   "outputs": [
+    {
+     "ename": "ValueError",
+     "evalue": "Array sensitive_features is not the same size as y_true",
+     "output_type": "error",
+     "traceback": [
+      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
+      "\u001b[1;31mValueError\u001b[0m                                Traceback (most recent call last)",
+      "\u001b[1;32m<ipython-input-27-4fcd662eb44f>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m\u001b[0m\n\u001b[0;32m      6\u001b[0m \u001b[0mscoring\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;33m{\u001b[0m\u001b[1;34m'prec'\u001b[0m\u001b[1;33m:\u001b[0m\u001b[0mprecision_scorer\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;34m'dp'\u001b[0m\u001b[1;33m:\u001b[0m\u001b[0mdp_scorer\u001b[0m\u001b[1;33m}\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m      7\u001b[0m \u001b[0mclf\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0msvm\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mSVC\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mkernel\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;34m'linear'\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mC\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;36m1\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mrandom_state\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;36m0\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 8\u001b[1;33m \u001b[0mscores\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mcross_validate\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mclf\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mX_test\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mY_test\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mscoring\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mscoring\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m      9\u001b[0m \u001b[0mscores\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
+      "\u001b[1;32mc:\\users\\riedgar\\appdata\\local\\continuum\\miniconda3\\envs\\py-36\\lib\\site-packages\\sklearn\\utils\\validation.py\u001b[0m in \u001b[0;36minner_f\u001b[1;34m(*args, **kwargs)\u001b[0m\n\u001b[0;32m     70\u001b[0m                           FutureWarning)\n\u001b[0;32m     71\u001b[0m         \u001b[0mkwargs\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mupdate\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m{\u001b[0m\u001b[0mk\u001b[0m\u001b[1;33m:\u001b[0m \u001b[0marg\u001b[0m \u001b[1;32mfor\u001b[0m \u001b[0mk\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0marg\u001b[0m \u001b[1;32min\u001b[0m \u001b[0mzip\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0msig\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mparameters\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0margs\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m}\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m---> 72\u001b[1;33m         \u001b[1;32mreturn\u001b[0m \u001b[0mf\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m     73\u001b[0m     \u001b[1;32mreturn\u001b[0m \u001b[0minner_f\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m     74\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n",
+      "\u001b[1;32mc:\\users\\riedgar\\appdata\\local\\continuum\\miniconda3\\envs\\py-36\\lib\\site-packages\\sklearn\\model_selection\\_validation.py\u001b[0m in \u001b[0;36mcross_validate\u001b[1;34m(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, return_train_score, return_estimator, error_score)\u001b[0m\n\u001b[0;32m    246\u001b[0m             \u001b[0mreturn_times\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;32mTrue\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mreturn_estimator\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mreturn_estimator\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    247\u001b[0m             error_score=error_score)\n\u001b[1;32m--> 248\u001b[1;33m         for train, test in cv.split(X, y, groups))\n\u001b[0m\u001b[0;32m    249\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    250\u001b[0m     \u001b[0mzipped_scores\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mlist\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mzip\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m*\u001b[0m\u001b[0mscores\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
+      "\u001b[1;32mc:\\users\\riedgar\\appdata\\local\\continuum\\miniconda3\\envs\\py-36\\lib\\site-packages\\joblib\\parallel.py\u001b[0m in \u001b[0;36m__call__\u001b[1;34m(self, iterable)\u001b[0m\n\u001b[0;32m   1002\u001b[0m             \u001b[1;31m# remaining jobs.\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m   1003\u001b[0m             \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_iterating\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;32mFalse\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m-> 1004\u001b[1;33m             \u001b[1;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mdispatch_one_batch\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0miterator\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m   1005\u001b[0m                 \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_iterating\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_original_iterator\u001b[0m \u001b[1;32mis\u001b[0m \u001b[1;32mnot\u001b[0m \u001b[1;32mNone\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m   1006\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n",
+      "\u001b[1;32mc:\\users\\riedgar\\appdata\\local\\continuum\\miniconda3\\envs\\py-36\\lib\\site-packages\\joblib\\parallel.py\u001b[0m in \u001b[0;36mdispatch_one_batch\u001b[1;34m(self, iterator)\u001b[0m\n\u001b[0;32m    833\u001b[0m                 \u001b[1;32mreturn\u001b[0m \u001b[1;32mFalse\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    834\u001b[0m             \u001b[1;32melse\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 835\u001b[1;33m                 \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_dispatch\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mtasks\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m    836\u001b[0m                 \u001b[1;32mreturn\u001b[0m \u001b[1;32mTrue\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    837\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n",
+      "\u001b[1;32mc:\\users\\riedgar\\appdata\\local\\continuum\\miniconda3\\envs\\py-36\\lib\\site-packages\\joblib\\parallel.py\u001b[0m in \u001b[0;36m_dispatch\u001b[1;34m(self, batch)\u001b[0m\n\u001b[0;32m    752\u001b[0m         \u001b[1;32mwith\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_lock\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    753\u001b[0m             \u001b[0mjob_idx\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mlen\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_jobs\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 754\u001b[1;33m             \u001b[0mjob\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_backend\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mapply_async\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mbatch\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mcallback\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mcb\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m    755\u001b[0m             \u001b[1;31m# A job can complete so quickly than its callback is\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    756\u001b[0m             \u001b[1;31m# called before we get here, causing self._jobs to\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
+      "\u001b[1;32mc:\\users\\riedgar\\appdata\\local\\continuum\\miniconda3\\envs\\py-36\\lib\\site-packages\\joblib\\_parallel_backends.py\u001b[0m in \u001b[0;36mapply_async\u001b[1;34m(self, func, callback)\u001b[0m\n\u001b[0;32m    207\u001b[0m     \u001b[1;32mdef\u001b[0m \u001b[0mapply_async\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mfunc\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mcallback\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;32mNone\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    208\u001b[0m         \u001b[1;34m\"\"\"Schedule a func to be run\"\"\"\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 209\u001b[1;33m         \u001b[0mresult\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mImmediateResult\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mfunc\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m    210\u001b[0m         \u001b[1;32mif\u001b[0m \u001b[0mcallback\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    211\u001b[0m             \u001b[0mcallback\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mresult\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
+      "\u001b[1;32mc:\\users\\riedgar\\appdata\\local\\continuum\\miniconda3\\envs\\py-36\\lib\\site-packages\\joblib\\_parallel_backends.py\u001b[0m in \u001b[0;36m__init__\u001b[1;34m(self, batch)\u001b[0m\n\u001b[0;32m    588\u001b[0m         \u001b[1;31m# Don't delay the application, to avoid keeping the input\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    589\u001b[0m         \u001b[1;31m# arguments in memory\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 590\u001b[1;33m         \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mresults\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mbatch\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m    591\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    592\u001b[0m     \u001b[1;32mdef\u001b[0m \u001b[0mget\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
+      "\u001b[1;32mc:\\users\\riedgar\\appdata\\local\\continuum\\miniconda3\\envs\\py-36\\lib\\site-packages\\joblib\\parallel.py\u001b[0m in \u001b[0;36m__call__\u001b[1;34m(self)\u001b[0m\n\u001b[0;32m    254\u001b[0m         \u001b[1;32mwith\u001b[0m \u001b[0mparallel_backend\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_backend\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mn_jobs\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_n_jobs\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    255\u001b[0m             return [func(*args, **kwargs)\n\u001b[1;32m--> 256\u001b[1;33m                     for func, args, kwargs in self.items]\n\u001b[0m\u001b[0;32m    257\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    258\u001b[0m     \u001b[1;32mdef\u001b[0m \u001b[0m__len__\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
+      "\u001b[1;32mc:\\users\\riedgar\\appdata\\local\\continuum\\miniconda3\\envs\\py-36\\lib\\site-packages\\joblib\\parallel.py\u001b[0m in \u001b[0;36m<listcomp>\u001b[1;34m(.0)\u001b[0m\n\u001b[0;32m    254\u001b[0m         \u001b[1;32mwith\u001b[0m \u001b[0mparallel_backend\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_backend\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mn_jobs\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_n_jobs\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    255\u001b[0m             return [func(*args, **kwargs)\n\u001b[1;32m--> 256\u001b[1;33m                     for func, args, kwargs in self.items]\n\u001b[0m\u001b[0;32m    257\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    258\u001b[0m     \u001b[1;32mdef\u001b[0m \u001b[0m__len__\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
+      "\u001b[1;32mc:\\users\\riedgar\\appdata\\local\\continuum\\miniconda3\\envs\\py-36\\lib\\site-packages\\sklearn\\model_selection\\_validation.py\u001b[0m in \u001b[0;36m_fit_and_score\u001b[1;34m(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, return_n_test_samples, return_times, return_estimator, error_score)\u001b[0m\n\u001b[0;32m    558\u001b[0m     \u001b[1;32melse\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    559\u001b[0m         \u001b[0mfit_time\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mtime\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mtime\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m \u001b[1;33m-\u001b[0m \u001b[0mstart_time\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 560\u001b[1;33m         \u001b[0mtest_scores\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0m_score\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mestimator\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mX_test\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0my_test\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mscorer\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m    561\u001b[0m         \u001b[0mscore_time\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mtime\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mtime\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m \u001b[1;33m-\u001b[0m \u001b[0mstart_time\u001b[0m \u001b[1;33m-\u001b[0m \u001b[0mfit_time\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    562\u001b[0m         \u001b[1;32mif\u001b[0m \u001b[0mreturn_train_score\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
+      "\u001b[1;32mc:\\users\\riedgar\\appdata\\local\\continuum\\miniconda3\\envs\\py-36\\lib\\site-packages\\sklearn\\model_selection\\_validation.py\u001b[0m in \u001b[0;36m_score\u001b[1;34m(estimator, X_test, y_test, scorer)\u001b[0m\n\u001b[0;32m    605\u001b[0m         \u001b[0mscores\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mscorer\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mestimator\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mX_test\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    606\u001b[0m     \u001b[1;32melse\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 607\u001b[1;33m         \u001b[0mscores\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mscorer\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mestimator\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mX_test\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0my_test\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m    608\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    609\u001b[0m     error_msg = (\"scoring must return a number, got %s (%s) \"\n",
+      "\u001b[1;32mc:\\users\\riedgar\\appdata\\local\\continuum\\miniconda3\\envs\\py-36\\lib\\site-packages\\sklearn\\metrics\\_scorer.py\u001b[0m in \u001b[0;36m__call__\u001b[1;34m(self, estimator, *args, **kwargs)\u001b[0m\n\u001b[0;32m     86\u001b[0m             \u001b[1;32mif\u001b[0m \u001b[0misinstance\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mscorer\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0m_BaseScorer\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m     87\u001b[0m                 score = scorer._score(cached_call, estimator,\n\u001b[1;32m---> 88\u001b[1;33m                                       *args, **kwargs)\n\u001b[0m\u001b[0;32m     89\u001b[0m             \u001b[1;32melse\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m     90\u001b[0m                 \u001b[0mscore\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mscorer\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mestimator\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m*\u001b[0m\u001b[0margs\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
+      "\u001b[1;32mc:\\users\\riedgar\\appdata\\local\\continuum\\miniconda3\\envs\\py-36\\lib\\site-packages\\sklearn\\metrics\\_scorer.py\u001b[0m in \u001b[0;36m_score\u001b[1;34m(self, method_caller, estimator, X, y_true, sample_weight)\u001b[0m\n\u001b[0;32m    211\u001b[0m         \u001b[1;32melse\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    212\u001b[0m             return self._sign * self._score_func(y_true, y_pred,\n\u001b[1;32m--> 213\u001b[1;33m                                                  **self._kwargs)\n\u001b[0m\u001b[0;32m    214\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    215\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n",
+      "\u001b[1;32mc:\\users\\riedgar\\source\\repos\\fairlearn\\fairlearn\\metrics\\_disparities.py\u001b[0m in \u001b[0;36mdemographic_parity_difference\u001b[1;34m(y_true, y_pred, sensitive_features, sample_weight)\u001b[0m\n\u001b[0;32m     25\u001b[0m     \"\"\"\n\u001b[0;32m     26\u001b[0m     return selection_rate_difference(\n\u001b[1;32m---> 27\u001b[1;33m         y_true, y_pred, sensitive_features=sensitive_features, sample_weight=sample_weight)\n\u001b[0m\u001b[0;32m     28\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m     29\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n",
+      "\u001b[1;32mc:\\users\\riedgar\\source\\repos\\fairlearn\\fairlearn\\metrics\\_metrics_engine.py\u001b[0m in \u001b[0;36m__call__\u001b[1;34m(self, y_true, y_pred, sensitive_features, **metric_params)\u001b[0m\n\u001b[0;32m    166\u001b[0m             \u001b[0my_true\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0my_pred\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    167\u001b[0m             \u001b[0msensitive_features\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0msensitive_features\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 168\u001b[1;33m             **metric_params))\n\u001b[0m\u001b[0;32m    169\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    170\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n",
+      "\u001b[1;32mc:\\users\\riedgar\\source\\repos\\fairlearn\\fairlearn\\metrics\\_metrics_engine.py\u001b[0m in \u001b[0;36m__call__\u001b[1;34m(self, y_true, y_pred, sensitive_features, **metric_params)\u001b[0m\n\u001b[0;32m    134\u001b[0m                              \u001b[0msensitive_features\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0msensitive_features\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    135\u001b[0m                              \u001b[0mindexed_params\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_indexed_params\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 136\u001b[1;33m                              **metric_params)\n\u001b[0m\u001b[0;32m    137\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    138\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n",
+      "\u001b[1;32mc:\\users\\riedgar\\source\\repos\\fairlearn\\fairlearn\\metrics\\_metrics_engine.py\u001b[0m in \u001b[0;36mgroup_summary\u001b[1;34m(metric_function, y_true, y_pred, sensitive_features, indexed_params, **metric_params)\u001b[0m\n\u001b[0;32m     51\u001b[0m     \"\"\"\n\u001b[0;32m     52\u001b[0m     \u001b[0m_check_array_sizes\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0my_true\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0my_pred\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;34m'y_true'\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;34m'y_pred'\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m---> 53\u001b[1;33m     \u001b[0m_check_array_sizes\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0my_true\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0msensitive_features\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;34m'y_true'\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;34m'sensitive_features'\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m     54\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m     55\u001b[0m     \u001b[1;31m# Make everything a numpy array\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
+      "\u001b[1;32mc:\\users\\riedgar\\source\\repos\\fairlearn\\fairlearn\\metrics\\_metrics_engine.py\u001b[0m in \u001b[0;36m_check_array_sizes\u001b[1;34m(a, b, a_name, b_name)\u001b[0m\n\u001b[0;32m    262\u001b[0m \u001b[1;32mdef\u001b[0m \u001b[0m_check_array_sizes\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0ma\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mb\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0ma_name\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mb_name\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    263\u001b[0m     \u001b[1;32mif\u001b[0m \u001b[0mlen\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0ma\u001b[0m\u001b[1;33m)\u001b[0m \u001b[1;33m!=\u001b[0m \u001b[0mlen\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mb\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 264\u001b[1;33m         \u001b[1;32mraise\u001b[0m \u001b[0mValueError\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0m_MESSAGE_SIZE_MISMATCH\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mformat\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mb_name\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0ma_name\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m    265\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    266\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n",
+      "\u001b[1;31mValueError\u001b[0m: Array sensitive_features is not the same size as y_true"
+     ]
+    }
+   ],
+   "source": [
+    "# Current\n",
+    "\n",
+    "precision_scorer = make_scorer(precision_score)\n",
+    "dp_scorer = make_scorer(demographic_parity_difference, sensitive_features=A_test['Race'])\n",
+    "\n",
+    "scoring = {'prec':precision_scorer, 'dp':dp_scorer}\n",
+    "clf = svm.SVC(kernel='linear', C=1, random_state=0)\n",
+    "scores = cross_validate(clf, X_test, Y_test, scoring=scoring)\n",
+    "scores"
    ]
   },
   {

From 13069d506fe5395e5a564b78d1428624453801cf Mon Sep 17 00:00:00 2001
From: Richard Edgar <riedgar@microsoft.com>
Date: Wed, 9 Sep 2020 10:41:03 -0400
Subject: [PATCH 25/42] Put in remaining comparisons

Signed-off-by: Richard Edgar <riedgar@microsoft.com>
---
 api/Metrics API Samples.ipynb | 292 ++++++++++++----------------------
 1 file changed, 103 insertions(+), 189 deletions(-)

diff --git a/api/Metrics API Samples.ipynb b/api/Metrics API Samples.ipynb
index 4c4651c..be77330 100644
--- a/api/Metrics API Samples.ipynb	
+++ b/api/Metrics API Samples.ipynb	
@@ -2,7 +2,7 @@
  "cells": [
   {
    "cell_type": "code",
-   "execution_count": 1,
+   "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -15,7 +15,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 2,
+   "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -24,7 +24,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 3,
+   "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -35,7 +35,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 4,
+   "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -49,21 +49,9 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 25,
-   "metadata": {},
-   "outputs": [
-    {
-     "ename": "AttributeError",
-     "evalue": "'numpy.ndarray' object has no attribute 'reset_index'",
-     "output_type": "error",
-     "traceback": [
-      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
-      "\u001b[1;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
-      "\u001b[1;32m<ipython-input-25-96eed97377f8>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m\u001b[0m\n\u001b[0;32m     11\u001b[0m \u001b[0mA_train\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mA_train\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mreset_index\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mdrop\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;32mTrue\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m     12\u001b[0m \u001b[0mX_test\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mX_test\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mreset_index\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mdrop\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;32mTrue\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m---> 13\u001b[1;33m \u001b[0mY_test\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mY_test\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mreset_index\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mdrop\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;32mTrue\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m     14\u001b[0m \u001b[0mA_test\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mA_test\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mreset_index\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mdrop\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;32mTrue\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m     15\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n",
-      "\u001b[1;31mAttributeError\u001b[0m: 'numpy.ndarray' object has no attribute 'reset_index'"
-     ]
-    }
-   ],
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
    "source": [
     "from sklearn.model_selection import train_test_split\n",
     "X_train, X_test, Y_train, Y_test, A_train, A_test = train_test_split(X_scaled, \n",
@@ -93,7 +81,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 6,
+   "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -105,7 +93,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 7,
+   "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -124,7 +112,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 8,
+   "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -143,30 +131,9 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 9,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Amer-Indian-Eskimo    0.923077\n",
-      "Asian-Pac-Islander    0.840796\n",
-      "Black                 0.914826\n",
-      "Other                 0.851064\n",
-      "White                 0.826492\n",
-      "dtype: float64\n",
-      "=======================\n",
-      "Amer-Indian-Eskimo    0.923077\n",
-      "Asian-Pac-Islander    0.840796\n",
-      "Black                 0.914826\n",
-      "Other                 0.851064\n",
-      "White                 0.826492\n",
-      "overall               0.836481\n",
-      "dtype: float64\n"
-     ]
-    }
-   ],
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
    "source": [
     "# Current\n",
     "bunch = group_summary(accuracy_score, Y_test, Y_pred_lr, sensitive_features=A_test['Race'])\n",
@@ -198,30 +165,9 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 10,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "                    accuracy        f1\n",
-      "Amer-Indian-Eskimo  0.923077  0.666667\n",
-      "Asian-Pac-Islander  0.840796  0.652174\n",
-      "Black               0.914826  0.550000\n",
-      "Other               0.851064  0.000000\n",
-      "White               0.826492  0.612800\n",
-      "=======================\n",
-      "                    accuracy        f1\n",
-      "Amer-Indian-Eskimo  0.923077  0.666667\n",
-      "Asian-Pac-Islander  0.840796  0.652174\n",
-      "Black               0.914826  0.550000\n",
-      "Other               0.851064  0.000000\n",
-      "White               0.826492  0.612800\n",
-      "overall             0.836481  0.610033\n"
-     ]
-    }
-   ],
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
    "source": [
     "# Current\n",
     "bunch1 = group_summary(accuracy_score, Y_test, Y_pred_lr, sensitive_features=A_test['Race'])\n",
@@ -258,40 +204,9 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 11,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Amer-Indian-Eskimo-female    0.937500\n",
-      "Amer-Indian-Eskimo-male      0.916667\n",
-      "Asian-Pac-Islander-female    0.879310\n",
-      "Asian-Pac-Islander-male      0.825175\n",
-      "Black-female                 0.962382\n",
-      "Black-male                   0.866667\n",
-      "Other-female                 0.909091\n",
-      "Other-male                   0.833333\n",
-      "White-female                 0.917824\n",
-      "White-male                   0.785510\n",
-      "dtype: float64\n",
-      "=======================\n",
-      "Amer-Indian-Eskimo-female    0.937500\n",
-      "Amer-Indian-Eskimo-male      0.916667\n",
-      "Asian-Pac-Islander-female    0.879310\n",
-      "Asian-Pac-Islander-male      0.825175\n",
-      "Black-female                 0.962382\n",
-      "Black-male                   0.866667\n",
-      "Other-female                 0.909091\n",
-      "Other-male                   0.833333\n",
-      "White-female                 0.917824\n",
-      "White-male                   0.785510\n",
-      "overall                      0.836481\n",
-      "dtype: float64\n"
-     ]
-    }
-   ],
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
    "source": [
     "# Current\n",
     "sf = A_test['Race']+'-'+A_test['Sex'] # User builds new column manually\n",
@@ -326,23 +241,9 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 12,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "                                 logreg       svm\n",
-      "Custom difference 1            0.688073  0.769231\n",
-      "Custom difference 2           -0.673229 -0.704020\n",
-      "Demographic parity difference  0.166402  0.196253\n",
-      "Worst-case balanced accuracy   0.476190  0.476190\n",
-      "FPR                            0.066734  0.054803\n",
-      "FNR                            0.468750  0.462372\n"
-     ]
-    }
-   ],
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
    "source": [
     "# Current\n",
     "fb_s = lambda y_t, y_p: fbeta_score(y_t, y_p, beta=0.5)\n",
@@ -402,7 +303,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 13,
+   "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -412,22 +313,9 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 14,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYgAAAEGCAYAAAB/+QKOAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy86wFpkAAAACXBIWXMAAAsTAAALEwEAmpwYAAAdDUlEQVR4nO3df5CdVZ3n8ffHTjIkKAmBxtUkmjATcaI7BmyjFmoURBrXEHCsnWR0QN01xiEqbJkx0Slda2eswXZdnSkEIyComBS/icqYMIxERcHcQMgPYjRm0XSC0lSMjCFrSPjuH89puHSfe/t2p5++fZPPq+qpfp5zzvPc7+mk+9vPr3MUEZiZmfX1vGYHYGZmo5MThJmZZTlBmJlZlhOEmZllOUGYmVnWmGYHMJxOPvnkmD59erPDMDNrGRs2bHg8ItpzdUdVgpg+fTqVSqXZYZiZtQxJv6pV50tMZmaW5QRhZmZZThBmZpZVaoKQ1Clpu6QdkpZl6pdK2piWLZIOS5os6bSq8o2SnpB0aZmxmpnZc5V2k1pSG3AFcA7QDayXtDoiHu5tExFdQFdqPw+4LCL2AnuB2VXH2Q3cVlasZmbWX5lPMc0BdkTETgBJq4D5wMM12i8EVmbKzwZ+GRE177Qfidsf3E3Xmu3s2XeAF08az9JzT+OC06eU8VFmZi2lzEtMU4BdVdvdqawfSROATuCWTPUC8omjd99FkiqSKj09PYMK8PYHd7P81s3s3neAAHbvO8DyWzdz+4O7B3UcM7OjUZkJQpmyWmOLzwPuTZeXnj2ANA44H7ip1odExIqI6IiIjvb27LseNXWt2c6Bpw4/p+zAU4fpWrN9UMcxMzsalZkguoFpVdtTgT012tY6SzgPeCAifjvMsQGwZ9+BQZWbmR1LykwQ64GZkmakM4EFwOq+jSRNBOYCd2SOUeu+xLB48aTxgyo3MzuWlJYgIuIQsARYA2wDboyIrZIWS1pc1fRCYG1E7K/eP92XOAe4tawYl557GuPHtj2nbPzYNpaee1pZH2lm1jJ0NE052tHREYMdi8lPMZnZsUzShojoyNUdVYP1DcUFp09xQjAzy/BQG2ZmluUEYWZmWU4QZmaW5QRhZmZZThBmZpblBGFmZllOEGZmluUEYWZmWU4QZmaW5QRhZmZZThBmZpblBGFmZllOEGZmluUEYWZmWU4QZmaWVWqCkNQpabukHZKWZeqXStqYli2SDkuanOomSbpZ0s8kbZP0+jJjNTOz5yotQUhqA64AzgNmAQslzapuExFdETE7ImYDy4F1EbE3VX8J+F5EvBx4FcW0pWZmNkLKPIOYA+yIiJ0RcRBYBcyv034hsBJA0gnAm4BrACLiYETsKzFWMzPro8wEMQXYVbXdncr6kTQB6ARuSUWnAj3A1yQ9KOlqScfX2HeRpIqkSk9Pz/BFb2Z2jCszQShTFjXazgPurbq8NAY4A7gyIk4H9gP97mEARMSKiOiIiI729vYjjdnMzJIyE0Q3MK1qeyqwp0bbBaTLS1X7dkfE/Wn7ZoqEYWZmI6TMBLEemClphqRxFElgdd9GkiYCc4E7essi4jfALkmnpaKzgYdLjNXMzPoYU9aBI+KQpCXAGqANuDYitkpanOqvSk0vBNZGxP4+h/gwcENKLjuB95UVq5mZ9aeIWrcFWk9HR0dUKpVmh2Fm1jIkbYiIjlyd36Q2M7MsJwgzM8tygjAzsywnCDMzy3KCMDOzLCcIMzPLcoIwM7MsJwgzM8tygjAzsywnCDMzy3KCMDOzLCcIMzPLcoIwM7MsJwgzM8tygjAzs6xSE4SkTknbJe2Q1G9OaUlLJW1MyxZJhyVNTnWPSNqc6jzJg5nZCCttRjlJbcAVwDkUc0yvl7Q6Ip6ZOjQiuoCu1H4ecFlE7K06zFsi4vGyYjQzs9rKPIOYA+yIiJ0RcRBYBcyv034hsLLEeMzMbBDKTBBTgF1V292prB9JE4BO4Jaq4gDWStogaVGtD5G0SFJFUqWnp2cYwjYzMyg3QShTVmsC7HnAvX0uL50ZEWcA5wGXSHpTbseIWBERHRHR0d7efmQRm5nZM8pMEN3AtKrtqcCeGm0X0OfyUkTsSV8fA26juGRlZmYjpMwEsR6YKWmGpHEUSWB130aSJgJzgTuqyo6X9ILedeBtwJYSYzUzsz5Ke4opIg5JWgKsAdqAayNiq6TFqf6q1PRCYG1E7K/a/YXAbZJ6Y/xWRHyvrFjNzKw/RdS6LdB6Ojo6olLxKxNmZo2StCEiOnJ1fpPazMyynCDMzCzLCcLMzLKcIMzMLMsJwszMspwgzMwsywnCzMyynCDMzCzLCcLMzLIGTBCSLpE0qWr7REl/W2pUZmbWdI2cQXwgIvb1bkTE74APlBaRmZmNCo0kiOcpjZoHz0wlOq68kMzMbDRoZDTXNcCNkq6imPBnMeCRVc3MjnKNJIiPAx8EPkQxS9xa4OoygzIzs+YbMEFExNPAlWkxM7NjRM17EJJuTF83S9rUd2nk4JI6JW2XtEPSskz9Ukkb07JF0mFJk6vq2yQ9KOk7Q+mcmZkNXb0ziI+mr+8YyoHTzewrgHMo5qdeL2l1RDzc2yYiuoCu1H4ecFlE7O0TwzbghKHEYGZmQ1fzDCIiHk2/5K+JiF/1XRo49hxgR0TsjIiDwCpgfp32C4GVvRuSpgL/Bd/vMDNrirqPuUbEYeBJSROHcOwpwK6q7e5U1o+kCUAncEtV8ReBvwOervchkhZJqkiq9PT0DCFMMzPLaeQppv8HbJZ0F7C/tzAiPjLAfsqU1ZoAex5wb+/lJUnvAB6LiA2S3lzvQyJiBbACijmpB4jJzMwa1EiC+G5aqjXyi7gbmFa1PRXYU6PtAqouLwFnAudLejtwHHCCpG9GxHsa+FwzMxsGjSSISRHxpeoCSR+t1bjKemCmpBnAbook8Nd9G6XLV3OBZ375R8RyYHmqfzPwMScHM7OR1chQGxdnyt470E4RcQhYQvEm9jbgxojYKmmxpMVVTS8E1kbE/txxzMysORSRv1okaSHFX/xvAH5YVfUC4HBEvLX88Aano6MjKpVKs8MwM2sZkjZEREeurt4lph8DjwInA/+7qvw/gIZelDMzs9ZVM0Gkdx1+Bbxe0kuBmRHxb5LGA+MpEoWZmR2lGpkw6APAzcBXUtFU4PYSYzIzs1GgkZvUl1A8dvoEQET8AjilzKDMzKz5GkkQf0xDZQAgaQyNvQdhZmYtrJEEsU7SJ4Dxks4BbgK+XW5YZmbWbI0kiGVAD7CZYuKgO4G/LzMoMzNrvkYnDPpqWszM7BhRM0EMNClQRPzF8IdjZmajRb0ziKcpbkZ/i+Kew4ERicjMzEaFehMGzaaYxOf5FEniH4FXALsbnDDIzMxa2EATBv0sIj4dEWdQnEV8HbhsRCIzM7OmqnuTWtIUimG6LwR+R5EcbhuBuMzMrMnq3aReRzFy640Uw3vvTVXjJE3unf3NzMyOTvXOIF5KcZP6g8CiqnKl8lNLjMvMzJqs3miu00cwDjMzG2UaeZN6yCR1StouaYekZZn6pZI2pmWLpMOSJks6TtJPJT0kaaukz5QZp5mZ9VdagpDUBlwBnAfMAhZKmlXdJiK6ImJ2eqR2ObAu3dv4I3BWRLwKmA10SnpdWbGamVl/ZZ5BzAF2RMTONBrsKmB+nfYLgZUAUfhDKh+bFo8ga2Y2ghqZMOjzkl4xhGNPAXZVbXenstxnTAA6gVuqytokbQQeA+6KiPtr7LtIUkVSpaenZwhhmplZTiNnED8DVki6X9JiSRMbPLYyZbXOAuYB91Y/OhsRh9Olp6nAHEmvzO0YESsioiMiOtrb2xsMzczMBjJggoiIqyPiTOAiYDqwSdK3JL1lgF27gWlV21OBPTXaLiBdXsp8/j7gHoozDDMzGyEN3YNIN5xfnpbHgYeA/yFpVZ3d1gMzJc2QNI4iCazOHHsiMBe4o6qsXdKktD4eeCvFmYyZmY2QAeeDkPQFiktA/w58NiJ+mqoul7S91n4RcUjSEmAN0AZcGxFbJS1O9VelphcCayNif9XuLwKuT4npecCNEfGdQfbNzMyOgCLqPxwk6f3Aqoh4MlM3MSJ+X1Zwg9XR0RGVSqXZYZiZtQxJGyKiI1fXyCWmd/dNDpLuBhhNycHMzIZXvcH6jgMmACdLOpFnn0o6AXjxCMRmZmZNVO8exAeBSymSwQNV5U9QvCFtZmZHsXqD9X0J+JKkD0fEv4xgTGZmNgrUu8R0VkT8O7Bb0jv71kfEraVGZmZmTVXvEtNcikdb52XqAnCCMDM7itW7xPRpSc8D/jUibhzBmMzMbBSo+5hrRDwNLBmhWMzMbBRp5D2IuyR9TNK0NJnPZEmTS4/MzMyaasChNoD3p6+XVJV5Tmozs6PcgAkiImaMRCBmZja6NHIGQZqLYRZwXG9ZRHy9rKDMzKz5GhnN9dPAmykSxJ0Uc0z/CHCCMDM7ijVyk/pdwNnAbyLifcCrgD8pNSozM2u6RhLEgfS46yFJJ1DMEe0b1GZmR7lGEkQlze72VWADxcB9P627RyKpU9J2STskLcvUL5W0MS1bJB1Oj9FOk/R9SdskbZX00cF0yszMjtyAEwY9p7E0HTghIjY10LYN+DlwDsX81OuBhRHxcI3284DLIuIsSS8CXhQRD0h6AUViuqDWvr08YZCZ2eDUmzCo0aeY3gm8geL9hx8BAyYIYA6wIyJ2pmOsAuYDtX7JLwRWAkTEo8Cjaf0/JG0DptTZ18zMhtmAl5gkfRlYDGwGtgAflNTIfBBTgF1V292pLPcZE4BO4JZM3XTgdOD+GvsuklSRVOnp6WkgLDMza0QjZxBzgVdGuhYl6XqKZDEQZcpqXc+aB9wbEXufcwDp+RRJ49KIeCK3Y0SsAFZAcYmpgbjMzKwBjdyk3g68pGp7Go1dYupObXtNBfbUaLuAdHmpl6SxFMnhBs89YWY28hpJECcB2yTdI+keivsA7ZJWS1pdZ7/1wExJMySNo0gC/dpLmkhxlnJHVZmAa4BtEfGFhntjZmbDppFLTJ8ayoEj4pCkJcAaoA24NiK2Slqc6q9KTS8E1kbE/qrdzwT+BtgsaWMq+0RE3DmUWMzMbPAGfMxV0vGkl+UkvQx4OcUkQk+NRICD4cdczcwGp95jro1cYvoBcJykKcDdwPuA64YvPDMzG40aSRCKiCeBdwL/EhEXAq8oNywzM2u2hhKEpNcD7wa+m8raygvJzMxGg0YSxKXAcuC2dJP5VOD7pUZlZmZN18iMcuuAdVXbO4GPlBmUmZk1X80EIemLEXGppG+TeQM6Is4vNTIzM2uqemcQ30hfPz8SgZiZ2ehSM0FExIb0dZ2k9rTu0fDMzI4RNW9Sq/A/JT0O/Az4uaQeSUN6s9rMzFpLvaeYLqUY8uI1EXFSRJwIvBY4U9JlIxGcmZk1T70EcRHFDHD/t7cgPcH0nlRnZmZHsXoJYmxEPN63MN2HGFteSGZmNhrUSxAHh1hnZmZHgXqPub5KUm4WNwHHlRSPmZmNEvUec/V4S2Zmx7BGxmIaMkmdkrZL2iFpWaZ+qaSNadki6bCkyanuWkmPSdpSZoxmZpZXWoKQ1AZcAZwHzAIWSppV3SYiuiJidkTMphgQcF1E7E3V1wGdZcVnZmb1lXkGMQfYERE7I+IgsAqYX6f9QmBl70ZE/ADYW7u5mZmVqcwEMQXYVbXdncr6kTSB4mzhlhLjMTOzQSgzQShTVmsC7HnAvVWXlxr/EGmRpIqkSk+Ph4oyMxsuZSaIbmBa1fZUYE+Ntguourw0GBGxIiI6IqKjvb19KIcwM7OMMhPEemCmpBmSxlEkgdV9G0maCMwF7igxFjMzG6TSEkREHAKWAGuAbcCNacrSxZIWVzW9EFgbEfur95e0EvgJcJqkbkn/raxYzcysP0XUui3Qejo6OqJSqTQ7DDOzliFpQ0R05OpKfVHOzMxalxOEmZllOUGYmVmWE4SZmWU5QZiZWZYThJmZZTlBmJlZlhOEmZllOUGYmVmWE4SZmWU5QZiZWZYThJmZZTlBmJlZlhOEmZllOUGYmVmWE4SZmWWVmiAkdUraLmmHpGWZ+qWSNqZli6TDkiY3sq+ZmZWrtAQhqQ24AjgPmAUslDSruk1EdEXE7IiYDSwH1kXE3kb2NTOzcpV5BjEH2BEROyPiILAKmF+n/UJg5RD3NTOzYVZmgpgC7Kra7k5l/UiaAHQCtwxh30WSKpIqPT09Rxy0mZkVykwQypRFjbbzgHsjYu9g942IFRHREREd7e3tQwjTzMxyykwQ3cC0qu2pwJ4abRfw7OWlwe5rZmYlKDNBrAdmSpohaRxFEljdt5GkicBc4I7B7mtmZuUZU9aBI+KQpCXAGqANuDYitkpanOqvSk0vBNZGxP6B9i0rVjMz608RtW4LtJ6Ojo6oVCrNDsPMrGVI2hARHbk6v0ltZmZZThBmZpblBGFmZllOEGZmluUEYWZmWU4QZmaW5QRhZmZZThBmZpblBGFmZllOEGZmluUEYWZmWU4QZmaW5QRhZmZZThBmZpblBGFmZlmlJghJnZK2S9ohaVmNNm+WtFHSVknrqso/KmlLKr+0zDjNzKy/0maUk9QGXAGcQzHH9HpJqyPi4ao2k4AvA50R8WtJp6TyVwIfAOYAB4HvSfpuRPyirHjNzOy5yjyDmAPsiIidEXEQWAXM79Pmr4FbI+LXABHxWCr/c+C+iHgyIg4B6yimJjUzsxFSZoKYAuyq2u5OZdVeBpwo6R5JGyRdlMq3AG+SdJKkCcDbgWm5D5G0SFJFUqWnp2eYu2Bmduwq7RIToExZ3wmwxwCvBs4GxgM/kXRfRGyTdDlwF/AH4CHgUO5DImIFsAKKOamHKXYzs2NemQmim+f+1T8V2JNp83hE7Af2S/oB8Crg5xFxDXANgKTPprZmZpbc/uBuutZsZ8++A7x40niWnnsaF5ze90LN0JV5iWk9MFPSDEnjgAXA6j5t7gDeKGlMupT0WmAbQNUN65cA7wRWlhirmVlLuf3B3Sy/dTO79x0ggN37DrD81s3c/uDuYfuM0s4gIuKQpCXAGqANuDYitkpanOqvSpeSvgdsAp4Gro6ILekQt0g6CXgKuCQifldWrGZmraZrzXYOPHX4OWUHnjpM15rtw3YWUeYlJiLiTuDOPmVX9dnuAroy+76xzNjMzFrZnn0HBlU+FH6T2sysBb140vhBlQ+FE4SZWQtaeu5pjB/b9pyy8WPbWHruacP2GaVeYjIzs3L03mco8ykmJwgzsxZ1welThjUh9OVLTGZmluUEYWZmWU4QZmaW5QRhZmZZThBmZpaliKNnAFRJPcCvmh3HIJ0MPN7sIIaJ+zI6uS+j02jpy0sjoj1XcVQliFYkqRIRHc2OYzi4L6OT+zI6tUJffInJzMyynCDMzCzLCaL5VjQ7gGHkvoxO7svoNOr74nsQZmaW5TMIMzPLcoIwM7MsJ4hhJqlT0nZJOyQty9RL0j+n+k2Szqiqe0TSZkkbJVX67PfhdNytkj7Xqn2RNFvSfb3lkua0QF8mSbpZ0s8kbZP0+lQ+WdJdkn6Rvp7Ywn3pSmWbJN0maVKr9qWq/mOSQtLJrdyXZvzsPyMivAzTQjH39i+BU4FxwEPArD5t3g78KyDgdcD9VXWPACdnjvsW4N+AP0nbp7RwX9YC51Xtf08L9OV64L+n9XHApLT+OWBZWl8GXN7CfXkbMCatX97KfUnb04A1FC/O9vt/2Cp9acbPfvXiM4jhNQfYERE7I+IgsAqY36fNfODrUbgPmCTpRQMc90PAP0XEHwEi4rHhDjyjrL4EcEJanwjsGc6gaxhyXySdALwJuAYgIg5GxL6qfa5P69cDF5TbDaCkvkTE2og4lPa/D5jaqn1J/g/wdxT/30ZCWX1pxs/+M5wghtcUYFfVdncqa7RNAGslbZC0qKrNy4A3Srpf0jpJrxnmuHPK6sulQJekXcDngeXDGXQNR9KXU4Ee4GuSHpR0taTjU5sXRsSjAOnrKWUE32CcjbSp15dq76f4S7dspfRF0vnA7oh4qLTI+yvr36UZP/vPcIIYXsqU9f0Lpl6bMyPiDOA84BJJb0rlY4ATKU5LlwI3SsodZziV1ZcPAZdFxDTgMtJfTSU7kr6MAc4AroyI04H9FJeTmqXUvkj6JHAIuOHIQx3QsPdF0gTgk8CnhjPQBpT179KMn/1nOEEMr26Ka5+9ptL/EkrNNhHR+/Ux4DaK09befW5Np6Y/BZ6mGOirTGX15WLg1rR+U1V5mY6kL91Ad0Tcn8pvpvhhBvht7yW19HUkTv/L6guSLgbeAbw70gXvkpXRlz8FZgAPSXoktX9A0n8a9ugbi7ORNvX+XZrxs/8MJ4jhtR6YKWmGpHHAAmB1nzargYvSEw2vA34fEY9KOl7SCwDS6eXbgC1pn9uBs1LdyyhuYpU9CmRZfdkDzE3rZwG/KLkfcAR9iYjfALsknZbanQ08XLXPxWn9YuCOUntRKKUvkjqBjwPnR8STI9APKKEvEbE5Ik6JiOkRMZ3iF+wZqX1L9SWt387I/+w/a6Tuhh8rC8WTCj+neKLhk6lsMbA4rQu4ItVvBjpS+akUTz48BGzt3TeefarhmxS/ZB8AzmrhvrwB2JDq7gdePZr7kupmAxVgE8UP7Imp/CTgbookdzcwuYX7soPi+vjGtFzVqn3pc/xHGIGnmEr8d2nKz37v4qE2zMwsy5eYzMwsywnCzMyynCDMzCzLCcLMzLKcIMzMLMsJwlqCpMMqRoDdIumm9MbsYPbvSqNhdpUV40iQdJ2kJ3vfM0llX1IDo5ZK+sQA9XdqhEZxtdbgx1ytJUj6Q0Q8P63fAGyIiC80sN+YiDgk6QmgPdKgZ43ud2RRDz9J11G8Zfu5iPimpOdRvLcwGZgdETVfoqr+HvYpF8XvgqfLidpalc8grBX9EPiz9Mb2tZLWp0HO5gNIem86y/g2xYCBq4Hjgfsl/ZWkl0q6W8WY/HdLekna7zpJX5D0feDytH2lpO9L2ilpbvq8bekXNWm/K1XMbbFV0meqyh+R9BlJD6iYG+Plqfz5kr6WyjZJ+stU/jZJP0ntb5LU75d5shL4q7T+ZuBeivGTej/3PZJ+ms64viKpTdI/AeNT2Q2Spqd+fJniBaxpKd6T0zEuSrE9JOkbR/jvZa1qJN/K8+JlqAvwh/R1DMWQFh8CPgu8J5VPoniL9XjgvRRDLEzuu39a/zZwcVp/P3B7Wr8O+A7QVrW9iuIN2PnAE8B/pvjDagPFX+z0fg7FnAD3AH+Rth8BPpzW/xa4Oq1fDnyxKp4TKcbX+QFwfCr7OPCpzPfhOuBdFENynwh8lWLokkfSMf489W9sav9l4KLM92A6xbg+r6sq6z3GK4DtpDeQGaE3xL2MvmXMAPnDbLQYL2ljWv8hxSiwPwbOl/SxVH4c8JK0fldE7K1xrNcD70zr36CY+KfXTRFxuGr72xERkjYDv42IzQCStlL8kt0I/FcVQ5qPAV4EzKIYMgGeHZhwQ9VnvpVirB4AIuJ3kt6R9ru3uOLDOOAntb4Z6bgLgNcCH6wqPxt4NbA+HWc8tQcR/FUU8xL0dRZwc6TLVXW+j3aUc4KwVnEgImZXF6Rr538ZEdv7lL+WYsjkRlXfiOu7X+89i6er1nu3x0iaAXwMeE36RX8dRaLqu/9hnv15E/mhoO+KiIUNxryK4tLQ9RHxtJ4dAVqprJF5Nmp9j3Lx2THI9yCsla0BPpwSBZJOb3C/H/PsX/DvBn50BDGcQPGL9veSXkgx/8VA1gJLejdUzGV9H3CmpD9LZRPS6J1ZEfFrinkPvtyn6m7gXZJOSceZLOmlqe4pSWMbiO9uirOik3qP0cA+dhRygrBW9r+AscAmSVvSdiM+ArxP0ibgb4CPDjWAKGYte5Bi1NprKW4YD+QfgBPTI7sPAW+JiB6KeycrU1z3AS8f4LO/EhG/7FP2MPD3FDfnNwF3UVz2AlhB8b2qOxlQRGwF/hFYl+Ib8GkxOzr5MVczM8vyGYSZmWU5QZiZWZYThJmZZTlBmJlZlhOEmZllOUGYmVmWE4SZmWX9fwmPi4R/kYmPAAAAAElFTkSuQmCC\n",
-      "text/plain": [
-       "<Figure size 432x288 with 1 Axes>"
-      ]
-     },
-     "metadata": {
-      "needs_background": "light"
-     },
-     "output_type": "display_data"
-    }
-   ],
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
    "source": [
     "# Current\n",
     "my_disparity_metric=custom_difference1\n",
@@ -465,71 +353,97 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 26,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "6513 6513 6513\n"
-     ]
-    }
-   ],
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
    "source": [
     "from sklearn.model_selection import cross_validate\n",
-    "from sklearn.metrics import make_scorer, precision_score\n",
-    "\n",
-    "print(len(A_test['Race']), len(Y_test), len(X_test))"
+    "from sklearn.metrics import make_scorer, precision_score"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 27,
-   "metadata": {},
-   "outputs": [
-    {
-     "ename": "ValueError",
-     "evalue": "Array sensitive_features is not the same size as y_true",
-     "output_type": "error",
-     "traceback": [
-      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
-      "\u001b[1;31mValueError\u001b[0m                                Traceback (most recent call last)",
-      "\u001b[1;32m<ipython-input-27-4fcd662eb44f>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m\u001b[0m\n\u001b[0;32m      6\u001b[0m \u001b[0mscoring\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;33m{\u001b[0m\u001b[1;34m'prec'\u001b[0m\u001b[1;33m:\u001b[0m\u001b[0mprecision_scorer\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;34m'dp'\u001b[0m\u001b[1;33m:\u001b[0m\u001b[0mdp_scorer\u001b[0m\u001b[1;33m}\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m      7\u001b[0m \u001b[0mclf\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0msvm\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mSVC\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mkernel\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;34m'linear'\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mC\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;36m1\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mrandom_state\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;36m0\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 8\u001b[1;33m \u001b[0mscores\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mcross_validate\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mclf\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mX_test\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mY_test\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mscoring\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mscoring\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m      9\u001b[0m \u001b[0mscores\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
-      "\u001b[1;32mc:\\users\\riedgar\\appdata\\local\\continuum\\miniconda3\\envs\\py-36\\lib\\site-packages\\sklearn\\utils\\validation.py\u001b[0m in \u001b[0;36minner_f\u001b[1;34m(*args, **kwargs)\u001b[0m\n\u001b[0;32m     70\u001b[0m                           FutureWarning)\n\u001b[0;32m     71\u001b[0m         \u001b[0mkwargs\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mupdate\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m{\u001b[0m\u001b[0mk\u001b[0m\u001b[1;33m:\u001b[0m \u001b[0marg\u001b[0m \u001b[1;32mfor\u001b[0m \u001b[0mk\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0marg\u001b[0m \u001b[1;32min\u001b[0m \u001b[0mzip\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0msig\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mparameters\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0margs\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m}\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m---> 72\u001b[1;33m         \u001b[1;32mreturn\u001b[0m \u001b[0mf\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m     73\u001b[0m     \u001b[1;32mreturn\u001b[0m \u001b[0minner_f\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m     74\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n",
-      "\u001b[1;32mc:\\users\\riedgar\\appdata\\local\\continuum\\miniconda3\\envs\\py-36\\lib\\site-packages\\sklearn\\model_selection\\_validation.py\u001b[0m in \u001b[0;36mcross_validate\u001b[1;34m(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, return_train_score, return_estimator, error_score)\u001b[0m\n\u001b[0;32m    246\u001b[0m             \u001b[0mreturn_times\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;32mTrue\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mreturn_estimator\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mreturn_estimator\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    247\u001b[0m             error_score=error_score)\n\u001b[1;32m--> 248\u001b[1;33m         for train, test in cv.split(X, y, groups))\n\u001b[0m\u001b[0;32m    249\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    250\u001b[0m     \u001b[0mzipped_scores\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mlist\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mzip\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m*\u001b[0m\u001b[0mscores\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
-      "\u001b[1;32mc:\\users\\riedgar\\appdata\\local\\continuum\\miniconda3\\envs\\py-36\\lib\\site-packages\\joblib\\parallel.py\u001b[0m in \u001b[0;36m__call__\u001b[1;34m(self, iterable)\u001b[0m\n\u001b[0;32m   1002\u001b[0m             \u001b[1;31m# remaining jobs.\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m   1003\u001b[0m             \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_iterating\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;32mFalse\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m-> 1004\u001b[1;33m             \u001b[1;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mdispatch_one_batch\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0miterator\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m   1005\u001b[0m                 \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_iterating\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_original_iterator\u001b[0m \u001b[1;32mis\u001b[0m \u001b[1;32mnot\u001b[0m \u001b[1;32mNone\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m   1006\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n",
-      "\u001b[1;32mc:\\users\\riedgar\\appdata\\local\\continuum\\miniconda3\\envs\\py-36\\lib\\site-packages\\joblib\\parallel.py\u001b[0m in \u001b[0;36mdispatch_one_batch\u001b[1;34m(self, iterator)\u001b[0m\n\u001b[0;32m    833\u001b[0m                 \u001b[1;32mreturn\u001b[0m \u001b[1;32mFalse\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    834\u001b[0m             \u001b[1;32melse\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 835\u001b[1;33m                 \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_dispatch\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mtasks\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m    836\u001b[0m                 \u001b[1;32mreturn\u001b[0m \u001b[1;32mTrue\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    837\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n",
-      "\u001b[1;32mc:\\users\\riedgar\\appdata\\local\\continuum\\miniconda3\\envs\\py-36\\lib\\site-packages\\joblib\\parallel.py\u001b[0m in \u001b[0;36m_dispatch\u001b[1;34m(self, batch)\u001b[0m\n\u001b[0;32m    752\u001b[0m         \u001b[1;32mwith\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_lock\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    753\u001b[0m             \u001b[0mjob_idx\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mlen\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_jobs\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 754\u001b[1;33m             \u001b[0mjob\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_backend\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mapply_async\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mbatch\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mcallback\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mcb\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m    755\u001b[0m             \u001b[1;31m# A job can complete so quickly than its callback is\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    756\u001b[0m             \u001b[1;31m# called before we get here, causing self._jobs to\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
-      "\u001b[1;32mc:\\users\\riedgar\\appdata\\local\\continuum\\miniconda3\\envs\\py-36\\lib\\site-packages\\joblib\\_parallel_backends.py\u001b[0m in \u001b[0;36mapply_async\u001b[1;34m(self, func, callback)\u001b[0m\n\u001b[0;32m    207\u001b[0m     \u001b[1;32mdef\u001b[0m \u001b[0mapply_async\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mfunc\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mcallback\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;32mNone\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    208\u001b[0m         \u001b[1;34m\"\"\"Schedule a func to be run\"\"\"\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 209\u001b[1;33m         \u001b[0mresult\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mImmediateResult\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mfunc\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m    210\u001b[0m         \u001b[1;32mif\u001b[0m \u001b[0mcallback\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    211\u001b[0m             \u001b[0mcallback\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mresult\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
-      "\u001b[1;32mc:\\users\\riedgar\\appdata\\local\\continuum\\miniconda3\\envs\\py-36\\lib\\site-packages\\joblib\\_parallel_backends.py\u001b[0m in \u001b[0;36m__init__\u001b[1;34m(self, batch)\u001b[0m\n\u001b[0;32m    588\u001b[0m         \u001b[1;31m# Don't delay the application, to avoid keeping the input\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    589\u001b[0m         \u001b[1;31m# arguments in memory\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 590\u001b[1;33m         \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mresults\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mbatch\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m    591\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    592\u001b[0m     \u001b[1;32mdef\u001b[0m \u001b[0mget\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
-      "\u001b[1;32mc:\\users\\riedgar\\appdata\\local\\continuum\\miniconda3\\envs\\py-36\\lib\\site-packages\\joblib\\parallel.py\u001b[0m in \u001b[0;36m__call__\u001b[1;34m(self)\u001b[0m\n\u001b[0;32m    254\u001b[0m         \u001b[1;32mwith\u001b[0m \u001b[0mparallel_backend\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_backend\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mn_jobs\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_n_jobs\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    255\u001b[0m             return [func(*args, **kwargs)\n\u001b[1;32m--> 256\u001b[1;33m                     for func, args, kwargs in self.items]\n\u001b[0m\u001b[0;32m    257\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    258\u001b[0m     \u001b[1;32mdef\u001b[0m \u001b[0m__len__\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
-      "\u001b[1;32mc:\\users\\riedgar\\appdata\\local\\continuum\\miniconda3\\envs\\py-36\\lib\\site-packages\\joblib\\parallel.py\u001b[0m in \u001b[0;36m<listcomp>\u001b[1;34m(.0)\u001b[0m\n\u001b[0;32m    254\u001b[0m         \u001b[1;32mwith\u001b[0m \u001b[0mparallel_backend\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_backend\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mn_jobs\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_n_jobs\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    255\u001b[0m             return [func(*args, **kwargs)\n\u001b[1;32m--> 256\u001b[1;33m                     for func, args, kwargs in self.items]\n\u001b[0m\u001b[0;32m    257\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    258\u001b[0m     \u001b[1;32mdef\u001b[0m \u001b[0m__len__\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
-      "\u001b[1;32mc:\\users\\riedgar\\appdata\\local\\continuum\\miniconda3\\envs\\py-36\\lib\\site-packages\\sklearn\\model_selection\\_validation.py\u001b[0m in \u001b[0;36m_fit_and_score\u001b[1;34m(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, return_n_test_samples, return_times, return_estimator, error_score)\u001b[0m\n\u001b[0;32m    558\u001b[0m     \u001b[1;32melse\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    559\u001b[0m         \u001b[0mfit_time\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mtime\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mtime\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m \u001b[1;33m-\u001b[0m \u001b[0mstart_time\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 560\u001b[1;33m         \u001b[0mtest_scores\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0m_score\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mestimator\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mX_test\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0my_test\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mscorer\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m    561\u001b[0m         \u001b[0mscore_time\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mtime\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mtime\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m \u001b[1;33m-\u001b[0m \u001b[0mstart_time\u001b[0m \u001b[1;33m-\u001b[0m \u001b[0mfit_time\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    562\u001b[0m         \u001b[1;32mif\u001b[0m \u001b[0mreturn_train_score\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
-      "\u001b[1;32mc:\\users\\riedgar\\appdata\\local\\continuum\\miniconda3\\envs\\py-36\\lib\\site-packages\\sklearn\\model_selection\\_validation.py\u001b[0m in \u001b[0;36m_score\u001b[1;34m(estimator, X_test, y_test, scorer)\u001b[0m\n\u001b[0;32m    605\u001b[0m         \u001b[0mscores\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mscorer\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mestimator\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mX_test\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    606\u001b[0m     \u001b[1;32melse\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 607\u001b[1;33m         \u001b[0mscores\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mscorer\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mestimator\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mX_test\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0my_test\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m    608\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    609\u001b[0m     error_msg = (\"scoring must return a number, got %s (%s) \"\n",
-      "\u001b[1;32mc:\\users\\riedgar\\appdata\\local\\continuum\\miniconda3\\envs\\py-36\\lib\\site-packages\\sklearn\\metrics\\_scorer.py\u001b[0m in \u001b[0;36m__call__\u001b[1;34m(self, estimator, *args, **kwargs)\u001b[0m\n\u001b[0;32m     86\u001b[0m             \u001b[1;32mif\u001b[0m \u001b[0misinstance\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mscorer\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0m_BaseScorer\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m     87\u001b[0m                 score = scorer._score(cached_call, estimator,\n\u001b[1;32m---> 88\u001b[1;33m                                       *args, **kwargs)\n\u001b[0m\u001b[0;32m     89\u001b[0m             \u001b[1;32melse\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m     90\u001b[0m                 \u001b[0mscore\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mscorer\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mestimator\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m*\u001b[0m\u001b[0margs\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
-      "\u001b[1;32mc:\\users\\riedgar\\appdata\\local\\continuum\\miniconda3\\envs\\py-36\\lib\\site-packages\\sklearn\\metrics\\_scorer.py\u001b[0m in \u001b[0;36m_score\u001b[1;34m(self, method_caller, estimator, X, y_true, sample_weight)\u001b[0m\n\u001b[0;32m    211\u001b[0m         \u001b[1;32melse\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    212\u001b[0m             return self._sign * self._score_func(y_true, y_pred,\n\u001b[1;32m--> 213\u001b[1;33m                                                  **self._kwargs)\n\u001b[0m\u001b[0;32m    214\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    215\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n",
-      "\u001b[1;32mc:\\users\\riedgar\\source\\repos\\fairlearn\\fairlearn\\metrics\\_disparities.py\u001b[0m in \u001b[0;36mdemographic_parity_difference\u001b[1;34m(y_true, y_pred, sensitive_features, sample_weight)\u001b[0m\n\u001b[0;32m     25\u001b[0m     \"\"\"\n\u001b[0;32m     26\u001b[0m     return selection_rate_difference(\n\u001b[1;32m---> 27\u001b[1;33m         y_true, y_pred, sensitive_features=sensitive_features, sample_weight=sample_weight)\n\u001b[0m\u001b[0;32m     28\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m     29\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n",
-      "\u001b[1;32mc:\\users\\riedgar\\source\\repos\\fairlearn\\fairlearn\\metrics\\_metrics_engine.py\u001b[0m in \u001b[0;36m__call__\u001b[1;34m(self, y_true, y_pred, sensitive_features, **metric_params)\u001b[0m\n\u001b[0;32m    166\u001b[0m             \u001b[0my_true\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0my_pred\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    167\u001b[0m             \u001b[0msensitive_features\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0msensitive_features\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 168\u001b[1;33m             **metric_params))\n\u001b[0m\u001b[0;32m    169\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    170\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n",
-      "\u001b[1;32mc:\\users\\riedgar\\source\\repos\\fairlearn\\fairlearn\\metrics\\_metrics_engine.py\u001b[0m in \u001b[0;36m__call__\u001b[1;34m(self, y_true, y_pred, sensitive_features, **metric_params)\u001b[0m\n\u001b[0;32m    134\u001b[0m                              \u001b[0msensitive_features\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0msensitive_features\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    135\u001b[0m                              \u001b[0mindexed_params\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_indexed_params\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 136\u001b[1;33m                              **metric_params)\n\u001b[0m\u001b[0;32m    137\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    138\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n",
-      "\u001b[1;32mc:\\users\\riedgar\\source\\repos\\fairlearn\\fairlearn\\metrics\\_metrics_engine.py\u001b[0m in \u001b[0;36mgroup_summary\u001b[1;34m(metric_function, y_true, y_pred, sensitive_features, indexed_params, **metric_params)\u001b[0m\n\u001b[0;32m     51\u001b[0m     \"\"\"\n\u001b[0;32m     52\u001b[0m     \u001b[0m_check_array_sizes\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0my_true\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0my_pred\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;34m'y_true'\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;34m'y_pred'\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m---> 53\u001b[1;33m     \u001b[0m_check_array_sizes\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0my_true\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0msensitive_features\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;34m'y_true'\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;34m'sensitive_features'\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m     54\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m     55\u001b[0m     \u001b[1;31m# Make everything a numpy array\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
-      "\u001b[1;32mc:\\users\\riedgar\\source\\repos\\fairlearn\\fairlearn\\metrics\\_metrics_engine.py\u001b[0m in \u001b[0;36m_check_array_sizes\u001b[1;34m(a, b, a_name, b_name)\u001b[0m\n\u001b[0;32m    262\u001b[0m \u001b[1;32mdef\u001b[0m \u001b[0m_check_array_sizes\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0ma\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mb\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0ma_name\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mb_name\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    263\u001b[0m     \u001b[1;32mif\u001b[0m \u001b[0mlen\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0ma\u001b[0m\u001b[1;33m)\u001b[0m \u001b[1;33m!=\u001b[0m \u001b[0mlen\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mb\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 264\u001b[1;33m         \u001b[1;32mraise\u001b[0m \u001b[0mValueError\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0m_MESSAGE_SIZE_MISMATCH\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mformat\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mb_name\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0ma_name\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m    265\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    266\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n",
-      "\u001b[1;31mValueError\u001b[0m: Array sensitive_features is not the same size as y_true"
-     ]
-    }
-   ],
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
    "source": [
     "# Current\n",
-    "\n",
     "precision_scorer = make_scorer(precision_score)\n",
-    "dp_scorer = make_scorer(demographic_parity_difference, sensitive_features=A_test['Race'])\n",
+    "\n",
+    "y_t = pd.Series(Y_test)\n",
+    "def dpd_wrapper(y_t, y_p, sensitive_features):\n",
+    "    # We need to slice up the sensitive feature to match y_t and y_p\n",
+    "    # See Adrin's reply to:\n",
+    "    # https://stackoverflow.com/questions/49581104/sklearn-gridsearchcv-not-using-sample-weight-in-score-function\n",
+    "    sf_slice = sensitive_features.loc[y_t.index.values].values.reshape(-1)\n",
+    "    return demographic_parity_difference(y_t, y_p, sensitive_features=sf_slice)\n",
+    "dp_scorer = make_scorer(dpd_wrapper, sensitive_features=A_test['Race'])\n",
     "\n",
     "scoring = {'prec':precision_scorer, 'dp':dp_scorer}\n",
     "clf = svm.SVC(kernel='linear', C=1, random_state=0)\n",
-    "scores = cross_validate(clf, X_test, Y_test, scoring=scoring)\n",
+    "scores = cross_validate(clf, X_test, y_t, scoring=scoring)\n",
     "scores"
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Proposed\n",
+    "\n",
+    "# Would be the same, until Adrin's SLEP/PR are accepted to help with input slicing"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### TASK 7: Run GridSearchCV\n",
+    "\n",
+    "Use demographic parity and precision score where the goal is to find the lowest-error model whose demographic parity is <= 0.05."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Current\n",
+    "from sklearn.model_selection import GridSearchCV\n",
+    "\n",
+    "param_grid = [\n",
+    "  {'C': [1, 10, 100, 1000], 'kernel': ['linear']},\n",
+    "  {'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},\n",
+    " ]\n",
+    "scoring = {'prec':precision_scorer, 'dp':dp_scorer}\n",
+    "\n",
+    "clf = svm.SVC(kernel='linear', C=1, random_state=0)\n",
+    "\n",
+    "gscv = GridSearchCV(clf, param_grid=param_grid, scoring=scoring, refit='prec', verbose=1)\n",
+    "gscv.fit(X_test, y_t)\n",
+    "\n",
+    "print(\"Best parameters set found on development set:\")  \n",
+    "print(gscv.best_params_)\n",
+    "print(\"Best score:\", gscv.best_score_)\n",
+    "print()\n",
+    "print(\"Overall results\")\n",
+    "print(gscv.cv_results_)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Proposed\n",
+    "\n",
+    "# Would be the same, until Adrin's SLEP/PR are accepted to help with input slicing"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -554,7 +468,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.6.10"
+   "version": "3.7.3"
   }
  },
  "nbformat": 4,

From 2e10c46889df594ed6ed999d75bf8f9055d94af4 Mon Sep 17 00:00:00 2001
From: Richard Edgar <riedgar@microsoft.com>
Date: Thu, 10 Sep 2020 11:19:33 -0400
Subject: [PATCH 26/42] Starting to change over to constructor method etc.

Signed-off-by: Richard Edgar <riedgar@microsoft.com>
---
 api/Updated-Metrics.md | 47 ++++++++++++++----------------------------
 1 file changed, 15 insertions(+), 32 deletions(-)

diff --git a/api/Updated-Metrics.md b/api/Updated-Metrics.md
index 2278184..535d56e 100644
--- a/api/Updated-Metrics.md
+++ b/api/Updated-Metrics.md
@@ -49,43 +49,26 @@ We also provide some wrappers for common metrics from SciKit-Learn:
 
 ### Proposed Change
 
-We do not intend to change the API invoked by the user.
-What will change is the return type.
-Rather than a `Bunch`, we will return a `GroupedMetric` object, which can offer richer functionality.
-
-At this basic level, there is only a slight change to the results seen by the user.
-There are still properties `overall` and `by_groups`, with the same semantics.
-However, the `by_groups` result is now a Pandas Series, and we also provide a `metric_` property to store a reference to the underlying metric:
+We propose to introduce a new object, the `GroupedMetric` (name discussion below).
+Users will compute metrics by passing arguments into the constructor:
 ```python
->>> result = flm.group_summary(skm.accuracy_score, y_true, y_pred, sensitive_features=A_1)
->>> result.metric_
-<class 'function'>
->>> result.overall
+>>> metrics = GroupedMetrics(skm.accuracy_score, y_true, y_pred, sensitive_features=A_1)
+<class 'GroupedMetric'>
+>>> metrics.overall
 0.4
->>> result.by_groups
-B      0.6536
-C      0.2130
-Name: accuracy_score dtype: float64
->>> print(type(result.by_groups))
-<class 'pandas.core.series.Series'>
+>>> metrics.by_group
+   accuracy_score
+B             0.4
+C             0.8
 ```
-Constructing the name of the Series could be an issue.
-In the example above, it is the name of the underlying metric function.
-Something as short as the `__name__` could end up being ambiguous, but using the `__module__` property to disambiguate might not match the user's expectations:
+The `by_group` property is a Pandas DataFrame, with the column name set to the `__name__` property of the given metric function, and the rows set to the unique values of the `sensitive_feature=` argument.
+
+Any extra arguments for the metric function would be passed via a `params=` dictionary (and an `indexed_params=` list):
 ```python
->>> import sklearn.metrics as skm
->>> skm.accuracy_score.__name__
-'accuracy_score'
->>> skm.accuracy_score.__qualname__
-'accuracy_score'
->>> skm.accuracy_score.__module__
-'sklearn.metrics._classification'
+>>> acc_params = { 'sample_weight': weights, 'normalize':False }
+>>> metrics = GroupedMetrics(skm.accuracy_score, y_true, y_pred, sensitive_features=A_1,
+                             indexed_params=['sample_weight'], params=acc_params)
 ```
-We are seeing some of the actual internal structure here of SciKit-Learn, and the user might not be expecting that.
-
-We would continue to provide convenience wrappers such as `accuracy_score_group_summary` for users, and support passing through arguments along with `indexed_params`.
-There is little advantage to the change at this point.
-This will change in the next section.
 
 ## Obtaining Scalars
 

From 8296f3abf0abe0046f5d4ba0e3c18ff010f6a4da Mon Sep 17 00:00:00 2001
From: Richard Edgar <riedgar@microsoft.com>
Date: Thu, 10 Sep 2020 11:42:18 -0400
Subject: [PATCH 27/42] More on methods

Signed-off-by: Richard Edgar <riedgar@microsoft.com>
---
 api/Updated-Metrics.md | 92 ++++++++++--------------------------------
 1 file changed, 21 insertions(+), 71 deletions(-)

diff --git a/api/Updated-Metrics.md b/api/Updated-Metrics.md
index 535d56e..45129fd 100644
--- a/api/Updated-Metrics.md
+++ b/api/Updated-Metrics.md
@@ -58,8 +58,8 @@ Users will compute metrics by passing arguments into the constructor:
 0.4
 >>> metrics.by_group
    accuracy_score
-B             0.4
-C             0.8
+B             0.6536
+C             0.213
 ```
 The `by_group` property is a Pandas DataFrame, with the column name set to the `__name__` property of the given metric function, and the rows set to the unique values of the `sensitive_feature=` argument.
 
@@ -69,6 +69,7 @@ Any extra arguments for the metric function would be passed via a `params=` dict
 >>> metrics = GroupedMetrics(skm.accuracy_score, y_true, y_pred, sensitive_features=A_1,
                              indexed_params=['sample_weight'], params=acc_params)
 ```
+We would _not_ provide the basic wrappers such as `accuracy_score_group_summary()`.
 
 ## Obtaining Scalars
 
@@ -87,83 +88,32 @@ We provide methods for turning the `Bunch`es returned from `group_summary()` int
 ```
 We also provide wrappers such as `accuracy_score_difference()`, `accuracy_score_ratio()` and `accuracy_score_min()` for user convenience.
 
-One point which these helpers lack (although it could be added) is the ability to select alternative values for measuring the difference and ratio.
-For example, the user might not be interested in the difference between the maximum and minimum, but the difference from the overall value.
-Or perhaps the difference from a particular group.
-
 ### Proposed Change
 
-The `GroupedMetric` object would have methods for calculating the required scalars.
-First, let us consider the differences.
-
-We would provide operations to calculate differences in various ways (all of these results are a Pandas Series):
-```python
->>> result.differences()
-B      0.0
-C      0.4406
-Name: TBD dtype: float64
->>> result.differences(relative_to='min')
-B     -0.4406
-C      0.0
-Name: TBD dtype: float64
->>> result.differences(relative_to='min', abs=True)
-B      0.4406
-C      0.0
-Name: TBD dtype: float64
->>> result.differences(relative_to='overall')
-B     -0.2436
-C      0.1870
-Name: TBD dtype: float64
->>> result.differences(relative_to='overall', abs=True)
-B      0.2436
-C      0.1870
-Name: TBD dtype: float64
->>> result.differences(relative_to='group', group='C', abs=True)
-B      0.4406
-C      0.0
-Name: TBD dtype: float64
-```
-The arguments introduced so far for the `differences()` method:
-- `relative_to=` to decide the common point for the differences. Possible values are `'max'` (the default), `'min'`, `'overall'` and `'group'`
-- `group=` to select a group name, only when `relative_to` is set to `'group'`. Default is `None`
-- `abs` to indicate whether to take the absolute value of each entry (defaults to false)
-
-The user could then use the Pandas methods `max()` and `min()` to reduce these Series objects to scalars.
-However, this will run into issues where the `relative_to` argument ends up pointing to either the maximum or minimum group, which will have a difference of zero.
-That could then be the maximum or minimum value of the set of difference, but probably won't be what the user wants.
+The functionality of the `group_max_from_summary()` and `group_min_from_summary()` can be accessed by calling `metrics.by_group.min()` and `metrics.by_group.max()`.
+Providing top level methods for these seems redundant.
 
-To address this case, we should add an extra argument `aggregate=` to the `differences()` method:
+For `difference_from_summary()` and `ratio_from_summary()` we propose to add appropriate methods.
+First for computing the difference:
 ```python
->>> result.differences(aggregate='max')
-0.4406
->>> result.differences(relative_to='overall', aggregate='max')
-0.1870
->>> result.differences(relative_to='overall', abs=True, aggregate='max')
-0.2436
+>>> metrics.difference()
+accuracy_score 0.4406
+dtype: float64
+>>> metrics.difference(relative_to='overall')
+accuracy_score 0.2563 # max(abs(0.6536-0.4), abs(0.213-0.4))
+dtype: float64
 ```
-If `aggregate=None` (which would be the default), then the result is a Series, as shown above.
+Note that the result type is a DataFrame (for reasons which will become clear below), and we are adding a `relative_to=` argument.
+This has valid values of `min`, `max` and `overall`.
 
-There would be a similar method called `ratios()` on the `GroupedMetric` object:
+We would similarly have a `ratio()` method:
 ```python
->>> result.ratios()
-B      1.0
-C      0.3259
-Name: TBD dtype: float64
+>>> metrics.ratio()
+accuracy_score 0.3259
+dtype: float64
+>>> metrics.ratio(relative_to='overall')
+accuracy_score 0.6120 # min(abs(0.4/0.6536), abs(0.213/0.6536))
 ```
-The `ratios()` method will take the following arguments:
-- `relative_to=` similar to `differences()`
-- `group=` similar to `differences()`
-- `ratio_order=` determines how to build the ratio. Values are
-   - `sub_unity` to make larger value the denominator
-   - `super_unity` to make larger value the numerator
-   - `from_relative` to make the value specified by `relative_to=` the denominator
-   - `to_relative` to make the value specified by `relative_to=` the numerator
-- `aggregate=` similar to `differences()`
-
-We would also provide the same wrappers such as `accuracy_score_difference()` but expose the extra arguments discussed here.
-One question is whether the default aggregation should be `None` (to match the method), or whether it should default to scalar results similar to the existing methods. 
-
-In the section on Conditional Metrics below, we shall discuss one extra optional argument for `differences()` and `ratios()`.
 
 ## Intersections of Sensitive Features
 

From f7557117be5cf3e3d57be0a7d51c92daa9755f48 Mon Sep 17 00:00:00 2001
From: Richard Edgar <riedgar@microsoft.com>
Date: Thu, 10 Sep 2020 12:02:47 -0400
Subject: [PATCH 28/42] More changes based on prior discussion

Signed-off-by: Richard Edgar <riedgar@microsoft.com>
---
 api/Updated-Metrics.md | 79 +++++++++++++++++++++---------------------
 1 file changed, 39 insertions(+), 40 deletions(-)

diff --git a/api/Updated-Metrics.md b/api/Updated-Metrics.md
index 45129fd..d6e95c8 100644
--- a/api/Updated-Metrics.md
+++ b/api/Updated-Metrics.md
@@ -137,14 +137,14 @@ If `sensitive_features=` is a DataFrame (or list of Series.... exact supported t
 ```python
 >>> result = group_summary(skm.accuracy_score, y_true, y_pred, sensitive_features=A)
 >>> result.by_groups
-SF 1    SF 2
-B       M       0.5
-        N       0.7
-        P       0.6
-C       M       0.4
-        N       0.5
-        P       0.5
-Name: sklearn.metrics.accuracy_score, dtype: float64
+           accuracy_score
+SF 1 SF 2
+B    M               0.50
+     N               0.40
+     P               0.55
+C    M               0.45
+     N               0.70
+     P               0.63
 ```
 If a particular combination of sensitive features had no representatives, then we would return `None` for that entry in the Series.
 Although this example has passed a DataFrame in for `sensitive_features=` we should aim to support lists of Series and `numpy.ndarray` as well.
@@ -153,7 +153,8 @@ The `differences()` and `ratios()` methods would act on this Series as before.
 
 ## Conditional (or Segmented) Metrics
 
-For our purposes, Conditional Metrics (alternatively known as Segmented Metrics) do not return single values when aggregation is requested in a call to `differences()` or `ratios()` but instead provide one result for each unique value of the specified condition feature(s).
+For our purposes, Conditional Metrics (alternatively known as Segmented Metrics) are specified separately from the sensitive features, since for users, they add columns to the result.
+Mathematically, they behave like additional sensitive features.
 
 ### Existing Syntax
 
@@ -162,32 +163,30 @@ Users would have to devise the required code themselves
 
 ### Proposed Change
 
-We propose adding an extra argument to `differences()` and `ratios()`, to provide a `condition_on=` argument.
-
-Suppose we have a DataFrame, `A_3` with three sensitive features: SF 1, SF 2 and Income Band (the latter having values 'Low' and 'High').
-This could represent a loan scenario where decisions can be based on income, but within the income bands, other sensitive groups must be treated equally.
-When `differences()` is invoked with `condition_on=`, the result will not be a scalar, but a Series.
-A user might make calls:
-```python
->>> result = accuracy_score_group_summary(y_true, y_test, sensitive_features=A_3)
->>> result.differences(aggregate=min, condition_on='Income Band')
-Income Band
-Low                 0.3
-High                0.4
-Name: TBD, dtype: float64
-```
-We can also allow `condition_on=` to be a list of names:
+The `GroupedMetric` constructor will need an additional argument `conditional_features=` to specify the conditional features.
+It will accept similar types to the `sensitive_features=` argument.
+Suppose we have another column called `income_level` with unique values 'Low' and 'High'
 ```python
->>> result.differences(aggregate=min, condition_on=['Income Band', 'SF 1'])
-Income Band     SF 1
-Low             B       0.3
-Low             C       0.35
-High            B       0.4
-High            C       0.5
+>>> metric = GroupedMetric(skm.accuracy_score, y_true, y_pred,
+                           sensitive_features=A_1,
+                           conditional_features=income_level)
+>>> metric.overall
+high 0.6
+low  0.4
+dtype: float64
+>>> metric.by_group
+  accuracy_score
+            high   low
+B           0.40  0.50
+C           0.55  0.65
 ```
-There may be demand for allowing the sensitive features to be supplied as a `numpy.ndarray` or even a list of `Series` (similar to how the `sensitive_features=` argument may not be a DataFrame).
-To support this, `condition_on=` would need to allow integers (and lists of integers) as inputs, to index the columns.
-If the user is specifying a list for `condition_on=` then we should probably be nice and detect cases where a feature is listed twice (especially if we're allowing both names and column indices).
+The `overall` property is now a Pandas series, indexed by the conditional feature values.
+Similarly, the result DataFrame now uses a Pandas MultiIndex for the columns, giving one column for each (combination of) conditional feature.
+
+Note that it is possible to have multiple sensitive features, and multiple conditional features.
+Operations such as `.max()` and `.differences()` will act on each column.
+Furthermore, the `relative_to=` argument for `.differences()` and `.ratios()` will be relative to the
+relevant value for each column.
 
 ## Multiple Metrics
 
@@ -201,16 +200,16 @@ Users would have to devise their own method
 ### Proposed Change
 
 We allow a list of metric functions in the call to group summary.
-Results become DataFrames, with one column for each metric:
+The properties then add columns to their DataFrames:
 ```python
->>> result = group_summary([skm.accuracy_score, skm.precision_score], y_true, y_pred, sensitive_features=A_1)
+>>> result = GroupedMetric([skm.accuracy_score, skm.precision_score], y_true, y_pred, sensitive_features=A_1)
 >>> result.overall
-      sklearn.metrics.accuracy_score  sklearn.metrics.precision_score
-   0                             0.3                              0.5
+      accuracy_score  precision_score
+   0             0.3              0.5
 >>> result.by_groups
-      sklearn.metrics.accuracy_score  sklearn.metrics.precision_score
-'B'                              0.4                            0.7
-'C'                              0.6                            0.75
+      accuracy_score  precision_score
+'B'              0.4              0.7
+'C'              0.6              0.75
 ```
 This should generalise to the other methods described above.
 

From 0f28fe30dfedb98bd648a535b21edbc2c557ffe4 Mon Sep 17 00:00:00 2001
From: Richard Edgar <riedgar@microsoft.com>
Date: Thu, 10 Sep 2020 12:04:14 -0400
Subject: [PATCH 29/42] Another correction

Signed-off-by: Richard Edgar <riedgar@microsoft.com>
---
 api/Updated-Metrics.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/api/Updated-Metrics.md b/api/Updated-Metrics.md
index d6e95c8..16c59df 100644
--- a/api/Updated-Metrics.md
+++ b/api/Updated-Metrics.md
@@ -219,9 +219,9 @@ For example, for `index_params=` we would have:
 ```python
 indexed_params = [['sample_weight'], ['sample_weight']]
 ```
-In the `**kwargs` a single `extra_args=` argument would be accepted (although not required), which would contain the individual `**kwargs` for each metric:
+Similarly, the `params=` argument would become a list of dictionaries:
 ```python
-extra_args = [ 
+params = [ 
     {
         'sample_weight': [1,2,1,1,3, ...],
         'normalize': False

From 2ea2037b334af7cd50d9dbd647c4112059b3c574 Mon Sep 17 00:00:00 2001
From: Richard Edgar <riedgar@microsoft.com>
Date: Fri, 11 Sep 2020 09:53:48 -0400
Subject: [PATCH 30/42] Some fixes to remove `group_summary()` (not yet
 complete)

Signed-off-by: Richard Edgar <riedgar@microsoft.com>
---
 api/Updated-Metrics.md | 27 +++++++++------------------
 1 file changed, 9 insertions(+), 18 deletions(-)

diff --git a/api/Updated-Metrics.md b/api/Updated-Metrics.md
index 16c59df..19d869c 100644
--- a/api/Updated-Metrics.md
+++ b/api/Updated-Metrics.md
@@ -214,7 +214,7 @@ The properties then add columns to their DataFrames:
 This should generalise to the other methods described above.
 
 One open question is how extra arguments should be passed to the individual metric functions, including how to handle the `indexed_params=`.
-A possible solution is to have lists, with indices corresponding to the list of functions supplied to `group_summary()`
+A possible solution is to have lists, with indices corresponding to the list of functions supplied to the `GroupedMetric` constructor.
 For example, for `index_params=` we would have:
 ```python
 indexed_params = [['sample_weight'], ['sample_weight']]
@@ -236,24 +236,15 @@ If users had a lot of functions with a lot of custom arguments, this could get e
 
 ## Naming
 
-The names `group_summary()` and `GroupedMetric` are not necessarily inspired, and there may well be better alternatives.
-Changes to these would ripple throughout the module, so agreeing on these is an important first step.
-
-Some possibilities for the function:
-  - `group_summary()`
-  - `metric_by_groups()`
-  - `calculate_group_metric()`
-  - ?
-
-And for the result object:
+The name `GroupedMetric` is not especially inspired.
+Some possible alternatives:
   - `GroupedMetric`
   - `GroupMetricResult`
   - `MetricByGroups`
   - ?
 
 Other names are also up for debate.
-However, things like the wrappers `accuracy_score_group_summary()` will hinge on the names chosen above.
-Arguments such as `index_params=` and `ratio_order=` (along with the allowed values of the latter) are important, but narrower in impact.
+Arguments such as `index_params=` and `relative_to=` (along with the allowed values of the latter) are important, but narrower in impact.
 
 ## User Conveniences
 
@@ -272,16 +263,16 @@ We would also allow mixtures of strings and functions in the multiple metric cas
 Throughout this document, we have been describing the case of classification metrics.
 However, we do not actually require this.
 It is the underlying metric function which gives meaning to the `y_true` and `y_pred` lists.
-So long as these are of equal length (and equal in length to the sensitive feature list - which _will_ be treated as a categorical), then `group_summary()` does not actually care about their datatypes.
+So long as these are of equal length (and equal in length to the sensitive feature list - which _will_ be treated as a categorical), then `GroupedMetric` does not actually care about their datatypes.
 For example, each entry in `y_pred` could be a dictionary of predicted classes and accompanying probabilities.
 Or the user might be working on a regression problem, and both `y_true` and `y_pred` would be floating point numbers (or `y_pred` might even be a tuple of predicted value and error).
-So long as the underlying metric understands the data structures, `group_summary()` will not care.
+So long as the underlying metric understands the data structures, `GroupedMetric` will not care.
 
-There will be an effect on the `GroupedMetric` result object.
+There will be an effect on the `difference()` and `ratio()` methods.
 Although the `overall` and `by_groups` properties will work fine, the `differences()` and `ratios()` methods may not.
 After all, what does "take the ratio of two confusion matrices" even mean?
 We should try to trap these cases, and throw a meaningful exception (rather than propagating whatever exception happens to emerge from the underlying libraries).
-Since we know that `differences()` and `ratios()` will only work when the metric has produced scalar results, which should be a straightforward test using [`isscalar()` from Numpy](https://numpy.org/doc/stable/reference/generated/numpy.isscalar.html).
+Since we know that `difference()` and `ratio()` will only work when the metric has produced scalar results, which should be a straightforward test using [`isscalar()` from Numpy](https://numpy.org/doc/stable/reference/generated/numpy.isscalar.html).
 
 ## Pitfalls
 
@@ -299,7 +290,7 @@ If we implement the convenience strings-for-functions piece mentioned above, the
 We could even generate the argument ourselves if the user does not specify it.
 However, this risks tying Fairlearn to particular versions of SciKit-Learn.
 
-Unfortunately, the generality of `group_summary()` means that we cannot solve this for the user.
+Unfortunately, the generality of `GroupedMetric` means that we cannot solve this for the user.
 It cannot even tell if it is evaluating a classification or regression problem.
 
 ## The Wrapper Functions

From 41bb325193d9d3d2816e5707c45cea9756bebef6 Mon Sep 17 00:00:00 2001
From: Richard Edgar <riedgar@microsoft.com>
Date: Fri, 11 Sep 2020 16:35:26 -0400
Subject: [PATCH 31/42] Some extensive edits

Signed-off-by: Richard Edgar <riedgar@microsoft.com>
---
 api/Updated-Metrics.md | 108 ++++++++++++++++++++++++++---------------
 1 file changed, 70 insertions(+), 38 deletions(-)

diff --git a/api/Updated-Metrics.md b/api/Updated-Metrics.md
index 19d869c..3ec883d 100644
--- a/api/Updated-Metrics.md
+++ b/api/Updated-Metrics.md
@@ -26,7 +26,9 @@ Here we have shown binary values for a simple classification problem, but they c
 Our basic method is `group_summary()`
 
 ```python
->>> result = flm.group_summary(skm.accuracy_score, y_true, y_pred, sensitive_features=A_1)
+>>> result = flm.group_summary(skm.accuracy_score,
+                               y_true, y_pred,
+                               sensitive_features=A_1)
 >>> print(result)
 {'overall': 0.4, 'by_group': {'B': 0.6536, 'C': 0.213}}
 >>> print(type(result))
@@ -37,13 +39,18 @@ Note that the `by_group` key accesses another `Bunch`.
 
 We allow for sample weights (and other arguments which require slicing) via `indexed_params`, and passing through other arguments to the underlying metric function (in this case, `normalize`):
 ```python
->>> flm.group_summary(skm.accuracy_score, y_true, y_pred, sensitive_features=A_1, indexed_params=['sample_weight'], sample_weight=weights, normalize=False)
+>>> flm.group_summary(skm.accuracy_score,
+                      y_true, y_pred,
+                      sensitive_features=A_1,
+                      indexed_params=['sample_weight'],
+                      sample_weight=weights, normalize=False)
 {'overall': 20, 'by_group': {'B': 60, 'C': 21}}
 ```
 
 We also provide some wrappers for common metrics from SciKit-Learn:
 ```python
->>> flm.accuracy_score_group_summary(y_true, y_pred, sensitive_features=A_1)
+>>> flm.accuracy_score_group_summary(y_true, y_pred, 
+                                     sensitive_features=A_1)
 {'overall': 0.4, 'by_group': {'B': 0.6536, 'C': 0.213}}
 ```
 
@@ -52,22 +59,29 @@ We also provide some wrappers for common metrics from SciKit-Learn:
 We propose to introduce a new object, the `GroupedMetric` (name discussion below).
 Users will compute metrics by passing arguments into the constructor:
 ```python
->>> metrics = GroupedMetrics(skm.accuracy_score, y_true, y_pred, sensitive_features=A_1)
+>>> metrics = GroupedMetrics(skm.accuracy_score,
+                             y_true, y_pred,
+                             sensitive_features=A_1)
 <class 'GroupedMetric'>
 >>> metrics.overall
-0.4
+    accuracy_score
+0   0.4
 >>> metrics.by_group
    accuracy_score
 B             0.6536
 C             0.213
 ```
-The `by_group` property is a Pandas DataFrame, with the column name set to the `__name__` property of the given metric function, and the rows set to the unique values of the `sensitive_feature=` argument.
+Both properties are now Pandas DataFrames, with the column name set to the `__name__` property of the given metric function.
+The rows of the `by_group` property are set to the unique values of the `sensitive_feature=` argument.
 
 Any extra arguments for the metric function would be passed via a `params=` dictionary (and an `indexed_params=` list):
 ```python
 >>> acc_params = { 'sample_weight': weights, 'normalize':False }
->>> metrics = GroupedMetrics(skm.accuracy_score, y_true, y_pred, sensitive_features=A_1,
-                             indexed_params=['sample_weight'], params=acc_params)
+>>> metrics = GroupedMetrics(skm.accuracy_score,
+                             y_true, y_pred,
+                             sensitive_features=A_1,
+                             indexed_params=['sample_weight'],
+                             params=acc_params)
 ```
 We would _not_ provide the basic wrappers such as `accuracy_score_group_summary()`.
 
@@ -124,7 +138,8 @@ To achieve this, users currently need to write something along the lines of:
 ```python
 >>> A_combined = A['SF 1'] + '-' + A['SF 2']
 
->>> accuracy_score_group_summary(y_true, y_pred, sensitive_features=A_combined)
+>>> accuracy_score_group_summary(y_true, y_pred,
+                                 sensitive_features=A_combined)
 { 'overall': 0.4, by_groups : { 'B-M':0.4, 'B-N':0.5, 'B-P':0.5, 'C-M':0.5, 'C-N': 0.6, 'C-P':0.7 } }
 ```
 This is unecessarily cumbersome.
@@ -135,7 +150,9 @@ It is also possible that some combinations might not appear in the data (especia
 
 If `sensitive_features=` is a DataFrame (or list of Series.... exact supported types are TBD), we can generate our results in terms of a MultiIndex. Using the `A` DataFrame defined above, a user might write:
 ```python
->>> result = group_summary(skm.accuracy_score, y_true, y_pred, sensitive_features=A)
+>>> result = group_summary(skm.accuracy_score,
+                           y_true, y_pred,
+                           sensitive_features=A)
 >>> result.by_groups
            accuracy_score
 SF 1 SF 2
@@ -151,7 +168,7 @@ Although this example has passed a DataFrame in for `sensitive_features=` we sho
 
 The `differences()` and `ratios()` methods would act on this Series as before.
 
-## Conditional (or Segmented) Metrics
+## Conditional Metrics
 
 For our purposes, Conditional Metrics (alternatively known as Segmented Metrics) are specified separately from the sensitive features, since for users, they add columns to the result.
 Mathematically, they behave like additional sensitive features.
@@ -167,20 +184,22 @@ The `GroupedMetric` constructor will need an additional argument `conditional_fe
 It will accept similar types to the `sensitive_features=` argument.
 Suppose we have another column called `income_level` with unique values 'Low' and 'High'
 ```python
->>> metric = GroupedMetric(skm.accuracy_score, y_true, y_pred,
+>>> metric = GroupedMetric(skm.accuracy_score,
+                           y_true, y_pred,
                            sensitive_features=A_1,
                            conditional_features=income_level)
 >>> metric.overall
-high 0.6
-low  0.4
+       accuracy_score
+high            0.46
+low             0.61
 dtype: float64
 >>> metric.by_group
-  accuracy_score
+        accuracy_score
             high   low
 B           0.40  0.50
 C           0.55  0.65
 ```
-The `overall` property is now a Pandas series, indexed by the conditional feature values.
+The `overall` property now has rows corresponding to the unique values of the conditional feature(s).
 Similarly, the result DataFrame now uses a Pandas MultiIndex for the columns, giving one column for each (combination of) conditional feature.
 
 Note that it is possible to have multiple sensitive features, and multiple conditional features.
@@ -188,6 +207,19 @@ Operations such as `.max()` and `.differences()` will act on each column.
 Furthermore, the `relative_to=` argument for `.differences()` and `.ratios()` will be relative to the
 relevant value for each column.
 
+As a final note, it would also be possible to put the conditional features into the rows, at a 'higher' level than the sensitive features.
+The resultant DataFrame would look like:
+```python
+>>> metric.by_group
+        accuracy_score
+high  B           0.40
+      C           0.55
+low   B           0.50
+      C           0.65
+```
+This might be more natural for some purposes, and we have not finally decided on which pattern to use.
+However, the `.stack()` and `.unstack()` methods of DataFrame can be used to flip between them.
+
 ## Multiple Metrics
 
 Finally, we can also allow for the evaluation of multiple metrics at once.
@@ -202,7 +234,9 @@ Users would have to devise their own method
 We allow a list of metric functions in the call to group summary.
 The properties then add columns to their DataFrames:
 ```python
->>> result = GroupedMetric([skm.accuracy_score, skm.precision_score], y_true, y_pred, sensitive_features=A_1)
+>>> result = GroupedMetric([skm.accuracy_score, skm.precision_score],
+                            y_true, y_pred,
+                            sensitive_features=A_1)
 >>> result.overall
       accuracy_score  precision_score
    0             0.3              0.5
@@ -248,13 +282,16 @@ Arguments such as `index_params=` and `relative_to=` (along with the allowed val
 
 ## User Conveniences
 
-In addition to having the underlying metric be passed as a function, we can consider allowing the metric function given to `group_summary()` to be represented by a string.
+In addition to having the underlying metric be passed as a function, we can consider allowing the metric function given to the `GroupedMetric` constructor to be represented by a string.
 We would provide a mapping of strings to suitable functions.
-This would make the following all equivalent:
+This would make the following equivalent:
 ```python
->>> r1 = group_summary(sklearn.accuracy_score, y_true, y_pred, sensitive_features=A_1)
->>> r2 = group_summary('accuracy_score', y_true, y_pred, sensitive_features=A_1)
->>> r3 = accuracy_score_group_summary( y_true, y_pred, sensitive_features=A_1)
+>>> r1 = GroupedMetric(sklearn.accuracy_score,
+                       y_true, y_pred, 
+                       sensitive_features=A_1)
+>>> r2 = group_summary('accuracy_score',
+                       y_true, y_pred, 
+                       sensitive_features=A_1)
 ```
 We would also allow mixtures of strings and functions in the multiple metric case.
 
@@ -293,25 +330,20 @@ However, this risks tying Fairlearn to particular versions of SciKit-Learn.
 Unfortunately, the generality of `GroupedMetric` means that we cannot solve this for the user.
 It cannot even tell if it is evaluating a classification or regression problem.
 
-## The Wrapper Functions
+## Convenience Functions
 
-In the above, we have assumed that we will provide both `group_summary()` and wrappers such as `accuracy_score_group_summary()`, `accuracy_score_difference()`, `accuracy_score_ratio()` and `accuracy_score_group_min()`.
-These wrappers allow the metrics to be passed to SciKit-Learn subroutines such as `make_scorer()`, and they all accept arguments for both the aggregation (as described above) and the underlying metric.
+We currently provide functions for evaluating common fairness metrics (where `X` can be `ratio` or `difference`):
 
-We also provide wrappers for specific fairness metrics used in the literature such `demographic_parity_difference()` and `equalized_odds_difference()` (although even then we should add the extra `relative_to=` and `group=` arguments).
+- `demographic_parity_X()`
+- `true_positive_rate_X()`
+- `false_positive_rate_X()`
+- `equalized_odds_X()`
 
+We will continue to provide these wrappers, based on `GroupedMetric` objects internally.
 
-## Methods or Functions
+## Support for `make_scorer()`
 
-Since the `GroupMetric` object contains no private members, it is not clear that it needs to be its own object.
-We could continue to use a `Bunch` but make the `group_by` entry/property return a Pandas Series (which would embed all the other information we might need).
-In the multiple metric case, we would still return a single `Bunch` but the properties would both be DataFrames.
 
-The question is whether users prefer:
-```python
->>> diff = group_summary(skm.recall_score, y_true, y_pred, sensitive_features=A).difference(aggregate='max')
-```
-or
-```python
->>> diff = difference(group_summary(skm.recall_score, y_true, y_pred, sensitive_features=A), aggregate='max')
-```
+
+In the above, we have assumed that we will provide both `group_summary()` and wrappers such as `accuracy_score_group_summary()`, `accuracy_score_difference()`, `accuracy_score_ratio()` and `accuracy_score_group_min()`.
+These wrappers allow the metrics to be passed to SciKit-Learn subroutines such as `make_scorer()`, and they all accept arguments for both the aggregation (as described above) and the underlying metric.

From 205c2394ce19613abfc9355f100ba5f9ea903d97 Mon Sep 17 00:00:00 2001
From: Richard Edgar <riedgar@microsoft.com>
Date: Fri, 11 Sep 2020 16:59:40 -0400
Subject: [PATCH 32/42] Add make_grouped_scorer()

Signed-off-by: Richard Edgar <riedgar@microsoft.com>
---
 api/Updated-Metrics.md | 29 +++++++++++++++++++++++++----
 1 file changed, 25 insertions(+), 4 deletions(-)

diff --git a/api/Updated-Metrics.md b/api/Updated-Metrics.md
index 3ec883d..aac9be3 100644
--- a/api/Updated-Metrics.md
+++ b/api/Updated-Metrics.md
@@ -343,7 +343,28 @@ We will continue to provide these wrappers, based on `GroupedMetric` objects int
 
 ## Support for `make_scorer()`
 
-
-
-In the above, we have assumed that we will provide both `group_summary()` and wrappers such as `accuracy_score_group_summary()`, `accuracy_score_difference()`, `accuracy_score_ratio()` and `accuracy_score_group_min()`.
-These wrappers allow the metrics to be passed to SciKit-Learn subroutines such as `make_scorer()`, and they all accept arguments for both the aggregation (as described above) and the underlying metric.
+SciKit-Learn has the concept of 'scorer' functions, which take arguments of `y_true` and `y_pred` and return a scalar score.
+These are used in higher level algorithms such as `GridSearchCV`.
+In order to use these in combination with metrics which can use other arguments (such as the `normalize=` argument on `accuracy_score` above), SciKit-Learn has a `make_scorer()` function , which takes a metric function, along with a list of other arguments, and binds them together to provide a function which just accepts `y_true` and `y_pred` arguments, but will invoke the underlying metric function with the specified extra arguments.
+The higher level algorithms take folds of the input data, and ask the generated scoring function to evaluate these.
+
+There is one problem with this: if a user has a per-sample input (such as sample weights), how do we select the correct values to match the fold?
+When the generated scorer is invoked, the `y_true` and `y_pred` arrays will be a subset of the `sample_weights` bound into the scorer by `make_scorer()`, so the problem is to work out the subset.
+Currently, there is no good way to do this through SciKit-Learn (although a proposed solution is under development).
+There is a [work around described by Adrin on StackOverflow](https://stackoverflow.com/questions/49581104/sklearn-gridsearchcv-not-using-sample-weight-in-score-function), which relies on DataFrames being sliced 'in-place' by SciKit-Learn.
+If all arguments are DataFrames (or Series) when when the generated scorer is invoked, the `index` property of `y_true` can be examined, and used as a mask on the sample weights column (which is bound into the generated scorer).
+
+Our grouped metrics will always face this problem, since we always have the sensitive feature which will need to be passed along.
+We can provide a `make_grouped_scorer()` method with a signature like:
+```python
+make_grouped_scorer(metric_function,
+                    sensitive_feature,
+                    indexed_params,
+                    params,
+                    disparity_measure='difference',
+                    relative_to='min')
+```
+We will only support a single sensitive feature for this, since we need to produce a single scalar result.
+The function will verify that it has been passed a Pandas Series or DataFrame, so that the `index` is available.
+The `disparity_measure=` argument specifies whether the disparity should be measured via the difference or ratio (corresponding to the methods on the `GroupedMetric` object).
+Similarly, the `relative_to=` argument can also be set to `overall` - although in this case it is important to note that this will be the overall value for the fold, and *not* the overall value on the entire dataset.

From 70d881102fe0aaf1e7482aa12c5772d71e974261 Mon Sep 17 00:00:00 2001
From: Richard Edgar <riedgar@microsoft.com>
Date: Mon, 14 Sep 2020 09:28:40 -0400
Subject: [PATCH 33/42] Errant group_summary

Signed-off-by: Richard Edgar <riedgar@microsoft.com>
---
 api/Updated-Metrics.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/api/Updated-Metrics.md b/api/Updated-Metrics.md
index aac9be3..0ec761d 100644
--- a/api/Updated-Metrics.md
+++ b/api/Updated-Metrics.md
@@ -289,7 +289,7 @@ This would make the following equivalent:
 >>> r1 = GroupedMetric(sklearn.accuracy_score,
                        y_true, y_pred, 
                        sensitive_features=A_1)
->>> r2 = group_summary('accuracy_score',
+>>> r2 = GroupedMetric('accuracy_score',
                        y_true, y_pred, 
                        sensitive_features=A_1)
 ```

From 96703c63770efd4d190075f4a49354dad3f9fadc Mon Sep 17 00:00:00 2001
From: Richard Edgar <riedgar@microsoft.com>
Date: Mon, 14 Sep 2020 09:31:16 -0400
Subject: [PATCH 34/42] Add link to SLEP006

Signed-off-by: Richard Edgar <riedgar@microsoft.com>
---
 api/Updated-Metrics.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/api/Updated-Metrics.md b/api/Updated-Metrics.md
index 0ec761d..b533522 100644
--- a/api/Updated-Metrics.md
+++ b/api/Updated-Metrics.md
@@ -353,6 +353,7 @@ When the generated scorer is invoked, the `y_true` and `y_pred` arrays will be a
 Currently, there is no good way to do this through SciKit-Learn (although a proposed solution is under development).
 There is a [work around described by Adrin on StackOverflow](https://stackoverflow.com/questions/49581104/sklearn-gridsearchcv-not-using-sample-weight-in-score-function), which relies on DataFrames being sliced 'in-place' by SciKit-Learn.
 If all arguments are DataFrames (or Series) when when the generated scorer is invoked, the `index` property of `y_true` can be examined, and used as a mask on the sample weights column (which is bound into the generated scorer).
+A more general solution is [under discussion within the SciKit-Learn community](https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest/slep006/proposal.html).
 
 Our grouped metrics will always face this problem, since we always have the sensitive feature which will need to be passed along.
 We can provide a `make_grouped_scorer()` method with a signature like:

From 752126179909d42843aa50a5dcf3980a2b905639 Mon Sep 17 00:00:00 2001
From: Richard Edgar <riedgar@microsoft.com>
Date: Mon, 14 Sep 2020 11:31:57 -0400
Subject: [PATCH 35/42] Minor update to notebook

Signed-off-by: Richard Edgar <riedgar@microsoft.com>
---
 api/Metrics API Samples.ipynb | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/api/Metrics API Samples.ipynb b/api/Metrics API Samples.ipynb
index be77330..28ef9df 100644
--- a/api/Metrics API Samples.ipynb	
+++ b/api/Metrics API Samples.ipynb	
@@ -468,7 +468,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.7.3"
+   "version": "3.6.10"
   }
  },
  "nbformat": 4,

From f9ed32c2c8db2a9da1616db98dcf8c2ef841489c Mon Sep 17 00:00:00 2001
From: Richard Edgar <riedgar@microsoft.com>
Date: Mon, 14 Sep 2020 11:43:04 -0400
Subject: [PATCH 36/42] Some small updates to the proposal

Signed-off-by: Richard Edgar <riedgar@microsoft.com>
---
 api/Updated-Metrics.md | 49 +++++++++++++++++++++---------------------
 1 file changed, 25 insertions(+), 24 deletions(-)

diff --git a/api/Updated-Metrics.md b/api/Updated-Metrics.md
index b533522..8632bb0 100644
--- a/api/Updated-Metrics.md
+++ b/api/Updated-Metrics.md
@@ -64,8 +64,8 @@ Users will compute metrics by passing arguments into the constructor:
                              sensitive_features=A_1)
 <class 'GroupedMetric'>
 >>> metrics.overall
-    accuracy_score
-0   0.4
+          accuracy_score
+overall   0.4
 >>> metrics.by_group
    accuracy_score
 B             0.6536
@@ -104,28 +104,26 @@ We also provide wrappers such as `accuracy_score_difference()`, `accuracy_score_
 
 ### Proposed Change
 
-The functionality of the `group_max_from_summary()` and `group_min_from_summary()` can be accessed by calling `metrics.by_group.min()` and `metrics.by_group.max()`.
-Providing top level methods for these seems redundant.
+Although the functionality of the `group_max_from_summary()` and `group_min_from_summary()` can be accessed by calling `metrics.by_group.min()` and `metrics.by_group.max()`, we will provide `.group_min()` and `.group_max()` methods for completeness.
 
-For `difference_from_summary()` and `ratio_from_summary()` we propose to add appropriate methods.
+For `difference_from_summary()` and `ratio_from_summary()` we will add methods to calculate the values as we do now, and also relative to the `overall` value.
 First for computing the difference:
 ```python
 >>> metrics.difference()
 accuracy_score 0.4406
 dtype: float64
->>> metrics.difference(relative_to='overall')
+>>> metrics.difference_to_overall()
 accuracy_score 0.2563 # max(abs(0.6536-0.4), abs(0.213-0.4))
 dtype: float64
 ```
-Note that the result type is a DataFrame (for reasons which will become clear below), and we are adding a `relative_to=` argument.
-This has valid values of `min`, `max` and `overall`.
+Note that the result type is a DataFrame (for reasons which will become clear below).
 
-We would similarly have a `ratio()` method:
+We would similarly have `ratio()` and `ratio_to_overall()`methods:
 ```python
 >>> metrics.ratio()
 accuracy_score 0.3259
 dtype: float64
->>> metrics.ratio(relative_to='overall')
+>>> metrics.ratio_to_overall()
 accuracy_score 0.6120 # min(abs(0.4/0.6536), abs(0.213/0.6536))
 ```
 
@@ -203,9 +201,8 @@ The `overall` property now has rows corresponding to the unique values of the co
 Similarly, the result DataFrame now uses a Pandas MultiIndex for the columns, giving one column for each (combination of) conditional feature.
 
 Note that it is possible to have multiple sensitive features, and multiple conditional features.
-Operations such as `.max()` and `.differences()` will act on each column.
-Furthermore, the `relative_to=` argument for `.differences()` and `.ratios()` will be relative to the
-relevant value for each column.
+Operations such as `.group_max()` and `.difference_to_overall()` will act on each column.
+
 
 As a final note, it would also be possible to put the conditional features into the rows, at a 'higher' level than the sensitive features.
 The resultant DataFrame would look like:
@@ -238,8 +235,8 @@ The properties then add columns to their DataFrames:
                             y_true, y_pred,
                             sensitive_features=A_1)
 >>> result.overall
-      accuracy_score  precision_score
-   0             0.3              0.5
+         accuracy_score  precision_score
+overall             0.3              0.5
 >>> result.by_groups
       accuracy_score  precision_score
 'B'              0.4              0.7
@@ -278,7 +275,7 @@ Some possible alternatives:
   - ?
 
 Other names are also up for debate.
-Arguments such as `index_params=` and `relative_to=` (along with the allowed values of the latter) are important, but narrower in impact.
+Arguments such as `index_params=` are important, but narrower in impact.
 
 ## User Conveniences
 
@@ -330,16 +327,8 @@ However, this risks tying Fairlearn to particular versions of SciKit-Learn.
 Unfortunately, the generality of `GroupedMetric` means that we cannot solve this for the user.
 It cannot even tell if it is evaluating a classification or regression problem.
 
-## Convenience Functions
 
-We currently provide functions for evaluating common fairness metrics (where `X` can be `ratio` or `difference`):
 
-- `demographic_parity_X()`
-- `true_positive_rate_X()`
-- `false_positive_rate_X()`
-- `equalized_odds_X()`
-
-We will continue to provide these wrappers, based on `GroupedMetric` objects internally.
 
 ## Support for `make_scorer()`
 
@@ -369,3 +358,15 @@ We will only support a single sensitive feature for this, since we need to produ
 The function will verify that it has been passed a Pandas Series or DataFrame, so that the `index` is available.
 The `disparity_measure=` argument specifies whether the disparity should be measured via the difference or ratio (corresponding to the methods on the `GroupedMetric` object).
 Similarly, the `relative_to=` argument can also be set to `overall` - although in this case it is important to note that this will be the overall value for the fold, and *not* the overall value on the entire dataset.
+
+
+## Convenience Functions
+
+We currently provide functions for evaluating common fairness metrics (where `X` can be `ratio` or `difference`):
+
+- `demographic_parity_X()`
+- `true_positive_rate_X()`
+- `false_positive_rate_X()`
+- `equalized_odds_X()`
+
+We will continue to provide these wrappers, based on `GroupedMetric` objects internally.

From 091c47eadbbb69f23b9a335f7f2f16153f04bacf Mon Sep 17 00:00:00 2001
From: Richard Edgar <riedgar@microsoft.com>
Date: Mon, 14 Sep 2020 12:01:04 -0400
Subject: [PATCH 37/42] Add make_derived_metric back in

Signed-off-by: Richard Edgar <riedgar@microsoft.com>
---
 api/Updated-Metrics.md | 31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/api/Updated-Metrics.md b/api/Updated-Metrics.md
index 8632bb0..7e61388 100644
--- a/api/Updated-Metrics.md
+++ b/api/Updated-Metrics.md
@@ -327,7 +327,38 @@ However, this risks tying Fairlearn to particular versions of SciKit-Learn.
 Unfortunately, the generality of `GroupedMetric` means that we cannot solve this for the user.
 It cannot even tell if it is evaluating a classification or regression problem.
 
+## Creating Derived Metrics
 
+Rather than a `GroupedMetric` object, users will often want to have a function which yields a scalar.
+
+### Existing Syntax
+
+We currently provide a `make_derived_metric()` function which can build a callable object which does this:
+```python
+fhalf_score = functools.partial(skm.fbeta_score, beta=0.5)
+
+custom_difference1 = make_derived_metric(
+    difference_from_summary,
+    make_metric_group_summary(fhalf_score))
+```
+Notice that we have had to put a wrapper around `fbeta_score` since `beta=` is a required parameter, but we do not support anything beyond `y_true`, `y_pred` and `sensitive_features` as arguments.
+
+We use this to provide helper functions such as `accuracy_score_difference()` and `accuracy_score_group_min()`
+
+### Proposed Syntax
+
+We should be able to provide a function builder of the following form:
+```python
+fbeta_diff = make_derived_metric(
+    'difference',
+    skm.fbeta_score,
+    index_params=['sample_weight']
+)
+
+print(fbeta_diff(y_true, y_pred, sensitive_features=A_1, sample_weight=weights, beta=0.5))
+```
+Since the goal of this function is to produce scalars, we would not support supplying multiple underlying metrics.
+The derived metrics would correspond to the various methods described above which compute scalars from the `GroupedMetrics` object.
 
 
 ## Support for `make_scorer()`

From 180370bfba622d415e59a6fcc3f36bd196fd39a9 Mon Sep 17 00:00:00 2001
From: Richard Edgar <riedgar@microsoft.com>
Date: Thu, 17 Sep 2020 12:57:18 -0400
Subject: [PATCH 38/42] Add note about meetings

Signed-off-by: Richard Edgar <riedgar@microsoft.com>
---
 api/Updated-Metrics.md | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/api/Updated-Metrics.md b/api/Updated-Metrics.md
index 7e61388..0a9a029 100644
--- a/api/Updated-Metrics.md
+++ b/api/Updated-Metrics.md
@@ -2,6 +2,9 @@
 
 This is an update for the existing metrics document, which is being left in place for now as a point of comparison.
 
+We are having meetings to discuss this proposal.
+Please reach out to `riedgar@microsoft.com` if you would like to join.
+
 ## Assumed data
 
 In the following we assume that we have variables of the following form defined:

From acfacabbf032ac7426ed98bf6141d780d7646db6 Mon Sep 17 00:00:00 2001
From: Richard Edgar <riedgar@microsoft.com>
Date: Mon, 21 Sep 2020 16:06:45 -0400
Subject: [PATCH 39/42] Update after today's discussion

Signed-off-by: Richard Edgar <riedgar@microsoft.com>
---
 api/Updated-Metrics.md | 110 ++++++++++++++++++++---------------------
 1 file changed, 54 insertions(+), 56 deletions(-)

diff --git a/api/Updated-Metrics.md b/api/Updated-Metrics.md
index 0a9a029..8f310ad 100644
--- a/api/Updated-Metrics.md
+++ b/api/Updated-Metrics.md
@@ -77,15 +77,16 @@ C             0.213
 Both properties are now Pandas DataFrames, with the column name set to the `__name__` property of the given metric function.
 The rows of the `by_group` property are set to the unique values of the `sensitive_feature=` argument.
 
-Any extra arguments for the metric function would be passed via a `params=` dictionary (and an `indexed_params=` list):
+Extra parameters for the metric function are passed in via the `sample_params=` and `params=` arguments.
+The `params=` arguments are broadcast, while the `sample_params=` arguments will be sliced along with the `y_true` and `y_pred` arguments in accordance with the sensitive feature vector.
 ```python
->>> acc_params = { 'sample_weight': weights, 'normalize':False }
 >>> metrics = GroupedMetrics(skm.accuracy_score,
                              y_true, y_pred,
                              sensitive_features=A_1,
-                             indexed_params=['sample_weight'],
-                             params=acc_params)
+                             sample_params={'sample_weight': weight},
+                             params={'normalize': False})
 ```
+A key which appears in both `sample_params=` and `params=` will be an error.
 We would _not_ provide the basic wrappers such as `accuracy_score_group_summary()`.
 
 ## Obtaining Scalars
@@ -109,25 +110,27 @@ We also provide wrappers such as `accuracy_score_difference()`, `accuracy_score_
 
 Although the functionality of the `group_max_from_summary()` and `group_min_from_summary()` can be accessed by calling `metrics.by_group.min()` and `metrics.by_group.max()`, we will provide `.group_min()` and `.group_max()` methods for completeness.
 
-For `difference_from_summary()` and `ratio_from_summary()` we will add methods to calculate the values as we do now, and also relative to the `overall` value.
+For differences and ratios, we will provide `.difference()` and `.ratio()` methods.
+These will take an optional argument of `method=`, to indicate how the values are to be calculated (defaulting to `minmax`)
 First for computing the difference:
 ```python
 >>> metrics.difference()
-accuracy_score 0.4406
+      accuracy_score
+all   0.4406
 dtype: float64
->>> metrics.difference_to_overall()
-accuracy_score 0.2563 # max(abs(0.6536-0.4), abs(0.213-0.4))
+>>> metrics.difference(method='to_overall')
+accuracy_score   0.2563 # max(abs(0.6536-0.4), abs(0.213-0.4))
 dtype: float64
 ```
-Note that the result type is a DataFrame (for reasons which will become clear below).
-
-We would similarly have `ratio()` and `ratio_to_overall()`methods:
+Note that the result type is a Series (for reasons which will become clear below).
+The `ratio()` method would behave in a similar way:
 ```python
 >>> metrics.ratio()
-accuracy_score 0.3259
-dtype: float64
->>> metrics.ratio_to_overall()
-accuracy_score 0.6120 # min(abs(0.4/0.6536), abs(0.213/0.6536))
+        accuracy_score
+all     0.3259
+>>> metrics.ratio(method='to_overall')
+       accuracy_score
+all    0.6120 # min(abs(0.4/0.6536), abs(0.213/0.6536))
 ```
 
 ## Intersections of Sensitive Features
@@ -165,9 +168,10 @@ C    M               0.45
      P               0.63
 ```
 If a particular combination of sensitive features had no representatives, then we would return `None` for that entry in the Series.
+Alternatively, we could skip that entry in the DataFrame - this point is **still TBD**.
 Although this example has passed a DataFrame in for `sensitive_features=` we should aim to support lists of Series and `numpy.ndarray` as well.
 
-The `differences()` and `ratios()` methods would act on this Series as before.
+The `differences()` and `ratios()` methods would act on this DataFrame as before.
 
 ## Conditional Metrics
 
@@ -191,34 +195,34 @@ Suppose we have another column called `income_level` with unique values 'Low' an
                            conditional_features=income_level)
 >>> metric.overall
        accuracy_score
-high            0.46
-low             0.61
+High            0.46
+Low             0.61
 dtype: float64
 >>> metric.by_group
-        accuracy_score
-            high   low
-B           0.40  0.50
-C           0.55  0.65
+          accuracy_score
+High   B            0.40 
+       C            0.55
+Low    B            0.55
+       C            0.65
 ```
 The `overall` property now has rows corresponding to the unique values of the conditional feature(s).
 Similarly, the result DataFrame now uses a Pandas MultiIndex for the columns, giving one column for each (combination of) conditional feature.
 
 Note that it is possible to have multiple sensitive features, and multiple conditional features.
-Operations such as `.group_max()` and `.difference_to_overall()` will act on each column.
-
-
-As a final note, it would also be possible to put the conditional features into the rows, at a 'higher' level than the sensitive features.
-The resultant DataFrame would look like:
+Operations such as `.group_max()` and `.difference()` will act on each combination of conditional feature values, and aggregate across the sensitive features.
+So for example
 ```python
->>> metric.by_group
+>>> metric.difference(method='minmax')
+        accuracy_score
+High              0.15
+Low               0.10
+>>> metric.difference(method='overall')
         accuracy_score
-high  B           0.40
-      C           0.55
-low   B           0.50
-      C           0.65
+High              0.09
+Low               0.06
 ```
-This might be more natural for some purposes, and we have not finally decided on which pattern to use.
-However, the `.stack()` and `.unstack()` methods of DataFrame can be used to flip between them.
+
+If it users found it more convenient to have the conditional features be sub-columns on the metrics, then the `unstack()` method of the pandas DataFrame can be used.
 
 ## Multiple Metrics
 
@@ -227,7 +231,7 @@ Finally, we can also allow for the evaluation of multiple metrics at once.
 ### Existing Syntax
 
 This is not supported.
-Users would have to devise their own method
+Users would have to devise their own means.
 
 ### Proposed Change
 
@@ -245,28 +249,21 @@ overall             0.3              0.5
 'B'              0.4              0.7
 'C'              0.6              0.75
 ```
-This should generalise to the other methods described above.
+This should generalise to the other methods described above - extra metric functions add extra columns to the resultant DataFrames (this is why we made all results to be DataFrames, even if the actual result was a scalar).
 
-One open question is how extra arguments should be passed to the individual metric functions, including how to handle the `indexed_params=`.
-A possible solution is to have lists, with indices corresponding to the list of functions supplied to the `GroupedMetric` constructor.
-For example, for `index_params=` we would have:
-```python
-indexed_params = [['sample_weight'], ['sample_weight']]
-```
-Similarly, the `params=` argument would become a list of dictionaries:
+When users wish to use the `sample_params=` and `params=` arguments, then they should pass in lists of dictionaries, matching the functions by index:
 ```python
-params = [ 
-    {
-        'sample_weight': [1,2,1,1,3, ...],
-        'normalize': False
-    },
-    {
-        'sample_weight': [1,2,1,1,3, ... ],
-        'pos_label' = 'Granted'
-    }
-]
+metric_fns = [skm.accuracy_score, skm.precision_score]
+sample_params = [{'sample_weight':weight}], [{'sample_weight':weight}]]
+params = [{ 'normalize': False }, {'pos_label' = 'Granted'}]
+result = GroupedMetric(metric_fns,
+                       y_true, y_pred,
+                       sensitive_features=A_1,
+                       sample_params=sample_params,
+                       params=params)
 ```
-If users had a lot of functions with a lot of custom arguments, this could get error-prone and difficult to debug.
+The length of the lists would be required to match.
+This is somewhat repetitious (see the `sample_weight` above), but trying to share some arguments between functions is likely to lead to a worse mess.
 
 ## Naming
 
@@ -278,7 +275,7 @@ Some possible alternatives:
   - ?
 
 Other names are also up for debate.
-Arguments such as `index_params=` are important, but narrower in impact.
+Arguments such as `method='minmax'` are important, but narrower in impact.
 
 ## User Conveniences
 
@@ -355,12 +352,13 @@ We should be able to provide a function builder of the following form:
 fbeta_diff = make_derived_metric(
     'difference',
     skm.fbeta_score,
-    index_params=['sample_weight']
+    sample_param_names=['sample_weight']
 )
 
 print(fbeta_diff(y_true, y_pred, sensitive_features=A_1, sample_weight=weights, beta=0.5))
 ```
 Since the goal of this function is to produce scalars, we would not support supplying multiple underlying metrics.
+We also do not propose to support conditional features at the present time (the appropriate behaviour is not clear).
 The derived metrics would correspond to the various methods described above which compute scalars from the `GroupedMetrics` object.
 
 

From f0cdb5cad8c073021d168f54a0239eb549e7e9ad Mon Sep 17 00:00:00 2001
From: Richard Edgar <riedgar@microsoft.com>
Date: Thu, 15 Oct 2020 11:01:56 -0400
Subject: [PATCH 40/42] Update to reflect reality

Signed-off-by: Richard Edgar <riedgar@microsoft.com>
---
 api/Updated-Metrics.md | 232 +++++++++++------------------------------
 1 file changed, 63 insertions(+), 169 deletions(-)

diff --git a/api/Updated-Metrics.md b/api/Updated-Metrics.md
index 8f310ad..51f891f 100644
--- a/api/Updated-Metrics.md
+++ b/api/Updated-Metrics.md
@@ -59,35 +59,34 @@ We also provide some wrappers for common metrics from SciKit-Learn:
 
 ### Proposed Change
 
-We propose to introduce a new object, the `GroupedMetric` (name discussion below).
+We propose to introduce a new object, the `MetricFrame` (name discussion below).
 Users will compute metrics by passing arguments into the constructor:
 ```python
->>> metrics = GroupedMetrics(skm.accuracy_score,
-                             y_true, y_pred,
-                             sensitive_features=A_1)
-<class 'GroupedMetric'>
+>>> metrics = MetricFrame(skm.accuracy_score,
+                          y_true, y_pred,
+                          sensitive_features=A_1)
+<class 'MetricFrame'>
 >>> metrics.overall
-          accuracy_score
-overall   0.4
+accuracy_score   0.4
 >>> metrics.by_group
    accuracy_score
 B             0.6536
 C             0.213
 ```
-Both properties are now Pandas DataFrames, with the column name set to the `__name__` property of the given metric function.
+The `overall` property is a Pandas Series, indexed by the name of the underlying metric.
+The `by_group` property is a Pandas DataFrame, with a column named by the underlying metric.
 The rows of the `by_group` property are set to the unique values of the `sensitive_feature=` argument.
 
-Extra parameters for the metric function are passed in via the `sample_params=` and `params=` arguments.
-The `params=` arguments are broadcast, while the `sample_params=` arguments will be sliced along with the `y_true` and `y_pred` arguments in accordance with the sensitive feature vector.
+Sample based parameters (such as sample weights) can be passed in using the `sample_params=`
+argument:
 ```python
->>> metrics = GroupedMetrics(skm.accuracy_score,
-                             y_true, y_pred,
-                             sensitive_features=A_1,
-                             sample_params={'sample_weight': weight},
-                             params={'normalize': False})
+>>> metrics = MetricFrame(skm.accuracy_score,
+                          y_true, y_pred,
+                          sensitive_features=A_1,
+                          sample_params={'sample_weight': weight})
 ```
-A key which appears in both `sample_params=` and `params=` will be an error.
-We would _not_ provide the basic wrappers such as `accuracy_score_group_summary()`.
+If the underlying requires other arguments (such as the `beta=` argument to `fbeta_score()`),
+then `functools.partial()` must be used.
 
 ## Obtaining Scalars
 
@@ -111,12 +110,14 @@ We also provide wrappers such as `accuracy_score_difference()`, `accuracy_score_
 Although the functionality of the `group_max_from_summary()` and `group_min_from_summary()` can be accessed by calling `metrics.by_group.min()` and `metrics.by_group.max()`, we will provide `.group_min()` and `.group_max()` methods for completeness.
 
 For differences and ratios, we will provide `.difference()` and `.ratio()` methods.
-These will take an optional argument of `method=`, to indicate how the values are to be calculated (defaulting to `minmax`)
+These will take an optional argument of `method=`, to indicate how the values are to be calculated.
+For now, the valid values of this argument will be `between_groups` (indicating that just the values in the
+`by_groups` property should be used) and `to_overall` (indicating that all results should be calculated
+relative to the appropriate values in the `overall` property).
 First for computing the difference:
 ```python
->>> metrics.difference()
-      accuracy_score
-all   0.4406
+>>> metrics.difference(method='between_groups')
+accuracy_score   0.4406
 dtype: float64
 >>> metrics.difference(method='to_overall')
 accuracy_score   0.2563 # max(abs(0.6536-0.4), abs(0.213-0.4))
@@ -125,12 +126,12 @@ dtype: float64
 Note that the result type is a Series (for reasons which will become clear below).
 The `ratio()` method would behave in a similar way:
 ```python
->>> metrics.ratio()
-        accuracy_score
-all     0.3259
+>>> metrics.ratio(method='between_groups')
+accuracy_score     0.3259
+dtype: float64
 >>> metrics.ratio(method='to_overall')
-       accuracy_score
-all    0.6120 # min(abs(0.4/0.6536), abs(0.213/0.6536))
+accuracy_score    0.6120 # min(abs(0.4/0.6536), abs(0.213/0.6536))
+dtype: float64
 ```
 
 ## Intersections of Sensitive Features
@@ -152,11 +153,11 @@ It is also possible that some combinations might not appear in the data (especia
 
 ### Proposed Change
 
-If `sensitive_features=` is a DataFrame (or list of Series.... exact supported types are TBD), we can generate our results in terms of a MultiIndex. Using the `A` DataFrame defined above, a user might write:
+If `sensitive_features=` is a DataFrame (or list of Series, or list of numpy arrays, or a 2D numpy array etc.), we can generate our results in terms of a MultiIndex. Using the `A` DataFrame defined above, a user might write:
 ```python
->>> result = group_summary(skm.accuracy_score,
-                           y_true, y_pred,
-                           sensitive_features=A)
+>>> result = MetricFrame(skm.accuracy_score,
+                         y_true, y_pred,
+                         sensitive_features=A)
 >>> result.by_groups
            accuracy_score
 SF 1 SF 2
@@ -167,16 +168,16 @@ C    M               0.45
      N               0.70
      P               0.63
 ```
-If a particular combination of sensitive features had no representatives, then we would return `None` for that entry in the Series.
-Alternatively, we could skip that entry in the DataFrame - this point is **still TBD**.
+If a particular combination of sensitive features had no representatives, then we would return `NaN` for that entry in the Series.
 Although this example has passed a DataFrame in for `sensitive_features=` we should aim to support lists of Series and `numpy.ndarray` as well.
 
 The `differences()` and `ratios()` methods would act on this DataFrame as before.
 
-## Conditional Metrics
+## Control Metrics
 
-For our purposes, Conditional Metrics (alternatively known as Segmented Metrics) are specified separately from the sensitive features, since for users, they add columns to the result.
-Mathematically, they behave like additional sensitive features.
+Control Metrics (alternatively known as Conditional Metrics) are specified separately from the sensitive features, since
+the aggregation functions discussed above do not act across them.
+Within the `by_group` property, they behave like additional sensitive features.
 
 ### Existing Syntax
 
@@ -185,14 +186,14 @@ Users would have to devise the required code themselves
 
 ### Proposed Change
 
-The `GroupedMetric` constructor will need an additional argument `conditional_features=` to specify the conditional features.
+The `MetricFrame` constructor will need an additional argument `control_features=` to specify the control features.
 It will accept similar types to the `sensitive_features=` argument.
 Suppose we have another column called `income_level` with unique values 'Low' and 'High'
 ```python
->>> metric = GroupedMetric(skm.accuracy_score,
+>>> metric = MetricFrame(skm.accuracy_score,
                            y_true, y_pred,
                            sensitive_features=A_1,
-                           conditional_features=income_level)
+                           control_features=income_level)
 >>> metric.overall
        accuracy_score
 High            0.46
@@ -205,11 +206,11 @@ High   B            0.40
 Low    B            0.55
        C            0.65
 ```
-The `overall` property now has rows corresponding to the unique values of the conditional feature(s).
-Similarly, the result DataFrame now uses a Pandas MultiIndex for the columns, giving one column for each (combination of) conditional feature.
+The `overall` property is now a DataFrame, with rows corresponding to the unique values of the control feature(s).
+Similarly, the result DataFrame now uses a Pandas MultiIndex for the columns, giving one column for each (combination of) control feature.
 
-Note that it is possible to have multiple sensitive features, and multiple conditional features.
-Operations such as `.group_max()` and `.difference()` will act on each combination of conditional feature values, and aggregate across the sensitive features.
+Note that it is possible to have multiple sensitive features, and multiple control features.
+Operations such as `.group_max()` and `.difference()` will act on each combination of control feature values, and aggregate across the sensitive features.
 So for example
 ```python
 >>> metric.difference(method='minmax')
@@ -235,72 +236,45 @@ Users would have to devise their own means.
 
 ### Proposed Change
 
-We allow a list of metric functions in the call to group summary.
-The properties then add columns to their DataFrames:
+We allow a dictionary of metric functions in the call to group summary.
+The properties then extend themselves:
 ```python
->>> result = GroupedMetric([skm.accuracy_score, skm.precision_score],
+>>> result = MetricFrame({'accuracy':skm.accuracy_score, 'precision':skm.precision_score},
                             y_true, y_pred,
                             sensitive_features=A_1)
 >>> result.overall
-         accuracy_score  precision_score
-overall             0.3              0.5
+accuracy       0.3
+precision      0.5
+dtype: float64
 >>> result.by_groups
-      accuracy_score  precision_score
-'B'              0.4              0.7
-'C'              0.6              0.75
+      accuracy  precision
+'B'        0.4        0.7
+'C'        0.6        0.75
 ```
-This should generalise to the other methods described above - extra metric functions add extra columns to the resultant DataFrames (this is why we made all results to be DataFrames, even if the actual result was a scalar).
+Note that we use the dictionary keys, rather than the function names in the output.
+This should generalise to the other methods described above.
 
-When users wish to use the `sample_params=` and `params=` arguments, then they should pass in lists of dictionaries, matching the functions by index:
+When users wish to use the `sample_params=` arguments, then they should pass in a dictionary of dictionaries, matching the functions by key:
 ```python
-metric_fns = [skm.accuracy_score, skm.precision_score]
-sample_params = [{'sample_weight':weight}], [{'sample_weight':weight}]]
-params = [{ 'normalize': False }, {'pos_label' = 'Granted'}]
-result = GroupedMetric(metric_fns,
+metric_fns = { 'accuracy':skm.accuracy_score, 'precision':skm.precision_score}
+sample_params = { 'accuracy':{'sample_weight':weight}], 'precision':{'sample_weight':weight}}
+result = MetricFrame(metric_fns,
                        y_true, y_pred,
                        sensitive_features=A_1,
-                       sample_params=sample_params,
-                       params=params)
+                       sample_params=sample_params)
 ```
-The length of the lists would be required to match.
+The outer set of dictionary keys given to `sample_params=` should be a subset of the keys of the metric function dictioary.
 This is somewhat repetitious (see the `sample_weight` above), but trying to share some arguments between functions is likely to lead to a worse mess.
 
-## Naming
-
-The name `GroupedMetric` is not especially inspired.
-Some possible alternatives:
-  - `GroupedMetric`
-  - `GroupMetricResult`
-  - `MetricByGroups`
-  - ?
-
-Other names are also up for debate.
-Arguments such as `method='minmax'` are important, but narrower in impact.
-
-## User Conveniences
-
-In addition to having the underlying metric be passed as a function, we can consider allowing the metric function given to the `GroupedMetric` constructor to be represented by a string.
-We would provide a mapping of strings to suitable functions.
-This would make the following equivalent:
-```python
->>> r1 = GroupedMetric(sklearn.accuracy_score,
-                       y_true, y_pred, 
-                       sensitive_features=A_1)
->>> r2 = GroupedMetric('accuracy_score',
-                       y_true, y_pred, 
-                       sensitive_features=A_1)
-```
-We would also allow mixtures of strings and functions in the multiple metric case.
-
 ## Generality
 
 Throughout this document, we have been describing the case of classification metrics.
 However, we do not actually require this.
 It is the underlying metric function which gives meaning to the `y_true` and `y_pred` lists.
-So long as these are of equal length (and equal in length to the sensitive feature list - which _will_ be treated as a categorical), then `GroupedMetric` does not actually care about their datatypes.
+So long as these are of equal length (and equal in length to the sensitive feature list - which _will_ be treated as a categorical), then `MetricFrame` does not actually care about their datatypes.
 For example, each entry in `y_pred` could be a dictionary of predicted classes and accompanying probabilities.
 Or the user might be working on a regression problem, and both `y_true` and `y_pred` would be floating point numbers (or `y_pred` might even be a tuple of predicted value and error).
-So long as the underlying metric understands the data structures, `GroupedMetric` will not care.
+So long as the underlying metric understands the data structures, `MetricFrame` will not care.
 
 There will be an effect on the `difference()` and `ratio()` methods.
 Although the `overall` and `by_groups` properties will work fine, the `differences()` and `ratios()` methods may not.
@@ -320,85 +294,5 @@ With intersections of sensitive features, cases like this become more likely.
 Metrics in SciKit-Learn usually have arguments such as `pos_label=` and `labels=` to allow the user to specify the expected labels, and adjust their behaviour accordingly.
 However, we do not require that users stick to the metrics defined in SciKit-Learn.
 
-If we implement the convenience strings-for-functions piece mentioned above, then _when the user specifies one of those strings_ we can log warnings if the appropriate arguments (such as `labels=`) are not specified.
-We could even generate the argument ourselves if the user does not specify it.
-However, this risks tying Fairlearn to particular versions of SciKit-Learn.
-
-Unfortunately, the generality of `GroupedMetric` means that we cannot solve this for the user.
+Unfortunately, the generality of `MetricFrame` means that we cannot solve this for the user.
 It cannot even tell if it is evaluating a classification or regression problem.
-
-## Creating Derived Metrics
-
-Rather than a `GroupedMetric` object, users will often want to have a function which yields a scalar.
-
-### Existing Syntax
-
-We currently provide a `make_derived_metric()` function which can build a callable object which does this:
-```python
-fhalf_score = functools.partial(skm.fbeta_score, beta=0.5)
-
-custom_difference1 = make_derived_metric(
-    difference_from_summary,
-    make_metric_group_summary(fhalf_score))
-```
-Notice that we have had to put a wrapper around `fbeta_score` since `beta=` is a required parameter, but we do not support anything beyond `y_true`, `y_pred` and `sensitive_features` as arguments.
-
-We use this to provide helper functions such as `accuracy_score_difference()` and `accuracy_score_group_min()`
-
-### Proposed Syntax
-
-We should be able to provide a function builder of the following form:
-```python
-fbeta_diff = make_derived_metric(
-    'difference',
-    skm.fbeta_score,
-    sample_param_names=['sample_weight']
-)
-
-print(fbeta_diff(y_true, y_pred, sensitive_features=A_1, sample_weight=weights, beta=0.5))
-```
-Since the goal of this function is to produce scalars, we would not support supplying multiple underlying metrics.
-We also do not propose to support conditional features at the present time (the appropriate behaviour is not clear).
-The derived metrics would correspond to the various methods described above which compute scalars from the `GroupedMetrics` object.
-
-
-## Support for `make_scorer()`
-
-SciKit-Learn has the concept of 'scorer' functions, which take arguments of `y_true` and `y_pred` and return a scalar score.
-These are used in higher level algorithms such as `GridSearchCV`.
-In order to use these in combination with metrics which can use other arguments (such as the `normalize=` argument on `accuracy_score` above), SciKit-Learn has a `make_scorer()` function , which takes a metric function, along with a list of other arguments, and binds them together to provide a function which just accepts `y_true` and `y_pred` arguments, but will invoke the underlying metric function with the specified extra arguments.
-The higher level algorithms take folds of the input data, and ask the generated scoring function to evaluate these.
-
-There is one problem with this: if a user has a per-sample input (such as sample weights), how do we select the correct values to match the fold?
-When the generated scorer is invoked, the `y_true` and `y_pred` arrays will be a subset of the `sample_weights` bound into the scorer by `make_scorer()`, so the problem is to work out the subset.
-Currently, there is no good way to do this through SciKit-Learn (although a proposed solution is under development).
-There is a [work around described by Adrin on StackOverflow](https://stackoverflow.com/questions/49581104/sklearn-gridsearchcv-not-using-sample-weight-in-score-function), which relies on DataFrames being sliced 'in-place' by SciKit-Learn.
-If all arguments are DataFrames (or Series) when when the generated scorer is invoked, the `index` property of `y_true` can be examined, and used as a mask on the sample weights column (which is bound into the generated scorer).
-A more general solution is [under discussion within the SciKit-Learn community](https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest/slep006/proposal.html).
-
-Our grouped metrics will always face this problem, since we always have the sensitive feature which will need to be passed along.
-We can provide a `make_grouped_scorer()` method with a signature like:
-```python
-make_grouped_scorer(metric_function,
-                    sensitive_feature,
-                    indexed_params,
-                    params,
-                    disparity_measure='difference',
-                    relative_to='min')
-```
-We will only support a single sensitive feature for this, since we need to produce a single scalar result.
-The function will verify that it has been passed a Pandas Series or DataFrame, so that the `index` is available.
-The `disparity_measure=` argument specifies whether the disparity should be measured via the difference or ratio (corresponding to the methods on the `GroupedMetric` object).
-Similarly, the `relative_to=` argument can also be set to `overall` - although in this case it is important to note that this will be the overall value for the fold, and *not* the overall value on the entire dataset.
-
-
-## Convenience Functions
-
-We currently provide functions for evaluating common fairness metrics (where `X` can be `ratio` or `difference`):
-
-- `demographic_parity_X()`
-- `true_positive_rate_X()`
-- `false_positive_rate_X()`
-- `equalized_odds_X()`
-
-We will continue to provide these wrappers, based on `GroupedMetric` objects internally.

From 4feb19fb3536e1496e51d8a1468527fbfd4e5163 Mon Sep 17 00:00:00 2001
From: Richard Edgar <riedgar@microsoft.com>
Date: Thu, 15 Oct 2020 11:03:24 -0400
Subject: [PATCH 41/42] Remove uneeded notebook

Signed-off-by: Richard Edgar <riedgar@microsoft.com>
---
 api/Metrics API Samples.ipynb | 476 ----------------------------------
 1 file changed, 476 deletions(-)
 delete mode 100644 api/Metrics API Samples.ipynb

diff --git a/api/Metrics API Samples.ipynb b/api/Metrics API Samples.ipynb
deleted file mode 100644
index 28ef9df..0000000
--- a/api/Metrics API Samples.ipynb	
+++ /dev/null
@@ -1,476 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from sklearn import svm\n",
-    "from sklearn.preprocessing import LabelEncoder, StandardScaler\n",
-    "from sklearn.linear_model import LogisticRegression\n",
-    "import pandas as pd\n",
-    "import shap"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "X_raw, Y = shap.datasets.adult()"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "A = X_raw[['Sex','Race']]\n",
-    "X = X_raw.drop(labels=['Sex', 'Race'],axis = 1)\n",
-    "X = pd.get_dummies(X)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "sc = StandardScaler()\n",
-    "X_scaled = sc.fit_transform(X)\n",
-    "X_scaled = pd.DataFrame(X_scaled, columns=X.columns)\n",
-    "\n",
-    "le = LabelEncoder()\n",
-    "Y = le.fit_transform(Y)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from sklearn.model_selection import train_test_split\n",
-    "X_train, X_test, Y_train, Y_test, A_train, A_test = train_test_split(X_scaled, \n",
-    "                                                    Y, \n",
-    "                                                    A,\n",
-    "                                                    test_size = 0.2,\n",
-    "                                                    random_state=0,\n",
-    "                                                    stratify=Y)\n",
-    "\n",
-    "# Work around indexing issue\n",
-    "X_train = X_train.reset_index(drop=True)\n",
-    "A_train = A_train.reset_index(drop=True)\n",
-    "X_test = X_test.reset_index(drop=True)\n",
-    "A_test = A_test.reset_index(drop=True)\n",
-    "\n",
-    "# Improve labels\n",
-    "A_test.Sex.loc[(A_test['Sex'] == 0)] = 'female'\n",
-    "A_test.Sex.loc[(A_test['Sex'] == 1)] = 'male'\n",
-    "\n",
-    "\n",
-    "A_test.Race.loc[(A_test['Race'] == 0)] = 'Amer-Indian-Eskimo'\n",
-    "A_test.Race.loc[(A_test['Race'] == 1)] = 'Asian-Pac-Islander'\n",
-    "A_test.Race.loc[(A_test['Race'] == 2)] = 'Black'\n",
-    "A_test.Race.loc[(A_test['Race'] == 3)] = 'Other'\n",
-    "A_test.Race.loc[(A_test['Race'] == 4)] = 'White'"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "lr_predictor = LogisticRegression(solver='liblinear', fit_intercept=True)\n",
-    "\n",
-    "lr_predictor.fit(X_train, Y_train)\n",
-    "Y_pred_lr = lr_predictor.predict(X_test)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "svm_predictor = svm.SVC()\n",
-    "\n",
-    "svm_predictor.fit(X_train, Y_train)\n",
-    "Y_pred_svm = svm_predictor.predict(X_test)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "# Sample APIs"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from sklearn.metrics import accuracy_score, f1_score, fbeta_score\n",
-    "from fairlearn.metrics import group_summary, make_derived_metric, difference_from_summary, make_metric_group_summary\n",
-    "from fairlearn.metrics import demographic_parity_difference, balanced_accuracy_score_group_min\n",
-    "from fairlearn.metrics import false_negative_rate, false_positive_rate"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Report one disaggregated metric in a data frame"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Current\n",
-    "bunch = group_summary(accuracy_score, Y_test, Y_pred_lr, sensitive_features=A_test['Race'])\n",
-    "frame = pd.Series(bunch.by_group)\n",
-    "frame_o = pd.Series({**bunch.by_group, 'overall': bunch.overall})\n",
-    "print(frame)\n",
-    "print(\"=======================\")\n",
-    "print(frame_o)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Proposed\n",
-    "result = GroupedMetric(accuracy_score, Y_test, Y_pred_lr, sensitive_features=A_test['Race'])\n",
-    "frame = result.by_group\n",
-    "frame_o = result.to_df() # Throw if there is a group called 'overall'"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Report several disaggregated metrics in a data frame."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Current\n",
-    "bunch1 = group_summary(accuracy_score, Y_test, Y_pred_lr, sensitive_features=A_test['Race'])\n",
-    "bunch2 = group_summary(f1_score, Y_test, Y_pred_lr, sensitive_features=A_test['Race'])\n",
-    "frame = pd.DataFrame({\n",
-    "   'accuracy': bunch1.by_group, 'f1': bunch2.by_group})\n",
-    "frame_o = pd.DataFrame({\n",
-    "   'accuracy': {**bunch1.by_group, 'overall': bunch1.overall},\n",
-    "   'f1': {**bunch2.by_group, 'overall': bunch2.overall}})\n",
-    "\n",
-    "print(frame)\n",
-    "print(\"=======================\")\n",
-    "print(frame_o)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Proposed\n",
-    "result = GroupedMetric({ 'accuracy':accuracy_score, 'f1':f1_score}, Y_test, Y_pred_lr, sensitive_features=A_test['Race'])\n",
-    "frame = result.by_group\n",
-    "frame_o = result.to_df() # Throw if there is a group called 'overall'"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Report metrics for intersecting sensitive features"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Current\n",
-    "sf = A_test['Race']+'-'+A_test['Sex'] # User builds new column manually\n",
-    "\n",
-    "bunch = group_summary(accuracy_score, Y_test, Y_pred_lr, sensitive_features=sf)\n",
-    "frame = pd.Series(bunch.by_group)\n",
-    "frame_o = pd.Series({**bunch.by_group, 'overall': bunch.overall})\n",
-    "\n",
-    "print(frame)\n",
-    "print(\"=======================\")\n",
-    "print(frame_o)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Proposed\n",
-    "result = GroupedMetric(accuracy_score, Y_test, Y_pred_lr, sensitive_features=[A['Race'], A['Sex']])\n",
-    "frame = result.by_group # Will have a MultiIndex built from the two sensitive feature columns\n",
-    "frame_o = result.to_def() # Not sure how to handle adding the extra 'overall' row"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Report several performance and fairness metrics of several models in a data frame"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Current\n",
-    "fb_s = lambda y_t, y_p: fbeta_score(y_t, y_p, beta=0.5)\n",
-    "custom_difference1 = make_derived_metric(\n",
-    "    difference_from_summary,\n",
-    "    make_metric_group_summary(fb_s))\n",
-    "\n",
-    "def custom_difference2(y_true, y_pred, sensitive_features):\n",
-    "    bunch = group_summary(fbeta_score, y_true, y_pred, sensitive_features=sensitive_features, beta=0.5)\n",
-    "    frame = pd.Series(bunch.by_group)\n",
-    "    return (frame-frame['White']).min()\n",
-    "\n",
-    "fairness_metrics = {\n",
-    "    'Custom difference 1': custom_difference1,\n",
-    "    'Custom difference 2': custom_difference2,\n",
-    "    'Demographic parity difference': demographic_parity_difference,\n",
-    "    'Worst-case balanced accuracy': balanced_accuracy_score_group_min}\n",
-    "performance_metrics = {\n",
-    "    'FPR': false_positive_rate,\n",
-    "    'FNR': false_negative_rate}\n",
-    "predictions_by_estimator = {\n",
-    "    'logreg': Y_pred_lr,\n",
-    "    'svm': Y_pred_svm}\n",
-    "\n",
-    "df = pd.DataFrame()\n",
-    "for pred_key, y_pred in predictions_by_estimator.items():\n",
-    "    for fairm_key, fairm in fairness_metrics.items():\n",
-    "        df.loc[fairm_key, pred_key] = fairm(Y_test, y_pred, sensitive_features=A_test['Race'])\n",
-    "    for perfm_key, perfm in performance_metrics.items():\n",
-    "        df.loc[perfm_key, pred_key] = perfm(Y_test, y_pred)\n",
-    "        \n",
-    "print(df)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Proposed\n",
-    "custom_difference1 = make_derived_metric('difference', fbeta_score, parms={'beta', 0.5})\n",
-    "\n",
-    "def custom_difference2(y_true, y_pred, sensitive_features):\n",
-    "    tmp = GroupedMetric(fbeta_score, y_true, y_pred, sensitive_features=sensitive_features, parms={'beta':0.5})\n",
-    "    return tmp.differences(relative_to='group', group='White', aggregate='min')\n",
-    "\n",
-    "# The remainder as before"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Create a fairness-performance raster plot of several models"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import matplotlib.pyplot as plt\n",
-    "%matplotlib inline"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Current\n",
-    "my_disparity_metric=custom_difference1\n",
-    "my_performance_metric=false_positive_rate\n",
-    "\n",
-    "xs = [my_performance_metric(Y_test, y_pred) for y_pred in predictions_by_estimator.values()]\n",
-    "ys = [my_disparity_metric(Y_test, y_pred, sensitive_features=A_test['Race']) \n",
-    "      for y_pred in predictions_by_estimator.values()]\n",
-    "\n",
-    "plt.scatter(xs,ys)\n",
-    "plt.xlabel('Performance Metric')\n",
-    "plt.ylabel('Disparity Metric')\n",
-    "plt.show()"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Proposed\n",
-    "\n",
-    "# Would also reuse the definition of custom_difference1"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Run sklearn.model_selection.cross_validate\n",
-    "\n",
-    "Use demographic parity and precision score as the metrics"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from sklearn.model_selection import cross_validate\n",
-    "from sklearn.metrics import make_scorer, precision_score"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Current\n",
-    "precision_scorer = make_scorer(precision_score)\n",
-    "\n",
-    "y_t = pd.Series(Y_test)\n",
-    "def dpd_wrapper(y_t, y_p, sensitive_features):\n",
-    "    # We need to slice up the sensitive feature to match y_t and y_p\n",
-    "    # See Adrin's reply to:\n",
-    "    # https://stackoverflow.com/questions/49581104/sklearn-gridsearchcv-not-using-sample-weight-in-score-function\n",
-    "    sf_slice = sensitive_features.loc[y_t.index.values].values.reshape(-1)\n",
-    "    return demographic_parity_difference(y_t, y_p, sensitive_features=sf_slice)\n",
-    "dp_scorer = make_scorer(dpd_wrapper, sensitive_features=A_test['Race'])\n",
-    "\n",
-    "scoring = {'prec':precision_scorer, 'dp':dp_scorer}\n",
-    "clf = svm.SVC(kernel='linear', C=1, random_state=0)\n",
-    "scores = cross_validate(clf, X_test, y_t, scoring=scoring)\n",
-    "scores"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Proposed\n",
-    "\n",
-    "# Would be the same, until Adrin's SLEP/PR are accepted to help with input slicing"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### TASK 7: Run GridSearchCV\n",
-    "\n",
-    "Use demographic parity and precision score where the goal is to find the lowest-error model whose demographic parity is <= 0.05."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Current\n",
-    "from sklearn.model_selection import GridSearchCV\n",
-    "\n",
-    "param_grid = [\n",
-    "  {'C': [1, 10, 100, 1000], 'kernel': ['linear']},\n",
-    "  {'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},\n",
-    " ]\n",
-    "scoring = {'prec':precision_scorer, 'dp':dp_scorer}\n",
-    "\n",
-    "clf = svm.SVC(kernel='linear', C=1, random_state=0)\n",
-    "\n",
-    "gscv = GridSearchCV(clf, param_grid=param_grid, scoring=scoring, refit='prec', verbose=1)\n",
-    "gscv.fit(X_test, y_t)\n",
-    "\n",
-    "print(\"Best parameters set found on development set:\")  \n",
-    "print(gscv.best_params_)\n",
-    "print(\"Best score:\", gscv.best_score_)\n",
-    "print()\n",
-    "print(\"Overall results\")\n",
-    "print(gscv.cv_results_)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Proposed\n",
-    "\n",
-    "# Would be the same, until Adrin's SLEP/PR are accepted to help with input slicing"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": []
-  }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": "Python 3",
-   "language": "python",
-   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.6.10"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 4
-}

From 5649f921bd7925448f77d76755555374a3c36c7a Mon Sep 17 00:00:00 2001
From: Richard Edgar <riedgar@microsoft.com>
Date: Thu, 15 Oct 2020 11:10:34 -0400
Subject: [PATCH 42/42] Fix the odd typo

Signed-off-by: Richard Edgar <riedgar@microsoft.com>
---
 api/Updated-Metrics.md | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/api/Updated-Metrics.md b/api/Updated-Metrics.md
index 51f891f..d2244e3 100644
--- a/api/Updated-Metrics.md
+++ b/api/Updated-Metrics.md
@@ -191,9 +191,9 @@ It will accept similar types to the `sensitive_features=` argument.
 Suppose we have another column called `income_level` with unique values 'Low' and 'High'
 ```python
 >>> metric = MetricFrame(skm.accuracy_score,
-                           y_true, y_pred,
-                           sensitive_features=A_1,
-                           control_features=income_level)
+                         y_true, y_pred,
+                         sensitive_features=A_1,
+                         control_features=income_level)
 >>> metric.overall
        accuracy_score
 High            0.46
@@ -240,8 +240,8 @@ We allow a dictionary of metric functions in the call to group summary.
 The properties then extend themselves:
 ```python
 >>> result = MetricFrame({'accuracy':skm.accuracy_score, 'precision':skm.precision_score},
-                            y_true, y_pred,
-                            sensitive_features=A_1)
+                         y_true, y_pred,
+                         sensitive_features=A_1)
 >>> result.overall
 accuracy       0.3
 precision      0.5
@@ -257,11 +257,11 @@ This should generalise to the other methods described above.
 When users wish to use the `sample_params=` arguments, then they should pass in a dictionary of dictionaries, matching the functions by key:
 ```python
 metric_fns = { 'accuracy':skm.accuracy_score, 'precision':skm.precision_score}
-sample_params = { 'accuracy':{'sample_weight':weight}], 'precision':{'sample_weight':weight}}
+sample_params = { 'accuracy':{'sample_weight':weight}, 'precision':{'sample_weight':weight}}
 result = MetricFrame(metric_fns,
-                       y_true, y_pred,
-                       sensitive_features=A_1,
-                       sample_params=sample_params)
+                     y_true, y_pred,
+                     sensitive_features=A_1,
+                     sample_params=sample_params)
 ```
 The outer set of dictionary keys given to `sample_params=` should be a subset of the keys of the metric function dictioary.
 This is somewhat repetitious (see the `sample_weight` above), but trying to share some arguments between functions is likely to lead to a worse mess.