Integrating insight into fairlens #108

Hilly12 · 2021-08-18T13:11:28Z

Hilly12
Aug 18, 2021

We have to integrate metrics from insight into fairlens and move metrics in fairlens to insight.

I'm thinking we could wrap statistical distance and correlation metrics using the following structure.

def mean_distance(x: pd.Series, y: pd.Series) -> float:
    """The difference between the means of two distributions.

    Args:
        x (pd.Series):
            The data in the column representing the first group.
        y (pd.Series):
            The data in the column representing the second group.

    Returns:
        float:
            The computed distance.
    """

    # Call to custom checker instance, throw error if necessary
    # Call to distance metric from insight

Then for metrics which return a p-value and hypothesis tests we could use the following structure.

def ks_test(x: pd.Series, y: pd.Series) -> DistanceResult:
    """Carries out a 2 sample Kolmogorov-Smirnov test.

    Args:
        x (pd.Series):
            The data in the column representing the first group.
        y (pd.Series):
            The data in the column representing the second group.

    Returns:
        DistanceResult:
            (Dataclass from insight)
    """

    # Call to custom checker
    # Call to class from insight

def bootstrap_test(x: pd.Series, y: pd.Series, metric: Callable[[pd.Series, pd.Series], float]) -> DistanceResult:
    """Bootstraps the data in x and y to compute a distribution for the metric (test statistic) which
    is used to test its significance when applied to x and y.

    Args:
        x (pd.Series):
            The data in the column representing the first group.
        y (pd.Series):
            The data in the column representing the second group.
        metric (Callable[[pd.Series, pd.Series], float]):
            The metric or test statistic to use.

    Returns:
        DistanceResult:
            (Dataclass from insight)
    """

    # Call to custom checker
    # Bootstrap the statistic using the correct method
    # Return the bootstrapped p-value, confidence interval in DistanceResult

Perhaps worth renaming DistanceResult to TestResult or perhaps redefining something similar in fairlens.
For methods like stat_distance we can loop through the methods in distance.py and use their names as mode and remove the p-value option. Alternatively we could rename stat_distance to stat_similarity and make it so it always returns a p-value.

Thoughts?

bogdansurdu · 2021-08-18T13:33:24Z

bogdansurdu
Aug 18, 2021

I think this is a good way of integrating it and also the differentiation between things that should only return numerical distances and things for which we are interested in the significance (p-values, confidence intervals) is quite clear with the _distance and _test distinction.

Additionally, by keeping the prototypes for distances strictly as

(x: pd.Series, y: pd.Series) -> float

and the prototypes for tests as

(x: pd.Series, y: pd.Series) -> DistanceResult

I imagine it will not be very difficult to make use of the metrics from Insight, as they have consistent types throughout.

A similar thing can be done for correlations, as some methods used for calculating coefficients are actually based on hypothesis testing (e.g. Kruskal-Wallis H Test) and we are simply using that p-value (or rather 1 - p_value) as the coefficient, which works quite well in practice, but may not strictly be the most theory-based approach.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrating insight into fairlens #108

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Integrating insight into fairlens #108

Hilly12 Aug 18, 2021

Replies: 1 comment

bogdansurdu Aug 18, 2021

Hilly12
Aug 18, 2021

bogdansurdu
Aug 18, 2021