Skip to content

Commit

Permalink
v0.1.0 (#32)
Browse files Browse the repository at this point in the history
* Stop using SDV during testing

* Make the sdv dependency optional

* Make sdv a test dependency and curate dependencies

* Bumpversion minor

* Fix test

* Fix test-tutorials

* Drop Python3.5 support

* Revert bumpversion

* Fix sphinx build

* Update badge to point at travis.com

* Reorganize metrics by data modality

* Reorganize by data modality (#19)

* Reorganize metrics by data modality with a new compute API

* Disable readme and tutorial testing

* Disable docs testing on test-devel

* Re add old sdmetrics (#20)

* Add more multi-table metrics and tests

* Restore bivariate metrics as column_pairs metrics

* Fix error on windows

* Remove unused methods

* Bump version: 0.0.5.dev0 → 0.1.0.dev0

* Add sdgym metrics (#24)

* Add more multi-table metrics and tests

* Restore bivariate metrics as column_pairs metrics

* Fix error on windows

* Remove unused methods

* Added conda support

* Fixed typo

* Empty commit

* Fixes readme mistake (#22)

* Fix docstring

* Fix NestedAttrsMeta

* Add BayesianNetwork Likelihood metrics

* Add GaussianMixture Likelihood metric

* Add Machine Learning Efficacy metrics

* Add dependencies

* Update version number

* Rename GMLikelihood to GMLogLikelihood

* Fix name and range for BNLogLikelihood

* Remove nan if no columns match the supported dtypes

* Allow being passed a predefined BN structure and add logging

* Allow passing scorers to ML Efficacy metrics

* Improve ML Detection and Efficacy pipelines

Co-authored-by: Felipe Alex Hofmann <[email protected]>

* Add metadata argument (#25)

* Added conda support

* Fixed typo

* Empty commit

* Fixes readme mistake (#22)

* Add optional metadata dict argument to all metrics

* Fix KSTestExtended

Co-authored-by: Felipe Alex Hofmann <[email protected]>

* Dynamic MultiSingleTable metrics (#26)

* Fix error when working on integer only data

* Allow passing an entire serialized BN instead of just the structure

* Allow defining MST metrics by passing the ST metric as an argument

* Fix lint

* Organize Imports and add Generic MLEfficacy and get_subclasses (#27)

* Organize and standardize imports across all the project

* Increase sample size to make tests more stable

* Update readme and docs (#28)

* Add method to load demo data

* Make get_subclasses return only usable metrics and skip parents

* Add READMEs

* Add DAI Logo and move SDV Logo to end

* Bump version: 0.1.0.dev0 → 0.1.0.dev1

* Allow passing non dict metadata

* Add demos by data modality and compute_metrics function (#30)

* Bump version: 0.1.0.dev1 → 0.1.0.dev2

* Add documentation (#31)

* Update readme and add docstrings

* Add timeseries demo

* Update installation instructions

Co-authored-by: Felipe Alex Hofmann <[email protected]>
  • Loading branch information
csala and fealho authored Dec 18, 2020
1 parent 5d9251b commit ab6c68b
Show file tree
Hide file tree
Showing 102 changed files with 3,484 additions and 3,593 deletions.
60 changes: 60 additions & 0 deletions INSTALL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# Installing SDMetrics

## Requirements

**SDMetrics** has been developed and tested on [Python 3.6, 3.7 and 3.8](https://www.python.org/downloads/)

Also, although it is not strictly required, the usage of a [virtualenv](
https://virtualenv.pypa.io/en/latest/) is highly recommended in order to avoid
interfering with other software installed in the system where **SDMetrics** is run.

## Install with pip

The easiest and recommended way to install **SDMetrics** is using [pip](
https://pip.pypa.io/en/stable/):

```bash
pip install sdmetrics
```

This will pull and install the latest stable release from [PyPi](https://pypi.org/).

## Install with conda

**SDMetrics** can also be installed using [conda](https://docs.conda.io/en/latest/):

```bash
conda install -c sdv-dev -c conda-forge sdmetrics
```

This will pull and install the latest stable release from [Anaconda](https://anaconda.org/).

## Install from source

If you want to install **SDMetrics** from source you need to first clone the repository
and then execute the `make install` command inside the `stable` branch. Note that this
command works only on Unix based systems like GNU/Linux and macOS:

```bash
git clone https://github.com/sdv-dev/SDMetrics
cd SDMetrics
git checkout stable
make install
```

## Install for development

If you intend to modify the source code or contribute to the project you will need to
install it from the source using the `make install-develop` command. In this case, we
recommend you to branch from `master` first:

```bash
git clone [email protected]:sdv-dev/SDMetrics
cd SDMetrics
git checkout master
git checkout -b <your-branch-name>
make install-develp
```

For more details about how to contribute to the project please visit the [Contributing Guide](
CONTRIBUTING.rst).
1 change: 1 addition & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ include CONTRIBUTING.rst
include HISTORY.md
include LICENSE
include README.md
include sdmetrics/demos/*.pkl

recursive-include tests *
recursive-exclude * __pycache__
Expand Down
254 changes: 82 additions & 172 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
<p align="left">
<img width=15% src="https://dai.lids.mit.edu/wp-content/uploads/2018/06/Logo_DAI_highres.png" alt=“DAI-Lab” />
<i>An open source project from Data to AI Lab at MIT.</i>
<a href="https://dai.lids.mit.edu">
<img width=15% src="https://dai.lids.mit.edu/wp-content/uploads/2018/06/Logo_DAI_highres.png" alt="DAI-Lab" />
</a>
<i>An Open Source Project from the <a href="https://dai.lids.mit.edu">Data to AI Lab, at MIT</a></i>
</p>

[![Development Status](https://img.shields.io/badge/Development%20Status-2%20--%20Pre--Alpha-yellow)](https://pypi.org/search/?c=Development+Status+%3A%3A+2+-+Pre-Alpha)
Expand All @@ -9,218 +11,126 @@
[![Tests](https://github.com/sdv-dev/SDMetrics/workflows/Run%20Tests/badge.svg)](https://github.com/sdv-dev/SDMetrics/actions?query=workflow%3A%22Run+Tests%22+branch%3Amaster)
[![Coverage Status](https://codecov.io/gh/sdv-dev/SDMetrics/branch/master/graph/badge.svg)](https://codecov.io/gh/sdv-dev/SDMetrics)

<p>
<img width=15% src="docs/resources/header.png">
</p>
<img align="center" width=30% src="docs/resources/header.png">

Metrics for Synthetic Data Generation Projects

* Website: https://sdv.dev
* Documentation: https://sdv.dev/SDV
* Repository: https://github.com/sdv-dev/SDMetrics
* License: [MIT](https://github.com/sdv-dev/SDMetrics/blob/master/LICENSE)
* Development Status: [Pre-Alpha](https://pypi.org/search/?c=Development+Status+%3A%3A+2+-+Pre-Alpha)
* Documentation: https://sdv-dev.github.io/SDMetrics
* Homepage: https://github.com/sdv-dev/SDMetrics

# Overview

The **SDMetrics** library provides a set of **dataset-agnostic tools** for evaluating the **quality of a synthetic database** by comparing it to the real database that it is modeled after. It includes a variety of metrics such as:
The **SDMetrics** library provides a set of **dataset-agnostic tools** for evaluating the **quality
of a synthetic database** by comparing it to the real database that it is modeled after.

- **Statistical metrics** which use statistical tests to compare the distributions of the real and synthetic distributions.
- **Detection metrics** which use machine learning to try to distinguish between real and synthetic data.
- **Descriptive metrics** which compute descriptive statistics on the real and synthetic datasets independently and then compare the values.
It supports multiple data modalities:

# Install
* **Single Columns**: Compare 1 dimensional `numpy` arrays representing individual columns.
* **Column Pairs**: Compare how columns in a `pandas.DataFrame` relate to each other, in groups of 2.
* **Single Table**: Compare an entire table, represented as a `pandas.DataFrame`.
* **Multi Table**: Compare multi-table and relational datasets represented as a python `dict` with
multiple tables passed as `pandas.DataFrame`s.
* **Time Series**: Compare tables representing ordered sequences of events.

It includes a variety of metrics such as:

## Requirements
* **Statistical metrics** which use statistical tests to compare the distributions of the real
and synthetic distributions.
* **Detection metrics** which use machine learning to try to distinguish between real and synthetic data.
* **Efficacy metrics** which compare the performance of machine learning models when run on the synthetic and real data.
* **Bayesian Network and Gaussian Mixture metrics** which learn the distribution of the real data
and evaluate the likelihood of the synthetic data belonging to the learned distribution.
* **Privacy metrics** which evaluate whether the synthetic data is leaking information about the real data.

**SDMetrics** has been developed and tested on [Python 3.6, 3.7 and 3.8](https://www.python.org/downloads/)
# Install

Also, although it is not strictly required, the usage of a [virtualenv](
https://virtualenv.pypa.io/en/latest/) is highly recommended in order to avoid
interfering with other software installed in the system where **SDMetrics** is run.
**SDMetrics** is part of the **SDV** project and is automatically installed alongside it. For
details about this process please visit the [SDV Installation Guide](
https://sdv.dev/SDV/getting_started/install.html)

## Install with pip
Optionally, **SDMetrics** can also be installed as a standalone library using the following commands:

The easiest and recommended way to install **SDMetrics** is using [pip](
https://pip.pypa.io/en/stable/):
**Using `pip`:**

```bash
pip install sdmetrics
```

This will pull and install the latest stable release from [PyPi](https://pypi.org/).

If you want to install from source or contribute to the project please read the
[Contributing Guide](https://sdv-dev.github.io/SDMetrics/contributing.html#get-started).

## Install with conda

**SDMetrics** can also be installed using [conda](https://docs.conda.io/en/latest/):
**Using `conda`:**

```bash
conda install -c sdv-dev -c conda-forge sdmetrics
```

This will pull and install the latest stable release from [Anaconda](https://anaconda.org/).


# Basic Usage

Let's run the demo code from **SDV** to generate a simple synthetic dataset:

```python3
from sdv import load_demo, SDV

metadata, real_tables = load_demo(metadata=True)

sdv = SDV()
sdv.fit(metadata, real_tables)

synthetic_tables = sdv.sample_all(20)
```

Now that we have a synthetic dataset, we can evaluate it using **SDMetrics** by calling the `evaluate` function which returns an instance of `MetricsReport` with the default metrics:

```python3
from sdmetrics import evaluate

report = evaluate(metadata, real_tables, synthetic_tables)
```

## Examining Metrics

This `report` object makes it easy to examine the metrics at different levels of granularity. For example, the `overall` method returns a single scalar value which functions as a composite score combining all of the metrics. This score can be passed to an optimization routine (i.e. to tune the hyperparameters in a model) and minimized in order to obtain higher quality synthetic data.

```python3
print(report.overall())
```

In addition, the `report` provides a `highlights` method which identifies the worst performing metrics. This provides useful hints to help users identify where their synthetic data falls short (i.e. which tables/columns/relationships are not being modeled properly).

```python3
print(report.highlights())
```

## Visualizing Metrics
For more installation options please visit the [SDMetrics installation Guide](INSTALL.md)

Finally, the `report` object provides a `visualize` method which generates a figure showing some of the key metrics.

```python3
figure = report.visualize()
figure.savefig("sdmetrics-report.png")
```
# Usage

<p align="center">
<img style="width:100%" src="docs/resources/visualize.png">
</p>
**SDMetrics** is included as part of the framework offered by SDV to evaluate the quality of
your synthetic dataset. For more details about how to use it please visit the corresponding
User Guides:

# Advanced Usage
* [Evaluating Single Table Data](https://sdv.dev/SDV/user_guides/single_table/evaluation.html)
* Evaluating Multi Table Data (Coming soon)
* Evaluating Time Series Data (Coming soon)

## Specifying Metrics
## Standalone usage

Instead of running all the default metrics, you can specify exactly what metrics you
want to run by creating an empty `MetricsReport` and adding the metrics yourself. For
example, the following code only computes the machine learning detection-based metrics.
**SDMetrics** can also be used as a standalone library to run metrics individually.

The `MetricsReport` object includes a `details` method which returns all of the
metrics that were computed.
In this short example we show how to use it to evaluate a toy multi-table dataset and its
synthetic replica by running all the compatible multi-table metrics on it:

```python3
from sdmetrics import detection
from sdmetrics.report import MetricsReport

report = MetricsReport()
report.add_metrics(detection.metrics(metadata, real_tables, synthetic_tables))
```

## Creating Metrics

Suppose you want to add some new metrics to this library. To do this, you simply
need to write a function which yields instances of the `Metric` object:

```python3
from sdmetrics.report import Metric

def my_custom_metrics(metadata, real_tables, synthetic_tables):
name = "abs-diff-in-number-of-rows"

for table_name in metadata.get_tables():

# Absolute difference in number of rows
nb_real_rows = len(real_tables[table_name])
nb_synthetic_rows = len(synthetic_tables[table_name])
value = float(abs(nb_real_rows - nb_synthetic_rows))
import sdmetrics

# Specify some useful tags for the user
tags = set([
"priority:high",
"table:%s" % table_name
])
# Load the demo data, which includes:
# - A dict containing the real tables as pandas.DataFrames.
# - A dict containing the synthetic clones of the real data.
# - A dict containing metadata about the tables.
real_data, synthetic_data, metadata = sdmetrics.load_demo()

yield Metric(name, value, tags)
```

To attach your metrics to a `MetricsReport` object, you can use the `add_metrics`
method and provide your custom metrics iterator:

```python3
from sdmetrics.report import MetricsReport
# Obtain the list of multi table metrics, which is returned as a dict
# containing the metric names and the corresponding metric classes.
metrics = sdmetrics.multi_table.MultiTableMetric.get_subclasses()

report = MetricsReport()
report.add_metrics(my_custom_metrics(metadata, real_tables, synthetic_tables))
# Run all the compatible metrics and get a report
sdmetrics.compute_metrics(metrics, real_data, synthetic_data, metadata=metadata)
```

See `sdmetrics.detection`, `sdmetrics.efficacy`, and `sdmetrics.statistical` for
more examples of how to implement metrics.
The output will be a table with all the details about the executed metrics and their score:

## Filtering Metrics
| metric | name | score | min_value | max_value | goal |
|------------------------------|-----------------------------------------|------------|-------------|-------------|----------|
| CSTest | Chi-Squared | 0.76651 | 0 | 1 | MAXIMIZE |
| KSTest | Inverted Kolmogorov-Smirnov D statistic | 0.75 | 0 | 1 | MAXIMIZE |
| KSTestExtended | Inverted Kolmogorov-Smirnov D statistic | 0.777778 | 0 | 1 | MAXIMIZE |
| LogisticDetection | LogisticRegression Detection | 0.882716 | 0 | 1 | MAXIMIZE |
| SVCDetection | SVC Detection | 0.833333 | 0 | 1 | MAXIMIZE |
| BNLikelihood | BayesianNetwork Likelihood | nan | 0 | 1 | MAXIMIZE |
| BNLogLikelihood | BayesianNetwork Log Likelihood | nan | -inf | 0 | MAXIMIZE |
| LogisticParentChildDetection | LogisticRegression Detection | 0.619444 | 0 | 1 | MAXIMIZE |
| SVCParentChildDetection | SVC Detection | 0.916667 | 0 | 1 | MAXIMIZE |

The `MetricsReport` object includes a `details` method which returns all of the
metrics that were computed.

```python3
from sdmetrics.report import MetricsReport

report = evaluate(metadata, real_tables, synthetic_tables)
report.details()
```

To filter these metrics, you can provide a filter function. For example, to only
see metrics that are associated with the `users` table, you can run
# What's next?

```python3
def my_custom_filter(metric):
if "table:users" in metric.tags:
return True
return False
If you want to read more about each individual metric, please visit the following folders:

report.details(my_custom_filter)
```
* Single Column Metrics: [sdmetrics/single_column](sdmetrics/single_column)
* Single Table Metrics: [sdmetrics/single_table](sdmetrics/single_table)
* Multi Table Metrics: [sdmetrics/multi_table](sdmetrics/multi_table)

Examples of standard tags implemented by the built-in metrics are shown below.

<table>
<tr>
<th style="width:14em;">Tag</th>
<th>Description</th>
</tr>
<tr>
<td><code>priority:high</code></td>
<td>This tag tells the user to pay extra attention to this metric. It typically indicates that the objects being evaluated by the metric are unusually bad (i.e. the synthetic values look very different from the real values).</td>
</tr>
<tr>
<td><code>table:TABLE_NAME</code></td>
<td>This tag indicates that the metric involves the table specified by <code>TABLE_NAME</code>.
</tr>
<tr>
<td><code>column:COL_NAME</code></td>
<td>This tag indicates that the metric involves the column specified by <code>COL_NAME</code>. If the column names are not unique across the entire database, then it needs to be combined with the <code>table:TABLE_NAME</code> tag to uniquely identify a specific column.</td>
</tr>
</table>

As this library matures, we will define additional standard tags and/or promote them to
first class attributes.
# The Synthetic Data Vault

# What's next?
<p>
<a href="https://sdv.dev">
<img width=30% src="https://github.com/sdv-dev/SDV/blob/master/docs/images/SDV-Logo-Color-Tagline.png?raw=true">
</a>
<p><i>This repository is part of <a href="https://sdv.dev">The Synthetic Data Vault Project</a></i></p>
</p>

For more details about **SDMetrics** and all its possibilities and features, please check
the [documentation site](https://sdv-dev.github.io/SDMetrics/).
* Website: https://sdv.dev
* Documentation: https://sdv.dev/SDV
4 changes: 4 additions & 0 deletions conda/meta.yaml
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@
{% set name = 'sdmetrics' %}
<<<<<<< HEAD
{% set version = '0.1.0.dev2' %}
=======
{% set version = '0.0.5.dev0' %}
>>>>>>> master

package:
name: "{{ name|lower }}"
Expand Down
Binary file added resources/visualize.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit ab6c68b

Please sign in to comment.