v0.1.0 (#32)

* Stop using SDV during testing * Make the sdv dependency optional * Make sdv a test dependency and curate dependencies * Bumpversion minor * Fix test * Fix test-tutorials * Drop Python3.5 support * Revert bumpversion * Fix sphinx build * Update badge to point at travis.com * Reorganize metrics by data modality * Reorganize by data modality (#19) * Reorganize metrics by data modality with a new compute API * Disable readme and tutorial testing * Disable docs testing on test-devel * Re add old sdmetrics (#20) * Add more multi-table metrics and tests * Restore bivariate metrics as column_pairs metrics * Fix error on windows * Remove unused methods * Bump version: 0.0.5.dev0 → 0.1.0.dev0 * Add sdgym metrics (#24) * Add more multi-table metrics and tests * Restore bivariate metrics as column_pairs metrics * Fix error on windows * Remove unused methods * Added conda support * Fixed typo * Empty commit * Fixes readme mistake (#22) * Fix docstring * Fix NestedAttrsMeta * Add BayesianNetwork Likelihood metrics * Add GaussianMixture Likelihood metric * Add Machine Learning Efficacy metrics * Add dependencies * Update version number * Rename GMLikelihood to GMLogLikelihood * Fix name and range for BNLogLikelihood * Remove nan if no columns match the supported dtypes * Allow being passed a predefined BN structure and add logging * Allow passing scorers to ML Efficacy metrics * Improve ML Detection and Efficacy pipelines Co-authored-by: Felipe Alex Hofmann <[email protected]> * Add metadata argument (#25) * Added conda support * Fixed typo * Empty commit * Fixes readme mistake (#22) * Add optional metadata dict argument to all metrics * Fix KSTestExtended Co-authored-by: Felipe Alex Hofmann <[email protected]> * Dynamic MultiSingleTable metrics (#26) * Fix error when working on integer only data * Allow passing an entire serialized BN instead of just the structure * Allow defining MST metrics by passing the ST metric as an argument * Fix lint * Organize Imports and add Generic MLEfficacy and get_subclasses (#27) * Organize and standardize imports across all the project * Increase sample size to make tests more stable * Update readme and docs (#28) * Add method to load demo data * Make get_subclasses return only usable metrics and skip parents * Add READMEs * Add DAI Logo and move SDV Logo to end * Bump version: 0.1.0.dev0 → 0.1.0.dev1 * Allow passing non dict metadata * Add demos by data modality and compute_metrics function (#30) * Bump version: 0.1.0.dev1 → 0.1.0.dev2 * Add documentation (#31) * Update readme and add docstrings * Add timeseries demo * Update installation instructions Co-authored-by: Felipe Alex Hofmann <[email protected]>
sdv-dev · Dec 18, 2020 · ab6c68b · ab6c68b
1 parent 5d9251b
commit ab6c68b
Show file tree

Hide file tree

Showing 102 changed files with 3,484 additions and 3,593 deletions.
diff --git a/INSTALL.md b/INSTALL.md
@@ -0,0 +1,60 @@
+# Installing SDMetrics
+
+## Requirements
+
+**SDMetrics** has been developed and tested on [Python 3.6, 3.7 and 3.8](https://www.python.org/downloads/)
+
+Also, although it is not strictly required, the usage of a [virtualenv](
+https://virtualenv.pypa.io/en/latest/) is highly recommended in order to avoid
+interfering with other software installed in the system where **SDMetrics** is run.
+
+## Install with pip
+
+The easiest and recommended way to install **SDMetrics** is using [pip](
+https://pip.pypa.io/en/stable/):
+
+```bash
+pip install sdmetrics
+```
+
+This will pull and install the latest stable release from [PyPi](https://pypi.org/).
+
+## Install with conda
+
+**SDMetrics** can also be installed using [conda](https://docs.conda.io/en/latest/):
+
+```bash
+conda install -c sdv-dev -c conda-forge sdmetrics
+```
+
+This will pull and install the latest stable release from [Anaconda](https://anaconda.org/).
+
+## Install from source
+
+If you want to install **SDMetrics** from source you need to first clone the repository
+and then execute the `make install` command inside the `stable` branch. Note that this
+command works only on Unix based systems like GNU/Linux and macOS:
+
+```bash
+git clone https://github.com/sdv-dev/SDMetrics
+cd SDMetrics
+git checkout stable
+make install
+```
+
+## Install for development
+
+If you intend to modify the source code or contribute to the project you will need to
+install it from the source using the `make install-develop` command. In this case, we
+recommend you to branch from `master` first:
+
+```bash
+git clone [email protected]:sdv-dev/SDMetrics
+cd SDMetrics
+git checkout master
+git checkout -b <your-branch-name>
+make install-develp
+```
+
+For more details about how to contribute to the project please visit the [Contributing Guide](
+CONTRIBUTING.rst).
diff --git a/MANIFEST.in b/MANIFEST.in
@@ -3,6 +3,7 @@ include CONTRIBUTING.rst
 include HISTORY.md
 include LICENSE
 include README.md
+include sdmetrics/demos/*.pkl
 
 recursive-include tests *
 recursive-exclude * __pycache__

diff --git a/README.md b/README.md
@@ -1,6 +1,8 @@
 <p align="left">
-<img width=15% src="https://dai.lids.mit.edu/wp-content/uploads/2018/06/Logo_DAI_highres.png" alt=“DAI-Lab” />
-<i>An open source project from Data to AI Lab at MIT.</i>
+  <a href="https://dai.lids.mit.edu">
+    <img width=15% src="https://dai.lids.mit.edu/wp-content/uploads/2018/06/Logo_DAI_highres.png" alt="DAI-Lab" />
+  </a>
+  <i>An Open Source Project from the <a href="https://dai.lids.mit.edu">Data to AI Lab, at MIT</a></i>
 </p>
 
 [![Development Status](https://img.shields.io/badge/Development%20Status-2%20--%20Pre--Alpha-yellow)](https://pypi.org/search/?c=Development+Status+%3A%3A+2+-+Pre-Alpha)
@@ -9,218 +11,126 @@
 [![Tests](https://github.com/sdv-dev/SDMetrics/workflows/Run%20Tests/badge.svg)](https://github.com/sdv-dev/SDMetrics/actions?query=workflow%3A%22Run+Tests%22+branch%3Amaster)
 [![Coverage Status](https://codecov.io/gh/sdv-dev/SDMetrics/branch/master/graph/badge.svg)](https://codecov.io/gh/sdv-dev/SDMetrics)
 
-<p>
-  <img width=15% src="docs/resources/header.png">
-</p>
+<img align="center" width=30% src="docs/resources/header.png">
 
 Metrics for Synthetic Data Generation Projects
 
+* Website: https://sdv.dev
+* Documentation: https://sdv.dev/SDV
+* Repository: https://github.com/sdv-dev/SDMetrics
 * License: [MIT](https://github.com/sdv-dev/SDMetrics/blob/master/LICENSE)
 * Development Status: [Pre-Alpha](https://pypi.org/search/?c=Development+Status+%3A%3A+2+-+Pre-Alpha)
-* Documentation: https://sdv-dev.github.io/SDMetrics
-* Homepage: https://github.com/sdv-dev/SDMetrics
 
 # Overview
 
-The **SDMetrics** library provides a set of **dataset-agnostic tools** for evaluating the **quality of a synthetic database** by comparing it to the real database that it is modeled after. It includes a variety of metrics such as:
+The **SDMetrics** library provides a set of **dataset-agnostic tools** for evaluating the **quality
+of a synthetic database** by comparing it to the real database that it is modeled after.
 
- - **Statistical metrics** which use statistical tests to compare the distributions of the real and synthetic distributions.
- - **Detection metrics** which use machine learning to try to distinguish between real and synthetic data.
- - **Descriptive metrics** which compute descriptive statistics on the real and synthetic datasets independently and then compare the values.
+It supports multiple data modalities:
 
-# Install
+* **Single Columns**: Compare 1 dimensional `numpy` arrays representing individual columns.
+* **Column Pairs**: Compare how columns in a `pandas.DataFrame` relate to each other, in groups of 2.
+* **Single Table**: Compare an entire table, represented as a `pandas.DataFrame`.
+* **Multi Table**: Compare multi-table and relational datasets represented as a python `dict` with
+  multiple tables passed as `pandas.DataFrame`s.
+* **Time Series**: Compare tables representing ordered sequences of events.
+
+It includes a variety of metrics such as:
 
-## Requirements
+* **Statistical metrics** which use statistical tests to compare the distributions of the real
+  and synthetic distributions.
+* **Detection metrics** which use machine learning to try to distinguish between real and synthetic data.
+* **Efficacy metrics** which compare the performance of machine learning models when run on the synthetic and real data.
+* **Bayesian Network and Gaussian Mixture metrics** which learn the distribution of the real data
+  and evaluate the likelihood of the synthetic data belonging to the learned distribution.
+* **Privacy metrics** which evaluate whether the synthetic data is leaking information about the real data.
 
-**SDMetrics** has been developed and tested on [Python 3.6, 3.7 and 3.8](https://www.python.org/downloads/)
+# Install
 
-Also, although it is not strictly required, the usage of a [virtualenv](
-https://virtualenv.pypa.io/en/latest/) is highly recommended in order to avoid
-interfering with other software installed in the system where **SDMetrics** is run.
+**SDMetrics** is part of the **SDV** project and is automatically installed alongside it. For
+details about this process please visit the [SDV Installation Guide](
+https://sdv.dev/SDV/getting_started/install.html)
 
-## Install with pip
+Optionally, **SDMetrics** can also be installed as a standalone library using the following commands:
 
-The easiest and recommended way to install **SDMetrics** is using [pip](
-https://pip.pypa.io/en/stable/):
+**Using `pip`:**
 
 ```bash
 pip install sdmetrics
 ```
 
-This will pull and install the latest stable release from [PyPi](https://pypi.org/).
-
-If you want to install from source or contribute to the project please read the
-[Contributing Guide](https://sdv-dev.github.io/SDMetrics/contributing.html#get-started).
-
-## Install with conda
-
-**SDMetrics** can also be installed using [conda](https://docs.conda.io/en/latest/):
+**Using `conda`:**
 
 ```bash
 conda install -c sdv-dev -c conda-forge sdmetrics
 ```
 
-This will pull and install the latest stable release from [Anaconda](https://anaconda.org/).
-
-
-# Basic Usage
-
-Let's run the demo code from **SDV** to generate a simple synthetic dataset:
-
-```python3
-from sdv import load_demo, SDV
-
-metadata, real_tables = load_demo(metadata=True)
-
-sdv = SDV()
-sdv.fit(metadata, real_tables)
-
-synthetic_tables = sdv.sample_all(20)
-```
-
-Now that we have a synthetic dataset, we can evaluate it using **SDMetrics** by calling the `evaluate` function which returns an instance of `MetricsReport` with the default metrics:
-
-```python3
-from sdmetrics import evaluate
-
-report = evaluate(metadata, real_tables, synthetic_tables)
-```
-
-## Examining Metrics
-
-This `report` object makes it easy to examine the metrics at different levels of granularity. For example, the `overall` method returns a single scalar value which functions as a composite score combining all of the metrics. This score can be passed to an optimization routine (i.e. to tune the hyperparameters in a model) and minimized in order to obtain higher quality synthetic data.
-
-```python3
-print(report.overall())
-```
-
-In addition, the `report` provides a `highlights` method which identifies the worst performing metrics. This provides useful hints to help users identify where their synthetic data falls short (i.e. which tables/columns/relationships are not being modeled properly).
-
-```python3
-print(report.highlights())
-```
-
-## Visualizing Metrics
+For more installation options please visit the [SDMetrics installation Guide](INSTALL.md)
 
-Finally, the `report` object provides a `visualize` method which generates a figure showing some of the key metrics.
-
-```python3
-figure = report.visualize()
-figure.savefig("sdmetrics-report.png")
-```
+# Usage
 
-<p align="center">
-    <img style="width:100%" src="docs/resources/visualize.png">
-</p>
+**SDMetrics** is included as part of the framework offered by SDV to evaluate the quality of
+your synthetic dataset. For more details about how to use it please visit the corresponding
+User Guides:
 
-# Advanced Usage
+* [Evaluating Single Table Data](https://sdv.dev/SDV/user_guides/single_table/evaluation.html)
+* Evaluating Multi Table Data (Coming soon)
+* Evaluating Time Series Data (Coming soon)
 
-## Specifying Metrics
+## Standalone usage
 
-Instead of running all the default metrics, you can specify exactly what metrics you
-want to run by creating an empty `MetricsReport` and adding the metrics yourself. For
-example, the following code only computes the machine learning detection-based metrics.
+**SDMetrics** can also be used as a standalone library to run metrics individually.
 
-The `MetricsReport` object includes a `details` method which returns all of the
-metrics that were computed.
+In this short example we show how to use it to evaluate a toy multi-table dataset and its
+synthetic replica by running all the compatible multi-table metrics on it:
 
 ```python3
-from sdmetrics import detection
-from sdmetrics.report import MetricsReport
-
-report = MetricsReport()
-report.add_metrics(detection.metrics(metadata, real_tables, synthetic_tables))
-```
-
-## Creating Metrics
-
-Suppose you want to add some new metrics to this library. To do this, you simply
-need to write a function which yields instances of the `Metric` object:
-
-```python3
-from sdmetrics.report import Metric
-
-def my_custom_metrics(metadata, real_tables, synthetic_tables):
-    name = "abs-diff-in-number-of-rows"
-
-    for table_name in metadata.get_tables():
-
-        # Absolute difference in number of rows
-        nb_real_rows = len(real_tables[table_name])
-        nb_synthetic_rows = len(synthetic_tables[table_name])
-        value = float(abs(nb_real_rows - nb_synthetic_rows))
+import sdmetrics
 
-        # Specify some useful tags for the user
-        tags = set([
-            "priority:high",
-            "table:%s" % table_name
-        ])
+# Load the demo data, which includes:
+# - A dict containing the real tables as pandas.DataFrames.
+# - A dict containing the synthetic clones of the real data.
+# - A dict containing metadata about the tables.
+real_data, synthetic_data, metadata = sdmetrics.load_demo()
 
-        yield Metric(name, value, tags)
-```
-
-To attach your metrics to a `MetricsReport` object, you can use the `add_metrics`
-method and provide your custom metrics iterator:
-
-```python3
-from sdmetrics.report import MetricsReport
+# Obtain the list of multi table metrics, which is returned as a dict
+# containing the metric names and the corresponding metric classes.
+metrics = sdmetrics.multi_table.MultiTableMetric.get_subclasses()
 
-report = MetricsReport()
-report.add_metrics(my_custom_metrics(metadata, real_tables, synthetic_tables))
+# Run all the compatible metrics and get a report
+sdmetrics.compute_metrics(metrics, real_data, synthetic_data, metadata=metadata)
 ```
 
-See `sdmetrics.detection`, `sdmetrics.efficacy`, and `sdmetrics.statistical` for
-more examples of how to implement metrics.
+The output will be a table with all the details about the executed metrics and their score:
 
-## Filtering Metrics
+| metric                       | name                                    |      score |   min_value |   max_value | goal     |
+|------------------------------|-----------------------------------------|------------|-------------|-------------|----------|
+| CSTest                       | Chi-Squared                             |   0.76651  |           0 |           1 | MAXIMIZE |
+| KSTest                       | Inverted Kolmogorov-Smirnov D statistic |   0.75     |           0 |           1 | MAXIMIZE |
+| KSTestExtended               | Inverted Kolmogorov-Smirnov D statistic |   0.777778 |           0 |           1 | MAXIMIZE |
+| LogisticDetection            | LogisticRegression Detection            |   0.882716 |           0 |           1 | MAXIMIZE |
+| SVCDetection                 | SVC Detection                           |   0.833333 |           0 |           1 | MAXIMIZE |
+| BNLikelihood                 | BayesianNetwork Likelihood              | nan        |           0 |           1 | MAXIMIZE |
+| BNLogLikelihood              | BayesianNetwork Log Likelihood          | nan        |        -inf |           0 | MAXIMIZE |
+| LogisticParentChildDetection | LogisticRegression Detection            |   0.619444 |           0 |           1 | MAXIMIZE |
+| SVCParentChildDetection      | SVC Detection                           |   0.916667 |           0 |           1 | MAXIMIZE |
 
-The `MetricsReport` object includes a `details` method which returns all of the
-metrics that were computed.
-
-```python3
-from sdmetrics.report import MetricsReport
-
-report = evaluate(metadata, real_tables, synthetic_tables)
-report.details()
-```
-
-To filter these metrics, you can provide a filter function. For example, to only
-see metrics that are associated with the `users` table, you can run
+# What's next?
 
-```python3
-def my_custom_filter(metric):
-  if "table:users" in metric.tags:
-    return True
-  return False
+If you want to read more about each individual metric, please visit the following folders:
 
-report.details(my_custom_filter)
-```
+* Single Column Metrics: [sdmetrics/single_column](sdmetrics/single_column)
+* Single Table Metrics: [sdmetrics/single_table](sdmetrics/single_table)
+* Multi Table Metrics: [sdmetrics/multi_table](sdmetrics/multi_table)
 
-Examples of standard tags implemented by the built-in metrics are shown below.
-
-<table>
-  <tr>
-    <th style="width:14em;">Tag</th>
-    <th>Description</th>
-  </tr>
-  <tr>
-    <td><code>priority:high</code></td>
-    <td>This tag tells the user to pay extra attention to this metric. It typically indicates that the objects being evaluated by the metric are unusually bad (i.e. the synthetic values look very different from the real values).</td>
-  </tr>
-  <tr>
-    <td><code>table:TABLE_NAME</code></td>
-    <td>This tag indicates that the metric involves the table specified by <code>TABLE_NAME</code>.
-  </tr>
-  <tr>
-    <td><code>column:COL_NAME</code></td>
-    <td>This tag indicates that the metric involves the column specified by <code>COL_NAME</code>. If the column names are not unique across the entire database, then it needs to be combined with the <code>table:TABLE_NAME</code> tag to uniquely identify a specific column.</td>
-  </tr>
-</table>
-
-As this library matures, we will define additional standard tags and/or promote them to
-first class attributes.
+# The Synthetic Data Vault
 
-# What's next?
+<p>
+  <a href="https://sdv.dev">
+    <img width=30% src="https://github.com/sdv-dev/SDV/blob/master/docs/images/SDV-Logo-Color-Tagline.png?raw=true">
+  </a>
+  <p><i>This repository is part of <a href="https://sdv.dev">The Synthetic Data Vault Project</a></i></p>
+</p>
 
-For more details about **SDMetrics** and all its possibilities and features, please check
-the [documentation site](https://sdv-dev.github.io/SDMetrics/).
+* Website: https://sdv.dev
+* Documentation: https://sdv.dev/SDV
diff --git a/conda/meta.yaml b/conda/meta.yaml
@@ -1,5 +1,9 @@
 {% set name = 'sdmetrics' %}
+<<<<<<< HEAD
+{% set version = '0.1.0.dev2' %}
+=======
 {% set version = '0.0.5.dev0' %}
+>>>>>>> master
 
 package:
   name: "{{ name|lower }}"

diff --git a/resources/visualize.png b/resources/visualize.png