-
Notifications
You must be signed in to change notification settings - Fork 72
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
9eb87ec
commit efbe9f9
Showing
1 changed file
with
198 additions
and
103 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,163 +1,258 @@ | ||
|
||
<p> | ||
<img src="./images/desbordante-preview.png"/> | ||
<img src="https://github.com/Mstrutov/Desbordante/assets/88928096/d687809b-5a3b-420e-a192-a1a2b6697b2a"/> | ||
</p> | ||
|
||
--- | ||
# FOR EDBT REVIEWERS | ||
|
||
If you are coming from our EDBT Industrial submission, please check out the `edbt` branch to access the following features, which are not merged into main yet: | ||
- Discovery of association rules using ECLAT and FP-Growth algorithms adapted from Christian Borgelt’s implementations | ||
- Discovery of conditional functional dependencies using the CTANE algorithm and its variations | ||
# General | ||
|
||
--- | ||
Desbordante is a high-performance data profiler that is capable of discovering and validating many different patterns in data using various algorithms. The currently supported data patterns are: | ||
* Functional dependencies, both exact and approximate (discovery and validation) | ||
* Conditional functional dependencies (discovery) | ||
* Metric functional dependencies (validation) | ||
* Fuzzy algebraic constraints (discovery) | ||
* Association rules (discovery) | ||
|
||
# About | ||
The discovered patterns can have many uses: | ||
* For scientific data, especially those obtained experimentally, an interesting pattern allows to formulate a hypothesis that could lead to a scientific discovery. In some cases it even allows to draw conclusions immediately, if there is enough data. At the very least, the found pattern can provide a direction for further study. | ||
* For business data it is also possible to obtain a hypothesis based on found patterns. However, there are more down-to-earth and more in-demand applications in this case: clearing errors in data, finding and removing inexact duplicates, performing schema matching, and many more. | ||
* For training data used in machine learning applications the found patterns can help in feature engineering and in choosing the direction for the ablation study. | ||
* For database data, found patterns can help with defining (recovering) primary and foreign keys, setting up (checking) all kinds of integrity constraints. | ||
|
||
Desbordante is a high-performance data profiler that is capable of discovering many different patterns in data using various algorithms. The currently supported data patterns are: | ||
* Functional dependencies, both exact and approximate | ||
* Conditional functional dependencies | ||
* Association rules | ||
Desbordante can be used via three interfaces: | ||
* **Console application.** This is a classic command-line interface that aims to provide basic profiling functionality, i.e. discovery and validation of patterns. A user can specify pattern type, task type, algorithm, input file(s) and output results to the screen or into a file. | ||
* **Python bindings.** Desbordante functionality can be accessed from within Python programs by employing the Desbordante Python library. This interface offers everything that is currently provided by the console version and allows advanced use, such as building interactive applications and designing scenarios for solving a particular real-life task. Relational data processing algorithms accept pandas DataFrames as input, allowing the user to conveniently preprocess the data before mining patterns. | ||
* **Web application.** There is a web application that provides discovery and validation tasks with a rich interactive interface where results can be conveniently visualized. However, currently it supports a limited number of patterns and should be considered more as an interactive demo. | ||
|
||
It also allows to run data cleaning scenarios using these algorithms. At the moment, we have implemented a typo detection scenario using an exact and approximate functional dependency discovery algorithm. | ||
A brief introduction into the tool and its use cases is presented [here](https://medium.com/@chernishev/exploratory-data-analysis-with-desbordante-4b97299cce07) (in English) and [here](https://habr.com/ru/company/unidata/blog/667636/) (in Russian). Also, a list of various articles and guides can be found [here](https://desbordante.unidata-platform.ru/papers). | ||
|
||
The algorithms of Desbordante are implemented in C++ to maximize the resulting performance. They can be run using either a console version or a web-application that features an easy-to-use web interface. | ||
# Console | ||
|
||
You can try the deployed version [here](https://desbordante.unidata-platform.ru/). You have to register in order to process your own datasets. Keep in mind that due to a large demand various time and memory limits are enforced (and a task is killed if it goes outside of acceptable ranges). | ||
Usage examples: | ||
1) Discover all exact functional dependencies in a table represented by a .csv file that uses a comma as the separator and has a header row. In this example the default FD discovery algorithm (HyFD) is used. | ||
|
||
A brief introduction into the tool and its use-cases is presented [here](https://medium.com/@chernishev/exploratory-data-analysis-with-desbordante-4b97299cce07) (in English) and [here](https://habr.com/ru/company/unidata/blog/667636/) (in Russian). | ||
```sh | ||
python3 cli.py --task=fd --table=../examples/datasets/university_fd.csv , True | ||
``` | ||
|
||
# Installation guide | ||
```text | ||
( 1 3 ) -> 0 | ||
( 1 3 ) -> 2 | ||
( 0 ) -> 2 | ||
( 0 3 ) -> 1 | ||
( 2 ) -> 0 | ||
( 2 3 ) -> 1 | ||
``` | ||
|
||
## Ubuntu | ||
The following instructions were tested on Ubuntu 18.04.4 LTS. | ||
### Dependencies | ||
Prior to cloning the repository and attempting to build the project, ensure that you have the following software: | ||
2) Discover all approximate functional dependencies with error less than or equal to 0.1 in a table represented by a .csv file that uses a comma as the separator and has a header row. In this example the default AFD discovery algorithm (Pyro) is used. | ||
|
||
- GNU g++ compiler, version 10+ | ||
- CMake, version 3.13+ | ||
- Boost library, version 1.72.0+ | ||
```sh | ||
python3 cli.py --task=afd --table=../examples/datasets/inventory_afd.csv , True --error=0.1 | ||
``` | ||
|
||
To use test datasets you will need: | ||
- Git Large File Storage, version 3.0.2+ | ||
```text | ||
( 0 ) -> 1 | ||
( 0 ) -> 2 | ||
( 1 ) -> 2 | ||
``` | ||
|
||
### Building the project | ||
Firstly, navigate to a desired directory. | ||
Then, clone the repository, cd into the project directory and launch the build script: | ||
3) Check whether metric functional dependency “Title -> Duration” with radius 5 (using the Euclidean metric) holds in a table represented by a .csv file that uses a comma as the separator and has a header row. In this example the default MFD validation algorithm (BRUTE) is used. | ||
|
||
```sh | ||
python3 cli.py --task=mfd_verification --table=../examples/datasets/theatres_mfd.csv , True --lhs_indices=0 --rhs_indices=2 --metric=euclidean --parameter=5 | ||
``` | ||
git clone https://github.com/Mstrutov/Desbordante/ | ||
cd Desbordante | ||
./build.sh | ||
|
||
```text | ||
True | ||
``` | ||
|
||
For more information consult documentation and help files. | ||
|
||
## Python bindings | ||
|
||
Desbordante features can be accessed from within Python programs by employing the Desbordante Python library. The library is implemented in the form of Python bindings to the interface of the Desbordante C++ core library, using pybind11. Apart from discovery and validation of patterns, this interface is capable of providing valuable additional information which can, for example, describe why a given pattern does not hold. All this allows end users to solve various data quality problems by constructing ad-hoc Python programs. To show the power of this interface, we have implemented several demo scenarios: | ||
1) [Typo detection](https://colab.research.google.com/drive/1h5mQAIIxSb6Sgc_Ep8AYZlgt4BGXN6A9) | ||
2) [Data deduplication](https://colab.research.google.com/drive/1hgF8idXi1-U4ZOR0fAmdbfbhltgEJecR?usp=sharing) | ||
3) [Anomaly detection](https://colab.research.google.com/drive/1hgF8idXi1-U4ZOR0fAmdbfbhltgEJecR?usp=sharing) | ||
|
||
[There is](https://desbordante.streamlit.app/) also an interactive demo for all of them, and all of these python scripts are [here](https://github.com/Mstrutov/Desbordante/tree/main/examples). The ideas behind them are briefly discussed in this [preprint](https://arxiv.org/abs/2307.14935) (Section 3). | ||
|
||
Simple usage examples: | ||
1) Discover all exact functional dependencies in a table represented by a .csv file that uses a comma as the separator and has a header row. In this example the FD discovery algorithm HyFD is used. | ||
|
||
### Getting the sample data | ||
```python | ||
import desbordante | ||
|
||
It is possible to fetch additional csv tables to try out Desbordante functionality right away. | ||
It should happen automatically via Git LFS during repository cloning, but in case | ||
it doesn't run the following script: | ||
TABLE = '../examples/datasets/university_fd.csv' | ||
|
||
algo = desbordante.HyFD() | ||
algo.set_option('table', (TABLE, ',', True)) | ||
algo.set_option('is_null_equal_null') | ||
algo.load_data() | ||
algo.execute() | ||
result = algo.get_fds() | ||
print('FDs:') | ||
for fd in result: | ||
print(fd) | ||
``` | ||
./pull_datasets.sh | ||
```text | ||
FDs: | ||
( 1 3 ) -> 0 | ||
( 1 3 ) -> 2 | ||
( 0 ) -> 2 | ||
( 0 3 ) -> 1 | ||
( 2 ) -> 0 | ||
( 2 3 ) -> 1 | ||
``` | ||
|
||
### Launching the binaries | ||
The script generates the following file structure in `/path/to/Desbordante/build/target`: | ||
```bash | ||
├───input_data | ||
│ └───some-sample-csv\'s.csv | ||
├───Desbordante_test | ||
├───Desbordante_run | ||
2) Discover all approximate functional dependencies with error less than or equal to 0.1 in a table represented by a .csv file that uses a comma as the separator and has a header row. In this example the AFD discovery algorithm Pyro is used. | ||
|
||
```python | ||
import desbordante | ||
|
||
TABLE = '../examples/datasets/inventory_afd.csv' | ||
ERROR = 0.1 | ||
|
||
algo = desbordante.Pyro() | ||
algo.set_option('table', (TABLE, ',', True)) | ||
algo.set_option('is_null_equal_null') | ||
algo.load_data() | ||
algo.set_option('error', ERROR) | ||
algo.set_option('threads') | ||
algo.set_option('max_lhs') | ||
algo.set_option('seed') | ||
algo.execute() | ||
result = algo.get_fds() | ||
print('AFDs:') | ||
for fd in result: | ||
print(fd) | ||
``` | ||
The `input_data` directory contains several .csv files that may be used by `Desbordante_test`. Run `Desbordante_test` to perform unit testing: | ||
```text | ||
AFDs: | ||
( 0 ) -> 1 | ||
( 0 ) -> 2 | ||
( 1 ) -> 2 | ||
``` | ||
cd build/target | ||
./Desbordante_test | ||
|
||
3) Check whether metric functional dependency “Title -> Duration” with radius 5 (using the Euclidean metric) holds in a table represented by a .csv file that uses a comma as the separator and has a header row. In this example the default MFD validation algorithm (BRUTE) is used. | ||
|
||
```python | ||
import desbordante | ||
|
||
TABLE = '../examples/datasets/theatres_mfd.csv' | ||
METRIC = 'euclidean' | ||
LHS_INDICES = [0] | ||
RHS_INDICES = [2] | ||
PARAMETER = 5 | ||
|
||
algo = desbordante.MetricVerifier() | ||
algo.set_option('table', (TABLE, ',', True)) | ||
algo.set_option('is_null_equal_null') | ||
algo.load_data() | ||
algo.set_option('lhs_indices', LHS_INDICES) | ||
algo.set_option('metric', METRIC) | ||
algo.set_option('parameter', PARAMETER) | ||
algo.set_option('dist_from_null_is_infinity') | ||
algo.set_option('rhs_indices', RHS_INDICES) | ||
algo.execute() | ||
if algo.mfd_holds(): | ||
print('MFD holds') | ||
else: | ||
print('MFD does not hold') | ||
``` | ||
The tool itself is launched via the following line: | ||
```text | ||
MFD holds | ||
``` | ||
./Desbordante_run --algo=tane --data=<dataset_name>.csv | ||
4) Discover approximate functional dependencies with various error thresholds. Here, we showcase the preferred approach to configuring algorithm options. Furthermore, we are using a pandas DataFrame to load data from a CSV file. | ||
```python-repl | ||
>>> import desbordante | ||
>>> import pandas as pd | ||
>>> pyro = desbordante.Pyro() | ||
>>> df = pd.read_csv('iris.csv', sep=',', header=0) | ||
>>> pyro.load_data(df) | ||
>>> pyro.execute(error=0.0) | ||
>>> pyro.get_fds() | ||
[( 0 1 2 ) -> 4, ( 0 2 3 ) -> 4, ( 0 1 3 ) -> 4, ( 1 2 3 ) -> 4] | ||
>>> pyro.execute(error=0.1) | ||
>>> pyro.get_fds() | ||
[( 2 ) -> 0, ( 2 ) -> 1, ( 0 ) -> 2, ( 2 ) -> 4, ( 2 ) -> 3, ( 3 ) -> 2, ( 3 ) -> 0, ( 0 ) -> 1, ( 0 ) -> 3, ( 1 ) -> 0, ( 1 ) -> 2, ( 3 ) -> 4, ( 3 ) -> 1, ( 1 ) -> 3, ( 0 ) -> 4, ( 1 ) -> 4] | ||
>>> pyro.execute(error=0.2) | ||
>>> pyro.get_fds() | ||
[( 2 ) -> 1, ( 2 ) -> 0, ( 2 ) -> 4, ( 0 ) -> 2, ( 2 ) -> 3, ( 0 ) -> 1, ( 3 ) -> 4, ( 3 ) -> 2, ( 3 ) -> 1, ( 3 ) -> 0, ( 1 ) -> 2, ( 0 ) -> 3, ( 0 ) -> 4, ( 1 ) -> 0, ( 1 ) -> 4, ( 1 ) -> 3] | ||
>>> pyro.execute(error=0.3) | ||
>>> pyro.get_fds() | ||
[( 2 ) -> 1, ( 0 ) -> 2, ( 2 ) -> 0, ( 3 ) -> 0, ( 2 ) -> 3, ( 1 ) -> 0, ( 2 ) -> 4, ( 3 ) -> 2, ( 0 ) -> 1, ( 1 ) -> 2, ( 3 ) -> 1, ( 3 ) -> 4, ( 0 ) -> 3, ( 4 ) -> 2, ( 4 ) -> 1, ( 0 ) -> 4, ( 1 ) -> 3, ( 1 ) -> 4, ( 4 ) -> 3] | ||
``` | ||
|
||
The `<dataset_name>.csv`, which is a user-provided dataset, should be placed in the `/path/to/Desbordante/build/target` directory. | ||
## Web interface | ||
|
||
## Windows | ||
--- | ||
**ATTENTION** | ||
While the Python interface makes building interactive applications possible, Desbordante also offers a web interface which is aimed specifically for interactive tasks. Such tasks typically involve multiple steps and require substantial user input on each of them. Interactive tasks usually originate from Python scenarios, i.e. we select the most interesting ones and implement them in the web version. Currently, only the typo detection scenario is implemented. The web interface is also useful for pattern discovery and validation tasks: a user may specify parameters, browse results, employ advanced visualizations and filters, all in a convenient way. | ||
|
||
Currently Windows is not supported. The last compatible version has a tag `windows-compatible` — | ||
later versions might compile, but the likelihood of such an event drastically declines with each commit. | ||
You can use the following guide to checkout, build and run this particular version. | ||
You can try the deployed web version [here](https://desbordante.unidata-platform.ru/). You have to register in order to process your own datasets. Keep in mind that due to a large demand various time and memory limits are enforced: processing is aborted if they are exceeded. The source code of the web interface is kept in a separate [repo](https://github.com/vs9h/Desbordante). | ||
|
||
--- | ||
The following instructions were tested on Windows 10 . | ||
## Build instructions | ||
|
||
## Ubuntu | ||
The following instructions were tested on Ubuntu 18.04.4 LTS. | ||
### Dependencies | ||
Prior to cloning the repository and attempting to build the project, ensure that you have the following software: | ||
|
||
- Microsoft Visual Studio 2019 | ||
- GNU g++ compiler, version 10+ | ||
- CMake, version 3.13+ | ||
- Boost library, version 1.65.1+ \ | ||
The recommended way to install Boost is by using [chocolatey](https://chocolatey.org/) | ||
- Boost library, version 1.74.0+ | ||
|
||
To use test datasets you will need: | ||
- Git Large File Storage, version 3.0.2+ | ||
|
||
### Building the project (first option: with tests) | ||
Firstly, navigate to a desired directory. | ||
Then, clone the repository, cd into the project directory and launch the build script: | ||
``` | ||
git clone https://github.com/Mstrutov/Desbordante/ | ||
cd Desbordante | ||
./pull_datasets.sh | ||
./build.sh | ||
``` | ||
|
||
### Building the project | ||
Firstly, launch the command prompt and navigate to a desired directory. | ||
### Building the project (second option: without tests) | ||
|
||
Firstly, navigate to a desired directory. | ||
Then, clone the repository, cd into the project directory and launch the build script: | ||
``` | ||
git clone https://github.com/Mstrutov/Desbordante/ | ||
cd Desbordante | ||
git checkout windows-compatible | ||
build.bat | ||
./build.sh --no-tests --no-unpack | ||
``` | ||
*Note:* to compile the project, the script uses hard-coded path to MSVC developer command prompt, which is located | ||
by default at `C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\Common7\Tools\VsDevCmd.bat`. | ||
You should change the path in the script if it differs from the default one. | ||
|
||
### Launching the binaries | ||
The script generates the following file structure in `\path\to\Desbordante\build\target`: | ||
The script generates the following file structure in `/path/to/Desbordante/build/target`: | ||
```bash | ||
├───input_data | ||
│ └───some-sample-csv\'s.csv | ||
├───fdtester_test.exe | ||
├───fdtester_run.exe | ||
├───Desbordante_test | ||
├───Desbordante_run | ||
``` | ||
The `inputData` directory contains several .csv files that may be used by `fdtester_test`. Run `fdtester_test` to perform unit testing: | ||
The `input_data` directory contains several .csv files that may be used by `Desbordante_test`. Run `Desbordante_test` to perform unit testing: | ||
``` | ||
cd build\target | ||
fdtester_test.exe | ||
cd build/target | ||
./Desbordante_test | ||
``` | ||
The tool itself is launched via the following line: | ||
The tool itself may be run like the following: | ||
``` | ||
fdtester_run.exe --algo=tane --data=<dataset_name>.csv | ||
./Desbordante_run --algo=tane --data=<path_to_dataset> | ||
``` | ||
|
||
The `<dataset_name>.csv`, which is a user-provided dataset, should be placed in the `\path\to\Desbordante\build\target` directory. | ||
|
||
# Developers | ||
|
||
Arthur Saliou — Fdep implementation | ||
|
||
Eduard Gaisin — Dep-Miner implementation | ||
|
||
Alexandr Smirnov — DFD implementation | ||
|
||
Ilya Shchuckin — FD_Mine implementation | ||
|
||
Michael Polyntsov — FastFDs implementation | ||
|
||
Ilya Vologin — core classes | ||
|
||
Maxim Strutovsky — team lead, Pyro & TANE implementation | ||
|
||
Nikita Bobrov — product owner, consult, papers | ||
|
||
Kirill Smirnov — product owner, code quality, infrastructure, consult | ||
## Cite | ||
If you use this software for research, please cite one of our papers: | ||
1) George Chernishev, et al. Solving Data Quality Problems with Desbordante: a Demo. CoRR abs/2307.14935 (2023). | ||
2) George Chernishev, et al. "Desbordante: from benchmarking suite to high-performance science-intensive data profiler (preprint)". CoRR abs/2301.05965. (2023). | ||
3) M. Strutovskiy, N. Bobrov, K. Smirnov and G. Chernishev, "Desbordante: a Framework for Exploring Limits of Dependency Discovery Algorithms," 2021 29th Conference of Open Innovations Association (FRUCT), 2021, pp. 344-354, doi: 10.23919/FRUCT52173.2021.9435469. | ||
4) A. Smirnov, A. Chizhov, I. Shchuckin, N. Bobrov and G. Chernishev, "Fast Discovery of Inclusion Dependencies with Desbordante," 2023 33rd Conference of Open Innovations Association (FRUCT), Zilina, Slovakia, 2023, pp. 264-275, doi: 10.23919/FRUCT58615.2023.10143047. | ||
|
||
George Chernishev — product owner, consult, papers | ||
# Contacts and Q&A | ||
|
||
# Cite | ||
If you have any questions regarding the tool usage you can ask it in our [google group](https://groups.google.com/g/desbordante). To contact dev team email George Chernishev, Maxim Strutovsky or Nikita Bobrov. | ||
|
||
If you use this software for research, please cite the paper (https://fruct.org/publications/volume-29/fruct29/files/Strut.pdf, https://ieeexplore.ieee.org/document/9435469) as follows: | ||
|
||
M. Strutovskiy, N. Bobrov, K. Smirnov and G. Chernishev, "Desbordante: a Framework for Exploring Limits of Dependency Discovery Algorithms," 2021 29th Conference of Open Innovations Association (FRUCT), 2021, pp. 344-354, doi: 10.23919/FRUCT52173.2021.9435469. | ||
|
||
# Contacts | ||
|
||
[Email me at [email protected]](mailto:[email protected]) |