Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
chernishev authored and BUYT-1 committed Dec 2, 2023
1 parent 9eb87ec commit 0a02033
Showing 1 changed file with 198 additions and 103 deletions.
301 changes: 198 additions & 103 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,163 +1,258 @@

<p>
<img src="./images/desbordante-preview.png"/>
<img src="https://github.com/Mstrutov/Desbordante/assets/88928096/d687809b-5a3b-420e-a192-a1a2b6697b2a"/>
</p>

---
# FOR EDBT REVIEWERS

If you are coming from our EDBT Industrial submission, please check out the `edbt` branch to access the following features, which are not merged into main yet:
- Discovery of association rules using ECLAT and FP-Growth algorithms adapted from Christian Borgelt’s implementations
- Discovery of conditional functional dependencies using the CTANE algorithm and its variations
# General

---
Desbordante is a high-performance data profiler that is capable of discovering and validating many different patterns in data using various algorithms. The currently supported data patterns are:
* Functional dependencies, both exact and approximate (discovery and validation)
* Conditional functional dependencies (discovery)
* Metric functional dependencies (validation)
* Fuzzy algebraic constraints (discovery)
* Association rules (discovery)

# About
The discovered patterns can have many uses:
* For scientific data, especially those obtained experimentally, an interesting pattern allows to formulate a hypothesis that could lead to a scientific discovery. In some cases it even allows to draw conclusions immediately, if there is enough data. At the very least, the found pattern can provide a direction for further study.
* For business data it is also possible to obtain a hypothesis based on found patterns. However, there are more down-to-earth and more in-demand applications in this case: clearing errors in data, finding and removing inexact duplicates, performing schema matching, and many more.
* For training data used in machine learning applications the found patterns can help in feature engineering and in choosing the direction for the ablation study.
* For database data, found patterns can help with defining (recovering) primary and foreign keys, setting up (checking) all kinds of integrity constraints.

Desbordante is a high-performance data profiler that is capable of discovering many different patterns in data using various algorithms. The currently supported data patterns are:
* Functional dependencies, both exact and approximate
* Conditional functional dependencies
* Association rules
Desbordante can be used via three interfaces:
* **Console application.** This is a classic command-line interface that aims to provide basic profiling functionality, i.e. discovery and validation of patterns. A user can specify pattern type, task type, algorithm, input file(s) and output results to the screen or into a file.
* **Python bindings.** Desbordante functionality can be accessed from within Python programs by employing the Desbordante Python library. This interface offers everything that is currently provided by the console version and allows advanced use, such as building interactive applications and designing scenarios for solving a particular real-life task. Relational data processing algorithms accept pandas DataFrames as input, allowing the user to conveniently preprocess the data before mining patterns.
* **Web application.** There is a web application that provides discovery and validation tasks with a rich interactive interface where results can be conveniently visualized. However, currently it supports a limited number of patterns and should be considered more as an interactive demo.

It also allows to run data cleaning scenarios using these algorithms. At the moment, we have implemented a typo detection scenario using an exact and approximate functional dependency discovery algorithm.
A brief introduction into the tool and its use cases is presented [here](https://medium.com/@chernishev/exploratory-data-analysis-with-desbordante-4b97299cce07) (in English) and [here](https://habr.com/ru/company/unidata/blog/667636/) (in Russian). Also, a list of various articles and guides can be found [here](https://desbordante.unidata-platform.ru/papers).

The algorithms of Desbordante are implemented in C++ to maximize the resulting performance. They can be run using either a console version or a web-application that features an easy-to-use web interface.
# Console

You can try the deployed version [here](https://desbordante.unidata-platform.ru/). You have to register in order to process your own datasets. Keep in mind that due to a large demand various time and memory limits are enforced (and a task is killed if it goes outside of acceptable ranges).
Usage examples:
1) Discover all exact functional dependencies in a table represented by a .csv file that uses a comma as the separator and has a header row. In this example the default FD discovery algorithm (HyFD) is used.

A brief introduction into the tool and its use-cases is presented [here](https://medium.com/@chernishev/exploratory-data-analysis-with-desbordante-4b97299cce07) (in English) and [here](https://habr.com/ru/company/unidata/blog/667636/) (in Russian).
```sh
python3 cli.py --task=fd --table=../examples/datasets/university_fd.csv , True
```

# Installation guide
```text
( 1 3 ) -> 0
( 1 3 ) -> 2
( 0 ) -> 2
( 0 3 ) -> 1
( 2 ) -> 0
( 2 3 ) -> 1
```

## Ubuntu
The following instructions were tested on Ubuntu 18.04.4 LTS.
### Dependencies
Prior to cloning the repository and attempting to build the project, ensure that you have the following software:
2) Discover all approximate functional dependencies with error less than or equal to 0.1 in a table represented by a .csv file that uses a comma as the separator and has a header row. In this example the default AFD discovery algorithm (Pyro) is used.

- GNU g++ compiler, version 10+
- CMake, version 3.13+
- Boost library, version 1.72.0+
```sh
python3 cli.py --task=afd --table=../examples/datasets/inventory_afd.csv , True --error=0.1
```

To use test datasets you will need:
- Git Large File Storage, version 3.0.2+
```text
( 0 ) -> 1
( 0 ) -> 2
( 1 ) -> 2
```

### Building the project
Firstly, navigate to a desired directory.
Then, clone the repository, cd into the project directory and launch the build script:
3) Check whether metric functional dependency “Title -> Duration” with radius 5 (using the Euclidean metric) holds in a table represented by a .csv file that uses a comma as the separator and has a header row. In this example the default MFD validation algorithm (BRUTE) is used.

```sh
python3 cli.py --task=mfd_verification --table=../examples/datasets/theatres_mfd.csv , True --lhs_indices=0 --rhs_indices=2 --metric=euclidean --parameter=5
```
git clone https://github.com/Mstrutov/Desbordante/
cd Desbordante
./build.sh

```text
True
```

For more information consult documentation and help files.

### Getting the sample data
## Python bindings

It is possible to fetch additional csv tables to try out Desbordante functionality right away.
It should happen automatically via Git LFS during repository cloning, but in case
it doesn't run the following script:
Desbordante features can be accessed from within Python programs by employing the Desbordante Python library. The library is implemented in the form of Python bindings to the interface of the Desbordante C++ core library, using pybind11. Apart from discovery and validation of patterns, this interface is capable of providing valuable additional information which can, for example, describe why a given pattern does not hold. All this allows end users to solve various data quality problems by constructing ad-hoc Python programs. To show the power of this interface, we have implemented several demo scenarios:
1) [Typo detection](https://colab.research.google.com/drive/1h5mQAIIxSb6Sgc_Ep8AYZlgt4BGXN6A9)
2) [Data deduplication](https://colab.research.google.com/drive/1hgF8idXi1-U4ZOR0fAmdbfbhltgEJecR?usp=sharing)
3) [Anomaly detection](https://colab.research.google.com/drive/1hgF8idXi1-U4ZOR0fAmdbfbhltgEJecR?usp=sharing)

[There is](https://desbordante.streamlit.app/) also an interactive demo for all of them, and all of these python scripts are [here](https://github.com/Mstrutov/Desbordante/tree/main/examples). The ideas behind them are briefly discussed in this [preprint](https://arxiv.org/abs/2307.14935) (Section 3).

Simple usage examples:
1) Discover all exact functional dependencies in a table represented by a .csv file that uses a comma as the separator and has a header row. In this example the FD discovery algorithm HyFD is used.

```python
import desbordante

TABLE = '../examples/datasets/university_fd.csv'

algo = desbordante.HyFD()
algo.set_option('table', (TABLE, ',', True))
algo.set_option('is_null_equal_null')
algo.load_data()
algo.execute()
result = algo.get_fds()
print('FDs:')
for fd in result:
print(fd)
```
./pull_datasets.sh
```text
FDs:
( 1 3 ) -> 0
( 1 3 ) -> 2
( 0 ) -> 2
( 0 3 ) -> 1
( 2 ) -> 0
( 2 3 ) -> 1
```

### Launching the binaries
The script generates the following file structure in `/path/to/Desbordante/build/target`:
```bash
├───input_data
│ └───some-sample-csv\'s.csv
├───Desbordante_test
├───Desbordante_run
2) Discover all approximate functional dependencies with error less than or equal to 0.1 in a table represented by a .csv file that uses a comma as the separator and has a header row. In this example the AFD discovery algorithm Pyro is used.

```python
import desbordante

TABLE = '../examples/datasets/inventory_afd.csv'
ERROR = 0.1

algo = desbordante.Pyro()
algo.set_option('table', (TABLE, ',', True))
algo.set_option('is_null_equal_null')
algo.load_data()
algo.set_option('error', ERROR)
algo.set_option('threads')
algo.set_option('max_lhs')
algo.set_option('seed')
algo.execute()
result = algo.get_fds()
print('AFDs:')
for fd in result:
print(fd)
```
The `input_data` directory contains several .csv files that may be used by `Desbordante_test`. Run `Desbordante_test` to perform unit testing:
```text
AFDs:
( 0 ) -> 1
( 0 ) -> 2
( 1 ) -> 2
```
cd build/target
./Desbordante_test

3) Check whether metric functional dependency “Title -> Duration” with radius 5 (using the Euclidean metric) holds in a table represented by a .csv file that uses a comma as the separator and has a header row. In this example the default MFD validation algorithm (BRUTE) is used.

```python
import desbordante

TABLE = '../examples/datasets/theatres_mfd.csv'
METRIC = 'euclidean'
LHS_INDICES = [0]
RHS_INDICES = [2]
PARAMETER = 5

algo = desbordante.MetricVerifier()
algo.set_option('table', (TABLE, ',', True))
algo.set_option('is_null_equal_null')
algo.load_data()
algo.set_option('lhs_indices', LHS_INDICES)
algo.set_option('metric', METRIC)
algo.set_option('parameter', PARAMETER)
algo.set_option('dist_from_null_is_infinity')
algo.set_option('rhs_indices', RHS_INDICES)
algo.execute()
if algo.mfd_holds():
print('MFD holds')
else:
print('MFD does not hold')
```
The tool itself is launched via the following line:
```text
MFD holds
```
./Desbordante_run --algo=tane --data=<dataset_name>.csv
4) Discover approximate functional dependencies with various error thresholds. Here, we showcase the preferred approach to configuring algorithm options. Furthermore, we are using a pandas DataFrame to load data from a CSV file.
```python-repl
>>> import desbordante
>>> import pandas as pd
>>> pyro = desbordante.Pyro()
>>> df = pd.read_csv('iris.csv', sep=',', header=0)
>>> pyro.load_data(df)
>>> pyro.execute(error=0.0)
>>> pyro.get_fds()
[( 0 1 2 ) -> 4, ( 0 2 3 ) -> 4, ( 0 1 3 ) -> 4, ( 1 2 3 ) -> 4]
>>> pyro.execute(error=0.1)
>>> pyro.get_fds()
[( 2 ) -> 0, ( 2 ) -> 1, ( 0 ) -> 2, ( 2 ) -> 4, ( 2 ) -> 3, ( 3 ) -> 2, ( 3 ) -> 0, ( 0 ) -> 1, ( 0 ) -> 3, ( 1 ) -> 0, ( 1 ) -> 2, ( 3 ) -> 4, ( 3 ) -> 1, ( 1 ) -> 3, ( 0 ) -> 4, ( 1 ) -> 4]
>>> pyro.execute(error=0.2)
>>> pyro.get_fds()
[( 2 ) -> 1, ( 2 ) -> 0, ( 2 ) -> 4, ( 0 ) -> 2, ( 2 ) -> 3, ( 0 ) -> 1, ( 3 ) -> 4, ( 3 ) -> 2, ( 3 ) -> 1, ( 3 ) -> 0, ( 1 ) -> 2, ( 0 ) -> 3, ( 0 ) -> 4, ( 1 ) -> 0, ( 1 ) -> 4, ( 1 ) -> 3]
>>> pyro.execute(error=0.3)
>>> pyro.get_fds()
[( 2 ) -> 1, ( 0 ) -> 2, ( 2 ) -> 0, ( 3 ) -> 0, ( 2 ) -> 3, ( 1 ) -> 0, ( 2 ) -> 4, ( 3 ) -> 2, ( 0 ) -> 1, ( 1 ) -> 2, ( 3 ) -> 1, ( 3 ) -> 4, ( 0 ) -> 3, ( 4 ) -> 2, ( 4 ) -> 1, ( 0 ) -> 4, ( 1 ) -> 3, ( 1 ) -> 4, ( 4 ) -> 3]
```

The `<dataset_name>.csv`, which is a user-provided dataset, should be placed in the `/path/to/Desbordante/build/target` directory.
## Web interface

## Windows
---
**ATTENTION**
While the Python interface makes building interactive applications possible, Desbordante also offers a web interface which is aimed specifically for interactive tasks. Such tasks typically involve multiple steps and require substantial user input on each of them. Interactive tasks usually originate from Python scenarios, i.e. we select the most interesting ones and implement them in the web version. Currently, only the typo detection scenario is implemented. The web interface is also useful for pattern discovery and validation tasks: a user may specify parameters, browse results, employ advanced visualizations and filters, all in a convenient way.

Currently Windows is not supported. The last compatible version has a tag `windows-compatible` &mdash;
later versions might compile, but the likelihood of such an event drastically declines with each commit.
You can use the following guide to checkout, build and run this particular version.
You can try the deployed web version [here](https://desbordante.unidata-platform.ru/). You have to register in order to process your own datasets. Keep in mind that due to a large demand various time and memory limits are enforced: processing is aborted if they are exceeded. The source code of the web interface is kept in a separate [repo](https://github.com/vs9h/Desbordante).

---
The following instructions were tested on Windows 10 .
## Build instructions

## Ubuntu
The following instructions were tested on Ubuntu 20.04+ LTS.
### Dependencies
Prior to cloning the repository and attempting to build the project, ensure that you have the following software:

- Microsoft Visual Studio 2019
- GNU g++ compiler, version 10+
- CMake, version 3.13+
- Boost library, version 1.65.1+ \
The recommended way to install Boost is by using [chocolatey](https://chocolatey.org/)
- Boost library, version 1.74.0+

### Building the project
Firstly, launch the command prompt and navigate to a desired directory.
To use test datasets you will need:
- Git Large File Storage, version 3.0.2+

### Building the project (first option: with tests)
Firstly, navigate to a desired directory.
Then, clone the repository, cd into the project directory and launch the build script:
```
git clone https://github.com/Mstrutov/Desbordante/
cd Desbordante
./pull_datasets.sh
./build.sh
```

### Building the project (second option: without tests)

Firstly, navigate to a desired directory.
Then, clone the repository, cd into the project directory and launch the build script:
```
git clone https://github.com/Mstrutov/Desbordante/
cd Desbordante
git checkout windows-compatible
build.bat
./build.sh --no-tests --no-unpack
```
*Note:* to compile the project, the script uses hard-coded path to MSVC developer command prompt, which is located
by default at `C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\Common7\Tools\VsDevCmd.bat`.
You should change the path in the script if it differs from the default one.

### Launching the binaries
The script generates the following file structure in `\path\to\Desbordante\build\target`:
The script generates the following file structure in `/path/to/Desbordante/build/target`:
```bash
├───input_data
│ └───some-sample-csv\'s.csv
├───fdtester_test.exe
├───fdtester_run.exe
├───Desbordante_test
├───Desbordante_run
```
The `inputData` directory contains several .csv files that may be used by `fdtester_test`. Run `fdtester_test` to perform unit testing:
The `input_data` directory contains several .csv files that may be used by `Desbordante_test`. Run `Desbordante_test` to perform unit testing:
```
cd build\target
fdtester_test.exe
cd build/target
./Desbordante_test
```
The tool itself is launched via the following line:
The tool itself may be run like the following:
```
fdtester_run.exe --algo=tane --data=<dataset_name>.csv
./Desbordante_run --algo=tane --data=<path_to_dataset>
```

The `<dataset_name>.csv`, which is a user-provided dataset, should be placed in the `\path\to\Desbordante\build\target` directory.

# Developers

Arthur Saliou &mdash; Fdep implementation

Eduard Gaisin &mdash; Dep-Miner implementation

Alexandr Smirnov &mdash; DFD implementation

Ilya Shchuckin &mdash; FD_Mine implementation

Michael Polyntsov &mdash; FastFDs implementation

Ilya Vologin &mdash; core classes

Maxim Strutovsky &mdash; team lead, Pyro & TANE implementation

Nikita Bobrov &mdash; product owner, consult, papers

Kirill Smirnov &mdash; product owner, code quality, infrastructure, consult
## Cite
If you use this software for research, please cite one of our papers:
1) George Chernishev, et al. Solving Data Quality Problems with Desbordante: a Demo. CoRR abs/2307.14935 (2023).
2) George Chernishev, et al. "Desbordante: from benchmarking suite to high-performance science-intensive data profiler (preprint)". CoRR abs/2301.05965. (2023).
3) M. Strutovskiy, N. Bobrov, K. Smirnov and G. Chernishev, "Desbordante: a Framework for Exploring Limits of Dependency Discovery Algorithms," 2021 29th Conference of Open Innovations Association (FRUCT), 2021, pp. 344-354, doi: 10.23919/FRUCT52173.2021.9435469.
4) A. Smirnov, A. Chizhov, I. Shchuckin, N. Bobrov and G. Chernishev, "Fast Discovery of Inclusion Dependencies with Desbordante," 2023 33rd Conference of Open Innovations Association (FRUCT), Zilina, Slovakia, 2023, pp. 264-275, doi: 10.23919/FRUCT58615.2023.10143047.

George Chernishev &mdash; product owner, consult, papers
# Contacts and Q&A

# Cite
If you have any questions regarding the tool usage you can ask it in our [google group](https://groups.google.com/g/desbordante). To contact dev team email George Chernishev, Maxim Strutovsky or Nikita Bobrov.

If you use this software for research, please cite the paper (https://fruct.org/publications/volume-29/fruct29/files/Strut.pdf, https://ieeexplore.ieee.org/document/9435469) as follows:

M. Strutovskiy, N. Bobrov, K. Smirnov and G. Chernishev, "Desbordante: a Framework for Exploring Limits of Dependency Discovery Algorithms," 2021 29th Conference of Open Innovations Association (FRUCT), 2021, pp. 344-354, doi: 10.23919/FRUCT52173.2021.9435469.

# Contacts

[Email me at [email protected]](mailto:[email protected])

0 comments on commit 0a02033

Please sign in to comment.