Skip to content

Commit

Permalink
initial commit
Browse files Browse the repository at this point in the history
  • Loading branch information
mbodensohn committed May 7, 2023
0 parents commit ff72334
Show file tree
Hide file tree
Showing 66 changed files with 10,073 additions and 0 deletions.
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
/venv/
/.idea/
/__pycache__/
/.pytest_cache/

/models/
Empty file added LICENSE
Empty file.
100 changes: 100 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
# WannaDB: Ad-hoc SQL Queries over Text Collections

![Document collection and corresponding table.](header_image.svg)

WannaDB allows users to explore unstructured text collections by automatically organizing the relevant information
nuggets in a table. It supports ad-hoc SQL queries over text collections using a novel two-phased approach. First, a
superset of information nuggets is extracted from the texts using existing extractors such as named entity recognizers.
The extractions are then interactively matched to a structured table definition as requested by the user.

## Usage

Run `main.py` to start the WannaDB GUI.

There are also various auxiliary scripts in `scripts/` and the experimentation repository.

## Installation

This project requires Python 3.9.

##### 1. Create a virtual environment.

```
python -m venv venv
source venv/bin/activate
export PYTHONPATH="."
```

##### 2. Install the dependencies.

```
pip install --upgrade pip
pip install --use-pep517 -r requirements.txt
pip install --use-pep517 pytest
```

You may have to install `torch` by hand if you want to use CUDA:

https://pytorch.org/get-started/locally/

##### 3. Run the tests.

```
pytest
```

## Citing WannaDB

The code in this repository is the result of several scientific publications:

```
@inproceedings{mci/Hättasch2023,
author = {Hättasch, Benjamin AND Bodensohn, Jan-Micha AND Vogel, Liane AND Urban, Matthias AND Binnig, Carsten},
title = {WannaDB: Ad-hoc SQL Queries over Text Collections},
booktitle = {BTW 2023},
year = {2023},
editor = {König-Ries, Birgitta AND Scherzinger, Stefanie AND Lehner, Wolfgang AND Vossen, Gottfried} ,
doi = { 10.18420/BTW2023-08 },
publisher = {Gesellschaft für Informatik e.V.},
address = {}
}
```

```
@inproceedings{10.1145/3514221.3520174,
author = {H\"{a}ttasch, Benjamin and Bodensohn, Jan-Micha and Binnig, Carsten},
title = {Demonstrating ASET: Ad-Hoc Structured Exploration of Text Collections},
year = {2022},
isbn = {9781450392495},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3514221.3520174},
doi = {10.1145/3514221.3520174},
abstract = {In this demo, we present ASET, a novel tool to explore the contents of unstructured data (text) by automatically transforming relevant parts into tabular form. ASET works in an ad-hoc manner without the need to curate extraction pipelines for the (unseen) text collection or to annotate large amounts of training data. The main idea is to use a new two-phased approach that first extracts a superset of information nuggets from the texts using existing extractors such as named entity recognizers. In a second step, it leverages embeddings and a novel matching strategy to match the extractions to a structured table definition as requested by the user. This demo features the ASET system with a graphical user interface that allows people without machine learning or programming expertise to explore text collections efficiently. This can be done in a self-directed and flexible manner, and ASET provides an intuitive impression of the result quality.},
booktitle = {Proceedings of the 2022 International Conference on Management of Data},
pages = {2393–2396},
numpages = {4},
keywords = {matching embeddings, text to table, interactive text exploration},
location = {Philadelphia, PA, USA},
series = {SIGMOD '22}
}
```

```
@article{Httasch2022ASETAS,
title={ASET: Ad-hoc Structured Exploration of Text Collections [Extended Abstract]},
author={Benjamin H{\"a}ttasch and Jan-Micha Bodensohn and Carsten Binnig},
journal={ArXiv},
year={2022},
volume={abs/2203.04663}
}
```

```
@inproceedings{Httasch2021WannaDBAS,
title={WannaDB: Ad-hoc Structured Exploration of Text Collections Using Queries},
author={Benjamin H{\"a}ttasch},
booktitle={Biennial Conference on Design of Experimental Search \& Information Retrieval Systems},
year={2021}
}
```
1,031 changes: 1,031 additions & 0 deletions header_image.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
21 changes: 21 additions & 0 deletions main.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
import logging
import sys

from PyQt6.QtWidgets import QApplication

from wannadb.resources import ResourceManager
from wannadb_ui.main_window import MainWindow

logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s")
logger = logging.getLogger()

if __name__ == "__main__":
logger.info("Starting wannadb_ui.")

with ResourceManager() as resource_manager:
# set up PyQt application
app = QApplication(sys.argv)

window = MainWindow()

sys.exit(app.exec())
45 changes: 45 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
[build-system]
requires = ["setuptools>=42"]
build-backend = "setuptools.build_meta"

[project]
name = "wannadb"
version = "0.0.1"
authors = [
{ name = "Benjamin Hättasch" },
{ name = "Jan-Micha Bodensohn" },
{ name = "Liane Vogel" },
]
description = "WannaDB: Ad-hoc SQL Queries over Text Collections"
readme = "README.md"
license = { file = "LICENSE" }
requires-python = ">=3.9"
classifiers = [
"Programming Language :: Python :: 3",
]
dependencies = [
"pymongo==3.12.1",
"torch==1.10.0",
"numpy==1.21.4",
"pandas==1.3.4",
"scipy==1.7.2",
"stanza==1.3.0",
"spacy==3.2.0",
"sentence-transformers==2.1.0",
"matplotlib==3.5.0",
"seaborn==0.11.2",
"scikit-learn==1.0.1",
"transformers==4.12.5",
"PyQt6==6.2.1",
"sqlparse==0.4.2",
]

[project.urls]
"Homepage" = "https://github.com/DataManagementLab/wannadb"

[tool.setuptools]
packages = [
"wannadb",
"wannadb_parsql",
"wannadb_ui",
]
Loading

0 comments on commit ff72334

Please sign in to comment.