-
Notifications
You must be signed in to change notification settings - Fork 4
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit ff72334
Showing
66 changed files
with
10,073 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
/venv/ | ||
/.idea/ | ||
/__pycache__/ | ||
/.pytest_cache/ | ||
|
||
/models/ |
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,100 @@ | ||
# WannaDB: Ad-hoc SQL Queries over Text Collections | ||
|
||
data:image/s3,"s3://crabby-images/a2b94/a2b94644a67195ce89d64e4a391d6dd55b7f1095" alt="Document collection and corresponding table." | ||
|
||
WannaDB allows users to explore unstructured text collections by automatically organizing the relevant information | ||
nuggets in a table. It supports ad-hoc SQL queries over text collections using a novel two-phased approach. First, a | ||
superset of information nuggets is extracted from the texts using existing extractors such as named entity recognizers. | ||
The extractions are then interactively matched to a structured table definition as requested by the user. | ||
|
||
## Usage | ||
|
||
Run `main.py` to start the WannaDB GUI. | ||
|
||
There are also various auxiliary scripts in `scripts/` and the experimentation repository. | ||
|
||
## Installation | ||
|
||
This project requires Python 3.9. | ||
|
||
##### 1. Create a virtual environment. | ||
|
||
``` | ||
python -m venv venv | ||
source venv/bin/activate | ||
export PYTHONPATH="." | ||
``` | ||
|
||
##### 2. Install the dependencies. | ||
|
||
``` | ||
pip install --upgrade pip | ||
pip install --use-pep517 -r requirements.txt | ||
pip install --use-pep517 pytest | ||
``` | ||
|
||
You may have to install `torch` by hand if you want to use CUDA: | ||
|
||
https://pytorch.org/get-started/locally/ | ||
|
||
##### 3. Run the tests. | ||
|
||
``` | ||
pytest | ||
``` | ||
|
||
## Citing WannaDB | ||
|
||
The code in this repository is the result of several scientific publications: | ||
|
||
``` | ||
@inproceedings{mci/Hättasch2023, | ||
author = {Hättasch, Benjamin AND Bodensohn, Jan-Micha AND Vogel, Liane AND Urban, Matthias AND Binnig, Carsten}, | ||
title = {WannaDB: Ad-hoc SQL Queries over Text Collections}, | ||
booktitle = {BTW 2023}, | ||
year = {2023}, | ||
editor = {König-Ries, Birgitta AND Scherzinger, Stefanie AND Lehner, Wolfgang AND Vossen, Gottfried} , | ||
doi = { 10.18420/BTW2023-08 }, | ||
publisher = {Gesellschaft für Informatik e.V.}, | ||
address = {} | ||
} | ||
``` | ||
|
||
``` | ||
@inproceedings{10.1145/3514221.3520174, | ||
author = {H\"{a}ttasch, Benjamin and Bodensohn, Jan-Micha and Binnig, Carsten}, | ||
title = {Demonstrating ASET: Ad-Hoc Structured Exploration of Text Collections}, | ||
year = {2022}, | ||
isbn = {9781450392495}, | ||
publisher = {Association for Computing Machinery}, | ||
address = {New York, NY, USA}, | ||
url = {https://doi.org/10.1145/3514221.3520174}, | ||
doi = {10.1145/3514221.3520174}, | ||
abstract = {In this demo, we present ASET, a novel tool to explore the contents of unstructured data (text) by automatically transforming relevant parts into tabular form. ASET works in an ad-hoc manner without the need to curate extraction pipelines for the (unseen) text collection or to annotate large amounts of training data. The main idea is to use a new two-phased approach that first extracts a superset of information nuggets from the texts using existing extractors such as named entity recognizers. In a second step, it leverages embeddings and a novel matching strategy to match the extractions to a structured table definition as requested by the user. This demo features the ASET system with a graphical user interface that allows people without machine learning or programming expertise to explore text collections efficiently. This can be done in a self-directed and flexible manner, and ASET provides an intuitive impression of the result quality.}, | ||
booktitle = {Proceedings of the 2022 International Conference on Management of Data}, | ||
pages = {2393–2396}, | ||
numpages = {4}, | ||
keywords = {matching embeddings, text to table, interactive text exploration}, | ||
location = {Philadelphia, PA, USA}, | ||
series = {SIGMOD '22} | ||
} | ||
``` | ||
|
||
``` | ||
@article{Httasch2022ASETAS, | ||
title={ASET: Ad-hoc Structured Exploration of Text Collections [Extended Abstract]}, | ||
author={Benjamin H{\"a}ttasch and Jan-Micha Bodensohn and Carsten Binnig}, | ||
journal={ArXiv}, | ||
year={2022}, | ||
volume={abs/2203.04663} | ||
} | ||
``` | ||
|
||
``` | ||
@inproceedings{Httasch2021WannaDBAS, | ||
title={WannaDB: Ad-hoc Structured Exploration of Text Collections Using Queries}, | ||
author={Benjamin H{\"a}ttasch}, | ||
booktitle={Biennial Conference on Design of Experimental Search \& Information Retrieval Systems}, | ||
year={2021} | ||
} | ||
``` |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
import logging | ||
import sys | ||
|
||
from PyQt6.QtWidgets import QApplication | ||
|
||
from wannadb.resources import ResourceManager | ||
from wannadb_ui.main_window import MainWindow | ||
|
||
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s") | ||
logger = logging.getLogger() | ||
|
||
if __name__ == "__main__": | ||
logger.info("Starting wannadb_ui.") | ||
|
||
with ResourceManager() as resource_manager: | ||
# set up PyQt application | ||
app = QApplication(sys.argv) | ||
|
||
window = MainWindow() | ||
|
||
sys.exit(app.exec()) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
[build-system] | ||
requires = ["setuptools>=42"] | ||
build-backend = "setuptools.build_meta" | ||
|
||
[project] | ||
name = "wannadb" | ||
version = "0.0.1" | ||
authors = [ | ||
{ name = "Benjamin Hättasch" }, | ||
{ name = "Jan-Micha Bodensohn" }, | ||
{ name = "Liane Vogel" }, | ||
] | ||
description = "WannaDB: Ad-hoc SQL Queries over Text Collections" | ||
readme = "README.md" | ||
license = { file = "LICENSE" } | ||
requires-python = ">=3.9" | ||
classifiers = [ | ||
"Programming Language :: Python :: 3", | ||
] | ||
dependencies = [ | ||
"pymongo==3.12.1", | ||
"torch==1.10.0", | ||
"numpy==1.21.4", | ||
"pandas==1.3.4", | ||
"scipy==1.7.2", | ||
"stanza==1.3.0", | ||
"spacy==3.2.0", | ||
"sentence-transformers==2.1.0", | ||
"matplotlib==3.5.0", | ||
"seaborn==0.11.2", | ||
"scikit-learn==1.0.1", | ||
"transformers==4.12.5", | ||
"PyQt6==6.2.1", | ||
"sqlparse==0.4.2", | ||
] | ||
|
||
[project.urls] | ||
"Homepage" = "https://github.com/DataManagementLab/wannadb" | ||
|
||
[tool.setuptools] | ||
packages = [ | ||
"wannadb", | ||
"wannadb_parsql", | ||
"wannadb_ui", | ||
] |
Oops, something went wrong.