This is the code base for our paper DocSCAN: Unsupervised Text Classification via Learning from Neighbors, accepted at KONVENS 2022.
Major code refactoring, the whole repo should be way more user friendly now!
Assuming Anaconda and linux, the environment can be installed with the following command:
conda create -n scan python=3.6
conda activate scan
pip install -U sentence-transformers
conda install faiss-cpu -c pytorch
pip install -r requirements.txt
Run with
PYTHONPATH=src python src/DocSCAN.py --infile 20newsgroup_sample.txt --outpath 20newsgroup --num_classes 20
Where --infile is a file with one sentence per line, outpath is the output directory and we provide the number of clusters. The main output is a csv file in the output directory called docscan_clusters.csv with columns sentence, clusters and probabilities.
Other output generated is
- prototypical_examples_by_clusters.txt (the 10 most likely sentences for each cluster)
- word clouds for each cluster in folder outpath/wordclouds
Also, if no number of classes is provided, the program automatically determines the number of clusters using an adapted elbow method from yellowbrick. If so, --min_clusters, --max_clusters and --stepsize should be provided. However, this is rather experimental!
input can also be a pandas dataframe with a column "sentence". If this is more convenient, run with
PYTHONPATH=src python src/DocSCAN.py --infile 20newsgroup_sample.csv --outpath 20newsgroup --num_classes 20 --data_format from_csv
Run with
PYTHONPATH=src python src/DocSCAN_paper_replication.py --path 20_newsgroup
path needs to contain 2 files, train.jsonl and test.jsonl where each line is a json dictionary containing the keys "text" and "label". Have a look at scripts/setup_data.sh for how these files were created.
Run with
PYTHONPATH=src python src/kmeans.py --path 20_newsgroup
If anything should not work or is unclear, please don't hesitate to contact the authors
- Dominik Stammbach ([email protected])