This is a library for clustering documents in an unsupervised fashion.
Check out the 4-minute video explanation here and paper here.
This extends the work done in doc-clustering to use visual features of documents!
conda env create --name doc-clustering --file=doc-clustering.yml
Once you've created the environment, you can activate it using:
conda activate doc-clustering
If you're using an M1 (Apple Silicon), you'll need to use Minforge in order to use TensorFlow: https://developer.apple.com/metal/tensorflow-plugin/
Alternatively, you can also create your own, fresh environment:
conda env create --name doc-clustering python=3.8
And then manually find and install the missing dependencies by running:
python clustering.py -h
Datasets are available for download here.
And should be stored with the following directory structure and names:
datasets/rvl-cdip/
datasets/sroie2019/
Models are available for download here.
And should be stored with the following directory structure and names:
finetuned_models/finetuned_related_lmv1/
finetuned_models/finetuned_unrelated_lmv2/
Prepared document embeddings and experiment results are here.
And should be stored with the following directory name:
results/
Run one of the commands from EXPERIMENTS.md, or python clustering.py --help
for example usage.
Add the --debug
flag to get interactive visualizations as well. Example commands:
mkdir -p results/sroie2019/resnet/
python clustering.py -p datasets/sroie2019/ \
-r resnet \
-o results/sroie2019/resnet/ \
--debug
mkdir -p results/sroie2019/alexnet/
python clustering.py -p datasets/sroie2019/ \
-r alexnet \
-o results/sroie2019/alexnet/ \
--debug
mkdir -p results/sroie2019/layoutlm_base/cls_token/
python clustering.py -p datasets/sroie2019/ \
-r layoutlm_base \
-s cls_token \
-o results/sroie2019/layoutlm_base/cls_token/ \
--debug