Document Clustering

This is a library for clustering documents in an unsupervised fashion.

Check out the 4-minute video explanation here and paper here.

This extends the work done in doc-clustering to use visual features of documents!

Getting started

Install dependencies in a new conda environment.

conda env create --name doc-clustering --file=doc-clustering.yml

Once you've created the environment, you can activate it using: conda activate doc-clustering

If you're using an M1 (Apple Silicon), you'll need to use Minforge in order to use TensorFlow: https://developer.apple.com/metal/tensorflow-plugin/

Alternatively, you can also create your own, fresh environment: conda env create --name doc-clustering python=3.8

And then manually find and install the missing dependencies by running: python clustering.py -h

Download datasets

Datasets are available for download here.

And should be stored with the following directory structure and names:

datasets/rvl-cdip/
datasets/sroie2019/

Download finetuned models.

Models are available for download here.

And should be stored with the following directory structure and names:

finetuned_models/finetuned_related_lmv1/
finetuned_models/finetuned_unrelated_lmv2/

Download embeddings and results from paper.

Prepared document embeddings and experiment results are here.

And should be stored with the following directory name:

results/

Training and running models.

Run one of the commands from EXPERIMENTS.md, or python clustering.py --help for example usage.

Add the --debug flag to get interactive visualizations as well. Example commands:

ResNet

mkdir -p results/sroie2019/resnet/
python clustering.py -p datasets/sroie2019/ \
	-r resnet \
	-o results/sroie2019/resnet/ \
	--debug

AlexNet

mkdir -p results/sroie2019/alexnet/
python clustering.py -p datasets/sroie2019/ \
	-r alexnet \
	-o results/sroie2019/alexnet/ \
	--debug

LayoutLM Base ([CLS] Token)

mkdir -p results/sroie2019/layoutlm_base/cls_token/
python clustering.py -p datasets/sroie2019/ \
	-r layoutlm_base \
	-s cls_token \
	-o results/sroie2019/layoutlm_base/cls_token/ \
	--debug

Name		Name	Last commit message	Last commit date
Latest commit History 135 Commits
lib		lib
plots		plots
.gitignore		.gitignore
EXPERIMENTS.md		EXPERIMENTS.md
LICENSE		LICENSE
README.md		README.md
clustering.py		clustering.py
doc-clustering.yml		doc-clustering.yml
get_hidden_states.py		get_hidden_states.py
poster.png		poster.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Document Clustering

Getting started

Install dependencies in a new conda environment.

Download datasets

Download finetuned models.

Download embeddings and results from paper.

Training and running models.

ResNet

AlexNet

LayoutLM Base ([CLS] Token)

About

Releases

Packages

Contributors 2

Languages

License

poojasethi/visual-doc-clustering

Folders and files

Latest commit

History

Repository files navigation

Document Clustering

Getting started

Install dependencies in a new conda environment.

Download datasets

Download finetuned models.

Download embeddings and results from paper.

Training and running models.

ResNet

AlexNet

LayoutLM Base ([CLS] Token)

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages