MolGrapher

This is the repository for MolGrapher: Graph-based Visual Recognition of Chemical Structures.

Citation

If you find this repository useful, please consider citing:

@InProceedings{Morin_2023_ICCV,
    author = {Morin, Lucas and Danelljan, Martin and Agea, Maria Isabel and Nassar, Ahmed and Weber, Valery and Meijer, Ingmar and Staar, Peter and Yu, Fisher},
    title = {MolGrapher: Graph-based Visual Recognition of Chemical Structures},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month = {October},
    year = {2023},
    pages = {19552-19561}
}

Links: ICCV, Arxiv

Installation

Create a virtual environment.

conda create -n molgrapher python=3.11
conda activate molgrapher

Install MolGrapher and MolDepictor for CPU.

pip install -e .["cpu"]

Install MolGrapher and MolDepictor for GPU. (Tested for x86_64, Linux Ubuntu 20.04, CUDA 11.7, CUDNN 8.4)

pip install -e .["gpu"]

CUDA and CDNN versions can be edited in setup.py.

To install and run MolGrapher using Docker, please refer to README_DOCKER.md.

Model

Models are available on Hugging Face.

wget https://huggingface.co/ds4sd/MolGrapher/resolve/main/models/graph_classifier/gc_gcn_model.ckpt -P ./data/models/graph_classifier/
wget https://huggingface.co/ds4sd/MolGrapher/resolve/main/models/graph_classifier/gc_no_stereo_model.ckpt -P ./data/models/graph_classifier/
wget https://huggingface.co/ds4sd/MolGrapher/resolve/main/models/graph_classifier/gc_stereo_model.ckpt -P ./data/models/graph_classifier/
wget https://huggingface.co/ds4sd/MolGrapher/resolve/main/models/keypoint_detector/kd_model.ckpt -P ./data/models/keypoint_detector/

After downloading, the folder models from Hugging Face should be placed in: ./data/. Models can be selected by modifying attributes of GraphRecognizer in ./molgrapher/models/graph_recognizer.py (The steps to follow are detailed in this issue).

Inference

Script

Your input images can be placed in the folder: ./data/benchmarks/default/.

bash molgrapher/scripts/annotate/run.sh

Output predictions are saved in: ./data/predictions/default/.

Python

from molgrapher.models.molgrapher_model import MolgrapherModel

model = MolgrapherModel()
images_or_paths = ["./data/benchmarks/default/images/image_0.png"] 
annotations = model.predict_batch(images_or_paths)

annotations is a list of dictionnaries with fields:

[
    {
        'smi': 'O=C(O)C1=CC=C(C2=C(...',                      # MolGrapher SMILES prediction
        'conf': 0.991,                                        # MolGrapher confidence
        'file-info': {
            'filename': '...',                                # Input image filename
            'image_nbr': 1       
        }, 
        'abbreviations_ocr': [...],                           # Detected OCR text
        'abbreviations': [...],                               # Post-processed detected OCR text
        'annotator': {'program': 'MolGrapher', 'version': '1.0.0'},
   },
   ...
]

Docling Integration

Docling is a toolkit to extract the content and structure from PDF documents. It recognizes page layout, reading order, table structure, code, formulas, and classify images. Here, we combine docling and MolGrapher:

Docling segments and classify chemical-structure images from document pages,
MolGrapher converts images to SMILES.

Install docling in the molgrapher environment.

pip install docling

Option 1. Convert a PDF document with docling and enrich it with MolGrapher annotations.

Example:

bash molgrapher/scripts/annotate/docling/docling_convert_and_enrich.sh ./data/pdfs/US9259003_page_4.pdf ./data/docling_documents/US9259003_page_4/
# bash [script] [pdf-path] [docling-document-directory-path]

Option 2. Enrich an existing docling document with MolGrapher annotations.

Example:

python3 molgrapher/scripts/annotate/docling/enrich_docling_document.py --docling-document-directory-path ./data/docling_documents/US9259003_page_4/  
# python3 [script] --docling-document-directory-path [docling-document-directory-path]

The docling document, enriched with SMILES predictions, will be stored in [docling-document-directory-path]. For more information, please refer to docling.

USPTO-30K Benchmark

USPTO-30K is available on Hugging Face.

USPTO-10K contains 10,000 clean molecules, i.e. without any abbreviated groups.
USPTO-10K-abb contains 10,000 molecules with superatom groups.
USPTO-10K-L contains 10,000 clean molecules with more than 70 atoms.

Synthetic Dataset

The synthetic dataset is available on Hugging Face. Images and graphs are generated using MolDepictor.

Training

To train the keypoint detector:

python3 ./molgrapher/scripts/train/train_keypoint_detector.py

To train the node classifier:

python3 ./molgrapher/scripts/train/train_graph_classifier.py

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
assets		assets
data		data
molgrapher		molgrapher
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE.txt		LICENSE.txt
MAINTAINERS.md		MAINTAINERS.md
README.md		README.md
README_DOCKER.md		README_DOCKER.md
docker_build.sh		docker_build.sh
install_packages.sh		install_packages.sh
install_paddleocr.sh		install_paddleocr.sh
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MolGrapher

Citation

Installation

Model

Inference

Script

Python

Docling Integration

USPTO-30K Benchmark

Synthetic Dataset

Training

About

Releases 1

Packages

Contributors 2

Languages

License

DS4SD/MolGrapher

Folders and files

Latest commit

History

Repository files navigation

MolGrapher

Citation

Installation

Model

Inference

Script

Python

Docling Integration

USPTO-30K Benchmark

Synthetic Dataset

Training

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Languages

Packages