Causal Graphical Models for Vision-Language Compositional Understanding

Fiorenzo Parascandolo, Nicholas Moratelli, Enver Sangineto, Lorenzo Baraldi, Rita Cucchiara

Abstract

Recent work has empirically shown that Vision-Language Models (VLMs) struggle to fully understand the compositional properties of the human language, usually modeling an image caption as a "bag of words". As a result, they perform poorly on compositional tasks, which require a deeper understanding of the different entities of a sentence (subject, verb, etc.) jointly with their mutual relationships in order to be solved. In this paper, we model the dependency relations among textual and visual tokens using a Causal Graphical Model (CGM), built using a dependency parser, and we train a decoder conditioned by the VLM visual encoder. Differently from standard autoregressive or parallel predictions, our decoder's generative process is partially-ordered following the CGM structure. This structure encourages the decoder to learn only the main causal dependencies in a sentence discarding spurious correlations. Using extensive experiments on five compositional benchmarks, we show that our method significantly outperforms all the state-of-the-art compositional approaches, usually by a large margin, and it also improves over methods trained using much larger datasets.

COGT Weights & Metadata

Benchmarks

We evaluate our model on five compositional benchmarks:

We propose an additional benchmark commonly used to evaluate the ability of open-vocabulary object detectors to discern fine-grained object properties. We use it as a compositional benchmark to challenge models in recognizing attributes of common objects that rarely appear in the image foreground:

FG-OVD

Create the environment

conda create -y -n "cogt" python=3.9.13
conda activate cogt
pip install -r requirements.txt

Edit some files before running the code based on your local path

To correctly load the model weights and datasets, it is necessary to customize the PATH and TEST_PATH dictionaries in paths.py.

X-VLM:

'xvlm_weights': 'yourpath/16m_base_model_state_step_199999.th'
'config_xvlm': 'yourpath/Pretrain_XVLM_base_16m.yaml'
'config_swin_xvlm': 'yourpath/config_swinB_224.json'

Dataset:

The dataset metadata includes not only the original annotations from the specific benchmark but also the dependency trees required to construct the attention mask for COGT.

To properly use the datasets, you need to download the benchmark-specific images and customize the images entry in the dictionary with the path to the folder containing the images. Similarly, the metadata entry should be updated with the path to the corresponding JSON file.

For the proposed FG-OVD benchmark, the images are sourced from the COCO val 2017 dataset.

Example:

TEST_PATH = {
    "visual_genome_relation": {'images': 'yourpath/vg_relation_images',
                               'metadata': 'yourpath/visual_genome_relation.json'}}

Training

We train our models on custom COCO split dataset defined by NegCLIP. Use these scripts to train the models:

scripts/COGT_X_train.sh

Inference

To evaluate our model:

scripts/COGT_X_inference.sh

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
images		images
models		models
scripts		scripts
xvlm_dir		xvlm_dir
README.md		README.md
cap_models.py		cap_models.py
clip_models.py		clip_models.py
dataset.py		dataset.py
main.py		main.py
paths.py		paths.py
requirements.txt		requirements.txt
trainer.py		trainer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Causal Graphical Models for Vision-Language Compositional Understanding

Fiorenzo Parascandolo, Nicholas Moratelli, Enver Sangineto, Lorenzo Baraldi, Rita Cucchiara

Abstract

COGT Weights & Metadata

Benchmarks

Create the environment

Edit some files before running the code based on your local path

Training

Inference

About

Releases

Packages

Languages

aimagelab/COGT

Folders and files

Latest commit

History

Repository files navigation

Causal Graphical Models for Vision-Language Compositional Understanding

Fiorenzo Parascandolo, Nicholas Moratelli, Enver Sangineto, Lorenzo Baraldi, Rita Cucchiara

Abstract

COGT Weights & Metadata

Benchmarks

Create the environment

Edit some files before running the code based on your local path

Training

Inference

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages