Ziyan Yang, Kushal Kafle, Zhe Lin, Scott Cohen, Zhihong Ding, Vicente Ordonez
If you have any questions, you can email [email protected]
We propose Subject-Conditional Relation Detection SCoRD, where conditioned on an input subject, the goal is to predict all its relations to other objects in a scene along with their locations. Based on the Open Images dataset, we propose a challenging OIv6-SCoRD benchmark such that the training and testing splits have a distribution shift in terms of the occurrence statistics of <subject, relation, object> triplets. To solve this problem, we propose an auto-regressive model that given a subject, it predicts its relations, objects, and object locations by casting this output as a sequence of tokens. First, we show that previous scene-graph prediction methods fail to produce as exhaustive an enumeration of relation-object pairs when conditioned on a subject on this benchmark. Particularly, we obtain a recall@3 of 83.8% for our relation-object predictions compared to the 49.75% obtained by a recent scene graph detector. Then, we show improved generalization on both relation-object and object-box predictions by leveraging during training relation-object pairs obtained automatically from textual captions and for which no object-box annotations are available. Particularly, for <subject, relation, object> triplets for which no object locations are available during training, we are able to obtain a recall@3 of 33.80% for relation-object pairs and 26.75% for their box locations.
Please follow ALBEF to install the required packages.
Download the training and testing splits here. To download images:
- Download Visual Genome, MS COCO, Flickr30k and OpenImageV6 images from the corresponding websites
- Download CC3M using this codebase
- Download CC12M using this codebase
Download the checkpoint for the removing 50% experiment here.
First, run this command to generate <relation, object, object location> triples:
# start and end indices indicate the index of your target checkpoint in the checkpoint folder. If you only have one checkpoint in the folder, the start flag should be 0 and the end flag should be 1
# chunk size indicates how many batches of evaluation samples should be processed
CUDA_VISIBLE_DEVICES=0 python results_generation.py --root your_checkpoint_folder --start 0 --end 1 --chunk 0 --num_seq 3 --num_beams 5 --chunk_size 100 --round 2
Then, run this command to get evaluation results:
python evaluate_results.py --results_folder your_checkpoint_folder/oidv6_results/ --report_unseen True --topk 3
First, download the pre-trained checkpoint from PEVL: Run:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=8 --master_port=12888 --use_env run_relation_train.py --config configs/relation_grounding.yaml --output_dir your_checkpoint_folder --checkpoint pevl_pretrain.pth
We would like to thank ALBEF and PEVL. Their released codebases help a lot in this project.
If you think this work is interesting, please consider to cite it:
@inproceedings{yang2024scord,
title={SCoRD: Subject-Conditional Relation Detection with Text-Augmented Data},
author={Yang, Ziyan and Kafle, Kushal and Lin, Zhe and Cohen, Scott and Ding, Zhihong and Ordonez, Vicente},
booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision},
pages={5731--5741},
year={2024}
}