Sniffing Threatening Open-World Objects in Autonomous Driving by Open-Vocabulary Models (ACMMM 2024)
Autonomous driving (AD) is a typical application that requires effectively exploiting multimedia information.
For AD, it is critical to ensure safety by detecting unknown objects in an open world, driving the demand for open world object detection (OWOD).
However, existing OWOD methods treat generic objects beyond known classes in the train set as unknown objects and prioritize recall in evaluation.
This encourages excessive false positives and endangers safety of AD. To address this issue, we restrict the definition of unknown objects to threatening objects in AD, and introduce a new evaluation protocol, which is built upon a new metric named U-ARecall, to alleviate biased evaluation caused by neglecting false positives.
Under the new evaluation protocol, we re-evaluate existing OWOD methods and discover that they typically perform poorly in AD.
Then, we propose a novel OWOD paradigm for AD based on fine-tuning foundational open-vocabulary models (OVMs), as they can exploit rich linguistic and visual prior knowledge for OWOD.
Following this new paradigm, we propose a brand-new OWOD solution, which effectively addresses two core challenges of fine-tuning OVMs via two novel techniques: 1) the maintenance of open-world generic knowledge by a dual-branch architecture; 2) the acquisition of scenario-specific knowledge by the visual-oriented contrastive learning scheme.
Besides, a dual-branch prediction fusion module is proposed to avoid post-processing and hand-crafted heuristics.
Extensive experiments show that our proposed method not only surpasses classic OWOD methods in unknown object detection by a large margin (
- We devise a more suitable evaluation protocol for AD-oriented OWOD, which includes restricting unknown objects to threatening objects in AD and introducing a new evaluation metric to alleviate the biased evaluation in OWOD. Based on this evaluation protocol, we re-evaluate existing OWOD methods and establish the AD-oriented OWOD benchmark.
- We propose a new OWOD paradigm for AD based on fine-tuning OVMs, exploiting rich prior knowledge, including high-level language semantic and diverse visual patterns, to distinguish between threatening objects and false positives.
- We identify and address two core challenges when fine-tuning OVMs: preserving open-world generic knowledge by dual-branch architecture and acquiring scenario-specific knowledge by visual-oriented contrastive learning.
- We propose a prediction fusion module that can integrate predictions from multiple branches without the need of post-processing and hand-crafted heuristics, serving as a generalized method applicable to transformer-based detectors.
Our code is based on the dev-3.x branch of mmdetection, which releases the implementation of GroundingDino.
Note: using the ``dev-3.x'' branch.
conda create --name ad-owod python==3.8 -y
conda activate ad-owod
conda install pytorch torchvision -c pytorch
pip install -U openmim
mim install mmengine
mim install "mmcv>=2.0.0"
pip install -v -e .
optional:
pip install peft # for lora finetuning
Download the GroundingDINO pretrained weight from mmdetection and add in weights
folder.
AD-OWOD /
└── configs
└── auto_driving_grounding_dino (fine-tuning methods and our method)
└── mmdet
└── datasets
├── soda.py
└── bdd.py
└── hooks
└── load_weight_hook.py (load the weights of dual-branch network)
└── models
└── detectors
├── auto_driving_grounding_dino.py
├── adapter_grounding_dino.py linear_prob_grounding_dino.py
├── linear_prob_grounding_dino.py
└── lora_grounding_dino.py
└── dense_heads
└── auto_driving_grounding_dino_head.py
└── eval_tools (data conversion tools, visualization tools, and evaluation tools)
├── soda_formnat_to_voc.py
├── bdd_formnat_to_voc.py
├── coda_formnat_to_voc.py
├── convert_to_pretest.py
├── convert_coda_to_pretest.py
├── soda_visualization.py
├── bdd_visualization.py
├── coda_visualization.py
├── output_visualization.py
├── eval.py (evaluation)
└── voc_eval_offical.py (evaluation)
Note: If you can not access the website of huggingface online, you need to download the bert-base-cased
from https://huggingface.co/bert-base-uncased/tree/main and change the path of lang_model_name
in the config file.
AD-OWOD/
└── data/
└── CODA/
├── train (train and val set of SODA)
├── val (val set of CODA)
└── annotations
├── train.json
├── val.json
└── annotation.json (CODA)
└── BDD/
├── train
├── val
└── annotations
├── bdd100k_labels_images_det_coco_train.json
├── bdd100k_labels_images_det_coco_val.json
└── annotation.json (CODA)
CODA:
- Download the train and val set of SODA from https://soda-2d.github.io/download.html.
- Download the val set of CODA from https://coda-dataset.github.io/.
- Prepare the annotation files. (1) Get the train.json file by running
eval_tools/merge_json.py
. (2) Get the val.json by runningeval_tools/coda_format_to_voc.py
andeval_tools/convert_coda_to_coco.py
. You can also directly download the annotation files from here
BDD:
- Download the train and val set of BDD0100K from https://doc.bdd100k.com/download.html.
To train SGROD on a single node with 4 GPUS, run
bash tools/dist_train.sh configs/auto_driving_grounding_dino/auto_driving_grounding_dino_swin-t_16xb2_1x_soda.py 4
you can also decide to run each one of the configurations defined in configs/auto_driving_grounding_dino
.
For reproducing any of the aforementioned results, please download our pretrain weights and place them in the 'weights' directory.
- Run the
demo/image_demo.py
file to output json file for testing.
python demo/image_demo.py ./data/CODA/val/ $config --weights $weights --texts "$text_prompts" --no-save-pred --no-save-vis --save-json-path ./val_json_output_dir/$save_json
For example:
# SODA
python demo/image_demo.py ./data/CODA/val/ configs/auto_driving_grounding_dino/auto_driving_grounding_dino_swin-t_16xb2_1x_soda.py --weights weights/auto_gd_soda.pth --texts "pedestrian . cyclist . car . truck . bus . tricycle . vehicle . roadblock . obstacle ." --no-save-pred --no-save-vis --save-json-path ./val_json_output_dir/auto_gd_soda.json
# BDD
python demo/image_demo.py ./data/CODA/val/ configs/auto_driving_grounding_dino/auto_driving_grounding_dino_swin-t_16xb2_1x_bdd.py --weights weights/auto_gd_bdd.pth --texts "person . rider . car . bus . truck . bike . motor . traffic light . traffic sign . train . vehicle . roadblock . obstacle ." --no-save-pred --no-save-vis --save-json-path ./val_json_output_dir/auto_gd_bdd.json
- Run the
eval_tools/eval.py
file to output the metrics.
python eval_tools/eval.py val_json_output_dir/$save_json json_output_dir/annotations.json 0. $data
For example:
# SODA
python eval_tools/eval.py val_json_output_dir/auto_gd_soda.json val_json_output_dir/annotations.json 0. soda
# BDD
python eval_tools/eval.py val_json_output_dir/auto_gd_bdd.json val_json_output_dir/annotations.json 0. bdd
Note: For more training and evaluation details please check the mmdetection reposistory.
Should you have any question, please contact: [email protected]
Acknowledgments:
This work builds on previous works' code base such as mmdetection, UnSniffer, GroundingDino. Please consider citing these works as well.