title | openreview | abstract | layout | series | publisher | issn | id | month | tex_title | firstpage | lastpage | page | order | cycles | bibtex_author | author | date | address | container-title | volume | genre | issued | extras | ||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Zero-Shot Open-Vocabulary OOD Object Detection and Grounding using Vision Language Models |
Q2wVVeOpz8 |
Automated driving involves complex perception tasks that require a precise understanding of diverse traffic scenarios and confident navigation. Traditional data-driven algorithms trained on closed-set data often fail to generalize upon out-of-distribution (OOD) and edge cases. Recently, Large Vision Language Models (LVLMs) have shown potential in integrating the reasoning capabilities of language models to understand and reason about complex driving scenes, aiding generalization to OOD scenarios. However, grounding such OOD objects still remains a challenging task. In this work, we propose an automated framework zPROD, for zero-shot promptable open vocabulary OOD object detection, segmentation, and grounding in autonomous driving. We leverage LVLMs with visual grounding capabilities, eliminating the need for lengthy text communication and providing precise indications of OOD objects in the scene or on the track of the ego-centric vehicle. We evaluate our approach on OOD datasets from existing road anomaly segmentation benchmarks such as SMIYC and Fishyscapes. Our zero-shot approach shows superior performance on RoadAnomaly and RoadObstacle and comparable results on the Fishyscapes subset as compared to supervised models and acts a baseline for future zero-shot methods based on open vocabulary OOD detection. |
inproceedings |
Proceedings of Machine Learning Research |
PMLR |
2640-3498 |
sinhamahapatra25a |
0 |
Zero-Shot Open-Vocabulary {OOD} Object Detection and Grounding using Vision Language Models |
230 |
238 |
230-238 |
230 |
false |
Sinhamahapatra, Poulami and Bose, Shirsha and Roscher, Karsten and G{\"u}nnemann, Stephan |
|
2025-01-12 |
Proceedings of the 6th Northern Lights Deep Learning Conference (NLDL) |
265 |
inproceedings |
|