add dynamic position encoding to Siglip #2770

ameroyer · 2025-02-14T10:01:04Z

Adding interpolated/dynamic position encodings to Siglip, similar to how it's done in Dino (although simpler because Siglip doesn't have a class token)

Roughly follows the equivalent in HF transformers but with a simpler interpolation

Example

RUST_BACKTRACE=1 NVCC_CCBIN=/usr/bin/gcc cargo run --features cuda --example siglip -- --image-size 448

before:
(when trying to add position encodings to tokens)

Error: shape mismatch in broadcast_add, lhs: [2, 784, 768], rhs: [196, 768]

after:

Interpolating position embeddings to (28, 28)
softmax_image_vec: [1.9005219e-14, 1.0336038e-13, 1.0, 0.9999999, 6.7562546e-8, 1.344231e-11]


Results for image: candle-examples/examples/stable-diffusion/assets/stable-diffusion-xl.jpg

Probability: 0.0000% Text: a cycling race 
Probability: 0.0000% Text: a photo of two cats 
Probability: 100.0000% Text: a robot holding a candle 


Results for image: candle-examples/examples/yolo-v8/assets/bike.jpg

Probability: 100.0000% Text: a cycling race 
Probability: 0.0000% Text: a photo of two cats 
Probability: 0.0000% Text: a robot holding a candle

for smaller images:

RUST_BACKTRACE=1 NVCC_CCBIN=/usr/bin/gcc cargo run --features cuda --example siglip -- --image-size 127

softmax_image_vec: [0.09331067, 0.90490746, 0.0017818566, 0.092874415, 0.9038708, 0.0032547913]

Results for image: candle-examples/examples/stable-diffusion/assets/stable-diffusion-xl.jpg

Probability: 7.1601% Text: a cycling race 
Probability: 2.5398% Text: a photo of two cats 
Probability: 90.3001% Text: a robot holding a candle 


Results for image: candle-examples/examples/yolo-v8/assets/bike.jpg

Probability: 84.6978% Text: a cycling race 
Probability: 15.2972% Text: a photo of two cats 
Probability: 0.0050% Text: a robot holding a candle

LaurentMazare

Great, thanks for the PR!

ameroyer added 2 commits February 14, 2025 10:52

add dynamic position encoding

9579726

remove debug messages

f8d0a7e

LaurentMazare approved these changes Feb 14, 2025

View reviewed changes

LaurentMazare merged commit 2423d63 into huggingface:main Feb 14, 2025
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add dynamic position encoding to Siglip #2770

add dynamic position encoding to Siglip #2770

ameroyer commented Feb 14, 2025 •

edited

Loading

LaurentMazare left a comment

add dynamic position encoding to Siglip #2770

add dynamic position encoding to Siglip #2770

Conversation

ameroyer commented Feb 14, 2025 • edited Loading

Example

LaurentMazare left a comment

Choose a reason for hiding this comment

ameroyer commented Feb 14, 2025 •

edited

Loading