Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add dynamic position encoding to Siglip #2770

Merged
merged 2 commits into from
Feb 14, 2025

Conversation

ameroyer
Copy link
Contributor

@ameroyer ameroyer commented Feb 14, 2025

Adding interpolated/dynamic position encodings to Siglip, similar to how it's done in Dino (although simpler because Siglip doesn't have a class token)

Roughly follows the equivalent in HF transformers but with a simpler interpolation

Example

RUST_BACKTRACE=1 NVCC_CCBIN=/usr/bin/gcc cargo run --features cuda --example siglip -- --image-size 448

before:
(when trying to add position encodings to tokens)

Error: shape mismatch in broadcast_add, lhs: [2, 784, 768], rhs: [196, 768] 

after:

Interpolating position embeddings to (28, 28)
softmax_image_vec: [1.9005219e-14, 1.0336038e-13, 1.0, 0.9999999, 6.7562546e-8, 1.344231e-11]


Results for image: candle-examples/examples/stable-diffusion/assets/stable-diffusion-xl.jpg

Probability: 0.0000% Text: a cycling race 
Probability: 0.0000% Text: a photo of two cats 
Probability: 100.0000% Text: a robot holding a candle 


Results for image: candle-examples/examples/yolo-v8/assets/bike.jpg

Probability: 100.0000% Text: a cycling race 
Probability: 0.0000% Text: a photo of two cats 
Probability: 0.0000% Text: a robot holding a candle 

for smaller images:

RUST_BACKTRACE=1 NVCC_CCBIN=/usr/bin/gcc cargo run --features cuda --example siglip -- --image-size 127
softmax_image_vec: [0.09331067, 0.90490746, 0.0017818566, 0.092874415, 0.9038708, 0.0032547913]

Results for image: candle-examples/examples/stable-diffusion/assets/stable-diffusion-xl.jpg

Probability: 7.1601% Text: a cycling race 
Probability: 2.5398% Text: a photo of two cats 
Probability: 90.3001% Text: a robot holding a candle 


Results for image: candle-examples/examples/yolo-v8/assets/bike.jpg

Probability: 84.6978% Text: a cycling race 
Probability: 15.2972% Text: a photo of two cats 
Probability: 0.0050% Text: a robot holding a candle

Copy link
Collaborator

@LaurentMazare LaurentMazare left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks for the PR!

@LaurentMazare LaurentMazare merged commit 2423d63 into huggingface:main Feb 14, 2025
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants