You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
CLIP's text encoding is limited to 77 tokens, which is inherently insufficient for long sequences where detailed descriptions are crucial. This limitation affects performance in scenarios such as fine-grained image retrieval and detailed text-to-image generation. To overcome this, Long-CLIP extends CLIP's capabilities by supporting longer text inputs while maintaining or even improving zero-shot generalizability.
Pitch
Long-CLIP proposes an efficient fine-tuning solution to extend CLIP's text encoding capacity while preserving its original strengths. This is achieved through:
Knowledge-Preserved Stretching of Positional Embeddings: Expanding positional embeddings in a way that retains the model's learned representations.
Primary Component Matching of CLIP Features: Ensuring alignment with CLIP's latent space to maintain its compatibility in existing pipelines.
By leveraging just one million additional long text-image pairs, Long-CLIP achieves a 20% improvement in long-caption text-image retrieval and a 6% gain in traditional retrieval tasks (COCO, Flickr30k). It also enhances text-to-image generation by allowing detailed descriptions without requiring modifications to existing frameworks.
Alternatives
An alternative approach would be replacing CLIP's text encoder with a transformer model capable of handling longer sequences. However, this would require pretraining on vast datasets, incurring high computational costs and potential loss of alignment with CLIP's latent space.
Additional Context
Implementation can be done using LongCLIP-L-Diffusers or LongCLIP-GmP-ViT-L-14. Below is a sample usage:
🚀 Feature
Motivation
CLIP's text encoding is limited to 77 tokens, which is inherently insufficient for long sequences where detailed descriptions are crucial. This limitation affects performance in scenarios such as fine-grained image retrieval and detailed text-to-image generation. To overcome this, Long-CLIP extends CLIP's capabilities by supporting longer text inputs while maintaining or even improving zero-shot generalizability.
Pitch
Long-CLIP proposes an efficient fine-tuning solution to extend CLIP's text encoding capacity while preserving its original strengths. This is achieved through:
By leveraging just one million additional long text-image pairs, Long-CLIP achieves a 20% improvement in long-caption text-image retrieval and a 6% gain in traditional retrieval tasks (COCO, Flickr30k). It also enhances text-to-image generation by allowing detailed descriptions without requiring modifications to existing frameworks.
Alternatives
An alternative approach would be replacing CLIP's text encoder with a transformer model capable of handling longer sequences. However, this would require pretraining on vast datasets, incurring high computational costs and potential loss of alignment with CLIP's latent space.
Additional Context
Implementation can be done using LongCLIP-L-Diffusers or LongCLIP-GmP-ViT-L-14. Below is a sample usage:
For more details and implementation, visit the official repository: Long-CLIP.
The text was updated successfully, but these errors were encountered: