add LongClip to CLIPScore #2935

MostHumble · 2025-02-02T10:24:00Z

🚀 Feature

Motivation

CLIP's text encoding is limited to 77 tokens, which is inherently insufficient for long sequences where detailed descriptions are crucial. This limitation affects performance in scenarios such as fine-grained image retrieval and detailed text-to-image generation. To overcome this, Long-CLIP extends CLIP's capabilities by supporting longer text inputs while maintaining or even improving zero-shot generalizability.

Pitch

Long-CLIP proposes an efficient fine-tuning solution to extend CLIP's text encoding capacity while preserving its original strengths. This is achieved through:

Knowledge-Preserved Stretching of Positional Embeddings: Expanding positional embeddings in a way that retains the model's learned representations.
Primary Component Matching of CLIP Features: Ensuring alignment with CLIP's latent space to maintain its compatibility in existing pipelines.

By leveraging just one million additional long text-image pairs, Long-CLIP achieves a 20% improvement in long-caption text-image retrieval and a 6% gain in traditional retrieval tasks (COCO, Flickr30k). It also enhances text-to-image generation by allowing detailed descriptions without requiring modifications to existing frameworks.

Alternatives

An alternative approach would be replacing CLIP's text encoder with a transformer model capable of handling longer sequences. However, this would require pretraining on vast datasets, incurring high computational costs and potential loss of alignment with CLIP's latent space.

Additional Context

Implementation can be done using LongCLIP-L-Diffusers or LongCLIP-GmP-ViT-L-14. Below is a sample usage:

model_id = "zer0int/LongCLIP-L-Diffusers"
config = CLIPConfig.from_pretrained(model_id)
maxtokens = 248
config.text_config.max_position_embeddings = maxtokens

clip_model = CLIPModel.from_pretrained(model_id, torch_dtype=dtype, config=config).to(device)
clip_processor = CLIPProcessor.from_pretrained(model_id, padding="max_length", max_length=maxtokens, return_tensors="pt", truncation=True)

inputs = clip_processor(text=[long_description, short_description], images=final_image, return_tensors="pt", padding=True).to(device)

with torch.inference_mode():
    outputs = clip_model(**inputs)
    logits_per_image = outputs.logits_per_image  # Image-text similarity score
    probs = logits_per_image.softmax(dim=1)  # Label probabilities
    print('probs', probs)

For more details and implementation, visit the official repository: Long-CLIP.

github-actions · 2025-02-02T10:24:29Z

Hi! thanks for your contribution!, great first issue!

rittik9 · 2025-02-02T18:06:31Z

@Borda if this looks good to you, I would be interested in working on this.

rittik9 · 2025-02-02T18:25:45Z

#2906 talks about the same issue with a different solution.

arijit-hub · 2025-02-02T23:06:34Z

I would prefer having Jina Clip v2 as it allows for 8k tokens.

MostHumble added the enhancement New feature or request label Feb 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add LongClip to CLIPScore #2935

add LongClip to CLIPScore #2935

MostHumble commented Feb 2, 2025 •

edited

Loading

github-actions bot commented Feb 2, 2025

rittik9 commented Feb 2, 2025

rittik9 commented Feb 2, 2025 •

edited

Loading

arijit-hub commented Feb 2, 2025

add LongClip to CLIPScore #2935

add LongClip to CLIPScore #2935

Comments

MostHumble commented Feb 2, 2025 • edited Loading

🚀 Feature

Motivation

Pitch

Alternatives

Additional Context

github-actions bot commented Feb 2, 2025

rittik9 commented Feb 2, 2025

rittik9 commented Feb 2, 2025 • edited Loading

arijit-hub commented Feb 2, 2025

MostHumble commented Feb 2, 2025 •

edited

Loading

rittik9 commented Feb 2, 2025 •

edited

Loading