Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add LongClip to CLIPScore #2935

Open
MostHumble opened this issue Feb 2, 2025 · 4 comments
Open

add LongClip to CLIPScore #2935

MostHumble opened this issue Feb 2, 2025 · 4 comments
Labels
enhancement New feature or request

Comments

@MostHumble
Copy link
Contributor

MostHumble commented Feb 2, 2025

🚀 Feature

Motivation

CLIP's text encoding is limited to 77 tokens, which is inherently insufficient for long sequences where detailed descriptions are crucial. This limitation affects performance in scenarios such as fine-grained image retrieval and detailed text-to-image generation. To overcome this, Long-CLIP extends CLIP's capabilities by supporting longer text inputs while maintaining or even improving zero-shot generalizability.

Pitch

Long-CLIP proposes an efficient fine-tuning solution to extend CLIP's text encoding capacity while preserving its original strengths. This is achieved through:

  1. Knowledge-Preserved Stretching of Positional Embeddings: Expanding positional embeddings in a way that retains the model's learned representations.
  2. Primary Component Matching of CLIP Features: Ensuring alignment with CLIP's latent space to maintain its compatibility in existing pipelines.

By leveraging just one million additional long text-image pairs, Long-CLIP achieves a 20% improvement in long-caption text-image retrieval and a 6% gain in traditional retrieval tasks (COCO, Flickr30k). It also enhances text-to-image generation by allowing detailed descriptions without requiring modifications to existing frameworks.

Alternatives

An alternative approach would be replacing CLIP's text encoder with a transformer model capable of handling longer sequences. However, this would require pretraining on vast datasets, incurring high computational costs and potential loss of alignment with CLIP's latent space.

Additional Context

Implementation can be done using LongCLIP-L-Diffusers or LongCLIP-GmP-ViT-L-14. Below is a sample usage:

model_id = "zer0int/LongCLIP-L-Diffusers"
config = CLIPConfig.from_pretrained(model_id)
maxtokens = 248
config.text_config.max_position_embeddings = maxtokens

clip_model = CLIPModel.from_pretrained(model_id, torch_dtype=dtype, config=config).to(device)
clip_processor = CLIPProcessor.from_pretrained(model_id, padding="max_length", max_length=maxtokens, return_tensors="pt", truncation=True)

inputs = clip_processor(text=[long_description, short_description], images=final_image, return_tensors="pt", padding=True).to(device)

with torch.inference_mode():
    outputs = clip_model(**inputs)
    logits_per_image = outputs.logits_per_image  # Image-text similarity score
    probs = logits_per_image.softmax(dim=1)  # Label probabilities
    print('probs', probs)

For more details and implementation, visit the official repository: Long-CLIP.

@MostHumble MostHumble added the enhancement New feature or request label Feb 2, 2025
Copy link

github-actions bot commented Feb 2, 2025

Hi! thanks for your contribution!, great first issue!

@rittik9
Copy link
Contributor

rittik9 commented Feb 2, 2025

@Borda if this looks good to you, I would be interested in working on this.

@rittik9
Copy link
Contributor

rittik9 commented Feb 2, 2025

#2906 talks about the same issue with a different solution.

@arijit-hub
Copy link

I would prefer having Jina Clip v2 as it allows for 8k tokens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants