Skip to content
forked from Hekang001/DALR

The official implementation of our multi-modal sentence embedding paper

Notifications You must be signed in to change notification settings

remarkableliu/DALR

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Overview

We propose DALR (Dual-level Alignment Learning for multimodal sentence Representation Learning). To achieve cross-modal fine-grained alignment, we propose a cross-modal alignment method to mitigate the cross-modal misalignment bias (CMB) issue. To alleviate the intra-modal semantic divergence (ISD) issue, we integrate ranking distillation with global alignment learning to effectively align intra-modal representations. The following figure is an illustration of our models.

Getting Started

Download Datasets

Run pip install -r requirements.txt to prepare the environment.

First you should download Flickr and MSCOCO datasets from the offical website and put them in the following format:

REPO ROOT
|
|--data    
|  |--Flickr  
|  |--MSCOCO
|  |--wiki1m_for_simcse.txt

Wiki1M

wget https://huggingface.co/datasets/princeton-nlp/datasets-for-simcse/resolve/main/wiki1m_for_simcse.txt

Use the script from the SimCSE repo to download the datasets for SentEval evaluation:

cd SentEval/data/downstream/
bash download_dataset.sh

You can download the model (SimCSE, DiffCSE, etc) from huggingface and put it in the Model folder

Access Our Model from Google Drive

Both the model checkpoints of flickr-bert-base and coco-bert-base are available on Google Drive.

Use Our model

import torch
from scipy.spatial.distance import cosine
from transformers import AutoModel, AutoTokenizer

# Import our models. The package will take care of downloading the models from the google drives
tokenizer = AutoTokenizer.from_pretrained("Model/DALR")
model = AutoModel.from_pretrained("Model/DALR")

# Tokenize input texts
texts = [
    "There's a kid on a skateboard.",
    "A kid is skateboarding.",
    "A kid is inside the house."
]
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")

# Get the embeddings
with torch.no_grad():
    embeddings = model(**inputs, output_hidden_states=True, return_dict=True).pooler_output

# Calculate cosine similarities
# Cosine similarities are in [-1, 1]. Higher means more similar
cosine_sim_0_1 = 1 - cosine(embeddings[0], embeddings[1])
cosine_sim_0_2 = 1 - cosine(embeddings[0], embeddings[2])

print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (texts[0], texts[1], cosine_sim_0_1))
print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (texts[0], texts[2], cosine_sim_0_2))

Evaluation

Run Evaluation with SentEval

python eval_senteval.py \
    --model_name_or_path Model/OurModel \
    --task_set sts \
    --mode test \

Train Your Own Models

In the following section, we describe how to train a DALR model by using our code.

pip install torch==1.8.1+cu111 torchvision==0.9.1+cu111 torchaudio==0.8.1 -f https://download.pytorch.org/whl/torch_stable.html

If you instead use CUDA <11 or CPU, install PyTorch by the following command,

pip install torch==1.8.1

Then run the following script to install the remaining dependencies,

pip install -r requirements.txt

For unsupervised mixed training setting of wiki+flickr and wiki+coco, you can run the following command train your own models and try out different hyperparameters in it as you like

bash scripts/run_wiki_flickr.sh

bash scripts/run_wiki_coco.sh

Acknowledgements

About

The official implementation of our multi-modal sentence embedding paper

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.0%
  • Shell 1.0%