We distribute models pretrained on Conceptual Captions. We share ViLBERT, LXMERT and VL-BERT pretrained as originally presented in their papers, as well as the weights for ViLBERT, LXMERT, VL-BERT, VisualBERT and UNITER pretrained in our controlled setup. For the latter, we distribute the weights that lead to higher average downstream performance when fine-tuned once.
Model | VQAv2 | RefCOCO+ | NLVR2 | Flickr30k IR | Flickr30k TR |
---|---|---|---|---|---|
ViLBERT | 66.68 | 70.49 | 74.26 | 58.90 | 75.50 |
LXMERT | 67.98 | 71.58 | |||
VL-BERT | 67.44 | 71.00 | |||
ViLBERT (CTRL) | 68.97 | 70.53 | 72.24 | 60.34 | 78.80 |
LXMERT (CTRL) | 67.52 | 70.49 | 71.09 | 58.62 | 74.90 |
VL-BERT (CTRL) | 68.23 | 71.23 | 73.22 | 57.62 | 70.90 |
VisualBERT (CTRL) | 69.03 | 70.02 | 72.70 | 61.48 | 75.20 |
UNITER (CTRL) | 68.67 | 71.45 | 73.73 | 60.54 | 76.40 |
Model | Source |
---|---|
LXMERT (Original) | airsplay/lxmert |
Models are defined in configuration files (see config/ for some examples). Rather than using Transformer layers, we specify attention and feed-forward sub-layers for each modality, which allows to quickly extend proposed architectures. In particular, the following sub-layers are defined:
tt_attn_sublayers
: text-text attention sub-layerstv_attn_sublayers
: text-vision attention sub-layers (text used as query, vision as context)vt_attn_sublayers
: vision-text attention sub-layers (vision used as query, text as context)vv_attn_sublayers
: vision-vision attention sub-layerst_ff_sublayers
: feed-forward sub-layers for the text modalityv_ff_sublayers
: feed-forward sub-layers for the vision modality
In addition, the following parameters allow to tune parameter sharing across modalities:
shared_sublayers
: sub-layers that share parameters between modalitiessingle_ln_sublayers
: sub-layers in which text and vision tensors are concatenated and fed into a single LN layer
Finally, bert_layer2attn_sublayer
and bert_layer2ff_sublayer
are used to load text-only BERT layers into VOLTA ones.
The following figure shows how these sub-layers are used to construct ViLBERT: