Skip to content

Hoyyyaard/Emu

Repository files navigation

Emu: An Open Multimodal Generalist

Quan Sun1*, Qiying Yu2,1*, Yufeng Cui1*, Fan Zhang1*, Xiaosong Zhang1*, Yueze Wang1, Hongcheng Gao1,
Jingjing Liu2, Tiejun Huang1,3, Xinlong Wang1

1 BAAI, 2 THU, 3 PKU
* Equal Contribution

| Paper | Demo |

Emu is a multimodal generalist that can seamlessly generate images and texts in multimodal context. Emu is trained with a unified autoregressive objective, i.e., predict-the-next-element, including both visual embeddings and textual tokens. Trained under this objective, Emu can serve as a generalist interface for both image-to-text and text-to-image tasks.

Generalist Interface

Emu serves as a generalist interface capable of diverse multimodal tasks, such as image captioning, image/video question answering, and text-to-image generation, together with new abilities like in-context text and image generation, and image blending:

Setup

Clone this repository and install required packages:

git clone https://github.com/baaivision/Emu
cd Emu

pip install -r requirements.txt

Model Weights

We release the pretrained and instruction-tuned weights of Emu. Our weights are subject to LLaMA-1's license.

Model name Weight
Emu w/ Decoder 🤗 HF link (34GB)
Emu-I 🤗 HF link (27GB)

Inference

At present, we provide inference code that can process interleaved image-text and video as input, and output text and image.

For instruction-tuned model, we provide examples for image captioning, visual question answering, and interleaved multi-image understanding:

python inference.py --instruct --ckpt-path ckpts/Emu-instruct.pt

For pretrained model, we provide an example for in-context learning:

python inference.py --ckpt-path ${PRETRAIN_CKPT_DIR}/multimodal_encoder/pytorch_model.bin

For image generation, we provide examples for image blending, text-to-image and in-context generation:

python image_inference.py --ckpt-path ${PRETRAIN_CKPT_DIR}

Schedule

We are committed to open-sourcing all Emu related materials, including:

  • The weights of Emu and Emu-I
  • Inference example for interleaved image-text as input, text as output
  • Video inference example
  • Weights of image decoder & image generation/blending example
  • YT-Storyboard-1B pretraining data
  • Pretraining code
  • Instruction tuning code
  • Evaluation code

We hope to foster the growth of our community through open-sourcing and promoting collaboration👬. Let's step towards multimodal intelligence together🍻.

Acknowledgement

We thank the great work from LLaMA, BLIP-2, Stable Diffusion, and FastChat.

Citation

If you find Emu useful for your research and applications, please consider starring this repository and citing:

@article{Emu,
  title={Generative Pretraining in Multimodality},
  author={Sun, Quan and Yu, Qiying and Cui, Yufeng and Zhang, Fan and Zhang, Xiaosong and Wang, Yueze and Gao, Hongcheng and Liu, Jingjing and Huang, Tiejun and Wang, Xinlong},
  publisher={arXiv preprint arXiv:2307.05222},
  year={2023},
}

Misc

Stargazers repo roster for @baaivision/Emu

Forkers repo roster for @baaivision/Emu

Star History Chart

FSDP Implement

  • Model in torch.float16
  • Inference takes about 31G in V100 >>> Inference takes about 8G in 3090(WORLD_SIZE=8)
  • Implement details
    Manually wraps submodules for FSDP and move other parameters to device_id.

    Why manually wrap?
    - all parameters within the FSDP wrapper must have the same requires_grad.
        We have a mix of frozen and unfrozen parameters.
    - model.vision_encoder.visual needs to be individually wrapped or encode_vision_x errors
        See: https://github.com/pytorch/pytorch/issues/82461#issuecomment-1269136344

    The rough wrapping structure is:
    - EMU(total about 24.8G in torch.float16)
        - visual (total about 1.8G in torch.float16)
        - ln_visual 
        - cformer
        - decoder (total about 24G in torch.float16) (about3.043G after FSDP of world_size 8)
            - lm
                - base_model
                    - model
                        - layers
                            - FSDP(nn.Module) * 40
                                - buffer(Not in parameters)
                                    - self_attn.rotary_emb.sin_cached
                                    - self_attn.rotary_emb.sin_cached
                        - FSDP(lm_head)
                        - FSDP(stu_regress_head)


    Known issues:
    - Our FSDP strategy is not compatible with tied embeddings. If the LM embeddings are tied,
        train with DDP or set the --freeze_lm_embeddings flag to true.
    - With FSDP + gradient ckpting, one can increase the batch size with seemingly no upper bound.
        Although the training curves look okay, we found that downstream performance dramatically
        degrades if the batch size is unreasonably large (e.g., 100 MMC4 batch size for OPT-125M).

    FAQs about our FSDP wrapping strategy:
    Why double wrap?
    As of torch==2.0.1, FSDP's _post_forward_hook and _post_backward_hook
    only free gathered parameters if the module is NOT FSDP root.

    Why unfreeze the decoder_layers?
    See https://github.com/pytorch/pytorch/issues/95805
    As of torch==2.0.1, FSDP's _post_backward_hook is only registed if the flat param
    requires_grad=True. We need the postback to fire to avoid OOM.
    To effectively freeze the decoder layers, we exclude them from the optimizer.

    What is assumed to be frozen v. unfrozen?
    We assume that the model is being trained under normal Flamingo settings
    with these lines being called in factory.py:
        ```
        # Freeze all parameters
        model.requires_grad_(False)
        assert sum(p.numel() for p in model.parameters() if p.requires_grad) == 0

        # Unfreeze perceiver, gated_cross_attn_layers, and LM input embeddings
        model.perceiver.requires_grad_(True)
        model.lang_encoder.gated_cross_attn_layers.requires_grad_(True)
        [optional] model.lang_encoder.get_input_embeddings().requires_grad_(True)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published