MoqaGPT: Zero-Shot Multi-modal Open-domain Question Answering with Large Language Models (EMNLP2023 Findings)

TL;DR: We propose a framework leveraging Large Language Models to solve multi-modal open-domian question answerings (moqa) in a zero-shot manner, this is very different from previous supervised paradigm, instead we do not require any domain-specific annotations and modules, posing our framework as a general paradigm for moqa tasks.

The method framework tackles each modality seperately, and fuse results by Large Language models, the 'devide-and-conquer' design circumvent complex multimodal information retrieval obstacle and flexible enough to incorporate any new modalities and stronger models into the framework to achieve better performance.

Instructions

pip install -r requirements.txt

Dataset	Url
MMCoQA	https://github.com/liyongqi67/MMCoQA?tab=readme-ov-file
MultiModalQA	https://github.com/allenai/multimodalqa

Download Dataset and update DATASET_DIR in utils/config.yaml:

DATASET_DIR: /path/to/your/dataset

Extract Features

Download features:hugs: https://huggingface.co/datasets/le723z/MOQAGPT_FEAT 🤗

Generate features using Hugging Face transformers models such as [Alibaba-NLP/gte-large-en-v1.5, Alibaba-NLP/gte-Qwen2-1.5B-instruct]. Refer to the scripts in the extract/ directory:

# Extract table and query embeddings using CLIP model
python extract/extract_image_features.py 

# Extract table and query embeddings
python extract/extract_table_features.py --dataset mmcoqa --model [Huggingface Model]

# Extract passage and query embeddings
python extract/extract_text_features.py --passages_file_path [] --query_file_path [] --model [Huggingface Model]

⚠️ Ensure FEATURES_CONFIG in utils/config.yaml is updated with the correct paths:

FEATURES_CONFIG:
  MMCOQA:
    image:
      clip:
        QuestionEmbedding: stored_features/mmcoqa/image/clip_features/QuestionEmbedding.pt
        ReferenceEmbedding: stored_features/mmcoqa/image/clip_features/ReferenceEmbedding.pt

Quick Retrieval Example

from pipeline.retriever import MultimodalRetriever
features_model_dict={"image":"clip","table":"ada","text":"gte-base-en-v1.5"}
retriever=MultimodalRetriever("MMCOQA",features_model_dict)
question_ids=["C_226_0","C_226_1"]
retrieved_results,retrieved_reference=retriever.retrieve(question_ids) 

# retrieved_results {qid: {modality: [ref1_id: score, ref2_id: score]}}
# retrieved_reference {qid: {modality: [ref1_content, ref2_content]}}

Run MOQAGPT

Generate direct_QA Answers Using LLMs

Run the direct_qa.py script to generate answers directly using LLMs (['openchat', 'gpt-4', 'llama2chat']) on datasets (['mmqa', 'mmcoqa']). The output will be saved in the format: MOQAGPT/output/$DATA/direct_chatgpt.json.
```
 python pipeline/direct_qa.py --dataset $DATA --direct_qa_model $LLM
```
Generate Multi-Modality Query Answers Use the answerer.py script to generate answers for queries referencing multiple modalities (e.g., text, table, or visual). The output will be saved in the same format: MOQAGPT/output/$DATA/direct_chatgpt.json
```
python pipeline/answerer.py --dataset $DATA --text_qa $LLM --table_qa $LLM
```

Ensemble Answers and Reason the Final Response

Combine answers from multiple sources (e.g., text, table, visual, and direct answers) using the strategy.py script.

python pipeline/strategy.py --reasoner $LLM \
        --textual_qa output/$DATA/Tllama2chatTabllama2chat.json \
        --visual_qa output/$DATA/Iblip2.json \
        --direct_qa output/$DATA/direct_chatgpt.json

#  The Reasoning_File_Result will be saved at: output/$DATA/candidates/chatgpt/Iblip2_Tllama2chatTabllama2chat_direct_chatgpt.json`.

Evaluate the Final Results

Run the evaluation.py script to evaluate the final ensemble results against the target file.
```
python pipeline/evaluation.py --target_file $Reasoning_File_Result
```

Citation

@inproceedings{zhang-etal-2023-moqagpt,
    title = "{M}oqa{GPT} : Zero-Shot Multi-modal Open-domain Question Answering with Large Language Model",
    author = "Zhang, Le  and
      Wu, Yihong  and
      Mo, Fengran  and
      Nie, Jian-Yun  and
      Agrawal, Aishwarya",
    editor = "Bouamor, Houda  and
      Pino, Juan  and
      Bali, Kalika",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2023",
    month = dec,
    year = "2023",
    address = "Singapore",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.findings-emnlp.85",
    doi = "10.18653/v1/2023.findings-emnlp.85",
    pages = "1195--1210",
    abstract = "Multi-modal open-domain question answering typically requires evidence retrieval from databases across diverse modalities, such as images, tables, passages, etc. Even Large Language Models (LLMs) like GPT-4 fall short in this task. To enable LLMs to tackle the task in a zero-shot manner, we introduce MoqaGPT, a straightforward and flexible framework. Using a divide-and-conquer strategy that bypasses intricate multi-modality ranking, our framework can accommodate new modalities and seamlessly transition to new models for the task. Built upon LLMs, MoqaGPT retrieves and extracts answers from each modality separately, then fuses this multi-modal information using LLMs to produce a final answer. Our methodology boosts performance on the MMCoQA dataset, improving F1 by +37.91 points and EM by +34.07 points over the supervised baseline. On the MultiModalQA dataset, MoqaGPT surpasses the zero-shot baseline, improving F1 by 9.5 points and EM by 10.1 points, and significantly closes the gap with supervised methods. Our codebase is available at https://github.com/lezhang7/MOQAGPT.",
}

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
dataset_zoo		dataset_zoo
extract		extract
model_zoo		model_zoo
pipeline		pipeline
scripts		scripts
utils		utils
.gitignore		.gitignore
README.md		README.md
demo.ipynb		demo.ipynb
preprocess_MMCoQA.py		preprocess_MMCoQA.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MoqaGPT: Zero-Shot Multi-modal Open-domain Question Answering with Large Language Models (EMNLP2023 Findings)

Instructions

Extract Features

Quick Retrieval Example

Run MOQAGPT

Citation

About

Releases

Packages

Languages

lezhang7/MOQAGPT

Folders and files

Latest commit

History

Repository files navigation

MoqaGPT: Zero-Shot Multi-modal Open-domain Question Answering with Large Language Models (EMNLP2023 Findings)

Instructions

Extract Features

Quick Retrieval Example

Run MOQAGPT

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages