Official implementation and checkpoints for paper "CEPrompt: Cross-Modal Emotion-Aware Prompting for Facial Expression Recognition" (accepted to IEEE TCSVT 2024)
Facial expression recognition (FER) remains a challenging task due to the ambiguity and subtlety of expressions. To address this challenge, current FER methods predominantly prioritize visual cues while inadvertently neglecting the potential insights that can be gleaned from other modalities. Recently, vision-language pre-training (VLP) models integrated textual cues as guidance, culminating in a powerful multi-modal solution that has proven effective for a range of computer vision tasks. In this paper, we propose a Cross-Modal Emotion-Aware Prompting (CEPrompt) framework for FER based on VLP models. To make VLP models sensitive to expression-relevant visual discrepancies, we devise an Emotion Conception-guided Visual Adapter (EVA) to capture the category-specific appearance representations with emotion conception guidance. Moreover, knowledge distillation is employed to prevent the model from forgetting the pre-trained category-invariant knowledge. In addition, we design a Conception-Appearance Tuner (CAT) to facilitate the interaction of multi-modal information via cooperatively tuning between emotion conception and appearance prompts. In this way, semantic information about emotion text conception is infused directly into facial appearance images, thereby enhancing a comprehensive and precise understanding of expression-related facial details. Quantitative and qualitative experiments show that our CEPrompt outperforms state-of-the-art approaches on three real-world FER datasets.
- Installation the package requirements
pip install -r requirements.txt
- Download pretrained VLP(ViT-B/16) model from OpenAI CLIP.
- The downloaded RAF-DB are reorganized as follow:
data/
├─ RAF-DB/
│ ├─ basic/
│ │ ├─ EmoLabel/
│ │ │ ├─ images.txt
│ │ │ ├─ image_class_labels.txt
│ │ │ ├─ train_test_split.txt
│ │ ├─ Image/
│ │ │ ├─ aligned/
│ │ │ ├─ aligned_224/ # reagliend by MTCNN
For the aligned images, we use the re-agliend images from APVIT.
- The downloaded AffectNet are reorganized as follow:
data/
├─ AffectNet/
│ ├─ affectnet_info/
│ │ ├─ images.txt
│ │ ├─ image_class_labels.txt
│ │ ├─ train_test_split.txt
│ ├─ Manually_Annotated_Images/
│ │ ├─ 1/
│ │ │ ├─ images
│ │ │ ├─ ...
│ │ ├─ 2/
│ │ ├─ ./
- The structure of three data-load and -split txt files are reorganized as follow:
% (1) images.txt:
idx | imagename
1 train_00001.jpg
2 train_00002.jpg
.
15339 test_3068.jpg
% (2) image_class_labels.txt:
idx | label
1 5
2 5
.
15339 7
% (3) train_test_split.txt:
idx | train(1) or test(0)
1 1
2 1
.
15339 0
- Download model checkpoints from Google Drive.
python3 train_fer_first_stage.py \
--dataset ${DATASET} \
--data-path ${DATAPATH}
python3 train_fer_second_stage.py \
--dataset ${DATASET} \
--data-path ${DATAPATH} \
--ckpt-path ${CKPTPATH}
bash stage1.sh
bash stage2.sh
python3 train_fer_second_stage.py \
--eval \
--dataset ${DATASET} \ # dataset name
--data-path ${DATAPATH} \ # path to dataset
--ckpt-path ${CKPTPATH} \ # path to first stage ckpt
--eval-ckpt ${EVACKPTPATH} # path to second stage ckpt
If you find our work helps, please cite our paper.
@ARTICLE{Zhou2024CEPrompt,
author={Zhou, Haoliang and Huang, Shucheng and Zhang, Feifei and Xu, Changsheng},
journal={IEEE Transactions on Circuits and Systems for Video Technology},
title={CEPrompt: Cross-Modal Emotion-Aware Prompting for Facial Expression Recognition},
year={2024},
volume={34},
number={11},
pages={11886-11899},
doi={10.1109/TCSVT.2024.3424777}
}
For any questions, welcome to create an issue or email to [email protected].