|
| 1 | +## image_based_facial_emotion_estimation module |
| 2 | + |
| 3 | +The *image_based_facial_emotion_estimation* module contains the *FacialEmotionLearner* class, which inherits from the abstract class *Learner*. |
| 4 | + |
| 5 | +### Class FacialEmotionLearner |
| 6 | +Bases: `engine.learners.Learner` |
| 7 | + |
| 8 | +The *FacialEmotionLearner* class is an implementation of the state-of-the-art method ESR [[1]](#1) for efficient facial feature learning with wide ensemble-based convolutional neural networks. |
| 9 | +An ESR consists of two building blocks. |
| 10 | +(1) The base of the network is an array of convolutional layers for low- and middle-level feature learning. |
| 11 | +(2) These informative features are then shared with independent convolutional branches that constitute the ensemble. |
| 12 | +From this point, each branch can learn distinctive features while competing for a common resource - the shared layers. |
| 13 | +The [FacialEmotionLearner](/src/opendr/perception/facial_expression_recognition/image_based_facial_emotion_estimation/facial_emotion_learner.py) class has the following public methods: |
| 14 | + |
| 15 | + |
| 16 | +#### `FacialEmotionLearner` constructor |
| 17 | +```python |
| 18 | +FacialEmotionLearner(self, lr, batch_size, temp_path, device, device_ind, validation_interval, |
| 19 | + max_training_epoch, momentum, ensemble_size, base_path_experiment, name_experiment, dimensional_finetune, categorical_train, |
| 20 | + base_path_to_dataset, max_tuning_epoch, diversify) |
| 21 | +``` |
| 22 | + |
| 23 | +Constructor parameters: |
| 24 | + |
| 25 | +- **lr**: *float, default=0.1*\ |
| 26 | + Specifies the initial learning rate to be used during training. |
| 27 | +- **batch_size**: *int, default=32*\ |
| 28 | + Specifies number of samples to be bundled up in a batch during training. |
| 29 | + This heavily affects memory usage, adjust according to your system. |
| 30 | +- **temp_path**: *str, default='temp'*\ |
| 31 | + Specifies a path where the algorithm saves the checkpoints and onnx optimized model (if needed). |
| 32 | +- **device**: *{'cpu', 'cuda'}, default='cuda'*\ |
| 33 | + Specifies the device to be used. |
| 34 | +- **device_ind**: *list, default=[0]*\ |
| 35 | + List of GPU indices to be used if the device is 'cuda'. |
| 36 | +- **validation_interval**: *int, default=1*\ |
| 37 | + Specifies the validation interval. |
| 38 | +- **max_training_epoch**: *int, default=2*\ |
| 39 | + Specifies the maximum number of epochs the training should run for. |
| 40 | +- **momentum**: *float, default=0.9*\ |
| 41 | + Specifies the momentum value used for optimizer. |
| 42 | +- **ensemble_size**: *int, default=9*\ |
| 43 | + Specifies the number of ensemble branches in the model. |
| 44 | +- **base_path_experiment**: *str, default='./experiments/'*\ |
| 45 | + Specifies the path in which the experimental results will be saved. |
| 46 | +- **name_experiment**: *str, default='esr_9'*\ |
| 47 | + String name for saving checkpoints. |
| 48 | +- **dimensional_finetune**: *bool, default=True*\ |
| 49 | + Specifies if the model should be fine-tuned on dimensional data or not. |
| 50 | +- **categorical_train**: *bool, default=False*\ |
| 51 | + Specifies if the model should be trained on categorical data or not. |
| 52 | +- **base_path_to_dataset**: *str, default=''./data/AffectNet''*\ |
| 53 | + Specifies the dataset path. |
| 54 | +- **max_tuning_epoch**: *int, default=1*\ |
| 55 | + Specifies the maximum number of epochs the model should be finetuned on dimensional data. |
| 56 | +- **diversity**: *bool, default=False*\ |
| 57 | + Specifies if the learner diversifies the features of different branches or not. |
| 58 | + |
| 59 | +#### `FacialEmotionLearner.fit` |
| 60 | +```python |
| 61 | +FacialEmotionLearner.fit(self) |
| 62 | +``` |
| 63 | + |
| 64 | +This method is used for training the algorithm on a train dataset and validating on a val dataset. |
| 65 | + |
| 66 | + |
| 67 | +#### `FacialEmotionLearner.eval` |
| 68 | +```python |
| 69 | +FacialEmotionLearner.eval(self, eval_type, current_branch_on_training) |
| 70 | +``` |
| 71 | + |
| 72 | +This method is used to evaluate a trained model on an evaluation dataset. |
| 73 | +Returns a dictionary containing stats regarding evaluation. |
| 74 | + |
| 75 | +Parameters: |
| 76 | + |
| 77 | +- **eval_type**: *str, default='categorical'*\ |
| 78 | + Specifies the type of data that model is evaluated on. |
| 79 | + It can be either categorical or dimensional data. |
| 80 | +- **current_branch_on_training**: *int, default=0*\ |
| 81 | + Specifies the index of trained branch which should be evaluated on validation data. |
| 82 | + |
| 83 | + |
| 84 | +#### `FacialEmotionLearner.init_model` |
| 85 | +```python |
| 86 | +FacialEmotionLearner.init_model(self, num_branches) |
| 87 | +``` |
| 88 | + |
| 89 | +This method is used to initialize the model. |
| 90 | + |
| 91 | +Parameters: |
| 92 | + |
| 93 | +- **num_branches**: *int*\ |
| 94 | + Specifies the number of ensemble branches in the model. ESR_9 model is built by 9 branches by default. |
| 95 | + |
| 96 | +#### `FacialEmotionLearner.infer` |
| 97 | +```python |
| 98 | +FacialEmotionLearner.infer(self, input_batch) |
| 99 | +``` |
| 100 | + |
| 101 | +This method is used to perform inference on an image or a batch of images. |
| 102 | +It returns dimensional emotion results and also the categorical emotion results as an object of `engine.target.Category` if a proper input object `engine.data.Image` is given. |
| 103 | + |
| 104 | +Parameters: |
| 105 | + |
| 106 | +- **input_batch**: *object*** |
| 107 | + Object of type `engine.data.Image`. It also can be a list of Image objects, or a Torch tensor which will be converted to Image object. |
| 108 | + |
| 109 | +#### `FacialEmotionLearner.save` |
| 110 | +```python |
| 111 | +FacialEmotionLearner.save(self, state_dicts, base_path_to_save_model) |
| 112 | +``` |
| 113 | +This method is used to save a trained model. |
| 114 | +Provided with the path (absolute or relative), it creates the "path" directory, if it does not already exist. |
| 115 | +Inside this folder, the model is saved as "model_name.pt" and the metadata file as "model_name.json". If the directory already exists, the "model_name.pt" and "model_name.json" files are overwritten. |
| 116 | + |
| 117 | +If [`self.optimize`](#FacialEmotionLearner.optimize) was run previously, it saves the optimized ONNX model in a similar fashion with an ".onnx" extension, by copying it from the self.temp_path it was saved previously during conversion. |
| 118 | + |
| 119 | +Parameters: |
| 120 | + |
| 121 | +- **state_dicts**: *object*\ |
| 122 | + Object of type Python dictionary containing the trained model weights. |
| 123 | +- **base_path_to_save_model**: *str*\ |
| 124 | + Specifies the path in which the model will be saved. |
| 125 | + |
| 126 | +#### `FacialEmotionLearner.load` |
| 127 | +```python |
| 128 | +FacialEmotionLearner.load(self, ensemble_size, path_to_saved_network, file_name_base_network, |
| 129 | + file_name_conv_branch, fix_backbone) |
| 130 | +``` |
| 131 | + |
| 132 | +Loads the model from inside the directory of the path provided, using the metadata .json file included. |
| 133 | + |
| 134 | +Parameters: |
| 135 | + |
| 136 | +- **ensemble_size**: *int, default=9*\ |
| 137 | + Specifies the number of ensemble branches in the model for which the pretrained weights should be loaded. |
| 138 | +- **path_to_saved_network**: *str, default="./trained_models/esr_9"*\ |
| 139 | + Path of the model to be loaded. |
| 140 | +- **file_name_base_network**: *str, default="Net-Base-Shared_Representations.pt"*\ |
| 141 | + The file name of the base network to be loaded. |
| 142 | +- **file_name_conv_branch**: *str, default="Net-Branch_{}.pt"*\ |
| 143 | + The file name of the ensemble branch network to be loaded. |
| 144 | +- **fix_backbone**: *bool*\ |
| 145 | + If true, all the model weights except the classifier are fixed so that the last layers' weights are fine-tuned on dimensional data. |
| 146 | + Otherwise, all the model weights will be trained from scratch. |
| 147 | + |
| 148 | + |
| 149 | +#### `FacialEmotionLearner.optimize` |
| 150 | +```python |
| 151 | +FacialEmotionLearner.optimize(self, do_constant_folding) |
| 152 | +``` |
| 153 | + |
| 154 | +This method is used to optimize a trained model to ONNX format which can be then used for inference. |
| 155 | + |
| 156 | +Parameters: |
| 157 | + |
| 158 | +- **do_constant_folding**: *bool, default=False*\ |
| 159 | + ONNX format optimization. |
| 160 | + If True, the constant-folding optimization is applied to the model during export. |
| 161 | + |
| 162 | + |
| 163 | +#### `FacialEmotionLearner.download` |
| 164 | +```python |
| 165 | +@staticmethod |
| 166 | +FacialEmotionLearner.download(self, path, mode, url) |
| 167 | +``` |
| 168 | + |
| 169 | +Downloads data and saves them in the path provided. |
| 170 | + |
| 171 | +Parameters: |
| 172 | + |
| 173 | +- **path**: *str, default=None*\ |
| 174 | + Local path to save the files, defaults to `self.temp_dir` if None. |
| 175 | +- **mode**: *str, default="data"*\ |
| 176 | + What file to download, can be "data". |
| 177 | +- **url**: *str, default=opendr FTP URL*\ |
| 178 | + URL of the FTP server. |
| 179 | + |
| 180 | + |
| 181 | +#### Data preparation |
| 182 | + Download the [AffectNet](http://mohammadmahoor.com/affectnet/) [[2]](https://www.computer.org/csdl/magazine/mu/2012/03/mmu2012030034/13rRUxjQyrW) dataset, and organize it in the following structure: |
| 183 | + ``` |
| 184 | + AffectNet/ |
| 185 | + Training_Labeled/ |
| 186 | + 0/ |
| 187 | + 1/ |
| 188 | + ... |
| 189 | + n/ |
| 190 | + Training_Unlabeled/ |
| 191 | + 0/ |
| 192 | + 1/ |
| 193 | + ... |
| 194 | + n/ |
| 195 | + Validation/ |
| 196 | + 0/ |
| 197 | + 1/ |
| 198 | + ... |
| 199 | + n/ |
| 200 | + ``` |
| 201 | + In order to do that, you need to run the following function: |
| 202 | + ```python |
| 203 | + from opendr.perception.facial_expression_recognition.image_based_facial_emotion_estimation.algorithm.utils import datasets |
| 204 | + datasets.pre_process_affect_net(base_path_to_images, base_path_to_annotations, base_destination_path, set_index) |
| 205 | + ``` |
| 206 | + This pre-processes the AffectNet dataset by cropping and resizing the images into 96 x 96 pixels, and organizing them in folders with 500 images each. |
| 207 | + Each image is renamed to follow the pattern "[id][emotion_idx][valence times 1000]_[arousal times 1000].jpg". |
| 208 | + |
| 209 | +#### Pre-trained models |
| 210 | + |
| 211 | +The pretrained models on AffectNet Categorical dataset are provided by [[1]](#1) which can be found [here](https://github.com/siqueira-hc/Efficient-Facial-Feature-Learning-with-Wide-Ensemble-based-Convolutional-Neural-Networks/tree/master/model/ml/trained_models/esr_9). |
| 212 | +**Please note that the pretrained weights cannot be used for commercial purposes.** |
| 213 | + |
| 214 | +#### Examples |
| 215 | + |
| 216 | +* **Train the ensemble model on AffectNet Categorical dataset and then fine-tune it on the AffectNet dimensional dataset** |
| 217 | + The training and evaluation dataset should be present in the path provided. |
| 218 | + The `batch_size` argument should be adjusted according to available memory. |
| 219 | + |
| 220 | + ```python |
| 221 | + from opendr.perception.facial_expression_recognition import FacialEmotionLearner |
| 222 | + |
| 223 | + learner = FacialEmotionLearner(device="cpu", temp_path='./tmp', |
| 224 | + batch_size=2, max_training_epoch=1, ensemble_size=1, |
| 225 | + name_experiment='esr_9', base_path_experiment='./experiments/', |
| 226 | + lr=1e-1, categorical_train=True, dimensional_finetune=True, |
| 227 | + base_path_to_dataset='./data', max_tuning_epoch=1) |
| 228 | + learner.fit() |
| 229 | + learner.save(state_dicts=learner.model.to_state_dict(), |
| 230 | + base_path_to_save_model=learner.base_path_experiment, |
| 231 | + current_branch_save=8) |
| 232 | + ``` |
| 233 | + |
| 234 | +* **Inference on a batch of images** |
| 235 | + ```python |
| 236 | + from opendr.perception.facial_expression_recognition import FacialEmotionLearner |
| 237 | + from torch.utils.data import DataLoader |
| 238 | + |
| 239 | + learner = FacialEmotionLearner(device="cpu", temp_path='./tmp', |
| 240 | + batch_size=2, max_training_epoch=1, ensemble_size=1, |
| 241 | + name_experiment='esr_9', base_path_experiment='./experiments/', |
| 242 | + lr=1e-1, categorical_train=True, dimensional_finetune=True, |
| 243 | + base_path_to_dataset='./data', max_tuning_epoch=1) |
| 244 | + |
| 245 | + # Download the validation data |
| 246 | + dataset_path = learner.download(mode='data') |
| 247 | + val_data = datasets.AffectNetCategorical(idx_set=2, |
| 248 | + max_loaded_images_per_label=2, |
| 249 | + transforms=None, |
| 250 | + is_norm_by_mean_std=False, |
| 251 | + base_path_to_affectnet=learner.dataset_path) |
| 252 | + |
| 253 | + val_loader = DataLoader(val_data, batch_size=32, shuffle=False, num_workers=8) |
| 254 | + batch = next(iter(val_loader))[0] |
| 255 | + learner.load(learner.ensemble_size, path_to_saved_network=learner.base_path_experiment, fix_backbone=True) |
| 256 | + ensemble_emotion_results, ensemble_dimension_results = learner.infer(batch[0]) |
| 257 | + ``` |
| 258 | + |
| 259 | +* **Optimization example for a previously trained model** |
| 260 | + Inference can be run with the trained model after running self.optimize. |
| 261 | + ```python |
| 262 | + from opendr.perception.facial_expression_recognition import FacialEmotionLearner |
| 263 | + |
| 264 | + learner = FacialEmotionLearner(device="cpu", temp_path='./tmp', |
| 265 | + batch_size=2, max_training_epoch=1, ensemble_size=1, |
| 266 | + name_experiment='esr_9', base_path_experiment='./experiments/', |
| 267 | + lr=1e-1, categorical_train=True, dimensional_finetune=True, |
| 268 | + base_path_to_dataset='./data', max_tuning_epoch=1) |
| 269 | + |
| 270 | + |
| 271 | + learner.load(learner.ensemble_size, path_to_saved_network=learner.base_path_experiment, fix_backbone=True) |
| 272 | + learner.optimize(do_constant_folding=True) |
| 273 | + learner.save(path='./parent_dir/optimized_model', model_name='optimized_pstbln') |
| 274 | + ``` |
| 275 | + |
| 276 | + |
| 277 | +#### Performance Evaluation |
| 278 | + |
| 279 | +The tests were conducted on the following computational devices: |
| 280 | +- Intel(R) Xeon(R) Gold 6230R CPU on server |
| 281 | +- Nvidia Jetson TX2 |
| 282 | +- Nvidia Jetson Xavier AGX |
| 283 | +- Nvidia RTX 2080 Ti GPU on server with Intel Xeon Gold processors |
| 284 | + |
| 285 | + |
| 286 | +Inference time is measured as the time taken to transfer the input to the model (e.g., from CPU to GPU), run inference using the algorithm, and return results to CPU. |
| 287 | +The ESR and its extension diversified_ESR denoted as ESR*, which learns diversified feature representations to improve the model generalisation, are implemented in *FacialEmotionLearner*. |
| 288 | +The ESR-n and ESR*-n denote the ESR and diversified-ESR models with #n ensemble branches, respectively |
| 289 | + |
| 290 | +The model can receive either single images as input or a video, which can be captured by webcam, and perform the prediction frame-by-frame. |
| 291 | + |
| 292 | +We report speed (single sample per inference) as the mean of 100 runs, and the energy (Joules) on embedded devices. |
| 293 | +The noted memory is the maximum allocated memory on GPU during inference. |
| 294 | + |
| 295 | +| Method | Acc. (%) | Params (M) | Mem. (MB) | |
| 296 | +|--------------|----------|------------|-----------| |
| 297 | +| ESR-9 | 87.17 | 20.35 | 402.99 | |
| 298 | +| ESR-15 | 88.59 | 33.67 | 455.61 | |
| 299 | +| ESR*-9 | 89.15 | 20.83 | 406.83 | |
| 300 | +| ESR*-15 | 89.34 | 34.47 | 460.73 | |
| 301 | + |
| 302 | +The inference speed (evaluations/second) of both learners on various computational devices are as follows: |
| 303 | + |
| 304 | +| Method | CPU | Jetson TX2 | Jetson Xavier | RTX 2080 Ti | |
| 305 | +|--------------|-------|------------|---------------|-------------| |
| 306 | +| ESR-9 | 22.23 | 27.08 | 28.79 | 117.91 | |
| 307 | +| ESR-15 | 13.86 | 17.76 | 18.17 | 91.78 | |
| 308 | +| ESR*-9 | 5.24 | 6.60 | 12.45 | 33.40 | |
| 309 | +| ESR*-15 | 3.38 | 4.18 | 8.47 | 20.57 | |
| 310 | + |
| 311 | +Energy (Joules) of both learners’ inference on embedded devices is shown in the following: |
| 312 | + |
| 313 | +| Method | Jetson TX2 | Jetson Xavier | |
| 314 | +|---------|------------|---------------| |
| 315 | +| ESR-9 | 0.96 | 0.67 | |
| 316 | +| ESR-15 | 1.16 | 0.93 | |
| 317 | +| ESR*-9 | 3.38 | 1.41 | |
| 318 | +| ESR*-15 | 6.26 | 2.51 | |
| 319 | + |
| 320 | + |
| 321 | + |
| 322 | + |
| 323 | +## References |
| 324 | + |
| 325 | +<a id="1">[1]</a> |
| 326 | +[Siqueira, Henrique, Sven Magg, and Stefan Wermter. "Efficient facial feature learning with wide ensemble-based convolutional neural networks." Proceedings of the AAAI conference on artificial intelligence. Vol. 34. No. 04. 2020.]( |
| 327 | +https://ojs.aaai.org/index.php/AAAI/article/view/6037) |
| 328 | + |
| 329 | +<a id="2">[2]</a> |
| 330 | +[Mollahosseini, Ali, Behzad Hasani, and Mohammad H. Mahoor. "Affectnet: A database for facial expression, valence, and arousal computing in the wild." IEEE Transactions on Affective Computing 10.1 (2017): 18-31.]( |
| 331 | +https://ieeexplore.ieee.org/abstract/document/8013713) |
0 commit comments