In this project, I use the code from show attend and tell, which is implemented base on paper: show attend and tell
Since the goal is to predict the label of the image, and eventually hope one can generate the ingredient of the food image, so we think using the image caption generation technique is more appropriate than image classification. But due to the limit of the dataset, for now, I can only use the image label as the caption. Overall though, the result is not bad.
- Extract Image feature by CNN (VGG-19)
- Use attention base LSTM model to generate the caption of the image, which in this case, is the label of the image
Note: The white part is the attention area that the machine focus on to make the caption/label prediction