Vision Transformer and CNN on GTZAN Dataset

Overview

This research project focuses on evaluating deep learning models for music genre classification using the GTZAN dataset. The dataset has been transformed into mel spectrograms to align with human auditory perception.

Methodology

The study encompasses a range of models, including traditional machine learning approaches such as Random Forest, K-Nearest Neighbors, and Naive Bayes, as well as advanced deep learning techniques. A custom Convolutional Neural Network (CNN) specifically designed for mel spectrogram analysis is introduced and compared against established models like VGG16 and ResNet152 in transfer learning scenarios. Additionally, a Vision Transformer, adapted for audio data, is evaluated for its effectiveness in this domain.

Key Findings

Random Forest emerged as the most accurate among traditional models, owing to its ensemble learning strategy.
The custom CNN designed for mel spectrogram analysis notably outperformed established models like VGG16 and ResNet152 in transfer learning scenarios.
The Vision Transformer, adapted for audio data, significantly exceeded the performance of both the custom CNN and traditional approaches in terms of accuracy.

Conclusion

The results from this study underscore the critical role of model selection in music genre classification. They highlight the effectiveness of using mel spectrograms for enhanced audio data analysis and demonstrate the potential of advanced deep learning models, especially the Vision Transformer, in this field.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
GTZAN_Vision_Transformer_Visualization.ipynb		GTZAN_Vision_Transformer_Visualization.ipynb
README.md		README.md
Vision Transformer and CNN on GTZAN.ipynb		Vision Transformer and CNN on GTZAN.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vision Transformer and CNN on GTZAN Dataset

Overview

Methodology

Key Findings

Conclusion

About

Releases

Packages

Languages

aquib1011/Audio-Classification-using-ViT-and-CNN

Folders and files

Latest commit

History

Repository files navigation

Vision Transformer and CNN on GTZAN Dataset

Overview

Methodology

Key Findings

Conclusion

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages