This research project focuses on evaluating deep learning models for music genre classification using the GTZAN dataset. The dataset has been transformed into mel spectrograms to align with human auditory perception.
The study encompasses a range of models, including traditional machine learning approaches such as Random Forest, K-Nearest Neighbors, and Naive Bayes, as well as advanced deep learning techniques. A custom Convolutional Neural Network (CNN) specifically designed for mel spectrogram analysis is introduced and compared against established models like VGG16 and ResNet152 in transfer learning scenarios. Additionally, a Vision Transformer, adapted for audio data, is evaluated for its effectiveness in this domain.
- Random Forest emerged as the most accurate among traditional models, owing to its ensemble learning strategy.
- The custom CNN designed for mel spectrogram analysis notably outperformed established models like VGG16 and ResNet152 in transfer learning scenarios.
- The Vision Transformer, adapted for audio data, significantly exceeded the performance of both the custom CNN and traditional approaches in terms of accuracy.
The results from this study underscore the critical role of model selection in music genre classification. They highlight the effectiveness of using mel spectrograms for enhanced audio data analysis and demonstrate the potential of advanced deep learning models, especially the Vision Transformer, in this field.