Skip to content

A Vision Transformer adapted for audio data significantly exceeded the performance of both the custom CNN and traditional approaches in accuracy. These results underscore the critical role of model selection in music genre classification and the effectiveness of mel spectrograms for enhanced audio data analysis.

Notifications You must be signed in to change notification settings

aquib1011/Audio-Classification-using-ViT-and-CNN

Repository files navigation

Vision Transformer and CNN on GTZAN Dataset

Overview

This research project focuses on evaluating deep learning models for music genre classification using the GTZAN dataset. The dataset has been transformed into mel spectrograms to align with human auditory perception.

Methodology

The study encompasses a range of models, including traditional machine learning approaches such as Random Forest, K-Nearest Neighbors, and Naive Bayes, as well as advanced deep learning techniques. A custom Convolutional Neural Network (CNN) specifically designed for mel spectrogram analysis is introduced and compared against established models like VGG16 and ResNet152 in transfer learning scenarios. Additionally, a Vision Transformer, adapted for audio data, is evaluated for its effectiveness in this domain.

Key Findings

  • Random Forest emerged as the most accurate among traditional models, owing to its ensemble learning strategy.
  • The custom CNN designed for mel spectrogram analysis notably outperformed established models like VGG16 and ResNet152 in transfer learning scenarios.
  • The Vision Transformer, adapted for audio data, significantly exceeded the performance of both the custom CNN and traditional approaches in terms of accuracy.

Conclusion

The results from this study underscore the critical role of model selection in music genre classification. They highlight the effectiveness of using mel spectrograms for enhanced audio data analysis and demonstrate the potential of advanced deep learning models, especially the Vision Transformer, in this field.

About

A Vision Transformer adapted for audio data significantly exceeded the performance of both the custom CNN and traditional approaches in accuracy. These results underscore the critical role of model selection in music genre classification and the effectiveness of mel spectrograms for enhanced audio data analysis.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published