This repository contains the code and dataset accompanying the paper "Video2Music: Suitable Music Generation from Videos using an Affective Multimodal Transformer model" by Dr. Jaeyong Kang, Prof. Soujanya Poria, and Prof. Dorien Herremans.
🔥 Live demo available on HuggingFace and Replicate.
We propose a novel AI-powered multimodal music generation framework called Video2Music. This framework uniquely uses video features as conditioning input to generate matching music using a Transformer architecture. By employing cutting-edge technology, our system aims to provide video creators with a seamless and efficient solution for generating tailor-made background music.
- 2023-11-28: add new input method (YouTube URL) on HuggingFace
Generate music from video:
import IPython
from video2music import Video2music
input_video = "input.mp4"
input_primer = "C Am F G"
input_key = "C major"
video2music = Video2music()
output_filename = video2music.generate(input_video, input_primer, input_key)
IPython.display.Video(output_filename)
This repo is developed using python version 3.8
apt-get update
apt-get install ffmpeg
apt-get install fluidsynth
git clone https://github.com/AMAAI-Lab/Video2Music
cd Video2Music
pip install -r requirements.txt
-
Download the processed training data
AMT.zip
from HERE and extract the zip file and put the extracted two files directly under this folder (saved_models/AMT/
) -
Download the soundfont file
default_sound_font.sf2
from HERE and put the file directly under this folder (soundfonts/
) -
Our code is built on pytorch version 1.12.1 (torch==1.12.1 in the requirements.txt). But you might need to choose the correct version of
torch
based on your CUDA version
-
Obtain the dataset:
- MuVi-Sync (Link)
-
Put all directories started with
vevo
in the dataset under this folder (dataset/
)
saved_models/
: saved model filesutilities/
run_model_vevo.py
: code for running model (AMT)run_model_regression.py
: code for running model (bi-GRU)
model/
video_music_transformer.py
: Affective Multimodal Transformer (AMT) modelvideo_regression.py
: Bi-GRU regression model used for predicting note density/loudnesspositional_encoding.py
: code for Positional encodingrpr.py
: code for RPR (Relative Positional Representation)
dataset/
vevo_dataset.py
: Dataset loader
script/
: code for extracting video/music features (sementic, motion, emotion, scene offset, loudness, and note density)train.py
: training script (AMT)train_regression.py
: training script (bi-GRU)evaluate.py
: evaluation scriptgenerate.py
: inference scriptvideo2music.py
: Video2Music module that outputs video with generated background music from input videodemo.ipynb
: Jupyter notebook for Quickstart Guide
python train.py
python generate.py
Model | Overall Music Quality ↑ | Music-Video Correspondence ↑ | Harmonic Matching ↑ | Rhythmic Matching ↑ | Loudness Matching ↑ |
---|---|---|---|---|---|
Music Transformer | 3.4905 | 2.7476 | 2.6333 | 2.8476 | 3.1286 |
Video2Music | 4.2095 | 3.6667 | 3.4143 | 3.8714 | 3.8143 |
- Add other instruments (e.g., drum) for live demo
If you find this resource useful, please cite the original work:
@article{KANG2024123640,
title = {Video2Music: Suitable music generation from videos using an Affective Multimodal Transformer model},
author = {Jaeyong Kang and Soujanya Poria and Dorien Herremans},
journal = {Expert Systems with Applications},
pages = {123640},
year = {2024},
issn = {0957-4174},
doi = {https://doi.org/10.1016/j.eswa.2024.123640},
}
Kang, J., Poria, S. & Herremans, D. (2024). Video2Music: Suitable Music Generation from Videos using an Affective Multimodal Transformer model, Expert Systems with Applications (in press).
Our code is based on Music Transformer.