MusiXQA

A large-scale dataset of music sheet images designed for VQA in music understanding. arXiv, Huggingface

Dataset

MusiXQA is a multimodal dataset for evaluating and training music sheet understanding systems. Each data sample is composed of:

A music sheet image (.png) rendered by MusiXTEX.
Its corresponding MIDI file (.mid) 🎵
A structured annotation (from metadata.json)
Question–Answer (QA) pairs targeting musical structure, semantics, and optical music recognition (OMR)

Installation

This code is written in python 3.10.

conda create -n MusiXQA python=3.10
pip install -r requirements.txt

To compile the latex file, please install the following:

sudo apt install texlive-latex-base
sudo apt install texlive-music
sudo apt-get install texlive-lang-all
sudo apt-get install texlive-fonts-recommended texlive-fonts-extra

The code also includes a MIDI to Audio function, which requires the fluidsynth software. And converting to mp3 requires ffmpeg. Install by running:

brew install fluidsynth
brew install ffmpeg

To generate audio, a soundfont in .sf2 format is required. For higher quality, please download a larger version online.

Model

Phi-3-MusiX is a LoRA adapter for microsoft/Phi-3-vision-128k-instruct for understanding symbolic music in the form of scanned music sheets, MIDI files, and structured annotations. This adapter equips Phi-3 with the ability to perform symbolic music reasoning and answer questions about scanned music sheets and MIDI content.

Fine-tune the model

You can use the loraft_phi3v.py script to finetune the phi-3-vision model. Please replace "xxxxxxxx" with your token for huggingface and wandb.

deepspeed --num_gpus=8 loraft_phi3v.py --deepspeed ds_config.json --hf_token xxxxxxxx --wandb_token xxxxxxxx

Data Synthesis

run the script generate_musicsheet.py. The code will save music sheet in pdf and the config file in specified directory. The pdf is compiled by MusixTex. The output will be saved in the ./data folder, including

config.yaml 
ground-truth music (`.json`)
PDF document (.pdf)
page images (.png), 
MIDI (`.mdi`)
Audio (`.mp3`)

🎓 Reference

If you use this dataset in your work, please cite it using the following reference:

@article{chen2025musixqa,
  title={MusiXQA: Advancing Visual Music Understanding in Multimodal Large Language Models},
  author={Chen, Jian and Ma, Wenye and Liu, Penghang and Wang, Wei and Song, Tengwei and Li, Ming and Wang, Chenguang and Zhang, Ruiyi and Chen, Changyou},
  journal={arXiv preprint arXiv:2506.23009},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
soundfont		soundfont
util		util
LICENSE		LICENSE
README.md		README.md
ds_config.json		ds_config.json
generate_musicsheet.py		generate_musicsheet.py
loraft_phi3v.py		loraft_phi3v.py
qa_template.py		qa_template.py
requirements.txt		requirements.txt
test_model.py		test_model.py
update.sh		update.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MusiXQA

Dataset

Installation

Model

Fine-tune the model

Data Synthesis

🎓 Reference

About

Uh oh!

Releases

Packages

Languages

License

puar-playground/MusiXQA

Folders and files

Latest commit

History

Repository files navigation

MusiXQA

Dataset

Installation

Model

Fine-tune the model

Data Synthesis

🎓 Reference

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages