Skip to content

NLP Analysis on Chinese Songs with THULAC Toolkit (Python)

Notifications You must be signed in to change notification settings

argowang/songAnalysis

Repository files navigation

songAnalysis

NLP Analysis on Chinese Songs with THULAC Toolkit

This project aims to use data analysis and nlp tools to find out the "culture" and pattern behind chinese hip-hop music. Words appeared the most could be considered as the heart of the hip-hop culture and the value of East-Asia culture

Another interesting analysis to be done is finding out which rhyme pairs are most popular among hip-hop music. Rhyme is considered as a value linguistics property in Chinese. The discovery of the frequency of rhyme pairs in hip-hop music(whose lyric is common and similar to conversation) may shed light onto Chinese Linguistics problem.

The other analysis to be done is to classify a hip-hop song as positive or negative based on the pre-annotated adj and noun appeared in the lyrics.

THULAC is an NLP toolkit that is easy to use while maintain a high level of accuracy in parsing. You can find the toolkit over here: http://thulac.thunlp.org/ However, the toolkit has limited its function to parsing (or you can say it concentrates all its energy on parsing). As a result, the toolkit needs some setup before usage. However, it is obvious that many of the setups are redundent and there might be some traps if you are not familiar with the toolkit or you are a beginner. For example: If you are using Python2, it is very likely that the chinese character printed out is unreadable code. I encountered the same problem during my work. Therefore I wish my code may provide some extents of help.

Module needed:

THULAC

Pyecharts (for visualization)

requests

bs4

lxml

pypinyin

=======================Aug 30th update

Implement the scraping module

By entering the singers' id on the music.163.com, the web scraping module will gather all of singers' songs and lyrics. This largely increase the efficiency of collecting data and improve the analysis outcome.

=======================Aug 31st update

Add in function to remove unwanted word. The original parsing results involve many unwanted English word and meaningless word. The function helps to clean up the result.

Modify visualization part so that the visualization html file is generated in the same directory as the code.

=======================Sept 4th update

Use regular expression to format the lyrics: one sentence per line and remove redundent newlines Add in transcribeToPinyin module: pull out the last word of each sentence (The rhyme word) use pypinyin to transcribe chinese character into pinyin use consecutive end-of-line words to form rhyme pairs

=======================Sept 5th update

Complete rhyme pair frequency count. Now we can see the most frequently shown rhyme pairs

Visualize rhyme pair frequency

TODO:

find out words with same rhyme based on input

Classify song based on adj. and n.

About

NLP Analysis on Chinese Songs with THULAC Toolkit (Python)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published