Chinese Datasets

Chinese Language Corpora for Sentiment Analysis

Microblogs

Open Weiboscope

This dataset comes from researchers at the Journalism and Media Center of the University of Hong Kong.

226 million posts on Sina Weibo (Twitter-like microblogging service)
(zipped) CSV format
Collected in 2012 from feeds of users having > 1000 followers
Not tagged for sentiment
Released for public use, citation required, no specific licensing terms

NLPIR Weibo Content (zh)

From China's NLP and Information Retrieval sharing platform (run by the Big Data Search and Mining Lab at the Beijing Institute of Technology).

230,000 posts from Sina Weibo (2011)
XML format
metadata: user ID, time of posting, etc.
Released for public non-commercial use, citation required, no specific licensing terms

Microblog PCU

From researchers at Xi'an Jiaotong university, shared with UC Irvine's machine learning repository.

About 50,000 posts from Sina Weibo.
Has more user metadata, apparently including full following-follower information.
Subject to UCI machine Learning Rpository's usage/citation guidelines [https://archive.ics.uci.edu/ml/citation_policy.html]

NLPIR 5 million Weibo (zh)

From researchers at BIT.

5 million Sina Weibo posts
SQL format
Use limited to research and teaching; commercial usage prohibited.
Slow connection to server; I have not yet successfully completed a download.

Medium-length documents

Surprisingly, it's harder to find publicly available corpora of medium-length texts in Chinese that aren't just news articles or other formal written genres. There are citations for corpora of product reviews and short documents, but accessing them has proved difficult.

Ren-CECps

Small corpus of blog posts with annotations of emotion and sentiment at document, paragraph, and sentence levels. Constructed by Changqin Quan (Hefei University of Technology) and Fuji Ren (Tokushima University).

1,500 blog posts (11k paragraphs, 35k sentences)
annotated for 3-way polarity, real-valued scores on 8 emotion categories
Has been publicly released [http://a1-www.is.tokushima-u.ac.jp/member/ren/Ren-CECps1.0/Ren-CECps1.0.html], but is not currently accessible through that link(Now had been repaired). Fuji Ren can be contacted ([email protected]) for a license agreement.

ChnSentiCorp

Small corpus of product reviews, maintained by Tan Songbo (Chinese Academy of Sciences, [email protected]).

6,000 reviews of hotels, computers, and books.
Includes ratings as sentiment polarity labels
Not currently accessible
Limited to academic use

Mandarin Chinese News Text (LDC)

250 million Chinese character corpus (hundreds of thousands of documents)
News text from People's Daily, Xinhua newswire, China Radio International
$500.00 for non-members

GALE Phase 1 Chinese Blog Parallel Text (LDC)

277 blog posts in Chinese and translated to English.
$1500.00 for non-members

Sogou News (zh)

Tens of thousands of news documents from the Sogou news page.
Licensed for free non-commercial use (http://www.sogou.com/labs/dl/license_en.html [en])

Anacode Chinese NLP API Web Data

Articles and user-generated content scraped from major Chinese domains, incl. texts and relevant metadata (date, author, source etc.). Maintained and regularly updated by Anacode GmbH.

More than 10 industries (automotive, health, cosmetics etc.)
Data in JSON format.
Free access for most of the datasets
Additional semantic information on datasets based on Anacode's NLP analysis.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly