Skip to content
This repository has been archived by the owner on Aug 19, 2020. It is now read-only.

Chinese Datasets

colorcatliu edited this page Sep 19, 2017 · 7 revisions

Chinese Language Corpora for Sentiment Analysis

Microblogs

This dataset comes from researchers at the Journalism and Media Center of the University of Hong Kong.

  • 226 million posts on Sina Weibo (Twitter-like microblogging service)
  • (zipped) CSV format
  • Collected in 2012 from feeds of users having > 1000 followers
  • Not tagged for sentiment
  • Released for public use, citation required, no specific licensing terms

From China's NLP and Information Retrieval sharing platform (run by the Big Data Search and Mining Lab at the Beijing Institute of Technology).

  • 230,000 posts from Sina Weibo (2011)
  • XML format
  • metadata: user ID, time of posting, etc.
  • Released for public non-commercial use, citation required, no specific licensing terms

From researchers at Xi'an Jiaotong university, shared with UC Irvine's machine learning repository.

  • About 50,000 posts from Sina Weibo.
  • Has more user metadata, apparently including full following-follower information.
  • Subject to UCI machine Learning Rpository's usage/citation guidelines [https://archive.ics.uci.edu/ml/citation_policy.html]

From researchers at BIT.

  • 5 million Sina Weibo posts
  • SQL format
  • Use limited to research and teaching; commercial usage prohibited.
  • Slow connection to server; I have not yet successfully completed a download.

Medium-length documents

Surprisingly, it's harder to find publicly available corpora of medium-length texts in Chinese that aren't just news articles or other formal written genres. There are citations for corpora of product reviews and short documents, but accessing them has proved difficult.

Ren-CECps

Small corpus of blog posts with annotations of emotion and sentiment at document, paragraph, and sentence levels. Constructed by Changqin Quan (Hefei University of Technology) and Fuji Ren (Tokushima University).

  • 1,500 blog posts (11k paragraphs, 35k sentences)
  • annotated for 3-way polarity, real-valued scores on 8 emotion categories
  • Has been publicly released [http://a1-www.is.tokushima-u.ac.jp/member/ren/Ren-CECps1.0/Ren-CECps1.0.html], but is not currently accessible through that link(Now had been repaired). Fuji Ren can be contacted ([email protected]) for a license agreement.

ChnSentiCorp

Small corpus of product reviews, maintained by Tan Songbo (Chinese Academy of Sciences, [email protected]).

  • 6,000 reviews of hotels, computers, and books.
  • Includes ratings as sentiment polarity labels
  • Not currently accessible
  • Limited to academic use
  • 250 million Chinese character corpus (hundreds of thousands of documents)
  • News text from People's Daily, Xinhua newswire, China Radio International
  • $500.00 for non-members
  • 277 blog posts in Chinese and translated to English.
  • $1500.00 for non-members

Articles and user-generated content scraped from major Chinese domains, incl. texts and relevant metadata (date, author, source etc.). Maintained and regularly updated by Anacode GmbH.

  • More than 10 industries (automotive, health, cosmetics etc.)
  • Data in JSON format.
  • Free access for most of the datasets
  • Additional semantic information on datasets based on Anacode's NLP analysis.