現行中文的語料就屬wiki上最完整,無奈欠缺了新聞等即時性的資訊,因此照造了這個爬蟲:可在每天的指定時間上,自動爬取Google News所蒐到的中文新聞頭條
Wikipedia is now the most sound and complete Mandarin corpus now on the Internet. However, it lacks the information of trendy topics. As a matter of fact, this daily cralwer is built to fetch news from the headlines of Google News.
I do not own the file "dict.txt.big". It's a great work from jieba team, "fxsjy" et al.
In python 2.7 with "requests", "BeutifulSoup", "jieba" and "progressbar" installed.