Skip to content

Commit

Permalink
Merge branch 'develop' with tm project.
Browse files Browse the repository at this point in the history
  • Loading branch information
Jian-Kai Wang committed Jun 23, 2017
2 parents 78d8740 + 70da4aa commit a6a7204
Show file tree
Hide file tree
Showing 54 changed files with 19,447 additions and 2 deletions.
38 changes: 38 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# History files
.Rhistory
.Rapp.history

# Session Data files
.RData

# Example code in package build process
*-Ex.R

# Output files from R CMD build
/*.tar.gz

# Output files from R CMD check
/*.Rcheck/

# RStudio files
.Rproj.user/

# produced vignettes
vignettes/*.html
vignettes/*.pdf

# OAuth2 token, see https://github.com/hadley/httr/releases/tag/v0.3
.httr-oauth

# knitr and R markdown default cache directories
/*_cache/
/cache/

# Temporary files created by R markdown
*.utf8.md
*.knit.md

# CSS Style

# manual test should not be tracked
manual_test.r
16 changes: 14 additions & 2 deletions readme.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,16 @@
# Text Mining in Sophia Project
# Data Mining in Sophia Project

Data mining refers to extracting or mining information (or say knowledge) from huge amounts of data. Someone might regard data mining as "Knowledge mining" or treat it as a synonym for a popular term, KDD (Knowledge Discovery from Data). But these fields are essentially different from data mining, for example, ignoring the large amounts of data. Data mining is also regarded as a misnomer because the goal is the extraction of patterns and knowledge from huge amounts of data, not the mining of data itself.

In the [sophia.dm](https://github.com/jiankaiwang/sophia.dm) project, we try and provide you the foundation of each fields related to the data mining. Most of fields in the project are built on R, Shiny and Plotly. Please enjoy it and explore the world of data mining.

### Field : Text Mining
---

* The Foundation of Text Mining
* Count-based Text Mining Infrastructure in R [[demo](http://jkw.cloudapp.net:3838/tm/)]
* topics
* find terms on the frequency
* term associations
* document listing
* main reference
* Ingo Feinerer, Kurt Hornik and David Meyer. (2008) Text Mining Infrastructure in R. *Journal of Statistical Software* Volume 25, Issue 5.
21 changes: 21 additions & 0 deletions runShinyApp.r
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
#
# auth : Jian-Kai Wang (http://jiankaiwang.no-ip.biz)
# proj : sophia.dm
# plat :
# |- R : 3.4.0
#

setwd("D:/code/shiny/github/sophia.dm")

#install.packages("shiny")
#install.packages("plotly")

library(shiny)
library(plotly)

# start a text-mining example
runApp("tm")




11 changes: 11 additions & 0 deletions tm/data/exampleData/chinese.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
這高職生好會拍!莊敬高職 影視科 二年級 學生 魯佳寶 在 法國 攝影大賽 大放異彩,就連 莊敬高職 也將他的 獲獎作品 放在 學校 的 官方網站 上。

據相關報導,這其實是 他 首次 參加 國際級 的 法國 「巴黎Px3攝影大賽 (The Prix de la Photographie Paris)」,魯佳寶 的 2張 參賽作品 在 80多個國家、數千件作品中 競逐獎項,均獲得相當於佳作的Honorable mentioned榮譽獎,相當不容易。

根據 魯佳寶 在個人臉書的動態,他獲獎的作品其中一張是在2016年12月29日拍下的,他寫道,在拍照的時候,樓下突然傳來一聲:嘿 ! 樓上的小帥哥,你在拍照嗎?當時他緊張地回答:呃…對阿,心中則想著:糟糕!難道捷運裡不能拍照?

結果那位捷運保全回答他: 記得把我拍帥一點!於是保全就維持這個動作有一分多鐘之久。魯佳寶 當時也寫下:感謝帥保全讓我拍到心目中的畫面、辛苦了!他也向《三立新聞網》表示,這張作品是他使用Canon EOS 70D機身,搭配TOKINA 11-16mm鏡頭,焦段則是11mm、光圈F11、快門4秒。

另一張作品則是2016年8月23日他參加國家地理攝影之旅所拍下的,地點是美國黃石國家公園。魯佳寶在臉書寫著,Yellowstone National Park × Upper Falls;能看到野生氂牛和糜鹿群在草原上,還好有帶大砲不然真的超遠!至於拍下這張照片的器材,同樣使用Canon EOS 70D,鏡頭則是Tamron150-600mm,焦段為150mm、光圈F8 、快門1/250秒。

獲獎後,他也在臉書寫下心情,「很開心,第一次參加 法國Px3 2017 攝影大賽 (P×3 - The Prix de la Photographie Paris),2張作品都能拿到 Honorable mentioned 榮譽獎;PX3 2017 Honorable Mention - Nature/Sunsets;PX3 2017 Honorable Mention - Press/People/Personality」。魯佳寶也說,感謝一路支持他的人,他會繼續努力、好還要更好!他的臉友也紛紛留言祝福他,會用畫面說故事真的好厲害!加油啦!期待更多作品。
30 changes: 30 additions & 0 deletions tm/data/exampleData/english.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
Dams could 'permanently damage Amazon'

The Amazon basin could suffer significant and irreversible damage if an extensive dam building programme goes ahead, scientists say.

Currently, 428 hydroelectric dams are planned, with 140 already built or under construction.

Researchers warn that this could affect the dynamics of the complex river system and put thousands of unique species at risk.
The study is published in the journal Nature.

The world is going to lose the most diverse wetland on the planet, said lead author Prof Edgargo Latrubesse, from the University of Texas at Austin, US.

The Amazon basin covers more than 6.1 million sq km, and is the largest and most complex river system on the planet.
It has become a key area for hydroelectric dam construction.

But this study suggests that the push for renewable energy along the Amazon's waterways could lead to profound problems.
The international team of researchers who carried out the research is particularly concerned about any disruption to the natural movement of sediment in the rivers.

This sediment provides a vital source of nutrients for wildlife in the Amazon's wetlands. It also affects the way the waterways meander and flow.

[The sediment is] how the rivers work, how they move, how they regenerate new land, and how they keep refreshing the ecosystems," said Prof Latrubesse.

The Texas researcher said that at present environmental assessments were being carried out for each dam in isolation, looking at their impact on the local area. But he argued a wider approach was needed for the Amazon.

The problem is nobody is assessing the whole package: the cascade of effects the dams produce on the whole system.

The researchers have highlighted the Madeira, Maranon and Ucayali rivers - all tributaries of the Amazon River - as areas of great concern.
These rivers are home to many unique species, and the scientists say these would be under threat if even a fraction of the planned dams go ahead.
Prof Latrubesse said: “All of these rivers hold huge diversity, with many species that are endemic.
Thousands of species could be affected, maybe even go extinct.
The researchers warn that any damage could be irreversible, and they say any risks must be considered before the dams are allowed to go ahead.
1 change: 1 addition & 0 deletions tm/data/stopwords/chinese.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
一,一下,一些,一切,一则,一天,一定,一方面,一旦,一时,一来,一样,一次,一片,一直,一致,一般,一起,一边,一面,万一,上下,上升,上去,上来,上述,上面,下列,下去,下来,下面,不一,不久,不仅,不会,不但,不光,不单,不变,不只,不可,不同,不够,不如,不得,不怕,不惟,不成,不拘,不敢,不断,不是,不比,不然,不特,不独,不管,不能,不要,不论,不足,不过,不问,与,与其,与否,与此同时,专门,且,两者,严格,严重,个,个人,个别,中小,中间,丰富,临,为,为主,为了,为什么,为什麽,为何,为着,主张,主要,举行,乃,乃至,么,之,之一,之前,之后,之後,之所以,之类,乌乎,乎,乘,也,也好,也是,也罢,了,了解,争取,于,于是,于是乎,云云,互相,产生,人们,人家,什么,什么样,什麽,今后,今天,今年,今後,仍然,从,从事,从而,他,他人,他们,他的,代替,以,以上,以下,以为,以便,以免,以前,以及,以后,以外,以後,以来,以至,以至于,以致,们,任,任何,任凭,任务,企图,伟大,似乎,似的,但,但是,何,何况,何处,何时,作为,你,你们,你的,使得,使用,例如,依,依照,依靠,促进,保持,俺,俺们,倘,倘使,倘或,倘然,倘若,假使,假如,假若,做到,像,允许,充分,先后,先後,先生,全部,全面,兮,共同,关于,其,其一,其中,其二,其他,其余,其它,其实,其次,具体,具体地说,具体说来,具有,再者,再说,冒,冲,决定,况且,准备,几,几乎,几时,凭,凭借,出去,出来,出现,分别,则,别,别的,别说,到,前后,前者,前进,前面,加之,加以,加入,加强,十分,即,即令,即使,即便,即或,即若,却不,原来,又,及,及其,及时,及至,双方,反之,反应,反映,反过来,反过来说,取得,受到,变成,另,另一方面,另外,只是,只有,只要,只限,叫,叫做,召开,叮咚,可,可以,可是,可能,可见,各,各个,各人,各位,各地,各种,各级,各自,合理,同,同一,同时,同样,后来,后面,向,向着,吓,吗,否则,吧,吧哒,吱,呀,呃,呕,呗,呜,呜呼,呢,周围,呵,呸,呼哧,咋,和,咚,咦,咱,咱们,咳,哇,哈,哈哈,哉,哎,哎呀,哎哟,哗,哟,哦,哩,哪,哪个,哪些,哪儿,哪天,哪年,哪怕,哪样,哪边,哪里,哼,哼唷,唉,啊,啐,啥,啦,啪达,喂,喏,喔唷,嗡嗡,嗬,嗯,嗳,嘎,嘎登,嘘,嘛,嘻,嘿,因,因为,因此,因而,固然,在,在下,地,坚决,坚持,基本,处理,复杂,多,多少,多数,多次,大力,大多数,大大,大家,大批,大约,大量,失去,她,她们,她的,好的,好象,如,如上所述,如下,如何,如其,如果,如此,如若,存在,宁,宁可,宁愿,宁肯,它,它们,它们的,它的,安全,完全,完成,实现,实际,宣布,容易,密切,对,对于,对应,将,少数,尔后,尚且,尤其,就,就是,就是说,尽,尽管,属于,岂但,左右,巨大,巩固,己,已经,帮助,常常,并,并不,并不是,并且,并没有,广大,广泛,应当,应用,应该,开外,开始,开展,引起,强烈,强调,归,当,当前,当时,当然,当着,形成,彻底,彼,彼此,往,往往,待,後来,後面,得,得出,得到,心里,必然,必要,必须,怎,怎么,怎么办,怎么样,怎样,怎麽,总之,总是,总的来看,总的来说,总的说来,总结,总而言之,恰恰相反,您,意思,愿意,慢说,成为,我,我们,我的,或,或是,或者,战斗,所,所以,所有,所谓,打,扩大,把,抑或,拿,按,按照,换句话说,换言之,据,掌握,接着,接著,故,故此,整个,方便,方面,旁人,无宁,无法,无论,既,既是,既然,时候,明显,明确,是,是否,是的,显然,显著,普通,普遍,更加,曾经,替,最后,最大,最好,最後,最近,最高,有,有些,有关,有利,有力,有所,有效,有时,有点,有的,有着,有著,望,朝,朝着,本,本着,来,来着,极了,构成,果然,果真,某,某个,某些,根据,根本,欢迎,正在,正如,正常,此,此外,此时,此间,毋宁,每,每个,每天,每年,每当,比,比如,比方,比较,毫不,没有,沿,沿着,注意,深入,清楚,满足,漫说,焉,然则,然后,然後,然而,照,照着,特别是,特殊,特点,现代,现在,甚么,甚而,甚至,用,由,由于,由此可见,的,的话,目前,直到,直接,相似,相信,相反,相同,相对,相对而言,相应,相当,相等,省得,看出,看到,看来,看看,看见,真是,真正,着,着呢,矣,知道,确定,离,积极,移动,突出,突然,立即,第,等,等等,管,紧接着,纵,纵令,纵使,纵然,练习,组成,经,经常,经过,结合,结果,给,绝对,继续,继而,维持,综上所述,罢了,考虑,者,而,而且,而况,而外,而已,而是,而言,联系,能,能否,能够,腾,自,自个儿,自从,自各儿,自家,自己,自身,至,至于,良好,若,若是,若非,范围,莫若,获得,虽,虽则,虽然,虽说,行为,行动,表明,表示,被,要,要不,要不是,要不然,要么,要是,要求,规定,觉得,认为,认真,认识,让,许多,论,设使,设若,该,说明,诸位,谁,谁知,赶,起,起来,起见,趁,趁着,越是,跟,转动,转变,转贴,较,较之,边,达到,迅速,过,过去,过来,运用,还是,还有,这,这个,这么,这么些,这么样,这么点儿,这些,这会儿,这儿,这就是说,这时,这样,这点,这种,这边,这里,这麽,进入,进步,进而,进行,连,连同,适应,适当,适用,逐步,逐渐,通常,通过,造成,遇到,遭到,避免,那,那个,那么,那么些,那么样,那些,那会儿,那儿,那时,那样,那边,那里,那麽,部分,鄙人,采取,里面,重大,重新,重要,鉴于,问题,防止,阿,附近,限制,除,除了,除此之外,除非,随,随着,随著,集中,需要,非但,非常,非徒,靠,顺,顺着,首先,高兴,是不是,说说
55 changes: 55 additions & 0 deletions tm/global.r
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
#
# desc : anything necessary librarie or package should be loaded on global.r, not in server.r or ui.r
#

#install.packages("shiny")
#install.packages("plotly")
#install.packages("ggplot2")
#install.packages("DT")
#install.packages("RCurl")

library(shiny)
library(plotly)
library(ggplot2)
library(DT)
library(RCurl)

tryCatch({
source("tm/sophia.r")
}, warning = function(e) {
source("sophia.r")
})


# plain-text example
processFile = function(filepath) {
allData <- ""
con = file(filepath, "r")

while ( TRUE ) {
line = readLines(con, n = 1, encoding="UTF-8")
if ( length(line) == 0 ) {
break
}
allData <- paste(allData, line, sep="\n")
}

close(con)
return(allData)
}

plainDataImportEN <- processFile("data/exampleData/english.txt")
plainDataImportCN <- processFile("data/exampleData/chinese.txt")

webUrlImportEN <- "http://www.bbc.co.uk/news/resources/idt-d60acebe-2076-4bab-90b4-0e9a5f62ab12"
webUrlImportCNOri <- c(
"https://udn.com/news/story/6897/2528524",
"http://www.ithome.com.tw/news/114828",
"http://technews.tw/2017/06/19/ai-doctor/"
)
webUrlImportCN <- webUrlImportCNOri[3]





145 changes: 145 additions & 0 deletions tm/server.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@

#
# desc : get import type
#
getImportType <- function(opt) {
dataImport <- "text"
if(as.numeric(opt) == 1) {
dataImport <- "text"
} else if (as.numeric(opt) == 2) {
dataImport <- "url"
}
return(dataImport)
}

#
# desc : split the string line
#
splitStrByComma <- function(stringTerms) {
return(as.vector(strsplit(stringTerms, ',')[[1]]))
}

# shiny server main function
shinyServer(function(input, output, session) {

#################
# run every session
#################

# data import type
output$mostFreqTermsTable <- renderDataTable({

# data import type
dataImport <- getImportType(input$dataImport)

# data / web content
dataImportField <- input$dataImportField

# data language
dataLang <- input$dataLang

# data transformation
# might be NA
dataTrans <- splitStrByComma(input$dataTrans)

# data filter
# might be NA
dataFilter <- splitStrByComma(input$dataFilter)

# word length in tdm
wordLength <- input$wordLength

# weighting method
weightMethod <- c(
as.numeric(input$weightMethod),
input$smartTerm,
input$smartDoc,
input$smartNormalization
)

# data analysis
findFreqTerms <- input$findFreqTerms

# get all terms
findTermsByFreq(dataImport, dataImportField, dataLang, dataTrans, dataFilter, wordLength, weightMethod, findFreqTerms, "//p")
})

output$assocTermsTable <- renderDataTable({

# data import type
dataImport <- getImportType(input$dataImport)

# data / web content
dataImportField <- input$dataImportField

# data language
dataLang <- input$dataLang

# data transformation
# might be NA
dataTrans <- splitStrByComma(input$dataTrans)

# data filter
# might be NA
dataFilter <- splitStrByComma(input$dataFilter)

# word length in tdm
wordLength <- input$wordLength

# weighting method
weightMethod <- c(
as.numeric(input$weightMethod),
input$smartTerm,
input$smartDoc,
input$smartNormalization
)

# association terms
termAssoc <- splitStrByComma(input$termAssoc)

# association ratio
findAssocsRatio <- as.numeric(input$findAssocsRatio)

# get association terms
findAssocsByTerms(dataImport, dataImportField, dataLang, dataTrans, dataFilter, wordLength, weightMethod, "//p", termAssoc, findAssocsRatio)

})

output$dictListing <- renderDataTable({

# data import type
dataImport <- getImportType(input$dataImport)

# data / web content
dataImportField <- input$dataImportField

# data language
dataLang <- input$dataLang

# data transformation
# might be NA
dataTrans <- splitStrByComma(input$dataTrans)

# data filter
# might be NA
dataFilter <- splitStrByComma(input$dataFilter)

# word length in tdm
wordLength <- input$wordLength

# terms for listing dictionaries
dictList <- splitStrByComma(input$dictList)

# get association terms
listDictByTerms(dataImport, dataImportField, dataLang, dataTrans, dataFilter, wordLength, "//p", dictList)

})
})








Loading

0 comments on commit a6a7204

Please sign in to comment.