Data Curation And Indexing with ElasticSearch

Solution

In this assignment, I use scalaj-http for handling HTTP, scala-xml for handling XML, and JSON4S for handling JSON.

Run ‘spark-submit’ with ‘--packages --packages "org.scalaj:scalaj-http_2.11:2.4.2","org.json4s:json4s-native_2.11:3.5.3"’

Firstly, get the RDD of all files via method named ‘wholeTextFiles’

Secondly, convert every file from string to XML object.

Then, we get an array contains key-value pairs whose value is the XML object.

Thirdly, analyse each XML object via analyseXML method. This method will split XML object to XML elements and send each of them to NLP server for getting named entity recognition.

After that, we get several map objects, containing filename, sentences and NERs, which will be sent to Elasticsearch server.

Finally, the Map object will be converted to JSON string in updateDocument and be sent to Elasticsearch server.

Index design

filename	Text	Filename, where the document from
name	Text	Name of case
AustLII	Text	URL of this case
catchphrases	Text(List)	Summarize of case stored in text list
sentences	Text(List)	Sentences contained in the legal case report and stored in text list
person	Text(List)	Store person NER analysed from XML
location	Text(List)	Store location NER analysed from XML
organization	Text(List)	Store organization NER analysed from XML

Example queries

Query based on general terms:

curl -X GET \
"http://localhost:9200/legal_idx/cases/_search?pretty&q=(criminal%20AND%20law)"

Queries based on entity type:

curl -X GET \
"http://localhost:9200/legal_idx/cases/_search?pretty&q=location:New%20South%20Walse"

curl -X GET \
"http://localhost:9200/legal_idx/cases/_search?pretty&q=person:John"

curl -X GET \
"http://localhost:9200/legal_idx/cases/_search?pretty&q=organization:Arts"

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
cases_test		cases_test
screenshot		screenshot
src/main/scala		src/main/scala
tes_curl		tes_curl
.gitignore		.gitignore
Assignment3.pdf		Assignment3.pdf
README.md		README.md
assignment3_solution.pdf		assignment3_solution.pdf
build.sbt		build.sbt
start_spark_submit.txt		start_spark_submit.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data Curation And Indexing with ElasticSearch

Solution

Index design

Example queries

About

Uh oh!

Releases

Packages

Languages

firedent/Data-curation-and-indexing-with-ElasticSearch

Folders and files

Latest commit

History

Repository files navigation

Data Curation And Indexing with ElasticSearch

Solution

Index design

Example queries

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages