GitHub - zentiment/soda: Solr Dictionary Annotator (Microservice for Spark)

##Solr Dictionary Annotator

###Introduction

The Solr Dictionary Annotator (SoDA) is a Dictionary-based Annotator (or Gazetteer) that supports exact as well as fuzzy lookups across multiple lexicons.

SoDA is backed by a Solr index into which entity names (and synonyms) are entered, as well as the identifier for that entity. Fast (FST based) span lookup is done using the SolrTextTagger project. Additional fuzzy lookup features are supported using OpenNLP and a mix of various normalization strategies.

###Usage

SoDA provides a JSON over HTTP interface. Requests are submitted to SoDA as JSON documents over HTTP POST, and SoDA responds with JSON documents. This form of API allows us to be language agnostic and cross platform. SoDA can be accessed from individual clients, Spark standalone applications and the Databricks Notebook environment using Python and Scala. Details of SoDA's REST API can be found here.

###Architecture

In terms of architecture, the SoDA system looks something like this.

Callers invoke the annotate (TBD) function with the necessary parameters, which results in a JSON/HTTP call to the SoDA webapp. Some calls, such as exact and lowercase lookup are passed directly to the SolrTextTagger. Other calls such as punctuation normalized lookup or unordered or fuzzy lookups, need the input string to be tokenized and the appropriate query made to Solr instead. For example, punctuation normalized lookups would require sentence normalization to ensure we don't match across sentence boundaries, and unordered or fuzzy lookups will require extracting phrases and matching.

###More Information

Citing

If you need to cite SoDA in your work, please use the following DOI:

Pal, Sujit (2015). Solr Dictionary Annotator [Computer Software]; https://github.com/elsevierlabs-os/soda

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
docs		docs
src		src
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Citing

About

Releases

Packages

Languages

License

zentiment/soda

Folders and files

Latest commit

History

Repository files navigation

Citing

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages