-
Notifications
You must be signed in to change notification settings - Fork 0
swissbib/content2SearchDocs
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
content2SearchDocs is used by swissbib for the transformation of content into SearchDocs as input for so called Search-Server Fist initial design (in 2011) was inspired by the FAST document processing model which uses various pipes for different content resources. These pipes are highly customizable because of stages (generally Python plugins) to perform special transformations Another 'role-model' for this component is the Hydra Framework (http://findwise.github.io/Hydra/) created and maintained by http://www.findwise.com/ Main characteristics: - pipes are formed by chaining XSLT templates in any order - an engine (XML2SearchDocEngine) starts the process, keeps the chained templates together and provides plugins as XSL extensions. Plugins may execute any kind of service. For swissbib this is at the moment: -- content enrichement with TOC / Abstracts. Documents are fetched online from content repositories (mostly ILS) and parsed with TIKA. Once content is parsed it is stored which makes a later process for the same document much faster. -- content enrichement with GND data -- use of VIAF for content enrichement -- special tasks like removing duplicate terms for a better relevance ranking - at the moment we can easily produce SearchDocs for SOLR as well as for ElasticSearch - because the main transformation is done with xsl templates no special knowledge of programming languages is necessary to write at least the transformation rules for library related content possible topics for further development: - at the moment only XML documents are supported as input - ....
About
swissbib component for the processing of SearchDocs
Resources
Stars
Watchers
Forks
Packages 0
No packages published