GitHub - swissbib/content2SearchDocs: swissbib component for the processing of SearchDocs

swissbib / content2SearchDocs Public

Notifications You must be signed in to change notification settings
Fork 0
Star 1

swissbib component for the processing of SearchDocs

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 313 Commits
ant		ant
config		config
gradle		gradle
shellscripts		shellscripts
src		src
xslt		xslt
xsltExamples		xsltExamples
xsltskipRecords		xsltskipRecords
.gitignore		.gitignore
README		README
build.gradle		build.gradle
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle		settings.gradle

Repository files navigation

content2SearchDocs is used by swissbib for the transformation of content into SearchDocs as input for so called Search-Server


Fist initial design (in 2011) was inspired by the FAST document processing model which uses various pipes for different content resources.
These pipes are highly customizable because of stages (generally Python plugins) to perform special transformations


Another 'role-model' for this component is the Hydra Framework (http://findwise.github.io/Hydra/) created and maintained by http://www.findwise.com/

Main characteristics:
- pipes are formed by chaining XSLT templates in any order
- an engine (XML2SearchDocEngine) starts the process, keeps the chained templates together and provides plugins as XSL extensions.
Plugins may execute any kind of service. For swissbib this is at the moment:
-- content enrichement with TOC / Abstracts.
    Documents are fetched online from content repositories (mostly ILS) and parsed with TIKA.
    Once content is parsed it is stored which makes a later process for the same document much faster.
-- content enrichement with GND data
-- use of VIAF for content enrichement
-- special tasks like removing duplicate terms for a better relevance ranking
- at the moment we can easily produce SearchDocs for SOLR as well as for ElasticSearch
- because the main transformation is done with xsl templates no special knowledge of programming languages is necessary to write at least the transformation rules
 for library related content


possible topics for further development:
- at the moment only XML documents are supported as input
- ....

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases 15

Packages

Contributors 4

Languages

swissbib/content2SearchDocs

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases 15

Packages 0

Contributors 4

Languages

Packages