Skip to content

qEndpoint Full Text Indexing

Antoine Willerval edited this page Jul 19, 2023 · 9 revisions

In qEndpoint, you can configure the repo_model.ttl file to generate an index for full-text or GeoSPARQL indexing.

You have multiple examples of model here, but we will describe how to add a simple node to handle this.

Prefixes

The prefixes used in this page are:

@prefix mdlc: <http://the-qa-company.com/modelcompiler/> .
@prefix my: <http://example.org/#> .
@prefix search: <http://www.openrdf.org/contrib/lucenesail#> .

Simple text indexing

You can describe a simple node to do text indexing like this one:

# Specify the main node
mdlc:main mdlc:node _:mainNode .
_:mainNode mdlc:type mdlc:luceneNode ;
            # Describe the location of the lucene directory, you can use mdlc:parsedString for template strings
            mdlc:dirLocation "${locationNative}lucene"^^mdlc:parsedString ;
            # Define the language(s) indexed by the Lucene index, here "fr" (French) and "es" (Spanish) (uncomment to add)
            # mdlc:luceneLang "es", "fr" ;
            # Define the node's ID, this parameter is important if you are using multiple indexes, if this ID is set, you need to add
            # the ID in the query
            # search:indexid my:luceneIndex ;
            # Define the reindex query for the lucene sail, the query should be ordered by ?s
            mdlc:luceneReindexQuery "SELECT * {?s ?p ?o} order by ?s" ;
            # Describe the evaluation mode of the queries, for native or endpointStore storage, use NATIVE
            mdlc:luceneEvalMode "NATIVE"^^mdlc:parsedString.

For location on disk, you can use the predefined options like locationNative for example, you can use all the predefined options here.

You can then search with the search virtual properties in your SPARQL queries:

PREFIX search: <http://www.openrdf.org/contrib/lucenesail#>

?subj search:matches [
	      search:query "search terms...";
	      search:property my:property ;
              # specify the index ID of the node, mandatory if it was specified in the model.ttl file.
              # search:indexid my:luceneIndex ;
	      search:score ?score;
	      search:snippet ?snippet ] .

Multiple indexes with flux filtering

You can find demo models here

With qEndpoint, you can config multiple Full-text indexes to have specific rules to search over them.

In you model.ttl, you can create a Lucene node like explained in the Simple text indexing part, but you can also add filter to your node to only handle certain triples. This is done by using filters. The filters will only impact the sails during the add/remove/select operations, not during dataset indexing, you need to specify a custom mdlc:luceneReindexQuery query to compute the index at indexing time.

Filters

Start by creating a filter node, here we will call it _:filterNode and it will filter the node _:luceneNode

_:filterNode mdlc:type mdlc:filterNode ;
             mdlc:paramFilter [
                  mdlc:type mdlc:typeFilterLuceneExp
             ] ;
             mdlc:paramLink _:luceneNode .

You can describe the type of the node with the mdlc:type predicate, you have multiple types available:

  • mdlc:typeFilterLuceneExp : Will only pass the SPARQL queries with a Lucene search:matches query.

    Example:

    _:filterNode mdlc:type mdlc:filterNode ;
                 mdlc:paramFilter [
                      mdlc:type mdlc:typeFilterLuceneExp
                 ] ;
                 mdlc:paramLink _:luceneNode .
  • mdlc:typeFilterLuceneGeoExp : Will only pass the SPARQL queries with a Lucene GeoSPARQL query.

    Example:

    _:filterNode mdlc:type mdlc:filterNode ;
                 mdlc:paramFilter [
                      mdlc:type mdlc:typeFilterLuceneGeoExp
                 ] ;
                 mdlc:paramLink _:luceneNode .
  • mdlc:predicateFilter : Will only pass during add/remove/get the triples with the described predicate(s)

    • Required param: mdlc:typeFilterPredicate <predicates>

    Example, here my:text1, my:text2 and my:text3 are the filtered predicates, but you can also specify only 1 or more than 3:

    _:filterNode mdlc:type mdlc:filterNode ;
                 mdlc:paramFilter [
                       mdlc:type mdlc:predicateFilter ;
                       # The filtered predicates
                       mdlc:typeFilterPredicate my:text1, my:text2, my:text3
                 ] ;
                 mdlc:paramLink _:luceneNode .
  • mdlc:languageFilter : Will only pass during add/remove/get the triples with a literal of a particular language, the mdlc:luceneLang parameter is faster for the Lucene nodes, it is mentionned here for custom implementations.

    • Required param: mdlc:languageFilterLang "langs": set the filtered languages
    • Optional param: mdlc:acceptNoLanguageLiterals []: allow to pass literals without languages

    Example, here "es", "fr" and "it" are the filtered languages, but you can also specify only 1 or more than 3:

    _:filterNode mdlc:type mdlc:filterNode ;
                 mdlc:paramFilter [
                       mdlc:type mdlc:languageFilter ;
                       # The filtered languages
                       mdlc:languageFilterLang "es", "fr", "it" ;
                       # Do we accept literals without any language
                       # mdlc:acceptNoLanguageLiterals []
                 ] ;
                 mdlc:paramLink _:luceneNode .
  • mdlc:typeFilter : Will only pass during add/remove/get the triples with a subject of a particular type, the mdlc:multiFilterNode node is faster and better for multiple type checks.

    • Required param: mdlc:typeFilterPredicate <is_of_type>: describe the type predicate to define the type of a subject
    • Required param: mdlc:typeFilterObject <types>: the filtered types

    Example, here my:type1 and my:type2 are the filtered types, but you can also specify only 1 or more than 3:

    _:filterNode mdlc:type mdlc:filterNode ;
                 mdlc:paramFilter [
                       mdlc:type mdlc:typeFilter ;
                       # The predicate describing the type for a subject
                       mdlc:typeFilterPredicate my:oftype ;
                       # The filtered types
                       mdlc:typeFilterObject my:type1, my:type2
                 ] ;
                 mdlc:paramLink _:luceneNode .

Filters boolean operations

Now we can filter our streams, but what if we want to use multiple filters? qEndpoint also has a syntax for that. It is done by using the mdlc:paramFilterAnd and mdlc:paramFilterOr predicates in the mdlc:paramFilter.

Example 1

_:filterNode mdlc:type mdlc:filterNode ;
             mdlc:paramFilter [
                  mdlc:type mdlc:typeFilterLuceneGeoExp
                  mdlc:paramFilterOr: [
                      mdlc:type mdlc:typeFilterLuceneExp
                  ]
             ] ;
             mdlc:paramLink _:luceneNode .

Here we are filtering all the expression not containing a GeoSPARQL query or a Full text search query, the mdlc:paramFilterOr can contain multiple filters, the predicates are the same as with the mdlc:paramFilter objects.

Example 2

_:filterNode mdlc:type mdlc:filterNode ;
             mdlc:paramFilter [
                  mdlc:type mdlc:typeFilterLuceneExp
                  mdlc:paramFilterAnd: [
                      mdlc:type mdlc:predicateFilter ;
                      mdlc:typeFilterPredicate my:description ;
                  ]
             ] ;
             mdlc:paramLink _:luceneNode .

In this example, we are filtering the expressions with a full-text search and all the triples without a my:description predicate, it can be used for example to index all the descriptions.

The boolean operators priorities as the same as in most of the programming languages.

[] mdlc:paramFilter [
    mdlc:type <FILTER_A>
    mdlc:paramFilterAnd: [
        mdlc:type <FILTER_B>
    ],
    mdlc:paramFilterOr: [
        mdlc:type <FILTER_C>
    ]
].

This little example can be translated to this expression:

(FILTER_A and FILTER_B) or FILTER_C

Multitype filters

The type filtering is important, but not optimized for multiple type checks in the same flux, to do that, you need to use a mdlc:multiFilterNode node.

Example

_:multiTypeFilter mdlc:type mdlc:multiFilterNode ;
                  mdlc:typeFilterPredicate my:typeof ;
                  mdlc:node [
                      mdlc:typeFilterObject my:type1;
                      mdlc:node my:luceneNode1
                  ] , [
                      mdlc:typeFilterObject my:type2;
                      mdlc:node my:luceneNode2
                  ] .

In this example, the my:typeof predicate is used as a typeof predicate and 2 types are selected, the my:type1 type linked with the my:luceneNode1 node and the my:type2 type linked with the my:luceneNode2 node (You can specify more than 2 types).

It can used for example to have one luceneNode indexing the clients of a company and another one indexing the products of a company, these 2 sets can be big and are obviously not overlapping.

Node chains

Node chains are used if you want to combine 2 nodes together. The type is mdlc:linkedSailNode.

example

_:lucenechain1 mdlc:type mdlc:linkedSailNode ;
               mdlc:node _:lucenesail_fr , 
                         _:lucenesail_de , 
                         _:lucenesail_es .

In this example we chain 3 lucene nodes. We can imagine one is only indexing French ("fr"), the 2nd German ("de") and the 3rd Spanish ("es") literals.

Example

In this part we are using the Cocktails dataset.

Using one index

We will first use one index, then we will split this index by type and to conclude by language.

First create a simple Lucene index using (Don't forget to reindex the dataset in the control menu if you're using an already indexed dataset)

# Define main node
mdlc:main mdlc:node _:mainNode .

# Create full text search Lucene index
_:mainNode mdlc:type mdlc:luceneNode ;
            # Describe the location of the lucene directory, you can use mdlc:parsedString for template strings
            mdlc:dirLocation "${locationNative}lucene"^^mdlc:parsedString ;
            # Define the reindex query for the lucene sail, the query should be ordered by ?s
            mdlc:luceneReindexQuery "SELECT * {?s ?p ?o} order by ?s" ;
            # Describe the evaluation mode of the queries, for native or endpointStore storage, use NATIVE
            mdlc:luceneEvalMode "NATIVE"

This will create an index that will parse all the text literals and allow us to find them.

You can then run your queries, for example this one to find cocktails containing "Margarita" in the their labels.

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX search: <http://www.openrdf.org/contrib/lucenesail#>
PREFIX cocktail: <http://vocabulary.semantic-web.at/cocktail-ontology/>

SELECT * WHERE {
    ?subj search:matches [
            search:query "margarita" ;
            search:property rdfs:label ;
          ] .
    ?subj a cocktail:Cocktail.
    ?subj rdfs:label ?name .
    FILTER (LANG(?name) = "en")
} LIMIT 100

You can notice that we are running first the full-text search, then we remove everything that isn't a cocktail and then everything that isn't an English literal. In the next sections we will see how to use multiple indexes to don't have to do it at query time.

Using multiple indexes

Using a mdlc:multiFilterNode, we can split our index into 3 indexes, one for the 3 types cocktail:Cocktail, cocktail:Ingredients and cocktail:Beverages.

@prefix my: <http://example.org/#> .
@prefix cocktail: <http://vocabulary.semantic-web.at/cocktail-ontology/> .

mdlc:main mdlc:node my:multiTypeFilter .

my:multiTypeFilter mdlc:type mdlc:multiFilterNode ;
                   mdlc:typeFilterPredicate rdf:type ;
                   mdlc:node [
                       mdlc:typeFilterObject cocktail:Cocktail ;
                       mdlc:node my:fulltextindexCocktail
                   ] , [
                       mdlc:typeFilterObject cocktail:Ingredients ;
                       mdlc:node my:fulltextindexIngredients
                   ] , [
                       mdlc:typeFilterObject cocktail::Beverages ;
                       mdlc:node my:fulltextindexBeverages
                   ] .

my:fulltextindexCocktail mdlc:type mdlc:luceneNode ;
                       search:indexid my:fulltextindexCocktail ;
                       mdlc:luceneReindexQuery "SELECT * {?s ?p ?o ; a <http://vocabulary.semantic-web.at/cocktail-ontology/Cocktail>} order by ?s" ;
                       mdlc:dirLocation "${locationNative}fulltextindexCocktail"^^mdlc:parsedString ;
                       mdlc:luceneEvalMode "NATIVE".

my:fulltextindexIngredients mdlc:type mdlc:luceneNode ;
                       search:indexid my:fulltextindexIngredients ;
                       mdlc:luceneReindexQuery "SELECT * {?s ?p ?o ; a <http://vocabulary.semantic-web.at/cocktail-ontology/Ingredients>} order by ?s" ;
                       mdlc:dirLocation "${locationNative}fulltextindexIngredients"^^mdlc:parsedString ;
                       mdlc:luceneEvalMode "NATIVE".
                       
my:fulltextindexBeverages mdlc:type mdlc:luceneNode ;
                       search:indexid my:fulltextindexBeverages ;
                       mdlc:luceneReindexQuery "SELECT * {?s ?p ?o ; a <http://vocabulary.semantic-web.at/cocktail-ontology/Beverages>} order by ?s" ;
                       mdlc:dirLocation "${locationNative}fulltextindexBeverages"^^mdlc:parsedString ;
                       mdlc:luceneEvalMode "NATIVE".

We can then run again our query

PREFIX my: <http://example.org/#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX search: <http://www.openrdf.org/contrib/lucenesail#>
PREFIX cocktail: <http://vocabulary.semantic-web.at/cocktail-ontology/>

SELECT * WHERE {
    ?subj search:matches [
            search:query "margarita" ;
            search:indexid my:fulltextindexCocktail ;
            search:property rdfs:label ;
          ] .
    # ?subj a cocktail:Cocktail.
    ?subj rdfs:label ?name .
    FILTER (LANG(?name) = "en")
} LIMIT 100

You can see that the search of the triple to find type of the subject isn't required anymore knowing we are using the index my:fulltextindexCocktail which only contains Cocktail.

Using multiple indexes with language splitting

Our type splitting done, we can then split using the literal language.

To do it, we are going to use the mdlc:languageFilterLang property of the Lucene index. Which is lighter than using one filter per language. But we need to link the indexes, for that the mdlc:linkedSailNode node.

@prefix my: <http://example.org/#> .
@prefix cocktail: <http://vocabulary.semantic-web.at/cocktail-ontology/> .

mdlc:main mdlc:node my:multiTypeFilter .

my:multiTypeFilter mdlc:type mdlc:multiFilterNode ;
                   mdlc:typeFilterPredicate rdf:type ;
                   mdlc:node [
                       mdlc:typeFilterObject cocktail:Cocktail ;
                       mdlc:node [
                            mdlc:type mdlc:linkedSailNode ;
                            mdlc:node my:fulltextindexCocktailFr ,
                                      my:fulltextindexCocktailEn ,
                                      my:fulltextindexCocktailIt 
                       ]
                   ] , [
                       mdlc:typeFilterObject cocktail:Ingredients ;
                       mdlc:node [
                            mdlc:type mdlc:linkedSailNode ;
                            mdlc:node my:fulltextindexIngredientsFr ,
                                      my:fulltextindexIngredientsEn ,
                                      my:fulltextindexIngredientsIt 
                       ]
                   ] , [
                       mdlc:typeFilterObject cocktail::Beverages ;
                       mdlc:node [
                            mdlc:type mdlc:linkedSailNode ;
                            mdlc:node my:fulltextindexBeveragesFr ,
                                      my:fulltextindexBeveragesEn ,
                                      my:fulltextindexBeveragesIt 
                       ]
                   ] .

### Cocktail indexes

my:fulltextindexCocktailFr mdlc:type mdlc:luceneNode ;
                       search:indexid my:fulltextindexCocktailFr ;
                       mdlc:luceneReindexQuery "SELECT * {?s ?p ?o ; a <http://vocabulary.semantic-web.at/cocktail-ontology/Cocktail>} order by ?s" ;
                       mdlc:dirLocation "${locationNative}fulltextindexCocktailFr"^^mdlc:parsedString ;
                       mdlc:languageFilterLang "fr" ;
                       mdlc:luceneEvalMode "NATIVE".
my:fulltextindexCocktailEn mdlc:type mdlc:luceneNode ;
                       search:indexid my:fulltextindexCocktailEn ;
                       mdlc:luceneReindexQuery "SELECT * {?s ?p ?o ; a <http://vocabulary.semantic-web.at/cocktail-ontology/Cocktail>} order by ?s" ;
                       mdlc:dirLocation "${locationNative}fulltextindexCocktailEn"^^mdlc:parsedString ;
                       mdlc:languageFilterLang "en" ;
                       mdlc:luceneEvalMode "NATIVE".
my:fulltextindexCocktailIt mdlc:type mdlc:luceneNode ;
                       search:indexid my:fulltextindexCocktailIt ;
                       mdlc:luceneReindexQuery "SELECT * {?s ?p ?o ; a <http://vocabulary.semantic-web.at/cocktail-ontology/Cocktail>} order by ?s" ;
                       mdlc:dirLocation "${locationNative}fulltextindexCocktailIt"^^mdlc:parsedString ;
                       mdlc:languageFilterLang "it" ;
                       mdlc:luceneEvalMode "NATIVE".

### Ingredients indexes

my:fulltextindexIngredientsFr mdlc:type mdlc:luceneNode ;
                       search:indexid my:fulltextindexIngredientsFr ;
                       mdlc:luceneReindexQuery "SELECT * {?s ?p ?o ; a <http://vocabulary.semantic-web.at/cocktail-ontology/Ingredients>} order by ?s" ;
                       mdlc:dirLocation "${locationNative}fulltextindexIngredientsFr"^^mdlc:parsedString ;
                       mdlc:languageFilterLang "fr" ;
                       mdlc:luceneEvalMode "NATIVE".
my:fulltextindexIngredientsEn mdlc:type mdlc:luceneNode ;
                       search:indexid my:fulltextindexIngredientsEn ;
                       mdlc:luceneReindexQuery "SELECT * {?s ?p ?o ; a <http://vocabulary.semantic-web.at/cocktail-ontology/Ingredients>} order by ?s" ;
                       mdlc:dirLocation "${locationNative}fulltextindexIngredientsEn"^^mdlc:parsedString ;
                       mdlc:languageFilterLang "en" ;
                       mdlc:luceneEvalMode "NATIVE".
my:fulltextindexIngredientsIt mdlc:type mdlc:luceneNode ;
                       search:indexid my:fulltextindexIngredientsIt ;
                       mdlc:luceneReindexQuery "SELECT * {?s ?p ?o ; a <http://vocabulary.semantic-web.at/cocktail-ontology/Ingredients>} order by ?s" ;
                       mdlc:dirLocation "${locationNative}fulltextindexIngredientsIt"^^mdlc:parsedString ;
                       mdlc:languageFilterLang "it" ;
                       mdlc:luceneEvalMode "NATIVE".

### Beverages indexes

my:fulltextindexBeveragesFr mdlc:type mdlc:luceneNode ;
                       search:indexid my:fulltextindexBeveragesFr ;
                       mdlc:luceneReindexQuery "SELECT * {?s ?p ?o ; a <http://vocabulary.semantic-web.at/cocktail-ontology/Beverages>} order by ?s" ;
                       mdlc:dirLocation "${locationNative}fulltextindexBeveragesFr"^^mdlc:parsedString ;
                       mdlc:languageFilterLang "fr" ;
                       mdlc:luceneEvalMode "NATIVE".
my:fulltextindexBeveragesEn mdlc:type mdlc:luceneNode ;
                       search:indexid my:fulltextindexBeveragesEn ;
                       mdlc:luceneReindexQuery "SELECT * {?s ?p ?o ; a <http://vocabulary.semantic-web.at/cocktail-ontology/Beverages>} order by ?s" ;
                       mdlc:dirLocation "${locationNative}fulltextindexBeveragesEn"^^mdlc:parsedString ;
                       mdlc:languageFilterLang "en" ;
                       mdlc:luceneEvalMode "NATIVE".
my:fulltextindexBeveragesIt mdlc:type mdlc:luceneNode ;
                       search:indexid my:fulltextindexBeveragesIt ;
                       mdlc:luceneReindexQuery "SELECT * {?s ?p ?o ; a <http://vocabulary.semantic-web.at/cocktail-ontology/Beverages>} order by ?s" ;
                       mdlc:dirLocation "${locationNative}fulltextindexBeveragesIt"^^mdlc:parsedString ;
                       mdlc:languageFilterLang "it" ;
                       mdlc:luceneEvalMode "NATIVE".

We can then run again our query without the language filtering using the English index

PREFIX my: <http://example.org/#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX search: <http://www.openrdf.org/contrib/lucenesail#>
PREFIX cocktail: <http://vocabulary.semantic-web.at/cocktail-ontology/>

SELECT * WHERE {
    ?subj search:matches [
            search:query "margarita" ;
            search:indexid my:fulltextindexCocktailEn ;
            search:property rdfs:label ;
          ] .
    # ?subj a cocktail:Cocktail.
    ?subj rdfs:label ?name .
    # FILTER (LANG(?name) = "en")
} LIMIT 100

Here we don't need to search for the language anymore.