Smarti uses Redlink NLP for the analysis of Conversations. In addition it provides several specific analysis components that are implemented as extensions to Redlink NLP.
This documentation focus is on those extensions
Processors are components that plug directly into Redlink NLP.
Interesting Terms is a kind of Keyword Extraction that uses tf-idf
over a document corpus to detect the most relevant terms within a conversation. Implementation wise Solr is used to manage the text corpus and Solr MLT requests are used to retrieve relevant terms.
It consists of two analysis components:
-
Interesting Term Extractor that marks interesting terms in the conversation. As the text in Conversations is typically to short to calculate the importance based on the content external information like the
tf-idf
value of words are used to calculate the importance of words. -
Interesting Phrase Collector that collects interesting terms marked form possible multiple Interesting Term Extractor and creates Tokens for them. This component can optionally use phrases as annotated by phrase extraction and/or dependency parsing to mark whole phrases that contain words marked as interesting terms.
Tokens created by this component have the type Keyword
and the Hint interestingTerm
For more information on the configuration of Interesting Term Extractors see the according section in the Smarti Configuration
Analysis Component that applies all registered Token Filters. If a single filter wants to filter a Token the token is removed from the conversation.
Stopword Token Filter
This Analysis component allows to filter extracted Tokens based on their value. Filtering is done based on stopword lists. One list is used for all languages where additional language specific lists can also be configured.
See the Stoword Token Filter configuration section of the Smarti Configuration for more information.
Analysis Component that allows to process extracted Tokens by defined Rules
Regex Token Processing Ruleset
Tokens in the original texts are replaced with <{Token.Type}>
. This allows to wirte regex patters against the type of the token instead of the actual value.
Typically those rules are used to add Hint
s to token extracted from a conversation. The following example showns Regex pattern that assigns the from
and to
hints to Place
type tokens extracted from a text
public MyGermanTokenMatcher(){ super("de", EnumSet.of(MessageTopic.Reiseplanung), EnumSet.of(Type.Date, Type.Place)); addRule(String.format("vo(?:n|m) (<%s>)", Type.Place), Hint.from); addRule(String.format("nach (<%s>)", Type.Place), Hint.to); }
NOTE: Currently it is not possible to configure Rules. To add Rulesets code needs to be written.
Adds the hint negated
to extracted Tokens if mention of the Token is negated (e.g. "no Pizzaia", "nicht über München")
Allows to create Tokens for Words with specific Part-of-Speech (POS) tags. By default this component is configured to create Tokens with the type Attribute
for words that are classified as adjectives.
This analysis components also supports a stopword list with words for those no tokens are created.
In German Location mentions often include words that are classified as Adjectives. A typical example would be a reference to the Munich Main Station as Münchner Hauptbahnhof
. While Named Entity Recognition (NER) typically marks the noun Hauptbahnhof
as place if often fails to also mark the adjective Müncher
.
This processor looks for cases like this and changes the Named Entity to also include the adjective.
Uses a Regex pattern to find URLs mentioned in conversations and makrs them as Named Entities with tag url
and type misc
. Those Named Entities are later converted to Tokens by the Named Entity Collector.
Templates are higher level models over extracted Token
. Templates defines Slots. Those Slots do have specific Roles and may have [0..*] extracted tokens assigned.
LATCH is a model for Information Retrieval. The acronym stands for the different search dimensions that are represented as Roles in this template:
-
Location: Spatial (Geo) based search. In Smarti Tokens with the
Place
type are assigned this role -
Alphabet: Refers to the full text component of searches. In Smarti the "surface form" of extracted Keywords and Tokens that do not fall into one of the other dimensions are used.
-
Time: Refers to the temporal location of the search. In Samrti Tokens with the Time type are assigned this role.
-
Category: Categorizations are also an important search dimension. In Samrti Tokens with the Category type are assigned this role
-
Hierarchy: Refers to hierarchical searches like Rating Systems ([1..5] Stars, [0..10] Points …) but could also be used for price ranges and similar things. Currently this dimension is not used by Smarti.
All roles in this template are optional. The only requirement is that at least a single Token
is assigned to any of the defined roles.