-
Notifications
You must be signed in to change notification settings - Fork 135
Natural_Language_Processing
This version of program-y introduces significant enhancements in the area of Natural Language Processing. Build on the translations and sentinment analysis introduced in v3.9, v3.10 introduces the following NLP features
- Stop Words
- Pre Processor
- Dynamic Set
- Post Question Processor
- Lemmatization
- Dynmaic Map
- Pre Processor
- Post Question Processor
- Stemming
- Dynamic Map
- Post Question Processor
- Synsets
- Dynamic Set
- Extension
- Part of Speech (POS) Tagging
- Post Question Processor
- NGrams
- Post Question Processor
- Wordnet
- Extension
For more details of Stemming and Lemmatization see the following link
The process of converting data to something a computer can understand is referred to as pre-processing. One of the major forms of pre-processing is to filter out useless data. In natural language processing, useless words (data), are referred to as stop words.
The full list of stop words which will be removed are the following
{"ourselves", "hers", "between", "yourself", "but", "again", "there", "about", "once", "during", "out",
"very", "having", "with", "they", "own", "an", "be", "some", "for", "do", "its", "yours", "such", "into",
"of", "most", "itself", "other", "off", "is", "s", "am", "or", "who", "as", "from", "him", "each", "the",
"themselves", "until", "below", "are", "we", "these", "your", "his", "through", "don", "nor", "me", "were",
"her", "more", "himself", "this", "down", "should", "our", "their", "while", "above", "both", "up", "to",
"ours", "had", "she", "all", "no", "when", "at", "any", "before", "them", "same", "and", "been", "have",
"in", "will", "on", "does", "yourselves", "then", "that", "because", "what", "over", "why", "so", "can",
"did", "not", "now", "under", "he", "you", "herself", "has", "just", "where", "too", "only", "myself",
"which", "those", "i", "after", "few", "whom", "t", "being", "if", "theirs", "my", "against", "a", "by",
"doing", "it", "how", "further", "was", "here", "than"}
There is now a pre processor you can use to remove stop words from your input text
To use this processor, add the following lines to preprocessing.conf
programy.processors.pre.stopwords.StopWordsPreProcessor
To use this processor, add the following lines to postquestionprocessing.conf
programy.processors.postquestion.stopwords.StopWordsPostQuestionProcessor
You can also include a new dynamic set in your grammars which will tell you if a word is a stop word. A typical category patter would be as follows.
<category>
<pattern>
IS <set name="stopwords">*</set> A STOP WORD
</pattern>
<template>
Yes <star /> is a stop word
</template>
</category>
In the above example will the pattern matches if the single word matched by '*' is a stop word.
Reduduces a word to its base word, turning plurals etc into singulars, such as octopi to octopus or mice to mouse
The dynamic map will conver the word passed to it into its singlar form by using the maps as follows
<category>
<pattern>
WHAT IS THE SINGULAR NAME OF A *
</pattern>
<template>
A singlular <star /> is called a <map name="lemmatizer" ><star /></map>
</template>
</category>
This will only match if the lemmatize finds a value lemma, otherwise no match will occur
The preprocessor will attempt to lemmatize every word in the sentence passed in. This reducing all multiple terms to their singular version. To use this processor, add the following lines to preprocessing.conf
programy.processors.pre.lemmatize.LemmatizePreProcessor
The post question processor, is called ( if configured ) when a sentence fails to match. The sentnce is lemmatized and the question asked again. To use this processor, add the following lines to postquestionprocessing.conf
programy.processors.postquestion.lemmatize.LemmatizePostQuestionProcessor
Stemming reduces all variants of a specific work down to the base word, e.g troubles, troubled and troubling are all reduced to troubl. Note the 'e' is missing as the word 'troubl' is considered the base term in NLP.
<category>
<pattern>
The base term of * is
</pattern>
<template>
<map name="stemmer" ><star /></map>
</template>
</category>
The pre process will apply stemming rules to all words of the sentence before it is parsed. To use this processor, add the following lines to preprocessing.conf
programy.processors.pre.stemming.StemmingPreProcessor
The post question processor will stem the sentence after it failed to match a response, and then ask the question again with stemming applied. To use this processor, add the following lines to postquestionprocessing.conf
programy.processors.postquestion.stemming.StemmingPostQuestionProcessor
Synsets are considered words which are similar to the original word, e.g the synsets of 'red' are the words 'red' and 'crimson', like wise the synsets of 'hack' are 'hack', 'machine_politician', 'cab', 'chop'
You can use a dynamic set, to check if a word is similar in you pattern match. This provides a greater degree of flexibility in matching clauses
<category>
<pattern>
I WOULD LIKE A <set name="synsets" similar="dog">*</set>
</pattern>
<template>
Me too, although I would like a cat too
</template>
</category>
This pattern will now match to 'I WOULD LIKE A DOG', 'I WOULD LIKE A POUCH', 'I WOULD LIKE LIKE A PUPPY'
You can use the extension provided to check if 2 words are similar as follows
<category>
<pattern>
SYNSETS SIMILAR * *
</pattern>
<template>
<extension path="programy.nlp.synsets.extension.SynsetsExtension">
SIMILAR <star index="1"/> <star index="2"/>
</extension>
</template>
</category>
Parts of Speech Tagging or POS Tagging carries out textual analysis on the text string and adds in identifiers for each work. Each identifier correlating to a part of speech such as noun, verb, adjective etc
For example the sentence 'Python is a high-level, general-purpose programming language.' will be converted into the sentence ''Python NNP is VBZ a DT high-level JJ general-purpose JJ programming NN language NN'.
Note none of the meaning is lost, just additional identifiers are added to the text to help with parsing.
You can now constructu a pattern grammar which looks for specific POS terms and matches the words as '*'s
The POS pre processor will convert the sentence into one in which all the words have had a POS tag associated with them. To use this processor, add the following lines to preprocessing.conf
programy.processors.pre.wordtagger.WordTaggerPreProcessor
NGrams are smaller sub sentences created from the original sentence, For example the sentence 'Now is better than never.', would produce the ngrams 'Now is better', 'is better than' and 'better than never'
The default size is 3 words, but a future version of the NGrammer will allow this to be configured programmatically.
The post question ngram processor will split the sentence into 3 word ngrams and ask each of the sentences. If it gets a result then that response will be returned. To use this processor, add the following lines to preprocessing.conf
programy.processors.postquestion.ngrams.NGramsPostQuestionProcessor
WordNet is a database of 1000's of definitions of words. You can query the database and have it return the defintion in you template grammar
You can use the WordNet functionality as an extension as follows. This example will display the WordNet defintion of any word the is matched to the '*'
<category>
<pattern>
WORDNET *
</pattern>
<template>
<extension path="programy.nlp.wordnet.extension.WordNetExtension">
<star />
</extension>
</template>
</category>
Building on top of NLP capbilities, a future version will introduce Natural Langauge Understanding (NLU) enhancements in the form of intent analysis. This will allow direct sentence to canonical form mapping of a category. The NLU engine can be used in multiple places, either
-
Pre Processor
-
Sentence Repeat Processor Converts you input sentence into intent, object, descriptors, e.g any combination of the following
I want to book a flight from edinburgh to san francisco next wednesday at 3:00pm
maps directly to
BOOK FLIGHT * FROM * TO * DATE *
Email: [email protected] | Twitter: @keiffster | Facebook: keith.sterling | LinkedIn: keithsterling | My Blog
- Home
- Background
- Guiding Principles
- Reporting an Issue
- Installation
- You And Your Bot
- Bots
- Clients
- Configuration
- AIML
- Sentence Splitting
- Natural Langauge Processing
- Normalization
- Spelling
- Sentiment Analysis
- Translation
- Security
- Hot Reload
- Logging
- Out of Band
- Multi Language
- RDF Support
- Rich Media
- Asynchronous Events
- Triggers
- External Services
- Dynamic Sets, Maps & Vars
- Extensions
- Pre & Post Processors
- Custom Nodes
- The Brain Tree
- Utilities
- Building It Yourself
- Creating Your Own Bot
- Contributing
- Performance Testing
- FAQ
- History
- Website