-
Notifications
You must be signed in to change notification settings - Fork 2
Natural Language Processing (NLP) Basics
Natural language processing (NLP) is a branch of artificial intelligence that helps computers understand, interpret and manipulate human language. NLP draws from many disciplines, including computer science and computational linguistics, in its pursuit to fill the gap between human communication and computer understanding. The table below shows a few NLP techniques which was used over the course of the project.
Input | Output |
---|---|
Melvin and Joe enjoyed working together the past 6 months. | {Melvin, and, Joe, enjoyed, working, together, the, past, 6, months, .} |
我爱我的电脑 | {我, 爱, 我的, 电脑} |
Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens , perhaps at the same time throwing away certain characters, such as punctuation.
The part of speech explains how a word is used in a sentence. There are eight main parts of speech - nouns, pronouns, adjectives, verbs, adverbs, prepositions, conjunctions and interjections. Part-Of-Speech (PoS) Tagging simply means labeling words with their appropriate PoS. Below is an example:
This is done to accomplish other NLP techniques and to also give meaning of the word and the syntactic role of a word by assigning linguistic information to the words.
For grammatical reasons, documents are going to use different forms of a word, such as organize, organizes, and organizing. Additionally, there are families of derivationally related words with similar meanings, such as democracy, democratic, and democratization. In many situations, it seems as if it would be useful for a search for one of these words to return documents that contain another word in the set.
The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. For instance:
Stemming
Input | Output |
---|---|
Is | Is |
Ponies | Pony |
Having | Hav |
Cats | Cat |
A fast and simple way (‘brute force’ method) compared to lemmatization. (Pattern-based/ Rule-based, e.g. removing back of a word that ends with ‘es’, ‘s’, ‘ss’, ‘ing’)
Lemmatization
Input | Output |
---|---|
Is | Be |
Ponies | Pony |
Having | Have |
Cats | Cat |
Done in a proper manner and requires additional steps. Requires PoS tagging and a vocabulary dictionary, aims to remove only the inflectional endings and to return the base word.
Removing stop words which are common words of a particular language, which are negligible to the meaning of the document.
Examples of English stop words ⋅⋅* a ⋅⋅* i ⋅⋅* the ⋅⋅* this ⋅⋅* there
Examples of Chinese stop words ⋅⋅* 的 ⋅⋅* 不 ⋅⋅* 在 ⋅⋅* 有 ⋅⋅* 是
These NLP Techniques were achieved by using the Spacy library.
Completed by Melvin and Joe