Natural Language Processing (NLP) Basics

What is NLP?

Natural language processing (NLP) is a branch of artificial intelligence that helps computers understand, interpret and manipulate human language. NLP draws from many disciplines, including computer science and computational linguistics, in its pursuit to fill the gap between human communication and computer understanding. The table below shows a few NLP techniques which was used over the course of the project.

NLP Techniques

Tokenization

Input	Output
Melvin and Joe enjoyed working together the past 6 months.	{Melvin, and, Joe, enjoyed, working, together, the, past, 6, months, .}
我爱我的电脑	{我, 爱, 我的, 电脑}

Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens , perhaps at the same time throwing away certain characters, such as punctuation.

Part-of-Speech (PoS) Tagging

The part of speech explains how a word is used in a sentence. There are eight main parts of speech - nouns, pronouns, adjectives, verbs, adverbs, prepositions, conjunctions and interjections. Part-Of-Speech (PoS) Tagging simply means labeling words with their appropriate PoS. Below is an example:

This is done to accomplish other NLP techniques and to also give meaning of the word and the syntactic role of a word by assigning linguistic information to the words.

Stemming and Lemmatization

For grammatical reasons, documents are going to use different forms of a word, such as organize, organizes, and organizing. Additionally, there are families of derivationally related words with similar meanings, such as democracy, democratic, and democratization. In many situations, it seems as if it would be useful for a search for one of these words to return documents that contain another word in the set.

The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. For instance:

Stemming

Input	Output
Is	Is
Ponies	Pony
Having	Hav
Cats	Cat

A fast and simple way (‘brute force’ method) compared to lemmatization. (Pattern-based/ Rule-based, e.g. removing back of a word that ends with ‘es’, ‘s’, ‘ss’, ‘ing’)

Lemmatization

Input	Output
Is	Be
Ponies	Pony
Having	Have
Cats	Cat

Done in a proper manner and requires additional steps. Requires PoS tagging and a vocabulary dictionary, aims to remove only the inflectional endings and to return the base word.

Stop words Removal

Removing stop words which are common words of a particular language, which are negligible to the meaning of the document.

Examples of English stop words

a
i
the
this
there

Examples of Chinese stop words

的
不
在
有
是

These NLP Techniques were achieved by using the Spacy library.

Completed by Melvin and Joe

Provide feedback

Saved searches

Use saved searches to filter your results more quickly