Skip to content

Natural Language Processing (NLP) Basics

Mehvin edited this page Jul 31, 2018 · 36 revisions

What is NLP?

Natural language processing (NLP) is a branch of artificial intelligence that helps computers understand, interpret and manipulate human language. NLP draws from many disciplines, including computer science and computational linguistics, in its pursuit to fill the gap between human communication and computer understanding. The table below shows a few NLP techniques which was used over the course of the project.

NLP Techniques

Tokenization

Input Output
Melvin and Joe enjoyed working together the past 6 months. {Melvin, and, Joe, enjoyed, working, together, the, past, 6, months, .}
我爱我的电脑 {我, 爱, 我的, 电脑}

Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens , perhaps at the same time throwing away certain characters, such as punctuation.

Part-of-Speech (PoS) Tagging

The part of speech explains how a word is used in a sentence. There are eight main parts of speech - nouns, pronouns, adjectives, verbs, adverbs, prepositions, conjunctions and interjections. Part-Of-Speech (PoS) Tagging simply means labeling words with their appropriate PoS. Below is an example:

Image

This is done to accomplish other NLP techniques and to also give meaning of the word and the syntactic role of a word by assigning linguistic information to the words.

Stemming and Lemmatization

For grammatical reasons, documents are going to use different forms of a word, such as organize, organizes, and organizing. Additionally, there are families of derivationally related words with similar meanings, such as democracy, democratic, and democratization. In many situations, it seems as if it would be useful for a search for one of these words to return documents that contain another word in the set.

The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. For instance:

Stemming

Input Output
Is Is
Ponies Pony
Having Hav
Cats Cat

A fast and simple way (‘brute force’ method) compared to lemmatization. (Pattern-based/ Rule-based, e.g. removing back of a word that ends with ‘es’, ‘s’, ‘ss’, ‘ing’)

Lemmatization

Input Output
Is Be
Ponies Pony
Having Have
Cats Cat

Done in a proper manner and requires additional steps. Requires PoS tagging and a vocabulary dictionary, aims to remove only the inflectional endings and to return the base word.

Stop words Removal

Removing stop words which are common words of a particular language, which are negligible to the meaning of the document.

Examples of English stop words

  • a
  • i
  • the
  • this
  • there

Examples of Chinese stop words

These NLP Techniques were achieved by using the Spacy library.