Skip to content

Natural Language Processing (NLP) Basics

Mehvin edited this page Jul 31, 2018 · 36 revisions

What is NLP?

Natural language processing (NLP) is a branch of artificial intelligence that helps computers understand, interpret and manipulate human language. NLP draws from many disciplines, including computer science and computational linguistics, in its pursuit to fill the gap between human communication and computer understanding. The table below shows a few NLP techniques which was used over the course of the project.

NLP Techniques

Tokenization

Input Output
Melvin and Joe enjoyed working together the past 6 months. {Melvin, and, Joe, enjoyed, working, together, the, past, 6, months, .}
我爱我的电脑 {我, 爱, 我的, 电脑}

Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens , perhaps at the same time throwing away certain characters, such as punctuation.

Part-of-Speech (PoS) Tagging

The part of speech explains how a word is used in a sentence. There are eight main parts of speech - nouns, pronouns, adjectives, verbs, adverbs, prepositions, conjunctions and interjections. Part-Of-Speech (PoS) Tagging simply means labeling words with their appropriate PoS. Below is an example:

Image

This is done to accomplish other NLP techniques and to also give meaning of the word and the syntactic role of a word by assigning linguistic information to the words. In this project PoS tagging is achieved using the Spacy library.

Stemming

Lemmatization

Stop words Removal