NLP_TUto/Core Concepts_ex.md at master · sourabhyadav/NLP_TUto · GitHub

Few NLP Terms

Tokenizing?

Tokenizing is Grouping. When we group something with some constraints is called tokenizing.
There are 2 basic types of tokenizer:
- Word Tokenizer
- Sentence Tokenizer

Corpora:

Body of the given text e.g. presidential speeches, medical documents etc.

Lexicons:

Words and their meanings.
Basically its not just english meaning but here meaning changes based on the context. Like "bull" is investment terms and also animal in different contexts.

Stop Words:

Words which people use sarcastically can be a stop-word, some words which you don't care about them.
They are fillers in the sentence. Basically you wanna remove them.
Sample stop words for english corpora are is, and, whom your etc.
Actually these words make a lot of sense to us but for Data Science it becomes different to use
Basically they even if you remove these words the meaning of sentence remains mostly same

Stemming:

Finding the root of the word is called steming. E.g. reading -> read, ridden -> ride etc.
This becomes necessary because you in english we can use many variant of words in the different sentences with same meaning.

Parts of Speech:

This collects different parts of speech in a given paragraph like nouns, verbs, adjectives etc.

Chunking: [TODO]

Chinking:

Excluding something from existing chunks is called chinking

Named Entity:

It is nothing but chunking the parts of speech with some sort of named entitity. It is useful to chunk and understand the sentence better.
Few examples of named entity is : NAME, LOCATION, TIME, DATE, MONEY, PERCENT etc.

Lemmatizing:

It is similar to stemming but here the word would be replaced by a synonyms

Word Embedding:

ONe-hot encoding: This kind of word embedding is simple hardcoded and which is at very high-dimensional. Moreover, it is very sparse.
Learn word embeddings: This kind of word embedding is trainable i,e, it can be learned from given data/corpus. This makes them dense and low-dimensional.