Skip to content

12 Chapter: NLP Track

Mikiko Bazeley edited this page Dec 19, 2019 · 1 revision

The Natural Language Processing Track

Overview

Data scientists use natural language processing, combining ideas from machine learning, computer science, and linguistics to extract data and insights from text. In this unit, you'll start by learning several common Python-based tools and approaches to NLP and learn advanced data wrangling techniques specific to text data. Next, you’ll learn a suite of NLP techniques, some of which are based on basic machine learning algorithms, and some that are state-of-the-art deep learning-based techniques, such as word2vec. As you work through the NLP track, you’ll develop ideas for your second capstone project and focus in on one proposal at the end of the unit.

Note: Keep in mind that the new techniques introduced in this unit may apply to your second capstone project.

Unit Plan (What you’ll learn, Words to know, What will help)

Work to Submit:

Three ideas for your capstone project 2 Capstone project 2 proposal

What You’ll Learn: Learning Objectives

  • Brainstorm three ideas for Capstone Project 2
  • Gain an overview of natural language processing fundamentals, techniques and applications used in Python, including:
  • Advanced data wrangling techniques focused on text and language data
  • Common text analysis techniques, Ex. ext classification and topic modeling
  • Using Python libraries, ex. NLTK and spaCy
  • Text analysis and Natural Language Processing (NLP) using scikit-learn.
  • Introduction to DL techniques for NLP
  • Word2vec: Representation of text data using Deep Learning
  • Implementing word2vec in practice using Python libraries such as Keras and Tensorflow
  • Practice applying some basic NLP processes in two interactive exercises.

Words to Know: Key Terms & Concepts

  • Text classification: The use of ML to classify text documents into categories e.g. spam/non-spam, real/fake news etc.
  • Sentiment analysis: A set of techniques used to analyze a document/text to determine if it reflects a positive or negative emotion
  • Topic modeling: A set of techniques used to extract salient topics or themes from a document.
  • Word vectors: Representation of a text document as a numeric vector. Deep Neural Networks: A set of machine learning algorithms that automatically detects complex patterns in unstructured data
  • Recurrent Neural Networks (RNNs) : A type of Deep Neural Network designed for analyzing sequential data e.g. speech or time series

What will Help

  • Keep in mind the basics of supervised and unsupervised learning.
  • Keep in mind lessons learned from the first capstone project. The second project is an opportunity to improve on all the previous skills and apply new ones.

Chapter 12.2 Fundamentals of NLP

This section introduces various libraries that you’ll use to perform NLP in Python, including NLTK, scikit-learn, and spaCy. It covers “traditional” techniques rather than newer deep learning-based techniques you’ll learn in the next section. Although they are older, these techniques are proven and often implemented to produce quick, practical results.

  • Video: Patrick Harrison | Modern NLP in Python ==> Jupyter Notebook Beyond the basic text analysis that we worked in while studying Bayesian Inference, Python has a vast toolset for natural language processing. While you don’t need to be an expert in computational linguistics, a basic awareness of NLP techniques is quite critical in this age of unstructured data. This PyData talk by Patrick Harrison covers the commonly used NLP techniques and tools in Python.

  • Topic Modeling with Gensim w/ Jupyter Notebook

  • Audio: Data Wrangling with Python Listen to this episode of Talk Python To Me with Katharine Jarmul about the book she co-authored called Data Wrangling with Python and her PyCon UK presentation entitled How to Automate your Data Cleanup with Python. Links to the book as well as the talk are available on the page, along with links to a vast array of the tools she discusses.

Note: The discussed tools are enormously relevant to cleaning up and using messy text data.

Chapter 12.3 Basics of Deep Learning

Deep Learning has revolutionized machine learning and data science in general, and NLP is no exception. In this unit, you’ll learn the basics of deep learning as it applies to NLP and go through some practical tools and applications.

Building and Deploying Deep Learning Applications with TensorFlow

What you’ll learn What's TensorFlow? Hardware, software, and language requirements Creating a TensorFlow model Training a deep learning model with TensorFlow Visualizing the computational graph Adding custom visualizations to TensorBoard Exporting models for use with Google Cloud

Clone this wiki locally