-
Notifications
You must be signed in to change notification settings - Fork 81
12 Chapter: NLP Track
The Natural Language Processing Track
Overview
Data scientists use natural language processing, combining ideas from machine learning, computer science, and linguistics to extract data and insights from text. In this unit, you'll start by learning several common Python-based tools and approaches to NLP and learn advanced data wrangling techniques specific to text data. Next, you’ll learn a suite of NLP techniques, some of which are based on basic machine learning algorithms, and some that are state-of-the-art deep learning-based techniques, such as word2vec. As you work through the NLP track, you’ll develop ideas for your second capstone project and focus in on one proposal at the end of the unit.
Note: Keep in mind that the new techniques introduced in this unit may apply to your second capstone project.
Unit Plan (What you’ll learn, Words to know, What will help)
Work to Submit:
Three ideas for your capstone project 2 Capstone project 2 proposal
What You’ll Learn: Learning Objectives
- Brainstorm three ideas for Capstone Project 2
- Gain an overview of natural language processing fundamentals, techniques and applications used in Python, including:
- Advanced data wrangling techniques focused on text and language data
- Common text analysis techniques, Ex. ext classification and topic modeling
- Using Python libraries, ex. NLTK and spaCy
- Text analysis and Natural Language Processing (NLP) using scikit-learn.
- Introduction to DL techniques for NLP
- Word2vec: Representation of text data using Deep Learning
- Implementing word2vec in practice using Python libraries such as Keras and Tensorflow
- Practice applying some basic NLP processes in two interactive exercises.
Words to Know: Key Terms & Concepts
- Text classification: The use of ML to classify text documents into categories e.g. spam/non-spam, real/fake news etc.
- Sentiment analysis: A set of techniques used to analyze a document/text to determine if it reflects a positive or negative emotion
- Topic modeling: A set of techniques used to extract salient topics or themes from a document.
- Word vectors: Representation of a text document as a numeric vector. Deep Neural Networks: A set of machine learning algorithms that automatically detects complex patterns in unstructured data
- Recurrent Neural Networks (RNNs) : A type of Deep Neural Network designed for analyzing sequential data e.g. speech or time series
What will Help
- Keep in mind the basics of supervised and unsupervised learning.
- Keep in mind lessons learned from the first capstone project. The second project is an opportunity to improve on all the previous skills and apply new ones.
This section introduces various libraries that you’ll use to perform NLP in Python, including NLTK, scikit-learn, and spaCy. It covers “traditional” techniques rather than newer deep learning-based techniques you’ll learn in the next section. Although they are older, these techniques are proven and often implemented to produce quick, practical results.
-
Video: Patrick Harrison | Modern NLP in Python ==> Jupyter Notebook Beyond the basic text analysis that we worked in while studying Bayesian Inference, Python has a vast toolset for natural language processing. While you don’t need to be an expert in computational linguistics, a basic awareness of NLP techniques is quite critical in this age of unstructured data. This PyData talk by Patrick Harrison covers the commonly used NLP techniques and tools in Python.
-
Audio: Data Wrangling with Python Listen to this episode of Talk Python To Me with Katharine Jarmul about the book she co-authored called Data Wrangling with Python and her PyCon UK presentation entitled How to Automate your Data Cleanup with Python. Links to the book as well as the talk are available on the page, along with links to a vast array of the tools she discusses.
Note: The discussed tools are enormously relevant to cleaning up and using messy text data.
-
Article: A practitioner's guide to NLP Part 1 - Processing and understanding text - https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72 In this highly detailed and practical article, Springboard mentor and NLP expert DJ Sarkar explains the basics of text processing including text wrangling, parsing and entity recognition. Highly recommended for both the writing and the exercises.
Deep Learning has revolutionized machine learning and data science in general, and NLP is no exception. In this unit, you’ll learn the basics of deep learning as it applies to NLP and go through some practical tools and applications.
What you’ll learn What's TensorFlow? Hardware, software, and language requirements Creating a TensorFlow model Training a deep learning model with TensorFlow Visualizing the computational graph Adding custom visualizations to TensorBoard Exporting models for use with Google Cloud