a

wjbmattingly · Sep 13, 2021 · 1918862 · 1918862
1 parent 3734fa0
commit 1918862
Show file tree

Hide file tree

Showing 20 changed files with 3,800 additions and 171 deletions.
diff --git a/...1_install_and_containers-checkpoint.ipynb → ...1_install_and_containers-checkpoint.ipynb b/...1_install_and_containers-checkpoint.ipynb → ...1_install_and_containers-checkpoint.ipynb
@@ -2,31 +2,43 @@
  "cells": [
   {
    "cell_type": "markdown",
-   "id": "english-validity",
+   "id": "composite-japanese",
    "metadata": {},
    "source": [
-    "# The Basics of spaCy"
+    "# <center>The Basics of spaCy</center>"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "standard-directory",
+   "id": "clear-subsection",
+   "metadata": {},
+   "source": [
+    "<center>Dr. W.J.B. Mattingly</center>\n",
+    "\n",
+    "<center>Smithsonian Data Science Lab and United States Holocaust Memorial Museum</center>\n",
+    "\n",
+    "<center>August 2021</center>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "dominican-swiss",
    "metadata": {},
    "source": [
     "In this notebook, we will not be working with spaCy in code, rather in concept. This entire JupyterBook is designed around approaching spaCy top-down. By this I mean approaching the things that spaCy does and can do and then exploring how to implement that in code. I think this is necessary so that as you explore the smaller components of spaCy, such as the Lemmatizer, you will understand how it fits into the larger architecture of the spaCy framework."
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "pharmaceutical-sacramento",
+   "id": "eastern-living",
    "metadata": {},
    "source": [
     "## What is spaCy?"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "anticipated-pursuit",
+   "id": "turkish-cooking",
    "metadata": {},
    "source": [
     "A good way to begin is by exploring the question, \"What is spaCy?\" spaCy (yes, spelled with a lowercase \"s\" and uppercase \"C\" is a natural language processing framework. **Natural language processing**, or NLP, is a branch of linguistics that seeks to parse human language in a computer system. This field is generally referred to as computational linguistics, though it has far reaching applications beyond academic linguistic research.\n",
@@ -40,15 +52,15 @@
   },
   {
    "cell_type": "markdown",
-   "id": "commercial-simon",
+   "id": "general-movement",
    "metadata": {},
    "source": [
     "## How to Install spaCy"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "accessible-wallace",
+   "id": "suburban-psychology",
    "metadata": {},
    "source": [
     "In order to install spaCy, I recommend visiting their website, here: https://spacy.io/usage . They have a nice user-friendly interface. Input your device settings, e.g. Mac or Windows or Linux, and your language, e.g. English, French, or German. The web-app will automatically populate the commands that you need to execute to get started. Since this is a JupyterBook, we can install these with a \"!\" before in a cell to indicate that we want to run a terminal command. I will be installing spaCy and thee small English model, en_core_web_sm."
@@ -57,7 +69,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "molecular-parker",
+   "id": "utility-argument",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -67,7 +79,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "alpha-bouquet",
+   "id": "explicit-university",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -76,7 +88,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "comprehensive-center",
+   "id": "molecular-rating",
    "metadata": {},
    "source": [
     "Now that we've installed spaCy let's import it to make sure we installed it correctly."
@@ -85,7 +97,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "indian-benjamin",
+   "id": "criminal-objective",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -94,7 +106,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "aggressive-logging",
+   "id": "operating-limit",
    "metadata": {},
    "source": [
     "Great! Now, let's make sure we downloaded the model successfully with the command below."
@@ -103,7 +115,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "obvious-trash",
+   "id": "molecular-watson",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -112,31 +124,31 @@
   },
   {
    "cell_type": "markdown",
-   "id": "active-invitation",
+   "id": "turkish-certification",
    "metadata": {},
    "source": [
     "Excellent! spaCy is now installed correctly and we have successfully downloaded the small English model. We will pick up here with the code in the next notebook. For now, I want to focus on big-picture items, specifically spaCy \"containers\"."
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "stylish-anthropology",
+   "id": "cellular-tender",
    "metadata": {},
    "source": [
     "## Containers"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "secret-circuit",
+   "id": "visible-movie",
    "metadata": {},
    "source": [
     "Containers are spaCy objects that contain a large quantity of data about a text. When we analyze texts with the spaCy framework, we create different container objects to do that. Here is a full list of all spaCy containers. We will be focusing on three (emboldened): Doc, Span, and Token."
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "extra-program",
+   "id": "robust-rehabilitation",
    "metadata": {},
    "source": [
     "* <b>Doc</b>\n",
@@ -151,7 +163,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "identified-creek",
+   "id": "covered-bacon",
    "metadata": {},
    "source": [
     "I created the image below to show how I visualize spaCy containers in my mind. At the top, we have a Doc container. This is the basis for all spaCy. It is the main object that we create. Within the Doc container are many different attributes and subcontainers. One attribute is the Doc.sents, which contains all the sentences in the Doc container. The doc container (and each sentence generator) is made up of a set of token containers. These are things like words, punctuation, etc.\n",
@@ -163,7 +175,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "current-custom",
+   "id": "victorian-uncle",
    "metadata": {},
    "source": [
     "```{image} ./images/spacy_containers.png\n",
@@ -176,7 +188,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "herbal-setup",
+   "id": "incomplete-location",
    "metadata": {},
    "source": [
     "If you do not fully understand this dynamic, do not worry. You will get a much better sense of this pyramid as we move forward throughout this book. For now, I recommend keeping this image handy so you can refer back to it as we progress through Part 1 of this book in which we explore the basics of spaCy. In the next chapter, we will start applying these concepts in code by creating a doc object and learning about the different attributes containers have as well as how to find the linguistic annotations."
@@ -185,7 +197,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "swedish-manhattan",
+   "id": "female-grass",
    "metadata": {},
    "outputs": [],
    "source": []

diff --git a/02_linguistic_annotations.ipynb → ...2_linguistic_annotations-checkpoint.ipynb b/02_linguistic_annotations.ipynb → ...2_linguistic_annotations-checkpoint.ipynb
@@ -4,7 +4,18 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Getting Started with spaCy and its Linguistic Annotations"
+    "# <center>Getting Started with spaCy and its Linguistic Annotations</center>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<center>Dr. W.J.B. Mattingly</center>\n",
+    "\n",
+    "<center>Smithsonian Data Science Lab and United States Holocaust Memorial Museum</center>\n",
+    "\n",
+    "<center>August 2021</center>"
    ]
   },
   {