Skip to content

Commit

Permalink
first draft
Browse files Browse the repository at this point in the history
  • Loading branch information
jpiaskowski committed Jan 28, 2020
0 parents commit eb4ff77
Show file tree
Hide file tree
Showing 25 changed files with 13,104 additions and 0 deletions.
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
.Rproj.user
.Rhistory
.RData
.Ruserdata
rejects/

Large diffs are not rendered by default.

Large diffs are not rendered by default.

349 changes: 349 additions & 0 deletions Example/Familydollar_location_scrape-all-states.ipynb

Large diffs are not rendered by default.

7,933 changes: 7,933 additions & 0 deletions Example/family_dollar_locations.csv

Large diffs are not rendered by default.

Binary file added Extra_Resources/Krotov2018.pdf
Binary file not shown.
11 changes: 11 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@

Presentation for the Pycascades Conference held Feb 7-10, 2020 in Portland, OR. A brief introduction into using beautiful soup for web scraping, using address information of Family Dollar stores to demonstrate.


The presentation was made in RStudio using `reticulate` to call python.






Binary file added images/adventures_in_babysitting.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/captain_marvel_binary.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/family_dollar_cascade_cropped.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/familydollar1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/familydollar2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/familydollar3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/heman_power.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/luke_brushesoff_dust.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/luke_lightsaber_throwaway.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/ren_throws_fit.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/rey_repeat.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/walgreens1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/walgreens2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/webscraping_book.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/yoda_lightsaber.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
345 changes: 345 additions & 0 deletions py2020.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,345 @@
---
title: "Adventures in Babysitting: Introduction to Web Scraping in Python"
author: "Julia Piaskowski"
date: "2020/02/08"
output: ioslides_presentation
---

<style type="text/css">
body p {
color: #282828;
}

ul {
color: #282828;
}

code {
color: ##0033cc;
}
</style>
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, error=TRUE, tidy=TRUE)
```

```{r echo=F}
library(reticulate)
virtualenv_create("pycas2020")
# a few things you might need to install
# py_install(c("requests", "json", "pandas", "DataFrame"))
# py_install(c("bs4", "Tag"))
use_virtualenv("pycas2020")
```

## Good Way to Learn Python:

```{r, echo=FALSE, out.height='40%'}
knitr::include_graphics("images/webscraping_book.png")
```

## But, Who Actually Reads These A to Z?
(spoiler: not me)

```{r, echo=FALSE, fig.cap="me and my programming books", out.width='100%'}
knitr::include_graphics("images/luke_lightsaber_throwaway.gif")
```

## What we realy need to know:
* the tools available in `BeautifulSoup` and `requests`
* what to look for in html code
* parsing json objects with <code>json</code>
* rudimentary `pandas` skills
* `<idea> All you need to know about html is how tags work </idea>`

## What to Look for in a scraping project:

Structured data with a regular repeatable format.

```{r, echo=FALSE, out.width = '60%'}
knitr::include_graphics("images/rey_repeat.gif")
```

Identical formating is not required, but the more edge cases present, the more complicated the scraping will be.

## Ethics in Scraping

Accessing vast troves of information can be intoxicating:

```{r, echo=FALSE, out.width = '60%'}
knitr::include_graphics("images/heman_power.gif")
```

Just because we can doesn't mean we should...

## Legal Considerations
(note: I have zero legal training)

* Are you scraping copyrighted material?
* Will your scraping activity compromise individual privacy?
* Are you making a large number of request that may overload or damage a server?
* Is it possible the scraping will expose intellectual property you do not own?
* Are there terms of service governing use of the website and are you following those?
* Will your scraping activities diminish the value of the original data?


## Dollar Stores are Taking Over the World!

```{r, echo=FALSE, fig.cap="Store in Cadcade, Idaho", out.width='60%'}
knitr::include_graphics("images/family_dollar_cascade_cropped.png")
```

Goal: Extract all addresses for all Family Dollar stores in Idaho.

## The Starting Point:

https://locations.familydollar.com/id/
```{r, echo=FALSE,out.width='80%'}
knitr::include_graphics("images/familydollar1.png")
```

## Step 1: Load those libraries

```{python}
import requests # for making standard html requests
from bs4 import BeautifulSoup # magical tool for parsing html data
import json # for parsing data
from pandas import DataFrame as df # data organization
```

## Step 2: Grab Some Data from Target Web Address

```{python}
page = requests.get("https://locations.familydollar.com/id/")
soup = BeautifulSoup(page.text, 'html.parser')
```

Beautiful Soup will take html or xml content and transform it into a complex tree of objects. Here are several common types:

* `BeautifulSoup` - the soup (the parsed content)
* `Tag` - main type of bs4 element you will encounter
* `NavigableString` - string within a tag
* `Comment` - special type of NavigableString


## Step 3: Determine How to Extract Relevant Content from BS4 Soup

This can be frustrating

```{r, echo=FALSE, out.width='70%'}
knitr::include_graphics("images/ren_throws_fit.gif")
```


## Step 3: Finding Content...

* Start with one representative example and then scale up
* Viewing the page's html source code is essential
* Run at your own risk:
```{python, eval=F, echo=T}
print(soup.prettify())
```

* It is usually easiest to browse via "View Page Source":
```{r, echo=FALSE, out.width='100%'}
knitr::include_graphics("images/familydollar2.png")
```

## Step 3: Finding Content by Searching

Searching for href does not work
```{python}
dollar_tree_list = soup.find_all('href')
dollar_tree_list
```

But searching on a specific class is often successful:
```{python}
dollar_tree_list = soup.find_all(class_ = 'itemlist')
for i in dollar_tree_list[:2]:
print(i)
```

## Step 3: Finding Content by Using 'contents'

What kind of content do we have and how much is there?
```{python, collapse=TRUE}
type(dollar_tree_list)
len(dollar_tree_list)
```

Now that we have drilled down to a BeautifulSoup "ResultSet", we can try extracting the contents.

```{python}
example = dollar_tree_list[2] # Arco, ID (single representative example)
example_content = example.contents
print(example_content)
```

## Step 3: Finding Content in Attributes

Find out what attributes are present in the contents:

*Note: `contents` usually return a list of exactly one item, so the first step is to index that item.*
```{python}
example_content = example.contents[0]
example_content.attrs
```

Extract the relevant attribute:
```{python}
example_href = example_content['href']
print(example_href)
```

## Step 4: Extract the Relevant Content

```{python}
city_hrefs = [] # initialise empty list
for i in dollar_tree_list:
cont = i.contents[0]
href = cont['href']
city_hrefs.append(href)
# check to be sure all went well
for i in city_hrefs[:2]:
print(i)
```
We now have a list of url's of Family Dollar stores in Idaho to scrape

## Repeat Steps 1-4 for the City URLs

```{python}
page2 = requests.get(city_hrefs[2]) # representative example
soup2 = BeautifulSoup(page2.text, 'html.parser')
```

```{r, echo=FALSE, out.width='100%'}
knitr::include_graphics("images/familydollar3.png")
```

## Extract Address Information

* from `type="application/ld+json"`

```{python}
arco = soup2.find_all(type="application/ld+json")
print(arco[1])
```
(address information is in the second list member)

## Use "contents" to Find Address Information

Extract the contents (from the second list item) and index the first (and only) list item:
```{python}
arco_contents = arco[1].contents[0]
arco_contents
```

Next, convert to a json object:
*(these are way easier to work with)*
```{python}
arco_json = json.loads(arco_contents)
```

## Extract Content from a json Object

This is actually a dictionary:
```{python}
type(arco_json)
print(arco_json)
```

```{python}
arco_address = arco_json['address']
arco_address
```

## Step 5: Put It All Togther

* iterate Over the List store URLs in Idaho

```{python}
locs_dict = [] # initialise empty list
for link in city_hrefs:
locpage = requests.get(link) # request page info
locsoup = BeautifulSoup(locpage.text, 'html.parser')
# parse the page's content
locinfo = locsoup.find_all(type="application/ld+json")
# extract specific element
loccont = locinfo[1].contents[0]
# get contents from the bs4 element set
locjson = json.loads(loccont) # convert to json
locaddr = locjson['address'] # get address
locs_dict.append(locaddr) # add address to list
```

## Step 6: Finalise Data

```{python}
locs_df = df.from_records(locs_dict)
locs_df.drop(['@type', 'addressCountry'], axis = 1, inplace = True)
locs_df.head(n = 5)
```

## Results!!

```{r, echo=FALSE, out.width='70%'}
knitr::include_graphics("images/adventures_in_babysitting.gif")
```

```
df.to_csv(locs_df, "family_dollar_ID_locations.csv", sep = ",", index = False)
```
## A Few Words on Selenium

"Inspect Element" - provides the code for what we actually see in a browser

```{r, echo=FALSE, out.width='100%'}
knitr::include_graphics("images/walgreens1.png")
```

## A Few Words on Selenium

"View Page Source" - provides the code for what `requests` will obtain

```{r, echo=FALSE, out.width='80%'}
knitr::include_graphics("images/walgreens2.png")
```

There is javascript modifying the source code. The source code needs to be accessed *after* the page has loaded in a browser.

## A Few Words on Selenium

* Requires a webdriver to retrieve the content
* It actually opens a web browser, and this is what you scrape
* Selenium is powerful - it can interact with loaded content in many ways
* Then continue to use `requests` and `BeautifulSoup` as before

```{r, echo=FALSE, out.width='80%'}
knitr::include_graphics("images/luke_brushesoff_dust.gif")
```

## What I Found Out

about dollar stores

(map of dollar stores)


## The Last Slide {.columns-2 .smaller}

**Read the Manuals*

* https://beautiful-soup-4.readthedocs.io/en/latest/
* https://selenium.dev/

This talk available at:
[need link]

```{r, echo=FALSE, fig.cap="Perservere", out.width='80%'}
knitr::include_graphics("images/yoda_lightsaber.gif")
```

Loading

0 comments on commit eb4ff77

Please sign in to comment.