Skip to content

Commit

Permalink
draft 2
Browse files Browse the repository at this point in the history
  • Loading branch information
jpiaskowski committed Jan 31, 2020
1 parent eb4ff77 commit de78a9d
Show file tree
Hide file tree
Showing 7 changed files with 282 additions and 421 deletions.
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,6 @@
.Rhistory
.RData
.Ruserdata
rejects/
rejects/
.ipynb_checkpoints

349 changes: 0 additions & 349 deletions Example/Familydollar_location_scrape-all-states.ipynb

This file was deleted.

106 changes: 64 additions & 42 deletions py2020.Rmd → docs/py2020.Rmd
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
---
title: "Adventures in Babysitting: Introduction to Web Scraping in Python"
author: "Julia Piaskowski"
date: "2020/02/08"
output: ioslides_presentation
date: "2020/02/09"
output:
ioslides_presentation:
widescreen: true
---

<style type="text/css">
Expand All @@ -23,6 +25,22 @@ code {
knitr::opts_chunk$set(echo = TRUE, error=TRUE, tidy=TRUE)
```


```{r wrap-hook, echo=FALSE}
library(knitr)
hook_output = knit_hooks$get('output')
knit_hooks$set(output = function(x, options) {
# this hook is used only when the linewidth option is not NULL
if (!is.null(n <- options$linewidth)) {
x = knitr:::split_lines(x)
# any lines wider than n should be wrapped
if (any(nchar(x) > n)) x = strwrap(x, width = n)
x = paste(x, collapse = '\n')
}
hook_output(x, options)
})
```

```{r echo=F}
library(reticulate)
virtualenv_create("pycas2020")
Expand All @@ -35,14 +53,14 @@ use_virtualenv("pycas2020")
## Good Way to Learn Python:

```{r, echo=FALSE, out.height='40%'}
knitr::include_graphics("images/webscraping_book.png")
knitr::include_graphics("../images/webscraping_book.png")
```

## But, Who Actually Reads These A to Z?
(spoiler: not me)

```{r, echo=FALSE, fig.cap="me and my programming books", out.width='100%'}
knitr::include_graphics("images/luke_lightsaber_throwaway.gif")
knitr::include_graphics("../images/luke_lightsaber_throwaway.gif")
```

## What we realy need to know:
Expand All @@ -57,20 +75,22 @@ knitr::include_graphics("images/luke_lightsaber_throwaway.gif")
Structured data with a regular repeatable format.

```{r, echo=FALSE, out.width = '60%'}
knitr::include_graphics("images/rey_repeat.gif")
knitr::include_graphics("../images/rey_repeat.gif")
```

Identical formating is not required, but the more edge cases present, the more complicated the scraping will be.

## Ethics in Scraping
## Ethics in Scraping {.columns-2 .smaller}

Accessing vast troves of information can be intoxicating:

```{r, echo=FALSE, out.width = '60%'}
knitr::include_graphics("images/heman_power.gif")
```{r, echo=FALSE, out.width = '100%'}
knitr::include_graphics("../images/heman_power.gif")
```

Just because we can doesn't mean we should...

Accessing vast troves of information can be intoxicating.


(Just because we can doesn't mean we should)

## Legal Considerations
(note: I have zero legal training)
Expand All @@ -86,19 +106,19 @@ Just because we can doesn't mean we should...
## Dollar Stores are Taking Over the World!

```{r, echo=FALSE, fig.cap="Store in Cadcade, Idaho", out.width='60%'}
knitr::include_graphics("images/family_dollar_cascade_cropped.png")
knitr::include_graphics("../images/family_dollar_cascade_cropped.png")
```

Goal: Extract all addresses for all Family Dollar stores in Idaho.
<center>**Goal:** Extract all addresses for all Family Dollar stores in Idaho.</center>

## The Starting Point:

https://locations.familydollar.com/id/
```{r, echo=FALSE,out.width='80%'}
knitr::include_graphics("images/familydollar1.png")
knitr::include_graphics("../images/familydollar1.png")
```

## Step 1: Load those libraries
## Step 1: Load the libraries

```{python}
import requests # for making standard html requests
Expand All @@ -122,27 +142,26 @@ Beautiful Soup will take html or xml content and transform it into a complex tre
* `Comment` - special type of NavigableString


## Step 3: Determine How to Extract Relevant Content from BS4 Soup
## Step 3: Determine How to Extract Relevant Content from bs4 Soup

This can be frustrating

```{r, echo=FALSE, out.width='70%'}
knitr::include_graphics("images/ren_throws_fit.gif")
knitr::include_graphics("../images/ren_throws_fit.gif")
```


## Step 3: Finding Content...

* Start with one representative example and then scale up
* Viewing the page's html source code is essential
* Run at your own risk:
* Run at your own risk:
```{python, eval=F, echo=T}
print(soup.prettify())
```

* It is usually easiest to browse via "View Page Source":
* It is usually easiest to browse via "View Page Source":
```{r, echo=FALSE, out.width='100%'}
knitr::include_graphics("images/familydollar2.png")
knitr::include_graphics("../images/familydollar2.png")
```

## Step 3: Finding Content by Searching
Expand Down Expand Up @@ -216,7 +235,7 @@ soup2 = BeautifulSoup(page2.text, 'html.parser')
```

```{r, echo=FALSE, out.width='100%'}
knitr::include_graphics("images/familydollar3.png")
knitr::include_graphics("../images/familydollar3.png")
```

## Extract Address Information
Expand Down Expand Up @@ -246,12 +265,12 @@ arco_json = json.loads(arco_contents)
## Extract Content from a json Object

This is actually a dictionary:
```{python}
```{python, linewidth=85}
type(arco_json)
print(arco_json)
```

```{python}
```{python, linewidth=70}
arco_address = arco_json['address']
arco_address
```
Expand Down Expand Up @@ -287,7 +306,7 @@ locs_df.head(n = 5)
## Results!!

```{r, echo=FALSE, out.width='70%'}
knitr::include_graphics("images/adventures_in_babysitting.gif")
knitr::include_graphics("../images/adventures_in_babysitting.gif")
```

```
Expand All @@ -298,48 +317,51 @@ df.to_csv(locs_df, "family_dollar_ID_locations.csv", sep = ",", index = False)
"Inspect Element" - provides the code for what we actually see in a browser

```{r, echo=FALSE, out.width='100%'}
knitr::include_graphics("images/walgreens1.png")
knitr::include_graphics("../images/walgreens1.png")
```

## A Few Words on Selenium

"View Page Source" - provides the code for what `requests` will obtain

```{r, echo=FALSE, out.width='80%'}
knitr::include_graphics("images/walgreens2.png")
```{r, echo=FALSE, out.width='75%'}
knitr::include_graphics("../images/walgreens2.png")
```

There is javascript modifying the source code. The source code needs to be accessed *after* the page has loaded in a browser.

## A Few Words on Selenium

* Requires a webdriver to retrieve the content
* It actually opens a web browser, and this is what you scrape
* It actually opens a web browser, and this info is collected
* Selenium is powerful - it can interact with loaded content in many ways
* Then continue to use `requests` and `BeautifulSoup` as before
* After getting data, continue to use `BeautifulSoup` as before

```{r, echo=FALSE, out.width='80%'}
knitr::include_graphics("images/luke_brushesoff_dust.gif")
```{python, eval=F, echo=T}
url = "https://www.walgreens.com/storelistings/storesbycity.jsp?requestType=locator&state=ID"
driver = webdriver.Firefox(executable_path = 'mypath/geckodriver.exe')
driver.get(url)
soup_ID = BeautifulSoup(driver.page_source, 'html.parser')
store_link_soup = soup_ID.find_all(class_ = 'col-xl-4 col-lg-4 col-md-4')
```

## What I Found Out

about dollar stores

(map of dollar stores)


## The Last Slide {.columns-2 .smaller}

**Read the Manuals*
**Read the Manuals**

* https://beautiful-soup-4.readthedocs.io/en/latest/
* https://selenium.dev/

This talk available at:
[need link]

* https://github.com/jpiaskowski/pycas2020_web_scraping

```{r, echo=FALSE, fig.cap="Perservere", out.width='80%'}
knitr::include_graphics("images/yoda_lightsaber.gif")
knitr::include_graphics("../images/yoda_lightsaber.gif")
```

## After We Become Web Scraping Masters:

```{r, echo=FALSE, out.width='100%'}
knitr::include_graphics("../images/luke_brushesoff_dust.gif")
```
76 changes: 47 additions & 29 deletions py2020.html → docs/py2020.html

Large diffs are not rendered by default.

Loading

0 comments on commit de78a9d

Please sign in to comment.