draft 2

jpiaskowski · Jan 31, 2020 · de78a9d · de78a9d
1 parent eb4ff77
commit de78a9d
Show file tree

Hide file tree

Showing 7 changed files with 282 additions and 421 deletions.
diff --git a/.gitignore b/.gitignore
@@ -2,4 +2,6 @@
 .Rhistory
 .RData
 .Ruserdata
-rejects/
+rejects/
+.ipynb_checkpoints
+
diff --git a/Example/Familydollar_location_scrape-all-states.ipynb b/Example/Familydollar_location_scrape-all-states.ipynb
diff --git a/py2020.Rmd → docs/py2020.Rmd b/py2020.Rmd → docs/py2020.Rmd
@@ -1,8 +1,10 @@
 ---
 title: "Adventures in Babysitting: Introduction to Web Scraping in Python"
 author: "Julia Piaskowski"
-date: "2020/02/08"
-output: ioslides_presentation
+date: "2020/02/09"
+output:
+  ioslides_presentation:
+    widescreen: true
 ---
 
 <style type="text/css">
@@ -23,6 +25,22 @@ code {
 knitr::opts_chunk$set(echo = TRUE, error=TRUE, tidy=TRUE)
 ```
 
+
+```{r wrap-hook, echo=FALSE}
+library(knitr)
+hook_output = knit_hooks$get('output')
+knit_hooks$set(output = function(x, options) {
+  # this hook is used only when the linewidth option is not NULL
+  if (!is.null(n <- options$linewidth)) {
+    x = knitr:::split_lines(x)
+    # any lines wider than n should be wrapped
+    if (any(nchar(x) > n)) x = strwrap(x, width = n)
+    x = paste(x, collapse = '\n')
+  }
+  hook_output(x, options)
+})
+```
+
 ```{r echo=F}
 library(reticulate)
 virtualenv_create("pycas2020")
@@ -35,14 +53,14 @@ use_virtualenv("pycas2020")
 ## Good Way to Learn Python:
 
 ```{r, echo=FALSE, out.height='40%'}
-knitr::include_graphics("images/webscraping_book.png")
+knitr::include_graphics("../images/webscraping_book.png")
 ```
 
 ## But, Who Actually Reads These A to Z?
 (spoiler: not me)
 
 ```{r, echo=FALSE, fig.cap="me and my programming books", out.width='100%'}
-knitr::include_graphics("images/luke_lightsaber_throwaway.gif")
+knitr::include_graphics("../images/luke_lightsaber_throwaway.gif")
 ```
 
 ##  What we realy need to know: 
@@ -57,20 +75,22 @@ knitr::include_graphics("images/luke_lightsaber_throwaway.gif")
 Structured data with a regular repeatable format. 
 
 ```{r, echo=FALSE, out.width = '60%'}
-knitr::include_graphics("images/rey_repeat.gif")
+knitr::include_graphics("../images/rey_repeat.gif")
 ```
 
 Identical formating is not required, but the more edge cases present, the more complicated the scraping will be. 
 
-## Ethics in Scraping
+## Ethics in Scraping {.columns-2 .smaller}
 
-Accessing vast troves of information can be intoxicating: 
-
-```{r, echo=FALSE, out.width = '60%'}
-knitr::include_graphics("images/heman_power.gif")
+```{r, echo=FALSE, out.width = '100%'}
+knitr::include_graphics("../images/heman_power.gif")
 ```
 
-Just because we can doesn't mean we should...
+
+Accessing vast troves of information can be intoxicating.   
+
+
+(Just because we can doesn't mean we should)
 
 ## Legal Considerations 
 (note: I have zero legal training)
@@ -86,19 +106,19 @@ Just because we can doesn't mean we should...
 ## Dollar Stores are Taking Over the World!
 
 ```{r, echo=FALSE, fig.cap="Store in Cadcade, Idaho", out.width='60%'}
-knitr::include_graphics("images/family_dollar_cascade_cropped.png")
+knitr::include_graphics("../images/family_dollar_cascade_cropped.png")
 ```
 
-Goal: Extract all addresses for all Family Dollar stores in Idaho. 
+<center>**Goal:** Extract all addresses for all Family Dollar stores in Idaho.</center>
 
 ## The Starting Point: 
 
 https://locations.familydollar.com/id/
 ```{r, echo=FALSE,out.width='80%'}
-knitr::include_graphics("images/familydollar1.png")
+knitr::include_graphics("../images/familydollar1.png")
 ```
 
-## Step 1: Load those libraries
+## Step 1: Load the libraries
 
 ```{python}
 import requests # for making standard html requests
@@ -122,27 +142,26 @@ Beautiful Soup will take html or xml content and transform it into a complex tre
 * `Comment` - special type of NavigableString
 
 
-## Step 3: Determine How to Extract Relevant Content from BS4 Soup
+## Step 3: Determine How to Extract Relevant Content from bs4 Soup
 
 This can be frustrating
 
 ```{r, echo=FALSE, out.width='70%'}
-knitr::include_graphics("images/ren_throws_fit.gif")
+knitr::include_graphics("../images/ren_throws_fit.gif")
 ```
 
 
 ## Step 3: Finding Content...
 
 * Start with one representative example and then scale up
 * Viewing the page's html source code is essential
-* Run at your own risk: 
+  * Run at your own risk: 
 ```{python, eval=F, echo=T}
 print(soup.prettify())
 ```
-
-* It is usually easiest to browse via "View Page Source":
+  * It is usually easiest to browse via "View Page Source":
 ```{r, echo=FALSE, out.width='100%'}
-knitr::include_graphics("images/familydollar2.png")
+knitr::include_graphics("../images/familydollar2.png")
 ```
 
 ## Step 3: Finding Content by Searching
@@ -216,7 +235,7 @@ soup2 = BeautifulSoup(page2.text, 'html.parser')
 ```
 
 ```{r, echo=FALSE, out.width='100%'}
-knitr::include_graphics("images/familydollar3.png")
+knitr::include_graphics("../images/familydollar3.png")
 ```
 
 ## Extract Address Information
@@ -246,12 +265,12 @@ arco_json =  json.loads(arco_contents)
 ## Extract Content from a json Object
 
 This is actually a dictionary: 
-```{python}
+```{python, linewidth=85}
 type(arco_json)
 print(arco_json)
 ```
 
-```{python}
+```{python, linewidth=70}
 arco_address = arco_json['address']
 arco_address
 ```
@@ -287,7 +306,7 @@ locs_df.head(n = 5)
 ## Results!!
 
 ```{r, echo=FALSE, out.width='70%'}
-knitr::include_graphics("images/adventures_in_babysitting.gif")
+knitr::include_graphics("../images/adventures_in_babysitting.gif")
 ```
 
 ```
@@ -298,48 +317,51 @@ df.to_csv(locs_df, "family_dollar_ID_locations.csv", sep = ",", index = False)
 "Inspect Element" - provides the code for what we actually see in a browser
 
 ```{r, echo=FALSE, out.width='100%'}
-knitr::include_graphics("images/walgreens1.png")
+knitr::include_graphics("../images/walgreens1.png")
 ```
 
 ## A Few Words on Selenium
 
 "View Page Source" - provides the code for what `requests` will obtain
 
-```{r, echo=FALSE, out.width='80%'}
-knitr::include_graphics("images/walgreens2.png")
+```{r, echo=FALSE, out.width='75%'}
+knitr::include_graphics("../images/walgreens2.png")
 ```
 
 There is javascript modifying the source code. The source code needs to be accessed *after* the page has loaded in a browser. 
 
 ## A Few Words on Selenium
 
 * Requires a webdriver to retrieve the content
-* It actually opens a web browser, and this is what you scrape
+* It actually opens a web browser, and this info is collected
 * Selenium is powerful - it can interact with loaded content in many ways
-* Then continue to use `requests` and `BeautifulSoup` as before
+* After getting data, continue to use `BeautifulSoup` as before
 
-```{r, echo=FALSE, out.width='80%'}
-knitr::include_graphics("images/luke_brushesoff_dust.gif")
+```{python, eval=F, echo=T}
+url = "https://www.walgreens.com/storelistings/storesbycity.jsp?requestType=locator&state=ID"
+driver = webdriver.Firefox(executable_path = 'mypath/geckodriver.exe')
+driver.get(url)
+soup_ID = BeautifulSoup(driver.page_source, 'html.parser')
+store_link_soup = soup_ID.find_all(class_ = 'col-xl-4 col-lg-4 col-md-4') 
 ```
 
-## What I Found Out
-
-about dollar stores
-
-(map of dollar stores)
-
-
 ## The Last Slide {.columns-2 .smaller}
 
-**Read the Manuals*
+**Read the Manuals**
 
 * https://beautiful-soup-4.readthedocs.io/en/latest/  
 * https://selenium.dev/  
 
 This talk available at:  
-[need link]
+
+* https://github.com/jpiaskowski/pycas2020_web_scraping
 
 ```{r, echo=FALSE, fig.cap="Perservere", out.width='80%'}
-knitr::include_graphics("images/yoda_lightsaber.gif")
+knitr::include_graphics("../images/yoda_lightsaber.gif")
 ```
 
+## After We Become Web Scraping Masters:
+
+```{r, echo=FALSE, out.width='100%'}
+knitr::include_graphics("../images/luke_brushesoff_dust.gif")
+```
diff --git a/py2020.html → docs/py2020.html b/py2020.html → docs/py2020.html