first draft

jpiaskowski · Jan 28, 2020 · eb4ff77 · eb4ff77
commit eb4ff77
Show file tree

Hide file tree

Showing 25 changed files with 13,104 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,5 @@
+.Rproj.user
+.Rhistory
+.RData
+.Ruserdata
+rejects/
diff --git a/.ipynb_checkpoints/Familydollar_location_scrape-all-states-checkpoint.ipynb b/.ipynb_checkpoints/Familydollar_location_scrape-all-states-checkpoint.ipynb
diff --git a/.ipynb_checkpoints/dollar_tree_location_scrape-all-states-checkpoint.ipynb b/.ipynb_checkpoints/dollar_tree_location_scrape-all-states-checkpoint.ipynb
diff --git a/Example/Familydollar_location_scrape-all-states.ipynb b/Example/Familydollar_location_scrape-all-states.ipynb
diff --git a/Example/family_dollar_locations.csv b/Example/family_dollar_locations.csv
diff --git a/Extra_Resources/Krotov2018.pdf b/Extra_Resources/Krotov2018.pdf
diff --git a/README.md b/README.md
@@ -0,0 +1,11 @@
+
+Presentation for the Pycascades Conference held Feb 7-10, 2020 in Portland, OR. A brief introduction into using beautiful soup for web scraping, using address information of Family Dollar stores to demonstrate.
+
+
+The presentation was made in RStudio using `reticulate` to call python. 
+
+
+
+
+
+
diff --git a/images/adventures_in_babysitting.gif b/images/adventures_in_babysitting.gif
diff --git a/images/captain_marvel_binary.jpg b/images/captain_marvel_binary.jpg
diff --git a/images/family_dollar_cascade_cropped.png b/images/family_dollar_cascade_cropped.png
diff --git a/images/familydollar1.png b/images/familydollar1.png
diff --git a/images/familydollar2.png b/images/familydollar2.png
diff --git a/images/familydollar3.png b/images/familydollar3.png
diff --git a/images/heman_power.gif b/images/heman_power.gif
diff --git a/images/luke_brushesoff_dust.gif b/images/luke_brushesoff_dust.gif
diff --git a/images/luke_lightsaber_throwaway.gif b/images/luke_lightsaber_throwaway.gif
diff --git a/images/ren_throws_fit.gif b/images/ren_throws_fit.gif
diff --git a/images/rey_repeat.gif b/images/rey_repeat.gif
diff --git a/images/walgreens1.png b/images/walgreens1.png
diff --git a/images/walgreens2.png b/images/walgreens2.png
diff --git a/images/webscraping_book.png b/images/webscraping_book.png
diff --git a/images/yoda_lightsaber.gif b/images/yoda_lightsaber.gif
diff --git a/py2020.Rmd b/py2020.Rmd
@@ -0,0 +1,345 @@
+---
+title: "Adventures in Babysitting: Introduction to Web Scraping in Python"
+author: "Julia Piaskowski"
+date: "2020/02/08"
+output: ioslides_presentation
+---
+
+<style type="text/css">
+body p {
+  color: #282828;
+}
+
+ul {
+  color: #282828;
+}
+
+code {
+  color: ##0033cc;
+}
+</style>
+                          
+```{r setup, include=FALSE}
+knitr::opts_chunk$set(echo = TRUE, error=TRUE, tidy=TRUE)
+```
+
+```{r echo=F}
+library(reticulate)
+virtualenv_create("pycas2020")
+# a few things you might need to install
+# py_install(c("requests", "json", "pandas", "DataFrame"))
+# py_install(c("bs4", "Tag"))
+use_virtualenv("pycas2020")
+```
+
+## Good Way to Learn Python:
+
+```{r, echo=FALSE, out.height='40%'}
+knitr::include_graphics("images/webscraping_book.png")
+```
+
+## But, Who Actually Reads These A to Z?
+(spoiler: not me)
+
+```{r, echo=FALSE, fig.cap="me and my programming books", out.width='100%'}
+knitr::include_graphics("images/luke_lightsaber_throwaway.gif")
+```
+
+##  What we realy need to know: 
+* the tools available in `BeautifulSoup` and `requests`
+* what to look for in html code
+* parsing json objects with <code>json</code>
+* rudimentary `pandas` skills
+* `<idea> All you need to know about html is how tags work </idea>`
+
+## What to Look for in a scraping project:
+
+Structured data with a regular repeatable format. 
+
+```{r, echo=FALSE, out.width = '60%'}
+knitr::include_graphics("images/rey_repeat.gif")
+```
+
+Identical formating is not required, but the more edge cases present, the more complicated the scraping will be. 
+
+## Ethics in Scraping
+
+Accessing vast troves of information can be intoxicating: 
+
+```{r, echo=FALSE, out.width = '60%'}
+knitr::include_graphics("images/heman_power.gif")
+```
+
+Just because we can doesn't mean we should...
+
+## Legal Considerations 
+(note: I have zero legal training)
+
+* Are you scraping copyrighted material? 
+* Will your scraping activity compromise individual privacy?
+* Are you making a large number of request that may overload or damage a server?
+* Is it possible the scraping will expose intellectual property you do not own?
+* Are there terms of service governing use of the website and are you following those? 
+* Will your scraping activities diminish the value of the original data?
+
+
+## Dollar Stores are Taking Over the World!
+
+```{r, echo=FALSE, fig.cap="Store in Cadcade, Idaho", out.width='60%'}
+knitr::include_graphics("images/family_dollar_cascade_cropped.png")
+```
+
+Goal: Extract all addresses for all Family Dollar stores in Idaho. 
+
+## The Starting Point: 
+
+https://locations.familydollar.com/id/
+```{r, echo=FALSE,out.width='80%'}
+knitr::include_graphics("images/familydollar1.png")
+```
+
+## Step 1: Load those libraries
+
+```{python}
+import requests # for making standard html requests
+from bs4 import BeautifulSoup # magical tool for parsing html data
+import json # for parsing data
+from pandas import DataFrame as df # data organization
+```
+
+## Step 2: Grab Some Data from Target Web Address
+
+```{python}
+page = requests.get("https://locations.familydollar.com/id/")
+soup = BeautifulSoup(page.text, 'html.parser') 
+```
+
+Beautiful Soup will take html or xml content and transform it into a complex tree of objects. Here are several common types: 
+
+* `BeautifulSoup` - the soup (the parsed content)  
+* `Tag`  - main type of bs4 element you will encounter  
+* `NavigableString` - string within a tag  
+* `Comment` - special type of NavigableString
+
+
+## Step 3: Determine How to Extract Relevant Content from BS4 Soup
+
+This can be frustrating
+
+```{r, echo=FALSE, out.width='70%'}
+knitr::include_graphics("images/ren_throws_fit.gif")
+```
+
+
+## Step 3: Finding Content...
+
+* Start with one representative example and then scale up
+* Viewing the page's html source code is essential
+* Run at your own risk: 
+```{python, eval=F, echo=T}
+print(soup.prettify())
+```
+
+* It is usually easiest to browse via "View Page Source":
+```{r, echo=FALSE, out.width='100%'}
+knitr::include_graphics("images/familydollar2.png")
+```
+
+## Step 3: Finding Content by Searching
+
+Searching for href does not work
+```{python}
+dollar_tree_list = soup.find_all('href')
+dollar_tree_list
+```
+
+But searching on a specific class is often successful: 
+```{python}
+dollar_tree_list = soup.find_all(class_ = 'itemlist')
+for i in dollar_tree_list[:2]:
+  print(i)
+```
+
+## Step 3: Finding Content by Using 'contents'
+
+What kind of content do we have and how much is there?
+```{python, collapse=TRUE}
+type(dollar_tree_list)
+len(dollar_tree_list)
+```
+
+Now that we have drilled down to a BeautifulSoup "ResultSet", we can try extracting the contents.
+
+```{python}
+example = dollar_tree_list[2] # Arco, ID (single representative example)
+example_content = example.contents
+print(example_content)
+```
+
+## Step 3: Finding Content in  Attributes
+
+Find out what attributes are present in the contents:
+
+*Note: `contents` usually return a list of exactly one item, so the first step is to index that item.*
+```{python}
+example_content = example.contents[0]
+example_content.attrs
+```
+
+Extract the relevant attribute: 
+```{python}
+example_href = example_content['href']
+print(example_href)
+```
+
+## Step 4: Extract the Relevant Content
+
+```{python}
+city_hrefs = [] # initialise empty list
+
+for i in dollar_tree_list:
+    cont = i.contents[0]
+    href = cont['href']
+    city_hrefs.append(href)
+
+#  check to be sure all went well
+for i in city_hrefs[:2]:
+  print(i)
+```
+We now have a list of url's of Family Dollar stores in Idaho to scrape 
+
+## Repeat Steps 1-4 for the City URLs
+
+```{python}
+page2 = requests.get(city_hrefs[2]) # representative example
+soup2 = BeautifulSoup(page2.text, 'html.parser')
+```
+
+```{r, echo=FALSE, out.width='100%'}
+knitr::include_graphics("images/familydollar3.png")
+```
+
+## Extract Address Information
+
+* from `type="application/ld+json"` 
+
+```{python}
+arco = soup2.find_all(type="application/ld+json")
+print(arco[1])
+```
+(address information is in the second list member)
+
+## Use "contents" to Find Address Information 
+
+Extract the contents (from the second list item) and index the first (and only) list item:
+```{python}
+arco_contents = arco[1].contents[0]
+arco_contents
+```
+
+Next, convert to a json object:  
+*(these are way easier to work with)*
+```{python}
+arco_json =  json.loads(arco_contents)
+```
+
+## Extract Content from a json Object
+
+This is actually a dictionary: 
+```{python}
+type(arco_json)
+print(arco_json)
+```
+
+```{python}
+arco_address = arco_json['address']
+arco_address
+```
+
+## Step 5: Put It All Togther
+
+* iterate Over the List store URLs in Idaho
+
+```{python}
+locs_dict = [] # initialise empty list
+
+for link in city_hrefs:
+  locpage = requests.get(link)   # request page info
+  locsoup = BeautifulSoup(locpage.text, 'html.parser') 
+      # parse the page's content
+  locinfo = locsoup.find_all(type="application/ld+json") 
+      # extract specific element
+  loccont = locinfo[1].contents[0]  
+      # get contents from the bs4 element set
+  locjson = json.loads(loccont)  # convert to json
+  locaddr = locjson['address'] # get address
+  locs_dict.append(locaddr) # add address to list
+```
+
+## Step 6: Finalise Data
+
+```{python}
+locs_df = df.from_records(locs_dict)
+locs_df.drop(['@type', 'addressCountry'], axis = 1, inplace = True)
+locs_df.head(n = 5)
+```
+
+## Results!!
+
+```{r, echo=FALSE, out.width='70%'}
+knitr::include_graphics("images/adventures_in_babysitting.gif")
+```
+
+```
+df.to_csv(locs_df, "family_dollar_ID_locations.csv", sep = ",", index = False)
+```
+## A Few Words on Selenium
+
+"Inspect Element" - provides the code for what we actually see in a browser
+
+```{r, echo=FALSE, out.width='100%'}
+knitr::include_graphics("images/walgreens1.png")
+```
+
+## A Few Words on Selenium
+
+"View Page Source" - provides the code for what `requests` will obtain
+
+```{r, echo=FALSE, out.width='80%'}
+knitr::include_graphics("images/walgreens2.png")
+```
+
+There is javascript modifying the source code. The source code needs to be accessed *after* the page has loaded in a browser. 
+
+## A Few Words on Selenium
+
+* Requires a webdriver to retrieve the content
+* It actually opens a web browser, and this is what you scrape
+* Selenium is powerful - it can interact with loaded content in many ways
+* Then continue to use `requests` and `BeautifulSoup` as before
+
+```{r, echo=FALSE, out.width='80%'}
+knitr::include_graphics("images/luke_brushesoff_dust.gif")
+```
+
+## What I Found Out
+
+about dollar stores
+
+(map of dollar stores)
+
+
+## The Last Slide {.columns-2 .smaller}
+
+**Read the Manuals*
+
+* https://beautiful-soup-4.readthedocs.io/en/latest/  
+* https://selenium.dev/  
+
+This talk available at:  
+[need link]
+
+```{r, echo=FALSE, fig.cap="Perservere", out.width='80%'}
+knitr::include_graphics("images/yoda_lightsaber.gif")
+```
+
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,11 @@

		Presentation for the Pycascades Conference held Feb 7-10, 2020 in Portland, OR. A brief introduction into using beautiful soup for web scraping, using address information of Family Dollar stores to demonstrate.


		The presentation was made in RStudio using `reticulate` to call python.