-
Notifications
You must be signed in to change notification settings - Fork 13
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit eb4ff77
Showing
25 changed files
with
13,104 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
.Rproj.user | ||
.Rhistory | ||
.RData | ||
.Ruserdata | ||
rejects/ |
349 changes: 349 additions & 0 deletions
349
.ipynb_checkpoints/Familydollar_location_scrape-all-states-checkpoint.ipynb
Large diffs are not rendered by default.
Oops, something went wrong.
576 changes: 576 additions & 0 deletions
576
.ipynb_checkpoints/dollar_tree_location_scrape-all-states-checkpoint.ipynb
Large diffs are not rendered by default.
Oops, something went wrong.
Large diffs are not rendered by default.
Oops, something went wrong.
Large diffs are not rendered by default.
Oops, something went wrong.
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
|
||
Presentation for the Pycascades Conference held Feb 7-10, 2020 in Portland, OR. A brief introduction into using beautiful soup for web scraping, using address information of Family Dollar stores to demonstrate. | ||
|
||
|
||
The presentation was made in RStudio using `reticulate` to call python. | ||
|
||
|
||
|
||
|
||
|
||
|
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,345 @@ | ||
--- | ||
title: "Adventures in Babysitting: Introduction to Web Scraping in Python" | ||
author: "Julia Piaskowski" | ||
date: "2020/02/08" | ||
output: ioslides_presentation | ||
--- | ||
|
||
<style type="text/css"> | ||
body p { | ||
color: #282828; | ||
} | ||
|
||
ul { | ||
color: #282828; | ||
} | ||
|
||
code { | ||
color: ##0033cc; | ||
} | ||
</style> | ||
```{r setup, include=FALSE} | ||
knitr::opts_chunk$set(echo = TRUE, error=TRUE, tidy=TRUE) | ||
``` | ||
|
||
```{r echo=F} | ||
library(reticulate) | ||
virtualenv_create("pycas2020") | ||
# a few things you might need to install | ||
# py_install(c("requests", "json", "pandas", "DataFrame")) | ||
# py_install(c("bs4", "Tag")) | ||
use_virtualenv("pycas2020") | ||
``` | ||
|
||
## Good Way to Learn Python: | ||
|
||
```{r, echo=FALSE, out.height='40%'} | ||
knitr::include_graphics("images/webscraping_book.png") | ||
``` | ||
|
||
## But, Who Actually Reads These A to Z? | ||
(spoiler: not me) | ||
|
||
```{r, echo=FALSE, fig.cap="me and my programming books", out.width='100%'} | ||
knitr::include_graphics("images/luke_lightsaber_throwaway.gif") | ||
``` | ||
|
||
## What we realy need to know: | ||
* the tools available in `BeautifulSoup` and `requests` | ||
* what to look for in html code | ||
* parsing json objects with <code>json</code> | ||
* rudimentary `pandas` skills | ||
* `<idea> All you need to know about html is how tags work </idea>` | ||
|
||
## What to Look for in a scraping project: | ||
|
||
Structured data with a regular repeatable format. | ||
|
||
```{r, echo=FALSE, out.width = '60%'} | ||
knitr::include_graphics("images/rey_repeat.gif") | ||
``` | ||
|
||
Identical formating is not required, but the more edge cases present, the more complicated the scraping will be. | ||
|
||
## Ethics in Scraping | ||
|
||
Accessing vast troves of information can be intoxicating: | ||
|
||
```{r, echo=FALSE, out.width = '60%'} | ||
knitr::include_graphics("images/heman_power.gif") | ||
``` | ||
|
||
Just because we can doesn't mean we should... | ||
|
||
## Legal Considerations | ||
(note: I have zero legal training) | ||
|
||
* Are you scraping copyrighted material? | ||
* Will your scraping activity compromise individual privacy? | ||
* Are you making a large number of request that may overload or damage a server? | ||
* Is it possible the scraping will expose intellectual property you do not own? | ||
* Are there terms of service governing use of the website and are you following those? | ||
* Will your scraping activities diminish the value of the original data? | ||
|
||
|
||
## Dollar Stores are Taking Over the World! | ||
|
||
```{r, echo=FALSE, fig.cap="Store in Cadcade, Idaho", out.width='60%'} | ||
knitr::include_graphics("images/family_dollar_cascade_cropped.png") | ||
``` | ||
|
||
Goal: Extract all addresses for all Family Dollar stores in Idaho. | ||
|
||
## The Starting Point: | ||
|
||
https://locations.familydollar.com/id/ | ||
```{r, echo=FALSE,out.width='80%'} | ||
knitr::include_graphics("images/familydollar1.png") | ||
``` | ||
|
||
## Step 1: Load those libraries | ||
|
||
```{python} | ||
import requests # for making standard html requests | ||
from bs4 import BeautifulSoup # magical tool for parsing html data | ||
import json # for parsing data | ||
from pandas import DataFrame as df # data organization | ||
``` | ||
|
||
## Step 2: Grab Some Data from Target Web Address | ||
|
||
```{python} | ||
page = requests.get("https://locations.familydollar.com/id/") | ||
soup = BeautifulSoup(page.text, 'html.parser') | ||
``` | ||
|
||
Beautiful Soup will take html or xml content and transform it into a complex tree of objects. Here are several common types: | ||
|
||
* `BeautifulSoup` - the soup (the parsed content) | ||
* `Tag` - main type of bs4 element you will encounter | ||
* `NavigableString` - string within a tag | ||
* `Comment` - special type of NavigableString | ||
|
||
|
||
## Step 3: Determine How to Extract Relevant Content from BS4 Soup | ||
|
||
This can be frustrating | ||
|
||
```{r, echo=FALSE, out.width='70%'} | ||
knitr::include_graphics("images/ren_throws_fit.gif") | ||
``` | ||
|
||
|
||
## Step 3: Finding Content... | ||
|
||
* Start with one representative example and then scale up | ||
* Viewing the page's html source code is essential | ||
* Run at your own risk: | ||
```{python, eval=F, echo=T} | ||
print(soup.prettify()) | ||
``` | ||
|
||
* It is usually easiest to browse via "View Page Source": | ||
```{r, echo=FALSE, out.width='100%'} | ||
knitr::include_graphics("images/familydollar2.png") | ||
``` | ||
|
||
## Step 3: Finding Content by Searching | ||
|
||
Searching for href does not work | ||
```{python} | ||
dollar_tree_list = soup.find_all('href') | ||
dollar_tree_list | ||
``` | ||
|
||
But searching on a specific class is often successful: | ||
```{python} | ||
dollar_tree_list = soup.find_all(class_ = 'itemlist') | ||
for i in dollar_tree_list[:2]: | ||
print(i) | ||
``` | ||
|
||
## Step 3: Finding Content by Using 'contents' | ||
|
||
What kind of content do we have and how much is there? | ||
```{python, collapse=TRUE} | ||
type(dollar_tree_list) | ||
len(dollar_tree_list) | ||
``` | ||
|
||
Now that we have drilled down to a BeautifulSoup "ResultSet", we can try extracting the contents. | ||
|
||
```{python} | ||
example = dollar_tree_list[2] # Arco, ID (single representative example) | ||
example_content = example.contents | ||
print(example_content) | ||
``` | ||
|
||
## Step 3: Finding Content in Attributes | ||
|
||
Find out what attributes are present in the contents: | ||
|
||
*Note: `contents` usually return a list of exactly one item, so the first step is to index that item.* | ||
```{python} | ||
example_content = example.contents[0] | ||
example_content.attrs | ||
``` | ||
|
||
Extract the relevant attribute: | ||
```{python} | ||
example_href = example_content['href'] | ||
print(example_href) | ||
``` | ||
|
||
## Step 4: Extract the Relevant Content | ||
|
||
```{python} | ||
city_hrefs = [] # initialise empty list | ||
for i in dollar_tree_list: | ||
cont = i.contents[0] | ||
href = cont['href'] | ||
city_hrefs.append(href) | ||
# check to be sure all went well | ||
for i in city_hrefs[:2]: | ||
print(i) | ||
``` | ||
We now have a list of url's of Family Dollar stores in Idaho to scrape | ||
|
||
## Repeat Steps 1-4 for the City URLs | ||
|
||
```{python} | ||
page2 = requests.get(city_hrefs[2]) # representative example | ||
soup2 = BeautifulSoup(page2.text, 'html.parser') | ||
``` | ||
|
||
```{r, echo=FALSE, out.width='100%'} | ||
knitr::include_graphics("images/familydollar3.png") | ||
``` | ||
|
||
## Extract Address Information | ||
|
||
* from `type="application/ld+json"` | ||
|
||
```{python} | ||
arco = soup2.find_all(type="application/ld+json") | ||
print(arco[1]) | ||
``` | ||
(address information is in the second list member) | ||
|
||
## Use "contents" to Find Address Information | ||
|
||
Extract the contents (from the second list item) and index the first (and only) list item: | ||
```{python} | ||
arco_contents = arco[1].contents[0] | ||
arco_contents | ||
``` | ||
|
||
Next, convert to a json object: | ||
*(these are way easier to work with)* | ||
```{python} | ||
arco_json = json.loads(arco_contents) | ||
``` | ||
|
||
## Extract Content from a json Object | ||
|
||
This is actually a dictionary: | ||
```{python} | ||
type(arco_json) | ||
print(arco_json) | ||
``` | ||
|
||
```{python} | ||
arco_address = arco_json['address'] | ||
arco_address | ||
``` | ||
|
||
## Step 5: Put It All Togther | ||
|
||
* iterate Over the List store URLs in Idaho | ||
|
||
```{python} | ||
locs_dict = [] # initialise empty list | ||
for link in city_hrefs: | ||
locpage = requests.get(link) # request page info | ||
locsoup = BeautifulSoup(locpage.text, 'html.parser') | ||
# parse the page's content | ||
locinfo = locsoup.find_all(type="application/ld+json") | ||
# extract specific element | ||
loccont = locinfo[1].contents[0] | ||
# get contents from the bs4 element set | ||
locjson = json.loads(loccont) # convert to json | ||
locaddr = locjson['address'] # get address | ||
locs_dict.append(locaddr) # add address to list | ||
``` | ||
|
||
## Step 6: Finalise Data | ||
|
||
```{python} | ||
locs_df = df.from_records(locs_dict) | ||
locs_df.drop(['@type', 'addressCountry'], axis = 1, inplace = True) | ||
locs_df.head(n = 5) | ||
``` | ||
|
||
## Results!! | ||
|
||
```{r, echo=FALSE, out.width='70%'} | ||
knitr::include_graphics("images/adventures_in_babysitting.gif") | ||
``` | ||
|
||
``` | ||
df.to_csv(locs_df, "family_dollar_ID_locations.csv", sep = ",", index = False) | ||
``` | ||
## A Few Words on Selenium | ||
|
||
"Inspect Element" - provides the code for what we actually see in a browser | ||
|
||
```{r, echo=FALSE, out.width='100%'} | ||
knitr::include_graphics("images/walgreens1.png") | ||
``` | ||
|
||
## A Few Words on Selenium | ||
|
||
"View Page Source" - provides the code for what `requests` will obtain | ||
|
||
```{r, echo=FALSE, out.width='80%'} | ||
knitr::include_graphics("images/walgreens2.png") | ||
``` | ||
|
||
There is javascript modifying the source code. The source code needs to be accessed *after* the page has loaded in a browser. | ||
|
||
## A Few Words on Selenium | ||
|
||
* Requires a webdriver to retrieve the content | ||
* It actually opens a web browser, and this is what you scrape | ||
* Selenium is powerful - it can interact with loaded content in many ways | ||
* Then continue to use `requests` and `BeautifulSoup` as before | ||
|
||
```{r, echo=FALSE, out.width='80%'} | ||
knitr::include_graphics("images/luke_brushesoff_dust.gif") | ||
``` | ||
|
||
## What I Found Out | ||
|
||
about dollar stores | ||
|
||
(map of dollar stores) | ||
|
||
|
||
## The Last Slide {.columns-2 .smaller} | ||
|
||
**Read the Manuals* | ||
|
||
* https://beautiful-soup-4.readthedocs.io/en/latest/ | ||
* https://selenium.dev/ | ||
|
||
This talk available at: | ||
[need link] | ||
|
||
```{r, echo=FALSE, fig.cap="Perservere", out.width='80%'} | ||
knitr::include_graphics("images/yoda_lightsaber.gif") | ||
``` | ||
|
Oops, something went wrong.