Skip to content

Commit

Permalink
readme update
Browse files Browse the repository at this point in the history
  • Loading branch information
jpiaskowski committed Feb 9, 2020
1 parent d161b1c commit 1bb5a20
Show file tree
Hide file tree
Showing 3 changed files with 46 additions and 44 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ Presentation for the Pycascades Conference held Feb 7-10, 2020 in Portland, OR.

The presentation was made in R using `rmarkdown` to generate the slides and `reticulate` to call Python.

The presentation can be viewed [here](https://jpiaskowski.github.io/pycas2020_web_scraping/)
The presentation can be downloaded [here](https://www.dropbox.com/s/h3p4ra5f9m0asx8/pycas2020_web_scraping.html?dl=0)



Expand Down
46 changes: 22 additions & 24 deletions docs/pycas2020_web_scraping.html

Large diffs are not rendered by default.

42 changes: 23 additions & 19 deletions pycas2020_web_scraping.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,12 @@ ul {
code {
color: ##0033cc;
}

.forceBreak { -webkit-column-break-after: always; break-after: column; }

</style>


```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, error=TRUE, tidy=TRUE)
Expand Down Expand Up @@ -52,7 +57,7 @@ use_virtualenv("pycas2020")

##

https://github.com/jpiaskowski/pycas2020_web_scraping
### https://github.com/jpiaskowski/pycas2020_web_scraping

## Good Way to Learn Python!

Expand All @@ -68,27 +73,28 @@ knitr::include_graphics("images/luke_lightsaber_throwaway.gif")
```

## The Main Things to Know in a Web Scraping Project:
* Is it worth the troube?
* Is it worth the trouble?
* Is it ethical?
* Using `BeautifulSoup` and `requests`
* what to look for in html code
* parsing json objects with <code>json</code>
* rudimentary `pandas` skills
* `<pro_tip> All you need to know about html is how tags work </protip>`
* Tools available in `BeautifulSoup` and `requests`
* What to look for in html code
* Parsing json objects with <code>json</code>
* Rudimentary `pandas` skills
* `<pro-tip> All you need to know about html is how tags work </pro-tip>`

## What to Look for in a Scraping Project: {.columns-2}

```{r, echo=FALSE, out.width = '100%'}
knitr::include_graphics("images/spidermen.jpg")
```

* A sizeable amount of structured data with a regular repeatable format.
* A sizeable amount of structured data with a regular repeatable format.
* Identical formating is not required, but the more edge cases present, the more complicated the scraping will be.


* Identical formating is not required, but the more edge cases present, the more complicated the scraping will be.
```{r, echo=FALSE, out.width = '80%'}
knitr::include_graphics("images/spidermen.jpg")
```

## Ethics in Scraping {.columns-2}

```{r, echo=FALSE, fig.cap = "Captain Marvel", out.width = '80%'}
```{r, echo=FALSE, out.width = '80%'}
knitr::include_graphics("images/captain_marvel_binary.jpg")
```

Expand Down Expand Up @@ -148,7 +154,7 @@ Beautiful Soup will take html or xml content and transform it into a complex tre

## Step 3: Determine How to Extract Relevant Content from bs4 Soup

This process can be frustrating.
*This process can be frustrating.*
```{r, echo=FALSE, out.width='70%'}
knitr::include_graphics("images/ren_throws_fit.gif")
```
Expand Down Expand Up @@ -317,7 +323,7 @@ locs_df.head(n = 5)
knitr::include_graphics("images/adventures_in_babysitting.gif")
```

```
```{python, eval=F, echo=T}
df.to_csv(locs_df, "family_dollar_ID_locations.csv", sep = ",", index = False)
```
## A Few Words on Selenium
Expand All @@ -333,7 +339,8 @@ knitr::include_graphics("images/walgreens1.png")
```{r, echo=FALSE, out.width='70%'}
knitr::include_graphics("images/walgreens2.png")
```
There are plugins modifying the source code. The source code needs to be accessed *after* the page has loaded in a browser.

There are plugins modifying the source code - so, it should be accessed *after* the page has loaded in a browser.

## A Few Words on Selenium
* Requires a webdriver to retrieve the content
Expand Down Expand Up @@ -378,6 +385,3 @@ knitr::include_graphics("images/luke_brushesoff_dust.gif")
knitr::include_graphics("images/family_dollar_locations.png")
```




0 comments on commit 1bb5a20

Please sign in to comment.