Skip to content

Commit

Permalink
university of Idaho version
Browse files Browse the repository at this point in the history
  • Loading branch information
jpiaskowski committed Feb 28, 2020
1 parent 1bb5a20 commit 8a3814e
Show file tree
Hide file tree
Showing 7 changed files with 4,846 additions and 1 deletion.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,5 @@
rejects/
.ipynb_checkpoints
py2020_files/
geckodriver.log
pycas2020_web_scraping_files
28 changes: 27 additions & 1 deletion pycas2020_web_scraping.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,7 @@ knitr::include_graphics("images/luke_lightsaber_throwaway.gif")

* A sizeable amount of structured data with a regular repeatable format.
* Identical formating is not required, but the more edge cases present, the more complicated the scraping will be.
* no API available


```{r, echo=FALSE, out.width = '80%'}
Expand Down Expand Up @@ -151,6 +152,31 @@ Beautiful Soup will take html or xml content and transform it into a complex tre
* `NavigableString` - string within a tag
* `Comment` - special type of NavigableString

## More on 'requests.get' output:

Different output types

* `page.text` for text
* `page.content` for byte-by-byte output
* `page.json` for json objects
* `page.raw` for the raw socket response (no thank you)


The encoding for text can be set:

* `page.encoding = 'ISO-885901'`




## More on Tags

* The bs4 element 'tag' is an html tag
* it has both a name and attributes (accessed like a dictionary)
* if a tag has multiple attritutes with the same name, only the first instance is accessed
* a tag's children is accessed via `[tag].contents`
* all tag descendenats can be accessed with `[tag].descendants`
* you can always access the full contents as a string:* `re.compile("your_string")` instead of navigating the html tree

## Step 3: Determine How to Extract Relevant Content from bs4 Soup

Expand All @@ -167,7 +193,7 @@ knitr::include_graphics("images/ren_throws_fit.gif")
```{python, eval=F, echo=T}
print(soup.prettify())
```


## Step 3: Finding Content...

Expand Down
Loading

0 comments on commit 8a3814e

Please sign in to comment.