Lecture on web scraping and (perhaps) text #87

jlperla · 2018-10-26T19:05:12Z

The lecture is planned for November 14th

My goal is primarily to help people realize that scraping the web and doing text analysis is Not scary! I don't want fear of it to be a reason they are not willing to get creative in the creation of new sources of data.

You guys can play around with the directory https://github.com/ubcecon/computing_and_datascience/tree/master/R_sandbox etc.

JasmineHao · 2018-10-29T16:49:42Z

Useful gadget for analyzing HTML
https://selectorgadget.com/

jlperla · 2018-10-29T17:45:24Z

Also see https://www.datacamp.com/community/tutorials/r-web-scraping-rvest and https://ropensci.org/tutorials/rselenium_tutorial/

JasmineHao · 2018-10-29T17:50:13Z

Something I think could be common for a class of websites.
I cannot webscrape https://www.ratebeer.com/ using Rvest tool, perhaps due to the cookies. Could leave for later research.

jlperla · 2018-10-29T17:56:52Z

Yeah, I think that there are a large number of sites where you really need to run selenium... it emulates both cookies and runs the javascript (at which point it can then be scraped by the other tools). It would be great if we should show a very minimal example of rselenium, if it is relatively easy to show.

Do you guys want to grab the R scraping textbooks from my office?

chiyahn · 2018-10-30T04:54:39Z

Relevant repos I have found in Github so far:

Useful R packages for data cleaning:

PSID: https://github.com/floswald/psidR
The World Wealth and Income Database: https://github.com/WIDworld/wid-r-tool
The Survey of Professional Forecasters: https://github.com/joergrieger/Survey
Democracy indices: https://github.com/xmarquez/democracyData (I highly recommend his book [Non-Democratic Politics] as well!)
Uber trip data: https://github.com/fivethirtyeight/uber-tlc-foil-response
Airbnb listings: https://github.com/tomslee/airbnb-data-collection

Fun examples (not necessarily economics):

Spotify music data analysis: https://github.com/AsTimeGoesBy111/Spotify-Music-Data-Analysis
Chinese poetry analysis: https://github.com/chinese-poetry/chinese-poetry
South Park script anaylsis: https://github.com/pdrhlik/southparkr

chiyahn · 2018-10-30T19:11:30Z

Jasmine and I had some discussion about how the lecture can be delivered:

In the first half of the lecture, I teach how data wrangling can be done seamlessly using tidyverse given that we have some clean dataset by:
- Exploring some Github repos for data that can be used for awesome economic analysis.
- Learning piping/ggplot/install_github/reshaping (for panel data) by following my tutorial that explores the relationship between democracy and income (which has been studied in econ literature for a long time) using democracyData package by Marquez at https://github.com/xmarquez/democracyData.
- Some preliminary text analysis using tidytext?
In the other half of the lecture, Jasmine and I teach how we can actually get the clean dataset with text mining and webscrapping by:
- Replicating Hoberg and Phillips (2016)? We will explore more dataset & papers by the next week to settle this.
- Webscrapping using plain text files
- Brief introduction on Rvest?

jlperla · 2018-10-30T19:49:08Z

I think I really want to emphasize the webscraping more rather than talking about tidyverse transformations.

The goal should be about building people's confidence that they can (1) scrape numerical data from the web and (2) could work with text as data.

It is more important for me to show the tools than anything else.

jlperla · 2018-10-30T19:51:04Z

If all we did was give a 1.25 hour presentation on how to scrape a couple of websites, I would be happy.

To be clear, we do not need to have an economic application of getting the data, just that we should be scraping data that could be applied to economic problems. For example, you could even take a world-bank or whatever page that has a "download" button and say "lets pretend it didn't have that button", I will show you how you could have gotten the data anyways.

arnavs · 2018-10-30T20:47:03Z

Just so it's not forgotten, I wanted to link to the notes Jasmine Yang produced on this from a few months ago. If the issue has evolved since then, please feel free to disregard.

https://github.com/ubcecon/computing_and_datascience/blob/master/python_sandbox/Web-Scraping.md

chiyahn · 2018-10-31T21:25:22Z

Simple tutorial on using rvest I wrote yesterday: https://github.com/chiyahn/notes/blob/master/programming/data-mining/rvest/text-mining-with-rvest.md

schrimpf · 2018-10-31T22:32:18Z

Relatedly, I have attached code that scrapes the AER website to look at programming language usage. Patrick independently did the same thing, and his results are going to be part of the next AER annual report. His code is perhaps a bit nicer https://github.com/pbaylis/econ-program-usage

…

On Wed, Oct 31, 2018 at 2:25 PM Chiyoung Ahn ***@***.***> wrote: Simple tutorial on using rvest I wrote yesterday: https://github.com/chiyahn/notes/blob/master/programming/data-mining/rvest/text-mining-with-rvest.md — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#87 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AJ-4vPG2C_--hskRDuhG1ka9EblmMVOaks5uqhVCgaJpZM4X8zdZ> .

jlperla · 2018-10-31T22:41:00Z

@pbaylis Can you give these guys your code to prepare as an Rmd file? I think it would be a nice example code to give people.

That said, I want to stress in class an example with data that is not "inside economics" so they don't think of this stuff as just a novelty.

pbaylis · 2018-11-01T00:49:58Z

I don't think it's actually all that clean but sure. Here's the repo. One note - I keep this code in a private repo because I don't want to be seen as encouraging people all over the internet to hammer the AER website (although thanks to Paul, it's considerably more gentle than it could have been). So it's important to talk about being a good scraping citizen when you do this sort of thing: test on a small subset until you know it works, don't parallelize downloading code, and include sleep time when downloading lots of large files or a bunch of websites (which, honestly, my code should do more of).

econ-program-usage-master.zip

jlperla · 2018-11-01T17:04:25Z

@pbaylis Alright Debbie Downer. You environmental economists spend too much time thinking about ethics and the tragedy-of-the-commons. The optimal non-cooperative strategy here is slash-and-burn webscraping.

But we will pass on your bleeding heart messages of being good scraping citizens along with the code!

JasmineHao · 2018-11-04T09:09:18Z

The rsdriver seems to have a connection issue, so when dealing with cookies, it seems like we need to install docker to run RSelenium
https://stackoverflow.com/questions/45395849/cant-execute-rsdriver-connection-refused

jlperla · 2018-11-04T16:00:59Z

We have given the students basic docker instruction, so we could conceivably pass on the RSelenium example for them...

But I don't think we should use that in the core demo in class (just supplementary links if they want to do further). Let's keep things simple. Also, it is more important to me that we show clean simple examples than fancy stuff, if that stuff is tricky to setyo.

Also @chiyahn and @jasminefish000 I want to make sure you guys are talking and planning things out together. If you are both off doing your own things for this lecture, there might be a lot of replication of effort.

schrimpf · 2018-11-04T17:15:50Z

For what it's worth, I've had no problem using rselenium without docker on Linux.

…

On Sun, Nov 4, 2018, 8:00 AM Jesse Perla ***@***.*** wrote: We have given the students basic docker instruction, so we could conceivably pass on the RSelenium example for them... But I don't think we should use that in the core demo in class (just supplementary links if they want to do further). Let's keep things simple. Also, it is more important to me that we show clean simple examples than fancy stuff, if that stuff is tricky to setyo. Also @chiyahn <https://github.com/chiyahn> and @jasminefish000 <https://github.com/jasminefish000> I want to make sure you guys are talking and planning things out together. If you are both off doing your own things for this lecture, there might be a lot of replication of effort. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#87 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AJ-4vIxsLnLPBbWkrsKri3LEht3sIpGZks5urw87gaJpZM4X8zdZ> .

jlperla assigned chiyahn and JasmineHao Oct 26, 2018

arnavs added the web-scraping label Oct 30, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lecture on web scraping and (perhaps) text #87

Lecture on web scraping and (perhaps) text #87

jlperla commented Oct 26, 2018

JasmineHao commented Oct 29, 2018

jlperla commented Oct 29, 2018

JasmineHao commented Oct 29, 2018

jlperla commented Oct 29, 2018

chiyahn commented Oct 30, 2018

chiyahn commented Oct 30, 2018

jlperla commented Oct 30, 2018

jlperla commented Oct 30, 2018

arnavs commented Oct 30, 2018

chiyahn commented Oct 31, 2018

schrimpf commented Oct 31, 2018 via email

jlperla commented Oct 31, 2018

pbaylis commented Nov 1, 2018 •

edited

Loading

jlperla commented Nov 1, 2018

JasmineHao commented Nov 4, 2018

jlperla commented Nov 4, 2018

schrimpf commented Nov 4, 2018 via email

Lecture on web scraping and (perhaps) text #87

Lecture on web scraping and (perhaps) text #87

Comments

jlperla commented Oct 26, 2018

JasmineHao commented Oct 29, 2018

jlperla commented Oct 29, 2018

JasmineHao commented Oct 29, 2018

jlperla commented Oct 29, 2018

chiyahn commented Oct 30, 2018

chiyahn commented Oct 30, 2018

jlperla commented Oct 30, 2018

jlperla commented Oct 30, 2018

arnavs commented Oct 30, 2018

chiyahn commented Oct 31, 2018

schrimpf commented Oct 31, 2018 via email

jlperla commented Oct 31, 2018

pbaylis commented Nov 1, 2018 • edited Loading

jlperla commented Nov 1, 2018

JasmineHao commented Nov 4, 2018

jlperla commented Nov 4, 2018

schrimpf commented Nov 4, 2018 via email

pbaylis commented Nov 1, 2018 •

edited

Loading