Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lecture on web scraping and (perhaps) text #87

Open
jlperla opened this issue Oct 26, 2018 · 17 comments
Open

Lecture on web scraping and (perhaps) text #87

jlperla opened this issue Oct 26, 2018 · 17 comments
Assignees

Comments

@jlperla
Copy link
Contributor

jlperla commented Oct 26, 2018

The lecture is planned for November 14th

My goal is primarily to help people realize that scraping the web and doing text analysis is Not scary! I don't want fear of it to be a reason they are not willing to get creative in the creation of new sources of data.

You guys can play around with the directory https://github.com/ubcecon/computing_and_datascience/tree/master/R_sandbox etc.

@JasmineHao
Copy link
Collaborator

Useful gadget for analyzing HTML
https://selectorgadget.com/

@jlperla
Copy link
Contributor Author

jlperla commented Oct 29, 2018

@JasmineHao
Copy link
Collaborator

Something I think could be common for a class of websites.
I cannot webscrape https://www.ratebeer.com/ using Rvest tool, perhaps due to the cookies. Could leave for later research.

@jlperla
Copy link
Contributor Author

jlperla commented Oct 29, 2018

Yeah, I think that there are a large number of sites where you really need to run selenium... it emulates both cookies and runs the javascript (at which point it can then be scraped by the other tools). It would be great if we should show a very minimal example of rselenium, if it is relatively easy to show.

Do you guys want to grab the R scraping textbooks from my office?

@chiyahn
Copy link
Collaborator

chiyahn commented Oct 30, 2018

Relevant repos I have found in Github so far:

Useful R packages for data cleaning:

Fun examples (not necessarily economics):

@chiyahn
Copy link
Collaborator

chiyahn commented Oct 30, 2018

Jasmine and I had some discussion about how the lecture can be delivered:

  • In the first half of the lecture, I teach how data wrangling can be done seamlessly using tidyverse given that we have some clean dataset by:

    • Exploring some Github repos for data that can be used for awesome economic analysis.
    • Learning piping/ggplot/install_github/reshaping (for panel data) by following my tutorial that explores the relationship between democracy and income (which has been studied in econ literature for a long time) using democracyData package by Marquez at https://github.com/xmarquez/democracyData.
    • Some preliminary text analysis using tidytext?
  • In the other half of the lecture, Jasmine and I teach how we can actually get the clean dataset with text mining and webscrapping by:

    • Replicating Hoberg and Phillips (2016)? We will explore more dataset & papers by the next week to settle this.
    • Webscrapping using plain text files
    • Brief introduction on Rvest?

@jlperla
Copy link
Contributor Author

jlperla commented Oct 30, 2018

I think I really want to emphasize the webscraping more rather than talking about tidyverse transformations.

The goal should be about building people's confidence that they can (1) scrape numerical data from the web and (2) could work with text as data.

It is more important for me to show the tools than anything else.

@jlperla
Copy link
Contributor Author

jlperla commented Oct 30, 2018

If all we did was give a 1.25 hour presentation on how to scrape a couple of websites, I would be happy.

To be clear, we do not need to have an economic application of getting the data, just that we should be scraping data that could be applied to economic problems. For example, you could even take a world-bank or whatever page that has a "download" button and say "lets pretend it didn't have that button", I will show you how you could have gotten the data anyways.

@arnavs
Copy link
Collaborator

arnavs commented Oct 30, 2018

Just so it's not forgotten, I wanted to link to the notes Jasmine Yang produced on this from a few months ago. If the issue has evolved since then, please feel free to disregard.

https://github.com/ubcecon/computing_and_datascience/blob/master/python_sandbox/Web-Scraping.md

@chiyahn
Copy link
Collaborator

chiyahn commented Oct 31, 2018

@schrimpf
Copy link
Member

schrimpf commented Oct 31, 2018 via email

@jlperla
Copy link
Contributor Author

jlperla commented Oct 31, 2018

@pbaylis Can you give these guys your code to prepare as an Rmd file? I think it would be a nice example code to give people.

That said, I want to stress in class an example with data that is not "inside economics" so they don't think of this stuff as just a novelty.

@pbaylis
Copy link

pbaylis commented Nov 1, 2018

I don't think it's actually all that clean but sure. Here's the repo. One note - I keep this code in a private repo because I don't want to be seen as encouraging people all over the internet to hammer the AER website (although thanks to Paul, it's considerably more gentle than it could have been). So it's important to talk about being a good scraping citizen when you do this sort of thing: test on a small subset until you know it works, don't parallelize downloading code, and include sleep time when downloading lots of large files or a bunch of websites (which, honestly, my code should do more of).

econ-program-usage-master.zip

@jlperla
Copy link
Contributor Author

jlperla commented Nov 1, 2018

@pbaylis Alright Debbie Downer. You environmental economists spend too much time thinking about ethics and the tragedy-of-the-commons. The optimal non-cooperative strategy here is slash-and-burn webscraping.

But we will pass on your bleeding heart messages of being good scraping citizens along with the code!

@JasmineHao
Copy link
Collaborator

The rsdriver seems to have a connection issue, so when dealing with cookies, it seems like we need to install docker to run RSelenium
https://stackoverflow.com/questions/45395849/cant-execute-rsdriver-connection-refused

@jlperla
Copy link
Contributor Author

jlperla commented Nov 4, 2018

We have given the students basic docker instruction, so we could conceivably pass on the RSelenium example for them...

But I don't think we should use that in the core demo in class (just supplementary links if they want to do further). Let's keep things simple. Also, it is more important to me that we show clean simple examples than fancy stuff, if that stuff is tricky to setyo.

Also @chiyahn and @jasminefish000 I want to make sure you guys are talking and planning things out together. If you are both off doing your own things for this lecture, there might be a lot of replication of effort.

@schrimpf
Copy link
Member

schrimpf commented Nov 4, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants