Skip to content
This repository has been archived by the owner on Dec 14, 2023. It is now read-only.

GSoC 2019 ideas #568

Closed
pypt opened this issue Mar 25, 2019 · 6 comments
Closed

GSoC 2019 ideas #568

pypt opened this issue Mar 25, 2019 · 6 comments
Assignees
Labels

Comments

@pypt
Copy link
Contributor

pypt commented Mar 25, 2019

Student application period has just started, so it's about time for us to come up with some GSoC ideas.

We can reuse some from the last year:

  • Build a tool to do some cool visualizations
  • Create PostgreSQL-based job queue
    • Probably too hard and not going to happen, but it doesn't hurt to leave it there. Or should we just remove it?
  • Implement a method to detect subtopics of a topic
    • A lot of students are asking about this idea but I'm not sure if I'm the one who would be able to be the best mentor for this task as (simply put) I don't know much about the subject. Dongge, our GSoC 2017 student, did implement subtopics using Louvain but it's still unmerged to this day.
  • Do your own freehand project

I'd also add easier, low-priority tasks from our side-projects, e.g. the Ultimate Sitemap Parser:

sentence_splitter Python module:

  • Add more supported languages?

feed_seeker:

@pypt pypt added the question label Mar 25, 2019
@pypt pypt added this to the 18 - March 2019 milestone Mar 25, 2019
@pypt
Copy link
Contributor Author

pypt commented Mar 25, 2019

Another idea for the Ultimate Sitemap Parser:

  • Rewrite code to use asyncio: sitemap processing is mostly CPU-bound (all the XML parsing takes a lot of time), but asyncio might improve performance by something like 15%, and also would be useful when Crawl-Delay is defined in robots.txt

@pypt
Copy link
Contributor Author

pypt commented Mar 26, 2019

Another idea:

readability-lxml is aging fast and is not necessarily still the best library around to extract body of the article from the HTML page. More and more articles get loaded using JavaScript due to an ongoing "frontend everywhere!" frenzy. Lastly, various CDNs, e.g. Cloudflare, are blocking our crawler just because our user agent doesn't have JavaScript enabled.

I think inevitably we'll have to switch to running a chromeless browser, loading each and every downloaded story in it, and then applying a well-supported third-party library, e.g. Mozilla's Readability, to extract article title, author and body.

So, an experimental project for the student could be:

  • Set up a chromeless browser
  • Set up Readability
  • Develop a HTTP service that accepts a parameter URL (and/or HTML body), loads it in the browser, runs Readability's magic, and returns the extracted HTML back to the requester.
  • Package everything in a nice Docker image

There are similar projects like this, e.g. https://github.com/schollz/readable, so we'd need to do some research into whether such a thing exists already first.

@hroberts
Copy link
Contributor

another idea:

Create scaffolding for a new version of our api. our existing api is written in perl, is inconsistent among its different major parts, and is goofily un-rest-ish in several places. we would like to replace it with a python based api that uses a modern framework for api specification, implementation, and testing.

The work to replace the entire api is much too large for one summer, so this task would be to build enough of the end points to demonstrate how to apply the chosen framework to the problem.

@jiahao-c
Copy link

Another idea:

readability-lxml is aging fast and is not necessarily still the best library around to extract body of the article from the HTML page. More and more articles get loaded using JavaScript due to an ongoing "frontend everywhere!" frenzy. Lastly, various CDNs, e.g. Cloudflare, are blocking our crawler just because our user agent doesn't have JavaScript enabled.

I think inevitably we'll have to switch to running a chromeless browser, loading each and every downloaded story in it, and then applying a well-supported third-party library, e.g. Mozilla's Readability, to extract article title, author and body.

So, an experimental project for the student could be:

  • Set up a chromeless browser
  • Set up Readability
  • Develop a HTTP service that accepts a parameter URL (and/or HTML body), loads it in the browser, runs Readability's magic, and returns the extracted HTML back to the requester.
  • Package everything in a nice Docker image

There are similar projects like this, e.g. https://github.com/schollz/readable, so we'd need to do some research into whether such a thing exists already first.

Hello pypt,
I thought chromeless were deprecated. Have you considered Puppeteer or Selenium?

@pypt
Copy link
Contributor Author

pypt commented Apr 1, 2019

Hey @jhcccc! By "chromeless", I had a general browser without chrome (UI) in mind :) Which specific rendering engine (Gecko, WebKit) and/or tool to use for the task would be up to a student.

@pypt
Copy link
Contributor Author

pypt commented Apr 1, 2019

Final list:

https://docs.google.com/document/d/1GGbGtFOMS07dog4yzglY5hZCDc41ZQjY1RqRKOlW0B4/edit?usp=sharing

Sent it to Ellen and interested students.

@pypt pypt closed this as completed Apr 1, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants