-
Notifications
You must be signed in to change notification settings - Fork 87
GSoC 2019 ideas #568
Comments
Another idea for the Ultimate Sitemap Parser:
|
Another idea: readability-lxml is aging fast and is not necessarily still the best library around to extract body of the article from the HTML page. More and more articles get loaded using JavaScript due to an ongoing "frontend everywhere!" frenzy. Lastly, various CDNs, e.g. Cloudflare, are blocking our crawler just because our user agent doesn't have JavaScript enabled. I think inevitably we'll have to switch to running a chromeless browser, loading each and every downloaded story in it, and then applying a well-supported third-party library, e.g. Mozilla's Readability, to extract article title, author and body. So, an experimental project for the student could be:
There are similar projects like this, e.g. https://github.com/schollz/readable, so we'd need to do some research into whether such a thing exists already first. |
another idea: Create scaffolding for a new version of our api. our existing api is written in perl, is inconsistent among its different major parts, and is goofily un-rest-ish in several places. we would like to replace it with a python based api that uses a modern framework for api specification, implementation, and testing. The work to replace the entire api is much too large for one summer, so this task would be to build enough of the end points to demonstrate how to apply the chosen framework to the problem. |
Hello pypt, |
Hey @jhcccc! By "chromeless", I had a general browser without chrome (UI) in mind :) Which specific rendering engine (Gecko, WebKit) and/or tool to use for the task would be up to a student. |
Final list: https://docs.google.com/document/d/1GGbGtFOMS07dog4yzglY5hZCDc41ZQjY1RqRKOlW0B4/edit?usp=sharing Sent it to Ellen and interested students. |
Student application period has just started, so it's about time for us to come up with some GSoC ideas.
We can reuse some from the last year:
I'd also add easier, low-priority tasks from our side-projects, e.g. the Ultimate Sitemap Parser:
yield
found links instead ofreturn
ing them (simple concept but would probably require rethinking and rewriting a lot of stuff)sentence_splitter
Python module:feed_seeker
:The text was updated successfully, but these errors were encountered: