Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrap mozilla's readability package #10

Open
jimjam-slam opened this issue Sep 10, 2024 · 5 comments
Open

Wrap mozilla's readability package #10

jimjam-slam opened this issue Sep 10, 2024 · 5 comments

Comments

@jimjam-slam
Copy link

Mozilla's readability tool (and library) powers Firefox's Reader View, but it can also be used to extract article text from web pages.

I've had a mind to wrap it for R for a while — we've done stories before based on text scraping where the sources are varied enough that trying to automate it simply with rvest is a recipe for frustration. I think an R wrapper for readability would be a good complement to existing tools (not to mention it would discourage people from simply aiming an LLM at a webpage, which is often overkill).

@deanmarchiori
Copy link
Collaborator

Hi! Thanks for the topic suggestion.

To help us prepare for the hackathon event it would be great to prepare a quick 30-60 sec overview of the topic to introduce it to the group and seek interested collaborators. You can use the below prompts to help with this:

What is the headline idea?

What is the (realistic) outcome being aimed for during the event?

What types of contributions would be welcomed (i.e. specific skills, tasks)?

@janithwanni
Copy link

Hi there,

I'm sorry for joining the conversation a bit late. I want to help out with this project and I see that Dean has commented on several prompts to be filled in. I can try my best to fill the gaps between the topics and perhaps @jimjam-slam can add more detail to it if there are any changes needed.

What is the headline idea?

Writing an R package to act as a wrapper for the {readability} javascript library to simplify text scraping from websites.

What is the (realistic) outcome being aimed for during the event?

A development version of an R package which satisfies the following criteria,

  1. Given a URL the package should return the human-readable text.
  2. The package should have a simple, intuitive function structure that is comparably easier than doing the same task in rvest

What types of contributions would be welcomed (i.e. specific skills, tasks)?

  • R package development skills
    • Writing tests
    • Writing documentation (roxygen, vignettes, pkgdown)
    • General R coding skills
  • Javascript/HTML skills (optionally)

Might be helpful to read this as a starting point (https://book.javascript-for-r.com/widgets-intro-intro)[https://book.javascript-for-r.com/widgets-intro-intro]

@jimjam-slam
Copy link
Author

Hey @janithwanni, welcome — and thanks for filling the template in! I think it's a great summary of what this project would add and involve 🥳

Although you could potentially use htmlwidgets to integrate the JavaScript package as described in your link, I think given it doesn't strictly produce an HTML widget output, I think these might be better starting points:

Absolutely agree that human-readable plain text done more easily for the user than rvest is definitely the primary goal of the package — I've done comparable tasks with complicated rvest logic where I'm scraping a series of pages that have different structures, and it can quickly get rough.

One potential 'stretch goal' (or second version goal) could be to also deliver stripped-back HTML output, as the readability library returns both a plain text version of the article and an HTML version. There are also configuration options that could be supported down the road.

@janithwanni
Copy link

Thank you so much for the links, @jimjam-slam! I will read these before the hackathon 💪 I had only worked with HTML widget-style stuff earlier, so I decided to put the only resource I knew, haha 😄 .

The stretch goals sound interesting and valuable to me as well!

@jimjam-slam
Copy link
Author

Our repo: https://github.com/jimjam-slam/readabilityr

The person who beat us to the punch 😅: https://github.com/nanxstats/r-readability-parser

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants