-
Notifications
You must be signed in to change notification settings - Fork 621
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MSN article returns NULL due to (at least) not recursing into open shadowRoots #926
Comments
I'm not a Readability maintainer but I've done a bit of scraping. You're going to have two main problems with msn.com:
To get past 1, you can use playwright or similar browser automation. You may already be doing this, not sure. You'll need to wait for the page to have loaded the article JSON and written it into the DOM. Using 2 is trickier. To solve that, you'll need to run a script like this one to extract the content from the shadow DOM nodes. You can run the script in the context of the playwright page with if (node instanceof Document) {
return extractHTML(node.documentElement);
} just before the Once you've got that html from playwright, pass it to Readability and it will work. You may want to remove elements matching this selector from the DOM first: Problem 1 isn't really a problem for Readability to solve. Problem 2 arguably might be. It might be nice if Readability was able to dive into shadow DOM elements. Maybe |
Thanks @danielnixon ! And yes, sounds like we'd need to update Readability to recurse into shadowroots... It still wouldn't work with closed shadow roots, I expect. I don't know that there's anything we could do about that, but then, I think it'd be unlikely article pages use those for the main article... |
I agree it's unlikely. Even open shadow roots are seemingly rare in the sorts of pages one might want to use Readability on. If it came to it, there are ways: https://developer.mozilla.org/en-US/docs/Mozilla/Add-ons/WebExtensions/API/dom/openOrClosedShadowRoot (assuming we're in a browser addon context) |
MSN article https://www.msn.com/en-us/news/world/south-korean-president-apologizes-for-declaring-martial-law-as-he-faces-impeachment-vote/ar-AA1vpHO2 simply returns null without errors.
I'm new to the library, is it expected when article is not readable or is this a bug?
The text was updated successfully, but these errors were encountered: