-
Notifications
You must be signed in to change notification settings - Fork 696
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crawler: error trying to parse this page: c02.xhtml #335
Comments
@lorenzodifuccia && @ivanpagac , I believe there is a set of problems introduced with later versions of Python that LXML hasn't addressed yet.
Regardless of this external change in lxml, I found the issue in this project with handling emojis and other special unicode characters when requesting lxml to parse the document, for the versions of Python with which lxml behaves well. I have addressed the issue in https://github.com/azec-pdx/safaribooks/tree/apiv2 . |
I've had different behaviors of |
@azec-pdx , I'm using an M series MacOS device and I was able to use the code on your branch (commit azec-pdx@a2be61e) and was able to get around this same problem for myself. Thank you! |
@azec-pdx thank you, is there a version of lxml (fixing Python at 3.9.x), where this error can be avoided? If so, patching requirements.txt to that version of lxml may allow users to locally work around this problem, until a formal PR resolving it, gets merged. |
#347 fixed this issue. |
however the url itself is correct, i can display the page in the browser
https://learning.oreilly.com/api/v2/epubs/urn:orm:book:9781098156817/files/c02.xhtml
The text was updated successfully, but these errors were encountered: