Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawler: error trying to parse this page: c02.xhtml #335

Open
ivanpagac opened this issue Jan 13, 2023 · 5 comments
Open

Crawler: error trying to parse this page: c02.xhtml #335

ivanpagac opened this issue Jan 13, 2023 · 5 comments
Labels

Comments

@ivanpagac
Copy link

[13/Jan/2023 12:17:49] ** Welcome to SafariBooks! **
[13/Jan/2023 12:17:49] Logging into Safari Books Online...
[13/Jan/2023 12:17:52] Successfully authenticated.
[13/Jan/2023 12:17:52] Retrieving book info...
[13/Jan/2023 12:17:52] Title: The Rust Programming Language, 2nd Edition
[13/Jan/2023 12:17:52] Authors: Steve Klabnik, Carol Nichols
[13/Jan/2023 12:17:52] Identifier: 9781098156817
[13/Jan/2023 12:17:52] ISBN: 9781098156800
[13/Jan/2023 12:17:52] Publishers: No Starch Press
[13/Jan/2023 12:17:52] Rights: 
[13/Jan/2023 12:17:52] Description: The Rust Programming Language, 2nd Edition is the official guide to Rust 2021: an open source systems programming language that will help you write faster, more reliable software. Rust provides control of low-level details along with high-level ergonomics, allowing you to improve productivity and eliminate the hassle traditionally associated with low-level languages.Klabnik and Nichols, alumni of the Rust Core Team, share their knowledge to help you get the most out of Rustâ??s features so that ...
[13/Jan/2023 12:17:52] Release Date: 2023-02-28
[13/Jan/2023 12:17:52] URL: https://learning.oreilly.com/library/view/the-rust-programming/9781098156817/
[13/Jan/2023 12:17:52] Retrieving book chapters...
[13/Jan/2023 12:17:54] Output directory:
    /Users/ivan/Projects/safaribooks/Books/The Rust Programming Language 2nd Edition (9781098156817)
[13/Jan/2023 12:17:54] Book directory already exists: /Users/ivan/Projects/safaribooks/Books/The Rust Programming Language 2nd Edition (9781098156817)
[13/Jan/2023 12:17:54] CSSs directory already exists: /Users/ivan/Projects/safaribooks/Books/The Rust Programming Language 2nd Edition (9781098156817)/OEBPS/Styles
[13/Jan/2023 12:17:54] Images directory already exists: /Users/ivan/Projects/safaribooks/Books/The Rust Programming Language 2nd Edition (9781098156817)/OEBPS/Images
[13/Jan/2023 12:17:54] Downloading book contents... (35 chapters)
[13/Jan/2023 12:17:54] File `cover.xhtml` already exists.
    If you want to download again all the book,
    please delete the output directory '/Users/ivan/Projects/safaribooks/Books/The Rust Programming Language 2nd Edition (9781098156817)' and restart the program.
[13/Jan/2023 12:17:54] Document is empty
[13/Jan/2023 12:17:54] Crawler: error trying to parse this page: c02.xhtml (Chapter 2: Programming a Guessing Game)
    From: https://learning.oreilly.com/api/v2/epubs/urn:orm:book:9781098156817/files/c02.xhtml
[13/Jan/2023 12:17:54] Last request done:
	URL: https://learning.oreilly.com/api/v2/epubs/urn:orm:book:9781098156817/files/c02.xhtml
	DATA: None
	OTHERS: {}

	200

however the url itself is correct, i can display the page in the browser

https://learning.oreilly.com/api/v2/epubs/urn:orm:book:9781098156817/files/c02.xhtml

@azec-pdx
Copy link

@lorenzodifuccia && @ivanpagac ,

I believe there is a set of problems introduced with later versions of Python that LXML hasn't addressed yet.
I am watching the following:

  1. https://bugs.launchpad.net/lxml/+bug/1949271
  2. https://github.com/Donohue/medium-to-jekyll/pull/4/files
  3. emoji (unicode char) into the html of the page make lxml return empty body Donohue/medium-to-jekyll#3

Regardless of this external change in lxml, I found the issue in this project with handling emojis and other special unicode characters when requesting lxml to parse the document, for the versions of Python with which lxml behaves well.

I have addressed the issue in https://github.com/azec-pdx/safaribooks/tree/apiv2 .
I was able to confirm positive results with testing on Book with IDs: 9781098156817 and 9781617297274 which both have some emojis and other offending characters. However, I was able to only get the parsing right with Python 3.9.16 and while using Python 3.9.10, it is still broken (I believe because of the additional issue linked above).

Screenshot 2023-03-27 at 9 08 37 AM

Screenshot 2023-03-27 at 8 58 21 AM

@azec-pdx
Copy link

azec-pdx commented Apr 4, 2023

I've had different behaviors of lxml on same Python version between macOS running Apple M1 chip and macOS running Apple Intel chip. On M1 macOS, it basically errors as described above and my branch is handling that now, but on Intel macOS it never errors out.

@lorenzodifuccia lorenzodifuccia added help wanted need more info Please provide more info to address the issue labels May 5, 2023
@jrwagz
Copy link

jrwagz commented May 15, 2023

@azec-pdx , I'm using an M series MacOS device and I was able to use the code on your branch (commit azec-pdx@a2be61e) and was able to get around this same problem for myself. Thank you!

@trsudarshan
Copy link

@azec-pdx thank you, is there a version of lxml (fixing Python at 3.9.x), where this error can be avoided? If so, patching requirements.txt to that version of lxml may allow users to locally work around this problem, until a formal PR resolving it, gets merged.

@dreampuf
Copy link

#347 fixed this issue.

@lorenzodifuccia lorenzodifuccia added wontfix and removed help wanted need more info Please provide more info to address the issue labels Oct 30, 2024
tejavegesna added a commit to tejavegesna/safaribooks that referenced this issue Nov 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants