Crawler: error trying to parse this page: c02.xhtml #335

ivanpagac · 2023-01-13T11:21:31Z

[13/Jan/2023 12:17:49] ** Welcome to SafariBooks! **
[13/Jan/2023 12:17:49] Logging into Safari Books Online...
[13/Jan/2023 12:17:52] Successfully authenticated.
[13/Jan/2023 12:17:52] Retrieving book info...
[13/Jan/2023 12:17:52] Title: The Rust Programming Language, 2nd Edition
[13/Jan/2023 12:17:52] Authors: Steve Klabnik, Carol Nichols
[13/Jan/2023 12:17:52] Identifier: 9781098156817
[13/Jan/2023 12:17:52] ISBN: 9781098156800
[13/Jan/2023 12:17:52] Publishers: No Starch Press
[13/Jan/2023 12:17:52] Rights: 
[13/Jan/2023 12:17:52] Description: The Rust Programming Language, 2nd Edition is the official guide to Rust 2021: an open source systems programming language that will help you write faster, more reliable software. Rust provides control of low-level details along with high-level ergonomics, allowing you to improve productivity and eliminate the hassle traditionally associated with low-level languages.Klabnik and Nichols, alumni of the Rust Core Team, share their knowledge to help you get the most out of Rustâ??s features so that ...
[13/Jan/2023 12:17:52] Release Date: 2023-02-28
[13/Jan/2023 12:17:52] URL: https://learning.oreilly.com/library/view/the-rust-programming/9781098156817/
[13/Jan/2023 12:17:52] Retrieving book chapters...
[13/Jan/2023 12:17:54] Output directory:
    /Users/ivan/Projects/safaribooks/Books/The Rust Programming Language 2nd Edition (9781098156817)
[13/Jan/2023 12:17:54] Book directory already exists: /Users/ivan/Projects/safaribooks/Books/The Rust Programming Language 2nd Edition (9781098156817)
[13/Jan/2023 12:17:54] CSSs directory already exists: /Users/ivan/Projects/safaribooks/Books/The Rust Programming Language 2nd Edition (9781098156817)/OEBPS/Styles
[13/Jan/2023 12:17:54] Images directory already exists: /Users/ivan/Projects/safaribooks/Books/The Rust Programming Language 2nd Edition (9781098156817)/OEBPS/Images
[13/Jan/2023 12:17:54] Downloading book contents... (35 chapters)
[13/Jan/2023 12:17:54] File `cover.xhtml` already exists.
    If you want to download again all the book,
    please delete the output directory '/Users/ivan/Projects/safaribooks/Books/The Rust Programming Language 2nd Edition (9781098156817)' and restart the program.
[13/Jan/2023 12:17:54] Document is empty
[13/Jan/2023 12:17:54] Crawler: error trying to parse this page: c02.xhtml (Chapter 2: Programming a Guessing Game)
    From: https://learning.oreilly.com/api/v2/epubs/urn:orm:book:9781098156817/files/c02.xhtml
[13/Jan/2023 12:17:54] Last request done:
	URL: https://learning.oreilly.com/api/v2/epubs/urn:orm:book:9781098156817/files/c02.xhtml
	DATA: None
	OTHERS: {}

	200

however the url itself is correct, i can display the page in the browser

https://learning.oreilly.com/api/v2/epubs/urn:orm:book:9781098156817/files/c02.xhtml

The text was updated successfully, but these errors were encountered:

azec-pdx · 2023-03-27T16:20:46Z

@lorenzodifuccia && @ivanpagac ,

I believe there is a set of problems introduced with later versions of Python that LXML hasn't addressed yet.
I am watching the following:

Regardless of this external change in lxml, I found the issue in this project with handling emojis and other special unicode characters when requesting lxml to parse the document, for the versions of Python with which lxml behaves well.

I have addressed the issue in https://github.com/azec-pdx/safaribooks/tree/apiv2 .
I was able to confirm positive results with testing on Book with IDs: 9781098156817 and 9781617297274 which both have some emojis and other offending characters. However, I was able to only get the parsing right with Python 3.9.16 and while using Python 3.9.10, it is still broken (I believe because of the additional issue linked above).

azec-pdx · 2023-04-04T18:41:50Z

I've had different behaviors of lxml on same Python version between macOS running Apple M1 chip and macOS running Apple Intel chip. On M1 macOS, it basically errors as described above and my branch is handling that now, but on Intel macOS it never errors out.

jrwagz · 2023-05-15T02:37:10Z

@azec-pdx , I'm using an M series MacOS device and I was able to use the code on your branch (commit azec-pdx@a2be61e) and was able to get around this same problem for myself. Thank you!

trsudarshan · 2023-07-07T23:44:10Z

@azec-pdx thank you, is there a version of lxml (fixing Python at 3.9.x), where this error can be avoided? If so, patching requirements.txt to that version of lxml may allow users to locally work around this problem, until a formal PR resolving it, gets merged.

dreampuf · 2023-09-27T03:44:13Z

#347 fixed this issue.

lorenzodifuccia#335 (comment)

lorenzodifuccia added help wanted need more info Please provide more info to address the issue labels May 5, 2023

lorenzodifuccia added wontfix and removed help wanted need more info Please provide more info to address the issue labels Oct 30, 2024

tejavegesna added a commit to tejavegesna/safaribooks that referenced this issue Nov 30, 2024

fix: Fix Unparasble Chapters

f5d0832

lorenzodifuccia#335 (comment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawler: error trying to parse this page: c02.xhtml #335

Crawler: error trying to parse this page: c02.xhtml #335

ivanpagac commented Jan 13, 2023

azec-pdx commented Mar 27, 2023

azec-pdx commented Apr 4, 2023

jrwagz commented May 15, 2023 •

edited

Loading

trsudarshan commented Jul 7, 2023

dreampuf commented Sep 27, 2023

Crawler: error trying to parse this page: c02.xhtml #335

Crawler: error trying to parse this page: c02.xhtml #335

Comments

ivanpagac commented Jan 13, 2023

azec-pdx commented Mar 27, 2023

azec-pdx commented Apr 4, 2023

jrwagz commented May 15, 2023 • edited Loading

trsudarshan commented Jul 7, 2023

dreampuf commented Sep 27, 2023

jrwagz commented May 15, 2023 •

edited

Loading