Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zimit WARC files of archives.nyphil.org are full of unrecognized characters #253

Open
benoit74 opened this issue Nov 20, 2023 · 1 comment

Comments

@benoit74
Copy link
Collaborator

youzim.it run of https://archives.nyphil.org/ failed reporting lots of unrecognized chars.

Task is here.

Command used:

zimit --url=https://archives.nyphil.org/ --name=archives.nyphil.org_67aad441 --zim-file=archives.nyphil.org_67aad441.zim --userAgentSuffix=Youzim.it+ --sizeLimit=4294967296 --timeLimit=7200 --output=/output --statsFilename=/output/task_progress.json [email protected]

Final error:

Traceback (most recent call last):
  File "/usr/bin/zimit", line 541, in <module>
    zimit()
  File "/usr/bin/zimit", line 443, in zimit
    return warc2zim(warc2zim_args)
  File "/app/zimit/lib/python3.10/site-packages/warc2zim/main.py", line 811, in warc2zim
    return warc2zim.run()
  File "/app/zimit/lib/python3.10/site-packages/warc2zim/main.py", line 433, in run
    self.add_items_for_warc_record(record)
  File "/app/zimit/lib/python3.10/site-packages/warc2zim/main.py", line 646, in add_items_for_warc_record
    payload_item = WARCPayloadItem(record, self.head_insert, self.css_insert)
  File "/app/zimit/lib/python3.10/site-packages/warc2zim/main.py", line 179, in __init__
    self.title = parse_title(self.content)
  File "/app/zimit/lib/python3.10/site-packages/warc2zim/main.py", line 714, in parse_title
    soup = BeautifulSoup(content, "html.parser")
  File "/app/zimit/lib/python3.10/site-packages/bs4/__init__.py", line 348, in __init__
    self._feed()
  File "/app/zimit/lib/python3.10/site-packages/bs4/__init__.py", line 434, in _feed
    self.builder.feed(self.markup)
  File "/app/zimit/lib/python3.10/site-packages/bs4/builder/_htmlparser.py", line 377, in feed
    parser.feed(markup)
  File "/usr/lib/python3.10/html/parser.py", line 110, in feed
    self.goahead(0)
  File "/usr/lib/python3.10/html/parser.py", line 178, in goahead
    k = self.parse_html_declaration(i)
  File "/usr/lib/python3.10/html/parser.py", line 263, in parse_html_declaration
    return self.parse_marked_section(i)
  File "/usr/lib/python3.10/_markupbase.py", line 144, in parse_marked_section
    sectName, j = self._scan_name( i+3, i )
  File "/usr/lib/python3.10/_markupbase.py", line 390, in _scan_name
    raise AssertionError(
AssertionError: expected name token at '<![\x05�\x069�y�\x00"���@��\x11H'
FATAL: exception not rethrown

Before that, we have many times in the log:

[WARNING] Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
@benoit74 benoit74 self-assigned this Nov 20, 2023
@benoit74 benoit74 added this to the 2.2.0 milestone Jun 18, 2024
@benoit74
Copy link
Collaborator Author

We need to run again the process with Zimit2 to confirm if issue is still present.

@benoit74 benoit74 modified the milestones: 2.2.0, later Aug 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant