-
-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Invalid XML character break docket parsers #348
Comments
Nice find. We've seen this before in other areas, so it's not surprising to see it here too. I did some performance testing on this a while back: https://stackoverflow.com/a/25920392/64911 The code that's in CL to handle this is: def filter_invalid_XML_chars(input):
"""XML allows:
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
This strips out everything else.
See: http://stackoverflow.com/a/25920392/64911
"""
if isinstance(input, str):
# Only do str, unicode, etc.
return re.sub(
"[^\u0020-\uD7FF\u0009\u000A\u000D\uE000-\uFFFD"
"\U00010000-\U0010FFFF]+",
"",
input,
)
else:
return input I'd definitely welcome a PR for this. |
I did some reading on this earlier today before reading your post, and stumbled upon the same SO post... I should have looked more closely at the author. Will work on this over the weekend. |
@mlissner thanks for the link. I've been doing a little digging on this and haven't found a solution that works quite yet. I've got a sample text file with the bad payload from the docket. I'm examining the way I'm still working through this. Side note, I see the code in CL, but I'm not seeing where it is used anywhere in that repo |
Weird, yeah, looks like it's not used anymore. I suppose we could delete it since it's easy to find again on StackOverflow. Do you need help with your progress? Sounds like you're just checking in, but if you're frustrated maybe somebody can take a look. |
Not just to be contrarian, but I have long been convinced the StackOverflow post does not offer the right solution. |
That's sort of what I'm finding @johnhawkinson. I'll post a PR with the failing test case. |
The traceback on this goes back to a character parsed by
PR: #349
|
Can you elaborate? |
Summary
When a page on pacer (or elsewhere) contains some characters that are not in the valid list of XML characters lxml's html5 parser will fail.
Tasks
juriscraper/lib/html_utils.py
to escape these characters, probably using some regex so we don't lose too much speed.Questions
All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
traceback bubble up the stack.The text was updated successfully, but these errors were encountered: