Scrapemark fails to decode hex-encoded HTML entities #9

arshaw · 2011-02-09T05:17:41Z

Reported by [email protected], Aug 11, 2010

Scrape something that has an HTML entity encoded in hex (ex title of http://www.youtube.com/videos)

Entity should be decoded, instead a ValueError is thrown.

At the time of writing, the title for the above mentioned youtube page is (some whitespace removed for clarity):

<title>YouTube - &#x202a;Most viewed videos&#x202c;&lrm</title>

Testcode below:

#!/usr/bin/env python
import scrapemark

url = "http://www.youtube.com/videos"
data = scrapemark.scrape("<title>{{title}}</title>", url = url)
print data['title']

I've attached a patch

diff --git a/scrapemark.py b/scrapemark.py
index 7b4cf72..be0327c 100644
--- a/scrapemark.py
+++ b/scrapemark.py
@@ -530,7 +530,11 @@ def _decode_entities(s):
     def _substitute_entity(m):
    ent = m.group(2)
    if m.group(1) == "#":
-       return unichr(int(ent))
+       # Hex value
+       if ent[0] == 'x':
+           return unichr(int(ent[1:], 16))
+       else:
+           return unichr(int(ent))
    else:
        cp = name2codepoint.get(ent)
        if cp:

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scrapemark fails to decode hex-encoded HTML entities #9

Scrapemark fails to decode hex-encoded HTML entities #9

arshaw commented Feb 9, 2011

Scrapemark fails to decode hex-encoded HTML entities #9

Scrapemark fails to decode hex-encoded HTML entities #9

Comments

arshaw commented Feb 9, 2011