You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The entities marked as reserved here (scroll down to see the list) are extracted literally by lxml, whereas it should probably strive for more compatibility with browsers which interpret them according to CP1252.
A quick example:
In [13]: etree.fromstring ('<p>…</p>').textOut[13]: u'\x85'
whereas modern browsers usually show it as an ellipsis …:
In [5]: u'\u2026'Out[5]: '…'
The text was updated successfully, but these errors were encountered:
Maybe one could use the parser target interface to intercept the data and replace the chars, but I don't know about the processing penalty.
Sample code:
>>> import string
>>>
>>> import lxml.etree
>>> from html5lib.constants import replacementCharacters
>>>
>>> table = {unichr(i): r for i, r in replacementCharacters.items()}
>>>
>>> def charref_replace(s):
... out = u''
... for c in s:
... if c in table:
... out += table[c]
... else:
... out += c
... return out
...
>>> class ReservedReplacementTarget(lxml.etree.TreeBuilder):
... def data(self, data):
... return super(ReservedReplacementTarget, self).data(charref_replace(data))
...
>>> parser = lxml.etree.HTMLParser(target = ReservedReplacementTarget())
>>> print(lxml.etree.fromstring('<p>hello, … world!</p>', parser=parser).xpath('//p')[0].text)
hello, … world!
The entities marked as reserved here (scroll down to see the list) are extracted literally by
lxml
, whereas it should probably strive for more compatibility with browsers which interpret them according to CP1252.A quick example:
whereas modern browsers usually show it as an ellipsis
…
:The text was updated successfully, but these errors were encountered: