-
Notifications
You must be signed in to change notification settings - Fork 135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Passing Unicode directly raises TypeError in textanalyzer #11
Comments
Hi @DeaconDesperado , I don't have experience with nltk_contrib and textanalyzer.py, but "unicode" and "utf-8 text" are in some sense antonyms, not synonyms, because "utf8-encoded" means "binary". So what is your input exactly, unicode string or binary utf8-encoded string? Based on _setEncoding method it looks like textanalyzer expects binary utf8-encoded string, so if you have unicode input, encoding it to utf8 before passing to readability will make algorithm work. Based on your code snippet, it looks to me that unicode handling in nltk_contrib.readability is bad; it really should accept only unicode (and not binary data). There are many such issues fixed in NLTK 3.0 alpha, but nltk_contrib was not developed accordingly. Contributions are appreciated! |
@kmike Thanks! The input was in fact unicode text, so I did as you suggested and did the encoding in my application code before handing it off to readability. We used the contrib module in a pretty awesome R&D project last Friday and I'd love to contribute. I'm going to work on amending the handling in my fork and will file a pull request ASAP. |
@kmike I know you said that you haven't much used this particular module in contrib, but what do you think would be the most desirable handling of unicode here? Should the analyzer only check the encoding once and perhaps use it as an instance property? The problem seems to be that the functions are called separately in the We should enforce unicode as input, agreed, but what from there? |
I think that analyzer should only accept unicode text and leave the task of decoding to user, and that almost every ".encode" / ".decode" is in incorrect place now :) Encoding/decoding should be done as closely as possible to IO: e.g. when file is read or written or when web page is downloaded. If some code doesn't perform IO, it shouldn't ever encode or decode text. Algorithms that work with text should work on unicode data (that is already decoded), and they should produce unicode data (that could be encoded later). It could be reasonable for URLextracter to guess encoding of web page and decode data to unicode. But there are dedicated modules for this task (e.g. chardet or UnicodeDammit from BeautifulSoup) that handle much more cases than current _setEncoding methods.
|
Thanks for the excellent writeup, I agree with all your points 👍 I did some experimentation with this yesterday and noticed what you mean. Amending the encoding in the I'm going to have another look at this today from the perspective of your guidelines. |
Perhaps I'm doing something wrong, but it's worth it to check.
My input to the
ReadabilityTool
is unicode utf-8 text. The input is already encoded, and I received aTypeError
when trying to run the tests on it.It appears the logic at line 130 in
textanalyzer.py
expects to perform a encoding that is already performed.Is there something I need to configure in order to make the module expect Unicode by default?
The text was updated successfully, but these errors were encountered: