make_tokenizer doesn't deal with binary tokenizers #46

gsnedders · 2016-06-28T23:44:50Z

tokenize = make_tokenizer([
    (u'x', (br'\xff\n',)),
])

tokens = list(tokenize(b"\xff\n"))

throws

  File "/Users/gsnedders/Documents/other-projects/funcparserlib/funcparserlib/funcparserlib/tests/test_parsing.py", line 76, in test_tokenize_bytes
    tokens = list(tokenize(b"\xff\n"))
  File "/Users/gsnedders/Documents/other-projects/funcparserlib/funcparserlib/funcparserlib/lexer.py", line 107, in f
    t = match_specs(compiled, str, i, (line, pos))
  File "/Users/gsnedders/Documents/other-projects/funcparserlib/funcparserlib/funcparserlib/lexer.py", line 91, in match_specs
    nls = value.count(u'\n')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)

match_specs needs to handle unicode and bytes line feed characters.

The text was updated successfully, but these errors were encountered:

vlasovskikh self-assigned this Jun 30, 2016

gsnedders mentioned this issue Apr 3, 2023

First attempt at linter html5lib/html5lib-tests#83

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

make_tokenizer doesn't deal with binary tokenizers #46

make_tokenizer doesn't deal with binary tokenizers #46

gsnedders commented Jun 28, 2016

make_tokenizer doesn't deal with binary tokenizers #46

make_tokenizer doesn't deal with binary tokenizers #46

Comments

gsnedders commented Jun 28, 2016