Replace the blib2to3 tokenizer with pytokens #4536

tusharsadhwani · 2024-12-22T14:50:03Z

Description

Replaces black's tokenizer with a from-scratch rewrite done by me. We could vendor the code into black itself, but either pinning it or keeping it as-is would be my recommendation, the tokenizer can be used by multiple tools for perfect compatibility.

Resolves #4520
Resolves #970
Resolves #3700

Tests passing so far: 381/381 (!)

for more information, see https://pre-commit.ci

tusharsadhwani · 2024-12-23T23:33:51Z

@JelleZijlstra with this, the test suite is fully passing. Primer is failing (mostly just because some file in hypothesis failed to parse), I'll be working on that, but the code should be good for a first review.

MeGaGiGaGon · 2024-12-24T06:04:11Z

I don't think we currently have any tests for this, but I just linked the above two issues here because they are the same bug in the parser where \rs cause issues. The most minimal reproduction is {\r}, and since this is a parser rewrite hopefully it can be solved. Note that this is currently only observable by directly calling internal methods due to how black reads input.

tusharsadhwani · 2024-12-24T10:27:57Z

Thanks for linking this, I'll make sure these parse identically to how CPython does it.

tusharsadhwani · 2024-12-25T07:13:47Z

Okay, primer is fixed, and all tests are green.

JelleZijlstra · 2025-01-11T04:26:55Z

Might this fuzzer failure indicate a bug?

Falsifying example: test_idempotent_any_syntatically_valid_python(
    src_contents='\n#\r#',
    mode=Mode(target_versions=set(),
     line_length=88,
     string_normalization=False,
     is_pyi=False,
     is_ipynb=False,
     skip_source_first_line=False,
     magic_trailing_comma=False,
     python_cell_magics=set(),
     preview=False,
     unstable=False,
     enabled_features=set()),  # or any other generated value
)

JelleZijlstra · 2025-01-11T04:30:30Z

src/blib2to3/pgen2/tokenize.py

+    token: pytokens.Token, source: str, prev_token: Optional[pytokens.Token]
+) -> pytokens.Token:
+    r"""
+    Black treats `\\\n` at the end of a line as a 'NL' token, while it


That doesn't sound particularly intentional, I'd be open to changing Black to remove this divergence.

Feel free to! I can give you a test case with the expected behaviour.

So, this is enough as a test case actually:

a \ b

But, the reason this probably exists is to support formatting this file:

class Plotter: \ pass class AnotherCase: \ """Some \ Docstring """

tusharsadhwani · 2025-01-11T04:33:15Z

Might this fuzzer failure indicate a bug?

Yeah, but I'm pretty sure it is a bug in CPython. For now we can work it out in the tokenizer though. I'll add a flag in pytokens to fix it.

tusharsadhwani and others added 19 commits December 22, 2024 20:18

Replace the blib2to3 tokenizer with pytokens

b37624f

[pre-commit.ci] auto fixes from pre-commit.com hooks

1174fbc

for more information, see https://pre-commit.ci

formatting

e5d412b

wip

920445f

add async/await keyword support

2fb18aa

typo fix

0fbdd2b

fix \\\n handling

bc785af

typo fix

945678c

fix form feed test

90dc1c7

fix EOFError case

134f7b9

fix async discrepancies

7002eff

bump pytokens version

b3500c4

remove empty fstring middle tokens from test

51e7ce1

remove python2 test

8cfe444

fix backslash edge case

ec0b568

fix another backslash edge case

ceab505

fix mypyc

cb9d48a

remove use of |

1e105f8

Merge branch 'main' into pytokens

4dace1a

This was referenced Dec 24, 2024

Black crashes on files containing \r, from e.g. old MacOS #3700

Open

Parse error found with Hypothesmith #970

Open

tusharsadhwani added 4 commits December 25, 2024 10:08

bump pytokens version

3420248

lints

59ddf06

Merge branch 'main' into pytokens

e18332f

Add changelog entry

b068867

bump pytokens version

74edc5b

tusharsadhwani added 3 commits January 8, 2025 23:57

Merge branch 'main' into pytokens

f3715e9

bump pytokens once again

ad46848

bump upload-artifact

97a730d

JelleZijlstra reviewed Jan 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace the blib2to3 tokenizer with pytokens #4536

Replace the blib2to3 tokenizer with pytokens #4536

tusharsadhwani commented Dec 22, 2024 •

edited

Loading

tusharsadhwani commented Dec 23, 2024

MeGaGiGaGon commented Dec 24, 2024

tusharsadhwani commented Dec 24, 2024

tusharsadhwani commented Dec 25, 2024

JelleZijlstra commented Jan 11, 2025

JelleZijlstra Jan 11, 2025

tusharsadhwani Jan 11, 2025

tusharsadhwani Jan 11, 2025

tusharsadhwani commented Jan 11, 2025

Replace the blib2to3 tokenizer with pytokens #4536

Are you sure you want to change the base?

Replace the blib2to3 tokenizer with pytokens #4536

Conversation

tusharsadhwani commented Dec 22, 2024 • edited Loading

Description

Tests passing so far: 381/381 (!)

tusharsadhwani commented Dec 23, 2024

MeGaGiGaGon commented Dec 24, 2024

tusharsadhwani commented Dec 24, 2024

tusharsadhwani commented Dec 25, 2024

JelleZijlstra commented Jan 11, 2025

JelleZijlstra Jan 11, 2025

Choose a reason for hiding this comment

tusharsadhwani Jan 11, 2025

Choose a reason for hiding this comment

tusharsadhwani Jan 11, 2025

Choose a reason for hiding this comment

tusharsadhwani commented Jan 11, 2025

tusharsadhwani commented Dec 22, 2024 •

edited

Loading