Normalize line endings in multiline strings #74

DavisVaughan · 2024-11-27T13:27:23Z

Why do they even do this?

I imagine the idea is that typically the line ending doesn't really matter, it becomes part of the Trivia set, but is not part of an actual node. But here, the line ending gets parsed into the actual string contents. I assume they want to ensure that the actual nodes of the tree look the same across OSes regardless of line endings, and standardizing on \n inside multiline strings is the way to achieve that.

The text was updated successfully, but these errors were encountered:

lionel- · 2024-11-27T16:55:58Z

The LSP normalises line endings here:

air/crates/lsp/src/documents.rs

Line 55 in 003c31b

let (contents, endings) = LineEndings::normalize(contents);

DavisVaughan · 2024-11-27T18:06:45Z

I find this kind of interesting because:

For LSP workflows:

LSP normalizes to all \n endings on the way in (both normal line endings and these weird ones in multiline string ones)
We format
LSP writes out with \r\n endings

For cli workflows:

We format, replacing \r\n in multiline strings with \n, but otherwise leave the \r\n as is

So in cli workflows you can end up with what we want, which is \r\n line endings everywhere except in the multiline strings, which have \n line endings. But in LSP workflows you can't ever end up with the "right" thing on the way out, the last step of replacing all \n with \r\n means that on the way out even multiline strings get \r\n endings again.

There's a decent chance this is why biome retains the original line endings in the LSP side of things

DavisVaughan · 2024-11-29T20:46:50Z

I've been ruminating on this for a few days and researching some things. Here's the guiding principle of what i think we should do:

At the earliest possible input boundary, convert from input line endings to \n. At the latest possible output boundary, convert from \n to LineEnding

We should apply this holistically throughout air.

Practically, the "earliest possible input boundary" is at file read time (or file modification time in the LSP). We could do this in the lexer/parser, but I think it is going to be important for us to have a "correct" standalone String of text to also use for other things, and ensure it is in sync with what's in the parser.

The LSP effectively does this already with normalize() as @lionel- showed above. We need to:

Also normalize() before parsing.
Turn handling of \r\n into an error in the parser, to avoid this ever happening again.

Something I did not like in the example of mine above is that the previous "ideal" scenario put you in a state where you had \r\n line endings everywhere except in multiline strings. That seems like a path to madness.

Instead I think the ideal scenario is that air ensures that you only have exactly one type of line endings in your file.

I brought up earlier that its not great that in this case

"foo
bar"

There is a line ending captured by R's parser, and it could be \n on unix and \r\n on Windows. It's true that that isn't great, but air provides a solution already!

Simply set LineEnding: Lf and run air and boom now you have consistent \n line endings across all platforms, even on Windows. You can go as far as enforcing this at CI time once we implement something like air format --check.

You can also combine that with a .gitattributes file with * text=auto eol=lf like we did in air to ensure that git checks out that files on Windows with an lf ending, avoiding any possible weirdness (i think its fine even if you dont, air just runs slightly slower)

This also doesn't seem to be something that R users have struggled with in the past.

Pros

Simple to handle on the implementation side
Simple to explain to users (they get exactly 1 type of line ending, not some mixed beast in some cases)
Exactly 1 type of line ending
LineEnding is the source of truth of that line ending type
Plays nicely with biome's formatter without having to change anything
Basically also what rust-analyzer (and actually rust core) does. We can use their exact normalization tooling, which is very cheap because removing \r only shrinks the string, so no new allocations are required!

Cons

It's technically possible to get in a scenario where you check out on Windows with CRLF endings, you write a multiline string in a file, you send it to R, and it has \r\n endings in the string, where some other string else generated on a mac had \n endings. But this sounds wildly improbable - especially since LineEnding: Lf is actually the default.

Interesting historical note. Our normalize() function comes from rust-analyzer, but the author also put it in Rust itself through this issue/PR rust-lang/rust#62865 rust-lang/rust#62948

lionel- · 2024-12-02T08:40:02Z

So in cli workflows you can end up with what we want, which is \r\n line endings everywhere except in the multiline strings, which have \n line endings

I was thinking that what we want is replacing line endings everywhere, including in multiline strings which are not special in that regard, and then restoring them back on the way out? Like the LSP does.

Turn handling of \r\n into an error in the parser, to avoid this ever happening again.

hmm I'm not sure about this. The parser can be used in other contexts than just the cli formatter and the LSP and it seems a bit restrictive to force callers to normalize. The platform sensitive line endings can be a footgun when trivia token matter but for many potential usages the trivia token are not depended upon. I don't have a strong opinion though, I could go with the strict approach.

Simply set LineEnding: Lf and run air and boom now you have consistent \n line endings across all platforms

Is this a biome setting? Can we add Preserve that would use the LSP approach?

DavisVaughan · 2024-12-02T20:10:22Z

@lionel- and I learned that, in practice, a lot of my fears around R parser behavior are unfounded

Even on Windows, it looks like this:

"foo
bar"

does end up evaluating to "foo\nbar" pretty much everywhere, including all of the following scenarios:

In Positron, at the Console or using CMD+Enter from a file with CRLF endings
- Due to usage of convert_line_endings() in handle_execute_request()
In Positron, in a readline() response
- Due to usage of convert_line_endings() in handle_input_reply()
In RStudio, at the Console or using CMD+Enter from a file with CRLF endings
In Rterm
When reading in files with parse(file =), due to this replacer

The key point is that the implementer of the R_ReadConsole() hook is the one in charge of doing this normalization.

It is possible for \r\n to slip in, and the R parser does not remove it. For example, if you remove the usage of convert_line_endings() in handle_execute_request() AND you take out the ark code that splits the input into individual lines(), then you can get this on Windows:

> x <- "a
+ b"
> x
[1] "a\r\nb"

We are not worried about this case. In practice the worst thing that would probably happen in this nightmare scenario is...exactly what would happen today. We'd normalize \r\n to \n, format, and remit as \r\n, back to exactly where we started.

So in air that means that it should be totally fine for us to standardize to unix line endings on the way in, and convert to a single type of line endings on the way out.

We don't have to be strict in the parser, I was just hoping that turning this into an error there would keep us from accidentally forgetting to normalize and not realizing it until a Windows user reports an issue.

DavisVaughan mentioned this issue Dec 2, 2024

Normalize all line endings to Unix on the way in #78

Merged

DavisVaughan closed this as completed in #78 Dec 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normalize line endings in multiline strings #74

Normalize line endings in multiline strings #74

DavisVaughan commented Nov 27, 2024

lionel- commented Nov 27, 2024

DavisVaughan commented Nov 27, 2024 •

edited

Loading

DavisVaughan commented Nov 29, 2024 •

edited

Loading

lionel- commented Dec 2, 2024

DavisVaughan commented Dec 2, 2024 •

edited

Loading

Normalize line endings in multiline strings #74

Normalize line endings in multiline strings #74

Comments

DavisVaughan commented Nov 27, 2024

Why do they even do this?

lionel- commented Nov 27, 2024

DavisVaughan commented Nov 27, 2024 • edited Loading

DavisVaughan commented Nov 29, 2024 • edited Loading

Pros

Cons

lionel- commented Dec 2, 2024

DavisVaughan commented Dec 2, 2024 • edited Loading

DavisVaughan commented Nov 27, 2024 •

edited

Loading

DavisVaughan commented Nov 29, 2024 •

edited

Loading

DavisVaughan commented Dec 2, 2024 •

edited

Loading