Skip to content

Normalize line endings in multiline strings #74

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
DavisVaughan opened this issue Nov 27, 2024 · 5 comments · Fixed by #78
Closed

Normalize line endings in multiline strings #74

DavisVaughan opened this issue Nov 27, 2024 · 5 comments · Fixed by #78

Comments

@DavisVaughan
Copy link
Collaborator

See also #69 (comment)

We have this CI failures on Windows https://github.com/posit-dev/air/actions/runs/12042018483/job/33574939950?pr=69

With this test

---- formatter::r::specs::r::value::string_value_r stdout ----
thread 'formatter::r::specs::r::value::string_value_r' panicked at C:\Users\runneradmin\.cargo\git\checkouts\biome-d49ce8898b420a50\2648fa4\crates\biome_formatter\src\builders.rs:394:5:
The content 'r"("some raw string
business")"' contains an unsupported '\r' line terminator character but text must only use line feeds '\n' as line separator. Use '\n' instead of '\r' and '\r\n' to insert a line break in strings.
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Which formats a file containing this string

r"("some raw string
business")"

Note that this is a multiline string. The raw string there isn't critical to the reprex, it's just what we happened to have there.

This fails because:

  • On windows, git checks the file out with CRLF line endings. Note this puts a \r\n inside the string contents.
  • The formatter fails inside located_token_text() for a string because of this no-\r assertion

Both biome and ruff use this assertion, so I imagine it is for a good reason.

The way both of them handle this is by using variants of normalize_string() here:
https://github.com/biomejs/biome/blob/cd1c8ec4249e8df8d221393586d664537c9fddb2/crates/biome_formatter/src/token/string.rs#L48

Note that the docs say one of the things it does is replace \r\n with just \n.

They call this at fmt() time for a string node, before it actually drops through to located_token_text(), so they get a chance to "fix up" the string here. We could probably create our own version of this and strip out everything except the \r\n conversion, until we decide to do more normalization.

Why do they even do this?

I imagine the idea is that typically the line ending doesn't really matter, it becomes part of the Trivia set, but is not part of an actual node. But here, the line ending gets parsed into the actual string contents. I assume they want to ensure that the actual nodes of the tree look the same across OSes regardless of line endings, and standardizing on \n inside multiline strings is the way to achieve that.

@lionel-
Copy link
Collaborator

lionel- commented Nov 27, 2024

The LSP normalises line endings here:

let (contents, endings) = LineEndings::normalize(contents);

@DavisVaughan
Copy link
Collaborator Author

DavisVaughan commented Nov 27, 2024

I find this kind of interesting because:

For LSP workflows:

  • LSP normalizes to all \n endings on the way in (both normal line endings and these weird ones in multiline string ones)
  • We format
  • LSP writes out with \r\n endings

For cli workflows:

  • We format, replacing \r\n in multiline strings with \n, but otherwise leave the \r\n as is

So in cli workflows you can end up with what we want, which is \r\n line endings everywhere except in the multiline strings, which have \n line endings. But in LSP workflows you can't ever end up with the "right" thing on the way out, the last step of replacing all \n with \r\n means that on the way out even multiline strings get \r\n endings again.

There's a decent chance this is why biome retains the original line endings in the LSP side of things

@DavisVaughan
Copy link
Collaborator Author

DavisVaughan commented Nov 29, 2024

I've been ruminating on this for a few days and researching some things. Here's the guiding principle of what i think we should do:

At the earliest possible input boundary, convert from input line endings to \n. At the latest possible output boundary, convert from \n to LineEnding

We should apply this holistically throughout air.

Practically, the "earliest possible input boundary" is at file read time (or file modification time in the LSP). We could do this in the lexer/parser, but I think it is going to be important for us to have a "correct" standalone String of text to also use for other things, and ensure it is in sync with what's in the parser.

The LSP effectively does this already with normalize() as @lionel- showed above. We need to:

  • Also normalize() before parsing.
  • Turn handling of \r\n into an error in the parser, to avoid this ever happening again.

Something I did not like in the example of mine above is that the previous "ideal" scenario put you in a state where you had \r\n line endings everywhere except in multiline strings. That seems like a path to madness.

Instead I think the ideal scenario is that air ensures that you only have exactly one type of line endings in your file.

I brought up earlier that its not great that in this case

"foo
bar"

There is a line ending captured by R's parser, and it could be \n on unix and \r\n on Windows. It's true that that isn't great, but air provides a solution already!

Simply set LineEnding: Lf and run air and boom now you have consistent \n line endings across all platforms, even on Windows. You can go as far as enforcing this at CI time once we implement something like air format --check.

You can also combine that with a .gitattributes file with * text=auto eol=lf like we did in air to ensure that git checks out that files on Windows with an lf ending, avoiding any possible weirdness (i think its fine even if you dont, air just runs slightly slower)

This also doesn't seem to be something that R users have struggled with in the past.


Pros

  • Simple to handle on the implementation side
  • Simple to explain to users (they get exactly 1 type of line ending, not some mixed beast in some cases)
  • Exactly 1 type of line ending
  • LineEnding is the source of truth of that line ending type
  • Plays nicely with biome's formatter without having to change anything
  • Basically also what rust-analyzer (and actually rust core) does. We can use their exact normalization tooling, which is very cheap because removing \r only shrinks the string, so no new allocations are required!

Cons

  • It's technically possible to get in a scenario where you check out on Windows with CRLF endings, you write a multiline string in a file, you send it to R, and it has \r\n endings in the string, where some other string else generated on a mac had \n endings. But this sounds wildly improbable - especially since LineEnding: Lf is actually the default.

Interesting historical note. Our normalize() function comes from rust-analyzer, but the author also put it in Rust itself through this issue/PR rust-lang/rust#62865 rust-lang/rust#62948

@lionel-
Copy link
Collaborator

lionel- commented Dec 2, 2024

So in cli workflows you can end up with what we want, which is \r\n line endings everywhere except in the multiline strings, which have \n line endings

I was thinking that what we want is replacing line endings everywhere, including in multiline strings which are not special in that regard, and then restoring them back on the way out? Like the LSP does.

Turn handling of \r\n into an error in the parser, to avoid this ever happening again.

hmm I'm not sure about this. The parser can be used in other contexts than just the cli formatter and the LSP and it seems a bit restrictive to force callers to normalize. The platform sensitive line endings can be a footgun when trivia token matter but for many potential usages the trivia token are not depended upon. I don't have a strong opinion though, I could go with the strict approach.

Simply set LineEnding: Lf and run air and boom now you have consistent \n line endings across all platforms

Is this a biome setting? Can we add Preserve that would use the LSP approach?

@DavisVaughan
Copy link
Collaborator Author

DavisVaughan commented Dec 2, 2024

@lionel- and I learned that, in practice, a lot of my fears around R parser behavior are unfounded

Even on Windows, it looks like this:

"foo
bar"

does end up evaluating to "foo\nbar" pretty much everywhere, including all of the following scenarios:

  • In Positron, at the Console or using CMD+Enter from a file with CRLF endings
  • In Positron, in a readline() response
  • In RStudio, at the Console or using CMD+Enter from a file with CRLF endings
  • In Rterm
  • When reading in files with parse(file =), due to this replacer

The key point is that the implementer of the R_ReadConsole() hook is the one in charge of doing this normalization.

It is possible for \r\n to slip in, and the R parser does not remove it. For example, if you remove the usage of convert_line_endings() in handle_execute_request() AND you take out the ark code that splits the input into individual lines(), then you can get this on Windows:

> x <- "a
+ b"
> x
[1] "a\r\nb"

We are not worried about this case. In practice the worst thing that would probably happen in this nightmare scenario is...exactly what would happen today. We'd normalize \r\n to \n, format, and remit as \r\n, back to exactly where we started.


So in air that means that it should be totally fine for us to standardize to unix line endings on the way in, and convert to a single type of line endings on the way out.

We don't have to be strict in the parser, I was just hoping that turning this into an error there would keep us from accidentally forgetting to normalize and not realizing it until a Windows user reports an issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants