Fix R code cell parser and add `# %%` support (#3709) #3858

kylebutts · 2024-07-03T21:52:35Z

Changes:

R parser would end cells after first blank line; this is now removed
# %% is now recognized as a valid cell demarcation in R
cellDecorationSetting now matches python (only highlights currently active cell). Note that before the whole program is highlighted a gray color, so the information was meaningless.
Tests added to test for these changes

Remaining enhancement:
Complete R support would recognize #' as markdown cells. This feature is fully supported in knitr::spin and quarto render for a script.

This would be require more substantial changes (that I would be happy to implement). For example:

# %% 
#| output: false
1 + 1 

#' Markdown comment

2 + 2

This example has three cells: code cell, markdown, and code cell. But the latter one requires seeing the markdown comment, ending the first code cell, and starting the second.

The knitr::spin/quarto render treat anything after a markdown comment as a new cell, so this behavior should probably be matched. In this example output: false does not apply to 2 + 2, but might seem like it will because running the code would run both 1 + 1 and 2 + 2.

The main reason this would be a substantial change is that the R parser would have to be specialized and the parser would no longer be "generic" as it seems like is currently planned for it.

- R parser would end cells after first blank line; this is now removed - `# %%` is now recognized as a valid cell demarcation in R - cellDecorationSetting now matches python (only highlights currently active cell - Tests added to test for these changes

github-actions · 2024-07-03T21:52:49Z

All contributors have signed the CLA ✍️ ✅
_{Posted by the CLA Assistant Lite bot.}

kylebutts · 2024-07-03T21:56:35Z

I have read the CLA Document and I hereby sign the CLA

seeM

This is great, thanks for the PR! Tests pass locally and it works well in a dev build 🎉 :

Screen.Recording.2024-07-05.at.09.58.30.mov

Just one question about the regexp before I approve.

BTW I couldn't find a spec on the R #' syntax and it seems I misunderstood it. Do you know if a spec exists?

Also, could you please share more about:

The main reason this would be a substantial change is that the R parser would have to be specialized and the parser would no longer be "generic" as it seems like is currently planned for it.

I arrived at the current interface as the simplest possible thing to support what was needed by both Python and R – although you've pointed out that the R implementation was wrong on my end.

We can totally change the CellParser interface as needed to support these requirements. We can even change which component is implemented by each language (doesn't have to be the parser). The main goal here is that, in future, any language can very easily add "code cell" support. Do you have any suggestions?

extensions/positron-code-cells/src/parser.ts

kylebutts · 2024-07-05T15:22:10Z

Sorry this is so long. Did a bit of a deep dive and trying to be complete :-)

From what I can tell (looking at R, python, julia, and stata), there are two designs for code cells:

Jupytext style: # %% and # %% [markdown]. This is what python uses in VSCode and what quarto uses for julia and python when rendering scripts.
- This naturally converts to jupyter notebooks
- The parser can be really simple: you need to only identify the starting lines and you split up the whole file. As you'll see below the alternative format creates many annoyances
Weave.jl and knitr::spin syntax.
- Markdown is created using #' syntax
- Cells are all non-#' lines. They are sometimes split with # %%, but need not be. E.g.:

knitr::spin(knit = FALSE, text = "
1 + 1 

#' # Header
#' 
#' Paragraph

2 + 2
") |> cat(sep = "\n")
#> 
#> ```{r}
#> 1 + 1 
#> ```
#> 
#> # Header
#> 
#> Paragraph
#> 
#> ```{r}
#> 2 + 2
#> ```

There are two bugs I think that can happen with a regex based approach, which is why parsing the language could be potentially necessary.

First, you are within a string and it just so happens you have a # %% at the start of a new line:

# %% 
"""
String
# %% 
String continues
"""

This is easy to detect, but you need to keep state of when a line is within a string. (e.g. R parser tells you if a string is a comment)
In the short term, fixing this is probably not needed because it quite an edge-case.

Second, R has roxygen2 style documentation which looks like:

# %% 
#' The length of a string
#'
#' Technically this returns the number of "code points", in a string. One
#' code point usually corresponds to one character, but not always. For example,
#' an u with a umlaut might be represented as a single character or as the
#' combination a u and an umlaut.
#'
#' @inheritParams str_detect
#' @return A numeric vector giving number of characters (code points) in each
#'    element of the character vector. Missing string have missing length.
#' @seealso [stringi::stri_length()] which this function wraps.
#' @export
#' @examples
#' str_length(letters)
#' str_length(NA)
#' str_length(factor("abc"))
#' str_length(c("i", "like", "programming", NA))
str_length <- function(string) {
}

# %% 
code ...

This obviously creates a tension. Roxygen2 comments get treated as markdown, but they are code documentation. knitr::spin for example treats them as markdown and any cell options you provide get lost ([spin] Better support for roxygen2 style comments yihui/knitr#2317).
I think a simple heuristic would be if you see #' @ in comments immediately preceding an expression, then treat it as a doc comment.

I believe a parser that could do both syntaxes would go like this:
parse_lines:

if line has # %%/# +, add to list "cell start lines"
if line has # %% [markdown], add to list "markdown start lines"
if line has #', add to list "markdown comment lines"
if line has #' @, add to list "roxygen2 comment lines"

create_cells:

iterating lines, create cells based on the parsed information (with type 1 or type 2 detected by file name and presence of # %% [markdown]?)

Overall, I think I could code up a draft parser that works for both cases, but not sure how to make this logic generic. Alternatively, could ask each LSP to be able to parse a document and return code chunk and markdown chunk ranges. Then the extension side would be request cell information and add code lens. Cacheing this information is probably useful (and currently is not implemented in the extension).

As an aside, nbconvert run on a .ipynb creates a type-1 script with cells starting with # In[ ]: with either a space or a number (# In\[( |[0-9])*\]:). VSCode's extension recognizes these. There is no concept of markdown cell in this format; so that information is lost.

Co-authored-by: Wasim Lorgat <[email protected]> Signed-off-by: Kyle F Butts <[email protected]>

kylebutts · 2024-07-06T23:30:52Z

@seeM, is it possible for the parser to observe the tree-sitter parse data? I have written an R parser (probably could use a bit of clean-up, and can adapt it to the tree-sitter data structure:
https://gist.github.com/kylebutts/81aabc5e61735593c0d6f2f4a93c3594

seeM · 2024-07-08T11:10:22Z

@kylebutts Thank you for your contributions and for the detailed write-up, amazing! 😄

Note that CI seems to be failing on the latest run.

Summary

Just to double-check my understanding, is this an accurate summary of where we are right now:

Before this PR: we assumed that R code cells start with a #+ and end with a blank line. That was very wrong – my bad!
With this PR: R code cells start with either #+ or # %%, and end when a new code cell starts.
We don't yet support R markdown cells, but: R markdown cells start with #' – although you're suggesting that roxygen comments should be an exception, and we can use the heuristic of finding a #' @ in the same block to identify them.
We also don't support the Python # In [ ]: code cell format yet.

Moving forward

Here's my suggested path forward:

Once CI is passing, merge this PR since it fixes the obviously broken R code cell parser.
Make a follow-up PR that moves language-specific logic to their appropriate language pack extension, without changing any of the actual behavior. This will also make the LSP, and thus tree sitter, more accessible.

I think the way to do this is:
1. Have the positron-code-cells activate function return an object with a registerCodeCellProvider method. We can decide on the specific API of the CodeCellProvider, but I'm now leaning to something very generic that takes a document and returns "code cells" with their ranges and types.
  
  We can define the extension API type in a .d.ts file and add that to the tsconfig.json of any other extensions that use it.
2. Have the Python and R language packs define and register their own cell parsers via vscode.extensions.getExtension('vscode.positron-code-cells') and registerCodeCellProvider.
Consider moving to a tree sitter approach for R. Tbh I'm not sure whether the string edge-case is worth moving to tree sitter, and IIUC all of the other issues can be addressed with a regex approach.

In any case, we could do this with a pattern we've used before:
1. Add a custom method to our LSP. This will be written in Rust. We may even have helper code to simplify your implementation. Example: https://github.com/posit-dev/ark/blob/2fc537f567934451d11d52d946f171d7ded0805e/crates/ark/src/lsp/statement_range.rs.
2. Register an LSP extension in the positron-r extension. This is where we'd call registerCellParser:
  
  positron/extensions/positron-r/src/lsp.ts
  
  Line 277 in c9de7ce
  
  private registerPositronLspExtensions(client: LanguageClient) {
  
  .

kylebutts · 2024-07-08T13:42:30Z

Addresses #3709. Good to merge!

Adding to Moving Forward, I also think the extension could do a better job at caching the code cell computation work. As it stands, the code is parsed multiple times every time the cursor is moved (add trace('Parsing cells!'); to parseCells and then move your cursor around). I think it should be not so bad to address. Have a top level CellManager for each file that parses cells on editor updates and then decorations, code lenses, etc. all just ask the CellManager for the cell information.

seeM

🎉

seeM · 2024-07-08T15:20:00Z

Merged, thanks again @kylebutts! 😄

I agree re caching!

seeM reviewed Jul 5, 2024

View reviewed changes

extensions/positron-code-cells/src/parser.ts Show resolved Hide resolved

extensions/positron-code-cells/src/parser.ts Outdated Show resolved Hide resolved

kylebutts and others added 4 commits July 5, 2024 11:22

Update extensions/positron-code-cells/src/parser.ts

bd2107b

Co-authored-by: Wasim Lorgat <[email protected]> Signed-off-by: Kyle F Butts <[email protected]>

Fix regex expression

6cab426

Merge branch 'posit-dev:main' into main

49b26d6

Fix regex (for real)

57a6248

Fix regex expression

07490d8

seeM approved these changes Jul 8, 2024

View reviewed changes

seeM merged commit 6363fe6 into posit-dev:main Jul 8, 2024
2 checks passed

github-actions bot locked and limited conversation to collaborators Jul 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix R code cell parser and add `# %%` support (#3709) #3858

Fix R code cell parser and add `# %%` support (#3709) #3858

Uh oh!

kylebutts commented Jul 3, 2024

Uh oh!

github-actions bot commented Jul 3, 2024 •

edited

Loading

Uh oh!

kylebutts commented Jul 3, 2024

Uh oh!

seeM left a comment

Uh oh!

Uh oh!

Uh oh!

kylebutts commented Jul 5, 2024

Uh oh!

kylebutts commented Jul 6, 2024

Uh oh!

seeM commented Jul 8, 2024

Uh oh!

kylebutts commented Jul 8, 2024 •

edited by juliasilge

Loading

Uh oh!

seeM left a comment

Uh oh!

Uh oh!

seeM commented Jul 8, 2024

Uh oh!

Uh oh!

Fix R code cell parser and add # %% support (#3709) #3858

Fix R code cell parser and add # %% support (#3709) #3858

Uh oh!

Conversation

kylebutts commented Jul 3, 2024

Uh oh!

github-actions bot commented Jul 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kylebutts commented Jul 3, 2024

Uh oh!

seeM left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kylebutts commented Jul 5, 2024

Uh oh!

kylebutts commented Jul 6, 2024

Uh oh!

seeM commented Jul 8, 2024

Summary

Moving forward

Uh oh!

kylebutts commented Jul 8, 2024 • edited by juliasilge Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seeM left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

seeM commented Jul 8, 2024

Uh oh!

Uh oh!

Fix R code cell parser and add `# %%` support (#3709) #3858

Fix R code cell parser and add `# %%` support (#3709) #3858

github-actions bot commented Jul 3, 2024 •

edited

Loading

kylebutts commented Jul 8, 2024 •

edited by juliasilge

Loading