Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix R code cell parser and add # %% support (#3709) #3858

Merged
merged 6 commits into from
Jul 8, 2024

Conversation

kylebutts
Copy link
Contributor

Changes:

  • R parser would end cells after first blank line; this is now removed
  • # %% is now recognized as a valid cell demarcation in R
  • cellDecorationSetting now matches python (only highlights currently active cell). Note that before the whole program is highlighted a gray color, so the information was meaningless.
  • Tests added to test for these changes

Remaining enhancement:
Complete R support would recognize #' as markdown cells. This feature is fully supported in knitr::spin and quarto render for a script.

This would be require more substantial changes (that I would be happy to implement). For example:

# %% 
#| output: false
1 + 1 

#' Markdown comment

2 + 2

This example has three cells: code cell, markdown, and code cell. But the latter one requires seeing the markdown comment, ending the first code cell, and starting the second.

The knitr::spin/quarto render treat anything after a markdown comment as a new cell, so this behavior should probably be matched. In this example output: false does not apply to 2 + 2, but might seem like it will because running the code would run both 1 + 1 and 2 + 2.

The main reason this would be a substantial change is that the R parser would have to be specialized and the parser would no longer be "generic" as it seems like is currently planned for it.

- R parser would end cells after first blank line; this is now removed
- `# %%` is now recognized as a valid cell demarcation in R
- cellDecorationSetting now matches python (only highlights currently active cell
- Tests added to test for these changes
Copy link

github-actions bot commented Jul 3, 2024

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

@kylebutts
Copy link
Contributor Author

I have read the CLA Document and I hereby sign the CLA

Copy link
Contributor

@seeM seeM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thanks for the PR! Tests pass locally and it works well in a dev build 🎉 :

Screen.Recording.2024-07-05.at.09.58.30.mov

Just one question about the regexp before I approve.

BTW I couldn't find a spec on the R #' syntax and it seems I misunderstood it. Do you know if a spec exists?

Also, could you please share more about:

The main reason this would be a substantial change is that the R parser would have to be specialized and the parser would no longer be "generic" as it seems like is currently planned for it.

I arrived at the current interface as the simplest possible thing to support what was needed by both Python and R – although you've pointed out that the R implementation was wrong on my end.

We can totally change the CellParser interface as needed to support these requirements. We can even change which component is implemented by each language (doesn't have to be the parser). The main goal here is that, in future, any language can very easily add "code cell" support. Do you have any suggestions?

extensions/positron-code-cells/src/parser.ts Show resolved Hide resolved
extensions/positron-code-cells/src/parser.ts Outdated Show resolved Hide resolved
@kylebutts
Copy link
Contributor Author

Sorry this is so long. Did a bit of a deep dive and trying to be complete :-)

From what I can tell (looking at R, python, julia, and stata), there are two designs for code cells:

  1. Jupytext style: # %% and # %% [markdown]. This is what python uses in VSCode and what quarto uses for julia and python when rendering scripts.

    • This naturally converts to jupyter notebooks
    • The parser can be really simple: you need to only identify the starting lines and you split up the whole file. As you'll see below the alternative format creates many annoyances
  2. Weave.jl and knitr::spin syntax.

    • Markdown is created using #' syntax
    • Cells are all non-#' lines. They are sometimes split with # %%, but need not be. E.g.:
knitr::spin(knit = FALSE, text = "
1 + 1 

#' # Header
#' 
#' Paragraph

2 + 2
") |> cat(sep = "\n")
#> 
#> ```{r}
#> 1 + 1 
#> ```
#> 
#> # Header
#> 
#> Paragraph
#> 
#> ```{r}
#> 2 + 2
#> ```

There are two bugs I think that can happen with a regex based approach, which is why parsing the language could be potentially necessary.

First, you are within a string and it just so happens you have a # %% at the start of a new line:

# %% 
"""
String
# %% 
String continues
"""
  • This is easy to detect, but you need to keep state of when a line is within a string. (e.g. R parser tells you if a string is a comment)
  • In the short term, fixing this is probably not needed because it quite an edge-case.

Second, R has roxygen2 style documentation which looks like:

# %% 
#' The length of a string
#'
#' Technically this returns the number of "code points", in a string. One
#' code point usually corresponds to one character, but not always. For example,
#' an u with a umlaut might be represented as a single character or as the
#' combination a u and an umlaut.
#'
#' @inheritParams str_detect
#' @return A numeric vector giving number of characters (code points) in each
#'    element of the character vector. Missing string have missing length.
#' @seealso [stringi::stri_length()] which this function wraps.
#' @export
#' @examples
#' str_length(letters)
#' str_length(NA)
#' str_length(factor("abc"))
#' str_length(c("i", "like", "programming", NA))
str_length <- function(string) {
}

# %% 
code ...
  • This obviously creates a tension. Roxygen2 comments get treated as markdown, but they are code documentation. knitr::spin for example treats them as markdown and any cell options you provide get lost ([spin] Better support for roxygen2 style comments yihui/knitr#2317).
  • I think a simple heuristic would be if you see #' @ in comments immediately preceding an expression, then treat it as a doc comment.

I believe a parser that could do both syntaxes would go like this:
parse_lines:

  • if line has # %%/# +, add to list "cell start lines"
  • if line has # %% [markdown], add to list "markdown start lines"
  • if line has #', add to list "markdown comment lines"
  • if line has #' @, add to list "roxygen2 comment lines"

create_cells:

  • iterating lines, create cells based on the parsed information (with type 1 or type 2 detected by file name and presence of # %% [markdown]?)

Overall, I think I could code up a draft parser that works for both cases, but not sure how to make this logic generic. Alternatively, could ask each LSP to be able to parse a document and return code chunk and markdown chunk ranges. Then the extension side would be request cell information and add code lens. Cacheing this information is probably useful (and currently is not implemented in the extension).


As an aside, nbconvert run on a .ipynb creates a type-1 script with cells starting with # In[ ]: with either a space or a number (# In\[( |[0-9])*\]:). VSCode's extension recognizes these. There is no concept of markdown cell in this format; so that information is lost.

@kylebutts
Copy link
Contributor Author

@seeM, is it possible for the parser to observe the tree-sitter parse data? I have written an R parser (probably could use a bit of clean-up, and can adapt it to the tree-sitter data structure:
https://gist.github.com/kylebutts/81aabc5e61735593c0d6f2f4a93c3594

@seeM
Copy link
Contributor

seeM commented Jul 8, 2024

@kylebutts Thank you for your contributions and for the detailed write-up, amazing! 😄

Note that CI seems to be failing on the latest run.

Summary

Just to double-check my understanding, is this an accurate summary of where we are right now:

  1. Before this PR: we assumed that R code cells start with a #+ and end with a blank line. That was very wrong – my bad!
  2. With this PR: R code cells start with either #+ or # %%, and end when a new code cell starts.
  3. We don't yet support R markdown cells, but: R markdown cells start with #' – although you're suggesting that roxygen comments should be an exception, and we can use the heuristic of finding a #' @ in the same block to identify them.
  4. We also don't support the Python # In [ ]: code cell format yet.

Moving forward

Here's my suggested path forward:

  1. Once CI is passing, merge this PR since it fixes the obviously broken R code cell parser.

  2. Make a follow-up PR that moves language-specific logic to their appropriate language pack extension, without changing any of the actual behavior. This will also make the LSP, and thus tree sitter, more accessible.

    I think the way to do this is:

    1. Have the positron-code-cells activate function return an object with a registerCodeCellProvider method. We can decide on the specific API of the CodeCellProvider, but I'm now leaning to something very generic that takes a document and returns "code cells" with their ranges and types.

      We can define the extension API type in a .d.ts file and add that to the tsconfig.json of any other extensions that use it.

    2. Have the Python and R language packs define and register their own cell parsers via vscode.extensions.getExtension('vscode.positron-code-cells') and registerCodeCellProvider.

  3. Consider moving to a tree sitter approach for R. Tbh I'm not sure whether the string edge-case is worth moving to tree sitter, and IIUC all of the other issues can be addressed with a regex approach.

    In any case, we could do this with a pattern we've used before:

    1. Add a custom method to our LSP. This will be written in Rust. We may even have helper code to simplify your implementation. Example: https://github.com/posit-dev/ark/blob/2fc537f567934451d11d52d946f171d7ded0805e/crates/ark/src/lsp/statement_range.rs.
    2. Register an LSP extension in the positron-r extension. This is where we'd call registerCellParser:
      private registerPositronLspExtensions(client: LanguageClient) {
      .

@kylebutts
Copy link
Contributor Author

kylebutts commented Jul 8, 2024

Addresses #3709. Good to merge!

Adding to Moving Forward, I also think the extension could do a better job at caching the code cell computation work. As it stands, the code is parsed multiple times every time the cursor is moved (add trace('Parsing cells!'); to parseCells and then move your cursor around). I think it should be not so bad to address. Have a top level CellManager for each file that parses cells on editor updates and then decorations, code lenses, etc. all just ask the CellManager for the cell information.

Copy link
Contributor

@seeM seeM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@seeM seeM merged commit 6363fe6 into posit-dev:main Jul 8, 2024
2 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Jul 8, 2024
@seeM
Copy link
Contributor

seeM commented Jul 8, 2024

Merged, thanks again @kylebutts! 😄

I agree re caching!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants