feat: add support for HTML, Markdown and Typst files #127

Rolv-Apneseth · 2024-12-22T15:30:06Z

Continuing #117, as per our discussion.

This PR adds parsers for HTML, Markdown and Typst files, which are used in the CLI to support different file types. Auto detection per file is used by default, which will just choose a parser based on the file's extension.

I think the parsers are working OK, but wanted to open this PR as soon as possible so you can let me know what you think, and to see if you have some files that maybe this doesn't work great for.

I decided to take the same path that pyLanguagetool took for Markdown files, by first converting them to HTML, and then using a separate library for parsing that (html_parser in our case). I figured we'd probably want support for HTML eventually anyway, but you can let me know if you think I should take a similar approach to the HTML and Typst parsers.

As for the HTML and Typst parsers, I've inserted some "placeholders" in place of e.g. code blocks, links etc., which LanguageTool ignores, e.g. _code_, _math_, _link_. This seems to work well enough but if you have any other ideas on how to tackle these let me know and I can have a look.

codspeed-hq · 2024-12-22T15:39:57Z

CodSpeed Performance Report

Merging #127 will not alter performance

_{Comparing Rolv-Apneseth:detect_filetype (9f7b66e) with v3 (1665a9d)}

Summary

✅ 6 untouched benchmarks

Rolv-Apneseth · 2024-12-22T15:57:09Z

Any idea what's going on with the tests? cargo nextest run --all-features --no-capture doesn't produce any failures for me locally

jeertmans

Hi @Rolv-Apneseth, thanks for your PR!

See my comments and let me know what you think :-)

jeertmans · 2024-12-24T10:55:46Z

src/cli/check.rs

+                            "md" | "mkd" | "mdwn" | "mdown" | "mdtxt" | "mdtext" | "markdown" => {
+                                FileType::Markdown
+                            },
+
+                            "html" | "htm" => FileType::Html,


Any documentation or link that documents those many different extensions for the same filetype?

I believe I just copied this list from https://superuser.com/a/285878. Maybe I should also add mdx as mentioned in a comment, or do you want to cut it down to just md and markdown?

I think it's better to either stick to simplicity, or to match the default behavior of some popular tool, like ripgrep's default list.

Alright, well I'll copy that list then

jeertmans · 2024-12-24T10:58:48Z

src/parsers.rs

Wouldn't be better to return a vector of data annotations here? So (1) we avoid memory allocation and (2) we can benefit from LT's support for annotated data. The latter also has the advantage that LT is going to compute the correct location of an error in the text, and we can then print annotated errors with respect to the raw content of the file.

I'm not too familiar with how the data annotation side of things work, and I had issues trying to use it with the current setup that splits the requests, but I can explore it a bit more if you want.

Some notes / questions though:

Should all requests sent by the CLI for files be data annotations, including text, or does that need to be handled separately to the other file types

The requests will be a lot larger than the current plain text ones right? So only much smaller files would be supported

I will also need to change the current approach for markdown files as that wouldn't make sense any more

I think all special types should be checked using data annotation. That excludes raw text files.

That can be an issue with very large files, but let's not bother too much on that. That can be solved by automatically splitting the request into many (which can itself be a problem if we send too many requests to the public API, but people should host then own server).

Probably, yes. Let me know if I can help!

src/cli/check.rs

jeertmans · 2024-12-24T11:08:58Z

src/parsers.rs

+                "code" => {
+                    txt.push_str("_code_");
+                    return;
+                },
+                "a" => {
+                    txt.push_str("_link_");
+                    return;
+                },
+                "pre" => {
+                    txt.push_str("_pre_");
+                    txt.push_str("\n\n");


It is a bit linked to my general file-comment, but there you could simply return a Markup note, with no interpret_as, and LT should ignore it, I think.

Rolv-Apneseth · 2024-12-24T11:15:35Z

Hi @Rolv-Apneseth, thanks for your PR!

See my comments and let me know what you think :-)

Hey @jeertmans, I'll get back to this in a couple days, thanks for the review though

jeertmans · 2024-12-24T11:37:49Z

Hi @Rolv-Apneseth, thanks for your PR!
See my comments and let me know what you think :-)

Hey @jeertmans, I'll get back to this in a couple days, thanks for the review though

No issue, Merry Christmas and happy end of year!

Rolv-Apneseth · 2024-12-24T12:52:40Z

No issue, Merry Christmas and happy end of year!

Thanks, you too :)

jeertmans · 2024-12-28T15:43:57Z

Any idea what's going on with the tests? cargo nextest run --all-features --no-capture doesn't produce any failures for me locally

Looks strange! Sometimes, the online version of the LT API returns different results than the one you can host locally. Did you run tests on a self-hosted API?

Rolv-Apneseth · 2024-12-28T16:27:46Z

Looks strange! Sometimes, the online version of the LT API returns different results than the one you can host locally. Did you run tests on a self-hosted API?

Yes, it was with a self-hosted API. Looks like these errors are on the current v3 branch too though (#117), so I don't think these changes are related.

Rolv-Apneseth · 2025-01-05T21:41:34Z

Hey @jeertmans, sorry been busy the last couple days. I had a go there at changing the typst parser to use data annotations, could you have a look and let me know if that's what you had in mind? It's not perfect of course but seems to work well enough. I only have one typst file to work with though (had to look into what that file type even is for this PR)

Rolv-Apneseth · 2025-01-05T21:58:44Z

Oh sorry - forgot to mention I haven't done anything with splitting the data annotation requests, so you'll need to test against small files

jeertmans · 2025-01-17T17:12:20Z

Oh sorry - forgot to mention I haven't done anything with splitting the data annotation requests, so you'll need to test against small files

Hi @Rolv-Apneseth, sorry for the delay. I looked at your PR and it looks nice!

Just a small question: is there any motivation to keep String for HTML and Markdown, and only use Data for Typst?

Rolv-Apneseth · 2025-01-17T17:25:29Z

Hey @jeertmans, it was related to the comment I made above:

I had a go there at changing the typst parser to use data annotations, could you have a look and let me know if that's what you had in mind? It's not perfect of course but seems to work well enough. I only have one typst file to work with though (had to look into what that file type even is for this PR)

If you think the approach is alright, I can have a go at doing the same thing for Markdown and HTML.

And as I mentioned, I haven't implemented any kind of splitting for the data annotation request so it will just fail for a lot of files due to the length.

jeertmans · 2025-01-27T14:31:29Z

I think I'll merge like that, and we can still improve the model later. Let's focus first on more "fundamental" features, and improve those file-format specific features later.

Thanks!

Rolv-Apneseth · 2025-01-27T15:08:19Z

Oh - I thought this file format stuff was kind of the last step towards v3. Could you clarify what steps are next?

jeertmans · 2025-01-27T17:31:10Z

Oh - I thought this file format stuff was kind of the last step towards v3. Could you clarify what steps are next?

This might, but let me take a look at this (the original PR), and see what we (you) have achieved so far, as I lack a stepped-back view of that big rewrite. I will have some time this week, so let me ping you back once it's done ;-)

Rolv-Apneseth · 2025-01-27T22:11:38Z

Sure thing

Rolv-Apneseth added 3 commits December 22, 2024 15:12

feat: add parsers for HTML, markdown and typst files

00cac59

feat: in the CLI, use different parsers based on file type

5dbbcbd

fix: formatting

f49fde0

Rolv-Apneseth added 2 commits December 22, 2024 15:43

fix: bump MSRV

13503e2

fix: features ordering

7d23240

jeertmans reviewed Dec 24, 2024

View reviewed changes

refactor: update markdown file extensions

e3be501

refactor: use data annotations for typst files

9f7b66e

jeertmans merged commit 82bd45b into jeertmans:v3 Jan 27, 2025
10 of 24 checks passed

Rolv-Apneseth deleted the detect_filetype branch January 27, 2025 22:11

Rolv-Apneseth mentioned this pull request Feb 8, 2025

chore(lib): fully refactor the library for v3 #117

Open

14 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add support for HTML, Markdown and Typst files #127

feat: add support for HTML, Markdown and Typst files #127

Rolv-Apneseth commented Dec 22, 2024

codspeed-hq bot commented Dec 22, 2024 •

edited

Loading

Rolv-Apneseth commented Dec 22, 2024

jeertmans left a comment

jeertmans Dec 24, 2024

Rolv-Apneseth Dec 27, 2024

jeertmans Dec 28, 2024

Rolv-Apneseth Dec 28, 2024

jeertmans Dec 24, 2024

Rolv-Apneseth Dec 27, 2024

jeertmans Dec 28, 2024

jeertmans Dec 24, 2024

Rolv-Apneseth commented Dec 24, 2024

jeertmans commented Dec 24, 2024

Rolv-Apneseth commented Dec 24, 2024

jeertmans commented Dec 28, 2024

Rolv-Apneseth commented Dec 28, 2024

Rolv-Apneseth commented Jan 5, 2025

Rolv-Apneseth commented Jan 5, 2025

jeertmans commented Jan 17, 2025

Rolv-Apneseth commented Jan 17, 2025

jeertmans commented Jan 27, 2025

Rolv-Apneseth commented Jan 27, 2025

jeertmans commented Jan 27, 2025

Rolv-Apneseth commented Jan 27, 2025

feat: add support for HTML, Markdown and Typst files #127

feat: add support for HTML, Markdown and Typst files #127

Conversation

Rolv-Apneseth commented Dec 22, 2024

codspeed-hq bot commented Dec 22, 2024 • edited Loading

Merging #127 will not alter performance

Summary

Rolv-Apneseth commented Dec 22, 2024

jeertmans left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Rolv-Apneseth commented Dec 24, 2024

jeertmans commented Dec 24, 2024

Rolv-Apneseth commented Dec 24, 2024

jeertmans commented Dec 28, 2024

Rolv-Apneseth commented Dec 28, 2024

Rolv-Apneseth commented Jan 5, 2025

Rolv-Apneseth commented Jan 5, 2025

jeertmans commented Jan 17, 2025

Rolv-Apneseth commented Jan 17, 2025

jeertmans commented Jan 27, 2025

Rolv-Apneseth commented Jan 27, 2025

jeertmans commented Jan 27, 2025

Rolv-Apneseth commented Jan 27, 2025

codspeed-hq bot commented Dec 22, 2024 •

edited

Loading