URLs with parenthesis break parser #74

matklad · 2023-05-24T07:36:18Z

Wikipedia links often have ( in them, which breaks djot. Consider this example:

[Paxos](https://en.wikipedia.org/wiki/Paxos_(computer_science))

[Paxos](<https://en.wikipedia.org/wiki/Paxos_(computer_science)>)

doc
  para
    link destination="https://en.wikipedia.org/wiki/Paxos_(computer_science"
      str text="Paxos"
    str text=")"
  para
    link destination="<https://en.wikipedia.org/wiki/Paxos_(computer_science)>"
      str text="Paxos"

Three problems here:

Neither one works. It perhaps would be nice if the first one just worked.
We don’t document how to escape URLs in the syntax description docs
The behavior of two examples seems self-inconsistent. We parse angle bracketed structure, but then we use it literally as href.

The text was updated successfully, but these errors were encountered:

clbarnes · 2023-06-02T12:56:19Z

Maybe just documenting that brackets need to be URL-escaped, i.e. () = %28%29?

mikekasprzak · 2024-02-02T19:25:30Z

Maybe just documenting that brackets need to be URL-escaped, i.e. () = %28%29?

I don't think we can reasonably expect this. My users would trip on this frequently. Thanks Wikipedia.

I don't like having to parenthesis count, but I can't think of a better solution.

jgm · 2024-02-03T04:48:05Z

I agree, it would be better to change "contain the link destination (URL) in parentheses" to "contain the link destination (URL) in balanced parentheses" (or something to that effect) in the syntax description and alter the parser to handle this. (Also add a test.)

jgm · 2024-02-03T04:55:39Z

Odd. This works with the current djot.js parser:

[hi](url_(u))
<p><a href="url_(u)">hi</a></p>

jgm · 2024-02-03T04:56:17Z

But:

[hi](url_(u_u))
<p><a href="url_(u_u">hi</a>)</p>

jgm · 2024-02-03T04:57:35Z

So, bottom line is that we've been trying to support balanced parens in URLs, but the potentially matching underscores are confusing the parser.
So this is just a bug in djot.js.

jgm · 2024-02-03T07:28:08Z

What's happening here is that we're using the delimiter stack to match parens, and when the two _ get matched as emphasis (prior to detecting that it's a link), the ( between them is removed from the delimiter stack.

faelys · 2024-02-03T09:55:08Z

I hope I'm not dragging in a dead horse to beat it, but it really bothers me that emphasis markers are somehow dragged into URLs. I wish djot had three clear kinds of elements (blocks, inline text, raw text) and URLs were in the latter where no inline parsing or matching happens.

Admittedly this worsens the current issue if URLs are raw to the point of not parsing parentheses, though that can be lifted at the syntax level (e.g. with (<…>) in the OP or escaping in the same way ticks are escaped in raw inline spans, by adding more consecutive delimiters (parentheses here).

jgm · 2024-02-03T17:43:15Z

Well, the issue is that in commonmark.js we try to do a one-pass parse without backtracking. So, when we hit [label](..., we still don't know if it's going to be a link until it is closed. Until then, we have to keep track of potential emphasis.

If we didn't care about backtracking, we could do things differently. In fact, I do do things differently in my Haskell djot parser: I just try to parse the link destination, and if it fails I backtrack.

Anyway, there are a lot of different parsing strategies, but as far as I can see this is not an issue with djot's syntax.

jgm · 2024-02-03T17:46:27Z

I think the best fix here would be not to abuse the delimiter stack to keep track of matching parens (since this really only has a purpose in links), but to create a new data structure for this.

faelys · 2024-02-03T18:59:00Z

Well, the issue is that in commonmark.js we try to do a one-pass parse without backtracking. So, when we hit [label](..., we still don't know if it's going to be a link until it is closed. Until then, we have to keep track of potential emphasis.

If we didn't care about backtracking, we could do things differently. In fact, I do do things differently in my Haskell djot parser: I just try to parse the link destination, and if it fails I backtrack.

Anyway, there are a lot of different parsing strategies, but as far as I can see this is not an issue with djot's syntax.

From where I sit, I see an issue with djot's syntax that manifests in this issue and a few others (which I discussed in detail in jdm/djot#247), and that I (maybe wrongly) linked to inline parsing leaking into raw text parsing.

In other words, we can have link parsing without backtracking, because we can know that [label](… starts a link, exactly like we currently know that ``… starts a inline code span, and it would be obvious if both these … were of the same nature, distinct from inline text elements.

Anyway, I've already made my case in the discussion linked above, I promise I will try harder to not bother you with it again.

andersk mentioned this issue Jun 13, 2023

Intraword emphasis jgm/djot#101

Open

matklad mentioned this issue Jul 22, 2023

Fix broken link on Notes-on-paxos matklad/matklad.github.io#128

Closed

jgm transferred this issue from jgm/djot Feb 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

URLs with parenthesis break parser #74

URLs with parenthesis break parser #74

matklad commented May 24, 2023

clbarnes commented Jun 2, 2023

mikekasprzak commented Feb 2, 2024

jgm commented Feb 3, 2024

jgm commented Feb 3, 2024

jgm commented Feb 3, 2024

jgm commented Feb 3, 2024

jgm commented Feb 3, 2024

faelys commented Feb 3, 2024

jgm commented Feb 3, 2024 •

edited

Loading

jgm commented Feb 3, 2024

faelys commented Feb 3, 2024

URLs with parenthesis break parser #74

URLs with parenthesis break parser #74

Comments

matklad commented May 24, 2023

clbarnes commented Jun 2, 2023

mikekasprzak commented Feb 2, 2024

jgm commented Feb 3, 2024

jgm commented Feb 3, 2024

jgm commented Feb 3, 2024

jgm commented Feb 3, 2024

jgm commented Feb 3, 2024

faelys commented Feb 3, 2024

jgm commented Feb 3, 2024 • edited Loading

jgm commented Feb 3, 2024

faelys commented Feb 3, 2024

jgm commented Feb 3, 2024 •

edited

Loading