Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URLs with parenthesis break parser #74

Open
matklad opened this issue May 24, 2023 · 11 comments
Open

URLs with parenthesis break parser #74

matklad opened this issue May 24, 2023 · 11 comments

Comments

@matklad
Copy link
Contributor

matklad commented May 24, 2023

Wikipedia links often have ( in them, which breaks djot. Consider this example:

[Paxos](https://en.wikipedia.org/wiki/Paxos_(computer_science))

[Paxos](<https://en.wikipedia.org/wiki/Paxos_(computer_science)>)

doc
  para
    link destination="https://en.wikipedia.org/wiki/Paxos_(computer_science"
      str text="Paxos"
    str text=")"
  para
    link destination="<https://en.wikipedia.org/wiki/Paxos_(computer_science)>"
      str text="Paxos"

Three problems here:

  • Neither one works. It perhaps would be nice if the first one just worked.
  • We don’t document how to escape URLs in the syntax description docs
  • The behavior of two examples seems self-inconsistent. We parse angle bracketed structure, but then we use it literally as href.
@clbarnes
Copy link

clbarnes commented Jun 2, 2023

Maybe just documenting that brackets need to be URL-escaped, i.e. () = %28%29?

@mikekasprzak
Copy link

Maybe just documenting that brackets need to be URL-escaped, i.e. () = %28%29?

I don't think we can reasonably expect this. My users would trip on this frequently. Thanks Wikipedia.

I don't like having to parenthesis count, but I can't think of a better solution.

@jgm
Copy link
Owner

jgm commented Feb 3, 2024

I agree, it would be better to change "contain the link destination (URL) in parentheses" to "contain the link destination (URL) in balanced parentheses" (or something to that effect) in the syntax description and alter the parser to handle this. (Also add a test.)

@jgm
Copy link
Owner

jgm commented Feb 3, 2024

Odd. This works with the current djot.js parser:

[hi](url_(u))
<p><a href="url_(u)">hi</a></p>

@jgm
Copy link
Owner

jgm commented Feb 3, 2024

But:

[hi](url_(u_u))
<p><a href="url_(u_u">hi</a>)</p>

@jgm
Copy link
Owner

jgm commented Feb 3, 2024

So, bottom line is that we've been trying to support balanced parens in URLs, but the potentially matching underscores are confusing the parser.
So this is just a bug in djot.js.

@jgm jgm transferred this issue from jgm/djot Feb 3, 2024
@jgm
Copy link
Owner

jgm commented Feb 3, 2024

What's happening here is that we're using the delimiter stack to match parens, and when the two _ get matched as emphasis (prior to detecting that it's a link), the ( between them is removed from the delimiter stack.

@faelys
Copy link

faelys commented Feb 3, 2024

I hope I'm not dragging in a dead horse to beat it, but it really bothers me that emphasis markers are somehow dragged into URLs. I wish djot had three clear kinds of elements (blocks, inline text, raw text) and URLs were in the latter where no inline parsing or matching happens.

Admittedly this worsens the current issue if URLs are raw to the point of not parsing parentheses, though that can be lifted at the syntax level (e.g. with (<…>) in the OP or escaping in the same way ticks are escaped in raw inline spans, by adding more consecutive delimiters (parentheses here).

@jgm
Copy link
Owner

jgm commented Feb 3, 2024

Well, the issue is that in commonmark.js we try to do a one-pass parse without backtracking. So, when we hit [label](..., we still don't know if it's going to be a link until it is closed. Until then, we have to keep track of potential emphasis.

If we didn't care about backtracking, we could do things differently. In fact, I do do things differently in my Haskell djot parser: I just try to parse the link destination, and if it fails I backtrack.

Anyway, there are a lot of different parsing strategies, but as far as I can see this is not an issue with djot's syntax.

@jgm
Copy link
Owner

jgm commented Feb 3, 2024

I think the best fix here would be not to abuse the delimiter stack to keep track of matching parens (since this really only has a purpose in links), but to create a new data structure for this.

@faelys
Copy link

faelys commented Feb 3, 2024

Well, the issue is that in commonmark.js we try to do a one-pass parse without backtracking. So, when we hit [label](..., we still don't know if it's going to be a link until it is closed. Until then, we have to keep track of potential emphasis.

If we didn't care about backtracking, we could do things differently. In fact, I do do things differently in my Haskell djot parser: I just try to parse the link destination, and if it fails I backtrack.

Anyway, there are a lot of different parsing strategies, but as far as I can see this is not an issue with djot's syntax.

From where I sit, I see an issue with djot's syntax that manifests in this issue and a few others (which I discussed in detail in jdm/djot#247), and that I (maybe wrongly) linked to inline parsing leaking into raw text parsing.

In other words, we can have link parsing without backtracking, because we can know that [label](… starts a link, exactly like we currently know that ``… starts a inline code span, and it would be obvious if both these were of the same nature, distinct from inline text elements.

Anyway, I've already made my case in the discussion linked above, I promise I will try harder to not bother you with it again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants