Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weird trimming of inline element text causes invalid markdown #82

Closed
4 tasks done
necauqua opened this issue Feb 3, 2024 · 11 comments
Closed
4 tasks done

Weird trimming of inline element text causes invalid markdown #82

necauqua opened this issue Feb 3, 2024 · 11 comments
Labels
👀 no/external This makes more sense somewhere else 👎 phase/no Post cannot or will not be acted on

Comments

@necauqua
Copy link

necauqua commented Feb 3, 2024

Initial checklist

Affected packages and versions

hast-util-to-mdast==10.1.0

Link to runnable example

https://codesandbox.io/p/devbox/dry-http-xx97sg

Steps to reproduce

Well, minimal example is converting HTML like this:
<p>some text with <em> spaced emphasis </em> in between</p>
with a direct html->hast->mdast->markdown pipeline generates this (invalid) markdown:
some text with *spaced emphasis *in between

Expected behavior

Given that the browser seems to trim the contents of <em> in such case, this?.
some text with *spaced emphasis* in between

Actual behavior

Well the space is in the wrong node somehow after hast-util-to-mdast call.
Not sure if other nodes are affected yup, at least strong and del are too, assuming it's inline nodes

Affected runtime and version

all?.

Affected package manager and version

No response

Affected OS and version

No response

Build and bundle tools

No response

@github-actions github-actions bot added 👋 phase/new Post is being triaged automatically 🤞 phase/open Post is being triaged manually and removed 👋 phase/new Post is being triaged automatically labels Feb 3, 2024
@necauqua
Copy link
Author

necauqua commented Feb 3, 2024

Okay maybe this issue should be moved to rehype-minify-whitespace as it seems that it's the one doing the hello <em> italic <em> world -> hello <em>italic <em>world thing

@ChristianMurphy
Copy link
Member

Welcome @necauqua! 👋
You may want to checkout https://github.com/orgs/syntax-tree/discussions/60#discussioncomment-2111096 there are some space combinations of <strong> and <em> in HTML which cannot be represented in CommonMark

@necauqua
Copy link
Author

necauqua commented Feb 4, 2024

Yeeeeah, I kind of figured it's probably like that.

Although I'm not sure the hello <em> italic <em> world -> hello <em>italic <em>world minification is the correct default

My usecase is the same as the few people there, I have some meh html from WYSIWYG editor I'm upgrading to markdown on the fly.

Looks like I'm keeping the workaround of "if inline node ends with whitespace and it's sibling is a text node move the space there", although that's one extra dfs pass, eh - because it's okay if some odd html generates some broken markdown, but my data has a ton of hello <em> italic <em> world

@wooorm
Copy link
Member

wooorm commented Feb 5, 2024

Okay maybe this issue should be moved to rehype-minify-whitespace as it seems that it's the one doing the hello <em> italic <em> world -> hello <em>italic <em>world thing

I believe that behavior is correct. That is how browsers collapse whitespace.
Things typically work left to right. The first space is shown. More spaces don’t show.
If you hove over the em with your devtools, you should see the devtools outline a box, which does not include the earlier space, but does include the later space.

Closing this as a duplicate of syntax-tree/mdast-util-to-markdown#12. I think the approach I outlined there will solve it.

@wooorm wooorm closed this as not planned Won't fix, can't repro, duplicate, stale Feb 5, 2024
@wooorm wooorm added the 👀 no/external This makes more sense somewhere else label Feb 5, 2024

This comment has been minimized.

@github-actions github-actions bot added 👎 phase/no Post cannot or will not be acted on and removed 🤞 phase/open Post is being triaged manually labels Feb 5, 2024
@necauqua
Copy link
Author

necauqua commented Feb 5, 2024

hover over the em with your devtools

Huh, I was looking at it in the devtools but I didn't notice that's actually the case, I see, thanks

Technically this is the hast-util-to-mdast issue then, since when it sees such minified html it breaks - but as you pointed out it seems like there's no universal fix.
In my case, I seem to patch the majority of my issues with moving whitespace out of em and also adding a few <wbr> here and there (also you render them as an actual invisible zwsp and I changed the handler to output the &#x200B; like in your approach)
It's meh, but seems to work ok

@wooorm
Copy link
Member

wooorm commented Feb 5, 2024

Technically this is the hast-util-to-mdast issue then, since when it sees such minified html it breaks

Why? Nothing breaks there.

there's no universal fix

There is a universal fix: the one I pointed to. To generate markdown that has those spaces your messy HTML has too, the ones a browser sees too


If you want to clean dirty HTML, that’s fine, but that’s something else, and not done in this project or in to-markdown.
Also very complex: what is dirty, what is intended? That needs a human to work on it.

@necauqua
Copy link
Author

necauqua commented Feb 5, 2024

Maybe my wording was wrong (yeah it was all over the place) but it generates broken markdown?.
So hello <em>italic <em>world -> hello *italic *world, where italic does not get italic

By no universal fix I meant without using a bunch of entities, since as I saw you mentioning on other issues that you strive to generate readable markdown with no html remains, which means that that universal fix is out of scope

Yeah in the end I concluded that my html is dirty as it couldn't be just cleanly converted, and I work around that by doing some cleaning of my html first 😅

@wooorm
Copy link
Member

wooorm commented Feb 5, 2024

Yes. I mean character references (entities). Those are part of markdown: hello *italic&#32;*&#119;orld -> hello italic world.

“Readable” would indeed be nice, but as folks can use character references in their input markdown, and it’s valid and fine, we’d have to use character references in the output too.

@necauqua
Copy link
Author

necauqua commented Feb 5, 2024

Wait it just occured to me that syntax-tree/mdast-util-to-markdown#12 (and that it is a to-markdown issue) is not closed and you said it will solve this issue and you're not opposed to having those entities (I mean in the case the input mdast was dirty like that), all of my comments are moot

Thanks, sorry for dragging this on 😅

And huh, once the to-markdown issue is solved like that I can avoid my janky cleaning which does not cover 100% corner cases

@wooorm
Copy link
Member

wooorm commented Feb 5, 2024

Yes! No worries ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
👀 no/external This makes more sense somewhere else 👎 phase/no Post cannot or will not be acted on
Development

No branches or pull requests

3 participants