Issues with list items #357

q00u · 2020-10-23T17:09:18Z

Issue 1: Headers inside list items.

Consider:

<li><h4>Header inside list item</h4></li>

This renders in html as:

In turndown, this becomes:

*   ####Header inside list item

Which renders as:

####Header inside list item

I propose that headers inside list items be treated as strong:

*   **Header inside list item**

Which renders as:

Header inside list item

Issue 2: Standalone headers following other non-list-item nodes.

Consider:

<img src="https://picsum.photos/id/1/200" /><li>Standalone list item</li>

This renders in html as:

In turndown, this becomes:

![](https://picsum.photos/id/1/200)*   Standalone list item

Which renders as:
* Standalone list item

I propose that list items that follow non-list-items should be placed on a new line:

![](https://picsum.photos/id/1/200)
*   Standalone list item

Which renders as:

Standalone list item

pavelhoral · 2020-12-05T23:35:52Z

Just an observation: The second example is invalid HTML. <li> element is only valid inside <ul> or <ol>. The first example, unfortunatelly, seems like a valid HTML.

q00u · 2020-12-05T23:40:03Z

True, but browsers will still render list items outside of lists. If the goal is for the resulting markdown to most-closely render as the original html renders, then it should be moved to a new line just like the browser does.

dchacke · 2023-08-27T15:42:56Z

I can reproduce the first issue:

<ol>
  <li>
    <h1>foo</h1>
  </li>
</ol>

is converted to:

1.  foo
    ===

Which a markdown parser (I'm using this one) turns into:

<ol>
<li>foo
===</li>
</ol>

So the heading is lost.

I don't think turning a heading into bold text is a good idea (#358). It may retain some semblance of a heading visually, but semantically and structurally, it does not adequately replace the original HTML. In addition, the same issue seems to extend to lis having any block-level content as their first child, in which case the suggested fix doesn't generalize. Consider blockquotes:

<ol>
  <li>
    <blockquote>foo</blockquote>
  </li>
</ol>

...is converted into...

1.  > foo

Passed back into a markdown parser, that is turned into:

<ol>
<li>> foo</li>
</ol>

So it treats the part "> foo" as regular text and keeps the ">" sign.

The good news is that there's no need to turn headings into bold text because there's a more general fix. For the first code block in this comment, turndown should instead return this:

1. 

    foo
    ===

That is, with one space after the period and four spaces before the "f" in "foo". Then the markdown parser recognizes the heading properly:

<ol>
<li><h1>foo</h1></li>
</ol>

The same fix works for blockquotes as well:

1. 

    > foo

is turned into:

<ol>
<li><blockquote>
  <p>foo</p>
</blockquote></li>
</ol>

As a temporary workaround, all list items could have this spacing, but that looks odd and is unnecessary when a list item starts with inline content.

dchacke · 2023-08-27T18:44:20Z

Here's a workaround for the consumer side to fix the first issue for now. I created it with the help of AI and only lightly tested it. I'm a Turndown noob but maybe it helps someone:

let td = new TurndownService();

td.addRule('listItemWithBlockAsFirstChild', {
  filter: node => {
    return (
      node.nodeName === 'LI' &&
      node.firstChild &&
      // Block elements listed on https://www.w3schools.com/htmL/html_blocks.asp
      [
        'address', 'article', 'aside', 'blockquote', 'canvas', 'dd', 'div',
        'dl', 'dt', 'fieldset', 'figcaption', 'figure', 'footer', 'form',
        'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'header', 'hr', 'li', 'main',
        'nav', 'noscript', 'ol', 'p', 'pre', 'section', 'table', 'tfoot',
        'ul', 'video'
      ]
      .map(t => t.toUpperCase())
      .includes(node.firstChild.nodeName)
    );
  },
  replacement: (content, node, options) => {
    let indicator;

    if (node.parentNode.nodeName === 'OL') {
      let count = 1; // Start counting from 1
      let sibling = node.previousElementSibling;

      while (sibling) {
        if (sibling.nodeName === 'LI') {
          count++;
        }

        sibling = sibling.previousElementSibling;
      }

      indicator = `${count}.`;
    } else {
      indicator = options.bulletListMarker;
    }

    let innerMarkdown = td.turndown(node.innerHTML);

    // Add one space behind the indicator and four spaces underneath
    return `${indicator} \n\n    ${innerMarkdown}\n`;
  }
});

In short, the filter applies to any li that has a block-level element as its first child.

We then determine the indicator (either n. with some positive integer n for ordered lists or the given bullet-list marker for unordered lists). If the li is part of an ordered list, we determine the indicator by counting the previous li elements in the same list, increasing a counter for each such previous element. If the li is part of an unordered list instead, we just use the given bullet-list marker.

Lastly, we parse the li's inner HTML and construct the final markdown string: the indicator followed by a space, then two line breaks, the last of which is followed by four spaces.

(I wrote this explanation, not an AI.)

Fixes the first issue described in mixmark-io#357. The fix the described in more detail here: https://github.com/mixmark-io/turndown/issues/357\#issuecomment-1694699312

Fixes the first issue mentioned in mixmark-io#357. I describe the bug in more detail here: mixmark-io#357 (comment)

dchacke · 2023-08-27T20:48:12Z

Disregard previous workaround in favor of this fix as per my PR #447:

rules.listItem = {
  filter: 'li',
  replacement: function (content, node, options) {
    content = content
      .replace(/^\n+/, '') // remove leading newlines
      .replace(/\n+$/, '\n') // replace trailing newlines with just a single one
      .replace(/\n/gm, '\n    ') // indent
    var prefix = options.bulletListMarker + '   '
    var parent = node.parentNode

    // Check if the first child is a block-level element
    var blockElements = [
      'address', 'article', 'aside', 'blockquote', 'canvas', 'dd', 'div',
      'dl', 'dt', 'fieldset', 'figcaption', 'figure', 'footer', 'form',
      'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'header', 'hr', 'li', 'main',
      'nav', 'noscript', 'ol', 'p', 'pre', 'section', 'table', 'tfoot',
      'ul', 'video'
    ].map(function (t) {
      return t.toUpperCase()
    })

    var firstChildIsBlock = node.firstChild && blockElements.includes(node.firstChild.nodeName)

    if (parent.nodeName === 'OL') {
      var start = parent.getAttribute('start')
      var index = Array.prototype.indexOf.call(parent.children, node)
      prefix = (start ? Number(start) + index : index + 1) + '.  '
    }

    return (
      prefix + (firstChildIsBlock ? '\n\n    ' : '') + content + (node.nextSibling && !/\n$/.test(content) ? '\n' : '')
    )
  }
}

q00u linked a pull request Oct 23, 2020 that will close this issue

Alter list items (standalone and with header) #358

Open

dchacke added a commit to dchacke/turndown that referenced this issue Aug 27, 2023

Fix bug where li with first child that's a block rendered incorrectly

6c40fac

Fixes the first issue mentioned in mixmark-io#357. I describe the bug in more detail here: mixmark-io#357 (comment)

dchacke mentioned this issue Aug 27, 2023

Fix bug where li with first child that's a block rendered incorrectly #447

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with list items #357

Issues with list items #357

q00u commented Oct 23, 2020

pavelhoral commented Dec 5, 2020

q00u commented Dec 5, 2020

dchacke commented Aug 27, 2023 •

edited

Loading

dchacke commented Aug 27, 2023 •

edited

Loading

dchacke commented Aug 27, 2023 •

edited

Loading

Issues with list items #357

Issues with list items #357

Comments

q00u commented Oct 23, 2020

Issue 1: Headers inside list items.

Issue 2: Standalone headers following other non-list-item nodes.

pavelhoral commented Dec 5, 2020

q00u commented Dec 5, 2020

dchacke commented Aug 27, 2023 • edited Loading

dchacke commented Aug 27, 2023 • edited Loading

dchacke commented Aug 27, 2023 • edited Loading

dchacke commented Aug 27, 2023 •

edited

Loading

dchacke commented Aug 27, 2023 •

edited

Loading

dchacke commented Aug 27, 2023 •

edited

Loading