Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nested List Formatting Issue #84

Open
Techman opened this issue Feb 23, 2023 · 1 comment
Open

Nested List Formatting Issue #84

Techman opened this issue Feb 23, 2023 · 1 comment

Comments

@Techman
Copy link

Techman commented Feb 23, 2023

Hello,

I am currently experimenting with this library for parsing HTML blog post content from an RSS feed for use on Reddit. While this library does not claim to cover Reddit's version of Markdown, it is overall fairly compatible with a few overrides here and there depending on the desired behavior.

However, an issue I have been running into has to deal with nested lists, particularly unordered lists (but this may apply to ordered lists too). In the Markdown output, nested unordered lists are not being indented properly.

For example, a section toward the bottom of this article is rendered as such in the output, with no indentation:

#### Lost Sector - Master

* Old
+ Small chance of up to two Enhancement Cores.

* New
+ **Two Enhancement Cores** and a **medium chance** at one more.
+ **One Enhancement Prism** and a **medium chance** at one more.

If one indents the lists (the convention I use is 4 spaces), it begins to show properly on Reddit.

#### Lost Sector - Master

* Old
    + Small chance of up to two Enhancement Cores.

* New
    + **Two Enhancement Cores** and a **medium chance** at one more.
    + **One Enhancement Prism** and a **medium chance** at one more.

I wrote a small demonstration/test script that I hope is helpful in reproducing. Requires FeedParser, Beautiful Soup, LXML, and Markdownify 0.11.6.

@chrispy-snps
Copy link
Collaborator

chrispy-snps commented Jan 14, 2024

@Techman - first, note that the nested list structure used on that page is invalid HTML5 per the specification. It is structured like this:

<ul>
  <li>Top 1</li>
  <ul>
    <li>Mid 1-1</li>
  </ul>
  <li>Top 2</li>
  <ul>
    <li>Mid 2-1</li>
    <li>Mid 2-2</li>
  </ul>
</ul>

However, the nested <ul> objects should be in a top-level <li> container. You can verify this using the W3C markdown validation service.

That being said, let's correct it as follows:

html = """
<ul>
  <li>Top 1
    <ul>
      <li>Mid 1-1</li>
    </ul>
  </li>
  <li>Top 2
    <ul>
      <li>Mid 2-1</li>
      <li>Mid 2-2</li>
    </ul>
  </li>
</ul>
"""
print(md(html))

We get reasonable Markdown for it:

* Top 1
        + Mid 1-1
* Top 2
        + Mid 2-1
        + Mid 2-2

If I remove all the top-level whitespace (which is what the page source looked like when I inspected it):

html = '<ul><li>Top 1<ul><li>Mid 1-1</li></ul></li><li>Top 2<ul><li>Mid 2-1</li><li>Mid 2-2</li></ul></li></ul>'
print(html)

I also get reasonable Markdown results:

* Top 1
        + Mid 1-1
* Top 2
        + Mid 2-1
        + Mid 2-2

So I think this is not a bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants