Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: script to convert HTML manual pages to markdown #4620

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

neteler
Copy link
Member

@neteler neteler commented Oct 31, 2024

Script to convert recursively all .html files to .md (GitHub flavoured Markdown).
This is not only relevant for the conversion in GRASS-core but also for GRASS-Addons.

(see related #3849)

Suggestions needed for:

Script to convert recursively all .html files to .md (GitHub flavoured Markdown).

(see related OSGeo#3849)
@neteler neteler added manual Documentation related issues docs labels Oct 31, 2024
@neteler neteler added this to the 8.5.0 milestone Oct 31, 2024
@neteler neteler self-assigned this Oct 31, 2024
@echoix
Copy link
Member

echoix commented Oct 31, 2024

Is this only to have the script in the repo? After that, is it supposed to be one time use or always used? (I assume that the docs in the repo will be in markdown instead of html soon..)

If it's supposed to be used only once, does it need to be part of the repo?

If ever we convert all of our html files to markdown (+formatting), I suggest to have an intermediate PR that only does the rename, to have the GitHub history +blame follow the file, instead of deleting and adding files (which would happen if the markdown formatting would have a lot of changes + renames)

@neteler
Copy link
Member Author

neteler commented Oct 31, 2024

Is this only to have the script in the repo? After that, is it supposed to be one time use or always used? (I assume that the docs in the repo will be in markdown instead of html soon..)

If it's supposed to be used only once, does it need to be part of the repo?

I see the following use cases:

  • bulk conversion of all core manual pages (one time unless we need to optimize this script for better markdown output/modifications of HTML residuals, ...)
  • bulk conversion of all addon manual pages (same as for core)
  • helper script for addons in non-standard repos (see for example the "grass-gis-addons" tag: https://github.com/topics/grass-gis-addons). This should be offered long-term, IMHO.

If ever we convert all of our html files to markdown (+formatting), I suggest to have an intermediate PR that only does the rename, to have the GitHub history +blame follow the file, instead of deleting and adding files (which would happen if the markdown formatting would have a lot of changes + renames)

Yes, an idea is to do that with multiple PRs:

  • convert HTML to MD, keep the HTML (PR 1, 2, ... n, i.e., submit in chunks to avoid too large PRs)
  • have a (sort?) interim phase of both HTML and MD in parallel, esp. for quality control
  • remove the HTML files in a different PR, keeping MD only

The "rename" comment I don't fully understand.

@echoix
Copy link
Member

echoix commented Oct 31, 2024

Is this only to have the script in the repo? After that, is it supposed to be one time use or always used? (I assume that the docs in the repo will be in markdown instead of html soon..)

If it's supposed to be used only once, does it need to be part of the repo?

I see the following use cases:

  • bulk conversion of all core manual pages (one time unless we need to optimize this script for better markdown output/modifications of HTML residuals, ...)

  • bulk conversion of all addon manual pages (same as for core)

  • helper script for addons in non-standard repos (see for example the "grass-gis-addons" tag: https://github.com/topics/grass-gis-addons). This should be offered long-term, IMHO.

I see the value of having this available for use outside of a one-time use, like add-ons outside of the osgeo/grass-addons repo.

If ever we convert all of our html files to markdown (+formatting), I suggest to have an intermediate PR that only does the rename, to have the GitHub history +blame follow the file, instead of deleting and adding files (which would happen if the markdown formatting would have a lot of changes + renames)

Yes, an idea is to do that with multiple PRs:

  • convert HTML to MD, keep the HTML (PR 1, 2, ... n, i.e., submit in chunks to avoid too large PRs)

  • have a (sort?) interim phase of both HTML and MD in parallel, esp. for quality control

  • remove the HTML files in a different PR, keeping MD only

The "rename" comment I don't fully understand.

I imagined that passing from html to markdown as the docs source would be done at once. Thus, no interim with both formats in the repo. Thus, to help navigating the history for the future, I was suggesting to have an interim commit in the main branch that renamed all html files to change the extension of html to md, without any content changes, and directly after, applying the conversion to md in these md files that are in fact html. Html can be used in md to some extent. However these two must be done right after the other, as I don't expect the html builds to be working in that interim commit. But I think it will greatly help navigating history by allowing to continue going back commits of the renamed file (instead of stopping there).

@wenzeslaus
Copy link
Member

I was suggesting to have an interim commit in the main branch that renamed all html files to change the extension of html to md, without any content changes, and directly after, applying the conversion to md in these md files that are in fact html.

The history would be nice and the HTML==MD is nice trick, but...

However these two must be done right after the other, as I don't expect the html builds to be working in that interim commit.

...I'm afraid we can't just replace the server infrastructure for HTML with Markdown/mkdocs at the same time as merging the PR, so I think the change needs to be gradual in one way or another.

@echoix
Copy link
Member

echoix commented Nov 1, 2024

If the build process generates the html from the md, what changes would be needed on the web server infrastructure?

Maybe a test with a staging instance might help. But I'll keep thinking about what might be best here..

@neteler
Copy link
Member Author

neteler commented Nov 8, 2024

Valid point about the git history.

Suggestion:

  • we git move the .html to .md (so history is kept) and commit
  • we copy it back to .html (to keep them for a little while in parallel; these files do not have the git history), add and commit
  • we replace HTML content of the now .md file with the markdown content and commit
  • we compare both (quality control)
  • eventually we drop the .html file

@echoix
Copy link
Member

echoix commented Nov 8, 2024

Valid point about the git history.

Suggestion:

  • we git move the .html to .md (so history is kept) and commit
  • we copy it back to .html (to keep them for a little while in parallel; these files do not have the git history), add and commit
  • we replace HTML content of the now .md file with the markdown content and commit
  • we compare both (quality control)
  • eventually we drop the .html file

That's what I was thinking about if we needed to have html too in parallel (vs a clean cutoff, switching to Md at once).

utils/grass_html2md.sh Outdated Show resolved Hide resolved
utils/grass_html2md.sh Outdated Show resolved Hide resolved

# TODO: path to LUA file setting to be improved (./utils/pandoc_codeblock.lua)
#wget https://raw.githubusercontent.com/OSGeo/grass/refs/heads/main/utils/pandoc_codeblock.lua -O "${TMP}/pandoc_codeblock.lua"
TMP="utils"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoding utils below is simpler. For testing in combination with the other PR, assuming you wget right is a very safe assumption.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question is: hardcoding yes, but how?
With a relative path, the script will not work in standard or other addon repos.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... postponing this for now...

@wenzeslaus
Copy link
Member

  • we git move the .html to .md (so history is kept) and commit

  • we copy it back to .html (to keep them for a little while in parallel; these files do not have the git history), add and commit

We could prepare that in one PR and then "Allow rebase merging" for couple minutes to merge that PR with its two commits. This would not break or workaround the CI.

@wenzeslaus
Copy link
Member

The rename way seems also appealing because it is more natural: Even with several conversions done already here, I get plenty of HTML tags, some perhaps need to stay.

cd dist.x86_64-pc-linux-gnu/docs/mkdocs/source
grep -Eor '<[a-zA-Z][^>]*>' *.md | grep -Ev '<https?://[^>]+>'
...
r.in.wms.md:<div data-align="center" style="margin: 10px">
r.in.xyz.md:<sup>
r.li.cwed.md:<span class="small">
...

utils/grass_html2md.sh Outdated Show resolved Hide resolved
utils/grass_html2md.sh Outdated Show resolved Hide resolved
@neteler
Copy link
Member Author

neteler commented Nov 15, 2024

Using the latest version of this script, I still get a lot of warnings:

...
WARNING -  Doc file 'db.execute.md' contains a link 'topic_attribute_table.html', but the target is not found among documentation files.
WARNING -  Doc file 'db.execute.md' contains a link 'keywords.html#SQL', but the target 'keywords.html' is not found among documentation files. Did
           you mean 'keywords.md#SQL'?
WARNING -  Doc file 'db.in.ogr.md' contains a link 'database.html', but the target is not found among documentation files. Did you mean
           'database.md'?
WARNING -  Doc file 'db.in.ogr.md' contains a link 'topic_import.html', but the target is not found among documentation files. Did you mean
           'topic_import.md'?
WARNING -  Doc file 'db.in.ogr.md' contains a link 'keywords.html#attribute%20table', but the target 'keywords.html' is not found among documentation
           files. Did you mean 'keywords.md#attribute%20table'?
WARNING -  Doc file 'db.login.md' contains a link 'database.html', but the target is not found among documentation files. Did you mean 'database.md'?
WARNING -  Doc file 'db.login.md' contains a link 'topic_connection_settings.html', but the target is not found among documentation files.
...

The reason is (example, see KEYWORDS section where .md should be present rather than .html):

---
name: db.in.ogr
description: Imports attribute tables in various formats.
keywords: database, import, attribute table
---

# db.in.ogr

## NAME

***db.in.ogr*** - Imports attribute tables in various formats.

### KEYWORDS

[database](database.html),
[import](topic_import.html),
[attribute table](keywords.html#attribute%20table)

### SYNOPSIS

**db.in.ogr**
...

Seems we overlooked that in #3849? Any idea, @landam?

As the script of this PR runs before the parser is invoked a change in man/build_md.py or around might be needed?

@landam
Copy link
Member

landam commented Nov 23, 2024

Seems we overlooked that in #3849? Any idea, @landam?

As the script of this PR runs before the parser is invoked a change in man/build_md.py or around might be needed?

For record, solved by #4740

@landam landam requested a review from wenzeslaus November 23, 2024 14:56
@neteler
Copy link
Member Author

neteler commented Nov 24, 2024

Change in f91a111 to avoid undesired escaping of $ in plain text:

Example: dist.x86_64-pc-linux-gnu/docs/mkdocs/site/grass.html#examples

Before:

image

After:

image

…ll URLs as before; fix %20 to dash for mkdocs
@echoix
Copy link
Member

echoix commented Dec 5, 2024

Is the link conversion regex only applied to our links, not all external references to other websites?

@neteler
Copy link
Member Author

neteler commented Dec 6, 2024

Is the link conversion regex only applied to our links, not all external references to other websites?

With the sed part URLs with "http[s]" will not be modified, only relative links pointing to other internal manual pages.

@echoix echoix dismissed wenzeslaus’s stale review December 9, 2024 22:19

Requested changes were addressed

@echoix
Copy link
Member

echoix commented Dec 9, 2024

Is the link conversion regex only applied to our links, not all external references to other websites?

With the sed part URLs with "http[s]" will not be modified, only relative links pointing to other internal manual pages.

I didn't catch the "not" part in the sed call/syntax, so your explanation makes sense in this case

echoix
echoix previously approved these changes Dec 9, 2024
Copy link
Member

@echoix echoix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To prepare for the migration, it is a good idea to have this merged, and properly tested inside the further PRs that also prepare the migration. Adjustments could then be added when finding more edge cases before the final switch. This script is an internal migration tool helper as I see it.

neteler added a commit to neteler/grass that referenced this pull request Dec 20, 2024
Test submission of conversion of all HTML manual pages to markdown using the `pandoc` based converter script (see OSGeo#4620).

For figure code conversion issues, see OSGeo#4864
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs manual Documentation related issues
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

4 participants