Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace hpricot with nokogiri in wordpressdotcom importer #555

Merged
merged 1 commit into from
Jan 21, 2025

Conversation

parkr
Copy link
Member

@parkr parkr commented Jan 13, 2025

Fixes #554.

hpricot is no longer supported and doesn't build on some modern systems like Mac. Our last use of Hpricot was wordpressdotcom importer, which is now converted to use nokogiri.

I actually created a simple website on wordpress.com and exported the XML file to test this. It now matches the old system. I added the exported XML as a mock here so that we could run the importer in tests and assert on the output.

Copy link
Member

@ashmaroli ashmaroli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ruby changes LGTM except for some style issues in changes to the test file.

However I came across an issue on running the importer with the mock XML at my end. All the imported posts have the following markup which IMHO, would be best if addressed prior to merging:

<p><!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"><br />
<!-- wp:paragraph --><html><body></p>

<!DOCTYPE>, <html>, <body> as children of a <p> tag is a 👎

test/test_wordpressdotcom_importer.rb Outdated Show resolved Hide resolved
@ashmaroli
Copy link
Member

A small UI/UX issue that could be addressed in a future pull request prior to release:

Current

bundle exec jekyll-import wordpressdotcom --source sitetitle.wordpress.2025-01-19.000.xml

Downloading images for  Blog Post 1
https://parkrmoore.wordpress.com/wp-content/uploads/2025/01/a-bright-night-sky-full-of-stars-and-the-milky.png
                OK!
   Imported 1 pages
   Imported 7 posts
Imported 1 wp_global_styless    # --> wp_global_styless ??? 
Imported 6 attachments
Imported 1 wp_navigations

Ideal

bundle exec jekyll-import wordpressdotcom --source sitetitle.wordpress.2025-01-19.000.xml

   Downloading images for Blog Post 1
            https://parkrmoore.wordpress.com/wp-content/uploads/2025/01/a-bright-night-sky-full-of-stars-and-the-milky.png
            OK!
   Imported 1 pages
   Imported 7 posts
   Imported 1 wp_global_styles
   Imported 6 attachments
   Imported 1 wp_navigations

@parkr
Copy link
Member Author

parkr commented Jan 20, 2025

Output now looks like this:

       Downloading: images for Blog Post 1
https://parkrmoore.wordpress.com/wp-content/uploads/2025/01/a-bright-night-sky-full-of-stars-and-the-milky.png 
                OK! 
           Imported 1 pages
           Imported 7 posts
           Imported 1 wp_global_styless
           Imported 6 attachments
           Imported 1 wp_navigations

@parkr
Copy link
Member Author

parkr commented Jan 20, 2025

However I came across an issue on running the importer with the mock XML at my end. All the imported posts have the following markup which IMHO, would be best if addressed prior to merging:

I would tend to agree, but since this is an existing issue, it might be better to fix in a separate PR that's more focused on changing that.

            content = Nokogiri::HTML(item.text_for("content:encoded"))         

...

              f.puts Util.wpautop(content.to_html)                             

Perhaps it would be better to use Util.wpautop(item.text_for("content:encoded"), but the issue is that the image downloads actually modify the <img> tag's src to point to the new Jekyll asset path. I think this deserves more work in a separate PR to get right.

@parkr parkr requested a review from ashmaroli January 20, 2025 18:23
@ashmaroli
Copy link
Member

Output now looks like this:

       Downloading: images for Blog Post 1
https://parkrmoore.wordpress.com/wp-content/uploads/2025/01/a-bright-night-sky-full-of-stars-and-the-milky.png 
                OK! 
           Imported 1 pages
           Imported 7 posts
           Imported 1 wp_global_styless
           Imported 6 attachments
           Imported 1 wp_navigations

I am not in favor of having wp_global_styless (2s') printed instead of just wp_global_styles.. perhaps in a future pull request..

@parkr
Copy link
Member Author

parkr commented Jan 21, 2025

I believe it's now fixed for that case.

       Downloading: images for Blog Post 1
https://parkrmoore.wordpress.com/wp-content/uploads/2025/01/a-bright-night-sky-full-of-stars-and-the-milky.png 
Already in cache. Clean assets folder if you want a redownload. 
           Imported 1 pages
           Imported 7 posts
           Imported 1 wp_global_styles
           Imported 6 attachments
           Imported 1 wp_navigations

hpricot is no longer supported and doesn't build on some modern systems
like Mac. Our last use of Hpricot was wordpressdotcom importer, which is
now converted to use nokogiri.

Rephrase wordpressdotcom importer alternative caveat.  The
wordpressdotcom importer does import pages, posts, images, etc so
previous statement was untrue.
@parkr
Copy link
Member Author

parkr commented Jan 21, 2025

@jekyllbot: merge +minor

@jekyllbot jekyllbot merged commit 693652d into master Jan 21, 2025
7 checks passed
@jekyllbot jekyllbot deleted the hpricot-bye-bye branch January 21, 2025 01:13
jekyllbot added a commit that referenced this pull request Jan 21, 2025
github-actions bot pushed a commit that referenced this pull request Jan 21, 2025
Parker Moore: Replace hpricot with nokogiri in wordpressdotcom (#555)

Merge pull request 555
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

hpricot no longer builds and should be replaced, maybe by nokogiri
3 participants