Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML5 namespaces do not propagate to parent nodes when adding nodes to a document #2647

Open
dan42 opened this issue Sep 14, 2022 · 4 comments

Comments

@dan42
Copy link

dan42 commented Sep 14, 2022

After parsing a document that contains <svg> elements, it's possible to traverse the elements with xpath(".//svg:svg")

But if we have a document with no <svg> elements to which we then add <svg> elements, the svg namespace is not added to the document, so it's impossible to use the above xpath.

To illustrate:

require "nokogiri"
p Nokogiri::VERSION

tags = %[<div><svg><use xlink:href="..."></use></svg>]
doc1 = Nokogiri::HTML5.parse("<section>"+tags)
doc2 = Nokogiri::HTML5.parse("<section>")
doc2.at("section").children = tags

[doc1, doc2].each do |doc|
  puts "",doc
  [doc, doc.at_css("div")].each do |base|
    p [base.name, base.namespaces]
    %w[ .//svg:svg  .//@xlink:href ].each do |x|
      p x => (base.xpath(x).size rescue $!)
    end
  end
end

Output:

"1.14.0.dev"

<html><head></head><body><section><div><svg><use xlink:href="..."></use></svg></div></section></body></html>
["document", {"xmlns:svg"=>"http://www.w3.org/2000/svg", "xmlns:xlink"=>"http://www.w3.org/1999/xlink"}]
{".//svg:svg"=>1}
{".//@xlink:href"=>1}
["div", {"xmlns:svg"=>"http://www.w3.org/2000/svg", "xmlns:xlink"=>"http://www.w3.org/1999/xlink"}]
{".//svg:svg"=>1}
{".//@xlink:href"=>1}

<html><head></head><body><section><div><svg><use xlink:href="..."></use></svg></div></section></body></html>
["document", {}]
{".//svg:svg"=>#<Nokogiri::XML::XPath::SyntaxError: ERROR: Undefined namespace prefix: .//svg:svg>}
{".//@xlink:href"=>#<Nokogiri::XML::XPath::SyntaxError: ERROR: Undefined namespace prefix: .//@xlink:href>}
["div", {"xmlns:svg"=>"http://www.w3.org/2000/svg", "xmlns:xlink"=>"http://www.w3.org/1999/xlink"}]
{".//svg:svg"=>#<Nokogiri::XML::XPath::SyntaxError: ERROR: Undefined namespace prefix: .//svg:svg>}
{".//@xlink:href"=>#<Nokogiri::XML::XPath::SyntaxError: ERROR: Undefined namespace prefix: .//@xlink:href>}

As we can see above, even though doc1 and doc2 have the same structure, doc2.namespaces returns empty, and namespaced xpath queries result in an error for doc2, even for the div element that claims to have the namespaces.

Now, it's probably better anyway to use css("svg") instead of xpath(".//svg:svg"). But I don't think there's an alternative to xpath(".//@xlink:href"); at least css("[xlink:href]") results in Nokogiri::CSS::SyntaxError.

@dan42 dan42 added the state/needs-triage Inbox for non-installation-related bug reports or help requests label Sep 14, 2022
@flavorjones
Copy link
Member

Hi, thanks for opening this issue.

What's going on with the namespaces?

This code snippet clarifies what's going on:

#! /usr/bin/env ruby

require "nokogiri"

svg = <<~SVG
  <div>
    <svg height="100" width="100">
      <circle cx="50" cy="50" r="40" stroke="black" stroke-width="3" />
    </svg>
  </div>
SVG

parsed_doc = Nokogiri::HTML5.parse(<<~HTML)
  <html>
    <body>
      <section>
        #{svg}
      </section>
    </body>
  </html>
HTML

assembled_doc = Nokogiri::HTML5.parse(<<~HTML)
  <html>
    <body>
      <section>
      </section>
    </body>
  </html>
HTML

assembled_doc.at_css("section").children = svg

parsed_doc.at_css("circle").ancestors.map do |a|
  [a.name, (a.namespace_definitions rescue nil)]
end
# => [["svg", []],
#     ["div", []],
#     ["section", []],
#     ["body", []],
#     ["html",
#      [#(Namespace:0x3c {
#         prefix = "svg",
#         href = "http://www.w3.org/2000/svg"
#         })]],
#     ["document", nil]]

assembled_doc.at_css("circle").ancestors.map do |a|
  [a.name, (a.namespace_definitions rescue nil)]
end
# => [["svg", []],
#     ["div",
#      [#(Namespace:0x50 {
#         prefix = "svg",
#         href = "http://www.w3.org/2000/svg"
#         })]],
#     ["section", []],
#     ["body", []],
#     ["html", []],
#     ["document", nil]]

The thing I'd like to point out here is that

  • the namespace definition is attached to the document root in the first doc,
  • and attached to the top-most node of the fragment subtree in the second doc.

When the libxml2 tree is created in ext/nokogiri/gumbo.c, these namespaces are added to the top-most node of either the document or the fragment, and the convention we currently follow is that we don't move namespace definitions around during reparenting.

Why this impacts an xpath query without explicit namespaces

The reason base.xpath(".//svg:svg") doesn't work is because, in the general case, Nokogiri expects users to explicitly provide the namespaces used in the query (see https://nokogiri.org/rdoc/Nokogiri/XML/Node.html#method-i-xpath). For example, changing your query to this works in all cases:

base.xpath(x, {"svg"=>"http://www.w3.org/2000/svg", "xlink"=>"http://www.w3.org/1999/xlink"})

The reason base.xpath(".//svg:svg") works on the first document (without explicit namespaces) is because Nokogiri implicitly uses any namespaces defined on the document if no namespaces are passed to #xpath. (I'll also note that we worked pretty hard to make sure that CSS queries work properly without explicit namespaces in HTML5, see #2376 for discussion and #2403 for implementation details.)

So: you've still got a way to search properly!

Possible alternative behaviors

The topic of what to do with namespaces during reparenting has been a 🔥 hot 🔥 topic over the years, and unfortunately the specifications do not provide any guidance. We've tried to establish some behavior and implement it as consistently as we can. I will very readily admit that some of the decisions we made may have been wrong; see #1200 for an example of behavior unrelated to this issue that I'd like to change.

Now: I don't think we're obviously doing the wrong thing here. I would be comfortable closing this and saying "behaving as designed," particularly since you can explicitly pass the relevant namespaces to the #xpath call.

But I also think that because there are a finite number of legal namespaces in HTML5 that there's an opportunity to step back and ask if there are some assumptions we can make to make this more user-friendly for the HTML5 use case.

Idea 1: when querying HTML5 with xpath, always implicitly include the three legal namespaces

What if xpath queries on HTML5 documents implicitly used the following namespaces:

{
  "svg" => "http://www.w3.org/2000/svg",
  "math" => "http://www.w3.org/1998/Math/MathML",
  "xlink" => "http://www.w3.org/1999/xlink",
}

Then users might never have to specify namespaces, and this would work in all of the cases above:

base.xpath(".//svg:circle")
base.xpath(".//math:mrow")
base.xpath(".//@xlink:href")

Idea 2: when reparenting a node with any of these namespaces, copy (move?) them to the document root

As mentioned above, when the libxml2 tree is created in ext/nokogiri/gumbo.c, these namespaces are added to the top-most node.

We could extend this behavior to reparenting so that, when a node is reparented into an HTML5 document, any of these namespaces will be copied (or moved?) up to the document root. Then the current behavior of Node#xpath would pick them up implicitly, and the location of the namespace definitions would be consistent across the two cases presented.


I don't feel strongly about either of these approaches. Maybe @stevecheckoway can weigh in?

@flavorjones flavorjones added topic/HTML5 topic/namespaces and removed state/needs-triage Inbox for non-installation-related bug reports or help requests labels Sep 15, 2022
@flavorjones flavorjones changed the title [bug] namespaces do not propagate to parent nodes when adding nodes to a document HTML5 namespaces do not propagate to parent nodes when adding nodes to a document Sep 15, 2022
@dan42
Copy link
Author

dan42 commented Sep 15, 2022

Thank you for the detailed and insightful answer. And I didn't know about explicitly providing the namespaces to xpath, so thank you for fixing my ignorance.

I like your idea 1. With this it would be possible to query for xpath .//svg:circle without raising an error even if there's no <svg> element in the doc.

@flavorjones
Copy link
Member

I like option 1 better, too (after sleeping on it), but I really would like Steve's feedback because he's gone deeper on HTML5 foreign element namespaces than I have.

@flavorjones
Copy link
Member

@stevecheckoway What do you think of option 1 above, implicitly including the svg, math, and xlink namespaces in xpath queries on HTML5 documents?

@flavorjones flavorjones added this to the v1.15.0 milestone Nov 22, 2022
@flavorjones flavorjones modified the milestones: v1.15.0, v1.16.0 Apr 28, 2023
@flavorjones flavorjones modified the milestones: v1.17.0, v1.18.0 Dec 8, 2024
@flavorjones flavorjones modified the milestones: v1.18.0, v1.19.0 Dec 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants