parsing behavior related to a vulnerability #3351

flavorjones · 2024-06-04T14:54:35Z

Note from maintainers: the potential vulnerability described below has been mitigated in rails-html-sanitizer v1.6.1. This issue was private until 2024-12-04 when it was moved into the public Nokogiri repository.

@stevecheckoway, I wanted to discuss a thing that came up in the context of a vulnerability report against Rails, that I think is either a problem with the spec or a bug in the gumbo parser (and I'm honestly not sure which).

Here's an explanation of what's going on, as concisely as I can explain:

#! /usr/bin/env ruby

require "bundler/inline"

gemfile do
  source "https://rubygems.org"
  gem "nokogiri", path: "."
end

# a few notes of explanation of what the parser is currently doing:
#
# 1. mtext is a MathML text integration point
# 2. ... so table is parsed as HTML content
# 3. mglyph is parsed as HTML content *but* isn't allowed in the "in table" parsing context
# 4. ... so the token is foster parented under the mtext element
# 5. style is also parsed as HTML content, and is placed under the mglyph element
input = "<math><mtext><table><mglyph><style><img src=: onerror=alert(1)>"

frag1 = Nokogiri::HTML5.fragment(input, nil, max_errors: 10)

puts frag1.errors.map(&:to_s).grep(/isn't allowed here/)
# => 1:21: ERROR: Start tag 'mglyph' isn't allowed here. Currently open tags: html, math, mtext, table.
#    <math><mtext><table><mglyph><style><img src=: onerror=alert(1)>
#                        ^

pp frag1
# => #(DocumentFragment:0x348 {
#      name = "#document-fragment",
#      children = [
#        #(Element:0x35c {
#          name = "math",
#          namespace = #(Namespace:0x370 { prefix = "math", href = "http://www.w3.org/1998/Math/MathML" }),
#          children = [
#            #(Element:0x384 {
#              name = "mtext",
#              namespace = #(Namespace:0x370 { prefix = "math", href = "http://www.w3.org/1998/Math/MathML" }),
#              children = [
#                #(Element:0x398 {
#                  name = "mglyph",
#                  children = [
#                    #(Element:0x3ac { name = "style", children = [ #(Text "<img src=: onerror=alert(1)>")] })]
#                  }),
#                #(Element:0x3c0 { name = "table" })]
#              })]
#          })]
#      })
#

puts frag1.to_html
# => <math><mtext><mglyph><style><img src=: onerror=alert(1)></style></mglyph><table></table></mtext></math>

The reason this is contributing to a security problem is that we can't round-trip that DOM through the parser. Continuing the above snippet ...

# second time around, the parser's behavior is:
#
# 1. mtext, mglyph, and style are all parsed as foreign content
# 2. and the "<img ...>" string is parsed as foreign content! (not raw text data)
# 3. the img tag isn't allowed as a child of the mathml style element
# 4. ... and is foster parented under the mtext element 
frag2 = Nokogiri::HTML5.fragment(frag1.to_html, nil, max_errors: 10)

puts frag2.errors.map(&:to_s).grep(/isn't allowed here/)
# => 1:29: ERROR: Start tag 'img' isn't allowed here. Currently open tags: html, math, mtext, mglyph, style.
#    <math><mtext><mglyph><style><img src=: onerror=alert(1)></style></mglyph><table></table></mtext></math>
#                                ^
#    1:57: ERROR: End tag 'style' isn't allowed here. Currently open tags: html, math, mtext.
#    <math><mtext><mglyph><style><img src=: onerror=alert(1)></style></mglyph><table></table></mtext></math>
#                                                            ^
#    1:65: ERROR: End tag 'mglyph' isn't allowed here. Currently open tags: html, math, mtext.
#    <math><mtext><mglyph><style><img src=: onerror=alert(1)></style></mglyph><table></table></mtext></math>
                                                                ^
pp frag2
# => #(DocumentFragment:0x3d4 {
#      name = "#document-fragment",
#      children = [
#        #(Element:0x3e8 {
#          name = "math",
#          namespace = #(Namespace:0x3fc { prefix = "math", href = "http://www.w3.org/1998/Math/MathML" }),
#          children = [
#            #(Element:0x410 {
#              name = "mtext",
#              namespace = #(Namespace:0x3fc { prefix = "math", href = "http://www.w3.org/1998/Math/MathML" }),
#              children = [
#                #(Element:0x424 {
#                  name = "mglyph",
#                  namespace = #(Namespace:0x3fc { prefix = "math", href = "http://www.w3.org/1998/Math/MathML" }),
#                  children = [
#                    #(Element:0x438 {
#                      name = "style",
#                      namespace = #(Namespace:0x3fc { prefix = "math", href = "http://www.w3.org/1998/Math/MathML" })
#                      })]
#                  }),
#                #(Element:0x44c {
#                  name = "img",
#                  attribute_nodes = [
#                    #(Attr:0x460 { name = "src", value = ":" }),
#                    #(Attr:0x474 { name = "onerror", value = "alert(1)" })]
#                  }),
#                #(Element:0x488 { name = "table" })]
#              })]
#          })]
#      })

puts frag2.to_html
# => <math><mtext><mglyph><style></style></mglyph><img src=":" onerror="alert(1)"><table></table></mtext></math>

And now the img tag's onerror handler will run in a browser.

I have a few questions here ...

1. mglyph foster parenting in parse 1

In the first parse, when mglyph is foster-parented, is placing it under mtext the correct behavior? The spec says in § 13.2.6.4.9:

Enable foster parenting, process the token using the rules for the "in body" insertion mode, and then disable foster parenting.

So I think the spec is saying "parse this as HTML content even though the parent is mtext" and so maybe this is a bug in the spec?

The spec also describes how to find the target for foster parenting in § 13.2.6.1, and that guidance seems to also say "use mtext as the parent node" even though it's got very different parsing rules.

In fact, mtext appears in the list of nodes that break the "have a particular element in scope" chain in § 13.2.4.2. Shouldn't the spec include a check for "having an HTML element in scope" in the foster parenting instructions?

This seems like the parser is following the spec correctly, but the spec is incomplete. WDYT?

I can imagine two possible solutions:

pick a different node to foster-parent under (which seems hard to specify)
make sure that the fostered token is re-parsed in the correct insertion mode (in our case, this would be "as a child of mtext, a MathML integration point" which according to § 13.2.6 is "in foreign content")

2. style foster parenting in parse 1

It seems like the style tag is then also foster parented under the new mglyph element.

I don't know whether this is right or not. The foster parenting instructions say "... and then disable foster parenting" but it seems like gumbo didn't disable it? Maybe I'm misreading the spec.

The text was updated successfully, but these errors were encountered:

stevecheckoway · 2024-06-04T19:24:45Z

TL;DR (1) I think there's a spec bug around mglyph (and probably malignmark) in that they are only handled correctly when they directly appear as children of mtext (or other MathML text integration points) and not in other cases including foster parenting (as happens here) and also when handled as descendants of a MathML text integration point but not direct children. (2) HTML with parse errors are likely to not round trip through serialization.

I'm not surprised that documents with errors cannot round-trip through parsing and serialization. Even documents without errors may not (e.g., textarea elements which I believe we explicitly do not follow the standard in order to get to round trip correctly when they start with newlines). The Serializing HTML fragments algorithm contains this warning.

Warning!
It is possible that the output of this algorithm, if parsed with an HTML parser, will not return the original tree structure. Tree structures that do not roundtrip a serialize and reparse step can also be produced by the HTML parser itself, although such cases are typically non-conforming.

Parsing

As I read the standard, what should happen is <math><mtext><table><mglyph><style><img src=: onerror=alert(1)> should be parsed like

math:math
└─ math:mtext
   ├─ mglyph
   │  └─ style
   │     └─ "<img src=: onerror=alert(1)>"
   └─ table

I believe that matches the output you gave.

In the first parse, when mglyph is foster-parented, is placing it under mtext the correct behavior? The spec says in § 13.2.6.4.9:
Enable foster parenting, process the token using the rules for the "in body" insertion mode, and then disable foster parenting.
So I think the spec is saying "parse this as HTML content even though the parent is mtext" and so maybe this is a bug in the spec?

Placingmglyph under the mtext is correct but note that it's not inserting a math:mglyph element but a nonstandard HTML element with the tag name mglyph.

In fact, mtext appears in the list of nodes that break the "have a particular element in scope" chain in § 13.2.4.2. Shouldn't the spec include a check for "having an HTML element in scope" in the foster parenting instructions?

I'm not sure what's supposed to happen when a foreign element is specified inside a table. The problem is that the <mglyph> token is handled according to the "in body" insertion mode but it should probably be handled according to the tree construction dispatcher with the current insertion mode set to "in body." Using the tree construction dispatcher would handle that specific case.

This seems like the parser is following the spec correctly, but the spec is incomplete. WDYT?

I agree.

It seems like the style tag is then also foster parented under the new mglyph element.

I don't think foster parenting what's happening actually. When the mglyph element is inserted, it is added to the stack of open elements which makes it the current node so that when the <style> token is encountered, a style element is inserted as a child of the mglyph element:

Emitted start tag style.
Original text = <style>.
Handling style token @1:29 in insertion mode 8.
Current node: <mglyph>.
Inserting <style> element (@0x600001536300) from token.

Insertion mode 8 is "in table" which seems strange because the mglyph is not a child of any tables at this point but this is correct. Something like a <tbody> or <tr> rather than <style> would cause the parser to clear the stack back to a table context and so <math><mtext><table><mglyph><tbody>... would put the tbody element as a child of the table and not the mglyph!

The foster parenting instructions say "... and then disable foster parenting" but it seems like gumbo didn't disable it?

I think this is being handled correctly.

nokogiri/gumbo-parser/src/parser.c

Line 3581 in 4cceeac

state->_foster_parent_insertions = false;

Serializing

According to Firefox, the first fragment serializes as

<math><mtext><mglyph><style><img src=: onerror=alert(1)>
</style></mglyph><table></table></mtext></math>

which is similar to, but not exactly what Nokogiri is giving us:

<math><mtext><mglyph><style><img src=: onerror=alert(1)></style></mglyph><table></table></mtext></math>

(~~I can't explain the newline in Firefox's output. I don't think it should be there but I also don't think it matters.~~ The newline comes from the HTML file ending in a newline.)

As you note, this fragment is parsed with the style being parsed in foreign context and no special rules apply. The img is also parsed in foreign context and this gives rise to the first parse error about img not being allowed. Now it's not foster parenting that occurs, but how img start tokens are handled. Specifically, it pops elements from the stack of open elements until the current node is a MathML text integration point, an HTML integration point, or an element in the HTML namespace. In this case, it'll pop the style and mglyph elements before inserting the img as a normal HTML element:

Emitted start tag img.
Original text = <img src=: onerror=alert(1)>.
Handling img token @1:29 in insertion mode 6.
Current node: <style>.
Handling foreign content.
Adding parse error.
Popping style node.
Popping mglyph node.
Inserting <img> element (@0x60000151ba20) from token.

flavorjones · 2024-06-04T19:44:55Z

@stevecheckoway Thanks for the thoughtful response.

I'm not surprised that documents with errors cannot round-trip

I'm not, either, and I didn't mean to imply an expectation that they should roundtrip, just highlighting that round-tripped is necessary to see the vulnerability.

Placingmglyph under the mtext is correct but note that it's not inserting a math:mglyph element but a nonstandard HTML element with the tag name mglyph.

Understood. Maybe if I was going to roll up my long initial post, it would be "I think this should be a math:mglyph because it's been foster-parented by mtext". Maybe we can dig into this for a second ...

The spec says

Enable foster parenting, process the token using the rules for the "in body" insertion mode

and it doesn't make sense to me that this doesn't take into account the insertion mode that would normally be demanded by the foster parent. §13.2.6 appears to say the token should be processed in foreign content -- it's the mglyph token, and the adjusted current node is mtext. Why is this tree construction guidance not in effect during foster parenting?

stevecheckoway · 2024-06-04T20:15:11Z

@stevecheckoway Thanks for the thoughtful response.

I'm not surprised that documents with errors cannot round-trip

I'm not, either, and I didn't mean to imply an expectation that they should roundtrip, just highlighting that round-tripped is necessary to see the vulnerability.

Placingmglyph under the mtext is correct but note that it's not inserting a math:mglyph element but a nonstandard HTML element with the tag name mglyph.

Understood. Maybe if I was going to roll up my long initial post, it would be "I think this should be a math:mglyph because it's been foster-parented by mtext". Maybe we can dig into this for a second ...

Yeah, I agree that this seems like the most reasonable behavior but note that something like <mtext><span><mglyph .../></span></mtext> is also going to be creating an mglyph rather than a math:mglyph so in that sense, <table><mglyph> and <span><mglyph> are going to be treated the same way.

I was going to test how browsers handle this, but no browser actually supports mglyph.

The spec says

Enable foster parenting, process the token using the rules for the "in body" insertion mode

and it doesn't make sense to me that this doesn't take into account the insertion mode that would normally be demanded of the foster parent. §13.2.6 appears to say the token should be processed in foreign content -- it's the mglyph token, and the adjusted current node is mtext. Why is this tree construction guidance not in effect during foster parenting?

Yes, I think that's the bug in the spec.

flavorjones · 2024-06-04T21:49:07Z

note that something like <mtext><span><mglyph .../></span></mtext> is also going to be creating an mglyph rather than a math:mglyph

Sure, I understand the use case in this example is a subtree that's parse as HTML (not MathML) and I get it.

no browser actually supports mglyph

LOL 🤷

Yes, I think that's the bug in the spec

OK, so just to make sure I understand, let me try to rephrase what I think we agree on ... The act of foster parenting mglyph -- admittedly in an HTML context as originally written -- should cause the token to be re-parsed as a child of mtext and so "in foreign content" and thus it should end up as math:mglyph.

If that's a rough statement that we think should be true and reflected in the spec ... what do you think the next steps should be?

How would we start a conversation about this? Open an issue on https://github.com/whatwg/html ?
Should we try to modify the behavior of libgumbo in the meantime? Or should we wait for this to become "settled law" first? Asking primarily because I still need to mitigate the Rails vulnerability somehow.

stevecheckoway · 2024-06-04T22:41:28Z

OK, so just to make sure I understand, let me try to rephrase what I think we agree on ... The act of foster parenting mglyph -- admittedly in an HTML context as originally written -- should cause the token to be re-parsed as a child of mtext and so "in foreign content" and thus it should end up as math:mglyph.

I think the behavior you describe is the best option with respect to serializing and reparsing.

The current behavior is self-consistent in that any HTML element appearing between the mtext (or other MathML text integration point) and the mglyph causes the mglyph to be treated as an HTML node, regardless of foster parenting. Changing the behavior of foster parenting will lead to a change in behaviors for <span><mglyph …/></span> and <table><mglyph …/></table> in that the latter will cause the mglyph to be a math:mglyph and the former will cause the mglyph to be an HTML element. The WG might view that as undesirable.

If that's a rough statement that we think should be true and reflected in the spec ... what do you think the next steps should be?
* How would we start a conversation about this? Open an issue on https://github.com/whatwg/html ?

I think it makes sense to open an issue and ask what the expected behavior is here.

* Should we try to modify the behavior of libgumbo in the meantime? Or should we wait for this to become "settled law" first? Asking primarily because I still need to mitigate the Rails vulnerability somehow.

My inclination is the parser should match the standard.

Can you say more about the vulnerability? I assume this comes from parsing, validating, and serializing attacker-controlled HTML and the problem is that the DOM produced by serializing is different from the DOM used in validation. If that's the case, then I think the best approach is to

Parse the HTML into a DOM;
Serialize the DOM;
Parse the serialized DOM;
Validate the DOM from step 3; and
If valid, output the serialization from step 2.

For security, I'd also ensure that the serialization of the DOM from step 3 matches the serialization from step 2.

I realize this is not ideal since parsing needs to happen twice and (if the security step is followed), serialized twice. One of the serializations can be skipped by comparing the DOMs produced in steps 1 and 3 directly, but that's more complicated and error-prone.

stevecheckoway · 2024-06-05T16:39:00Z

One thought along these lines. Gumbo currently has flags per node tracking a few things including whether a node was foster-parented out of a table, moved by the adoption agency algorithm, cloned by the adoption agency algorithm, created due to reconstructed formatting elements, and a few other things that seem less important.

Currently, nothing uses these flag. As the gumbo source has evolved, I don't think close attention was paid to setting these flags when creating nodes.

One option would be to track at the document level whether a node was adopted, reparented, etc. and expose that information to Ruby. Then, the 5 steps I gave above can be simplified since the DOM from step 1 should match the DOM from step 3 so long as no nodes were adopted or foster parented.

Going this route would require

Setting a bool to true whenever adopting or foster parenting (or maybe only foster parenting can lead to problems?) occurs.
Testing that the bool is set.
Fuzzing to ensure that when the bool isn't set, then serializing, parsing, and serializing gives identical results.

Step 1 seems easy (without actually checking). Step 2 involves finding every place foster parenting can occur and testing. Step 3 is probably easy, but I don't have experience with libFuzzer. I've only used AFL to fuzz gumbo and that was long enough ago, that I'm not sure I even have the infrastructure saved somewhere.

flavorjones · 2024-06-05T16:57:09Z

Can you say more about the vulnerability?

The original post mimics the full lifecycle of Rails sanitizing an untrusted input, and a user browser parsing the serialized result.

The first parse is of the untrusted content. Once parsed, Rails::HTML::Sanitizer traverses the DOM looking for disallowed tags and attributes, and removes them. An attribute like onerror=alert(1) on the img tag would be disallowed and removed. However, in this case there's no img tag or onerror attribute; there's only a rawtext child of the style tag.

Once serialized by Rails and then parsed by a user's browser (the second parse), style is parsed as math:style, and so now there's an img tag (foster-parented under mtext) with a dangerous onerror attribute that is executed by the browser.

Reparsing may sufficient for the attack described above; but I suspect it's possible to craft attack payloads with subtrees recursively nested within style tags that would necessitate iteratively parsing multiple times to mitigate.

I'd prefer to just prune big pieces of the DOM than add complexity to the sanitizer simply to try to optimally handle these kinds of cases. The attack vectors only exist if developers override the default allowlist and allow math or svg and style (and mtext, and mglyph ... you get the picture). The default allowlist is very conservative, and I'm not convinced anyone actually configures their Rails app in this way.

One idea I want to explore is for the sanitizer to remove any "style" tag who has an ancestor element that is foreign content (i.e., that has a namespace). Or, another idea is to simply throw our hands up and say "look, if you override the defaults to allow ((math or svg) and style), that's an insecure configuration and you're on your own" (similar to what would happen if someone allowlisted the script element).

flavorjones · 2024-06-05T17:07:41Z

(While writing my response, I missed your most recent response above)

I like the idea of tracking more detailed parse state (like "fostered"), but that feels like a much bigger lift than I think we should put into this particular vulnerability type. Like I mentioned above, the configurations required are not realistic and I personally question whether it's worth the effort (by us) to mitigate risks with a scalpel that we could more easily mitigate with a machete.

flavorjones · 2024-12-04T13:49:06Z

Now that rails-html-sanitizer v1.6.1 has shipped a mitigation for this vulnerability, I've moved the issue into the public repository so the conversation is public.

flavorjones transferred this issue from another repository Dec 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parsing behavior related to a vulnerability #3351

parsing behavior related to a vulnerability #3351

flavorjones commented Jun 4, 2024 •

edited

Loading

stevecheckoway commented Jun 4, 2024

flavorjones commented Jun 4, 2024 •

edited

Loading

stevecheckoway commented Jun 4, 2024

flavorjones commented Jun 4, 2024

stevecheckoway commented Jun 4, 2024

stevecheckoway commented Jun 5, 2024

flavorjones commented Jun 5, 2024 •

edited

Loading

flavorjones commented Jun 5, 2024

flavorjones commented Dec 4, 2024

parsing behavior related to a vulnerability #3351

parsing behavior related to a vulnerability #3351

Comments

flavorjones commented Jun 4, 2024 • edited Loading

1. mglyph foster parenting in parse 1

2. style foster parenting in parse 1

stevecheckoway commented Jun 4, 2024

Parsing

Serializing

flavorjones commented Jun 4, 2024 • edited Loading

stevecheckoway commented Jun 4, 2024

flavorjones commented Jun 4, 2024

stevecheckoway commented Jun 4, 2024

stevecheckoway commented Jun 5, 2024

flavorjones commented Jun 5, 2024 • edited Loading

flavorjones commented Jun 5, 2024

flavorjones commented Dec 4, 2024

flavorjones commented Jun 4, 2024 •

edited

Loading

flavorjones commented Jun 4, 2024 •

edited

Loading

flavorjones commented Jun 5, 2024 •

edited

Loading