-
-
Notifications
You must be signed in to change notification settings - Fork 905
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTML4::DocumentFragment4 truncates text in a <div> tag at about 10mb #2941
Comments
@bensherman Thanks for asking this question. Diagnosislibxml2 has some soft limits that you're running into here. The bad news is that it's not obvious what's happening. The good news, though, is that if you know what you're looking for, it's easy to detect when this is happening and it's easy to work around it. ExplanationLet's start with a slightly simpler reproduction: require "nokogiri"
TIMES = 9999983
(0..1).each do |j|
text = "a" * (TIMES + j)
html = "<div>" + text + "</div>"
parsed4 = Nokogiri::HTML4::DocumentFragment.parse(html)
puts "length of source html: #{html.length}"
puts "length of parsed html: #{parsed4.to_html.length}"
puts
end which outputs:
DetectionLet's start by checking if the parser has reported any errors: require "nokogiri"
TIMES = 9999983
(0..1).each do |j|
text = "a" * (TIMES + j)
html = "<div>" + text + "</div>"
parsed4 = Nokogiri::HTML4::DocumentFragment.parse(html)
puts "length of source html: #{html.length}"
puts "length of parsed html: #{parsed4.to_html.length}"
pp parsed4.errors
puts
end outputs:
Aha! So we've hit the soft limits. Configure the parserLet's tell libxml2 to turn off the safety limits and handle huge inputs: require "nokogiri"
TIMES = 9999983
(0..1).each do |j|
text = "a" * (TIMES + j)
html = "<div>" + text + "</div>"
parsed4 = Nokogiri::HTML4::DocumentFragment.parse(html) { |config| config.huge }
puts "length of source html: #{html.length}"
puts "length of parsed html: #{parsed4.to_html.length}"
pp parsed4.errors
puts
end outputs
So now the full text is parsed correctly. Hopefully this helps you! EpilogueThis is weird to have to work around, and I'm open to suggestions on how to balance these principles:
libxml2's strategy over the years has been to adopt a set of reasonably large size limits that protect against unbounded memory allocation while also enabling most inputs to be handled correctly. This has been "good enough" for Nokogiri for a pretty long time. It might also be interesting to think about whether libgumbo should have similar soft limits. @stevecheckoway is this something we can chat about, maybe in a new issue? |
Thanks for the quick response, we can work this into our code.
edit: I will follow along to see how the libgumbo work goes - thanks again. |
I'm not entirely sure what the limit is designed to do. The way Gumbo tries to handle parsing a malicious document is to adhere to the standard as closely as we can. Barring standards bugs (which do exist and get fixed with some regularity), the HTML standard specifies a way to parse any sequence of bytes. A malicious document could probably cause Gumbo to allocate a lot of memory, particularly with errors. We have soft limits on number of errors and tree depth because they were requested by a downstream project. Currently, if gumbo fails to allocate memory, the whole process is killed. It may be worth thinking about how to make that happen. Since gumbo doesn't do anything like expand XML external entities (e.g., the billion laughs attack), it may require a very large document. |
@stevecheckoway Thanks for the thoughtful reply. The libxml2 limits are in place to cap memory allocations done for an untrusted document. They halt the parser based on a few different limits, either length or depth or even individual text node size. It's enough to prevent an OOM condition. I think I'm primarily concerned about unbounded memory allocation if a large untrusted document is parsed. It seems like it could be trivial to craft a long-enough document to trigger an OOM condition, which could be used for a denial-of-service attack. |
@flavorjones Gotcha! That does make sense. We'll have to think about threading allocation failure all the way back to the main gumbo parse functions. Makes me pine for Rust's |
I don't think it's particularly urgent that we address this, but I have created a new issue #2949 so we don't forget and can continue the conversation. |
It looks like there is different behavior for Nokogiri::HTML4::DocumentFragment vs HTML5 - If you have more than 10mb of text inside a
<div>
tag in the HTML4 parser, some of it goes missing.It looks like there is a buffer in the parser that has about a 10mb limit and the data after that point get dropped - the closing
</div>
tag remains on the file, so it's not truncating the final string.Quickly written code that uses text converted to escaped HTML that shows where it breaks:
Environment
A quick env to reproduce. It happens on older versions too. This is from the docker image
ruby:3.2.2
running onlygem install nokigiri
then the above script.The text was updated successfully, but these errors were encountered: