From eb1572ce7f7a6e97ec44c27568286345c2a7748e Mon Sep 17 00:00:00 2001 From: Roland Shoemaker Date: Fri, 14 Apr 2023 12:13:47 -0700 Subject: [PATCH] html: another shot at security doc Be clearer about the operation of the tokenizer and the parser (and their differences), and be explicit about the need for re-serialization when they are being used in security contexts. Change-Id: Ieb8f2a9d4806fb7a8849a15671667396e81c53b9 Reviewed-on: https://go-review.googlesource.com/c/net/+/484795 Auto-Submit: Roland Shoemaker Reviewed-by: Damien Neil Run-TryBot: Roland Shoemaker TryBot-Result: Gopher Robot --- html/doc.go | 22 ++++++++++++++-------- 1 file changed, 14 insertions(+), 8 deletions(-) diff --git a/html/doc.go b/html/doc.go index 5ff8480cf..2466ae3d9 100644 --- a/html/doc.go +++ b/html/doc.go @@ -99,14 +99,20 @@ Care should be taken when parsing and interpreting HTML, whether full documents or fragments, within the framework of the HTML specification, especially with regard to untrusted inputs. -This package provides both a tokenizer and a parser. Only the parser constructs -a DOM according to the HTML specification, resolving malformed and misplaced -tags where appropriate. The tokenizer simply tokenizes the HTML presented to it, -and as such does not resolve issues that may exist in the processed HTML, -producing a literal interpretation of the input. - -If your use case requires semantically well-formed HTML, as defined by the -WHATWG specification, the parser should be used rather than the tokenizer. +This package provides both a tokenizer and a parser, which implement the +tokenization, and tokenization and tree construction stages of the WHATWG HTML +parsing specification respectively. While the tokenizer parses and normalizes +individual HTML tokens, only the parser constructs the DOM tree from the +tokenized HTML, as described in the tree construction stage of the +specification, dynamically modifying or extending the docuemnt's DOM tree. + +If your use case requires semantically well-formed HTML documents, as defined by +the WHATWG specification, the parser should be used rather than the tokenizer. + +In security contexts, if trust decisions are being made using the tokenized or +parsed content, the input must be re-serialized (for instance by using Render or +Token.String) in order for those trust decisions to hold, as the process of +tokenization or parsing may alter the content. */ package html // import "golang.org/x/net/html"