Skip to content

Commit

Permalink
html: sync changes from std
Browse files Browse the repository at this point in the history
Before golang/go@324513b (2012-01-04) std "html" and what is now
"golang.org/x/net/html" were the same. Ever since then (well, since
golang/go@4e0749a (2012-05-29)) the escape/unescape code that they
share has been drifting apart, each receiving separate improvements.

This CL cherry-picks over all of the changes that std "html" has seen.

When applying golang/go@5b92028 (https://golang.org/cl/10172) I had to
get a touch creative.  That commit inlined `unescape()` into
`UnescapeString()`, removing the original `unescape()`.  However, over
here in x/net, we have other callers of `unescape()` so we can't
remove it... but duplicating it is also bad.  Simply wrapping it
instead of duplicating it would repeat the first call to `IndexByte()`
(first as `strings.IndexByte()`, then as `bytes.IndexByte()`); as
minor as that preformance regression would be, I don't want anything
to go backward.  So, I've pulled out an `unescapeInner()` function
takes the initial `i` as an argument, and both `unescape()` and
`UnescapeString()` call.

This is the counterpart to https://golang.org/cl/580896, and so also
includes the doc-fix for `UnescapeString()` requested at
https://go-review.googlesource.com/c/go/+/580896/comment/cc8b5704_b1899241/

golang/go@a025e1c :

    Author: Shawn Smith <[email protected]>
    Date:   Wed Dec 18 10:20:25 2013 -0800

    html: add tests for UnescapeString edge cases

    R=golang-dev, gobot, bradfitz
    CC=golang-dev
    https://golang.org/cl/40810044

golang/go@2d9a50b :

    Author: Didier Spezia <[email protected]>
    Date:   Fri May 8 16:38:08 2015 +0000

    html: simplify and optimize escape/unescape

    The html package uses some specific code to escape special characters.
    Actually, the strings.Replacer can be used instead, and is much more
    efficient. The converse operation is more complex but can still be
    slightly optimized.

    Credits to Ken Bloom ([email protected]), who first submitted a
    similar patch at https://codereview.appspot.com/141930043

    Added benchmarks and slightly optimized UnescapeString.

    benchmark                   old ns/op     new ns/op     delta
    BenchmarkEscape-4           118713        19825         -83.30%
    BenchmarkEscapeNone-4       87653         3784          -95.68%
    BenchmarkUnescape-4         24888         23417         -5.91%
    BenchmarkUnescapeNone-4     14423         157           -98.91%

    benchmark                   old allocs     new allocs     delta
    BenchmarkEscape-4           9              2              -77.78%
    BenchmarkEscapeNone-4       0              0              +0.00%
    BenchmarkUnescape-4         2              2              +0.00%
    BenchmarkUnescapeNone-4     0              0              +0.00%

    benchmark                   old bytes     new bytes     delta
    BenchmarkEscape-4           24800         12288         -50.45%
    BenchmarkEscapeNone-4       0             0             +0.00%
    BenchmarkUnescape-4         10240         10240         +0.00%
    BenchmarkUnescapeNone-4     0             0             +0.00%

    Fixes #8697

    Change-Id: I208261ed7cbe9b3dee6317851f8c0cf15528bce4
    Reviewed-on: https://go-review.googlesource.com/9808
    Run-TryBot: Brad Fitzpatrick <[email protected]>
    Reviewed-by: Brad Fitzpatrick <[email protected]>
    TryBot-Result: Gobot Gobot <[email protected]>

golang/go@a3c0730 :

    Author: Carlos C <[email protected]>
    Date:   Wed Jun 17 23:51:54 2015 +0200

    html: add examples to the functions

    Change-Id: I129d70304ae4e4694d9217826b18b341e3834d3c
    Reviewed-on: https://go-review.googlesource.com/11201
    Reviewed-by: Andrew Gerrand <[email protected]>

golang/go@5b92028 :

    Author: Ingo Oeser <[email protected]>
    Date:   Sat May 9 17:55:05 2015 +0200

    html: speed up UnescapeString

    Add benchmarks for for sparsely escaped and densely escaped strings.
    Then speed up the sparse unescaping part heavily by using IndexByte and
    copy to skip the parts containing no escaping very fast.

    Unescaping densely escaped strings slower because of
    the new function call overhead. But sparsely encoded strings are seen
    more often in the utf8 enabled web.

    We win part of the speed back by looking up entityName differently.

    	benchmark                  old ns/op    new ns/op    delta
    	BenchmarkEscape                31680        31396   -0.90%
    	BenchmarkEscapeNone             6507         6872   +5.61%
    	BenchmarkUnescape              36481        48298  +32.39%
    	BenchmarkUnescapeNone            332          325   -2.11%
    	BenchmarkUnescapeSparse         8836         3221  -63.55%
    	BenchmarkUnescapeDense         30639        32224   +5.17%

    Change-Id: If606cb01897a40eefe35ba98f2ff23bb25251606
    Reviewed-on: https://go-review.googlesource.com/10172
    Reviewed-by: Brad Fitzpatrick <[email protected]>
    Run-TryBot: Brad Fitzpatrick <[email protected]>
    TryBot-Result: Gobot Gobot <[email protected]>

golang/go@a44c425 :

    Author: Brad Fitzpatrick <[email protected]>
    Date:   Sun Apr 10 14:51:07 2016 +0000

    html: fix typo in UnescapeString string docs

    Fixes #15221

    Change-Id: I9e927a2f604213338b4572f1a32d0247c58bdc60
    Reviewed-on: https://go-review.googlesource.com/21798
    Reviewed-by: Ian Lance Taylor <[email protected]>

golang/go@6dae588 :

    Author: Seiji Takahashi <[email protected]>
    Date:   Thu Aug 3 22:08:55 2017 +0900

    html: updated entity spec link

    Fixes #21194

    Change-Id: Iac5187335df67f90f0f47c7ef6574de147c2ac9b
    Reviewed-on: https://go-review.googlesource.com/52970
    Reviewed-by: Avelino <[email protected]>
    Reviewed-by: Brad Fitzpatrick <[email protected]>

golang/go@740e589 :

    Author: Brad Fitzpatrick <[email protected]>
    Date:   Tue Jul 31 21:37:35 2018 +0000

    html: lazily populate Unescape tables

    Saves ~105KB of heap for callers who don't use html.UnescapeString.
    (EscapeString is much more common).

    Also saves 70KB of binary size, because now the linker can do dead
    code elimination. (because #2559 is still open and global maps always
    generate init code)

    Fixes #26727
    Updates #6853

    Change-Id: I18fe9a273097e2c7e0cb7f88205cae1bb60fa89b
    Reviewed-on: https://go-review.googlesource.com/127075
    Run-TryBot: Brad Fitzpatrick <[email protected]>
    Reviewed-by: Emmanuel Odeke <[email protected]>
    Reviewed-by: Ian Lance Taylor <[email protected]>
    TryBot-Result: Gobot Gobot <[email protected]>

golang/go@4ad1355 :

    Author: Romain Baugue <[email protected]>
    Date:   Tue Apr 30 13:51:05 2019 +0200

    html: add a Fuzz function

    Adds a sample Fuzz test function to package html based on
    https://github.com/dvyukov/go-fuzz-corpus/blob/master/stdhtml/main.go

    Updates #19109
    Updates #31309

    Change-Id: I8c49fff8f70fc8a8813daf1abf0044752003adbb
    Reviewed-on: https://go-review.googlesource.com/c/go/+/174301
    Reviewed-by: Brad Fitzpatrick <[email protected]>
    Run-TryBot: Brad Fitzpatrick <[email protected]>
    TryBot-Result: Gobot Gobot <[email protected]>

golang/go@52c4488 :

    Author: fujimoto kyosuke <[email protected]>
    Date:   Sun Jan 12 06:49:19 2020 +0000

    html: update URL in comment

    The comment contained a link that had a file name and ID that no longer existed, so change to the URL of the corresponding part of the latest page.

    Change-Id: I74e0885aabf470facc39b84035f7a83fef9c6a8e
    GitHub-Last-Rev: 5681c84d9f1029449da6860c65a1d9a128296e85
    GitHub-Pull-Request: golang/go#36514
    Reviewed-on: https://go-review.googlesource.com/c/go/+/214181
    Run-TryBot: Ian Lance Taylor <[email protected]>
    TryBot-Result: Gobot Gobot <[email protected]>
    Reviewed-by: Ian Lance Taylor <[email protected]>

golang/go@d4b2638 :

    Author: Russ Cox <[email protected]>
    Date:   Fri Feb 19 18:35:10 2021 -0500

    all: go fmt std cmd (but revert vendor)

    Make all our package sources use Go 1.17 gofmt format
    (adding //go:build lines).

    Part of //go:build change (#41184).
    See https://golang.org/design/draft-gobuild

    Change-Id: Ia0534360e4957e58cd9a18429c39d0e32a6addb4
    Reviewed-on: https://go-review.googlesource.com/c/go/+/294430
    Trust: Russ Cox <[email protected]>
    Run-TryBot: Russ Cox <[email protected]>
    TryBot-Result: Go Bot <[email protected]>
    Reviewed-by: Jason A. Donenfeld <[email protected]>
    Reviewed-by: Ian Lance Taylor <[email protected]>

golang/go@f229e70 :

    Author: Russ Cox <[email protected]>
    Date:   Wed Aug 25 12:48:26 2021 -0400

    all: go fix -fix=buildtag std cmd (except for bootstrap deps, vendor)

    When these packages are released as part of Go 1.18,
    Go 1.16 will no longer be supported, so we can remove
    the +build tags in these files.

    Ran go fix -fix=buildtag std cmd and then reverted the bootstrapDirs
    as defined in src/cmd/dist/buildtool.go, which need to continue
    to build with Go 1.4 for now.

    Also reverted src/vendor and src/cmd/vendor, which will need
    to be updated in their own repos first.

    Manual changes in runtime/pprof/mprof_test.go to adjust line numbers.

    For #41184.

    Change-Id: Ic0f93f7091295b6abc76ed5cd6e6746e1280861e
    Reviewed-on: https://go-review.googlesource.com/c/go/+/344955
    Trust: Russ Cox <[email protected]>
    Run-TryBot: Russ Cox <[email protected]>
    TryBot-Result: Go Bot <[email protected]>
    Reviewed-by: Bryan C. Mills <[email protected]>

golang/go@200a01f :

    Author: Tobias Klauser <[email protected]>
    Date:   Wed May 10 17:08:59 2023 +0200

    html: convert fuzz test to native Go fuzzing

    Convert the existing gofuzz based fuzz test to a testing.F based fuzz
    test.

    Change-Id: Ieae69ba7fb17bd54d95c7bb2f4ed04c323c9f15f
    Reviewed-on: https://go-review.googlesource.com/c/go/+/494195
    TryBot-Result: Gopher Robot <[email protected]>
    Reviewed-by: Ian Lance Taylor <[email protected]>
    Reviewed-by: Cherry Mui <[email protected]>
    Auto-Submit: Tobias Klauser <[email protected]>
    Run-TryBot: Tobias Klauser <[email protected]>
  • Loading branch information
LukeShu committed Jul 8, 2024
1 parent e2310ae commit 757e15b
Show file tree
Hide file tree
Showing 6 changed files with 2,418 additions and 2,301 deletions.
4,484 changes: 2,248 additions & 2,236 deletions html/entity.go

Large diffs are not rendered by default.

8 changes: 8 additions & 0 deletions html/entity_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,15 @@ import (
"unicode/utf8"
)

func init() {
UnescapeString("") // force load of entity maps
}

func TestEntityLength(t *testing.T) {
if len(entity) == 0 || len(entity2) == 0 {
t.Fatal("maps not loaded")
}

// We verify that the length of UTF-8 encoding of each value is <= 1 + len(key).
// The +1 comes from the leading "&". This property implies that the length of
// unescaped text is <= the length of escaped text.
Expand Down
109 changes: 45 additions & 64 deletions html/escape.go
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ import (

// These replacements permit compatibility with old numeric entities that
// assumed Windows-1252 encoding.
// https://html.spec.whatwg.org/multipage/syntax.html#consume-a-character-reference
// https://html.spec.whatwg.org/multipage/parsing.html#numeric-character-reference-end-state
var replacementTable = [...]rune{
'\u20AC', // First entry is what 0x80 should be replaced with.
'\u0081',
Expand Down Expand Up @@ -135,14 +135,14 @@ func unescapeEntity(b []byte, dst, src int, attribute bool) (dst1, src1 int) {
break
}

entityName := string(s[1:i])
if entityName == "" {
entityName := s[1:i]
if len(entityName) == 0 {
// No-op.
} else if attribute && entityName[len(entityName)-1] != ';' && len(s) > i && s[i] == '=' {
// No-op.
} else if x := entity[entityName]; x != 0 {
} else if x := entity[string(entityName)]; x != 0 {
return dst + utf8.EncodeRune(b[dst:], x), src + i
} else if x := entity2[entityName]; x[0] != 0 {
} else if x := entity2[string(entityName)]; x[0] != 0 {
dst1 := dst + utf8.EncodeRune(b[dst:], x[0])
return dst1 + utf8.EncodeRune(b[dst1:], x[1]), src + i
} else if !attribute {
Expand All @@ -151,7 +151,7 @@ func unescapeEntity(b []byte, dst, src int, attribute bool) (dst1, src1 int) {
maxLen = longestEntityWithoutSemicolon
}
for j := maxLen; j > 1; j-- {
if x := entity[entityName[:j]]; x != 0 {
if x := entity[string(entityName[:j])]; x != 0 {
return dst + utf8.EncodeRune(b[dst:], x), src + j + 1
}
}
Expand All @@ -165,24 +165,34 @@ func unescapeEntity(b []byte, dst, src int, attribute bool) (dst1, src1 int) {
// unescape unescapes b's entities in-place, so that "a&lt;b" becomes "a<b".
// attribute should be true if parsing an attribute value.
func unescape(b []byte, attribute bool) []byte {
for i, c := range b {
if c == '&' {
dst, src := unescapeEntity(b, i, i, attribute)
for src < len(b) {
c := b[src]
if c == '&' {
dst, src = unescapeEntity(b, dst, src, attribute)
} else {
b[dst] = c
dst, src = dst+1, src+1
}
}
return b[0:dst]
}
populateMapsOnce.Do(populateMaps)
if i := bytes.IndexByte(b, '&'); i >= 0 {
return unescapeInner(b, i, attribute)
}
return b
}

func unescapeInner(b []byte, i int, attribute bool) []byte {
dst, src := unescapeEntity(b, i, i, attribute)
for len(b[src:]) > 0 {
if b[src] == '&' {
i = 0
} else {
i = bytes.IndexByte(b[src:], '&')
}
if i < 0 {
dst += copy(b[dst:], b[src:])
break
}

if i > 0 {
copy(b[dst:], b[src:src+i])
}
dst, src = unescapeEntity(b, dst+i, src+i, attribute)
}
return b[:dst]
}

// lower lower-cases the A-Z bytes in b in-place, so that "aBc" becomes "abc".
func lower(b []byte) []byte {
for i, c := range b {
Expand Down Expand Up @@ -274,66 +284,37 @@ func escapeCommentString(s string) string {
return buf.String()
}

const escapedChars = "&'<>\"\r"
var htmlEscaper = strings.NewReplacer(
`&`, "&amp;",
`'`, "&#39;", // "&#39;" is shorter than "&apos;" and apos was not in HTML until HTML5.
`<`, "&lt;",
`>`, "&gt;",
`"`, "&#34;", // "&#34;" is shorter than "&quot;".
"\r", "&#13;",
)

func escape(w writer, s string) error {
i := strings.IndexAny(s, escapedChars)
for i != -1 {
if _, err := w.WriteString(s[:i]); err != nil {
return err
}
var esc string
switch s[i] {
case '&':
esc = "&amp;"
case '\'':
// "&#39;" is shorter than "&apos;" and apos was not in HTML until HTML5.
esc = "&#39;"
case '<':
esc = "&lt;"
case '>':
esc = "&gt;"
case '"':
// "&#34;" is shorter than "&quot;".
esc = "&#34;"
case '\r':
esc = "&#13;"
default:
panic("unrecognized escape character")
}
s = s[i+1:]
if _, err := w.WriteString(esc); err != nil {
return err
}
i = strings.IndexAny(s, escapedChars)
}
_, err := w.WriteString(s)
_, err := htmlEscaper.WriteString(w, s)
return err
}

// EscapeString escapes special characters like "<" to become "&lt;". It
// escapes only five such characters: <, >, &, ' and ".
// escapes only six such characters: <, >, &, ', ", and \r.
// UnescapeString(EscapeString(s)) == s always holds, but the converse isn't
// always true.
func EscapeString(s string) string {
if strings.IndexAny(s, escapedChars) == -1 {
return s
}
var buf bytes.Buffer
escape(&buf, s)
return buf.String()
return htmlEscaper.Replace(s)
}

// UnescapeString unescapes entities like "&lt;" to become "<". It unescapes a
// larger range of entities than EscapeString escapes. For example, "&aacute;"
// unescapes to "á", as does "&#225;" and "&xE1;".
// unescapes to "á", as does "&#225;" and "&#xE1;".
// UnescapeString(EscapeString(s)) == s always holds, but the converse isn't
// always true.
func UnescapeString(s string) string {
for _, c := range s {
if c == '&' {
return string(unescape([]byte(s), false))
}
populateMapsOnce.Do(populateMaps)
if i := strings.IndexByte(s, '&'); i >= 0 {
return string(unescapeInner([]byte(s), i, false))
}
return s
}
22 changes: 22 additions & 0 deletions html/escape_example_test.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
// Copyright 2015 The Go Authors. All rights reserved.
// Use of this source code is governed by a BSD-style
// license that can be found in the LICENSE file.

package html_test

import (
"fmt"
"html"
)

func ExampleEscapeString() {
const s = `"Fran & Freddie's Diner" <[email protected]>`
fmt.Println(html.EscapeString(s))
// Output: &#34;Fran &amp; Freddie&#39;s Diner&#34; &lt;[email protected]&gt;
}

func ExampleUnescapeString() {
const s = `&quot;Fran &amp; Freddie&#39;s Diner&quot; &lt;[email protected]&gt;`
fmt.Println(html.UnescapeString(s))
// Output: "Fran & Freddie's Diner" <[email protected]>
}
74 changes: 73 additions & 1 deletion html/escape_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,10 @@

package html

import "testing"
import (
"strings"
"testing"
)

type unescapeTest struct {
// A short description of the test case.
Expand Down Expand Up @@ -64,6 +67,24 @@ var unescapeTests = []unescapeTest{
"Footnote&#x87;",
"Footnote‡",
},
// Handle single ampersand.
{
"copySingleAmpersand",
"&",
"&",
},
// Handle ampersand followed by non-entity.
{
"copyAmpersandNonEntity",
"text &test",
"text &test",
},
// Handle "&#".
{
"copyAmpersandHash",
"text &#",
"text &#",
},
}

func TestUnescape(t *testing.T) {
Expand Down Expand Up @@ -95,3 +116,54 @@ func TestUnescapeEscape(t *testing.T) {
}
}
}

var (
benchEscapeData = strings.Repeat("AAAAA < BBBBB > CCCCC & DDDDD ' EEEEE \" ", 100)
benchEscapeNone = strings.Repeat("AAAAA x BBBBB x CCCCC x DDDDD x EEEEE x ", 100)
benchUnescapeSparse = strings.Repeat(strings.Repeat("AAAAA x BBBBB x CCCCC x DDDDD x EEEEE x ", 10)+"&amp;", 10)
benchUnescapeDense = strings.Repeat("&amp;&lt; &amp; &lt;", 100)
)

func BenchmarkEscape(b *testing.B) {
n := 0
for i := 0; i < b.N; i++ {
n += len(EscapeString(benchEscapeData))
}
}

func BenchmarkEscapeNone(b *testing.B) {
n := 0
for i := 0; i < b.N; i++ {
n += len(EscapeString(benchEscapeNone))
}
}

func BenchmarkUnescape(b *testing.B) {
s := EscapeString(benchEscapeData)
n := 0
for i := 0; i < b.N; i++ {
n += len(UnescapeString(s))
}
}

func BenchmarkUnescapeNone(b *testing.B) {
s := EscapeString(benchEscapeNone)
n := 0
for i := 0; i < b.N; i++ {
n += len(UnescapeString(s))
}
}

func BenchmarkUnescapeSparse(b *testing.B) {
n := 0
for i := 0; i < b.N; i++ {
n += len(UnescapeString(benchUnescapeSparse))
}
}

func BenchmarkUnescapeDense(b *testing.B) {
n := 0
for i := 0; i < b.N; i++ {
n += len(UnescapeString(benchUnescapeDense))
}
}
22 changes: 22 additions & 0 deletions html/fuzz_test.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
// Copyright 2019 The Go Authors. All rights reserved.
// Use of this source code is governed by a BSD-style
// license that can be found in the LICENSE file.

package html

import "testing"

func FuzzEscapeUnescape(f *testing.F) {
f.Fuzz(func(t *testing.T, v string) {
e := EscapeString(v)
u := UnescapeString(e)
if u != v {
t.Errorf("EscapeString(%q) = %q, UnescapeString(%q) = %q, want %q", v, e, e, u, v)
}

// As per the documentation, this isn't always equal to v, so it makes
// no sense to check for equality. It can still be interesting to find
// panics in it though.
EscapeString(UnescapeString(v))
})
}

0 comments on commit 757e15b

Please sign in to comment.