- in readme?
- as separate
htmlnorm
documentation?- dispersed in htmlnorm.test.md
htmlnorm
's purpose is to facilitate testing of HTML output by not letting unimportant differences between output values and a test case's expected value result in a failed test.I did my best to imagine all the kinds of "unimportant differences" that are possible, but I may have missed some. If you have a case where
htmlnorm
comes up short, please file an issue! I will get to it as soon as possible!
htmlnorm
's raison d'être is testing. This means:
-
It will be maximally conservative if there is any ambiguity.
Specifically, if there is any ambiguity as to the semantics of the input that is not disambiguated by the HTML5 spec (this one?), then
htmlnorm
will return the input as-is, untouched. This ensures thathtmlnorm
doesn't inadvertently allow tests to pass that shouldn't pass because it chose the wrong interpretation. -
"beautification" is not a primary concern, and is done only in service of testing.
That said, you may really like the pretty printing
htmlnorm
performs. -
There is no guaranteed stability of how
htmlnorm
formats from version to version, though it will only change when justified -- to further improve its purpose.- stability is not required for testing, because
htmlnorm
is intended to be applied to both the test output and the expected value -- in other words they will both be normalized by the current version ofhtmlnorm
and comparable. The output ofhtmlnorm
is not intended to be preserved as test artifacts or fixtures.
- stability is not required for testing, because
As someone pointed out about JS-Beautifier:
Your formatter is meant to take existing HTML source code and make it prettier by improving it, while still maintaining certain aesthetic decisions of the author (e.g. whether the author places extra newlines between paragraphs, etc.).
In other words, it doesn't normalize, just beautifies some things, and leaves alone other author choices, meaning two HTML docs that are actually semantically identical will not always be reformatted or reduced to a normalized or canonical form.
document another difference between pretty printing and testing-oriented normalization:
whenever there are two semantically equivalent ways of expressing the same thing, then they both should be normalized to the exact same form (whether it is one of the two or a third form).
For example, the first form below looks great, and a pretty printer might leave it alone. But a pretty printer might also leave the second one alone, thinking: given the context(inside single quotes), the internal quotes must remain escaped. But for htmlnorm
that is result in failed tests. So it MUST choose one of the two and ALWAYS transform to that form. It would have to choose between (a) ALWAYS choosing double quotes for the outer quotes and escaping as necessary or (b) employ complex logic, changing the outer quotes to single when the value contains double (and only double).
<img title='the meaning of "grok"'>
<img title="the meaning of "grok"">
Three basic decisions/operations on whitespace:
- When to COLLAPSE to single space
- When to REMOVE as run entirely.
- When to INSERT to produce hierarchical structure.
- outside of PRE, we collapse all whitespace runs to a single space.
- STANDARD: we remove leading and trailing whitespace of any BLOCK's contents
- CONSERVATIVE: ❓❓❓
- STANDARD:
- between any sibling BLOCKS (including anonymous inline text blocks)
- between a CONTAINER BLOCK open/close tags and it's contents
- CONSERVATIVE:
- same as STANDARD, but only if there is/was whitespace in that spot in the source originally. In other words, it's more a REPLACE than an INSERT. If there was no whitespace, we can't INSERT any without potentially altering how it is rendered --- and normalizing HTML CANNOT impact the visual rendering.
- We don't insert whitespace where it does not exist in the source. This includes adding newlines or indentation to format the HTML source as a nice hierarchical structure. We will only do so when we can (i.e. whitespace already exists)
- we don't remove whitespace where it does exist in the source. But we
will:
- reduces runs of whitespace between inline elements to a single space
- replace the whitespace with a different run of whitespace in order to reformat the HTML source as a nice hierarchical structure, with blocks on their own lines.
So just as one wouldn't insert a space or newline in the middle of the
word "solidarity" when laying out a sentence, one doesn't insert a space
or newline between the <q>
and the inner text of "<q>a quote</q>
because the ends up rendering a a quotation mark and you'd end up with
" a quote "
, which is obviously wrong and not what the author intended.
- That a normally block element won't be recast as inline by CSS
- That a normally block element won't be recast as inline by CSS ❓❓❓
- That tags that wrap an inline element don't also correspond to glyph(s), <--- actually we do this for STANDARD mode too a border or other visible manifestation where the a space between the tag and its inner content can be ignored. [??? also between it and outer adjacencies?]
- Since a block can be recasts as inline, #3 applies to blocks.
- this corresponds with browser/CSS defaults
- this is also the most conservative, as
htmlnorm
's rules for inlines are more conservative than for blocks (though they are identical in CONSERVATIVE mode)
STANDARD
mode assumes HTML blocks elements have CSS block
flow. [STANDARD
mode assumes that each HTML defined element will have the same flow semantics as defined by the XXX default CSS.] That inline elements will remain inline and block will remain block. This means we can insert whitespace between block-level elements even if there is no whitespace in the source.
-
Insert whitespace (newlines and spaces) between HTML blocks elements with newlines and spaces to form a visual hierarchy. Do so even if the source HTML has no whitespace between the tags, e.g.
<li><p>1</p><p>2</p></li>
will be rendered as:
<li> <p>1</p> <p>2</p> <li>
-
Collapse all other contiguous whitespace runs to a single space. "Whitespace" here means spaces, tabs and newlines.
-
🟧 Remove leading and trailing whitespace within each leaf block element
As a semi-conservative precaution, we do treat whitespace runs spanning an inline tag as a single run, even when the default CSS rule for that tag produces no visual element that separates the whitespace on either side. This rule is consistent with how browsers parse the HTML into the DOM1. It has no impact on the ability to reformat blocks onto separate lines with indentation to create a visual hierarchy in the source.
CONSERVATIVE
mode assumes that CSS rules might alter the flow property of any normally block element to inline
. This means the existence of whitespace or the lack thereof is significant even between sibling element tags, and so cannot be arbitrarily altered. Specifically:
- Never add some where there was none.
- Never reduce to none where there was some.
In other words, HTML whitespace semantics are such that all whitespace reduces down to a binary either-or:
-
EITHER there is whitespace between to things, with the amount not mattering:
<p>1</p> <p>2</p> <p>1</p> <p>2</p> <p>1</p> <p>2</p>
-
OR there is not:
<p>1</p><p>2</p>
CONSERVATIVE
mode is careful not to change the sources condition form one to the other.
which means:
- ✅ never collapse white space to
zero space
. Any run of consecutive whitespace characters (space, tabs, newlines) can be collapse, replaced by, or expanded into to any another other consecutive run of whitespace characters and the document remains semantically identical. - ✅ never introduce space between to elements separated by
zero space
in the source. This means, for example, when the opening tag of a paragraph abuts the closing tag a preceding paragraph, with no space between them), they are on the same line and must remain so. (One is free to break the text for these to paragraphs across lines anywhere there exists a space in the content of either paragraph.)
The only exception is around PRE and BR tags. These two tags are void tags with such explicit whitespace semantics that it makes no sense for a CSS rule to alter their meaning.
Otherwise the same as standard
mode.
The impact of CONSERVATIVE
mode is that blocks whose tags have no whitespace between them cannot be broken onto separate lines or indented, limiting the ability to reformat those blocks into a visual hierarchy. Such "run-on" block elements will remain on a single line, potentially a very long one.
Conservative
mode may seem like a theoretical and unnecessary option, but as the examples below show, it matters in read world situations:
People accepting the reality of it when writing HTML / CSS with specific rendering goals in mind:
More:
-
✅ When does whitespace matter in HTML?
- Question is well written, illustrating the problem well.
- 🌈 It has the true answer: It all depends on the CSS rules. It has a link to the CSS spec.
-
✅ Browser white space rendering
- is really a duplicate of the above question (already reported)
- The accepted answer is spot on.
-
✅ White space removed by minifier tools affects look and feel of HTML
-
The question is well written.
-
It points out that there are minifiers that a make bad assumptions.
-
The top answer is also well written. It points out that the only TOTALLY safe thing is to not mess with any whitespace at all, even collapsing, because there is a CSS rule that makes all whitespace significant: ``white-space:pre`. But outside of that, this is who to do it:
it should collapse runs of whitespace in phrasing content, but not strip them:
<div><span class="surround-brackets"> Title </span></div>
...the effective minified equivalent assuming normal CSS white-space processing.
-
-
✅ Is minifying your HTML, CSS, and Javascript a bad idea?
- Same as previous.
- 🥃 WRITE A NEW ANSWER FOR THIS, based on my
htmlnorm
research, perpahaps pointing to my explainer doc in this repo.
-
✅ CSS word-spacing property doesn't work after HTML minified with PHP
- same deal
Even more:
custom
mode is builds off of standard
, but allows you to change the inline
vs leaf block
vs container block
property for any element.
https://html.spec.whatwg.org/multipage/syntax.html#optional-tags
Four possible options:
-
AS-IS
(default)- no closing tags will be added or removed
- BUT the formatting of the document will reflect implicit closure, whether per HTML rules or mimicking browser handling of tag soup
-
EXPLICIT
-
all unclosed tags are normalized via insertion of explicit close tags:
-
first, per HTML5 rules, some tags are implicitly closed open the opening of others (e.g. successive
p
tags without intervening close tags, asp
cannot contain ap
) -
mimic browser handling of tag-soup. e.g. an unclosed blockquote will be automatically closed if its parent container terminates (including termination of the root doc container)
I think htmlparser2 does this so it should be easy to implement.
-
- if the user code being tested should never be allowed to produce tag soup. It will hide such usage, even if your expected values are properly formed, since the actually will be silently "fixed" before the test comparison.
- if your tests need to distinguish between explicit and implicit tag closure.
Use either
-
AS-IS
mode.- Nice diff
- your test expected values can express whether or not a closing tag should occur.
In other words, you can test a library that is expected to produce HTML abbreviated form, or explicit element closure, and be able to test this difference, with easy to read diffs when the expectation isn't met
-
STRICT MODE
(if the diffs don't matter)Use this when you want the tests to barf when ANY HTML is used that isn't perfectly formed with explicit closed tags (OR not barf, but your expected out is to be compared to the actual literally, char for char)
-
-
HTML5 IMPLICIT
- removes close tags for elements in contexts that support implicit closure, per HTML5 specs
🚩 SAME "DO NOT use" caveat as
EXPLICIT_CLOSE
mode.🏈 I think this one is not so important, and since it isn't trivial to implement, punt for now
-
STRICT MODE
/SAFE MODE
- if a document is missing any close tags,
htmlnorm
will simply return the input unchanged. This is useful for test contexts where the user does not want tests to pass if any tag soup fixes were employed and/or wants to test cases where tag soup is expected.
- if a document is missing any close tags,
- Even if not totally orthogonal, we can let users figure out which combinations don't make sense?
- or maybe we should disallow those, to proect the users
- OR does CONSERVATIVE mode always choose one of the above?
htmlnorm
handles HTML fragments, does not convert into HTML doc
hidden elements (e.g. <head>
)
The only problem is that this only applies to non-hidden elements. Things like
<head>
are considered "hidden" so I'm determining the best way to find a semantic grouping that would include the right hidden elements that we expect to be formatted as block in the HTML source code.
📜 governing standard docs:
-
The W3C ceded authority over the HTML and DOM standards to WHATWG on 28 May 2019, as it considered that having two standards is harmful.[37][38][39][3] The HTML Living Standard is now authoritative. However, W3C will still participate in the development process of HTML.
HTML (MDN) will be referenced when its explanation is more succinct and intelligible, but when there is any disagreement, the HTML Living Standard supersedes.
By topic:
-
Parsing
-
✅ Defaults
- CSS Default Browser Values for HTML Elements <-- a lot of minifiers/pretty printers assume this default CSS, and thus can break a design that uses different CSS.
- W3: Default style sheet for HTML 4
-
✅ Kinds/Classes
-
https://developer.mozilla.org/en-US/docs/Web/HTML/Inline_elements
-
https://developer.mozilla.org/en-US/docs/Web/HTML/Block-level_elements
-
https://developer.mozilla.org/en-US/docs/Web/Guide/HTML/Content_categories
-
Whatwg spec: Kinds of content
-
Whatwg spec: Palpable content
-
Whatwg spec: Kinds of elements
-
Whatwg spec: HTML Living Standard - Rendering (HTML5)
See particular sections such as:
-
W3:
inline-level: Content that participates in inline layout. Specifically, inline-level boxes and text runs.
block-level: Content that participates in block layout. Specifically, block-level boxes.
-
Summarized list, maybe useful if official specs don't summarize this conveniently: HTML: The 16 Content Categories and Their Elements · Jens Oliver Meiert
-
-
CSS3 draft
-
CSS Text
-
Defines terms "collapse" and "preserve". Includes White Space Processing Rules.
-
CSS Display
-
-
W3 CSS
The ==block container== also generates a ==root inline box==, which is an ==anonymous inline box== that holds all of its inline-level contents. (Thus, all text in an inline formatting context is directly contained by an inline box, whether the root inline box or one of its descendants.) The root inline box inherits from its parent block container, but is otherwise unstyleable.
Note: ==Empty inline boxes== still have margins, padding, borders, and a line-height, and thus influence these calculations just like boxes with content.
-
display
Clarified that empty text objects are ignored for CSS rendering.
[css-display] Do empty text nodes generate text runs? #1808
empty inline boxes count as content that causes a line box to exist
They do, but shouldn't they be treated as "not existing for any other [than positioning out-of-flow descendants] purpose" phantom line boxes?
That depends on if they are assigned any other styles. :)
-
There is no standard that I'm aware of that says, "here is how HTML source code should be formatted". But for the most part I think most users would agree that that the HTML source code formatting should reflect the rendering of the HTML in the browser for the most part. (That's a reasonable starting point, anyway, although users of course may want to override some things.)
Browser rendering is based on CSS, which defaults to
display: inline
for unknown elements. You can read an explanation of this on Stack Overflow, but in short the CSSdisplay
property has an initial value ofinline
as per the CSS 2.1 spec. The CSS3 Display Module says the same thing.
==An important thing to note in this: It's based on CSS, but CSS can be configured to do ANYTHING.==
beautifier/js-beautify#1732 (comment):
Yes, I'm pretty sure I'm the one who suggested to use HTML5 "phrasing content" in #840. The definition says basically that it's stuff in a paragraph, so for the most part I think those do reflect the right things.
Now that I'm looking more closely at it, I'm ==not sure it's so straightforward with nested things==. For example "phrasing content" includes
<area>
inside a map, which someone might want to be placed on individual lines. And what about<param>
inside<object>
and<source>
inside<video>
? From the MDN examples, it looks like they would look nicer on separate lines, even though<object>
and<video>
are "phrasing" elements. But then again if the<object>
had no<param>
children, we wouldn't want it by itself break a paragraph.I'm almost thinking my original idea of ==a "container" classification might be useful==. A container would:
- Make all of its children be automatically considered "block" elements.
- It would be considered a "block" element if it contained children; otherwise it would be considered an inline element.
But that is getting pretty complicated. And besides I'm not sure how I would want these sort of elements formatted. ==Maybe they would be fine just formatted inline in a paragraph.==
I'm still not 100% sure how I would want ==
<iframe>
to be formatted, even though it's "phrasing content"==. And what about<frame>
inside a<frameset>
(although those are obsolete and should not be used)?I think the biggest doubt I have is things like ==
<dataset><option>
, and of course<option>
within a<select>
.====For now I'm going with this definition of "block element":==
- ==All the elements that should default to
diplay: block
in the browser.==- ==All the elements that should default to
diplay: list-item
in the browser.==- ==All the elements HTML5 specifies as "metadata content".==
- ==The HTML
<head>
element.==For a pretty complex HTML document, that matches almost exactly the js-beautify output using my algorithm (except for
<figcaption>
as mentioned). The only other difference is the order of attributes, which my algorithm changes to be consistent based upon certain rules.For me that will work for now, and I'll keep thinking about what want to do with the nested things like
<option>
and<param>
and such.
I've since learned that in HTML5 it is perfectly valid to put block level elements inside link tags eg:
<a href="#"> <h1>Heading</h1> <p>Paragraph.</p> </a>This is actually really useful if you want a large block of HTML to be a link.
If you read the accepted answer to the question, it is clear that for block level elements in sequence (be it originally a block or an inline coerced into a block by containing a block), zero-space
and space
have the same effect: the blocks are distinct and separate, and thus rendered.
Link to browser-whitespace-study.html
and also to htmlnorm.test.html
(perhaps renamed)
- perhaps combine these two files
- remove
htmlnorm.test.html
from.gitignore
and check in. (changed behavior on code changes will show up as changes in this file to be checked in) - 🚩 Including these two files MIGHT replace some of the README stuff below.
Footnotes
-
🟧 See Open Question about whether browsers collapse whitespace before CSS sees anything ↩