A parse for HTML5 based on the official W3C specification.
the html source text is:
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>My test page</title>
</head>
<body>
<img src="images/firefox-icon.png" alt="My test image">
</body>
</html>
we can use this code to parse html source to HtmlNode list
:
let sourceText = ...
let doctype,nodes = HtmlUtils.parseDoc sourceText
doctype is a string that is extracted from doctype tag. and nodes is a HtmlNode list
.
All parsing processes in a package are public, and you are free to compose them to implement your functional requirements. Parser is highly configurable, see source code HtmlUtils
Parse only html structures without changing the content. Please use HtmldocCompiler.compile
. In fact, the HtmlUtils.parseDoc
is defined as follows:
let parseDoc (txt:string) =
let doctype,nodes =
txt
|> HtmldocCompiler.compile
let nodes =
nodes
|> List.map Whitespace.removeWS
|> Whitespace.trimWhitespace
|> List.map HtmlCharRefs.unescapseNode
doctype,nodes
Knowing the above code, you can determine the parsing result as your needs.
generate html source text:
Render.stringifyNode
Render.stringifyDoc
HtmlUtils.stringifyNode
HtmlUtils.stringifyDoc
some transform:
BrRemover.splitByBr
HrRemover.splitByHr
The user can parse the string through the functions in the HtmlUtils
module.
You can also use a tokenizer to get a token sequence.
let tokens = HtmlTokenizer.tokenize txt
The main structure types are defined as follows: