Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faithful mode #448

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 8 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,9 @@ var turndownService = new TurndownService({ option: 'value' })
| `linkStyle` | `inlined` or `referenced` | `inlined` |
| `linkReferenceStyle` | `full`, `collapsed`, or `shortcut` | `full` |
| `preformattedCode` | `false` or [`true`](https://github.com/lucthev/collapse-whitespace/issues/16) | `false` |
| `renderAsPure` | `true` or `false` | `true`

The `renderAsPure` option specifies how this library handles HTML that can't be rendered as pure Markdown. For example, `<em style="color:red">bang</em>` could be rendered simply as "pure" Markdown as `*bang*`, but this loses the red color. It could also be rendered using HTML embedded in Markdown as the more verbose `<em style="color:red">bang</em>`, but this is less readable. Setting `renderAsPure` as `true` chooses the simple, lossy rendering, while setting it to `false` chooses the verbose, exact rendering.

### Advanced Options

Expand Down Expand Up @@ -176,6 +179,10 @@ filter: function (node, options) {
}
```

### `pureAttributes` Dict|Function

The `pureAttributes` property defines which attributes of an HTML element can be rendered using pure Markdown. For example, the `<a>` tag can include the `href` attribute with any value; setting `pureAttributes: {href: undefined}` specifies this (a value of `undefined` allows any attribute value). For additional flexibility, this also accepts a `function (node, options)`; this function can modify `node.renderAsPure` and/or return a dict with allowed attributes.

### `replacement` Function

The replacement function determines how an element should be converted. It should return the Markdown string for a given node. The function is passed the node's content, the node itself, and the `TurndownService` options.
Expand Down Expand Up @@ -227,7 +234,7 @@ To avoid the complexity and the performance implications of parsing the content

If you are confident in doing so, you may want to customise the escaping behaviour to suit your needs. This can be done by overriding `TurndownService.prototype.escape`. `escape` takes the text of each HTML element and should return a version with the Markdown characters escaped.

Note: text in code elements is never passed to`escape`.
Note: text in code elements is never passed to `escape`.

## License

Expand Down
3 changes: 2 additions & 1 deletion src/collapse-whitespace.js
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ function collapseWhitespace (options) {
var isPre = options.isPre || function (node) {
return node.nodeName === 'PRE'
}
var renderAsPure = options.renderAsPure

if (!element.firstChild || isPre(element)) return

Expand Down Expand Up @@ -80,7 +81,7 @@ function collapseWhitespace (options) {
// Drop protection if set previously.
keepLeadingWs = false
}
} else {
} else if (renderAsPure) {
node = remove(node)
continue
}
Expand Down
54 changes: 54 additions & 0 deletions src/commonmark-rules.js
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
import Node from './node'
import { repeat } from './utilities'

var rules = {}
Expand Down Expand Up @@ -47,6 +48,24 @@ rules.blockquote = {

rules.list = {
filter: ['ul', 'ol'],
pureAttributes: function (node, options) {
// When rendering in faithful mode, check that all children are `<li>` elements that can be faithfully rendered. If not, this must be rendered as HTML.
if (!options.renderAsPure) {
var childrenPure = Array.prototype.reduce.call(node.childNodes,
(previousValue, currentValue) =>
previousValue &&
currentValue.nodeName === 'LI' &&
(new Node(currentValue, options)).renderAsPure, true
)
if (!childrenPure) {
// If any of the children must be rendered as HTML, then this node must also be rendered as HTML.
node.renderAsPure = false
return
}
}
// Allow a `start` attribute if this is an `ol`.
return node.nodeName === 'OL' ? {start: undefined} : {}
},

replacement: function (content, node) {
var parent = node.parentNode
Expand Down Expand Up @@ -89,6 +108,15 @@ rules.indentedCodeBlock = {
)
},

pureAttributes: function (node, options) {
// Check the purity of the child block(s) which contain the code.
node.renderAsPure = options.renderAsPure || (node.renderAsPure && (
// There's only one child (the code element), and it's pure.
new Node(node.firstChild, options)).renderAsPure && node.childNodes.length === 1 &&
// There's only one child of this code element, and it's text.
node.firstChild.childNodes.length === 1 && node.firstChild.firstChild.nodeType === 3)
},

replacement: function (content, node, options) {
return (
'\n\n ' +
Expand All @@ -108,6 +136,22 @@ rules.fencedCodeBlock = {
)
},

pureAttributes: function (node, options) {
// Check the purity of the child code element.
var firstChild = new Node(node.firstChild, options)
var className = firstChild.getAttribute('class') || ''
var language = (className.match(/language-(\S+)/) || [null, ''])[1]
// Allow the matched classname as pure Markdown. Compare using the `className` attribute, since the `class` attribute returns an object, not an easily-comparable string.
if (language) {
firstChild.renderAsPure = firstChild.renderAsPure || firstChild.className === `language-${language}`
}
node.renderAsPure = options.renderAsPure || (node.renderAsPure &&
// There's only one child (the code element), and it's pure.
firstChild.renderAsPure && node.childNodes.length === 1 &&
// There's only one child of this code element, and it's text.
node.firstChild.childNodes.length === 1 && node.firstChild.firstChild.nodeType === 3)
},

replacement: function (content, node, options) {
var className = node.firstChild.getAttribute('class') || ''
var language = (className.match(/language-(\S+)/) || [null, ''])[1]
Expand Down Expand Up @@ -151,6 +195,8 @@ rules.inlineLink = {
)
},

pureAttributes: {href: undefined, title: undefined},

replacement: function (content, node) {
var href = node.getAttribute('href')
var title = cleanAttribute(node.getAttribute('title'))
Expand All @@ -168,6 +214,8 @@ rules.referenceLink = {
)
},

pureAttributes: {href: undefined, title: undefined},

replacement: function (content, node, options) {
var href = node.getAttribute('href')
var title = cleanAttribute(node.getAttribute('title'))
Expand Down Expand Up @@ -232,6 +280,11 @@ rules.code = {
return node.nodeName === 'CODE' && !isCodeBlock
},

pureAttributes: function (node, options) {
// An inline code block must contain only text to be rendered as Markdown.
node.renderAsPure = options.renderAsPure || (node.renderAsPure && node.firstChild.nodeType === 3 && node.childNodes.length === 1)
},

replacement: function (content) {
if (!content) return ''
content = content.replace(/\r?\n|\r/g, ' ')
Expand All @@ -247,6 +300,7 @@ rules.code = {

rules.image = {
filter: 'img',
pureAttributes: {alt: undefined, src: undefined, title: undefined},

replacement: function (content, node) {
var alt = cleanAttribute(node.getAttribute('alt'))
Expand Down
24 changes: 24 additions & 0 deletions src/node.js
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,30 @@ export default function Node (node, options) {
node.isCode = node.nodeName === 'CODE' || node.parentNode.isCode
node.isBlank = isBlank(node)
node.flankingWhitespace = flankingWhitespace(node, options)
// When true, this node will be rendered as pure Markdown; false indicates it will be rendered using HTML. A value of true can indicate either that the source HTML can be perfectly captured as Markdown, or that the source HTML will be approximated as Markdown by discarding some HTML attributes (options.renderAsPure === true). Note that the value computed below is an initial estimate, which may be updated by a rule's `pureAttributes` property.
node.renderAsPure = options.renderAsPure || node.attributes === undefined || node.attributes.length === 0
// Given a dict of attributes that an HTML element may contain and still be convertable to pure Markdown, update the `node.renderAsPure` attribute. The keys of the dict define allowable attributes; the values define the value allowed for that key. If the value is `undefined`, then any value is allowed for the given key.
node.addPureAttributes = (d) => {
// Only perform this check if the node isn't pure and there's something to check. Note that `d.length` is always `undefined` (JavaScript is fun).
if (!node.renderAsPure && Object.keys(d).length) {
// Check to see how many of the allowed attributes match the actual attributes.
let allowedLength = 0
for (const [key, value] of Object.entries(d)) {
if (key in node.attributes && (value === undefined || node.attributes[key].value === value)) {
++allowedLength
}
}
// If the lengths are equal, then every attribute matched with an allowed attribute: this node is representable in pure Markdown.
if (node.attributes.length === allowedLength) {
node.renderAsPure = true
}
}
}

// Provide a means to escape HTML to confirm to Markdown's requirements. This happens only inside preformatted code blocks, where `collapseWhitespace` avoids removing newlines.
node.cleanOuterHTML = () => node.outerHTML.replace(/\n/g, '&#10;').replace(/\r/g, '&#13;')
// Output the provided string if `node.renderAsPure`; otherwise, output `node.outerHTML`.
node.ifPure = (str) => node.renderAsPure ? str : node.cleanOuterHTML()
return node
}

Expand Down
3 changes: 2 additions & 1 deletion src/root-node.js
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,8 @@ export default function RootNode (input, options) {
element: root,
isBlock: isBlock,
isVoid: isVoid,
isPre: options.preformattedCode ? isPreOrCode : null
isPre: options.preformattedCode ? isPreOrCode : null,
renderAsPure: options.renderAsPure
})

return root
Expand Down
117 changes: 101 additions & 16 deletions src/turndown.js
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,45 @@ import { extend, trimLeadingNewlines, trimTrailingNewlines } from './utilities'
import RootNode from './root-node'
import Node from './node'
var reduce = Array.prototype.reduce
// Taken from `commonmark.js/lib/common.js`.
var TAGNAME = '[A-Za-z][A-Za-z0-9-]*'
var ATTRIBUTENAME = '[a-zA-Z_:][a-zA-Z0-9:._-]*'
var UNQUOTEDVALUE = "[^\"'=<>`\\x00-\\x20]+"
var SINGLEQUOTEDVALUE = "'[^']*'"
var DOUBLEQUOTEDVALUE = '"[^"]*"'
var ATTRIBUTEVALUE =
'(?:' +
UNQUOTEDVALUE +
'|' +
SINGLEQUOTEDVALUE +
'|' +
DOUBLEQUOTEDVALUE +
')'
var ATTRIBUTEVALUESPEC = '(?:' + '\\s*=' + '\\s*' + ATTRIBUTEVALUE + ')'
var ATTRIBUTE = '(?:' + '\\s+' + ATTRIBUTENAME + ATTRIBUTEVALUESPEC + '?)'
var OPENTAG = '<' + TAGNAME + ATTRIBUTE + '*' + '\\s*/?>'
var CLOSETAG = '</' + TAGNAME + '\\s*[>]'
var HTMLCOMMENT = '<!-->|<!--->|<!--(?:[^-]+|-[^-]|--[^>])*-->'
var PROCESSINGINSTRUCTION = '[<][?][\\s\\S]*?[?][>]'
var DECLARATION = '<![A-Z]+' + '[^>]*>'
var CDATA = '<!\\[CDATA\\[[\\s\\S]*?\\]\\]>'
var HTMLTAG =
'(?:' +
OPENTAG +
'|' +
CLOSETAG +
'|' +
// Note: Turndown removes comments, so this portion of the regex isn't
// necessary, but doesn't cause problems.
HTMLCOMMENT +
'|' +
PROCESSINGINSTRUCTION +
'|' +
DECLARATION +
'|' +
CDATA +
')'
// End of copied commonmark code.
var escapes = [
[/\\/g, '\\\\'],
[/\*/g, '\\*'],
Expand All @@ -17,7 +56,28 @@ var escapes = [
[/\]/g, '\\]'],
[/^>/g, '\\>'],
[/_/g, '\\_'],
[/^(\d+)\. /g, '$1\\. ']
[/^(\d+)\. /g, '$1\\. '],
// Per [section 6.6 of the CommonMark spec](https://spec.commonmark.org/0.30/#raw-html),
// Raw HTML, CommonMark recognizes and passes through HTML-like tags and
// their contents. Therefore, Turndown needs to escape text that would parse
// as an HTML-like tag. This regex recognizes these tags and escapes them by
// inserting a leading backslash.
[new RegExp(HTMLTAG, 'g'), '\\$&'],
// Likewise, [section 4.6 of the CommonMark spec](https://spec.commonmark.org/0.30/#html-blocks),
// HTML blocks, requires the same treatment.
//
// This regex was copied from `commonmark.js/lib/blocks.js`, the
// `reHtmlBlockOpen` variable. We only need regexps for patterns not matched
// by the previous pattern, so this doesn't need all expressions there.
//
// TODO: this is too aggressive; it should only recognize this pattern at
// the beginning of a line of CommonnMark source; these will recognize the
// pattern at the beginning of any inline or block markup. The approach I
// tried was to put this in `commonmark-rules.js` for the `paragraph` and
// `heading` rules (the only block beginning-of-line rules). However, text
// outside a paragraph/heading doesn't get escaped in this case.
[/^<(?:script|pre|textarea|style)(?:\s|>|$)/i, '\\$&'],
[/^<[/]?(?:address|article|aside|base|basefont|blockquote|body|caption|center|col|colgroup|dd|details|dialog|dir|div|dl|dt|fieldset|figcaption|figure|footer|form|frame|frameset|h[123456]|head|header|hr|html|iframe|legend|li|link|main|menu|menuitem|nav|noframes|ol|optgroup|option|p|param|section|source|summary|table|tbody|td|tfoot|th|thead|title|tr|track|ul)(?:\s|[/]?[>]|$)/i, '\\$&']
]

export default function TurndownService (options) {
Expand All @@ -36,14 +96,18 @@ export default function TurndownService (options) {
linkReferenceStyle: 'full',
br: ' ',
preformattedCode: false,
// Should the output be pure (pure Markdown, with no HTML blocks; this discards any HTML input that can't be represented in "pure" Markdown) or faithful (any input HTML that can't be exactly duplicated using Markdwon remains HTML is the resulting output)? This is `false` by default, following the original author's design.
renderAsPure: true,
blankReplacement: function (content, node) {
return node.isBlock ? '\n\n' : ''
},
keepReplacement: function (content, node) {
return node.isBlock ? '\n\n' + node.outerHTML + '\n\n' : node.outerHTML
},
defaultReplacement: function (content, node) {
return node.isBlock ? '\n\n' + content + '\n\n' : content
defaultReplacement: function (content, node, options) {
// A hack: for faithful output, always produce the HTML, rather than the content. To get this, tell the node it's impure.
node.renderAsPure = options.renderAsPure
return node.isBlock ? '\n\n' + node.ifPure(content) + '\n\n' : node.ifPure(content)
}
}
this.options = extend({}, defaults, options)
Expand Down Expand Up @@ -156,25 +220,44 @@ TurndownService.prototype = {

function process (parentNode) {
var self = this
return reduce.call(parentNode.childNodes, function (output, node) {
node = new Node(node, self.options)

var replacement = ''
if (node.nodeType === 3) {
replacement = node.isCode ? node.nodeValue : self.escape(node.nodeValue)
} else if (node.nodeType === 1) {
replacement = replacementForNode.call(self, node)
}
// Note that the root node passed to Turndown isn't translated -- only its children, since the root node is simply a container (a div or body tag) of items to translate. Only the root node's `renderAsPure` attribute is undefined; treat it as pure, since we never translate this node.
if (parentNode.renderAsPure || parentNode.renderAsPure === undefined) {
return reduce.call(parentNode.childNodes, function (output, node) {
node = new Node(node, self.options)

var replacement = ''
// Is this a text node?
if (node.nodeType === 3) {
replacement = node.isCode ? node.nodeValue : self.escape(node.nodeValue)
// Is this an element node?
} else if (node.nodeType === 1) {
replacement = replacementForNode.call(self, node)
// In faithful mode, return the contents for these special cases.
} else if (!self.options.renderAsPure) {
if (node.nodeType === 4) {
replacement = `<!CDATA[[${node.nodeValue}]]>`
} else if (node.nodeType === 7) {
replacement = `<?${node.nodeValue}?>`
} else if (node.nodeType === 8) {
replacement = `<!--${node.nodeValue}-->`
} else if (node.nodeType === 10) {
replacement = `<!${node.nodeValue}>`
}
}

return join(output, replacement)
}, '')
return join(output, replacement)
}, '')
} else {
// If the `parentNode` represented itself as raw HTML, that contains all the contents of the child nodes.
return ''
}
}

/**
* Appends strings as each rule requires and trims the output
* @private
* @param {String} output The conversion output
* @returns A trimmed version of the ouput
* @returns A trimmed version of the output
* @type String
*/

Expand All @@ -199,12 +282,14 @@ function postProcess (output) {

function replacementForNode (node) {
var rule = this.rules.forNode(node)
node.addPureAttributes((typeof rule.pureAttributes === 'function' ? rule.pureAttributes(node, this.options) : rule.pureAttributes) || {})
var content = process.call(this, node)
var whitespace = node.flankingWhitespace
if (whitespace.leading || whitespace.trailing) content = content.trim()
return (
whitespace.leading +
rule.replacement(content, node, this.options) +
// If this node contains impure content, then it must be replaced with HTML. In this case, the `content` doesn't matter, so it's passed as an empty string.
(node.renderAsPure ? rule.replacement(content, node, this.options) : this.options.defaultReplacement('', node, this.options)) +
whitespace.trailing
)
}
Expand Down
Loading