Skip to content

HtmlSanitizerUtil

Mark Reeves edited this page Jan 4, 2018 · 6 revisions

HtmlSanitizerUtil provides basic HTML sanitizer using AntiSamy. It provides two levels of sanitization, one optimized for input sanitization of user entered content through WTextArea and a more permissive option recommended for output sanitization. Both may be overridden by a particular application.

Input sanitization

HtmlSanitizerUtil is used to sanitize user input from WTextArea in rich-text mode. This uses the default AntiSamy Policy and is the same as calling com.github.bordertech.wcomponents.util.HtmlSanitizerUtil.sanitize(String) or com.github.bordertech.wcomponents.util.HtmlSanitizerUtil.sanitize(String, false). The default Policy works with the default implementation of tinyMCE used by WTextArea and is quite strict.

Implementing input sanitization

This Policy may be used for custom input sanitation. One would merely override setData(Object) in an extension of WComponent or any similar input method in a custom component to call the sanitize routine. For example:

// In this case we are processing user input from a HTML form. We will assume the
// input data is a String but it would be easy to handle other cases.
// For the sake of argument we are also going to assume this data is going to be reused
// at some time in the future and may not be escaped so NEEDS to be safe.
@Override
public void setData(final Object data) {
  if (data instanceof String) {
    String dataString = (String) data;

    if (Util.empty(dataString)) {
      // no need to sanitize an empty-ish string.
      super.setData(data);
    } else {
      try {
        // sanitize input to get rid of potentially harmful HTML.
        super.setData(HtmlSanitizerUtil.sanitize(dataString));
      } catch (ScanException | PolicyException e) {
        // If the Sanitizer throws an error we are not able to sanitize
        // Options are:
        // * throw an Exception
        // * do nothing;
        // * save it anyway (in which case why sanitize?);
        // * set data to "";
        // * set data to null; or
        // * set data to something which is like the original but safer.
        // Let's do the last one - it is probably cleverest.
        super.setData(StringEscapeUtils.escapeXml10(dataString));
      }
    }
  } else {
    // we either have null (which is normal) or something
    // completely unexpected. Do something sensible with it but this
    // is out of scope for this example.
  }
}

It is possible to use the permissive Policy as an input Policy by calling HtmlSanitizerUtil.sanitize(String, false) but this could be rather dangerous.

Modifying input sanitization

WComponents currently uses a single AntiSamy Policy for all input sanitization. AntiSamy Policy configuration file's location is set using WComponents property com.github.bordertech.wcomponents.AntiSamy.config. The default Policy is quite strict but not so strict as the AntiSamy tinyMCE policy on which it is based. To use a different input Policy in an application one would add the policy XML file to the applications's resource bundle then set property com.github.bordertech.wcomponents.AntiSamy.config to point to that policy.

The input sanitation policy may need to be modified (usually relaxed) if rich-text WTextAreas in the application are configured to allow a wider range of options than the default. The default policy does not, for example, allow setting the class attribute on an element and has an extremely small set of allowed values in the style attribute.

Output sanitization

Output sanitization is able to be used in WTextArea, WText and WLabel on a per-instance basis. It will occur if (and only if) the output string is not XML escaped. HTML sanitization can be quite time and/or resource hungry so it should not be turned on as a matter of course. To this effect there is property which may be used to turn on output sanitization per instance of these components using setSanitizeOnOutput(boolean). The following sample shows the use of the setter on a WTextArea.

A method is provided to undertake less-strict sanitization which is recommended for output sanitization of unescaped HTML which is of dubious origin. The Policy uses an XML config file set by the WComponents property com.github.bordertech.wcomponents.AntiSamyLax.config. The default implementation of this policy is quite permissive and should be used for output sanitization of unescaped components when the original source of the HTML is unknown. If the source is known and trusted it may be safe to output without sanitization.

WTextArea taRichText = new WTextArea();
// Turns on output sanitization. It is safe to set this flag
// as sanitization is only undertaken if it is needed.
taRichText.setSanitizeOnOutput(true);
// Output sanitization is unnecessary if the WTextArea is not rich text
// as the content will always be escaped.
taRichText.setRichTextArea(true);
// Output sanitization is only needed if the WTextArea is read only
// otherwise the content will be escaped.
taRichText.setReadOnly(true);

Modifying output sanitization

WComponents uses a single AntiSamy Policy for all output sanitization. This may be extended in the future if required. The output sanitization policy is an XML file. The default is quite lax. A different Policy for a specific application may be implemented by adding a Policy XML file to the applications resource bundle and setting a property com.github.bordertech.wcomponents.AntiSamyLax.config. Note that if the Policy cannot be found when the sanitizer is called the sanitizer will fall back to using the more strict Policy (as used for WTextArea input sanitization).

Applying output sanitization to a custom component

HtmlSanitizerUtil exposes two static methods:

  • com.github.bordertech.wcomponents.util.HtmlSanitizerUtil.sanitize(String)
  • com.github.bordertech.wcomponents.util.HtmlSanitizerUtil.sanitize(String, Boolean)

Both return the sanitized HTML as a String. The first method always uses the "stricter" input Policy. The second uses the more relaxed Policy if the Boolean argument is true.

Each sanitize is merely a String manipulator and may, therefore be used at any stage where a (potentially insecure) HTML String is available. Therefore it can be used when getting data from a data source, from a service call, from the component's internal methods or immediately prior to rendering the component (though this is not recommended).

Within a WComponent renderer, for example, one may use it in a manner similar to the following. Note, however, that it is recommended that the sanitizer be called before the render phase as it does throw Exceptions.

import com.github.bordertech.wcomponents.util.HtmlSanitizerUtil;
import com.github.bordertech.wcomponents.util.StringEscapeHTMLToXMLUtil;
import org.owasp.validator.html.PolicyException;
import org.owasp.validator.html.ScanException;
//...

XmlStringBuilder xml = renderContext.getWriter();
// let us assume the component being implemented has
// method to get its content as a String
String dirtyHtml = myComponent.getDataAsString(); //

if (textString != null) {
  try {
    String safeHtml = HtmlSanitizerUtil.sanitize(dirtyHtml, true);
    if (null != safeHtml) {
      xml.print(StringEscapeHTMLToXMLUtil.unescapeToXML(safeHtml));
    }
  }
  catch (ScanException | PolicyException e) {
    // this is why it is a bad idea to do this in a renderer...
    // now we have to do something sensible (see next example).
  }
}

In the above example we have to handle the Exceptions thrown by AntiSamy during the render process when we may already have an open socket and be streaming content to a browser. It would be much better to do all of this in the component's method to get its data, such as:

// In this example we are going to suppose a component which is able
// to render unescaped HTML to the browser.
// ...

// In the component's class:
public String getDataAsString() {
  // We have some way to get the component's data. Do not care what it is,
  // this is the way WComponents do it.
  Object data = this.getData();

  // we do not want to bother trying to sanitize null
  if (data == null) {
    return null; // One may want to return an empty String
  }

  // one could, of course, know data is already a String Object and
  // merely caste it. Another option, if one is expecting a String,
  // would be to to an instanceOf test above.
  String dataString = data.toString();

  try {
    return HtmlSanitizerUtil.sanitize(dataString, true);
  } catch (PolicyException | ScanException ex) {
    // If the Sanitizer throws an error we are not able to sanitize
    // so we have options:
    // 1. return null;
    // 2. throw a new Exception; or
    // 3. be sensible and escape the String. Remember the renderer will
    // not escape the content of this component.
    return StringEscapeUtils.escapeXml10(dataString);
  }
}
//...

// in the component's render class
// Yes, I have cheated and MyComponent extends WComponent, it is
// just an example!
// ...
@Override
public void doRender(final WComponent component,
    final WebXmlRenderContext renderContext) {
  MyComponent myComponent = (MyComponent) component;
  XmlStringBuilder xml = renderContext.getWriter();

  String htmlString = myComponent.getDataAsString();

  if (htmlString != null) {
    // If we are outputting unencoded content it must still be XML valid.
    xml.print(StringEscapeHTMLToXMLUtil.unescapeToXML(htmlString));
  }
}

To do

Related components

Further information

Clone this wiki locally