Skip to content
robpvn edited this page Mar 25, 2013 · 12 revisions

What can you do with JDOM 2.x? Manipulating XML can be done for many reasons, but almost all cases rely on a traditional input/manipulate/output concept. This page describes the most common things you may want to do with XML, and describes how JDOM can help.

Overview

JDOM is an in-memory representation of an XML document. XML consists of elements (which have attributes), text data, 'entity' references, processing instructions, and comments. XML documents can also have a DocType declaration, Comments, and Processing Instructions before the root element.

Elements and Attributes are 'named', and the full name for an element or attribute consists of its 'local' name, and its namespace URI. Think of these like first and last names respectively. The XML specification requires that Namespace URI's are referenced by a 'prefix', so there needs to be a Namespace declaration that links the Namespace URI to a prefix.

Finally, XML content can be contained within a Document which can hold some 'meta' data for the XML.

JDOM has classes that represent each of these XML concepts: Document, DocType, Namespace, Element, Attribute, Comment, ProcessingInstruction, and Text.

'Text' is an interesting and sometimes confusing concept in XML. It is the data that happens between Element tags. Officially, the normal text between tags is called 'Parsed Character Data', or PCDATA. In XML terms and in this context, 'Parsed' means that the characters '<', '>' and '&' are treated specially (they are tokens that introduce 'child' XML structures). XML allows you to designate portions of 'Text' to be 'unparsed' in which case, it is just 'Character Data', or CDATA. CDATA is just like PCDATA, except the parser will not expect 'child' XML content to be embedded in it.

JDOM has the Text class to represent PCDATA, and it also has a subclass of Text called CDATA to represent those special times when the character content should not be parsed.

XML Data is naturally a 'hierarchy'. The XML document can have a single child element, and that element can contain child content. Some of that child content could be child elements. Any child element could, in turn, have their own child content. This lends itself to having a natural 'tree' model for representing the XML hierarchy. The top element in the tree is called the 'root'.

JDOM models the tree hierarchy using a Parent/Child type linkage between the XML structures. All structures which can contain child content (Document and Element) extend the Parent interface. All structures which can be child content extend the Content abstract class.

The JDOM Tree

JDOM maintains a strict parent-child type relationship. Parent-type JDOM instances (Parent) have methods to access their content, and Child-type JDOM instances (Content) have methods to access their Parent. If a Content instance has a null parent it is said to be 'detached'.

You add Content to a Parent instance by using the addContent(*) methods as well as a number of other convenient mechanisms. Content can be attached to only one Parent at any one time, but it is quite legitimate, and common, to detach some Content from one place, and re-attach it at another.

Elements and Documents are Parent instances. Elements are also Content. Text, CDATA, EntityRef, ProcessingInstruction, Comment, and DocType are Content only.

JDOM exposes the parent-child nature of the tree using a number of different mechanisms:

  • Parent instances have the addContent(*) methods
  • Parent instances have the removeContent(*) methods
  • Parent instances have the getContentSize() and getContent(int) methods
  • Parent instances have the getContent() and getContent(Filter) methods which return 'live' Lists of the Parent's Content
    • modifications (adding, removing, setting) to the List are reflected immediately in the Parent
    • modifications (adding, removing) to the Parent are reflected immediately in all Lists
    • iterators and sub-lists are also modifiable.
  • Element instances have the getChild(*) and getChildren(*) methods.
  • Content instances have the getParent(), getParentElement() and getDocument() methods.
  • Content instances have the detach() method which removes that instance from its Parent.

Creating New Content

What if we wanted to build an 'ant' build.xml file? This will be a simple ant file that compiles the Java source files from "./src" to "./classes"

We want the output to look like (have a look at the 'Retrieving String Content' section for some important notes):

<project default="compile">
  <target name="compile">
    <mkdir dir="./classes" />
    <javac srcdir="./src" destdir="./classes" includes="**/*.java" />
    <echo>Build Complete!</echo>
  </target>
</project>

We can build the JDOM content with.....

Element mkdir = new Element("mkdir");
mkdir.setAttribute("dir","./classes");

Element javac = new Element("javac");
javac.setAttribute("srcdir", "./src");
javac.setAttribute("destdir", "./classes");
javac.setAttribute("includes", "**/*.java");

Element echo = new Element("echo");
echo.addContent(new Text("Build Complete!"));

Element compile = new Element("target");
compile.setAttribute("name","compile");
compile.addContent(mkdir);
compile.addContent(javac);
compile.addContent(echo);

Element project = new Element("project");
project.setAttribute("default", "compile");
project.addContent(compile);

Document antbuild = new Document(project);

The antbuild instance now contains a full XML tree representing the ant build.xml file.

Querying Content

There are all sorts of ways you may want to query the JDOM Tree

Direct navigation

Normally the programmer has a very good idea of what the XML content will look like. For example, in our build.xml file, the programmer may want to get the name of the job dependencies of the default ant task. This is a complex task, with data in attributes, elements, and more attributes.

The following example shows one way (an ugly way) to get the results:

Element root = antbuild.getRootElement();
String deftarget = root.getAttributeValue("default", "all");
for (int i = 0; i < root.getContentSize(); i++) {
    Content content = root.getContent(i);
    if (content instanceof Element) {
        Element element = (Element)content;
        if ("target".equals(element.getName()) &&
                deftarget.equals(element.getAttribute("name").getValue())) {
            System.out.println("The default target " + deftarget + 
                    " has dependencies " + target.getAttributeValue("depends"));
        }
    }
}

Loops and Scans

JDOM is designed to make scanning the tree easy. The above ugly example can be simplified by using the looping/scanning options. For example, the exact same results as above can be accomplished with:

Element root = antbuild.getRootElement();
String deftarget = root.getAttributeValue("default", "all");
for (Element target : root.getChildren("target")) {
    if (deftarget.equals(element.getAttributeValue("name"))) {
        System.out.println("The default target " + deftarget + 
                " has dependencies " + target.getAttributeValue("depends"));
    }
}

Using the same build example we have above, what if we wanted to get a list of the ant 'targets', and print their names? This example shows how you can query the Element content of a parent Element. In this case, it uses the getChildren(String) Method which returns all child Elements with the given name (as a List<Element>).

Element root = antbuild.getRootElement();
for (Element target : root.getChildren("target")) {
  System.out.println("We have target " + target.getAttributeValue("name"));
}

Accessing Attribute Values

Getting attribute values is a common operation. You can see in the example above how the getAttributeValue("name") is used. The getAttributeValue(*) methods are special because they first check to see if the Attribute is defined, and only then check the attribute value. This makes them convenient to use. The getAttributeValue(*) methods are also available in a way that returns a special default value if the attribute was not defined on the Element. As an example, ant build files allow an optional 'description' attribute for targets, but the example build file does not set one. Our query code can be modified to print the description, or a meaningful message if there is none:

Element root = antbuild.getRootElement();
for (Element target : root.getChildren("target")) {
  System.out.println("Target " + target.getAttributeValue("name") +
       " has description: " + target.getAttributeValue("description", "none"));
}

Retrieving String Content

Retrieving character content from an Element is also easy. In most cases, the programmer has a very good idea of what the XML structures is that they are accessing. In our example, we are processing a build.xml file. One of the items in that build is the 'echo' task. What if we wanted to get that text value?

Element root = antbuild.getRootElement();
Element target = root.getChild("target"); // gets the first 'target'
Element echo = target.getChild("echo"); // gets the first 'echo'
String message = echo.getText();
System.out.println("echo has message: " + message):

Note that because we know our example document has only got one 'target' element, and there's just the one 'echo' element, we can get away with using the simple getChild(String) method.

Note about the example ant build.xml
When the ant build file example was introduced, it showed the 'desired' XML result, and the JDOM/Java code to get that result. The reality is that the code does not produce the desired result.

This is the 'desired' result:

<project default="compile">
  <target name="compile">
    <mkdir dir="./classes" />
    <javac srcdir="./src" destdir="./classes" includes="**/*.java" />
    <echo>Build Complete!</echo>
  </target>
</project>

and this is a better representation of the actual result the JDOM/Java code produces:

<project default="compile"><target name="compile"><mkdir dir="./classes" /><javac srcdir="./src" destdir="./classes" includes="**/*.java" /><echo>Build Complete!</echo></target></project>

The significance of this difference is in the Text content. In the 'desired' result there is a lot of whitespace newlines and indenting which makes the XML more (human) readable. That whitespace data would be represented by JDOM Text instances. The String "Build Complete!" is also represented by a Text instance. JDOM will normally represent one continuous character section as a single Text instance, but, there are some instances in which you get consecutive Text instances. In the example, we used the code: echo.addContent(new Text("Build Complete!"));. It would be quite legal, and it would be 'identical' (in an XML sense, not JDOM sense) to express the same thing as:

echo.addContent(new Text("Build"));
echo.addContent(new Text(" "));
echo.addContent(new Text("Complete!"));

This type of circumstance happens relatively frequently, especially when CDATA is involved (remember, CDATA is a subclass of Text). Here is a completely different example:

<root>This is an unparsed <![CDATA[<Element/>]]> in some text</root>

If the above XML was 'loaded' by JDOM then the result would be a 'root' Element with three child Content items, a Text, a CDATA and another Text. But, remember, that CDATA is also Text!

The getText() method is a massive simplification of XML Text processing, and it is only appropriate to use in limited ways. In this simplified example with the CDATA, if we were to run root.getText() we would get "This is an unparsed <Element/> in some text".

What if we wanted to change the 'echo' message to something else?

Element root = antbuild.getRootElement();
Element target = root.getChild("target"); // gets the first 'target'
Element echo = target.getChild("echo");
String message = echo.getText();
System.out.prinln("echo has message: " + message):
echo.setText("Compile Complete!");

Note that getText() gets all the character content for an Element, but not that Element's child elements. This is different to the concept of Element.getValue() which merges the Element's text with all of the child Element's Text values recursively.

Thus, there are essentially three ways to get the Text content from an Element:

  1. Using getContent(*) and filtering for the Text items.
  2. Using getText() and its variants (Trim, Normalize - which format the text) for the Element.
  3. Using getValue() which recursively scans the Element and its child Elements and concatenates the Text items.

Setting String Content

Text content can be added the regular way by calling the addContent(Text) method. JDOM provides a shortcut method for this, the addContent(String) method.

If you want to wipe out the complete contents for an Element, and replace it all with a single Text item, you can use the Element.setText(String) method.

XPath

We could do huge amounts of JDOM access using XPath queries. A previous build.xml example illustrated reporting all the existing target items in the build.xml file. We could do the same thing using XPath queries:

XPathFactory xpfac = XPathFactory.instance();
XPathExpression xp = xpfac.compile("//target/@name", Filters.attribute());
for (Attribute att : xp.evaluate(antbuild)) {
  System.out.println("We have target " + att.getValue());
}

Input

You have a file you want the XML from. You just want to load up the XML without validating the content.

  File file = new File("path/to/file.xml");
  SAXBuilder sax = new SAXBuilder();
  Document doc = sax.build(file);

The SAXBuilder has a build(String) method which assumes that the String value is a URL reference. Since file paths are also valid URI's, the above example could thus be simplified to:

  SAXBuilder sax = new SAXBuilder();
  Document doc = sax.build("path/to/file.xml");

What if the source XML is in a web-based location?

  // parse the JDOM build.xml file
  SAXBuilder sax = new SAXBuilder();
  URL url = new URL(
        "https://raw.github.com/hunterhacker/jdom/master/build.xml");
  Document doc = sax.build(url);

What if you have a String that contains XML (instead of being a reference to some XML).

  String myxml = "<root>mytext</root>";
  SAXBuilder sax = new SAXBuilder();
  Document doc = sax.build(new StringReader(myxml));

What about Validation?

If the XML has an Embedded DocType reference you want to validate against, then specify that the SAXBuilder should validate using the DTD.

  File file = new File("path/to/file.xml");
  SAXBuilder sax = new SAXBuilder(XMLReaders.DTDVALIDATING);
  Document doc = sax.build(file);

If the document has XSD Schema validating specifications then you can enable the 'simple' Schema validating code:

  File file = new File("path/to/file.xml");
  SAXBuilder sax = new SAXBuilder(XMLReaders.XSDVALIDATING);
  Document doc = sax.build(file);

Output

JDOM supports output to a number of different targets. JDOM content can be output in a 'text' document to OutputStreams and Writers. Additionally, it can be converted to DOM nodes, and similarly it can be used as a source of SAX, and StAX events (streams).

Not only do you have a large number of output formats, but you can control what the XML content on that output will look like.

You choose the type of output by selecting the appropriate 'Outputter', and you choose what the output should look like by configuring an appropriate Format.

For example, if you have the input XML (in a String object) <root><child>kid</child></root>

First, we parse that input to a JDOM Document:

  String myxml = "<root><child>kid</child></root>";
  SAXBuilder sb = new SAXBuilder();
  Document doc = sb.build(new StringReader(myxml));

Now, we output that XML to the screen:

  XMLOutputter xout = new XMLOutputter();
  xout.output(doc, System.out);

and we get:

<?xml version="1.0" encoding="UTF-8"?>
<root><child>kid</child></root>

If we want to change the indenting of that document, we can use a 'pretty' format.

  XMLOutputter xout = new XMLOutputter(Format.getPrettyFormat());
  xout.output(doc, System.out);

and we get:

<?xml version="1.0" encoding="UTF-8"?>
<root>
  <child>kid</child>
</root>

What if we wanted to convert our JDOM Document to a DOM Document...

  DOMOutputter dout = new DOMOutputter(Format.getPrettyFormat());
  org.w3c.dom.Document domdoc = dout.output(doc);
Clone this wiki locally