Skip to content
Chris Perivolaropoulos edited this page Apr 30, 2014 · 2 revisions

MWDumper UTF-8 ArraryIndexOutOfBounds Exception

Sometimes the mwdumper will raise this error.

...

376,000 pages (14,460.426/sec), 376,000 revs (14,460.426/sec)
377,000 pages (14,458.848/sec), 377,000 revs (14,458.848/sec)
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 2048
        at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
        at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
        at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source)
        at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source)
        at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
        at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
        at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
        at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
        at javax.xml.parsers.SAXParser.parse(SAXParser.java:392)
        at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
        at org.mediawiki.importer.XmlDumpReader.readDump(XmlDumpReader.java:88)
        at org.mediawiki.dumper.Dumper.main(Dumper.java:142)
make: *** [/scratch/cperivol/wikipedia-mirror/drafts/wikipedia-parts/enwiki-20131202-pages-articles20.xml-p011125004p013324998.sql] Error 1

In this case it is the 20th dump. It is not actually an encoding issue probably, after all utf8thread.c would have caught up on it. To find out which article caused the error, try dumpung xml to xml with the command

java -jar /scratch/cperivol/wikipedia-mirror/tools/mwdumper.jar   --format=xml /scratch/cperivol/wikipedia-mirror/drafts/wikipedia-parts/enwiki-20131202-pages-articles20.xml-p011125004p013324998.fix.xml > /tmp/just-a-copy.xml

Then look at the end of /tmp/just-a-copy.xml for the last article:

$ tac /tmp/just-a-copy.xml | grep "<title>" -m 1

Which should show you the last article processed (the point where things went south). In xml-parse.sh you will find a couple of useful tools to deal with it. Most of the time you just want to do:

$  data/xml-parse.sh Cranopsis\ bocourti > /tmp/striped-dump.xml

Which will throw the entire xml in $ORIGINAL_XML ommiting the article named Cranopsis bocourti in /tmp/striped-dump.xml.

This can all be quite straightforwardly automated (and actually is in some branch I think) but not knowing exactly I am fixing I feel more comfortable doing it by hand. Please contact me or make an issue if you run into this and want it automated.

Alternative approaches

To take a look at more details on how I approached this issue and for a couple of potential alternative solutions you may take a look at issue 3. Most notably I tried in-place substituting the article with spaces (see page_remover.c) but strangely enough that didn't work.