-
Notifications
You must be signed in to change notification settings - Fork 2
Troubleshooting
Sometimes the mwdumper will raise this error.
... 376,000 pages (14,460.426/sec), 376,000 revs (14,460.426/sec) 377,000 pages (14,458.848/sec), 377,000 revs (14,458.848/sec) Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 2048 at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source) at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source) at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source) at javax.xml.parsers.SAXParser.parse(SAXParser.java:392) at javax.xml.parsers.SAXParser.parse(SAXParser.java:195) at org.mediawiki.importer.XmlDumpReader.readDump(XmlDumpReader.java:88) at org.mediawiki.dumper.Dumper.main(Dumper.java:142) make: *** [/scratch/cperivol/wikipedia-mirror/drafts/wikipedia-parts/enwiki-20131202-pages-articles20.xml-p011125004p013324998.sql] Error 1
In this case it is the 20th dump. It is not actually an encoding issue probably, after all utf8thread.c would have caught up on it. To find out which article caused the error, try dumpung xml to xml with the command
java -jar /scratch/cperivol/wikipedia-mirror/tools/mwdumper.jar --format=xml /scratch/cperivol/wikipedia-mirror/drafts/wikipedia-parts/enwiki-20131202-pages-articles20.xml-p011125004p013324998.fix.xml > /tmp/just-a-copy.xml
Then look at the end of /tmp/just-a-copy.xml
for the last article:
$ tac /tmp/just-a-copy.xml | grep "<title>" -m 1
Which should show you the last article processed (the point where things went south). In xml-parse.sh you will find a couple of useful tools to deal with it. Most of the time you just want to do:
$ data/xml-parse.sh Cranopsis\ bocourti > /tmp/striped-dump.xml
Which will throw the entire xml in $ORIGINAL_XML
ommiting the article named Cranopsis bocourti
in /tmp/striped-dump.xml
.
This can all be quite straightforwardly automated (and actually is in some branch I think) but not knowing exactly I am fixing I feel more comfortable doing it by hand. Please contact me or make an issue if you run into this and want it automated.
To take a look at more details on how I approached this issue and for a couple of potential alternative solutions you may take a look at issue 3. Most notably I tried in-place substituting the article with spaces (see page_remover.c) but strangely enough that didn't work.