-
Notifications
You must be signed in to change notification settings - Fork 16
Parsing quality
As for version 0.1.0, Infoboxer implements most of Wikipedia markup.
Most of not implemented markup is rarely used and still would provide
valid and reasonable output (for ex., <source>
tag will be treated as
"usual" HTML tag, which is nearly OK for most cases), and, anyways, will
be implemented in nearest version.
I've tried the parser on several complex pages and everything was fine. Maybe in future we need a large and diverse "test dataset" of complicated pages. But for now, if you'll encounter some bugs or inconsitencies -- just show them to me.
Still, there may be failures on pages:
- extensively using
<source>
tags (with programming languages code) -- Infoboxer will try to parse inside this, and results are unpredictable; - template definitions (i.e. pages like Template:Episode list) -- most of used there features are not implemented.
Also, really complicated embedded HTML may produce something strange -- but it is not seen on typical Wikipedia page.
Note, that Infoboxer main target is information extraction from existing wiki(pedia) articles. Therefore, it never tries to "parse the markup anyways": if the markup is seriously inconsistent, the ParseError will be thrown.
On my thought, it's the most reasonable thing: if you have a page with very broken markup, it is definitely better not trying to extract information at all.
The solution, though, have some drawbacks:
- you hardly can use Infoboxer as backend for your own MediaWiki-like software: unlike Wikipedia editor, it will not try to "show somehow" the strangest markup, it just throw;
- it may be not very useful for some small and marginal MediaWiki-based wikis, where nobody monitors and validates markup quality.