Skip to content

Wikicompiler is a fully extensible python library that compile and evaluate text from Wikipedia dump. You can extract text, do text analysis or even evaluate the AST(Abstract Syntax Tree) yourself

License

Notifications You must be signed in to change notification settings

iwasingh/Wikicompiler

Repository files navigation


Logo

Wikicompiler

The wikipedia compiler


Evaluation

Wikicompiler

Wikicompiler (a.k.a WCC) is a fully extensible library that helps you to compile Wikitext. For example you can do text analysis, text extraction, document preprocessing and so on. In fact this library implements a recursive descent parser that parse and evaluate(you can custumize this process if you want) the wikicode. Check out the examples

Requirements

  • Python >= 3.6

Basic Usage

Extracts clean text

 from wikicompiler import compiler as c
 
 wcc = c.Compiler()
 wcc.compile(text)
 

You can listen specific events emitted by the compiler. Let's say you want to grab all the links from a page:

 links = []
 wcc.on(lambda node: links.append(node), c.ParseTypes.LINK) 

Done! Checkout the examples section for more infos Examples

AST

If you want the AST instead, you can do the following way

 from wikicompiler import parser.parser as p
 
 text="==Hello World=="
 parser = p.Parser()
 ast = parser.parse(text)

Then you can visit that AST and consider to write your own evaluator yourself.

Grammar

You can pass your own grammar to the parse and evaluate the AST yourself. Furthermore you can use combinators to write your own rules checkout the Grammar and the combinators

 class MyGrammar:
  # This is important! The parser will consider this as a starting symbol
  def expression(self):
    # Must return a function that accept a parser
    return seq(expect(Heading2.start), self.mytext, expect(Heading2.end))
  
  def mytext(self):
    return p.Node(p.TextP('My static node'))
    
 
 parser = Parser(grammar=MyGrammar.expression())
 parser.parse(text)

Lexer

You should checkout the symbols definitions(https://github.com/iwasingh/Wikicompiler/blob/master/lexer/symbols.py) and the lexer symbols definition. WCC adds some basic symbols, you can extend the symbol table, obviously you have to change grammar too. Basically you first have to define a symbol (tag)

from wikicompiler import lexer.lexer as l
from wikicompiler import lexer.symbols as s
class MyCustomTag(s.Tag):
    start = s.Token('LINK_START', r'\[\[')
    end = s.Token('LINK_END', r']]')

    def __init__(self):
        super().__init__(MyCustomTag.start, MyCustomTag.end)

# And then define the symbol in the table

@l.definition(l.Symbol.RESERVED)(MyCustomTag)

# Or if you need to do other things when matching the token
@l.definition(l.Symbol.RESERVED)
class MyLexCustomTag(MyCustomTag):
 def __init__(self):
   super().__init__()
 def match(self, text, pos, **kwargs):
   # Do something
   # Must return (Match, Token)

Examples

Under tests you can found some examples that are used to test everything out. If you need to look how the outputs are you can found some examples under examples directory in which you will found

  • data input data for the scripts
  • outputs the parsed data

in particular i made 2 examples

Notes

Wikitext cannot be solved by a context-free parser (the one implementend in this library) because Wikitext is based upon a context-sensitive grammar. But for most of the use cases, grammar rules can be simplified with a little information loss. For example, Templates are simplified and ignored by default in the evaluation, but its node is considerated anyway, that means you can parse the template (text inside of it) in your own manner when you reach that node. If you really need a result similar to the wikipedia actual page rendered in the browser, you should checkout mediawiki specification or use their api directly that gives you the page properly rendered.

License

MIT

About

Wikicompiler is a fully extensible python library that compile and evaluate text from Wikipedia dump. You can extract text, do text analysis or even evaluate the AST(Abstract Syntax Tree) yourself

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages