Wikicompiler

The wikipedia compiler

Wikicompiler

Wikicompiler (a.k.a WCC) is a fully extensible library that helps you to compile Wikitext. For example you can do text analysis, text extraction, document preprocessing and so on. In fact this library implements a recursive descent parser that parse and evaluate(you can custumize this process if you want) the wikicode. Check out the examples

Requirements

Python >= 3.6

Basic Usage

Extracts clean text

 from wikicompiler import compiler as c
 
 wcc = c.Compiler()
 wcc.compile(text)

You can listen specific events emitted by the compiler. Let's say you want to grab all the links from a page:

 links = []
 wcc.on(lambda node: links.append(node), c.ParseTypes.LINK)

Done! Checkout the examples section for more infos Examples

AST

If you want the AST instead, you can do the following way

 from wikicompiler import parser.parser as p
 
 text="==Hello World=="
 parser = p.Parser()
 ast = parser.parse(text)

Then you can visit that AST and consider to write your own evaluator yourself.

Grammar

You can pass your own grammar to the parse and evaluate the AST yourself. Furthermore you can use combinators to write your own rules checkout the Grammar and the combinators

 class MyGrammar:
  # This is important! The parser will consider this as a starting symbol
  def expression(self):
    # Must return a function that accept a parser
    return seq(expect(Heading2.start), self.mytext, expect(Heading2.end))
  
  def mytext(self):
    return p.Node(p.TextP('My static node'))
    
 
 parser = Parser(grammar=MyGrammar.expression())
 parser.parse(text)

Lexer

You should checkout the symbols definitions(https://github.com/iwasingh/Wikicompiler/blob/master/lexer/symbols.py) and the lexer symbols definition. WCC adds some basic symbols, you can extend the symbol table, obviously you have to change grammar too. Basically you first have to define a symbol (tag)

from wikicompiler import lexer.lexer as l
from wikicompiler import lexer.symbols as s
class MyCustomTag(s.Tag):
    start = s.Token('LINK_START', r'\[\[')
    end = s.Token('LINK_END', r']]')

    def __init__(self):
        super().__init__(MyCustomTag.start, MyCustomTag.end)

# And then define the symbol in the table

@l.definition(l.Symbol.RESERVED)(MyCustomTag)

# Or if you need to do other things when matching the token
@l.definition(l.Symbol.RESERVED)
class MyLexCustomTag(MyCustomTag):
 def __init__(self):
   super().__init__()
 def match(self, text, pos, **kwargs):
   # Do something
   # Must return (Match, Token)

Examples

Under tests you can found some examples that are used to test everything out. If you need to look how the outputs are you can found some examples under examples directory in which you will found

data input data for the scripts
outputs the parsed data

in particular i made 2 examples

Meta data extraction(links and categories): input | output | code
Text extraction: input | output | code

Notes

Wikitext cannot be solved by a context-free parser (the one implementend in this library) because Wikitext is based upon a context-sensitive grammar. But for most of the use cases, grammar rules can be simplified with a little information loss. For example, Templates are simplified and ignored by default in the evaluation, but its node is considerated anyway, that means you can parse the template (text inside of it) in your own manner when you reach that node. If you really need a result similar to the wikipedia actual page rendered in the browser, you should checkout mediawiki specification or use their api directly that gives you the page properly rendered.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
assets/images		assets/images
docs		docs
examples		examples
lexer		lexer
parser		parser
tests		tests
utils		utils
.gitignore		.gitignore
Dockerfile.dev		Dockerfile.dev
LICENSE		LICENSE
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
__init__.py		__init__.py
compiler.py		compiler.py
config.py		config.py
conftest.py		conftest.py
logz.yaml		logz.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wikicompiler

Wikicompiler

Requirements

Basic Usage

AST

Grammar

Lexer

Examples

Notes

License

About

Releases

Packages

Languages

License

iwasingh/Wikicompiler

Folders and files

Latest commit

History

Repository files navigation

Wikicompiler

Wikicompiler

Requirements

Basic Usage

AST

Grammar

Lexer

Examples

Notes

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages