This is Boris Johnson from UK! 🇬🇧 🇺🇦

Instagram: @borisjohnsonuk

This is Boris Johnson from UK! 🇬🇧 🇺🇦

And this is PDF to Markdown converter based on PyMuPDF.

Quick start

$ make build
$ make linux_example
$ make macos_example
$ make windows_example # not tested yet

Features

Boris can convert your PDF file to Markdown.
It supports images, codeblock, blockquotes, titles, and all other kinds of font flags! (Unknown font flags ease to adding manually)
Text extracts carefully now!
Great works inside of Docker image. Outside fitz lib on my M1 Mac was not works. Don't f*ckn know why :(

More

The first thing Boris Johnson does is to open your PDF file via Document reference. Then bypasses each Page reference, which is a representation of the pages of the PDF file.
Then, ideating over each page, Boris calls .extractJSON for each page. This expands the byte representation of the page into JSON dictionaries that hold all the data in each page.
This code can be tracked here
Then boris processes each Block in the order of the original queue.
Here boris does the initial parsing and arranges each block by its type - Picture or text - to further comfortably convert the source blocks into markdown
Then Boris binds a result into a doubly linked list.
In what follows, we will think of a sequence of PDF content as nodes, two-linked lists that are represented by Pages. That is, Pages are double-linked lists, and the sequence of content in them are nodes. It's simple.

This is convenient for us, because in the future, when transforming the content, we will need to understand the state of the current node and its neighbors. That's how we will know when it's time to open markdown tag and when we should close it. We convert the PDF content into a markdown by having a sequence of content with its own font type (which we parse) and font size (which we also parse) and based on that data we know which block of text belongs to which tag. Then we gradually open the necessary tags, concatenate the lines and close the tags when the font type of each current node in the iteration differs from the font type of the previous node. Obviously, if the fonts are different, it means they belong to different tags. This is why I introduced the Bi-linked list - so that by going through each node, we can check the font type of the previous and next node and understand whether or not we should close/open this or that tag now.

Then we introduce an entity like ContentProcessor which takes a Bi-linked list of content as input and encapsulates all further processing and conversion of PDF content into a markdown representation.

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
.github		.github
backends		backends
media		media
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
__doc__.py		__doc__.py
__init__.py		__init__.py
__main__.py		__main__.py
boris.py		boris.py
docker-compose.yml		docker-compose.yml
fonts.py		fonts.py
interfaces.py		interfaces.py
processor.py		processor.py
requirements.txt		requirements.txt
setup.py		setup.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

This is Boris Johnson from UK! 🇬🇧 🇺🇦

And this is PDF to Markdown converter based on PyMuPDF.

Quick start

Features

More

About

Releases 1

Sponsor this project

Packages

Languages

License

codefather-labs/borisjohnsonuk

Folders and files

Latest commit

History

Repository files navigation

This is Boris Johnson from UK! 🇬🇧 🇺🇦

And this is PDF to Markdown converter based on PyMuPDF.

Quick start

Features

More

About

Resources

License

Stars

Watchers

Forks

Releases 1

Sponsor this project

Packages 0

Languages

Packages