Skip to content

Use Python3's `pdfminer` & `ElasticSearch` to index & compute statistics on free e-books found online.

Notifications You must be signed in to change notification settings

ArnoVel/elastic-book-stats

Repository files navigation

PDF Parsing and Analysis using PDFMiner, ElasticSearch

The goal is to download e-books for free from the internet and index them by parsing them using PDFMiner to map them to strings.

Machine learning & visualization on the strings will probably be done at some point in the future ...

PDF Parsing

Using both PDFMiner, and PyPDF2, map each pdf to a list of strings describing its pages, as well as basic metadata about length, author description, the type of book & name.

PDF Analysis

Using ElasticSearch, we create and index and compute various statistics on the aggregated dataset.

Technical Details

Creating the index locally & running tests on your machine

Given the documents in pdf form, one can index them with elk by running basic_parse_store methods such as generate_store_all, which first calls the routine pdf_to_pickle to store extracted pages, author string, theme, etc as rows in a pandas DataFrame. For example the following calls:

if generate : pdf_to_pickle()
books_df = pd.read_pickle('./data/dataframe/books_df.pkl')
j = 0
for i,row in books_df.iterrows():
    book = row.to_dict() ; book['_pages'] = [ page.replace('\t', ' ').lower() for page in book['_pages'] ]
    es.index(index='book-index', body=book, id=i, doc_type='_doc')
    pages, j = book_to_pages(row.to_dict(), j)
    bulk_index(pages)

which are at the core of generate_store_allstores into an index named page-index each page, with its attached book name, book num_pages, theme etc. The books are also stored as books in the index book-index.

The function json_to_index(..) maps the extracted pdf (as a json file) to its index. Only use when the index does not yet exist or has been erased.

Install requirements

This project is small but uses specific libraries. For the python3 dependencies, use the requirements.txt file; I also have jupyter-specific dependencies which I haven't listed, but one is the ipynb package

About

Use Python3's `pdfminer` & `ElasticSearch` to index & compute statistics on free e-books found online.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published