Skip to content

Jibril-Frej/irspdf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

66 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

README

Presentation

irspdf is a simple textual information retrieval system for pdf documents.

Text is extracted from pdf with pdfplumber.

Standard text preprocessing for information retrieval is applied:

  • StopWord removal
  • Stemming
  • Punctuation removal
  • Lowercase conversion

The ranking function used is BM25.

Installation

Install with pip

pip install irspdf

OR install from github

git clone https://github.com/Jibril-Frej/irspdf.git
cd irspdf && python setup.py install

Usage

Build a collection

from irspdf import build
build(folder_path, collection_path)

folder_path : path of the folder that contains all the pdf files to include to the collection.

collection_path : file where the collection will be saved

Query the collection

from irspdf import query
query(collection_path)

collection_path : file where the collection is saved

Update the collection

from irspdf import update
update(folder_path, collection_path)

folder_path : path of the folder that contains all the pdf files to add to the collection.

collection_path : file where the original collection is saved

Useful links

Documentation: https://irspdf.readthedocs.io/en/latest/.

Source Code: https://github.com/Jibril-Frej/irspdf

Package: https://pypi.org/project/irspdf/

About

Information retireval system for pdf documents

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages