diff --git a/README.md b/README.md index 2734f36..4526e7c 100644 --- a/README.md +++ b/README.md @@ -1,27 +1,40 @@ # Copy Spotter + +![PyPI - Version](https://img.shields.io/pypi/v/copy-spotter) ![PyPI - License](https://img.shields.io/pypi/l/copy-spotter) +![Python](https://img.shields.io/badge/python-3.11-blue) + + ![GIF demo](data/img/example.gif) ## About -This program will proccess pdf, txt, docx, and txt files that can be found in the given input directory, find similar sentences, calculate similarity percentage, display a similarity table with links to side by side comparison where similar sentences are highlighted. - -This project was made part of my internship at the "Human Computer Humans Interacting with Computers at University of Primorska" lab (HICUP Lab). +This program will process pdf, txt, docx, and txt files that can be found in the given input directory, find similar sentences, calculate similarity percentage, display a similarity table with links to side by side comparison where similar sentences are highlighted. **Usage** --- +```bash +$ pip install copy-spotter +$ copy-spotter [-s] [-o] [-h] input_directory ``` -Usage: python -m scripts.main.py input_directory [OPTIONS] +***Positional Arguments:*** +* `input_directory`: Directory that contains one folder per pdf file (see `data/pdf/plagiarism` for example) + +***Optional Arguments:*** +* `-s`, `--block-size`: Set minimum number of consecutive and similar words detected. (Default is 2) +* `-o`, `--out_dir`: Set the output directory for html files. (Default is creating a new directory called results) +* `-h`, `--help`: Show this message and exit. - Performs a similarity analysis of all text files available in given input directory. - Developed by Clément Delteil -> (Github: Wazzabeee) +**Examples** +--- +```bash +# Analyze documents in 'data/pdf/plagiarism', with default settings +$ copy-spotter data/pdf/plagiarism -Options: - -block_size, -s Set minimum number of consecutive and similar words detected. (Default is 2) - -out_dir, -o Set the output directory for html files. (Default is creating a new directory) - -help, -h Show this message and exit. +# Analyze with custom block size and specify output directory +$ copy-spotter data/pdf/plagiarism -s 5 -o results/output ``` -**How to use** +**Development Setup:** --- ```bash @@ -33,42 +46,34 @@ $ cd copy_spotter # Install requirements $ pip install -r requirements.txt +$ pip install -r requirements_lint.txt -# Run the app -$ python -m scripts.main.py data/pdf/plagiarism -s 2 -``` -**First run** ---- -On the first run you might get : -- an ImportError from pdfminer library -``` -ImportError: cannot import name 'uint_value' from 'pdfminer.pdftypes' (C:/.../pdfminer/pdftypes.py) -``` -To fix this, please uninstall pdfminer3k and pdfminer.six via -``` pip uninstall pdfminer3k ``` -``` pip uninstall pdfminer.six ``` -Then install them again via -``` pip install pdfminer3k ``` -``` pip install pdfminer.six ``` +# Install precommit +$ pip install pre-commit +$ pre-commit install +# Run tests +$ pip install pytest +$ pytest tests/ -- a TypeError from Slate3k library -``` -TypeError __init__() missing 1 required positional arg 'parser' in "C:/.../slate3k/classes.py +# Run package locally +$ python -m scripts.main.py [-s] [-o] [-h] input_directory ``` -To fix this you'll need to modify `class PDF(list):` in `C:/.../slate3k/classes.py`. In `def __init__()` change both `if PYTHON 3:`
to `if not PYTHON 3:` on lines 58 and 72. **Recommandations** --- - Please make sure that all text files are closed before running the program. - In order to get the best results please provide text files of the same languages. - Pdf files that are made from scanned images won't be processed correctly. +- Ensure you have writing access when using the package - If a specific file is not processed correctly feel free to [contact me](mailto:) so that I can address the issue. **TODO** --- -- Add more tests +- Add more tests on existing functions +- Implement OCR with tesseract for scanned documents - Add info in console for timing (tqdm) -- Add CSS to HTML Template -- Add support for other folder structures -- Fix Slate3k by installing custom fork \ No newline at end of file +- Add CSS to HTML Template to make the results better looking +- Add support for other folder structures (right now the package is expecting one pdf files per folder) +- Add custom naming option for pdf files +- Fix Slate3k by installing custom fork (check if still relevant) \ No newline at end of file