chore: update readme (#11)

Wazzabeee · Apr 26, 2024 · 9c8312f · 9c8312f
1 parent 4fa35a2
commit 9c8312f
Showing 1 changed file with 40 additions and 35 deletions.
diff --git a/README.md b/README.md
@@ -1,27 +1,40 @@
 # Copy Spotter
+
+![PyPI - Version](https://img.shields.io/pypi/v/copy-spotter) ![PyPI - License](https://img.shields.io/pypi/l/copy-spotter)
+![Python](https://img.shields.io/badge/python-3.11-blue)
+
+
 ![GIF demo](data/img/example.gif)
 
 ## About
-This program will proccess pdf, txt, docx, and txt files that can be found in the given input directory, find similar sentences, calculate similarity percentage, display a similarity table with links to side by side comparison where similar sentences are highlighted.
-
-This project was made part of my internship at the "Human Computer Humans Interacting with Computers at University of Primorska" lab (HICUP Lab).
+This program will process pdf, txt, docx, and txt files that can be found in the given input directory, find similar sentences, calculate similarity percentage, display a similarity table with links to side by side comparison where similar sentences are highlighted.
 
 **Usage**
 ---
 
+```bash
+$ pip install copy-spotter
+$ copy-spotter [-s] [-o] [-h] input_directory
 ```
-Usage: python -m scripts.main.py input_directory [OPTIONS]
+***Positional Arguments:***
+* `input_directory`: Directory that contains one folder per pdf file (see `data/pdf/plagiarism` for example)
+
+***Optional Arguments:***
+* `-s`, `--block-size`: Set minimum number of consecutive and similar words detected. (Default is 2)
+* `-o`, `--out_dir`: Set the output directory for html files. (Default is creating a new directory called results)
+* `-h`, `--help`: Show this message and exit.
 
-  Performs a similarity analysis of all text files available in given input directory.
-  Developed by Clément Delteil -> (Github: Wazzabeee)
+**Examples**
+---
+```bash
+# Analyze documents in 'data/pdf/plagiarism', with default settings
+$ copy-spotter data/pdf/plagiarism
 
-Options:
-  -block_size, -s  Set minimum number of consecutive and similar words detected. (Default is 2)
-  -out_dir, -o     Set the output directory for html files. (Default is creating a new directory)
-  -help, -h        Show this message and exit.
+# Analyze with custom block size and specify output directory
+$ copy-spotter data/pdf/plagiarism -s 5 -o results/output
 ```
 
-**How to use**
+**Development Setup:**
 ---
 
 ```bash
@@ -33,42 +46,34 @@ $ cd copy_spotter
 
 # Install requirements
 $ pip install -r requirements.txt
+$ pip install -r requirements_lint.txt
 
-# Run the app
-$ python -m scripts.main.py data/pdf/plagiarism -s 2
-```
-**First run**
----
-On the first run you might get :
-- an ImportError from pdfminer library 
-``` 
-ImportError: cannot import name 'uint_value' from 'pdfminer.pdftypes' (C:/.../pdfminer/pdftypes.py)
-```
-To fix this, please uninstall pdfminer3k and pdfminer.six via 
-``` pip uninstall pdfminer3k ```
-``` pip uninstall pdfminer.six ```
-Then install them again via 
-``` pip install pdfminer3k ```
-``` pip install pdfminer.six ```
+# Install precommit
+$ pip install pre-commit
+$ pre-commit install
 
+# Run tests
+$ pip install pytest
+$ pytest tests/
 
-- a TypeError from Slate3k library 
-```
-TypeError __init__() missing 1 required positional arg 'parser' in "C:/.../slate3k/classes.py
+# Run package locally
+$ python -m scripts.main.py [-s] [-o] [-h] input_directory
 ```
-To fix this you'll need to modify `class PDF(list):` in `C:/.../slate3k/classes.py`. In `def __init__()` change both `if PYTHON 3:` <br/> to `if not PYTHON 3:` on lines 58 and 72.
 
 **Recommandations**
 ---
 - Please make sure that all text files are closed before running the program.
 - In order to get the best results please provide text files of the same languages.
 - Pdf files that are made from scanned images won't be processed correctly.
+- Ensure you have writing access when using the package 
 - If a specific file is not processed correctly feel free to [contact me](mailto:<[email protected]>) so that I can address the issue.
 
 **TODO**
 ---
-- Add more tests
+- Add more tests on existing functions
+- Implement OCR with tesseract for scanned documents
 - Add info in console for timing (tqdm)
-- Add CSS to HTML Template
-- Add support for other folder structures
-- Fix Slate3k by installing custom fork
+- Add CSS to HTML Template to make the results better looking
+- Add support for other folder structures (right now the package is expecting one pdf files per folder)
+- Add custom naming option for pdf files
+- Fix Slate3k by installing custom fork (check if still relevant)