The Plagiarism Detection System is a robust application designed to detect similarities between uploaded documents and stored reports using state-of-the-art machine learning techniques. It leverages Sentence Transformers for text embedding, FAISS for efficient similarity search, and MongoDB for embedding storage.
This project enables users to upload reports (PDF, DOCX, or TXT), generate embeddings, store them in a database, and find similarity scores for new reports, ensuring efficient and scalable detection of plagiarized content.
- Document Parsing: Extract text from PDFs, DOCX, and TXT files.
- Text Embedding: Use Sentence Transformers (
all-MiniLM-L6-v2
) to generate dense vector representations of text. - Similarity Search: Perform high-speed similarity search using FAISS.
- Database Storage: Store embeddings and metadata in MongoDB for persistent access.
- Similarity Aggregation: Aggregate similarity scores for chunk- and report-level analysis.
- FastAPI: API framework for backend services.
- Sentence Transformers: For creating embeddings from text.
- FAISS: For efficient similarity search.
- MongoDB: For storing embeddings and metadata.
- PyPDF2: For extracting text from PDF files.
- python-docx: For extracting text from DOCX files.
- dotenv: For managing environment variables.
- Python 3.8+
- MongoDB installed and running locally or on the cloud.
- Virtual environment (optional but recommended).
-
Clone the repository:
git clone https://github.com/your-repo/plagiarism-detection-system.git cd plagiarism-detection-system
-
Install dependencies:
pip install -r requirements.txt
-
Create a
.env
file in the project root and add the following:MODEL_NAME=sentence-transformers/all-MiniLM-L6-v2 MONGO_URI=mongodb://localhost:27017 MONGO_DB=plagiarism_detection
-
Start the FastAPI server:
uvicorn main:app --reload
-
Upload a Report: Use the
/upload_report/
endpoint to upload a report and store its embeddings in the database. -
Find Similarity: Use the
/find_similarity_with_upload/
endpoint to upload a new report and find similarities with stored reports.
http://127.0.0.1:8000/
- URL:
/
- Method:
GET
- Description: Returns a simple welcome message.
- Response:
{ "Hello , Sandesh here just testing route 😅" }
- URL:
/upload_report/
- Method:
POST
- Description: Upload a report to generate embeddings and store them in MongoDB.
- Parameters:
- File: A file in
.pdf
,.docx
, or.txt
format.
- File: A file in
- Response:
{ "message": "Report 'filename.ext' processed and embeddings stored successfully." }
- URL:
/find_similarity_with_upload/
- Method:
POST
- Description: Upload a new report to find its similarity with stored reports.
- Parameters:
- File: A file in
.pdf
,.docx
, or.txt
format. - top_k: (Optional) The number of top similar chunks to return. Default: 5.
- File: A file in
- Response:
{ "message": "Similarity analysis for 'filename.ext' completed.", "chunk_level_similarity": [ { "new_chunk": "chunk_text", "similar_chunks": [ { "report_id": "report_1", "chunk_id": 0, "distance": 0.25 } ] } ], "report_level_similarity": [ { "report_id": "report_1", "similarity_score": 0.95 } ] }
.
├── main.py # Main FastAPI application
├── tool.py # Core logic for embeddings and similarity
├── requirements.txt # Python dependencies
├── .env # Environment variables
└── README.md # Project documentation
- Add support for more file types (e.g., HTML).
- Introduce user authentication and role management.
- Optimize embedding storage with vector databases like Pinecone.
- Integrate a frontend for seamless user interaction.