Plagiarism Detection System

Overview

The Plagiarism Detection System is a robust application designed to detect similarities between uploaded documents and stored reports using state-of-the-art machine learning techniques. It leverages Sentence Transformers for text embedding, FAISS for efficient similarity search, and MongoDB for embedding storage.

This project enables users to upload reports (PDF, DOCX, or TXT), generate embeddings, store them in a database, and find similarity scores for new reports, ensuring efficient and scalable detection of plagiarized content.

Features

Document Parsing: Extract text from PDFs, DOCX, and TXT files.
Text Embedding: Use Sentence Transformers (all-MiniLM-L6-v2) to generate dense vector representations of text.
Similarity Search: Perform high-speed similarity search using FAISS.
Database Storage: Store embeddings and metadata in MongoDB for persistent access.
Similarity Aggregation: Aggregate similarity scores for chunk- and report-level analysis.

Technologies Used

FastAPI: API framework for backend services.
Sentence Transformers: For creating embeddings from text.
FAISS: For efficient similarity search.
MongoDB: For storing embeddings and metadata.
PyPDF2: For extracting text from PDF files.
python-docx: For extracting text from DOCX files.
dotenv: For managing environment variables.

Installation

Prerequisites

Python 3.8+
MongoDB installed and running locally or on the cloud.
Virtual environment (optional but recommended).

Steps

Clone the repository:

git clone https://github.com/your-repo/plagiarism-detection-system.git
cd plagiarism-detection-system

Install dependencies:
```
pip install -r requirements.txt
```

Create a .env file in the project root and add the following:

MODEL_NAME=sentence-transformers/all-MiniLM-L6-v2
MONGO_URI=mongodb://localhost:27017
MONGO_DB=plagiarism_detection

Start the FastAPI server:
```
uvicorn main:app --reload
```

Usage

Upload a Report: Use the /upload_report/ endpoint to upload a report and store its embeddings in the database.
Find Similarity: Use the /find_similarity_with_upload/ endpoint to upload a new report and find similarities with stored reports.

API Documentation

Base URL

http://127.0.0.1:8000/

Endpoints

1. Home

URL: /
Method: GET
Description: Returns a simple welcome message.

Response:

{
  "Hello , Sandesh here just testing route 😅"
}

2. Upload Report

URL: /upload_report/
Method: POST
Description: Upload a report to generate embeddings and store them in MongoDB.
Parameters:
- File: A file in .pdf, .docx, or .txt format.

Response:

{
  "message": "Report 'filename.ext' processed and embeddings stored successfully."
}

3. Find Similarity

URL: /find_similarity_with_upload/
Method: POST
Description: Upload a new report to find its similarity with stored reports.
Parameters:
- File: A file in .pdf, .docx, or .txt format.
- top_k: (Optional) The number of top similar chunks to return. Default: 5.

Response:

{
  "message": "Similarity analysis for 'filename.ext' completed.",
  "chunk_level_similarity": [
    {
      "new_chunk": "chunk_text",
      "similar_chunks": [
        {
          "report_id": "report_1",
          "chunk_id": 0,
          "distance": 0.25
        }
      ]
    }
  ],
  "report_level_similarity": [
    {
      "report_id": "report_1",
      "similarity_score": 0.95
    }
  ]
}

Project Structure

.
├── main.py              # Main FastAPI application
├── tool.py              # Core logic for embeddings and similarity
├── requirements.txt     # Python dependencies
├── .env                 # Environment variables
└── README.md            # Project documentation

Future Enhancements

Add support for more file types (e.g., HTML).
Introduce user authentication and role management.
Optimize embedding storage with vector databases like Pinecone.
Integrate a frontend for seamless user interaction.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
__pycache__		__pycache__
uploads		uploads
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt
st.json		st.json
tool.py		tool.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Plagiarism Detection System

Overview

Features

Technologies Used

Installation

Prerequisites

Steps

Usage

API Documentation

Base URL

Endpoints

1. Home

2. Upload Report

3. Find Similarity

Project Structure

Future Enhancements

About

Releases

Packages

Languages

sandeshlavshetty/plauge_nlp_V2

Folders and files

Latest commit

History

Repository files navigation

Plagiarism Detection System

Overview

Features

Technologies Used

Installation

Prerequisites

Steps

Usage

API Documentation

Base URL

Endpoints

1. Home

2. Upload Report

3. Find Similarity

Project Structure

Future Enhancements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages