Intell_Quest_Baseline

Intell Quest hackathon challenges you to develop an innovative research paper plagiarism detection tool. This repository contains the code for the baseline model for Intell Quest Hackathon 2024. In the baseline model we have considered a limited set of documents and the algorithm used for detecting plagiarism is Term Frequency Inverse Document Frequency (TF-IDF).

Getting started

Clone the repostory using the command

git clone https://github.com/IEEE-NITK/Intell_Quest_24_Baseline.git

Run python main.py and enter the input view the top 5 matching documents.

Data format

The data is in JSON format it being

[ ...
    {
        "abstract": Abstract of the paper
        "articleNumber": Article Number
        "articleTitle": Article title
        "authors": [ ...
            {
                "preferredName": Preferred Name
                "normalizedName": Normalized name
                "firstName": First Name
                "lastName": Last Name
                "searchablePreferredName": Identifying on internet
                ""id": ID
            }
        ...]
        "doi": Date of issuing
        "publicationTitle":  Publication title
        "publicationYear":  Year of Publication
        "publicationVolume": Publication volume
        "publicationIssue": Publication issue
        "volume": Volume Number
        "issue": 
        "documentLink": Link of the document
        "xml": XML of the document
    }
  ...
]

What we want in the hackathon

This hackathon focuses on creating a user-friendly application that effectively identifies plagiarism in IEEE research papers. You will be tasked with building a comprehensive system encompassing:

Frontend

A user-friendly interface for uploading research papers in common formats (Text file, PDF, docx, etc.).

Backend

A robust algorithm that analyzes uploaded papers against a vast database of similar research papers to identify potential plagiarism. Key functionalities include:
Preprocessing: Clean and tokenize the text data for efficient analysis.
Feature Extraction: Extract meaningful features from the text using techniques like TF-IDF or vectorization with powerful language models from Hugging Face.
Similarity Comparison: Employ appropriate similarity metrics (cosine similarity, Euclidean distance, etc.) to assess the degree of resemblance between the uploaded paper and reference documents.
Database Management: Choose a suitable database solution (relational or vector) to store reference documents and their extracted features efficiently.
Output: Generate a clear and concise plagiarism score for the entire paper. Consider these additional features for improved analysis: Chunk-level analysis: Highlight specific sections with suspected plagiarism and their potential sources.
Visualization: Implement a color-coding system inspired by tools like Turnitin to visually represent plagiarism severity.

Submission Essentials

Core Functionality:

Frontend: User-friendly interface for uploading research papers.
Backend: Robust similarity comparison against a provided dataset of research papers.
Database: Efficient storage and retrieval of data.

Beyond the Basics:

While the core functionalities outlined above provide a solid foundation, we encourage you to push the boundaries of innovation and explore enhancements like:
Accuracy Boosting: Implement advanced text similarity algorithms, leverage external knowledge sources, or incorporate deep learning models for more precise plagiarism detection.
Active Learning: Continuously improve the model by incorporating user feedback and new data. This would also mean adding the newly uploaded data into your already existing database.
Efficiency Optimization: Utilize caching, parallel processing, and optimized data structures to improve processing time for large datasets. User Experience: Design an intuitive interface with clear result interpretations and filtering options for sources, similarity thresholds, and plagiarism types.

NOTE: This hackathon will provide access to a curated dataset of research papers for training and testing purposes. Data

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
.gitignore		.gitignore
README.md		README.md
getData.py		getData.py
main.py		main.py
tfidf.py		tfidf.py
user_text.txt		user_text.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Intell_Quest_Baseline

Getting started

Data format

What we want in the hackathon

Frontend

Backend

Submission Essentials

Core Functionality:

Beyond the Basics:

References

About

Releases

Packages

Contributors 2

Languages

IEEE-NITK/Intell_Quest_24_Baseline

Folders and files

Latest commit

History

Repository files navigation

Intell_Quest_Baseline

Getting started

Data format

What we want in the hackathon

Frontend

Backend

Submission Essentials

Core Functionality:

Beyond the Basics:

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages