In this work, we introduce an approach for classifying binary sequences by using transformer-based architectures with specialized preprocessing. Given a binary input sequence, we split it into halves and perform an XOR operation to differentiating bit patterns. We employ both BERT and GPT-2 models to predict classes from the resulting bitwise-processed sequences, adapting them through efficient parameter fine-tuning. This preprocessing and fine-tuning together reduces computational overhead while maintaining accuracy. Experimental results demonstrate the effectiveness of this approach on a binary classification task, with little advantage. Our findings open avenues for applying transformers in low-dimensional, binary-based classification scenarios, offering efficient, adaptable solutions for real-world applications. In future work, we can analyze if the advantage is increasing function in terms of training data and not advantage is not a neglible function in terms of security parameter (bit size).
Please View this report for more details : Report - Indistinguishability Adversary under Ciphertext-Only Attack
The "Digital Sherlock" project focuses on applying state-of-the-art machine learning models, such as BERT and GPT, for sequence classification and inference tasks. The project involves training a BERT-based model and a GPT-based model, both fine-tuned for a specific task related to NLP. The trained models are evaluated on a dataset, and their performance is reported in a PDF file.
This file contains the comprehensive report detailing the training process, results, and analysis of the models used in this project. It includes:
- Model architecture and configuration
- Training procedure and hyperparameters
- Performance metrics (accuracy, loss, etc.)
- Evaluation results and visualizations
This report serves as a detailed analysis of the model's performance and provides insights into areas of improvement.
This Jupyter notebook contains the code used for model training, evaluation, and analysis. It includes:
- Data preprocessing steps to prepare the dataset for training.
- Model architecture setup for both BERT and GPT-based models.
- Training loops for fine-tuning the models on the dataset.
- Evaluation of the models on validation data, along with the results.
- Performance visualizations (loss and accuracy curves).
This notebook serves as the main script for reproducing the results in the report and can be used to train and evaluate the models on your own.
This file contains a list of the trained models, link to download them, their configurations, and checkpoints
- Model type (e.g., BERT, GPT)
- Hyperparameters used during training
- Directory or path to the saved model checkpoint
This file helps you keep track of different trained models and their specific configurations.
This CSV file contains the dataset used for training and evaluating the models. It includes the following columns:
text
: The input text (e.g., sentences, paragraphs, or documents) that the models are trained to classify.label
: The target label corresponding to each input text.
The dataset is used as the training and validation data for both the BERT and GPT models.
Downloading the Models
To download the models, follow these steps:
This file contains the best-performing BERT model checkpoint, which was fine-tuned for the sequence classification task. The model is saved in a format compatible with Hugging Face's Transformers library and can be loaded for inference or further fine-tuning.
- Model architecture:
BertForSequenceClassification
- Fine-tuned for a specific downstream task (e.g., text classification)
This file contains the best-performing GPT-based model checkpoint, which was fine-tuned on the dataset for a sequence classification task. Like the BERT model, this model is saved in a format compatible with Hugging Face's GPT architecture and can be used for generating predictions or further fine-tuning.
- Model architecture:
GPT2LMHeadModel
(or another variant, depending on the task) - Fine-tuned for a specific downstream task
- Python 3.6+
- Required libraries (you can install them via
requirements.txt
or directly using pip):transformers
torch
pandas
matplotlib
sklearn
You can install the required libraries with the following command:
pip install -r requirements.txt
-
Load and Preprocess Data: Load the dataset from
dataset.csv
using pandas and preprocess it (tokenization, padding, etc.) before training the models. The preprocessing steps are implemented in the notebookS_Chatterjee_DigitalSherlock.ipynb
. -
Train the Models: In the notebook, run the code for training both BERT and GPT-based models. Ensure you have the correct environment set up and the GPU available for faster training.
-
Evaluate the Models: After training, evaluate the models on the validation set and save the best-performing models in
best_model
(BERT) andgpt_best_model
. -
Review Results: The final results and performance analysis can be found in
report.pdf
. This document provides a detailed evaluation of each model's performance.
You can load the best-performing models (best_model
for BERT and gpt_best_model
for GPT) for inference or further tasks:
from transformers import BertForSequenceClassification, GPT2LMHeadModel, BertTokenizer, GPT2Tokenizer
# Load BERT model
bert_model = BertForSequenceClassification.from_pretrained('path_to_best_model')
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# Load GPT model
gpt_model = GPT2LMHeadModel.from_pretrained('path_to_gpt_best_model')
gpt_tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
# Example of BERT inference
text = "This is an example input sentence."
inputs = bert_tokenizer(text, return_tensors="pt")
outputs = bert_model(**inputs)
logits = outputs.logits
predictions = logits.argmax(dim=-1)
# Example of GPT inference
text = "Once upon a time"
inputs = gpt_tokenizer(text, return_tensors="pt")
outputs = gpt_model.generate(**inputs)
generated_text = gpt_tokenizer.decode(outputs[0], skip_special_tokens=True)
This repository provides all the necessary tools to replicate and analyze the sequence classification task using BERT and GPT models. It includes the trained models, dataset, evaluation results, and a detailed report. Feel free to experiment with the models and improve the performance by modifying hyperparameters or training configurations.
This project is licensed under the MIT License - see the LICENSE file for details.
- Project Overview: Describes the project and its objectives, focusing on the use of BERT and GPT models for sequence classification.
- Files Overview: Detailed descriptions of the four main files you mentioned (
report.pdf
,S_Chatterjee_DigitalSherlock.ipynb
,trained_models.txt
,dataset.csv
, and the two models). - Getting Started: Instructions for setting up the project environment and running the project, including installing dependencies and running the training code.
- Running the Project: Details on how to load data, train models, and evaluate them.
- Loading Pre-trained Models for Inference: How to load and use the trained models for inference.
- Example Code: Provides code snippets for running inference with both BERT and GPT models.
- License: A section for specifying the project's license.
This README.md
provides a way to explain the project and guide users through setting up and running the code.
I invite researchers, developers, and enthusiasts to collaborate with me on this exciting project in cryptology, deep learning, NLP, and reinforcement learning. Feel free to contact me for developing more and mroe ideas. Having rigorously studied Cryptology, Advanced Cryptology, Deep Learning, and Natural Language Processing, I have developed some understanding of these fields and their applications. I have credited Cryptology in ISI and made a project in Multi Party Computation leading it, I also have worked on homomorphic encryption and its algebraic analysis. If you're interested in exploring innovative solutions and advancing research in these domains, I'd be thrilled to work together and combine our expertise. Let's connect and push the boundaries of what's possible in these rapidly evolving areas.