A Natural Language Processing (NLP) model designed to classify crime details into predefined Categories and Sub-Categories based on input data.
- Python 3.10 or higher
- Libraries: Install the required dependencies listed in requirements.txt
Columns had mismatch counts
- crimeaditionalinfo: 93666
- sub_category: 87096
- category: 93687
- applied .strip() function for trimming whitespace
- Removed blank rows in crimeaditionalinfo
- Deduplicated the cleaned text
- Remaining records: 79,399 rows for all columns.
- crimeaditionalinfo: 31230
- sub_category: 28994
- category: 31223
27157 records for all columns
- Combined similar terms (e.g., Ransomware and Ransomware Attack → Ransomware)
- Split merged terms into distinct categories (e.g., DebitCredit Card FraudSim Swap Fraud → Debit/Credit Card Fraud and SIM Swap Fraud)
Below is the mapping of categories available in the dataset to those mentioned in the document
Old Category (From Dataset) | New Category (From Document) |
---|---|
Cyber Bullying Stalking Sexting | Cyber Bullying/Stalking/Sexting |
Fraud CallVishing | Fraud Call/Vishing |
Online Gambling Betting | Online Gambling/Betting Fraud |
Online Job Fraud | Online Job Fraud |
UPI Related Frauds | UPI-Related Frauds |
Internet Banking Related Fraud | Internet Banking-Related Fraud |
Other | Any Other Cyber Crime |
Profile Hacking Identity Theft | Profile Hacking/Identity Theft |
EWallet Related Fraud | E-Wallet Related Frauds |
Data Breach/Theft | Unauthorized Access/Data Breach |
Denial of Service (DoS)/Distributed Denial of Service (DDOS) attacks | Denial of Service (DoS) and Distributed Denial of Service (DDoS) attacks |
FakeImpersonating Profile | Fake/Impersonating Profile |
Cryptocurrency Fraud | Cryptocurrency Crime |
Malware Attack | Malware attacks |
Business Email CompromiseEmail Takeover | Business Email Compromise/Email Takeover |
Email Hacking | Email Hacking |
Cheating by Impersonation | Cheating by Impersonation |
Hacking/Defacement | Defacement/Hacking |
Unauthorised AccessData Breach | Unauthorized Access/Data Breach |
SQL Injection | SQL Injection |
Provocative Speech for unlawful acts | Provocative Speech of Unlawful Acts |
Ransomware Attack | Ransomware |
Cyber Terrorism | Cyber Terrorism |
Tampering with computer source documents | Tampering with computer source documents |
DematDepository Fraud | Demat/Depository Fraud |
Online Trafficking | Online Cyber Trafficking |
Online Matrimonial Fraud | Online Matrimonial Fraud |
Website DefacementHacking | Defacement/Hacking |
Damage to computer computer systems etc | Damage to Computer Systems |
Impersonating Email | Impersonating Email |
EMail Phishing | Email Phishing |
Ransomware | Ransomware |
Intimidating Email | Intimidating Email |
Against Interest of sovereignty or integrity of India | Against Interest of sovereignty or integrity of India |
Computer Generated CSAM/CSEM | Child Pornography/Child Sexual Abuse Material (CSAM) |
Cyber Blackmailing & Threatening | Cyber Bullying/Stalking/Sexting |
Sexual Harassment | Cyber Bullying/Stalking/Sexting |
DebitCredit Card Fraud | Debit/Credit Card Fraud |
Sim Swap Fraud | SIM Swap Fraud |
- Defined 39 subcategories by mapping dataset categories to consistent names
- The dataset contained only 39 subcategories (from both train and test data) compared to the 62 subcategories mentioned in the document.
- Categories in the train-test dataset were actually subcategories
- 3,783 records in the dataset lacked subcategory labels.
- Some subcategories were misclassified.
- Worked with the available data to address inconsistencies.
- Grouped the 62 subcategories into 4 top-level categories for streamlined classification:
- any_other_cyber_crime
- cyberbullying_and_online_harassment
- financial_frauds
- system_hacking_and_damage
- Check this file for the mapping of 62 subcategories to the top four categories.
- Used LLMs to classify the data into these four categories
After mapping the subcategories to the four top-level categories, the data distribution was as follows:
- any_other_cyber_crime: 1,981 records
- cyberbullying_and_online_harassment: 9,026 records
- financial_frauds: financial_frauds
- system_hacking_and_damage: 1,981 records
Approach to predict Sub-Category and Category
- Boosting Algorithm: XGBoost
- Neural Networks: FastText
- Transformers: DistilBERT and ALBERT
Confusion matrix and classification report on test data using different algorithms
model | data_count_to_test | ram | model_size (mb) | gpu | taken_time_to_process_records (seconds) | precision | recall | accuracy | f1-score |
---|---|---|---|---|---|---|---|---|---|
distillbert | 27158 | 2.8 GB | 767.3 | 485 MB (60) | 140.87 | 0.937 | 0.937 | 0.94 | 0.937 |
albert | 27158 | 3.2 GB | 134.6 | 213 MB(65 - 70%) | 240 | 0.936 | 0.937 | 0.94 | 0.936 |
fasttext | 27158 | 2.7 GB | 806.6 | 0 | 9.93 | 0.92 | 0.92 | 0.93 | 0.925 |
Decision: Chose FastText due to its efficiency and competitive performance.
-
Handling Imbalanced Data
- Optimized hyperparameters to handle imbalanced data
-
Subcategory Prediction
-
Primary Classification: FastText model classifies the data into one of four categories: any_other_cyber_crime, cyberbullying_and_online_harassment, financial_frauds, or system_hacking_and_damage.
-
Keyword Matching: Based on the primary category, a set of keyword matching techniques (bi-grams, single-grams, and exact phrases) are used to identify a relevant subcategory.
-
Latent Semantic Analysis (LSA): If no suitable subcategory is found using keywords, LSA is employed to identify the most similar subcategory based on the underlying meaning of the incident description. This process ensures robust categorization and subcategorization by combining both keyword-based and semantic methods to handle a wide range of incident descriptions effectively.
-
Grouped predicted subcategories into the top-level categories
-
We evaluated 100 records from testing data here is the result. Checke this file file
- Accuracy: 0.92
- Precision: 0.91
- Recall: 0.89
- F1 Score: 0.83
- Processing Speed: FastText completed predictions significantly faster compared to transformer-based models.
- Clone the repository:
git clone https://github.com/ankitvirla/crime_categorization_model.git cd crime_categorization_model
- Install dependencies:
pip install -r requirements.txt
- Do Inference: Modify the script based on input and use inference.py
python3 scripts/inference.py
- Jagannathan Arumugam - Project Lead
- Kaushambi Chandel
- Ebin Jose Mathew
- Yakkanti Tulasi Kishore Reddy
- Ankit Birla
- The dataset was skewed heavily toward certain categories (e.g., financial frauds) while other categories had significantly fewer records, leading to poor model performance on underrepresented classes.
- 3,783 records were missing subcategory labels, reducing the quality and coverage of training data.
- The test dataset lacked category information, which led to reliance on manual validation for evaluation.
- Subcategories were inconsistently labeled or merged, requiring significant preprocessing to standardize the data.
- The mapping of subcategories to top-level categories resulted in the loss of finer details in some cases, which could limit the model's usefulness in specific contexts.
- Several records were found to be incorrectly labeled in the dataset, reducing model accuracy and making it harder to generalize predictions.
- For some data, there was no exact matching category for the given details. In such cases, we classify it to the closest matching category.
- Regularly update the dataset to include new types of cybercrimes or frauds to ensure the model stays relevant.
- Train models to handle multilingual datasets, allowing classification of crime details provided in different languages.
- Develop an API or web-based GUI to allow users to classify text in real-time, making the solution more user-friendly.
- Optimize the model for deployment on low-resource devices, such as edge systems, to expand usability.
- Develop integrations with law enforcement tools to automate categorization of cybercrime reports for faster processing.
- Continuously monitor the model’s performance with real-world data to identify biases, misclassifications, or performance drops.