Multilingual text identification with its topic categorization

This project presents an integrated system for automatic language detection and topic identification using natural language processing (NLP) models. By leveraging the MobileBERT model, known for its efficiency, we achieve language detection while minimizing computational resources. The system preprocesses data, tokenizes text, and fine-tunes MobileBERT for classification tasks. The system incorporates topic identification capabilities through models like Latent Dirichlet Allocation (LDA) or BERT-based approaches. This addition allows the system not only to detect language but also extract meaningful topics from the input text, providing valuable insights. Additionally, it also translates the input text to user desired choice.

Requirements

Ensure you have Python 3.6+ installed along with the following libraries:

Python libraries: warnings, pandas, re, torch, numpy, sklearn, transformers deep_translator, langdetect, ftfy, os, nltk

ServerDeep Learning Framework: Transformers (Hugging Face Transformers library)

Machine Learning Libraries: scikit-learn (sklearn)

Natural Language Processing Libraries: NLTK deep_translator, langdetect and ftfy

Install the required packages using:

pip install -r requirements.txt

Features

Develop and integrate a BERT-based language detection model capable of accurately identifying multiple languages, including low-resource languages, within a single text input.
Utilize TF-IDF vectorization in conjunction with a fine-tuned BERT classifier to enhance the precision and recall of topic categorization in multilingual texts.
Design the system to effectively manage and categorize text that contains multiple languages within the same document, ensuring seamless language transitions and accurate topic detection.
Implement advanced text preprocessing steps to handle encoding issues, remove noise, and standardize input text, improving the overall quality and consistency of the data fed into the models.
Apply efficient training techniques such as gradient accumulation, mixed precision training, and early stopping to reduce computational resource requirements without sacrificing model performance.

Usage

Setup Environment:

Clone the repository.
Install dependencies using pip install -r requirements.txt.

Run the Application:

Run the main code.

Interpret Results

Contributing

Contributions are welcome! If you have suggestions, enhancements, or issues, please submit them via GitHub issues.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
Language Detection.csv		Language Detection.csv
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multilingual text identification with its topic categorization

Requirements

Features

Usage

Contributing

About

Releases

Packages

Languages

ajay-cs-tech/Multilingual-text-identification-with-its-topic-categorization

Folders and files

Latest commit

History

Repository files navigation

Multilingual text identification with its topic categorization

Requirements

Features

Usage

Contributing

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages