Vietnamese News Classification Model

This project focuses on building a Vietnamese news classification model. The goal is to classify news articles into different categories based on their content.

Overview

The project consists of several steps: data crawling, preprocessing, feature extraction, and model training.

Data Crawling: News data is crawled from a Vietnamese news website. The crawled data includes Category, Sub-category, Title, Description, and Content of each news article.
Preprocessing: The crawled data is preprocessed to clean and prepare it for further analysis. This includes removing irrelevant information, handling missing values, and applying tokenization to the Vietnamese text using the underthesea library.
Feature Extraction: Various feature extraction techniques are employed to convert the text data into numerical representations. The following techniques are used:
- Bag of Words: This approach represents text as a collection of word frequencies.
- TF-IDF (Term Frequency-Inverse Document Frequency): This technique assigns weights to words based on their importance in a document relative to the entire corpus.
- PhoBERT: This is a state-of-the-art language model specifically designed for the Vietnamese language.
Model Training and Evaluation: Several classification models are trained and evaluated using the extracted features. The following models are used:
- Naive Bayes
- Support Vector Machines (SVM)
- K-Nearest Neighbors (KNN)
- Logistic Regression
- Decision Tree
- Random Forest

Experiments

Experiments are conducted to evaluate the performance of each combination of feature extraction technique and classification model. Evaluation metrics such as accuracy, precision, recall, and F1-score are calculated to assess the classification results.

Based on the experiments, it was found that Logistic Regression with the Bag of Words approach and SVM with TF-IDF achieved the best classification results for the Vietnamese news data.

Prerequisites

Python [version]
Required libraries and dependencies

Installation

Clone the repository:
```
git clone [repository URL]
```
Install the required dependencies:
```
pip install -r requirements.txt
```

Usage

Prepare the dataset: [Provide instructions on how to prepare the dataset for training and evaluation]
Feature extraction: [Describe how to perform feature extraction using the chosen technique]
Model training: [Provide steps to train the chosen models]
Evaluation: [Explain how to evaluate the trained models]

Examples

[Provide example usage or code snippets]

Web Demo

Open terminal and run this code snippet:
```
cd ./web_demo
```
Run the demo webiste:
```
streamlit run web.py
```
Usage: [Screen Capture]

Contributing

Contributions to the project are welcome. If you would like to contribute, please follow the guidelines outlined in [CONTRIBUTING.md].

License

[License information]

Contact

For any inquiries or questions, please contact [EMAIL ADDRESS].

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
web_demo		web_demo
.gitignore		.gitignore
README.md		README.md
bert.ipynb		bert.ipynb
bow.ipynb		bow.ipynb
preprocess.ipynb		preprocess.ipynb
tfidf.ipynb		tfidf.ipynb
utils.py		utils.py
visual.ipynb		visual.ipynb
vn_news_crawler.py		vn_news_crawler.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vietnamese News Classification Model

Overview

Experiments

Prerequisites

Installation

Usage

Examples

Web Demo

Contributing

License

Contact

About

Releases

Packages

Languages

huuhieunguyen/categorize-crawled-vietnamese-news

Folders and files

Latest commit

History

Repository files navigation

Vietnamese News Classification Model

Overview

Experiments

Prerequisites

Installation

Usage

Examples

Web Demo

Contributing

License

Contact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages