It can be useful for businesses to label text data for identification, sorting or strategic purposes. Traditionally, businesses employ word matching (building a label dictionary from scratch) or manual labour to put labels on their existing storage. However, this tends to be resource heavy and can be cumbersome to implement.
This project aims to ease this process and make labelling easy. With this [semi] auto labelling tool, users simply have to pick from a list of recommended words to form their label dictionary and allow the model to form an enriched dictionary. The model will then utilise the enriched dictionary to label the input text dataset.
Data | Use Case |
---|---|
Email messages from suppliers or customers | - Better archive and store email messages on local file system - Group suppliers or customers to better understand collaboration partners |
Service tickets for customer complaints | - Group customer complaints to identify problematic areas - Group service approaches to identify the best service approaches |
Customer feedback for products or services | - Identify potential new product categories - Group feedbacks with labels to identify performance of each product label |
You will require the following system set up.
- Install python3 here
- Install pip3 for your windows or linux
- Install python virtualenv here
- Install git here
- Clone project to the local and cd into project
git clone [repository]
- Create a python virtual environment within project folder
virtualenv -p python3 env
- Activate your virtual environment
# Linux
source env/bin/activate
# Windows
env\Scripts\activate
- Install python dependencies
pip3 install -r requirements.txt
# or
pip install -r requirements.txt
- Run Jupyter Notebook
jupyter notebook
- Run the following code within Jupyter to install nltk
import nltk
nltk.download('all')
- Walk through the demo in bricks_demo_auto_label.ipynb to gain an intuition of the steps required to operate this auto labelling tool.
- Walk through the sample notebook in bricks_auto_label.ipynb. This notebook allows you to experiment with the labelling tool and evaluate its usefulness for your company.
- bricks_demo_auto_label.ipynb - demo code using the original example for users to get an intuition for the applications of this auto labelling tool.
- bricks_auto_label.ipynb - base code to allow users to play with and experiment with the auto labeller
It is important to identify your desired keys and labels
- Manually mix and match keywords to create the dictionary
labels.csv
with desired categories, with a list of keywords for each category - Notebook takes in
data/labels.csv
to proceed with the semi-supervised labeling
The primary function of this model takes in an input (news.csv) and labels it using the labels.csv. It outputs labelled.csv and if ground truth is available, the score.csv.
- inputs:
- news.csv - dataset containing string to be labelled
- can contain as many row as needed (recommended less than 10k rows)
- contains text to be labelled
- labels.csv - labels for different identified classification (e.g. finance, sports, politics)
- contains number of columns corresponding to categories
- contains keys in each column relating to each category
- news.csv - dataset containing string to be labelled
- outputs:
- labelled.csv - labelled dataset containing labels for the specified input labels from labels.csv.
- Contains the same number of rows as news.csv
- score.csv - model performance for the labels. Only available if you have ground truth for the input.
- labelled.csv - labelled dataset containing labels for the specified input labels from labels.csv.
Code tested with python 3.5.5 running on Azure Data Science Virtual Machine (Ubuntu 16.04)
Lin Laiyi, Senior AI Apprentice at AI Singapore, NUS MSBA 2017/2018
LinkedIn: https://www.linkedin.com/in/laiyilin/
Portfolio of selected analytics project: https://drive.google.com/file/d/1fVntFEvj6us_6ERzRmbU85EOeZymFxEm/view
Edited by Jway Jin Jun on Aug 2019, AI Engineer at AI Singapore.
Find the original presentation slide [here](https://docs.google.com/presentation/u/1/d/1hQED4ZZqzcwgq6-jgtw3MbRWRPN6CRTOs_zbVQQu_YU/edit#slide=id.p)
Project is editted for the purpose of the Bricks project to demonstrate and enable AI Technologies
Find the original project [here](https://github.com/lylin17/auto_label)