Named-entity recognition to extract a person's name and label documents

OCR and document labelling for medical files

Chatbots using NLP approaches from NLPIA, Lane, et al.

The nlpia package is found at https://github.com/totalgood/nlpia.

codechef-NLP and MORE: OkCupid Dating Profiles Analysis, and Book Summaries

- 9/13/19: Adding production-style development workflow using Docker and Kubernetes

** I've also added a notebook on OkCupid Dating Profiles Analysis. In it, I apply topic modeling to OkCupid bios, using data that was obtained using an API. The algorithm clusters men and women into groups based on what they type in their description.

** There's also a notebook applying topic modeling on book summaries of works of fiction.

Natural language processing and machine learning experiments on problem statements and solutions provided by codechef competitive programming site

How might we train a model to predict difficulty level using the question statement?

We start off with EDA and cleaning. We have tables of questions, solutions, and submitted code snippets. Using the code submissions, we can count up the number of correct and incorrect answers, and compute the success ratio grouped by user and by question.

The first idea to explore is whether the golden labels provided by the site ("easy", "medium", "hard", etc.) reflect the actual success ratios based off users' attempt/ rates of success. For interesting visualization, let's apply K-Means clustering to the questions based off their 'question_success_ratio', with some influence by their assigned labels (I added random noise to the assigned difficulty labels). Practically, this clustering could indicate whether the questions should have their difficulty levels re-labeled, reflecting actual success ratio statistics.

Points were clustered into 3 groups:

The largest cluster consists of questions that had high success ratio, irrespective of the golden labels from the dataset.
 - these might be considered the new "easy" questions
second largest cluster are questions labeled easy, yet have low success ratio
 - these might be considered the new 'medium'
smallest cluster are questions labeled hard with low success ratio
 - these are the true 'hard' ones

For more difficulty tiers, we could increase the number of centroids. We demonstrate how to compute the silhouette score to choose a different value for k, and then cluster the questions into 6 clusters. This level scheme could help users with a more gradual progression in difficulty.

Next, we load in the datasets containing code snippets of submitted solutions, and try to classify code snippets as either Python or C++. Let's train a classification model on the solutions dataset, where given a sample solution, we want to assign the label "C++11" or "Python" based on the text content of the solution. This is a standard binary classification task. We'll do a bag of words processing with tokenization, where we split each document into words; vocabulary building, where we collect a vocabulary of all words that appear in any of the documents and number them; and encoding, where we count how often each of the words in the vocabulary appear in this document. The shape of X_train, which represents the bag of words representation of the training data is 31772x38672, indicating that the vocabulary contains 38,672 entries. The data is stored in a SciPy sparse matrix. We train a logistic regression model for document classification, and use cross-validation to obtain an accuracy. On the training data: default parameters: Mean cross-validation accuracy: 0.9959705463307194

Best cross-validation score: 0.9964112573191463
Best parameters:  {'C': 0.1}
We used a simple grid search to tune via cross-validation our regularization parameter C.

On Test Data: Mean cross-validation accuracy: 0.9950892610105836 Hence, it was easy to predict whether a code snippet is Python or C++ using CountVectorizer and LogisticRegression. The result is as expected, since each language has its respective keywords that would obviously determine the language.

Can we cluster question statements into discrete topics?
Next, we experiment with some topic modeling. Unfortunately, our dataset does not include the question category labels, e.g. "trees", "graphs", "strings", etc. Fortunately, however, we can try to apply topic modeling on the question statements and obtain question categories in with an unsupervised machine learning algorithm. We start with topic Modeling with Non-Negative Matrix factorization (NMF). We use the CountVectorizer from scikit-learn to turn the content of the question statements into the document-term matrix A. Apply NMF with SVD-based initialization to the document-term matrix AA generate topics, then get the W and H factors from the resulting model. We get the top ranked terms for each topic, by sorting the values in the rows of the HH factor. We do the same with TF-IDF, the results of which I thought were the best. If we think of topics as group of words that commonly appear together in the same statement, the NMF model generates results closer to what we'd expect. Here's a subsection of the 15 topics:

...

Topic 2: string, strings, substring, length, characters, character, substrings, letters, palindrome, latin
Topic 3: cell, cells, grid, row, column, color, rows, columns, board, right
Topic 4: node, nodes, tree, path, trees, edge, root, distance, associated, cost ...

These clusters help us understand how questions fall into categories. For instance, questions about trees (Topic 4) tend to have an intuitive set of associated words. We count the number of question statements primarily belonging to each of the above topics and plot a histogram. We also use LatentDirichletAllocation, which resulted in more specific keywords.

As an experiment! Let's create topic models on code snippets. Could we guess what what question the code is answering? Since code has more structure than English and we have a bag of words assumptions, results would probably be nonsensical, but let's experiment. We'll use NMF with tfidf, since I think it's the most impressive.

...

Topic 2: rectangles, got, present, size, light, recognition, www, y1, st, _gaq
Topic 3: submit, special, removed, girl, light, got, replace, present, www, similarly ...

I thought the results were nonsensical but it was an interesting experiment.

Next, we try to predict, given the statement, whether the question is "easy" or "medium" using Naïve Bayes as the classifier. Our classifier will use word counts as features and make a classification decision on the level of difficulty.

Results: There are 20,267 words in the vocabulary. 'tree' appears 18,822 times. The accuracy on the test data is 45.76%

The accuracy is below baseline, which suggests, as we suspected, that CountVectorizer on the words of the question statement is not correlated to difficulty level. Nevertheless, we calculate accuracy, recall, precision, F1 score, and show a confusion matrix. We got 63% accuracy on the test data. Next we perform 10-fold cross-validation on the training and testing data. The average accuracy is about 40%, which is worse than baseline.

Next we use TextBlob's pretrained sentiment tool on the entire corpus of question statements. We got Sentiment(polarity=0.0662212856306699, subjectivity=0.4231108107510163), so the question statements overall have neutral sentiment. As a reference point, the string "have a great day" has a sentiment polarity=0.8.

Finally, we train Various Classifiers to predict success (whether a submitted solution is correct or incorrect) As an exercise, we could try to predict Success based on TimeTaken, MemTaken, and question_success_ratio. We do some data cleaning and manipulation to prepare the data for the task. First, use Kernelized support vector machines, which are an extension that allows for more complex models that are not defined simply by hyperplanes in the input space.

Keep in mind that we're using each question's success ratio to predict success. The only other features we include are TimeTaken and MemorySpace.

Hence, we're using past/aggregate performance and adding in a few extra variables to predict success.

Since the number of no success and success are about even, we don't need to correct class imbalance issues. It also means that an accuracy score of 75% should be compared to a 50% baseline.

Here were the results:

SVM: Train score: 0.75 Test score: 0.75

Random Forest: Accuracy on training set: 0.895 Accuracy on test set: 0.853

Gradient Boosting: Accuracy on training set: 0.802 Accuracy on test set: 0.807

SVM with initial grid search: Best score on validation set: 0.82 Best parameters: {'C': 100, 'gamma': 100} Test set score with best parameters: 0.82

Using the parameters we tried, the Random Forest Classifier outperformed the other models.

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
BERT		BERT
NER - auto document labelling		NER - auto document labelling
OkCupid_Image_Classification		OkCupid_Image_Classification
data		data
docker/Test App Docker		docker/Test App Docker
frontend		frontend
images		images
redis-image		redis-image
visits		visits
.DS_Store		.DS_Store
.drawio		.drawio
.gitignore		.gitignore
Docker and Kubernetes.ipynb		Docker and Kubernetes.ipynb
LICENSE		LICENSE
OkCupid Dating Profiles Analysis.ipynb		OkCupid Dating Profiles Analysis.ipynb
README.md		README.md
Topic Modeling on Book Summary Descriptions.ipynb		Topic Modeling on Book Summary Descriptions.ipynb
bookrec_README.md		bookrec_README.md
docker-ml.ipynb		docker-ml.ipynb
eap.jpg		eap.jpg
environment.yml		environment.yml
fastai nlp course notes.ipynb		fastai nlp course notes.ipynb
natural language processing - codechef.ipynb		natural language processing - codechef.ipynb
notes.ipynb		notes.ipynb
spark with python.ipynb		spark with python.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Named-entity recognition to extract a person's name and label documents

Chatbots using NLP approaches from NLPIA, Lane, et al.

codechef-NLP and MORE: OkCupid Dating Profiles Analysis, and Book Summaries

- 9/13/19: Adding production-style development workflow using Docker and Kubernetes

About

Releases

Packages

Languages

License

byukan/Chatbots-NLP

Folders and files

Latest commit

History

Repository files navigation

Named-entity recognition to extract a person's name and label documents

Chatbots using NLP approaches from NLPIA, Lane, et al.

codechef-NLP and MORE: OkCupid Dating Profiles Analysis, and Book Summaries

- 9/13/19: Adding production-style development workflow using Docker and Kubernetes

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages