RepoLingo - 'Do You Read Me?'

Overview

Project Description:

Utilizing Web-Scraping techniques on Github NLP repository's README that have the majority of the code as Python or HTML, we will create a classification model that accurately predicts the predominant coding language used within each repository. This is important to see if there is any pattern of vocabulary usage that tends to dictate the predominant coding language.

Project Goals:

Identify patterns of vocabulary within README files to identify Python or HTML coding language from Github NLP repositories and create classification models in order to determine if there is a pattern of vocabulary usage unique to Python and unique to HTML.

Initial Hypothesis:

Since HTML and Python utilizes unique syntaxes and code, then there should be unique terminology used within README files that can identify the predominant coding language of a repository.

Initial Questions:

Is there unique terminology used more frequently for Python and HTML?
Does one programming language have a higher sentiment score on average than the other?
Are there 2-word combinations that are used more in Python than HTML and vice versa?
Do hyperlinks (text begining with 'http') occur more frequently in Python or in HTML readmes?

Dataset

Description:

Web-scraped data of the README contents from 500 NLP related Github repositories and a variable describing the predominant code language (Python or HTML) used for each repository. The target variable is the predominant coding language of each repository (Python or HTML).

Data Dictionary:

Feature Name	Data Type	Description	Example
repo	object	Name of Repository	'huggingface/transformers'
language	object	Predominant coding language of Repository	'Python'
readme_contents	object	Contents of Repository's README file	'Transformers provides thousands of pretrained...'
cleaned_readme_contents	object	Cleaned version of contents of Repository's README file	'transformers provides thousands pretrained...'

Setup

Instructions to Reproduce:

IF YOU WANT TO SCRAPE YOUR OWN DATA...

Clone this repository
Generate a Github Token
- Go here: https://github.com/settings/tokens
- Click: 'Generate New Token(Classic)'
- DO NOT check any boxes
- Copy TOKEN URL
Create 'env.py' file with:
- github_username = YOUR GITHUB USERNAME
- github_token = TOKEN URL
Run desired files

IF YOU WANT TO USE OUR DATA

Clone this repository
Run desired files

Python Library	Version	Usage
numpy	1.21.5	Vectorization
pandas	1.4.4	Dataframing
matplotlib	3.5.2	Visualization
seaborn	0.11.2	Visualization
wordcloud	1.9.1.1	Visualization
bs4	4.11.1	NLP
requests	2.28.1	NLP
regex	2022.7.9	NLP
nltk	3.7	NLP
unicodedata	X	NLP
sklearn	1.0.2	Stats, Metrics, Modeling

Data Preprocessing

Missing Value Handling:

Nothing of significance

NLP Methodology:

Clean text of contents

Tokenize cleaned text

Lemmatize tokenized data

Remove stop-words (To include predominant coding language) of lemmatized data

Modeling Specific:

Count Vectorizer (CV)

With or without ngram_range=(#, #)

Term Frequency - Inverse Document Frequency (TF-IDF)

With or without ngram_range=(#, #)

Model Selection and Training

Classification Models:

DecisionTreeClassifier()

RandomForestClassifier()

K-Nearest-Neighbors

LogisticRegression()

Training Procedure:

Split into train, validate, and test sets

Using features selected, fit and transform on training data

Evaluate train and validate scores to determine best model

Run best model on the test set and evaluate results

Model Evaluation Metric:

Accuracy

Since we do not necessarily care for specifically Python or HTML predictions, but rather the overall accuracy of the model, we will evaluate models on their accuracy scores

Results

Our best model was a Decision Tree model with an accuracy of 80% on our test dataset, which is 20% higher than our baseline model.

Future Work

If given more time we would filter out more html tags from the readme text as a part of the preparation steps

We would also try rerunning our exploration of the data using an even number of Python and HTML repositories to see if there would be any difference in findings

Acknowledgements

The readme files used for this project were gathered from https://github.com/topics/nlp. Use of readmes in this project does not imply ownership of any of the repository information used.

This project was created by Rob Casey, Adam Harris and Jared Wood as a part of the Codeup curriculum.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

RepoLingo - 'Do You Read Me?'

Table of Contents

Overview

Dataset

Setup

Data Preprocessing

Model Selection and Training

Results

Future Work

Acknowledgements

Files

README.md

Latest commit

History

README.md

File metadata and controls

RepoLingo - 'Do You Read Me?'

Table of Contents

Overview

Dataset

Setup

Data Preprocessing

Model Selection and Training

Results

Future Work

Acknowledgements