Skip to content

Latest commit

 

History

History
177 lines (116 loc) · 6.95 KB

README.md

File metadata and controls

177 lines (116 loc) · 6.95 KB

RepoLingo - 'Do You Read Me?'

Table of Contents

Overview

Project Description:

Utilizing Web-Scraping techniques on Github NLP repository's README that have the majority of the code as Python or HTML, we will create a classification model that accurately predicts the predominant coding language used within each repository. This is important to see if there is any pattern of vocabulary usage that tends to dictate the predominant coding language.

Project Goals:

Identify patterns of vocabulary within README files to identify Python or HTML coding language from Github NLP repositories and create classification models in order to determine if there is a pattern of vocabulary usage unique to Python and unique to HTML.

Initial Hypothesis:

Since HTML and Python utilizes unique syntaxes and code, then there should be unique terminology used within README files that can identify the predominant coding language of a repository.

Initial Questions:

  1. Is there unique terminology used more frequently for Python and HTML?
  2. Does one programming language have a higher sentiment score on average than the other?
  3. Are there 2-word combinations that are used more in Python than HTML and vice versa?
  4. Do hyperlinks (text begining with 'http') occur more frequently in Python or in HTML readmes?

Dataset

Description:

Web-scraped data of the README contents from 500 NLP related Github repositories and a variable describing the predominant code language (Python or HTML) used for each repository. The target variable is the predominant coding language of each repository (Python or HTML).

Data Dictionary:

Feature Name Data Type Description Example
repo object Name of Repository 'huggingface/transformers'
language object Predominant coding language of Repository 'Python'
readme_contents object Contents of Repository's README file 'Transformers provides thousands of pretrained...'
cleaned_readme_contents object Cleaned version of contents of Repository's README file 'transformers provides thousands pretrained...'

Setup

Instructions to Reproduce:

  • IF YOU WANT TO SCRAPE YOUR OWN DATA...
  1. Clone this repository

  2. Generate a Github Token

  3. Create 'env.py' file with:

    • github_username = YOUR GITHUB USERNAME
    • github_token = TOKEN URL
  4. Run desired files

  • IF YOU WANT TO USE OUR DATA
  1. Clone this repository
  2. Run desired files
Python Library Version Usage
numpy 1.21.5 Vectorization
pandas 1.4.4 Dataframing
matplotlib 3.5.2 Visualization
seaborn 0.11.2 Visualization
wordcloud 1.9.1.1 Visualization
bs4 4.11.1 NLP
requests 2.28.1 NLP
regex 2022.7.9 NLP
nltk 3.7 NLP
unicodedata X NLP
sklearn 1.0.2 Stats, Metrics, Modeling

Data Preprocessing

Missing Value Handling:

Nothing of significance

NLP Methodology:

  1. Clean text of contents
  2. Tokenize cleaned text
  3. Lemmatize tokenized data
  4. Remove stop-words (To include predominant coding language) of lemmatized data

Modeling Specific:

  1. Count Vectorizer (CV)

    • With or without ngram_range=(#, #)
  2. Term Frequency - Inverse Document Frequency (TF-IDF)

    • With or without ngram_range=(#, #)

Model Selection and Training

Classification Models:

  • DecisionTreeClassifier()
  • RandomForestClassifier()
  • K-Nearest-Neighbors
  • LogisticRegression()

Training Procedure:

  1. Split into train, validate, and test sets
  2. Using features selected, fit and transform on training data
  3. Evaluate train and validate scores to determine best model
  4. Run best model on the test set and evaluate results

Model Evaluation Metric:

  • Accuracy
  • Since we do not necessarily care for specifically Python or HTML predictions, but rather the overall accuracy of the model, we will evaluate models on their accuracy scores

Results

  • Our best model was a Decision Tree model with an accuracy of 80% on our test dataset, which is 20% higher than our baseline model.

Future Work

  • If given more time we would filter out more html tags from the readme text as a part of the preparation steps
  • We would also try rerunning our exploration of the data using an even number of Python and HTML repositories to see if there would be any difference in findings

Acknowledgements

  • The readme files used for this project were gathered from https://github.com/topics/nlp. Use of readmes in this project does not imply ownership of any of the repository information used.
  • This project was created by Rob Casey, Adam Harris and Jared Wood as a part of the Codeup curriculum.