- Overview
- Dataset
- Setup
- Data Preprocessing
- Model Selection and Training
- Results
- Future Work
- Acknowledgements
Project Description:
Utilizing Web-Scraping techniques on Github NLP repository's README that have the majority of the code as Python or HTML, we will create a classification model that accurately predicts the predominant coding language used within each repository. This is important to see if there is any pattern of vocabulary usage that tends to dictate the predominant coding language.
Project Goals:
Identify patterns of vocabulary within README files to identify Python or HTML coding language from Github NLP repositories and create classification models in order to determine if there is a pattern of vocabulary usage unique to Python and unique to HTML.
Initial Hypothesis:
Since HTML and Python utilizes unique syntaxes and code, then there should be unique terminology used within README files that can identify the predominant coding language of a repository.
Initial Questions:
- Is there unique terminology used more frequently for Python and HTML?
- Does one programming language have a higher sentiment score on average than the other?
- Are there 2-word combinations that are used more in Python than HTML and vice versa?
- Do hyperlinks (text begining with 'http') occur more frequently in Python or in HTML readmes?
Description:
Web-scraped data of the README contents from 500 NLP related Github repositories and a variable describing the predominant code language (Python or HTML) used for each repository. The target variable is the predominant coding language of each repository (Python or HTML).
Data Dictionary:
Feature Name | Data Type | Description | Example |
---|---|---|---|
repo | object | Name of Repository | 'huggingface/transformers' |
language | object | Predominant coding language of Repository | 'Python' |
readme_contents | object | Contents of Repository's README file | 'Transformers provides thousands of pretrained...' |
cleaned_readme_contents | object | Cleaned version of contents of Repository's README file | 'transformers provides thousands pretrained...' |
Instructions to Reproduce:
- IF YOU WANT TO SCRAPE YOUR OWN DATA...
-
Clone this repository
-
Generate a Github Token
- Go here: https://github.com/settings/tokens
- Click: 'Generate New Token(Classic)'
- DO NOT check any boxes
- Copy TOKEN URL
-
Create 'env.py' file with:
- github_username = YOUR GITHUB USERNAME
- github_token = TOKEN URL
-
Run desired files
- IF YOU WANT TO USE OUR DATA
- Clone this repository
- Run desired files
Python Library | Version | Usage |
---|---|---|
numpy | 1.21.5 | Vectorization |
pandas | 1.4.4 | Dataframing |
matplotlib | 3.5.2 | Visualization |
seaborn | 0.11.2 | Visualization |
wordcloud | 1.9.1.1 | Visualization |
bs4 | 4.11.1 | NLP |
requests | 2.28.1 | NLP |
regex | 2022.7.9 | NLP |
nltk | 3.7 | NLP |
unicodedata | X | NLP |
sklearn | 1.0.2 | Stats, Metrics, Modeling |
Missing Value Handling:
Nothing of significance
NLP Methodology:
- Clean text of contents
- Tokenize cleaned text
- Lemmatize tokenized data
- Remove stop-words (To include predominant coding language) of lemmatized data
Modeling Specific:
-
Count Vectorizer (CV)
- With or without ngram_range=(#, #)
-
Term Frequency - Inverse Document Frequency (TF-IDF)
- With or without ngram_range=(#, #)
Classification Models:
- DecisionTreeClassifier()
- RandomForestClassifier()
- K-Nearest-Neighbors
- LogisticRegression()
Training Procedure:
- Split into train, validate, and test sets
- Using features selected, fit and transform on training data
- Evaluate train and validate scores to determine best model
- Run best model on the test set and evaluate results
Model Evaluation Metric:
- Accuracy
- Since we do not necessarily care for specifically Python or HTML predictions, but rather the overall accuracy of the model, we will evaluate models on their accuracy scores
- Our best model was a Decision Tree model with an accuracy of 80% on our test dataset, which is 20% higher than our baseline model.
- If given more time we would filter out more html tags from the readme text as a part of the preparation steps
- We would also try rerunning our exploration of the data using an even number of Python and HTML repositories to see if there would be any difference in findings
- The readme files used for this project were gathered from https://github.com/topics/nlp. Use of readmes in this project does not imply ownership of any of the repository information used.
- This project was created by Rob Casey, Adam Harris and Jared Wood as a part of the Codeup curriculum.