This repository includes the folders/files representative of the programming projects I have completed thus far
Analysis_of_Tornadoes_Rip_Currents_and_Hurricanes_in_the_US_in_2020 - folder that contains four things:
- Analysis_of_Tornadoes_Rip_Currents_and_Hurricanes_in_the_US_in_2020_Lat_Long_Viz.py - a .py file (Python file) that holds a visualization illustrating the latitudes and longitudes of tornadoes in the United States in the year 2020 (libraries used: Pandas, NumPy, Matplotlib)
- Analysis_of_Tornadoes_Rip_Currents_and_Hurricanes_in_the_US_in_2020_Statistical_Tests.Rmd - a .Rmd file (R Markdown file) that includes data cleaning steps (i.e. creating subsets, omitting null values, performing log transformations) and a plethora of statistical tests (i.e. Shapiro-Wilk tests, Kruskal-Wallis tests, Mann-Whitney U tests, Spearman rank correlation tests, G-tests) conducted based on the questions posed for the project
- Analysis_of_Tornadoes_Rip_Currents_and_Hurricanes_in_the_US_in_2020_detail_graphing.py - a .py file that holds visualizations illustrating the number of storm events by state in the US, the frequency of storm events by month, and the property damage per event type (libraries used: Pandas, CSV, Matplotlib, Random)
- DATA Club Snowball Project - Luke Abbatessa & Ciara Malamug.pptx - a .pptx file that represents a slideshow presentation detailing the specifics of the project
This project involved the collection, analysis, visualization, & interpretation of data on previous locations, meteorological patterns, & economic damage rates of tornadoes, rip currents, & hurricanes in the US in 2020 to aid preparation for these events in the future
Analyzing_the_Relationships_Between_Median_Housing_Value_and_its_Possible_Influences_in_Boston_in_the_Late_1900s - folder that contains two things:
- Analyzing_the_Relationships_Between_Median_Housing_Value_and_its_Possible_Influences_in_Boston_in_the_Late_1900s.py - a .py file that includes data cleaning steps (i.e. dropping columns of a dataframe, changing the magnitudes of column values, creating a new column), the creation of a heatmap, and multiple machine learning models (i.e. linear regression, prediction of median home value based on influencing factors, knn classification) (libraries used: Sklearn, SciPy, Seaborn, Matplotlib)
- Project Slides - Project #2 - DS2500 - Luke Abbatessa & Andy Babb - Northeastern University.pptx - a .pptx file that represents a slideshow presentation detailing the specifics of the project
This project analyzed the relationships between median housing value and its possible influences in Boston in the late 1900’s by looking at factors such as crime, whether an area was primarily residential or not, the average size of a home, and others
CDC_Diabetes_Prediction_Based_on_Health_Indicators - folder that contains four things:
- CDC Diabetes Prediction Based on Health Indicators - Snowball Project - DATA Club - Luke Abbatessa & Rajendra Goparaju - Northeastern University.pptx - a .pptx file that represents a slideshow presentation detailing the specifics of the project
- CDCDiabetesPredictionBasedonHealthIndicators.ipynb - a .ipynb file (Jupyter Notebook) containing helper functions, data cleaning steps (i.e. reading in the data as a Pandas DataFrame, changing variable data types, removing outliers, dropping duplicate rows, etc.), data shuffling, Exploratory Data Analysis, non-binary feature standardization, a train-test split of the data, the implementation of cross validation, the implementation of two types of feature selection (tree-based feature selection, feature selection as part of a pipeline method), the implementation of hyperparameter tuning using GridSearchCV, and the implementation of ten machine learning models (SciKit-Learn's Decision Tree Classifier with all features, SciKit-Learn's Random Forest Classifier with all features, Keras' Neural Networks Classifier with all features, XGBoost Classifier with all features, SciKit-Learn's Random Forest Classifier with selected features, Keras' Neural Networks Classifier with selected features, XGBoost Classifier with selected features, SciKit-Learn's Random Forest Classifier with all features and optimal hyperparameters, Keras' Neural Networks Classifier with all features and optimal hyperparameters, XGBoost Classifier with all features and optimal hyperparameters) (libraries used: XGBoost, Keras, mltools, Sklearn, Pandas, NumPy, Seaborn, Matplotlib)
- diabetes_012_health_indicators_BRFSS2015.csv - a .csv file containing the dataset used for the project
- mltools.py - a .py file containing helper functions for all things machine learning-related for the project (libraries used: Sklearn, NumPy, Math, Matplotlib, Collections)
This project depicted correlations between a variety of different features and the onset of diabetes, in an effort to help health care professionals better infer the leading causes of diabetes based on the competing factors
FaunaDB_Database_Evaluation - folder that contains five things:
- FaunaDBReadCSV.py - a .py file that initializes a connection to Fauna, reads a .csv file, and adds the data to the FaunaDB database (libraries used: FaunaDB, CSV, Datetime, config (a .py file including a secret to pass as a parameter to the FaunaClient class to initialize the connection to Fauna))
- Presentation - HW6 - DS4300.pdf - a .pdf file that represents a presentation detailing the specifics of the project
- Report - HW6 - DS4300.pdf - a .pdf file containing the final report for the project
- Viz.py - a .py file containing code to visualize the number of people per age group and the top 10 most popular job titles, based on a .csv file containing information for 1,000 people (libraries: FaunaDB, config, Matplotlib, Collections)
- people-1000.csv - a .csv file containing the "User Id", "First Name", "Last Name", "Sex", "Email", "Phone", "Date of birth", and "Job Title" for 1,000 different people
This project described key principles, use-cases, and a basic coding tutorial that conveys the essential ideas behind FaunaDB, a distributed multi-model NoSQL database
Investigating_the_Relationship_Between_Spreads_of_NFL_Games_and_NFL_Game_Type - folder that contains two things:
- Final Project Presentation - DS2001 - Luke Abbatessa & John McCarthy - Northeastern University.pptx - a .pptx file that represents a slideshow presentation detailing the specifics of the project
- Investigating_the_Relationship_Between_Spreads_of_NFL_Games_and_NFL_Game_Type.ipynb - a .ipynb file (Jupyter Notebook) that includes data cleaning steps (i.e. removing null values, converting values to floats), implementation of a t-test, and visualizations (e.g. bar chart, histograms) (libraries used: Statistics, Math, CSV, Matplotlib, Google)
This project investigated the relationship between spreads of NFL games and NFL game type, and it rejected the null hypothesis that the average spread for regular season games is equal to the average spread for playoff games
Ocean_acidification_in_west_central_Florida - folder that contains two things:
- Ocean_acidification_in_west_central_Florida.py - a .py file that includes data cleaning steps (i.e. adding a column to a dataframe, filtering values from a dataframe, grouping multiple columns by a single separate column), calculating correlation coefficients, implementing linear regression, visualizing regression plots as subplots, and predicting a response variable over time (libraries used: Pandas, NumPy, SciPy, Seaborn, Matplotlib)
- Project Slides - Project #1 - DS2500 - Luke Abbatessa - Northeastern University.pptx - a .pptx file that represents a slideshow presentation detailing the specifics of the project
This project observed predicted relationships between atmospheric CO2 and pH and between pH and alkalinity for five coastal springs in west-central Florida in an effort to analyze ocean acidification
Predicting_ECommerce_Shoppers_Purchases - folder that contains six things:
- DS4400 Final Project Paper.docx - a .docx file containing the final report for the project
- Final Poster.pptx - a .pptx file containing the final poster for the project
- dtree.py - a .py file containing helper functions for the decision tree algorithm models implemented for the project (libraries used: mltools, Pandas, Math, Collections)
- mltools.py - a .py file containing helper functions for all things machine learning-related for the project (libraries used: Sklearn, NumPy, Math, Matplotlib, Collections)
- online_shoppers_intention.csv - a .csv file containing the dataset used for the project
- shoppers-purchase-intention.ipynb - a .ipynb file (Jupyter Notebook) containing helper functions, data cleaning steps (i.e. reading in the data as a Pandas DataFrame, changing variable data types, dropping duplicate rows, casting strings and boolean column values to integers, etc.), data shuffling, Exploratory Data Analysis, a train-test split of the data, the implementation of 10-fold cross validation, the implementation of two types of feature selection (tree-based feature selection, feature selection as part of a pipeline method), the implementation of decision tree hyperparameter tuning using two methods (GridSearchCV, self-developed tuning), and the implementation of nine machine learning models (a self-developed perceptron model with all features, SciKit-Learn's perceptron model with all features, a self-developed decision tree model with all features, SciKit-Learn's decision tree model with all features, a self-developed perceptron model with selected features, SciKit-Learn's perceptron model with selected features, a self-developed decision tree model with selected features and optimal hyperparameters, SciKit-Learn's decision tree model with selected features and optimal hyperparameters, and Keras's neural networks for classification) (libraries used: TensorFlow, Keras, Operator, mltools, dtree, Sklearn, Pandas, NumPy, Six, IPython, PyDotPlus, Seaborn, Matplotlib, Random, Collections)
This project implemented self-developed machine learning models against SciKit-Learn’s machine learning models for the perceptron, decision tree, and neural networks algorithms to classify shoppers as buying or not buying
Public_Statement_Analysis - folder that contains 10 things:
- data_files - a subfolder containing 11 corporate apologies for data breaches, both as .txt files and as .json files
- stock_data_files - a subfolder containing stock data for the companies involved in the apologies
- Final Report - DS3500 Final Project.pdf - a .pdf file representative of the final report written for the project
- data_prep.py - a .py file that provides a foundation for gathering VADER sentiment scores for a group of files (libraries used: textquisite (a .py file representative of a manually created reusable extensible NLP framework), textquisite_parsers (a .py file representative of a manually created .json parser))
- project_dashboard.py - a .py file that establishes a dashboard illustrating data breach corporate apology comparisons (libraries used: dash_bootstrap_components, dash_bootstrap_templates, Dash, textquisite, sentiment_stock_plot (a .py file that graphs an average sentiment score vs. stock price percentage plot), stock_price_plot (a .py file that graphs a stock price plot), Matplotlib)
- sentiment_nltk.py - a .py file that tokenizes and performs nltk vader analysis on texts (libraries used: NLTK, Pandas)
- sentiment_stock_plot.py - a .py file that provides a foundation for creating an average sentiment score vs. stock price percentage plot (libraries used: Plotly, data_prep, Pandas, Collections)
- stock_price_plot.py - a .py file that graphs a stock price plot (libraries used: Plotly, Pandas)
- textquisite.py - a .py file that establishes a reusable NLP library (libraries used: sentiment_nltk, NLTK, Plotly, Collections)
- textquisite_parsers.py - a .py file that establishes a custom .json parser for the user to implement (libraries used: JSON, textquisite, sentiment_nltk, Collections)
This project allowed users to explore sentiment scores of 11 corporate apologies for data breaches and corresponding stock fluctuations as a result
Weather_Disaster_Prediction - folder that contains 3 things:
- DS3000FinalProjectCodeWalkthrough.mp4 - a .mp4 file representative of a walkthrough of the code for the project
- DS3000_final_poster.pdf - a .pdf representative of the final poster for the project
- weather_disasters.ipynb - a .ipynb file that includes data cleaning steps (i.e. merging and concatenating dataframes, removing duplicate rows from a dataframe, filtering a dataframe based on columns of interest, changing column data types, modifying column values, deleting rows with missing values) and the implementation of three machine learning algorithms (random forest regression, knn regression, and multiple linear regression) (libraries used: Sklearn, Pandas, NumPy, Seaborn, Matplotlib)
This project predicted storm property damage based on storm event properties using data from the NOAA National Centers for Environmental Information Storm Events Database during 2012-2022