Skip to content

A curated list of projects leveraging multiple approaches to Data Science, Machine Learning and Deep Learning.

Notifications You must be signed in to change notification settings

jnopareboateng/ml-library

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Project Overview

Goals

  • Forecast Brent Crude Oil prices in Ghana using various models including ARIMA, SSA, Prophet, Random Forests, XG Boost

Problem Statement

  • Predicting future crude oil prices to aid in economic planning and decision-making.

Data

  • Source: Historical crude oil price data from Bank of Ghana
  • Description: The dataset contains monthly crude oil prices in USD per Barrel.
  • Features: Date, Price.

Methodology

  • Data preprocessing, feature engineering, and model selection including ARIMA,SSA,Prophet, XGBoost, and Random Forest.

Results

  • ARIMA, SSA,Prophet,Random Forest,and XG Boost models were evaluated, with Random Forest showing better performance.

Conclusion

  • Random Forest model provided more accurate forecasts, suggesting its suitability for this task.

Code Structure

Directory Structure

crude-oil-forecast/
├── data/
│   ├── Commodity Prices Monthly.csv
│   ├── Modified_Data.csv
├── notebooks/
│   ├── arimav3.ipynb
│   ├── rf.ipynb
├── scripts/
│   ├── arima.py
│   ├── arimav2.py
│   ├── arima_full.py
├── tests/
│   ├── arima.py
├── readme.md

Code Explanation

  • scripts/arima.py: Implements ARIMA model for forecasting.
  • notebooks/rf.ipynb: Contains the implementation of the Random Forest model.
  • tests/arima.py: Unit tests for ARIMA model functions.

Dependencies

  • Python: 3.8
  • Libraries:
    • pandas: 1.2.3
    • scikit-learn: 0.24.1
    • matplotlib: 3.3.4
    • statsmodels: 0.12.2

Model Details

Architecture

  • ARIMA and Random Forest models.

Training

  • Data split into training and test sets, hyperparameter tuning using GridSearchCV.

Evaluation

  • Metrics: Mean Absolute Error (MAE), Root Mean Squared Error (RMSE).
  • Results: Random Forest outperformed ARIMA in terms of lower MAE and RMSE.

Deployment

Deployment Strategy

  • Not applicable for this project.

Infrastructure

  • Not applicable for this project.

Future Work

  • Explore additional models like LSTM for time series forecasting.
  • Incorporate more features such as economic indicators.

Project 2: [Telecel Churn Analysis]("/Telecel Churn Analysis")

Project Overview

Goals

  • Predict customer churn on the Telecel network.

Problem Statement

  • Identifying customers likely to churn to improve retention strategies.

Data

  • Source: Customer data from Telecel network.
  • Description: The dataset contains customer information and churn labels.
  • Features: Customer demographics, usage patterns, churn labels.

Methodology

  • Data preprocessing, feature engineering, and model selection including Logistic Regression, Random Forest, and Gradient Boosting.

Results

  • Random Forest and Gradient Boosting models showed high accuracy in predicting churn.

Conclusion

  • The models can effectively predict churn, aiding in targeted retention efforts.

Code Structure

Directory Structure

telecel/
├── data/
│   ├── telecel_data.csv
├── notebooks/
│   ├── data_exploration.ipynb
│   ├── data_preprocessing.ipynb
│   ├── model_development.ipynb
├── models/
├── requirements.txt
├── readme.md

Code Explanation

  • notebooks/data_exploration.ipynb: Exploratory data analysis.
  • notebooks/data_preprocessing.ipynb: Data cleaning and preprocessing.
  • notebooks/model_development.ipynb: Model training and evaluation.

Dependencies

  • Python: 3.8
  • Libraries:
    • pandas: 1.2.3
    • scikit-learn: 0.24.1
    • matplotlib: 3.3.4
    • seaborn: 0.11.1

Model Details

Architecture

  • Logistic Regression, Random Forest, Gradient Boosting.

Training

  • Data split into training and test sets, hyperparameter tuning using RandomizedSearchCV.

Evaluation

  • Metrics: Accuracy, Precision, Recall, F1-Score.
  • Results: Random Forest and Gradient Boosting achieved high accuracy and F1-Score.

Deployment

Deployment Strategy

  • Streamlit app for model deployment.

Infrastructure

  • Hosted on a cloud platform with Streamlit.

Future Work

  • Implement additional models like XGBoost.
  • Enhance feature engineering with more customer behavior data.

Project 3: [Monetary Policy Rate Analysis]("/Monetary Policy Rate Analysis")

Project Overview

Goals

  • Predict the average policy rate of Ghana based on economic factors.

Problem Statement

  • Forecasting policy rates to aid in economic planning and decision-making.

Data

  • Source: Economic indicators such as GDP, unemployment, and inflation.
  • Description: The dataset contains historical economic data.
  • Features: GDP, Unemployment Rate, Inflation Rate, Policy Rate.

Methodology

  • Data preprocessing, feature engineering, and model selection including Linear Regression and Random Forest.

Results

  • Random Forest model provided accurate predictions of policy rates.

Conclusion

  • The model can be used for economic forecasting and policymaking.

Code Structure

Directory Structure

mpr/
├── data/
│   ├── economic_data.csv
├── notebooks/
│   ├── data_exploration.ipynb
│   ├── preprocessing.py
│   ├── models.ipynb
├── requirements.txt
├── readme.md

Code Explanation

  • notebooks/data_exploration.ipynb: Exploratory data analysis.
  • notebooks/preprocessing.py: Data cleaning and preprocessing.
  • notebooks/models.ipynb: Model training and evaluation.

Dependencies

  • Python: 3.8
  • Libraries:
    • pandas: 1.2.3
    • scikit-learn: 0.24.1
    • matplotlib: 3.3.4
    • seaborn: 0.11.1

Model Details

Architecture

  • Linear Regression, Random Forest.

Training

  • Data split into training and test sets, hyperparameter tuning using GridSearchCV.

Evaluation

  • Metrics: Mean Absolute Error (MAE), Root Mean Squared Error (RMSE).
  • Results: Random Forest outperformed Linear Regression in terms of lower MAE and RMSE.

Deployment

Deployment Strategy

  • Not applicable for this project.

Infrastructure

  • Not applicable for this project.

Future Work

  • Explore additional models like XGBoost.
  • Incorporate more features such as international economic indicators.

Target Audience: Data scientists.

Style Guide: Use consistent formatting and clear headings for readability.

Code Snippets: Include relevant code snippets to illustrate key points.

Visualizations: Use visualizations to enhance understanding.

Version Control: Use version control for the documentation to track changes and updates.


This template can be adapted for other projects in your workspace by following the same structure and filling in the relevant details.

jnopareboateng: @workspace please continue for the remaining projects

Workspace: Sure, here is the continuation of the documentation for the remaining projects in your workspace:


Project 4: msft-mlflow

Project Overview

Goals

  • Implement and track machine learning experiments using MLflow.

Problem Statement

  • Efficiently manage and track machine learning experiments to ensure reproducibility and scalability.

Data

  • Source: Various datasets used for different machine learning tasks.
  • Description: The datasets vary depending on the specific experiment being tracked.
  • Features: Varies by dataset.

Methodology

  • Use MLflow to track experiments, log parameters, metrics, and artifacts.

Results

  • Improved experiment tracking and reproducibility.

Conclusion

  • MLflow provides a robust framework for managing machine learning experiments.

Code Structure

Directory Structure

msft-mlflow/
├── data/
├── notebooks/
├── scripts/
├── models/
├── requirements.txt
├── readme.md

Code Explanation

  • notebooks/: Contains Jupyter Notebooks for different experiments.
  • scripts/: Python scripts for data preprocessing, model training, and evaluation.
  • models/: Directory to save trained models.
  • requirements.txt: Lists the required Python packages.

Dependencies

  • Python: 3.8
  • Libraries:
    • mlflow: 1.14.0
    • pandas: 1.2.3
    • scikit-learn: 0.24.1
    • matplotlib: 3.3.4

Model Details

Architecture

  • Various models depending on the experiment.

Training

  • Data split into training and test sets, hyperparameter tuning using GridSearchCV.

Evaluation

  • Metrics: Varies by experiment.
  • Results: Logged in MLflow.

Deployment

Deployment Strategy

  • Not applicable for this project.

Infrastructure

  • Not applicable for this project.

Future Work

  • Integrate with other tools like TensorBoard for enhanced visualization.
  • Explore automated hyperparameter tuning.

Project 5: [Stock Returns Analysis]("/Stock Returns Analysis")

Project Overview

Goals

  • Predict customer churn for MTN network.

Problem Statement

  • Identifying customers likely to churn to improve retention strategies.

Data

  • Source: Customer data from MTN network.
  • Description: The dataset contains customer information and churn labels.
  • Features: Customer demographics, usage patterns, churn labels.

Methodology

  • Data preprocessing, feature engineering, and model selection including Logistic Regression, Random Forest, and Gradient Boosting.

Results

  • Random Forest and Gradient Boosting models showed high accuracy in predicting churn.

Conclusion

  • The models can effectively predict churn, aiding in targeted retention efforts.

Code Structure

Directory Structure

mtn/
├── data/
│   ├── mtn_data.csv
├── notebooks/
│   ├── data_exploration.ipynb
│   ├── data_preprocessing.ipynb
│   ├── model_development.ipynb
├── models/
├── requirements.txt
├── readme.md

Code Explanation

  • notebooks/data_exploration.ipynb: Exploratory data analysis.
  • notebooks/data_preprocessing.ipynb: Data cleaning and preprocessing.
  • notebooks/model_development.ipynb: Model training and evaluation.

Dependencies

  • Python: 3.8
  • Libraries:
    • pandas: 1.2.3
    • scikit-learn: 0.24.1
    • matplotlib: 3.3.4
    • seaborn: 0.11.1

Model Details

Architecture

  • Logistic Regression, Random Forest, Gradient Boosting.

Training

  • Data split into training and test sets, hyperparameter tuning using RandomizedSearchCV.

Evaluation

  • Metrics: Accuracy, Precision, Recall, F1-Score.
  • Results: Random Forest and Gradient Boosting achieved high accuracy and F1-Score.

Deployment

Deployment Strategy

  • Streamlit app for model deployment.

Infrastructure

  • Hosted on a cloud platform with Streamlit.

Future Work

  • Implement additional models like XGBoost.
  • Enhance feature engineering with more customer behavior data.

Project 6: [Non Performing Loans Analysis]("/Non Performing Loans")

Project Overview

Goals

  • Predict non-performing loans (NPL) for financial institutions.

Problem Statement

  • Identifying loans likely to default to mitigate financial risk.

Data

  • Source: Loan data from financial institutions.
  • Description: The dataset contains loan information and default labels.
  • Features: Loan amount, interest rate, borrower information, default labels.

Methodology

  • Data preprocessing, feature engineering, and model selection including Logistic Regression, Random Forest, and Gradient Boosting.

Results

  • Random Forest and Gradient Boosting models showed high accuracy in predicting NPL.

Conclusion

  • The models can effectively predict NPL, aiding in risk management.

Code Structure

Directory Structure

NPL/
├── data/
│   ├── npl_data.csv
├── notebooks/
│   ├── data_exploration.ipynb
│   ├── data_preprocessing.ipynb
│   ├── model_development.ipynb
├── models/
├── requirements.txt
├── readme.md

Code Explanation

  • notebooks/data_exploration.ipynb: Exploratory data analysis.
  • notebooks/data_preprocessing.ipynb: Data cleaning and preprocessing.
  • notebooks/model_development.ipynb: Model training and evaluation.

Dependencies

  • Python: 3.8
  • Libraries:
    • pandas: 1.2.3
    • scikit-learn: 0.24.1
    • matplotlib: 3.3.4
    • seaborn: 0.11.1

Model Details

Architecture

  • Logistic Regression, Random Forest, Gradient Boosting.

Training

  • Data split into training and test sets, hyperparameter tuning using RandomizedSearchCV.

Evaluation

  • Metrics: Accuracy, Precision, Recall, F1-Score.
  • Results: Random Forest and Gradient Boosting achieved high accuracy and F1-Score.

Deployment

Deployment Strategy

  • Not applicable for this project.

Infrastructure

  • Not applicable for this project.

Future Work

  • Implement additional models like XGBoost.
  • Enhance feature engineering with more borrower information.

Project 7: [Music Recommender System]("/Music Recommender System")

Project Overview

Goals

  • Develop a recommendation system to suggest items to users based on their preferences and behavior.

Problem Statement

  • Providing personalized recommendations to enhance user experience and engagement.

Data

  • Source: User interaction data.
  • Description: The dataset contains user-item interactions, item metadata, and user profiles.
  • Features: User ID, Item ID, Interaction Type, Timestamp.

Methodology

  • Data preprocessing, feature engineering, and model selection including collaborative filtering, content-based filtering, and reinforcement learning.

Results

  • The recommendation system was able to provide accurate and relevant recommendations.

Conclusion

  • The system can be used to improve user engagement and satisfaction by providing personalized recommendations.

Code Structure

Directory Structure

recommender system/
├── data/
│   ├── dataset.csv
├── notebooks/
│   ├── data_exploration.ipynb
│   ├── model_development.ipynb
├── scripts/
│   ├── cbf_recommendations.py
│   ├── combined_recommendations.py
│   ├── final_recommendations.py
│   ├── get_recommendations.py
├── tests/
│   ├── app.py
│   ├── streamlit/
│   │   ├── core_functions.py
│   │   ├── streamlit_interfaces.py
│   ├── gradio/
│   │   ├── core_functions.py
│   │   ├── gradio_interfaces.py
├── readme.md

Code Explanation

  • scripts/cbf_recommendations.py: Implements content-based filtering recommendations.
  • scripts/combined_recommendations.py: Combines multiple recommendation strategies.
  • scripts/final_recommendations.py: Final recommendation logic.
  • scripts/get_recommendations.py: Core recommendation functions.
  • tests/app.py: Unit tests for recommendation functions.
  • tests/streamlit/core_functions.py: Streamlit interface core functions.
  • tests/gradio/core_functions.py: Gradio interface core functions.

Dependencies

  • Python: 3.8
  • Libraries:
    • pandas: 1.2.3
    • scikit-learn: 0.24.1
    • numpy: 1.19.5
    • streamlit: 0.79.0
    • gradio: 2.3.0

Model Details

Architecture

  • Collaborative filtering, content-based filtering, reinforcement learning.

Training

  • Data split into training and test sets, hyperparameter tuning using GridSearchCV.

Evaluation

  • Metrics: Precision, Recall, F1-Score.
  • Results: The combined recommendation strategy achieved high precision and recall.

Deployment

Deployment Strategy

  • Streamlit and Gradio apps for model deployment.

Infrastructure

  • Hosted on a cloud platform with Streamlit and Gradio.

Future Work

  • Implement additional models like matrix factorization.
  • Enhance feature engineering with more user behavior data.

Target Audience: Data scientists.

Style Guide: Use consistent formatting and clear headings for readability.

Code Snippets: Include relevant code snippets to illustrate key points.

Visualizations: Use visualizations to enhance understanding.

Version Control: Use version control for the documentation to track changes and updates.


Project 8: Banks

Project Overview

Goals

  • Analyze banking data to derive insights and trends.

Problem Statement

  • Understanding customer behavior and improving banking services.

Data

  • Source: Banking transaction records.
  • Description: The dataset contains transaction details, customer demographics, and account information.
  • Features: Transaction ID, Customer ID, Transaction Amount, Transaction Type, Timestamp.

Methodology

  • Data preprocessing, exploratory data analysis (EDA), and statistical modeling.

Results

  • Identified key trends and patterns in customer transactions.

Conclusion

  • The analysis provides actionable insights for improving customer service and operational efficiency.

Code Structure

Directory Structure

Banks/
├── analysis.html

Code Explanation

  • analysis.html: Contains the results of the data analysis, including visualizations and statistical summaries.

Dependencies

  • Python: 3.8
  • Libraries:
    • pandas: 1.2.3
    • matplotlib: 3.3.4
    • seaborn: 0.11.1

Analysis Details

Data Preprocessing

  • Cleaning and transforming the data for analysis.

Exploratory Data Analysis (EDA)

  • Visualizing transaction patterns and customer behavior.

Statistical Modeling

  • Applying statistical techniques to identify significant trends.

Future Work

  • Implement machine learning models for predictive analysis.
  • Enhance data visualization with interactive dashboards.

Project 9: [Burn out AnalysiS In Medical Imaging Students]("/Burn out AnalysiS In Medical Imaging Students")

Project Overview

Goals

  • Assess and analyze burnout levels among students.

Problem Statement

  • Understanding the factors contributing to student burnout and its impact on academic performance.

Data

  • Source: Survey data from students.
  • Description: The dataset contains responses to burnout-related questions.
  • Features: Timestamp, Consent, Gender, Age, University, Level, Religion, Location, Ethnicity, Study's financing, Medications intake, Desired profession, Burnout-related questions.

Methodology

  • Data preprocessing, feature engineering, and statistical analysis.

Results

  • Calculated burnout scores and identified key factors contributing to burnout.

Conclusion

  • The analysis provides insights into student burnout and potential interventions.

Code Structure

Directory Structure

Burn out/
├── analysis_v2.ipynb
├── analysis.ipynb
├── analysis.py
├── Burn_Out_Data Set_Adepa.sav
├── Burnout calculated scores
├── Burnout calculated scores.csv
├── burnout_regression_summary.csv
├── Burnout.csv
├── Copy of Burn_Out_Data_Set_Adepa(1)(AutoRecovered).xlsx

Code Explanation

  • analysis_v2.ipynb: Contains the updated analysis with refined methodologies.
  • analysis.ipynb: Contains the initial analysis and exploratory data analysis (EDA).
  • analysis.py: Python script for data preprocessing and analysis.
  • Burnout.csv: Raw survey data.
  • Burnout calculated scores.csv: Calculated burnout scores.
  • burnout_regression_summary.csv: Summary of regression analysis results.

Dependencies

  • Python: 3.8
  • Libraries:
    • pandas: 1.2.3
    • numpy: 1.19.5
    • matplotlib: 3.3.4
    • seaborn: 0.11.1
    • scikit-learn: 0.24.1

Analysis Details

Data Preprocessing

  • Cleaning and transforming the survey data.

Feature Engineering

  • Defining burnout score categories: Emotional Exhaustion (EE), Cynicism (CY), and Academic Efficacy (AE).

Statistical Analysis

  • Calculating burnout scores and performing regression analysis.

Future Work

  • Implement machine learning models to predict burnout.
  • Conduct longitudinal studies to track burnout over time.