Skip to content

A repository that consists some short data science projects I worked on

Notifications You must be signed in to change notification settings

ChiWang03/Various-Data-Science-Projects

Repository files navigation

Please feel free to contact me at [email protected] for any potential opportunities, collaboration or any inquiries of the repositories! Thanks!

  • Unsupervised Learning 1
    • Done for data 573: Unsupervised Learning
    • Hiearchical clustering using different linkage methods on the bank data set. (Continuous Variables)
    • Used Mclust and k-means clustering to find clusters in the lots data
    • Compared the coded up k-means and built in Mclust for which is the better clustering method for this data based on the nature of the alogrithm and the Rand Index
  • Unsupervised Learning 2
    • Hiearchical clustering using different linkages on HouseVotes84 (mlbench) data, transformed Categorical/Binary variables in the data with Gower's distance
    • Perforemd Factor Analysis on ability.cov data
    • Using PCA and NMF on hand written digits data to decide how many components are to be kept for digit recognition.

Extracting key skills from Data Scientists and Machine Learning Engineer job posts

  • Scraped data off of indeed using BeautifulSoup

  • Conducted Topic modelling using ldamallet from gensim and Nonnegative Matrix Factorization and LDA from sklearn.

  • Very interstingly this project requires a great understanding of the data itself. When we think of key skills of data scientists we think of natural language processing, machine vision, machine learning, etc. These are not unigrams but bigrams or trigrams. This is why preprocessing the data and running topic modelling methods using n-grams is so important in key skills extraction! Enjoy some intersting findings!

  • Create Multiple Metrics to optimize customs wait time efficiency
  • EDA
    • Exploratory data analysis on customs wait time
    • suggest ways to optimize efficiency for customs boarder patrol agents
  • Multivariate Outlier Identification
    • Using the PyOD libary in python to identify outliers
    • two methods used: CBLOF (K-means) Cluster Based Local Outlier Factor and K Nearest Neighbors
  • Note: comparison part of the outlier removal notebook is slightly messy still needs to be updated.
  • Visualized Zillow's property dataset and the housing dataset from Kaggle (boston property information)
  • Created a 4 tab dashboard using Plotly Dash for the visualizations
  • Includes: Geolocation plots of property location, Interactive Volume bar plots, Lasso Coefficients slider plots
  • This short notebook explores the OkCupid data set by mining association rules and finding latent information about dating profiles
  • Multiple Notebooks that uses PyTorch to explore Neural Neural Networks.
  • lab work done for data 586: Advanced Machine Learning
    • lab1: Multilayer Perceptron
    • lab2: Convolutional Neural Networks
    • lab3: Recurrent Neural Networks
    • lab4: Stochastic Gradient Descent and Regularization
  • Seaborn and Plotly visualizations for the SierraLeoneAIMS data set.
  • The Notebook cannot visualize interactive plotly graphs (Links are provided in the notebook for interactive purposes)
  • A short script that pulls twitter data from the twitter api based on a certain user.
  • In this case Elon Musk's tweets were pulled.

This was one of the first pandas EDA I've ever done (done in 2016)