This repository contains randomly undertaken projects in Data Science and Machine Learning domain. Intent is to explore few under-valued pre-created Python package/libraries as well as methodologies that can actually create a business impact. Mostly projects vary in content depth and if relevant for reference, each project is also associated with a descriptive run-through article on my Medium blog. Down the lane, also urge to showcase this repo to Data Science aspirants for helping them envision what they can undertake as projects to display their skills.
If you're a learner, this repository can be your guide in building a portfolio for yourself where you get to practice on various Machine Learning concepts, along with Deep Learning applications. Feel free to Star for reference or Fork if willing to contribute. All of us need to take responsibility to add more to our open-source community, whichever way we can. :)
-
WiFi spots Map of Networks around us with WiGLE and Python. Selected geographical co-ordinates were of Pune (India). Static visual demonstrated using Geoplotlib and interactivity enhancement using Folium. Static Output as well as Interactive inference available for reference, along with a step-by-step guide on my blog for learners.
-
OCR (Optical Character Recognition) to parse files containing complex table structured invoice screenshots for text recognition & extraction by applying Deep Learning, using pytesseract v0.2.6, Pillow v5.4.1 and OpenCV-4. We feed forward images to our LSTM (RNN) network for text recognition. For testing purpose, we experiment on Walmart receipts as well as transaction tables.
-
Exploratory Frequency Analysis on NLP Data of IMDB movie reviews to gain an insight on buzz words amidst viewer comments. Acts as a pillar for further algorithmic processing! Natural language processing has been carried out using NLTK v3.4 and visual representation for word tagging is implemented using Word Cloud.
-
Coloring Black & White Images by applying Convolutional Neural Network. Selected architecture refers to Zhang et al.’s ECCV paper, Colorful Image Colorization where he trains Imagenet dataset to Lab from RGV with mean annealing. Implementation is done using OpenCV-4, by utilizing pre-trained
.caffemodel
which contains weights for actual layers &.prototxt
file which shall define our model architecture. -
Manipulation of PDF Files using core Python for varied use cases. Involves splitting a multi-page PDF file into individual page files, merging individual files into a single PDF file, slicing out selective pages from a multi-page PDF file based on index page numbers, and rotating all pages of a PDF file in either clockwise or anti-clockwise direction as per use. Involves usage of built-in
os
andglob
packages, along with PyPDF2 package. -
Face Detection in static Images by applying
face detector
capability of OpenCV-4dnn
module, using pre-trained Caffe Deep Learning model based on the Single Shot Detector (SSD) framework with a ResNet network. Architecture includes.prototxt
files fromsamples/dnn/face_detector/
directory of OpenCV-4 GitHub repository & associated Weights for actual layers. -
Marketing Campaign Subscription Analysis & Classification by applying Logistic Regression with k-Fold cross-validation technique to identify potential customers of a Portuguese bank who would subscribe to a Term Deposit using conventional Scikit-Learn tools. Dataset utilized for this assignment is available at UCI repository. Few additional Python external libraries have also been used for EDA like Seaborn and Yellowbrick.
-
Fake News detection with NLP techniques by exploring dataset scraped from various websites during US Presidential elections of October 2016 and categorized per biases accordingly. Currently access is limited to just Fake news dataset, and not on supplementing Real news dataset, hence attached worksheet displays exploratory data cleansing, preprocessing and bigram analogy using NLP tools like NLTK and Wordcloud. With access to other set of data, we shall enhance project with tools like SpaCy v2.0.18 and Gensim v3.7 to perform Deep Learning for classifying the flavour of news source and content.
-
Traversing Airbnb London Calendar & Listings as of February 5th, 2019 on unofficial Inside Data application for hostings using extensive analysis, segmentation and visualization of various aspects. Further data being modelled and summarized with conventional Scikit-Learn, LightGBM v2.2.3 and Keras v2.2.4 deep neural network architecture for regression.
-
Segmentation and Object Recognition using Holistically-Nested Edge Detection for obtaining this state-of-the-art computer vision technique. Deployment is based on the algorithm introduced in Xie and Tu’s HED research paper, and utilizes a pre-trained
.caffemodel
Caffemodel for weights along with associated.prototxt
layer architecture. Demonstration shall only include deployment on images using OpenCV-4 as our tool. Common use case includes Asset tracking for features like position, landmarks & can be further geo-mapped. -
Netflix Movie Recommendation algorithm using Singular Value Decomposition and Cosine-Similarity along with KFold Cross-Validation to predict all ratings given by customers to movies. Dataset is sourced from Netflix account. Primary tools utilized are XGBoost v0.82 and Surprise v0.1. For understanding underlying statistical concepts, refer Article-1 & Article-2 on my blog.
-
Fashion MNIST Classification algorithm using 3-layered fully connected dense neural network. Dataset is sourced from Kaggle and accessed via in-built Keras module in Tensorflow. Purpose was to run/test newly released TensorFlow2.0.0 on a multi-class baseline problem statement.Pre-trained model weights and architecture have been also attached for reusability.