- Incorporated binary classification of the dataset to determine the probabilistic outcome using logarithmic-loss as the performance indicator and binary confusion matrix to determine errors.
- Performed EDA using statistical methods to understand the data; Implemented basic feature engineering to increase interpretability.
- Applied text preprocessing to remove HTML tags, punctuations, and stop-words using stemming; Executed advanced feature extraction to split and analyze the question similarity.
- Visualized 15-D data in 2-D using T-SNE to get more detailed insights; Incorporated NLP techniques to convert words to vectors.
- Trained the data using Logistic Regression, Linear SVM, and XGBoost to determine the expected outcome efficiently by comparing and minimizing errors.
Technologies: Pandas, Numpy, Matplotlib, Seaborn, Plotly, Sklearn, Nltk, Sqlite, Tqdm, WordCloud, KNN Classifier, Gaussian Naive Bayes, Logistic Regression, Random Forest Classifier, Linear SVM, XGBoost