Merge pull request #214 from amosproj/dev

Release sprint 11
amosproj · Jan 24, 2024 · 621c43b · 621c43b
2 parents 7c45cda + 583a1bb
commit 621c43b
Show file tree

Hide file tree

Showing 37 changed files with 3,868 additions and 254 deletions.
diff --git a/.gitignore b/.gitignore
@@ -53,6 +53,8 @@ bin/
 !**/data/merged_geo.geojson
 **/data/reviews/*.json
 **/data/gpt-results/*.json
+**/data/models/*
+**/data/classification_reports/*
 
 # Env files
 *.env

diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -32,3 +32,8 @@ repos:
     rev: v2.1.0
     hooks:
       - id: reuse
+
+  - repo: https://github.com/kynan/nbstripout
+    rev: 0.6.1
+    hooks:
+      - id: nbstripout
diff --git a/Deliverables/sprint-11/coach-HAMBAUER-workshop-report.pdf b/Deliverables/sprint-11/coach-HAMBAUER-workshop-report.pdf
diff --git a/Deliverables/sprint-11/coach-HAMBAUER-workshop-report.pdf.license b/Deliverables/sprint-11/coach-HAMBAUER-workshop-report.pdf.license
@@ -0,0 +1,2 @@
+SPDX-License-Identifier: MIT
+SPDX-FileCopyrightText: 2023 Nico Hambauer <[email protected]>
diff --git a/Deliverables/sprint-11/feature-board.png b/Deliverables/sprint-11/feature-board.png
diff --git a/Deliverables/sprint-11/feature-board.png.license b/Deliverables/sprint-11/feature-board.png.license
@@ -0,0 +1,2 @@
+SPDX-License-Identifier: MIT
+SPDX-FileCopyrightText: 2023 Simon Zimmermann <[email protected]>
diff --git a/Deliverables/sprint-11/imp-squared-backlog.jpg b/Deliverables/sprint-11/imp-squared-backlog.jpg
diff --git a/Deliverables/sprint-11/imp-squared-backlog.jpg.license b/Deliverables/sprint-11/imp-squared-backlog.jpg.license
@@ -0,0 +1,2 @@
+SPDX-License-Identifier: MIT
+SPDX-FileCopyrightText: 2023 Nico Hambauer <[email protected]>
diff --git a/Deliverables/sprint-11/planning-documents.pdf b/Deliverables/sprint-11/planning-documents.pdf
diff --git a/Deliverables/sprint-11/planning-documents.pdf.license b/Deliverables/sprint-11/planning-documents.pdf.license
@@ -0,0 +1,2 @@
+SPDX-License-Identifier: MIT
+SPDX-FileCopyrightText: 2023 Simon Zimmermann <[email protected]>
diff --git a/Documentation/Classifier-Comparison.md b/Documentation/Classifier-Comparison.md
@@ -0,0 +1,106 @@
+<!--
+SPDX-License-Identifier: MIT
+SPDX-FileCopyrightText: 2024 Felix Zailskas <[email protected]>
+SPDX-FileCopyrightText: 2024 Ahmed Sheta <[email protected]>
+-->
+
+# Classifier Comparison
+
+This document compares the results of the following classifiers on the enriched and
+preprocessed data set from the 22.01.2024.
+
+- Quadratic Discriminant Analysis (QDA)
+- Ridge Classifier
+- Random Forest
+- Support Vecotr Machine (SVM)
+- Fully Connected Neural Networks Classifier Model (FCNNC)
+- Fully Connected Neural Networks Regression Model (FCNNR)
+- XGBoost Classifier Model
+- K Nearest Neighbor Classifier (KNN)
+- Bernoulli Naive Bayes Classifier
+
+Each model type was tested on two splits of the data set. The used data set has five
+classes for prediction corresponding to different merchant sizes, namely XS, S, M, L, and XL.
+The first split of the data set used exactly these classes for the prediction corresponding
+to the exact classes given by SumUp. The other data set split grouped the classes S, M, and L
+into one new class resulting in three classes of the form {XS}, {S, M, L}, and {XL}. While
+this does not exactly correspond to the given classes from SumUp, this simplification of
+the prediction task generally resulted in a better F1-score across models.
+
+## Experimental Attempts
+
+According to free lunch theorem, there is no universal model or methodology that is top performing on every problem or data, therefore multiple attempts are crucal. In this section, we will document the experiments we tried and their corresponding performance and outputs.
+
+## Models not performing well
+
+### Support Vector Machine Classifier Model
+
+Training Support Vector Machine (SVM) took a while such that the training never ended. It is believed that it is the case because SVMs are very sensitive to the misclassifications and it finds a hard time minimizing them, given the data.
+
+### Fully Connected Neural Networks Classifier Model
+
+Fully Connected Neural Networks (FCNN) achieved overall lower performance than that Random Forest Classifier, mainly it had f1 score 0.84 on the XS class, while having 0.00 f1 scores on the other class, it learned only the XS class. the FCNN consisted of 4 layers overall, RELU activation function in each layer, except in the logits layer the activation function is Softmax. The loss functions investigated were Cross-Entropy and L2 Loss. The Optimizers were Adam and Sctohastic Gradient Descent. Moreover, Skip connections, L1 and L2 Regularization techniques and class weights have been investigated as well. Unfortunately we haven't found any FCNN that outperforms the simpler ML models.
+
+### Fully Connected Neural Networks Regression Model
+
+There has been an idea written in the scientific paper "Inter-species cell detection -
+datasets on pulmonary hemosiderophages in equine, human and feline specimens" by Marzahl et al. where they proposed using regression model on a classification task. The idea is to train the regression model on the class values, whereas the model predicts a continous values and learns the relation between the classes. The output is then subjected to threshholds (0-0.49,0.5-1.49,1.5-2.49,2.5-3.49,3.5-4.5) for classes XS, S, M, L, XL respectivly. This yielded better performance than the FCNN classifier but still was worse than that of the Random Forest.
+
+### QDA & Ridge Classifier
+
+Both of these classifiers could not produce a satisfactory performance on either data set
+split. While the prediction on the XS class was satisfactory (F1-score of ~0.84) all other
+classes had F1-scores of ~0.00-0.15. For this reason we are not considering these predictors
+in future experiments. This resulted in an overall F1-score of ~0.11, which is significantly
+outperformed by the other tested models.
+
+## Well performing models
+
+### Random Forest Classifier
+
+Random Forest Classifier with 100 estimators has been been able to achieve an overall F1-score of 0.62 and scores of 0.81, 0.13, 0.09, 0.08 and 0.15 for classes XS, S, M, L and XL respectively.
+
+### Overall Results
+
+Note:
+The Random Forest Classifier used 100 estimators.
+The KNN classifier used a distance based weighting for the evaluated neighbors and considered 10 neighbors in the 5-class split and 19 neighbors for the 3-class split.
+The XGBoost was trained for 10000 rounds.
+
+In the following table we can see the model's overall weighted F1-score on the 3-class and
+5-class data set split.
+
+|         | KNN    | Naive Bayes | Random Forest | XGBoost |
+| ------- | ------ | ----------- | ------------- | ------- |
+| 5-Class | 0.6314 | 0.6073      | 0.6150        | 0.6442  |
+| 3-Class | 0.6725 | 0.6655      | 0.6642        | 0.6967  |
+
+We can see that all classifiers perform better on the 3-class data set split and that the XGBoost classifier is the best performing for both data set splits.
+
+### Results for each class
+
+#### 5-class split
+
+In the following table we can see the F1-score of each model for each class in the 5-class split:
+
+| Class | KNN  | Naive Bayes | Random Forest | XGBoost |
+| ----- | ---- | ----------- | ------------- | ------- |
+| XS    | 0.82 | 0.83        | 0.81          | 0.84    |
+| S     | 0.15 | 0.02        | 0.13          | 0.13    |
+| M     | 0.08 | 0.02        | 0.09          | 0.08    |
+| L     | 0.06 | 0.00        | 0.08          | 0.06    |
+| XL    | 0.18 | 0.10        | 0.15          | 0.16    |
+
+For every model we can see that the predictions on the XS class are significantly better than every other class. TFor the KNN, Random Forest, and XGBoost all perform similar, having second best classes S and XL and worst classes M and L. The Naive Bayes classifier performs significantly worse on the S, M, and L classes and has second best class XL.
+
+#### 3-class split
+
+In the following table we can see the F1-score of each model for each class in the 3-class split:
+
+| Class | KNN  | Naive Bayes | Random Forest | XGBoost |
+| ----- | ---- | ----------- | ------------- | ------- |
+| XS    | 0.83 | 0.82        | 0.81          | 0.84    |
+| S,M,L | 0.27 | 0.28        | 0.30          | 0.33    |
+| XL    | 0.16 | 0.07        | 0.13          | 0.14    |
+
+For the 3-class split we observe similar performance for the XS and {S, M, L} classes for each model, while the XGBoost model slightly outperforms the other models. The KNN classifier is performing the best on the XL class while the Naive Bayes classifier performs worst. Interestingly, we can observe that the performance of the models on the XS class was barely affected by the merging of the s, M, and L classes while the performance on the XL class got worse for all of them. This needs to be considered, when evaluating the overall performance of the models on this data set split.
diff --git a/Documentation/OpenLLm-Business-Type-Analysis.md b/Documentation/OpenLLm-Business-Type-Analysis.md
@@ -0,0 +1,34 @@
+<!--
+SPDX-License-Identifier: MIT
+SPDX-FileCopyrightText: 2024 Ruchita Nathani <[email protected]>
+-->
+
+# Business Type Analysis: Research and Proposed Solution
+
+## Research
+
+**1. Open-source LLM Model :**
+I explored an open-source LLM model named CrystalChat available on Hugging Face (https://huggingface.co/LLM360/CrystalChat). Despite its capabilities, it has some limitations:
+
+- **Computational Intensity:** CrystalChat is computationally heavy and cannot be run efficiently on local machines.
+
+- **Infrastructure Constraints:** Running the model on Colab, although feasible, faces GPU limitations.
+
+**2. OpenAI as an Alternative :**
+Given the challenges with the open LLM model, OpenAI's GPT models provide a viable solution. While GPT is known for its computational costs, it offers unparalleled language understanding and generation capabilities.
+
+## Proposed Solution
+
+Considering the limitations of CrystalChat and the potential infrastructure costs associated with running an open LLM model on local machines, I propose the following solution:
+
+1. **Utilize OpenAI Models:** Leverage OpenAI models, which are known for their robust language capabilities.
+
+2. **Manage Costs:** Acknowledge the computational costs associated with GPT models and explore efficient usage options, such as optimizing queries or using cost-effective computing environments.
+
+3. **Experiment with CrystalChat on AWS SageMaker:** As part of due diligence, consider executing CrystalChat on AWS SageMaker to evaluate its performance and potential integration.
+
+4. **Decision Making:** After the experimentation phase, evaluate the performance, costs, and feasibility of both OpenAI and CrystalChat. Make an informed decision based on the achieved results.
+
+## Conclusion
+
+Leveraging OpenAI's GPT models offers advanced language understanding. To explore the potential of open-source LLM models, an experiment with CrystalChat on AWS SageMaker is suggested before making a final decision.
diff --git a/Documentation/Twiiter-API-Limitation.md → Documentation/Twitter-API-Limitation.md b/Documentation/Twiiter-API-Limitation.md → Documentation/Twitter-API-Limitation.md
diff --git a/Documentation/data_fields.csv b/Documentation/data_fields.csv
@@ -49,9 +49,9 @@ regional_atlas_gdp_p_workhours,GDP per workhours,Regional Atlas
 regional_atlas_pop_avg_age_zensus,Average population age (from zensus),Regional Atlas
 regional_atlas_regional_score,Regional score,calculated
 address_ver_1,?,?
-review_avg_grammatical_score,,calculated
-review_polarization_type,,calculated
-review_polarization_score,,calculated
-review_highest_rating_ratio,,calculated
-review_lowest_rating_ratio,,calculated
-review_rating_trend,,calculated
+review_avg_grammatical_score,Average grammatical score of reviews,calculated
+review_polarization_type,Polarization type of review ratings,calculated
+review_polarization_score,Polarization score of review ratings ,calculated
+review_highest_rating_ratio,Ratio of the highest review ratings,calculated
+review_lowest_rating_ratio,Ratio of the lowest review ratings,calculated
+review_rating_trend,Value indicating the trend of ratings,calculated
diff --git a/Pipfile b/Pipfile
@@ -44,6 +44,14 @@ textblob = "==0.17.1"
 deep-translator = "==1.11.4"
 fsspec = "2023.12.2"
 s3fs = "2023.12.2"
+imblearn = "*"
+sagemaker = "*"
+joblib = "1.3.2"
+xgboost = "*"
+colorama = "*"
+torch = "2.1.2"
+deutschland = "0.4.0"
+bs4 = "0.0.2"
 
 [requires]
 python_version = "3.10"