Skip to content

Commit

Permalink
Merge pull request #214 from amosproj/dev
Browse files Browse the repository at this point in the history
Release sprint 11
  • Loading branch information
ur-tech authored Jan 24, 2024
2 parents 7c45cda + 583a1bb commit 621c43b
Show file tree
Hide file tree
Showing 37 changed files with 3,868 additions and 254 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,8 @@ bin/
!**/data/merged_geo.geojson
**/data/reviews/*.json
**/data/gpt-results/*.json
**/data/models/*
**/data/classification_reports/*

# Env files
*.env
Expand Down
5 changes: 5 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -32,3 +32,8 @@ repos:
rev: v2.1.0
hooks:
- id: reuse

- repo: https://github.com/kynan/nbstripout
rev: 0.6.1
hooks:
- id: nbstripout
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
SPDX-License-Identifier: MIT
SPDX-FileCopyrightText: 2023 Nico Hambauer <[email protected]>
Binary file added Deliverables/sprint-11/feature-board.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 2 additions & 0 deletions Deliverables/sprint-11/feature-board.png.license
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
SPDX-License-Identifier: MIT
SPDX-FileCopyrightText: 2023 Simon Zimmermann <[email protected]>
Binary file added Deliverables/sprint-11/imp-squared-backlog.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 2 additions & 0 deletions Deliverables/sprint-11/imp-squared-backlog.jpg.license
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
SPDX-License-Identifier: MIT
SPDX-FileCopyrightText: 2023 Nico Hambauer <[email protected]>
Binary file added Deliverables/sprint-11/planning-documents.pdf
Binary file not shown.
2 changes: 2 additions & 0 deletions Deliverables/sprint-11/planning-documents.pdf.license
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
SPDX-License-Identifier: MIT
SPDX-FileCopyrightText: 2023 Simon Zimmermann <[email protected]>
106 changes: 106 additions & 0 deletions Documentation/Classifier-Comparison.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
<!--
SPDX-License-Identifier: MIT
SPDX-FileCopyrightText: 2024 Felix Zailskas <[email protected]>
SPDX-FileCopyrightText: 2024 Ahmed Sheta <[email protected]>
-->

# Classifier Comparison

This document compares the results of the following classifiers on the enriched and
preprocessed data set from the 22.01.2024.

- Quadratic Discriminant Analysis (QDA)
- Ridge Classifier
- Random Forest
- Support Vecotr Machine (SVM)
- Fully Connected Neural Networks Classifier Model (FCNNC)
- Fully Connected Neural Networks Regression Model (FCNNR)
- XGBoost Classifier Model
- K Nearest Neighbor Classifier (KNN)
- Bernoulli Naive Bayes Classifier

Each model type was tested on two splits of the data set. The used data set has five
classes for prediction corresponding to different merchant sizes, namely XS, S, M, L, and XL.
The first split of the data set used exactly these classes for the prediction corresponding
to the exact classes given by SumUp. The other data set split grouped the classes S, M, and L
into one new class resulting in three classes of the form {XS}, {S, M, L}, and {XL}. While
this does not exactly correspond to the given classes from SumUp, this simplification of
the prediction task generally resulted in a better F1-score across models.

## Experimental Attempts

According to free lunch theorem, there is no universal model or methodology that is top performing on every problem or data, therefore multiple attempts are crucal. In this section, we will document the experiments we tried and their corresponding performance and outputs.

## Models not performing well

### Support Vector Machine Classifier Model

Training Support Vector Machine (SVM) took a while such that the training never ended. It is believed that it is the case because SVMs are very sensitive to the misclassifications and it finds a hard time minimizing them, given the data.

### Fully Connected Neural Networks Classifier Model

Fully Connected Neural Networks (FCNN) achieved overall lower performance than that Random Forest Classifier, mainly it had f1 score 0.84 on the XS class, while having 0.00 f1 scores on the other class, it learned only the XS class. the FCNN consisted of 4 layers overall, RELU activation function in each layer, except in the logits layer the activation function is Softmax. The loss functions investigated were Cross-Entropy and L2 Loss. The Optimizers were Adam and Sctohastic Gradient Descent. Moreover, Skip connections, L1 and L2 Regularization techniques and class weights have been investigated as well. Unfortunately we haven't found any FCNN that outperforms the simpler ML models.

### Fully Connected Neural Networks Regression Model

There has been an idea written in the scientific paper "Inter-species cell detection -
datasets on pulmonary hemosiderophages in equine, human and feline specimens" by Marzahl et al. where they proposed using regression model on a classification task. The idea is to train the regression model on the class values, whereas the model predicts a continous values and learns the relation between the classes. The output is then subjected to threshholds (0-0.49,0.5-1.49,1.5-2.49,2.5-3.49,3.5-4.5) for classes XS, S, M, L, XL respectivly. This yielded better performance than the FCNN classifier but still was worse than that of the Random Forest.

### QDA & Ridge Classifier

Both of these classifiers could not produce a satisfactory performance on either data set
split. While the prediction on the XS class was satisfactory (F1-score of ~0.84) all other
classes had F1-scores of ~0.00-0.15. For this reason we are not considering these predictors
in future experiments. This resulted in an overall F1-score of ~0.11, which is significantly
outperformed by the other tested models.

## Well performing models

### Random Forest Classifier

Random Forest Classifier with 100 estimators has been been able to achieve an overall F1-score of 0.62 and scores of 0.81, 0.13, 0.09, 0.08 and 0.15 for classes XS, S, M, L and XL respectively.

### Overall Results

Note:
The Random Forest Classifier used 100 estimators.
The KNN classifier used a distance based weighting for the evaluated neighbors and considered 10 neighbors in the 5-class split and 19 neighbors for the 3-class split.
The XGBoost was trained for 10000 rounds.

In the following table we can see the model's overall weighted F1-score on the 3-class and
5-class data set split.

| | KNN | Naive Bayes | Random Forest | XGBoost |
| ------- | ------ | ----------- | ------------- | ------- |
| 5-Class | 0.6314 | 0.6073 | 0.6150 | 0.6442 |
| 3-Class | 0.6725 | 0.6655 | 0.6642 | 0.6967 |

We can see that all classifiers perform better on the 3-class data set split and that the XGBoost classifier is the best performing for both data set splits.

### Results for each class

#### 5-class split

In the following table we can see the F1-score of each model for each class in the 5-class split:

| Class | KNN | Naive Bayes | Random Forest | XGBoost |
| ----- | ---- | ----------- | ------------- | ------- |
| XS | 0.82 | 0.83 | 0.81 | 0.84 |
| S | 0.15 | 0.02 | 0.13 | 0.13 |
| M | 0.08 | 0.02 | 0.09 | 0.08 |
| L | 0.06 | 0.00 | 0.08 | 0.06 |
| XL | 0.18 | 0.10 | 0.15 | 0.16 |

For every model we can see that the predictions on the XS class are significantly better than every other class. TFor the KNN, Random Forest, and XGBoost all perform similar, having second best classes S and XL and worst classes M and L. The Naive Bayes classifier performs significantly worse on the S, M, and L classes and has second best class XL.

#### 3-class split

In the following table we can see the F1-score of each model for each class in the 3-class split:

| Class | KNN | Naive Bayes | Random Forest | XGBoost |
| ----- | ---- | ----------- | ------------- | ------- |
| XS | 0.83 | 0.82 | 0.81 | 0.84 |
| S,M,L | 0.27 | 0.28 | 0.30 | 0.33 |
| XL | 0.16 | 0.07 | 0.13 | 0.14 |

For the 3-class split we observe similar performance for the XS and {S, M, L} classes for each model, while the XGBoost model slightly outperforms the other models. The KNN classifier is performing the best on the XL class while the Naive Bayes classifier performs worst. Interestingly, we can observe that the performance of the models on the XS class was barely affected by the merging of the s, M, and L classes while the performance on the XL class got worse for all of them. This needs to be considered, when evaluating the overall performance of the models on this data set split.
34 changes: 34 additions & 0 deletions Documentation/OpenLLm-Business-Type-Analysis.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
<!--
SPDX-License-Identifier: MIT
SPDX-FileCopyrightText: 2024 Ruchita Nathani <[email protected]>
-->

# Business Type Analysis: Research and Proposed Solution

## Research

**1. Open-source LLM Model :**
I explored an open-source LLM model named CrystalChat available on Hugging Face (https://huggingface.co/LLM360/CrystalChat). Despite its capabilities, it has some limitations:

- **Computational Intensity:** CrystalChat is computationally heavy and cannot be run efficiently on local machines.

- **Infrastructure Constraints:** Running the model on Colab, although feasible, faces GPU limitations.

**2. OpenAI as an Alternative :**
Given the challenges with the open LLM model, OpenAI's GPT models provide a viable solution. While GPT is known for its computational costs, it offers unparalleled language understanding and generation capabilities.

## Proposed Solution

Considering the limitations of CrystalChat and the potential infrastructure costs associated with running an open LLM model on local machines, I propose the following solution:

1. **Utilize OpenAI Models:** Leverage OpenAI models, which are known for their robust language capabilities.

2. **Manage Costs:** Acknowledge the computational costs associated with GPT models and explore efficient usage options, such as optimizing queries or using cost-effective computing environments.

3. **Experiment with CrystalChat on AWS SageMaker:** As part of due diligence, consider executing CrystalChat on AWS SageMaker to evaluate its performance and potential integration.

4. **Decision Making:** After the experimentation phase, evaluate the performance, costs, and feasibility of both OpenAI and CrystalChat. Make an informed decision based on the achieved results.

## Conclusion

Leveraging OpenAI's GPT models offers advanced language understanding. To explore the potential of open-source LLM models, an experiment with CrystalChat on AWS SageMaker is suggested before making a final decision.
File renamed without changes.
12 changes: 6 additions & 6 deletions Documentation/data_fields.csv
Original file line number Diff line number Diff line change
Expand Up @@ -49,9 +49,9 @@ regional_atlas_gdp_p_workhours,GDP per workhours,Regional Atlas
regional_atlas_pop_avg_age_zensus,Average population age (from zensus),Regional Atlas
regional_atlas_regional_score,Regional score,calculated
address_ver_1,?,?
review_avg_grammatical_score,,calculated
review_polarization_type,,calculated
review_polarization_score,,calculated
review_highest_rating_ratio,,calculated
review_lowest_rating_ratio,,calculated
review_rating_trend,,calculated
review_avg_grammatical_score,Average grammatical score of reviews,calculated
review_polarization_type,Polarization type of review ratings,calculated
review_polarization_score,Polarization score of review ratings ,calculated
review_highest_rating_ratio,Ratio of the highest review ratings,calculated
review_lowest_rating_ratio,Ratio of the lowest review ratings,calculated
review_rating_trend,Value indicating the trend of ratings,calculated
8 changes: 8 additions & 0 deletions Pipfile
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,14 @@ textblob = "==0.17.1"
deep-translator = "==1.11.4"
fsspec = "2023.12.2"
s3fs = "2023.12.2"
imblearn = "*"
sagemaker = "*"
joblib = "1.3.2"
xgboost = "*"
colorama = "*"
torch = "2.1.2"
deutschland = "0.4.0"
bs4 = "0.0.2"

[requires]
python_version = "3.10"
Loading

0 comments on commit 621c43b

Please sign in to comment.