161 add gradient boosting, logistic regression and random forests #210

nmaarnio · 2023-10-23T13:55:36Z

Basic implementations of gradient boosting, logistic regression and random forests using Sklearn.

…nto 161-add-gradient-boosting-and-random-forests

nmaarnio · 2023-10-26T09:23:55Z

@nialov could you take a look at this PR at some point? Also @zelioluca , let me know if you have any comments since you're now also working with Sklearn and modelling stuff. I might be unaware of some best practices in this domain, and I am not sure if the parameter optimization is good like this.

nialov · 2023-11-02T14:30:49Z

Will take a look next week hopefully!

zelioluca · 2023-11-02T14:32:04Z

Perfect man!!!! Thanks

…

On Thu, 2 Nov 2023, 16:31 Nikolas Ovaskainen, ***@***.***> wrote: Will take a look next week hopefully! — Reply to this email directly, view it on GitHub <#210 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AF3KUFM2J7XUY3BX32ZR2DDYCOVCHAVCNFSM6AAAAAA6MD2YQ2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOJQHA2DONJRGM> . You are receiving this because you were mentioned.Message ID: ***@***.***>

nialov

Hey, very small stuff. Fix if you think it is worth the time. I am not an expert on the functionality of these sklearn functions so difficult to say anything about the business logic itself but looks very solid to me!

nialov · 2023-11-08T07:45:09Z

eis_toolkit/prediction/gradient_boosting.py

+        **kwargs,
+    )
+
+    # Training and optionally tuning the model


Based on this comment I understand the fitting should be done regardless of using tuning or not. Should update the comment if this is not the case and remove the else-clause if this is the case.

eis_toolkit/prediction/gradient_boosting.py

nialov · 2023-11-08T07:52:35Z

eis_toolkit/prediction/model_utils.py

+    rmse = np.sqrt(mse)
+    r2 = r2_score(y_test, y_pred)
+
+    metrics = {"MAE": mae, "MSE": mse, "RMSE": rmse, "R2": r2}


Nitpicky stuff but I would refactor the strings to module variables e.g.

MAE = "MAE" MSE = "MSE" ...

Could then import these in the tests instead of using strings manually there.

tests/prediction/gradient_boosting_test.py

zelioluca · 2023-11-08T08:04:27Z

Hey guys i can help if you need just let me know

…

On Wed, 8 Nov 2023, 10:00 RichardScottOZ, ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In eis_toolkit/prediction/gradient_boosting.py <#210 (comment)> : > +from typing import Literal, Optional, Tuple, Union + +import numpy as np +import pandas as pd +from beartype import beartype +from sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressor +from sklearn.metrics import classification_report +from sklearn.model_selection import train_test_split + +from eis_toolkit import exceptions +from eis_toolkit.prediction.model_utils import evaluate_regression_model, tune_model_parameters + + ***@***.*** +def gradient_boosting_classifier_train( + X: Union[np.ndarray, pd.DataFrame], On your user side, lots of people don't understand arrays - so having it magically take dataframes/tables and transform them in the background helps some people. — Reply to this email directly, view it on GitHub <#210 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AF3KUFKUMAWIEUG4GH2F35LYDM33TAVCNFSM6AAAAAA6MD2YQ2VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTOMJZGYYTEOJRGY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

zelioluca · 2023-11-08T08:54:12Z

very good job guys! I like it!

…oCoding/eis_toolkit into 161-add-gradient-boosting-and-random-forests

nmaarnio · 2023-11-15T10:10:24Z

@msmiyels I have now pushed the reworked versions of these three ML models. The tests should cover most cases, but I might have missed something.

msmiyels

Hi Niko,

thank you, this is very clean and well structured code 🐱‍🏍

Regarding the methods, I have some points to discuss/to check on your side:

Personally, I would like to have an option for the visibility of the training progress (verbose) since it can take quite a while to get finished. Most sklearn functions are supporting this.
I would think about to add the shuffle parameter for the train-test-splits. If you have data which are ordered in a certain way, it would make sense to shuffle the dataset before splitting into train-test-data.
Another point is something we already started to discuss in the meeting last week:
It would be nice to have the option to split before the actual training and test the model against data that it has never seen before.

Currently, we have
1. training without any split
2. training with some split (simple, cv),

but the test-portion of the data will always be used for the model to optimize a respective score.

I think it would be useful to have a real test, even if it becomes only relevant when a sufficient amount of data are available.
There are some comments regarding the test functions within the code review.

tests/prediction/logistic_regression_test.py

msmiyels · 2023-11-17T11:46:44Z

tests/prediction/logistic_regression_test.py

+
+    # Test that all predicted labels have perfect metric scores since we are predicting with the test data
+    for metric in metrics:
+        np.testing.assert_equal(out_metrics[metric], 1.0)


This is quite dangerous. For whatever reason, the returned score for accurary here is 1.0. However, there are some misclassifications (3) when using the same parameters as provided in the test. The correct score should be 0.98 with max_iter = 100.

What I did to check:

# load tool stuff from sklearn.datasets import load_iris from sklearn.metrics import accuracy_score from eis_toolkit.prediction.logistic_regression import logistic_regression_train # get data X_IRIS, Y_IRIS = load_iris(return_X_y=True) # define data to use X = X_IRIS Y = Y_IRIS # run training metrics = ["accuracy"] model, out_metrics = logistic_regression_train(X, Y, random_state=42, max_iter=100) # run prediction y_pred = model.predict(X) # get number of false classifications and compare count_false = np.count_nonzero(y_pred - Y) print(f"Returned metrics: {out_metrics}\nSum of FALSE classifications: {count_false}") # Returned metrics: {'accuracy': 1.0} # Sum of FALSE classifications: 3 # calculate and compare accuracy manually score = accuracy_score(Y, y_pred) print(f"Manual score: {score}\nDifference to returned score from EIS: {score - out_metrics['accuracy']}") # Manual score: 0.98 # Difference to returned score from EIS: -0.020000000000000018

I found out the reason why accuracy is 1.0 in out_metrics but 0.98 when it is calculated like in your snippet. The out_metrics are produced during training, and in this case a simple split of 20%-80% is used. The model trained with 80% of the data manages to classify correctly all in the 20%, therefore accuracy is 1.0. However, after training the whole data is fit to the model. So both Y and y_pred in score = accuracy_score(Y, y_pred) are different in your code than what they are in the training function, and model trained with the whole data apparently doesn't manage to predict all its labels correctly.

tests/prediction/logistic_regression_test.py

eis_toolkit/prediction/gradient_boosting.py

eis_toolkit/prediction/random_forests.py

nmaarnio · 2023-11-20T11:09:53Z

Hi Micha,

thanks for the thorough review and good comments!

I'll add both verbose mode and shuffle to simple train-test split (I noticed shuffle=True by default in the Sklearn function, but I can make it explicit).

About the option to split data before, do you think running a model with test_method = "none" does not do the trick here? Of course, if you want to evaluate your model in this case, you need to do it after training the model using a Sklearn method. I might have misunderstood what you meant here, so can you maybe describe it a bit more?

msmiyels · 2023-11-21T19:45:34Z

Hi Niko,

sorry for my delay. Of course 😉

What's currently implemented uses either

the complete dataset for training (no split option), or
a portion of it, with the other portion used to optimize a certain score (simple split, cv)

What's missing is the option to train with a either the first or second approach, but to keep another independent part of if input data away from the modeling. This third part can be used for a "real" test szenario, as the data were neither seen nor incorporated in the modeling part.

Its only useful if enough data are available, which is, especially in geo-space, often not the case, but in principle, it would look like this:

split the data into two parts. One for training (train) and one for testing (test) on the "unseen" data
train the model, either without any split or with simple/cv options on the train dataset and optimize scores on a certain metric and the validation (valid) part, if those.
Test the model on the test dataset, which was not used for any modeling stage.

Does this makes more sense now? Maybe it's already possible to build the workflow like this calling the functions with different parameters, but if so, it would be good to have this kind of workflow integrated in the Plugin.

Cheers 😉

nmaarnio · 2023-11-22T07:58:44Z

Hey,

no problem.

Okay I understand now. Do you think we should parameterize the training functions so that this whole workflow could be accomplished with one function call? What the user can now do is just as you described, first split their data (using train_test_split for example, we can make this importable from EIS or create a wrapper), then train how they like, and finally use predict, score or some other Sklearn method. Of course, this assumes the user is aware that all these functions/options exist and can be used.

I guess we could add parameter evaluate_split or validate_split that would define the other split of data (and maybe swap the names so validate_split is the "inner" one and test_split for the unseen data), or give the option to give already separated test dataset. However, I am a bit worried these functions will grow too large and complicated to use. In the Plugin, I am sure we can find some ways to show all these different options to the user. But what do you think? I am open to suggestions how to best incorporate this feature :)

msmiyels · 2023-11-28T13:45:46Z

Ahoi Niko 🏴‍☠️,

hope u had a nice start into the new week!

I think this train/valid/test split is quite common in the ML world. However, I agree on the point of keeping things simple and do not overload it with complexity. So, even if it would be nice to have something like this in the core functionality of the toolkit, we could assume that advanced users who use the toolkit in code rather than the plugin are able to do

the train_test_split based on the sklearn functionality
train the model with the eistoolkit (either with or without simple split and validation data)
predict with the test data to get test results with the eistoolkit on unseen data

For the non-advanced users, it would be great to have this integration in the plugin, but its totally okay if it uses the "workflow" above instead of calling a single complex toolkit function. We just have to take care that we describe the different approaches in the guidelines, for both user types.

But to keep the terms consistent, I would suggest to call everything in the eistookit functions

related to the training: train
related to the validation and scoring during training and model-building: valid
related to the preditction and scoring with unseen data: test
related to the prediction with unknown labels (i.e. the purpose of this whole thing here): prediction

Would you agree on this?

- Renamed test -> validation where validation was meant - Renamed simple_split -> split - Renamed simple_split_size -> split_size

nmaarnio · 2023-11-29T13:51:29Z

Hi!

I definitely agree with term consistency – good that you pointed it out! The terms used were quite unclear.

Now:

core/private function is called _train_and_validate_sklearn_model
test_method is called validation_method
simple_split is called just split and simple_split_size is called split_size (I thought the "simple" didn't add much to explain this validation method)
docstrings are talking only about validation instead of testing / validation

Regarding the workflow for testing/scoring model with unseen data, I think this 3-step workflow is good for the Toolkit side. However, I ended up creating simple EIS Toolkit functions split_data, test_model and predict, so now users can still use our own functions/wrappers to conduct this workflow. What do you think about the structure now?

eis_toolkit/prediction/gradient_boosting.py

eis_toolkit/prediction/logistic_regression.py

eis_toolkit/prediction/random_forests.py

eis_toolkit/prediction/model_utils.py

eis_toolkit/prediction/gradient_boosting.py

msmiyels · 2023-12-12T12:55:37Z

eis_toolkit/prediction/model_utils.py

+    # Approach 2: Validation with splitting data once
+    elif validation_method == SPLIT:
+        X_train, X_valid, y_train, y_valid = split_data(
+            X, y, split_size=split_size, random_state=random_state, shuffle=True


Since the split_data function allows the shuffle on/off, it could be added to this function as parameter, too (but kept as a True default). Also not crucial and most likely for advanced users only.

I added the shuffle as a parameter to the inner function (_train_and_validate_sklearn_model). I didn't add it yet to the public ones, but can do that too if you think it should be possible.

msmiyels

Ahoi Niko, 🏴‍☠️

nice work! 🐱‍🏍

I've only got these two points:

random_seed to None would be better
no "hardcoding" for shuffle parameter

Except of these, ready to merge 🚀

msorvoja and others added 8 commits July 31, 2023 13:47

Add first draft

ef679a4

Add and fix tests

98e5f4c

Add column_name argument

ef91484

initial implementations

0df553f

initial implementations

e884ed5

Start test file for random forest, small changes to implementation

db55a04

Random forest and gradient boosting implemented v1

68be62d

Merge branch 'master' of https://github.com/GispoCoding/eis_toolkit i…

c7abc60

…nto 161-add-gradient-boosting-and-random-forests

nmaarnio linked an issue Oct 23, 2023 that may be closed by this pull request

Add random forests, logistic regression and gradient boosting #161

Closed

nmaarnio added the modelling Features related to modelling functions label Oct 23, 2023

nmaarnio marked this pull request as ready for review October 26, 2023 09:24

nmaarnio mentioned this pull request Oct 26, 2023

I added MLP probability #206

Closed

msorvoja added 2 commits November 6, 2023 10:33

Merge master into branch

ed25d0a

Add v2, move files to correct modules, add docs

693a3a5

nialov requested changes Nov 8, 2023

View reviewed changes

nmaarnio added 4 commits November 10, 2023 12:24

Merge branch '144-add-logistic-regression' of https://github.com/Gisp…

5b1aeed

…oCoding/eis_toolkit into 161-add-gradient-boosting-and-random-forests

Remove model parameter tuning and change test size to 0.2

c9bbc12

Reworked the architecture of Sklearn models

f5ca5a3

Modified Iris test data naming in tests

8fe3a87

nmaarnio requested a review from msmiyels November 15, 2023 10:10

nmaarnio changed the title ~~161 add gradient boosting and random forests~~ 161 add gradient boosting, logistic regression and random forests Nov 15, 2023

nmaarnio mentioned this pull request Nov 15, 2023

144 add logistic regression #222

Closed

msmiyels reviewed Nov 17, 2023

View reviewed changes

Add shuffle to train-test split

413944d

nmaarnio added 2 commits November 22, 2023 09:41

Improve tests

cf8996b

Add checks, typing and documentation for max_depth

c7512df

This was linked to issues Nov 22, 2023

Add random forest functionality #76

Closed

Add logistic regression #144

Closed

nmaarnio added 4 commits November 29, 2023 12:21

Clarified naming in modeling

9becd2b

- Renamed test -> validation where validation was meant - Renamed simple_split -> split - Renamed simple_split_size -> split_size

Added wrappers for splitting data and predicting (tests included)

ef247d3

Minor clarification to docstring

bd2b37e

Added func to test/score Sklearn model

8c75aa3

nmaarnio requested a review from msmiyels December 7, 2023 13:26

nmaarnio mentioned this pull request Dec 7, 2023

129 add MLP #252

Merged