-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
161 add gradient boosting, logistic regression and random forests #210
161 add gradient boosting, logistic regression and random forests #210
Conversation
…nto 161-add-gradient-boosting-and-random-forests
@nialov could you take a look at this PR at some point? Also @zelioluca , let me know if you have any comments since you're now also working with Sklearn and modelling stuff. I might be unaware of some best practices in this domain, and I am not sure if the parameter optimization is good like this. |
Will take a look next week hopefully! |
Perfect man!!!! Thanks
…On Thu, 2 Nov 2023, 16:31 Nikolas Ovaskainen, ***@***.***> wrote:
Will take a look next week hopefully!
—
Reply to this email directly, view it on GitHub
<#210 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AF3KUFM2J7XUY3BX32ZR2DDYCOVCHAVCNFSM6AAAAAA6MD2YQ2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOJQHA2DONJRGM>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, very small stuff. Fix if you think it is worth the time. I am not an expert on the functionality of these sklearn
functions so difficult to say anything about the business logic itself but looks very solid to me!
**kwargs, | ||
) | ||
|
||
# Training and optionally tuning the model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on this comment I understand the fitting should be done regardless of using tuning or not. Should update the comment if this is not the case and remove the else-clause if this is the case.
rmse = np.sqrt(mse) | ||
r2 = r2_score(y_test, y_pred) | ||
|
||
metrics = {"MAE": mae, "MSE": mse, "RMSE": rmse, "R2": r2} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpicky stuff but I would refactor the strings to module variables e.g.
MAE = "MAE"
MSE = "MSE"
...
Could then import these in the tests instead of using strings manually there.
Hey guys i can help if you need just let me know
…On Wed, 8 Nov 2023, 10:00 RichardScottOZ, ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In eis_toolkit/prediction/gradient_boosting.py
<#210 (comment)>
:
> +from typing import Literal, Optional, Tuple, Union
+
+import numpy as np
+import pandas as pd
+from beartype import beartype
+from sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressor
+from sklearn.metrics import classification_report
+from sklearn.model_selection import train_test_split
+
+from eis_toolkit import exceptions
+from eis_toolkit.prediction.model_utils import evaluate_regression_model, tune_model_parameters
+
+
***@***.***
+def gradient_boosting_classifier_train(
+ X: Union[np.ndarray, pd.DataFrame],
On your user side, lots of people don't understand arrays - so having it
magically take dataframes/tables and transform them in the background helps
some people.
—
Reply to this email directly, view it on GitHub
<#210 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AF3KUFKUMAWIEUG4GH2F35LYDM33TAVCNFSM6AAAAAA6MD2YQ2VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTOMJZGYYTEOJRGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
very good job guys! I like it! |
…oCoding/eis_toolkit into 161-add-gradient-boosting-and-random-forests
@msmiyels I have now pushed the reworked versions of these three ML models. The tests should cover most cases, but I might have missed something. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Niko,
thank you, this is very clean and well structured code 🐱🏍
Regarding the methods, I have some points to discuss/to check on your side:
-
Personally, I would like to have an option for the visibility of the training progress (
verbose
) since it can take quite a while to get finished. Mostsklearn
functions are supporting this. -
I would think about to add the
shuffle
parameter for the train-test-splits. If you have data which are ordered in a certain way, it would make sense to shuffle the dataset before splitting into train-test-data. -
Another point is something we already started to discuss in the meeting last week:
It would be nice to have the option to split before the actual training and test the model against data that it has never seen before.Currently, we have
1. training without any split
2. training with some split (simple, cv),but the test-portion of the data will always be used for the model to optimize a respective score.
I think it would be useful to have a real test, even if it becomes only relevant when a sufficient amount of data are available.
-
There are some comments regarding the test functions within the code review.
|
||
# Test that all predicted labels have perfect metric scores since we are predicting with the test data | ||
for metric in metrics: | ||
np.testing.assert_equal(out_metrics[metric], 1.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is quite dangerous. For whatever reason, the returned score for accurary here is 1.0. However, there are some misclassifications (3) when using the same parameters as provided in the test. The correct score
should be 0.98 with max_iter = 100
.
What I did to check:
# load tool stuff
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
from eis_toolkit.prediction.logistic_regression import logistic_regression_train
# get data
X_IRIS, Y_IRIS = load_iris(return_X_y=True)
# define data to use
X = X_IRIS
Y = Y_IRIS
# run training
metrics = ["accuracy"]
model, out_metrics = logistic_regression_train(X, Y, random_state=42, max_iter=100)
# run prediction
y_pred = model.predict(X)
# get number of false classifications and compare
count_false = np.count_nonzero(y_pred - Y)
print(f"Returned metrics: {out_metrics}\nSum of FALSE classifications: {count_false}")
# Returned metrics: {'accuracy': 1.0}
# Sum of FALSE classifications: 3
# calculate and compare accuracy manually
score = accuracy_score(Y, y_pred)
print(f"Manual score: {score}\nDifference to returned score from EIS: {score - out_metrics['accuracy']}")
# Manual score: 0.98
# Difference to returned score from EIS: -0.020000000000000018
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found out the reason why accuracy is 1.0 in out_metrics
but 0.98 when it is calculated like in your snippet. The out_metrics
are produced during training, and in this case a simple split of 20%-80% is used. The model trained with 80% of the data manages to classify correctly all in the 20%, therefore accuracy is 1.0. However, after training the whole data is fit to the model. So both Y
and y_pred
in score = accuracy_score(Y, y_pred)
are different in your code than what they are in the training function, and model trained with the whole data apparently doesn't manage to predict all its labels correctly.
Hi Micha, thanks for the thorough review and good comments! I'll add both About the option to split data before, do you think running a model with |
Hi Niko, sorry for my delay. Of course 😉 What's currently implemented uses either
What's missing is the option to train with a either the first or second approach, but to keep another independent part of if input data away from the modeling. This third part can be used for a "real" test szenario, as the data were neither seen nor incorporated in the modeling part. Its only useful if enough data are available, which is, especially in geo-space, often not the case, but in principle, it would look like this:
Does this makes more sense now? Maybe it's already possible to build the workflow like this calling the functions with different parameters, but if so, it would be good to have this kind of workflow integrated in the Plugin. Cheers 😉 |
Hey, no problem. Okay I understand now. Do you think we should parameterize the training functions so that this whole workflow could be accomplished with one function call? What the user can now do is just as you described, first split their data (using I guess we could add parameter |
Ahoi Niko 🏴☠️, hope u had a nice start into the new week! I think this train/valid/test split is quite common in the ML world. However, I agree on the point of keeping things simple and do not overload it with complexity. So, even if it would be nice to have something like this in the core functionality of the toolkit, we could assume that advanced users who use the toolkit in code rather than the plugin are able to do
For the non-advanced users, it would be great to have this integration in the plugin, but its totally okay if it uses the "workflow" above instead of calling a single complex toolkit function. We just have to take care that we describe the different approaches in the guidelines, for both user types. But to keep the terms consistent, I would suggest to call everything in the
Would you agree on this? |
- Renamed test -> validation where validation was meant - Renamed simple_split -> split - Renamed simple_split_size -> split_size
Hi! I definitely agree with term consistency – good that you pointed it out! The terms used were quite unclear. Now:
Regarding the workflow for testing/scoring model with unseen data, I think this 3-step workflow is good for the Toolkit side. However, I ended up creating simple EIS Toolkit functions |
# Approach 2: Validation with splitting data once | ||
elif validation_method == SPLIT: | ||
X_train, X_valid, y_train, y_valid = split_data( | ||
X, y, split_size=split_size, random_state=random_state, shuffle=True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the split_data
function allows the shuffle
on/off, it could be added to this function as parameter, too (but kept as a True
default). Also not crucial and most likely for advanced users only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the shuffle as a parameter to the inner function (_train_and_validate_sklearn_model
). I didn't add it yet to the public ones, but can do that too if you think it should be possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahoi Niko, 🏴☠️
nice work! 🐱🏍
I've only got these two points:
random_seed
toNone
would be better- no "hardcoding" for
shuffle
parameter
Except of these, ready to merge 🚀
Basic implementations of gradient boosting, logistic regression and random forests using Sklearn.