Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of Time In space split is wrong on space_time_split_dataset function #91

Closed
marcelogdeandrade opened this issue Aug 22, 2019 · 3 comments · Fixed by #92
Closed

Out of Time In space split is wrong on space_time_split_dataset function #91

marcelogdeandrade opened this issue Aug 22, 2019 · 3 comments · Fixed by #92
Labels
bug Something isn't working good first issue Good for newcomers
Milestone

Comments

@marcelogdeandrade
Copy link
Contributor

Issue location

src/fklearn/preprocessing/splitting.py

Function: space_time_split_dataset

Problem description

space_time_split_dataset splits the input DataFrame into 4 parts:

  • train_set
  • intime_outspace_hdout
  • outime_inspace_hdout
  • outime_outspace_hdout

The outime_inspace_hdout split is defined wrongly on the function. It is only filtered by time, but not in space

outime_inspace_hdout = dataset[
        (dataset[time_column] >= train_end_date) & (dataset[time_column] < holdout_end_date)]

Expected behavior

The outime_inspace_hdout should split the DataFrame out of space and in time, not only out of space

Possible solutions

We should rename this variable to outime_hdout, and define the right split with the other 3.

E.g.

train_set = train_period[~train_period[space_column].isin(holdout_space)]
intime_outspace_hdout = train_period[train_period[space_column].isin(holdout_space)]
outime_outspace_hdout = outime_hdout[outime_hdout[space_column].isin(holdout_space)]
outime_inspace_hdout = outime_hdout[~outime_hdout[space_column].isin(holdout_space)]
@marcelogdeandrade marcelogdeandrade added the bug Something isn't working label Aug 22, 2019
@caique-lima caique-lima added the good first issue Good for newcomers label Aug 22, 2019
@caique-lima caique-lima added this to the 1.16.x milestone Aug 22, 2019
@caique-lima
Copy link
Contributor

Is a bit hard to understand this issue because I think we have some messed up stuff in the documentation.

First of all seems that we have an issue with the docstring:

intime_outspace_hdout : pandas.DataFrame
    The out of ID sample and in time hold out set.

Is equal to

outime_inspace_hdout : pandas.DataFrame
    The out of ID sample and in time hold out set.

This last one should be The in ID sample and out time hold out set

And just to clarify, the expected behaviour is that in this test:

https://github.com/nubank/fklearn/blob/master/tests/preprocessing/test_splitting.py#L62

This value should be:

'space': ['space2'],

Just like the training dataset

@marcelogdeandrade can you solve this bug? Your solution seems fine to me, but don't forget to update the docstring

@marcelogdeandrade
Copy link
Contributor Author

@caique-lima Sure, I'll solve it

@caique-lima
Copy link
Contributor

Related to #62

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants