Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

min_bin_size and max_bin_size not working when using sample_weight in ContinuousOptimalBinning #323

Open
diegodebrito opened this issue Jul 9, 2024 · 1 comment
Labels
bug Something isn't working
Milestone

Comments

@diegodebrito
Copy link

diegodebrito commented Jul 9, 2024

The parameters min_bin_size and max_bin_size don't seem to work well when passing sample_weight during fit. The example below produces only one bin, regardless of value for those parameters.

Removing sample_weight from the fit call seems to work properly (you can just comment that out and rerun the example below).

Please let me know if it's my lack of understanding or if I'm using the tool incorrectly.

import pandas as pd
from optbinning import ContinuousOptimalBinning

df = pd.DataFrame({'value': {0: 0.0,
  1: 1.0,
  2: 2.0,
  3: 3.0,
  4: 4.0,
  5: 5.0,
  6: 6.0,
  7: 7.0,
  8: 8.0,
  9: 9.0},
 'target': {0: 7.747250464922968,
  1: 6.527567693419396,
  2: 5.951775031334447,
  3: 5.4739748791420855,
  4: 5.635028933057227,
  5: 5.177333709759795,
  6: 5.242660923463983,
  7: 4.681195578721209,
  8: 4.921130922493046,
  9: 4.698432205030768},
 'num_obs': {0: 166252,
  1: 305567,
  2: 245220,
  3: 182303,
  4: 137543,
  5: 113468,
  6: 99369,
  7: 92211,
  8: 87613,
  9: 76431}})

variable = "target"
optb = ContinuousOptimalBinning(dtype="numerical",
                                min_bin_size=0.1,
                                max_bin_size=1.0,
                               )
optb.fit(df['value'], 
         df['target'], 
         sample_weight=df['num_obs']
        )

binning_table = optb.binning_table
binning_table.build()
binning_table.plot()
@diegodebrito
Copy link
Author

Hi @guillermo-navas-palencia, wondering if you could check on this. I'm adding a more comprehensive example below:

Baseline works fine and finds the bins

import pandas as pd
from optbinning import ContinuousOptimalBinning

df = pd.DataFrame({'value': {0: 0.0,
  1: 1.0,
  2: 2.0,
  3: 3.0,
  4: 4.0,
  5: 5.0,
  6: 6.0,
  7: 7.0,
  8: 8.0,
  9: 9.0},
'target': {0: 7.747250464922968,
  1: 6.527567693419396,
  2: 5.951775031334447,
  3: 5.4739748791420855,
  4: 5.635028933057227,
  5: 5.177333709759795,
  6: 5.242660923463983,
  7: 4.681195578721209,
  8: 4.921130922493046,
  9: 4.698432205030768},                   
})

variable = "target"
optb = ContinuousOptimalBinning(dtype="numerical",
                                min_bin_size=0.3,
                                max_bin_size=1.0,
                               )
optb.fit(df['value'], 
         df['target'], 
        )

print(optb.status)

binning_table = optb.binning_table
binning_table.build()
binning_table.plot()

image

Adding weights = 10 to each observation is unfeasible (which is weird, since it would be a simple scaling)

df['num_obs'] = [10] * 10

variable = "target"
optb = ContinuousOptimalBinning(dtype="numerical",
                                min_bin_size=0.3,
                                max_bin_size=1.0,
                               )
optb.fit(df['value'], 
         df['target'], 
         sample_weight=df['num_obs']
        )

print(optb.status)
binning_table = optb.binning_table
binning_table.build()
binning_table.plot()

image

Repeating observations 10 times instead of using weights works fine:

df = df.loc[df.index.repeat([10 for i in range(10)])]

variable = "target"
optb = ContinuousOptimalBinning(dtype="numerical",
                                min_bin_size=0.3,
                                max_bin_size=1.0,
                               )
optb.fit(df['value'], 
         df['target']
        )

print(optb.status)

binning_table = optb.binning_table
binning_table.build()
binning_table.plot()

image

Thanks for your work on this great tool!

@guillermo-navas-palencia guillermo-navas-palencia added the bug Something isn't working label Oct 30, 2024
@guillermo-navas-palencia guillermo-navas-palencia added this to the v0.21.0 milestone Oct 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: No status
Development

No branches or pull requests

2 participants