Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Summary statistics could be incorrect when using #324

Open
diegodebrito opened this issue Jul 9, 2024 · 0 comments
Open

Summary statistics could be incorrect when using #324

diegodebrito opened this issue Jul 9, 2024 · 0 comments

Comments

@diegodebrito
Copy link

I created a simple dataframe with age, salary, and num_obs:

import pandas as pd
from optbinning import ContinuousOptimalBinning

df = pd.DataFrame({'age': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10},
 'salary': {0: 0.7739560485559633,
  1: 0.4388784397520523,
  2: 0.8585979199113825,
  3: 0.6973680290593639,
  4: 0.09417734788764953,
  5: 0.9756223516367559,
  6: 0.761139701990353,
  7: 0.7860643052769538,
  8: 0.12811363267554587,
  9: 0.45038593789556713},
 'num_obs': {0: 5, 1: 4, 2: 3, 3: 7, 4: 6, 5: 6, 6: 5, 7: 7, 8: 5, 9: 5}})

Better displayed as:

age salary num_obs
1 0.773956 5
2 0.438878 4
3 0.858598 3
4 0.697368 7
5 0.0941773 6
6 0.975622 6
7 0.76114 5
8 0.786064 7
9 0.128114 5
10 0.450386 5

I then generated optimal bins using num_obs as sample weights:

optb = ContinuousOptimalBinning(dtype="numerical")
optb.fit(df['age'], df['salary'], sample_weight=df['num_obs'])

binning_table = optb.binning_table
binning_table.build()

Which results in:

Bin Count Count (%) Sum Std Mean Min Max Zeros count WoE IV
0 (-inf, 1.50) 5 0.0943396 3.86978 1.5479120971119267 0.773956 3.86978 3.86978 0 0.175803 0.0165852
1 [1.50, 4.50) 14 0.264151 9.21288 1.401113568517211 0.658063 1.75551 4.88158 0 0.0599101 0.0158253
2 [4.50, 8.50) 24 0.45283 15.7269 1.6960750859575562 0.655289 0.565064 5.85373 0 0.0571365 0.0258731
3 [8.50, 9.50) 5 0.0943396 0.640568 0.25622726535109175 0.128114 0.640568 0.640568 0 -0.470039 0.0443433
4 [9.50, inf) 5 0.0943396 2.25193 0.9007718757911344 0.450386 2.25193 2.25193 0 -0.147767 0.0139403
5 Special 0 0 0 nan 0 nan nan 0 -0.598153 0
6 Missing 0 0 0 nan 0 nan nan 0 -0.598153 0
Totals 53 1 31.7021 0.598153 0.565064 5.85373 0 2.10696 0.116567

Notice how row 3 (with bin [8.50, 9.50)) has Std different than 0. Since the only age that falls on that bin is 8, I don't understand how the std could be different than 0. The other statistics are also quite odd/don't make sense.

Please let me know if there is an issue when using weights or if I'm understanding the results wrong.

Thanks!

Ps: this might be related to this issue: #323

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant