Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeyError in notebook "1. Intro..." #54

Open
ansgar-t opened this issue May 28, 2022 · 4 comments
Open

KeyError in notebook "1. Intro..." #54

ansgar-t opened this issue May 28, 2022 · 4 comments

Comments

@ansgar-t
Copy link

Depending on the random train/test split this code can give a key error:

for i in X_test_features:
    predicted_values.append(joint_prob[i[0], i[1]].idxmax())

Here's a possible alternative:

for i in X_test_features:
    key = (i[0], i[1])
    conditional_prob = joint_prob[key].idxmax() if key in joint_prob else 0.0
    predicted_values.append(conditional_prob)
@ankurankan
Copy link
Member

ankurankan commented Jun 3, 2022

@ansgar-t Yes, this seems to be an issue. Thanks for reporting it. But I think it would be better to assign None/Nan for keys that don't exist. As 0 is an actual class value which would lead to mis-classification. Something like: joint_prob[key].idxmax() if key in joint_prob else None. What do you think? and would you like to create a PR for it?

@ansgar-t
Copy link
Author

ansgar-t commented Jun 6, 2022

you're right about the else case, of course. "0.0" doesn't make sense there.

looks like I thought of separating 2 steps:

  • Looking up estimated conditional probabilities
  • Maximizing them.

... and then I didn't. :)

@ansgar-t
Copy link
Author

ansgar-t commented Jun 6, 2022

having said that...

I think adding the following code to the preparation of joint_prob would be my preferred solution now:

# making sure, that the estimated joint probability is defined over the full domain, 
# using 0.0 for value combinations not seen in the data:

length_domain = range(16) # assuming length cannot exceed 15
width_domain = range(11) # assuming width cannot exceed 10
type_domain = range(3) 

full_index = pd.MultiIndex.from_product([length_domain, width_domain, type_domain])

joint_prob = joint_prob.reindex(full_index).fillna(0.0)

@ankurankan
Copy link
Member

@ansgar-t Sorry for the super late reply. Yes, this solution also looks good. Would be great if you would open a PR with the fix :). Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants