Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Constant columns labeled as continuous #5

Open
amueller opened this issue Dec 23, 2020 · 3 comments
Open

Constant columns labeled as continuous #5

amueller opened this issue Dec 23, 2020 · 3 comments

Comments

@amueller
Copy link

The training set contains several constant columns labeled as continuous:

train_merged[(train_merged.y_act == 0) & (train_merged.num_of_dist_val == 1)]
     Record_id             Attribute_name  y_act  total_vals  num_nans  \
142         51                      count      0       29421         0   
587        101  M1_CURRENT_PROGRAM_NUMBER      0         605         0   
627        102  M1_CURRENT_PROGRAM_NUMBER      0         740         0   
640        102           S1_SystemInertia      0         740         0   

     %_nans  num_of_dist_val  %_dist_val  mean  std_dev  min_val  max_val  \
142     0.0                1    0.003399   1.0      0.0      1.0      1.0   
587     0.0                1    0.165289   1.0      0.0      1.0      1.0   
627     0.0                1    0.135135   1.0      0.0      1.0      1.0   
640     0.0                1    0.135135  12.0      0.0     12.0     12.0   

    sample_1 sample_2 sample_3 sample_4 sample_5  \
142        1        1        1        1        1   
587        1        1        1        1        1   
627        1        1        1        1        1   
640       12       12       12       12       12   

                                name  \
142  aac_shelter_cat_outcome_eng.csv   
587                experiment_08.csv   
627                experiment_09.csv   
640                experiment_09.csv   

                                                  link  
142  https://www.kaggle.com/aaronschlegel/austin-an...  
587  https://www.kaggle.com/shasun/tool-wear-detect...  
627  https://www.kaggle.com/shasun/tool-wear-detect...  
640  https://www.kaggle.com/shasun/tool-wear-detect...  

I think these should be labeled as not-generalizable.

@amueller
Copy link
Author

Actually, they are labeled as all kinds of things, mostly lists:

train_merged[(train_merged.y_act != 6) & (train_merged.num_of_dist_val == 1)].y_act.value_counts()
7    327
2     12
4      5
0      4
3      3
1      1

@pvn25
Copy link
Owner

pvn25 commented Jan 28, 2021

Thank you for pointing this out. I found 25 examples overall that should have been Not-Generalizable in our labeled data. I will update the dataset along with the benchmark with the changes.

So, 6 is decoded as List and 7 as Not-Generalizable. I would like to apologize for this encoding, it is a bit confusing. I will change them to actual types.

@amueller
Copy link
Author

Oh sorry for the 6/7 confusion, I guess it wasn't as bad as I made it look :) Thanks for checking up on it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants