Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset issue #2

Open
Kannadasa opened this issue May 12, 2020 · 20 comments
Open

Dataset issue #2

Kannadasa opened this issue May 12, 2020 · 20 comments

Comments

@Kannadasa
Copy link

Hi,

I am testing your model, but i am not getting the desired output. I think i am not distributing the data properly in train and valid folders.

Please let me know how you are creating the folder structure and loading the images for train and valid datasets. This is for binary classification

@LelisThanos
Copy link

Hello,
same issues here, having trouble reproducing your code with loading and distributing images for train and validation datasets.

@muhammedtalo
Copy link
Owner

You may use https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html
for spiting the datasets.
I have also provided our results for three classes. Please see COVID-19 main repository.

@Kannadasa
Copy link
Author

Do we know the actual results of the X-Ray images ? or can i assume that all 125 x-ray images inside Covid-19 folder are covid-19 positive ?

Thanks
Kannadasan

@muhammedtalo
Copy link
Owner

Do we know the actual results of the X-Ray images ? or can i assume that all 125 x-ray images inside Covid-19 folder are covid-19 positive ?

Thanks
Kannadasan
Yes, the X-ray images inside the Covid-19 folder are covid-19 positive. The folder names are given in terms of diagnosis results.

@Kannadasa
Copy link
Author

Hi Muhammed,

Have you got the code to implement K-Fold on the datasets ?

Thanks
Kannadasan

@Kannadasa
Copy link
Author

Hi Muhammed,
By having 125 images in Covid-19 and 500 images in No_findings folder, are we not dealing with imbalanced dataset ?

The reason why i am asking is i trained your model using KFold datasets but i am getting only 58% accuracy. I am printing below one of the iteration output and I think somewhere something is wrong in my code.

epoch | train_loss | valid_loss | accuracy | time

0 | 0.003996 | 0.006837 | 1.000000 | 02:26
1 | 0.002480 | 0.004546 | 1.000000 | 02:10
2 | 0.001856 | 0.002552 | 1.000000 | 02:06
3 | 0.001408 | 0.001160 | 1.000000 | 02:04
4 | 0.001097 | 0.000621 | 1.000000 | 02:06
5 | 0.000911 | 0.000315 | 1.000000 | 02:11
6 | 0.000743 | 0.000152 | 1.000000 | 02:10
7 | 0.000620 | 0.000084 | 1.000000 | 02:12
8 | 0.000522 | 0.000066 | 1.000000 | 02:10
9 | 0.000442 | 0.000044 | 1.000000 | 02:09
10 | 0.000372 | 0.000033 | 1.000000 | 02:10
11 | 0.000316 | 0.000022 | 1.000000 | 02:09
12 | 0.000272 | 0.000018 | 1.000000 | 02:10
13 | 0.000233 | 0.000018 | 1.000000 | 02:10
14 | 0.000201 | 0.000017 | 1.000000 | 02:08
15 | 0.000173 | 0.000017 | 1.000000 | 02:10
16 | 0.000149 | 0.000015 | 1.000000 | 02:08
17 | 0.000129 | 0.000014 | 1.000000 | 02:10
18 | 0.000112 | 0.000014 | 1.000000 | 02:06
19 | 0.000097 | 0.000014 | 1.000000 | 02:07
20 | 0.000084 | 0.000015 | 1.000000 | 02:05
21 | 0.000074 | 0.000014 | 1.000000 | 02:07
22 | 0.000064 | 0.000014 | 1.000000 | 02:07
23 | 0.000056 | 0.000011 | 1.000000 | 02:07
24 | 0.000049 | 0.000010 | 1.000000 | 02:07
25 | 0.000043 | 0.000009 | 1.000000 | 02:07
26 | 0.000038 | 0.000009 | 1.000000 | 02:09
27 | 0.000034 | 0.000008 | 1.000000 | 02:10
28 | 0.000030 | 0.000007 | 1.000000 | 02:10
29 | 0.000026 | 0.000007 | 1.000000 | 02:10
30 | 0.000023 | 0.000007 | 1.000000 | 02:10
31 | 0.000020 | 0.000007 | 1.000000 | 02:11
32 | 0.000018 | 0.000007 | 1.000000 | 02:06
33 | 0.000016 | 0.000006 | 1.000000 | 02:07
34 | 0.000014 | 0.000006 | 1.000000 | 02:06
35 | 0.000012 | 0.000006 | 1.000000 | 02:08
36 | 0.000011 | 0.000005 | 1.000000 | 02:07
37 | 0.000010 | 0.000005 | 1.000000 | 02:07
38 | 0.000009 | 0.000005 | 1.000000 | 02:07
39 | 0.000008 | 0.000005 | 1.000000 | 02:10
40 | 0.000007 | 0.000005 | 1.000000 | 02:09
41 | 0.000006 | 0.000005 | 1.000000 | 02:10
42 | 0.000006 | 0.000005 | 1.000000 | 02:08
43 | 0.000005 | 0.000005 | 1.000000 | 02:10
44 | 0.000005 | 0.000004 | 1.000000 | 02:11
45 | 0.000004 | 0.000004 | 1.000000 | 02:12
46 | 0.000004 | 0.000004 | 1.000000 | 02:10
47 | 0.000003 | 0.000004 | 1.000000 | 02:13
48 | 0.000003 | 0.000004 | 1.000000 | 02:06
49 | 0.000003 | 0.000004 | 1.000000 | 02:07
50 | 0.000003 | 0.000004 | 1.000000 | 02:12
51 | 0.000002 | 0.000004 | 1.000000 | 02:10
52 | 0.000002 | 0.000004 | 1.000000 | 02:12
53 | 0.000002 | 0.000004 | 1.000000 | 02:10
54 | 0.000002 | 0.000004 | 1.000000 | 02:10
55 | 0.000002 | 0.000004 | 1.000000 | 02:07
56 | 0.000002 | 0.000003 | 1.000000 | 02:09
57 | 0.000001 | 0.000003 | 1.000000 | 02:10
58 | 0.000001 | 0.000004 | 1.000000 | 02:08
59 | 0.000001 | 0.000004 | 1.000000 | 02:09
60 | 0.000001 | 0.000004 | 1.000000 | 02:10
61 | 0.000001 | 0.000004 | 1.000000 | 02:12
62 | 0.000001 | 0.000004 | 1.000000 | 02:09
63 | 0.000001 | 0.000004 | 1.000000 | 02:09
64 | 0.000001 | 0.000003 | 1.000000 | 02:08
65 | 0.000001 | 0.000003 | 1.000000 | 02:09
66 | 0.000001 | 0.000003 | 1.000000 | 02:10
67 | 0.000001 | 0.000004 | 1.000000 | 02:09
68 | 0.000001 | 0.000004 | 1.000000 | 02:11
69 | 0.000001 | 0.000003 | 1.000000 | 02:08
70 | 0.000001 | 0.000003 | 1.000000 | 02:09
71 | 0.000001 | 0.000003 | 1.000000 | 02:08
72 | 0.000001 | 0.000003 | 1.000000 | 02:08
73 | 0.000001 | 0.000003 | 1.000000 | 02:07
74 | 0.000001 | 0.000003 | 1.000000 | 02:06
75 | 0.000001 | 0.000003 | 1.000000 | 02:08
76 | 0.000001 | 0.000003 | 1.000000 | 02:07
77 | 0.000001 | 0.000003 | 1.000000 | 02:07
78 | 0.000001 | 0.000003 | 1.000000 | 02:07
79 | 0.000001 | 0.000003 | 1.000000 | 02:07
80 | 0.000001 | 0.000003 | 1.000000 | 02:06
81 | 0.000000 | 0.000004 | 1.000000 | 02:08
82 | 0.000000 | 0.000003 | 1.000000 | 02:11
83 | 0.000000 | 0.000003 | 1.000000 | 02:08
84 | 0.000000 | 0.000003 | 1.000000 | 02:10
85 | 0.000000 | 0.000003 | 1.000000 | 02:09
86 | 0.000000 | 0.000003 | 1.000000 | 02:08
87 | 0.000000 | 0.000003 | 1.000000 | 02:08
88 | 0.000000 | 0.000003 | 1.000000 | 02:09
89 | 0.000000 | 0.000003 | 1.000000 | 02:09
90 | 0.000000 | 0.000003 | 1.000000 | 02:08
91 | 0.000000 | 0.000003 | 1.000000 | 02:08
92 | 0.000000 | 0.000003 | 1.000000 | 02:09
93 | 0.000000 | 0.000003 | 1.000000 | 02:09
94 | 0.000000 | 0.000003 | 1.000000 | 02:09
95 | 0.000000 | 0.000003 | 1.000000 | 02:09
96 | 0.000000 | 0.000003 | 1.000000 | 02:11
97 | 0.000000 | 0.000003 | 1.000000 | 02:11
98 | 0.000000 | 0.000003 | 1.000000 | 02:09
99 | 0.000000 | 0.000003 | 1.000000 | 02:08

@muhammedtalo
Copy link
Owner

Hi Muhammed,
By having 125 images in Covid-19 and 500 images in No_findings folder, are we not dealing with imbalanced dataset ?

The reason why i am asking is i trained your model using KFold datasets but i am getting only 58% accuracy. I am printing below one of the iteration output and I think somewhere something is wrong in my code.

epoch | train_loss | valid_loss | accuracy | time

0 | 0.003996 | 0.006837 | 1.000000 | 02:26
1 | 0.002480 | 0.004546 | 1.000000 | 02:10
2 | 0.001856 | 0.002552 | 1.000000 | 02:06
3 | 0.001408 | 0.001160 | 1.000000 | 02:04
4 | 0.001097 | 0.000621 | 1.000000 | 02:06
5 | 0.000911 | 0.000315 | 1.000000 | 02:11
6 | 0.000743 | 0.000152 | 1.000000 | 02:10
7 | 0.000620 | 0.000084 | 1.000000 | 02:12
8 | 0.000522 | 0.000066 | 1.000000 | 02:10
9 | 0.000442 | 0.000044 | 1.000000 | 02:09
10 | 0.000372 | 0.000033 | 1.000000 | 02:10
11 | 0.000316 | 0.000022 | 1.000000 | 02:09
12 | 0.000272 | 0.000018 | 1.000000 | 02:10
13 | 0.000233 | 0.000018 | 1.000000 | 02:10
14 | 0.000201 | 0.000017 | 1.000000 | 02:08
15 | 0.000173 | 0.000017 | 1.000000 | 02:10
16 | 0.000149 | 0.000015 | 1.000000 | 02:08
17 | 0.000129 | 0.000014 | 1.000000 | 02:10
18 | 0.000112 | 0.000014 | 1.000000 | 02:06
19 | 0.000097 | 0.000014 | 1.000000 | 02:07
20 | 0.000084 | 0.000015 | 1.000000 | 02:05
21 | 0.000074 | 0.000014 | 1.000000 | 02:07
22 | 0.000064 | 0.000014 | 1.000000 | 02:07
23 | 0.000056 | 0.000011 | 1.000000 | 02:07
24 | 0.000049 | 0.000010 | 1.000000 | 02:07
25 | 0.000043 | 0.000009 | 1.000000 | 02:07
26 | 0.000038 | 0.000009 | 1.000000 | 02:09
27 | 0.000034 | 0.000008 | 1.000000 | 02:10
28 | 0.000030 | 0.000007 | 1.000000 | 02:10
29 | 0.000026 | 0.000007 | 1.000000 | 02:10
30 | 0.000023 | 0.000007 | 1.000000 | 02:10
31 | 0.000020 | 0.000007 | 1.000000 | 02:11
32 | 0.000018 | 0.000007 | 1.000000 | 02:06
33 | 0.000016 | 0.000006 | 1.000000 | 02:07
34 | 0.000014 | 0.000006 | 1.000000 | 02:06
35 | 0.000012 | 0.000006 | 1.000000 | 02:08
36 | 0.000011 | 0.000005 | 1.000000 | 02:07
37 | 0.000010 | 0.000005 | 1.000000 | 02:07
38 | 0.000009 | 0.000005 | 1.000000 | 02:07
39 | 0.000008 | 0.000005 | 1.000000 | 02:10
40 | 0.000007 | 0.000005 | 1.000000 | 02:09
41 | 0.000006 | 0.000005 | 1.000000 | 02:10
42 | 0.000006 | 0.000005 | 1.000000 | 02:08
43 | 0.000005 | 0.000005 | 1.000000 | 02:10
44 | 0.000005 | 0.000004 | 1.000000 | 02:11
45 | 0.000004 | 0.000004 | 1.000000 | 02:12
46 | 0.000004 | 0.000004 | 1.000000 | 02:10
47 | 0.000003 | 0.000004 | 1.000000 | 02:13
48 | 0.000003 | 0.000004 | 1.000000 | 02:06
49 | 0.000003 | 0.000004 | 1.000000 | 02:07
50 | 0.000003 | 0.000004 | 1.000000 | 02:12
51 | 0.000002 | 0.000004 | 1.000000 | 02:10
52 | 0.000002 | 0.000004 | 1.000000 | 02:12
53 | 0.000002 | 0.000004 | 1.000000 | 02:10
54 | 0.000002 | 0.000004 | 1.000000 | 02:10
55 | 0.000002 | 0.000004 | 1.000000 | 02:07
56 | 0.000002 | 0.000003 | 1.000000 | 02:09
57 | 0.000001 | 0.000003 | 1.000000 | 02:10
58 | 0.000001 | 0.000004 | 1.000000 | 02:08
59 | 0.000001 | 0.000004 | 1.000000 | 02:09
60 | 0.000001 | 0.000004 | 1.000000 | 02:10
61 | 0.000001 | 0.000004 | 1.000000 | 02:12
62 | 0.000001 | 0.000004 | 1.000000 | 02:09
63 | 0.000001 | 0.000004 | 1.000000 | 02:09
64 | 0.000001 | 0.000003 | 1.000000 | 02:08
65 | 0.000001 | 0.000003 | 1.000000 | 02:09
66 | 0.000001 | 0.000003 | 1.000000 | 02:10
67 | 0.000001 | 0.000004 | 1.000000 | 02:09
68 | 0.000001 | 0.000004 | 1.000000 | 02:11
69 | 0.000001 | 0.000003 | 1.000000 | 02:08
70 | 0.000001 | 0.000003 | 1.000000 | 02:09
71 | 0.000001 | 0.000003 | 1.000000 | 02:08
72 | 0.000001 | 0.000003 | 1.000000 | 02:08
73 | 0.000001 | 0.000003 | 1.000000 | 02:07
74 | 0.000001 | 0.000003 | 1.000000 | 02:06
75 | 0.000001 | 0.000003 | 1.000000 | 02:08
76 | 0.000001 | 0.000003 | 1.000000 | 02:07
77 | 0.000001 | 0.000003 | 1.000000 | 02:07
78 | 0.000001 | 0.000003 | 1.000000 | 02:07
79 | 0.000001 | 0.000003 | 1.000000 | 02:07
80 | 0.000001 | 0.000003 | 1.000000 | 02:06
81 | 0.000000 | 0.000004 | 1.000000 | 02:08
82 | 0.000000 | 0.000003 | 1.000000 | 02:11
83 | 0.000000 | 0.000003 | 1.000000 | 02:08
84 | 0.000000 | 0.000003 | 1.000000 | 02:10
85 | 0.000000 | 0.000003 | 1.000000 | 02:09
86 | 0.000000 | 0.000003 | 1.000000 | 02:08
87 | 0.000000 | 0.000003 | 1.000000 | 02:08
88 | 0.000000 | 0.000003 | 1.000000 | 02:09
89 | 0.000000 | 0.000003 | 1.000000 | 02:09
90 | 0.000000 | 0.000003 | 1.000000 | 02:08
91 | 0.000000 | 0.000003 | 1.000000 | 02:08
92 | 0.000000 | 0.000003 | 1.000000 | 02:09
93 | 0.000000 | 0.000003 | 1.000000 | 02:09
94 | 0.000000 | 0.000003 | 1.000000 | 02:09
95 | 0.000000 | 0.000003 | 1.000000 | 02:09
96 | 0.000000 | 0.000003 | 1.000000 | 02:11
97 | 0.000000 | 0.000003 | 1.000000 | 02:11
98 | 0.000000 | 0.000003 | 1.000000 | 02:09
99 | 0.000000 | 0.000003 | 1.000000 | 02:08

It seems you are using the test set during the training.

@Kannadasa
Copy link
Author

I have not created any testset during the training.

All i did was split the data using StratifiedKFold and split the data using 20%. That means KFOLD n_splits=5.
Then i ran 5 iteration during training with 100 epochs.

In each iteration 20% of my entire dataset will act as testset.

I used Stratified KFold to split the data, this is to make sure some portion of testdata will be available during training.

for example :
This is how data is split during training.

[ 25 26 27 28 ... 621 622 623 624] [ 0 1 2 3 ... 221 222 223 224]
[ 0 1 2 3 ... 621 622 623 624] [ 25 26 27 28 ... 321 322 323 324]
[ 0 1 2 3 ... 621 622 623 624] [ 50 51 52 53 ... 421 422 423 424]
[ 0 1 2 3 ... 621 622 623 624] [ 75 76 77 78 ... 521 522 523 524]
[ 0 1 2 3 ... 521 522 523 524] [100 101 102 103 ... 621 622 623 624]

@Kannadasa
Copy link
Author

Also whatever the dataset i am using is training set and validation set. My testset is completely unseen x-ray images and the accuracy i am getting is 67%.

@AmiZya
Copy link

AmiZya commented May 23, 2020

@Kannadasa can you please provide the code you used for KFolds?

@Kannadasa
Copy link
Author

Please find below my code for KFolds.

from sklearn.model_selection import KFold
from sklearn.model_selection import StratifiedKFold
kf = KFold(n_splits=5)
skf=StratifiedKFold(n_splits=5)

data= (ImageList.from_folder(path)
.split_none()
.label_from_folder()
.transform(size=(256,256))
.databunch()).normalize(imagenet_stats)

df=data.to_df()

for train_index, test_index in skf.split(df.index, df['y']):
print(len(train_index), len(test_index))

print((train_index), (test_index))

d = (ImageList.from_folder (path)
        .split_by_idxs(train_index, test_index)
        .label_from_folder()
        .transform(size = (256,256))
        .databunch(num_workers =0)).normalize(imagenet_stats)

@AmiZya
Copy link

AmiZya commented May 23, 2020

Thanks, much appreciated.

On a side note did you manage to get higher accuracy? I'm running the model now and it sits around 78% for the the three classes model.

@Kannadasa
Copy link
Author

Hi,

I did not test for 3 classes. I did test only 2 classes. My KFold code is also for 2 classes.

I am not getting good accuracy on unseen data. With the training set and validation set the model is working fine. I am not getting good accuracy on the new data which the model has not seen before.

Thanks
Kannadasan

@Kannadasa
Copy link
Author

Are you using KFold to split the data for 3 classes prediction?

Is my KFold split code working for you in 3 classes?

Thanks
Kannadasan

@Shambhujii
Copy link

Hi,

I am testing your model, but i am not getting the desired output. I think i am not distributing the data properly in train and valid folders.

Please let me know how you are creating the folder structure and loading the images for train and valid datasets. This is for binary classification

i am also facing the same issue,,,i hope you have fixed this problem now,,,Please let me know how you are creating the folder structure and loading the images for train and valid datasets.

@Kannadasa
Copy link
Author

Hi,

First of all you need to have a directory called train and valid, because Fastai will look for these names while running the code. I am using KFold cross validation to split the data into training and validation sets.

Please find below my code for KFolds.

from sklearn.model_selection import KFold
from sklearn.model_selection import StratifiedKFold
kf = KFold(n_splits=5)
skf=StratifiedKFold(n_splits=5)

data= (ImageList.from_folder(path)
.split_none()
.label_from_folder()
.transform(size=(256,256))
.databunch()).normalize(imagenet_stats)

df=data.to_df()

for train_index, test_index in skf.split(df.index, df['y']):
print(len(train_index), len(test_index))

print((train_index), (test_index))

d = (ImageList.from_folder (path)
.split_by_idxs(train_index, test_index)
.label_from_folder()
.transform(size = (256,256))
.databunch(num_workers =0)).normalize(imagenet_stats)

@Shambhujii
Copy link

Hi,

First of all you need to have a directory called train and valid, because Fastai will look for these names while running the code. I am using KFold cross validation to split the data into training and validation sets.

Please find below my code for KFolds.

from sklearn.model_selection import KFold
from sklearn.model_selection import StratifiedKFold
kf = KFold(n_splits=5)
skf=StratifiedKFold(n_splits=5)

data= (ImageList.from_folder(path)
.split_none()
.label_from_folder()
.transform(size=(256,256))
.databunch()).normalize(imagenet_stats)

df=data.to_df()

for train_index, test_index in skf.split(df.index, df['y']):
print(len(train_index), len(test_index))

print((train_index), (test_index))

d = (ImageList.from_folder (path)
.split_by_idxs(train_index, test_index)
.label_from_folder()
.transform(size = (256,256))
.databunch(num_workers =0)).normalize(imagenet_stats)

Thank you so much my friend for this valuable comment,,,I will try to split train and validation sets as per your guidance,,Thank you again,,,Lets collaborate together to fight against this pandemic.

@Kannadasa
Copy link
Author

It works fine for me in the training set and validation set. If i show some unseen x-ray images to my model, the model does not predict well. I dont know how to fix this problem.
If you get any solution please let me know.

Thanks

@rahuls321
Copy link

Hi,

First of all you need to have a directory called train and valid, because Fastai will look for these names while running the code. I am using KFold cross validation to split the data into training and validation sets.

Please find below my code for KFolds.

from sklearn.model_selection import KFold
from sklearn.model_selection import StratifiedKFold
kf = KFold(n_splits=5)
skf=StratifiedKFold(n_splits=5)

data= (ImageList.from_folder(path)
.split_none()
.label_from_folder()
.transform(size=(256,256))
.databunch()).normalize(imagenet_stats)

df=data.to_df()

for train_index, test_index in skf.split(df.index, df['y']):
print(len(train_index), len(test_index))

print((train_index), (test_index))

d = (ImageList.from_folder (path)
.split_by_idxs(train_index, test_index)
.label_from_folder()
.transform(size = (256,256))
.databunch(num_workers =0)).normalize(imagenet_stats)

Hey @Kannadasa , I successfully run the code for normal splitting like 80% for training and 10% for validation and 10% for testing. But I'm still facing issues with KFold cross-validation. After creating a train and valid dir. this code didn't produce anything. Could you please give a brief about this code?

Thanks in advance

@aliranic
Copy link

Hello! Why you used validation dataset as test dataset? Or you did different thing that I don't understand?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants