Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to interpret the samples? #3

Open
ghost opened this issue Sep 10, 2017 · 11 comments
Open

How to interpret the samples? #3

ghost opened this issue Sep 10, 2017 · 11 comments

Comments

@ghost
Copy link

ghost commented Sep 10, 2017

Hi Ed,

thank you very much for adding the process_mimic.py script :)

It all worked fairly painlessly, following your clear instructions (I used "counts") - and now I'm the very proud owner of 10000 synthetic EHR's - woohoo !!!

So I loaded samples, but I'm not sure how to interpret them?

>>> import numpy as np
>>> X = np.load('/home/ajay/PythonProjects/medgan-master/samples/samples.npy')
>>> X
array([[ 0.42479137,  0.38992843,  0.3843686 , ...,  0.48570082,
         0.44278869,  0.4656629 ],
       [ 0.28643027,  0.45749718,  0.23394403, ...,  0.47090551,
         0.41072363,  0.43643555],
       [ 0.29359645,  0.46955556,  0.22549649, ...,  0.48150307,
         0.41780272,  0.45492986],
       ..., 
       [ 0.56480783,  0.66771448,  0.54325938, ...,  0.47483209,
         0.43128845,  0.45304856],
       [ 0.68514657,  0.79574692,  0.73424697, ...,  0.47857872,
         0.43853614,  0.44970644],
       [ 0.17376943,  0.19806506,  0.27509841, ...,  0.47925362,
         0.44123808,  0.46058744]], dtype=float32)
>>> X.shape
(10000, 1071)
>>> synthetic_ehr = X[0,:]
>>> synthetic_ehr
array([ 0.42479137,  0.38992843,  0.3843686 , ...,  0.48570082,
        0.44278869,  0.4656629 ], dtype=float32)

I just realized I'm not sure what synthetic_ehr is? Does it look right to you?

I thought it would be like a row of a table where the columns are the 1071 ICD-9 codes, and the counts are the number of times those entities appear in the patients ehr? So the counts should be whole numbers, and would give some idea of co-morbidities? For example, cardiovascular and metabolic disorders would frequently co-occur?

So would one way of analysis be a correlation matrix?

Thanks very much 👍

@mp2893
Copy link
Owner

mp2893 commented Sep 10, 2017

First of all, you should round up your X so that it will be integers.

Your X is of shape (10000, 1071).
Each row corresponds to a single synthetic patient.
Each column corresponds to a specific ICD9 diagnosis code.
You can use ".types" file created by process_mimic.py to map each column to a specific ICD9 diagnosis code. (Read the beginning part of the source code of process_mimic.py for more information about ".types" file)

If you want to do a correlation analysis, you can creates a correlation matrix using Pearson correlation coefficient (or any correlation metric you choose. e.g. Spearman correlation). The resulting correlation matrix will be of shape (1071, 1071) of course.

Actually, it would be interesting to provide the correlation information to the discriminator while training. This can be the next version of minibatch averaging I used in the paper. I am not sure if it will be as effective as minibatch averaging because it will be way harder for the generator to figure out how to satisfy the correlation statistics, than to figure out how to satisfy the average statistics. But it will still be worth a try.

@ghost
Copy link
Author

ghost commented Sep 10, 2017

Ah that makes sense. I just re-read the paper though and it say for dataset B which is the mimic dataset,

From dataset B, we extracted ICD9 codes only and grouped them by generalizing up to their first 3
digits. Finally, we aggregate a patient’s longitudinal record into a single fixed-size vector x ∈ Z
|C| , where |C| equals 615, 1071 and 569 for dataset A, B and C respectively. Note that datasets A and B are binarized for experiments regarding binary variables while dataset C is used for experiments regarding count variables.

So does that mean, I've made a mistake using "counts" for the MIMIC dataset?

I'll retrain any way using "binary", I'm curious to see the differences?

@mp2893
Copy link
Owner

mp2893 commented Sep 11, 2017

No in the paper I just chose to use the binary matrix. It is totally up to you which matrix you want to synthesize.

Of course, the performance of medGAN will be better with binary matrix(especially since MIMIC-III has only 45k samples), but count matrices will be more informative.

@arielpeterson
Copy link

Hi Edward,

I have read the responses above, but just want to check that I understand correctly. Once we have the resulting numpy array, I should round the array to the nearest integer and map each column its corresponding ICD9 code for the patient.

Another question regarding the resulting array is what the values mean before they are rounded?

Thank you!

@mp2893
Copy link
Owner

mp2893 commented Jul 28, 2018

Hi Ariel,

Yes that is correct.
The float values before rounding is just a number that medGAN thinks are statistically similar to the real data. (remember that the integer values are converted to float values when being input to medGAN)

@RacingTadpole
Copy link

Hi,
Thanks for making this code available. To try it out, I started by running it on randomly generated integer count data:

>>> import numpy as np
>>> np.random.seed(1)
>>> data = np.floor(np.random.exponential(4, size=(10000, 120))).astype(np.float32)
>>> np.save('datafile1.npy', data)
>>> exit()
$ python medgan.py datafile1.npy output_run1/output --data_type=count
...
$ python medgan.py datafile1.npy output_run1/synthetic --model_file=output_run1/output-999 --generate_data=True
$ python
>>> import numpy as np
>>> syn = np.load('output_run1/synthetic.npy')
>>> np.min(syn), np.max(syn)
(1.5497208e-06, 0.99999994)

As the final line shows, the output synthetic data are all floats between 0 and 1. Is that what you expect (in which case, how should I interpret them?), or should they be counts? Thanks!

@mp2893
Copy link
Owner

mp2893 commented Mar 31, 2019

Hi RacingTadpole,

If the provided training data are proper count data, then the output should also be count data.

Thanks,
Ed

@canoninzaz
Copy link

canoninzaz commented May 18, 2019

Hi RacingTadpole,

If the provided training data are proper count data, then the output should also be count data.

Thanks,
Ed

Hi RacingTadpole,

If the provided training data are proper count data, then the output should also be count data.

Thanks,
Ed

Hi ED,

Hoping not too late to post my finding here.
I've run into the same question as RacingTadpole did.
When I've done the following:
$ python process_mimic.py ADMISSIONS.csv DIAGNOSES_ICD.csv count "count"
$ python medgan.py count.matrix model_count --data_type="count"
$ python medgan.py count.matrix sdg_count --model_file=model_count-999 --generate_data=True
the sdg_count should be a synthetic, count-value data generated by GAN. But the values in sdg_count are range from 0 to 1. It's based on MIMIC3.

@aauss
Copy link

aauss commented Jul 24, 2019

Hi ED,

Hoping not too late to post my finding here.
I've run into the same question as RacingTadpole did.
When I've done the following:
$ python process_mimic.py ADMISSIONS.csv DIAGNOSES_ICD.csv count "count"
$ python medgan.py count.matrix model_count --data_type="count"
$ python medgan.py count.matrix sdg_count --model_file=model_count-999 --generate_data=True
the sdg_count should be a synthetic, count-value data generated by GAN. But the values in sdg_count are range from 0 to 1. It's based on MIMIC3.

Hello ED,
I am also confused that the generated data ranges from 0 to 1 since the count numbers encountered in the matrix file generated from MIMIC3 range to values above 50. Rounding the generated values does not reflect the pattern found in the real data. Can you help why this might be so. I run the same code as canoninzaz.

Thank you for your help!
All the best
Auss

@aauss
Copy link

aauss commented Jul 25, 2019

I found the mistake in
>$ python medgan.py count.matrix sdg_count --model_file=model_count-999 --generate_data=True
We/you need to specify --data_type=count
The reason is, that the script instanciates a new medgan object and its default data_type is binary. That is why the output ranges from 0 to 1 and from 0 to n where n can also be larger than 1.

All the best,
Auss

@Myshgithub
Copy link

First of all, you should round up your X so that it will be integers.

Your X is of shape (10000, 1071).
Each row corresponds to a single synthetic patient.
Each column corresponds to a specific ICD9 diagnosis code.
You can use ".types" file created by process_mimic.py to map each column to a specific ICD9 diagnosis code. (Read the beginning part of the source code of process_mimic.py for more information about ".types" file)

If you want to do a correlation analysis, you can creates a correlation matrix using Pearson correlation coefficient (or any correlation metric you choose. e.g. Spearman correlation). The resulting correlation matrix will be of shape (1071, 1071) of course.

Actually, it would be interesting to provide the correlation information to the discriminator while training. This can be the next version of minibatch averaging I used in the paper. I am not sure if it will be as effective as minibatch averaging because it will be way harder for the generator to figure out how to satisfy the correlation statistics, than to figure out how to satisfy the average statistics. But it will still be worth a try.

So, we will have Row-Number_ID in all kinds of synthetic generated records by medGAN always? Which is not Patient ID! Unless in our data set Patient ID would be same as Row-Number_ID!?
Lets say if we want to generate another feature like (gender) other than Dx codes, then the shape of X will be: (10000,1071,2) here by 2 I mean; it can correspond to Male/ Female features respectively or (10000, 1071,1) which by 1, I just can consider Gender( and consider different values; 1:F, 0:Male) in case of Binary?

Thanks for your consideration in advance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants