How to interpret the samples? #3

ghost · 2017-09-10T17:09:09Z

Hi Ed,

thank you very much for adding the process_mimic.py script :)

It all worked fairly painlessly, following your clear instructions (I used "counts") - and now I'm the very proud owner of 10000 synthetic EHR's - woohoo !!!

So I loaded samples, but I'm not sure how to interpret them?

>>> import numpy as np
>>> X = np.load('/home/ajay/PythonProjects/medgan-master/samples/samples.npy')
>>> X
array([[ 0.42479137,  0.38992843,  0.3843686 , ...,  0.48570082,
         0.44278869,  0.4656629 ],
       [ 0.28643027,  0.45749718,  0.23394403, ...,  0.47090551,
         0.41072363,  0.43643555],
       [ 0.29359645,  0.46955556,  0.22549649, ...,  0.48150307,
         0.41780272,  0.45492986],
       ..., 
       [ 0.56480783,  0.66771448,  0.54325938, ...,  0.47483209,
         0.43128845,  0.45304856],
       [ 0.68514657,  0.79574692,  0.73424697, ...,  0.47857872,
         0.43853614,  0.44970644],
       [ 0.17376943,  0.19806506,  0.27509841, ...,  0.47925362,
         0.44123808,  0.46058744]], dtype=float32)
>>> X.shape
(10000, 1071)
>>> synthetic_ehr = X[0,:]
>>> synthetic_ehr
array([ 0.42479137,  0.38992843,  0.3843686 , ...,  0.48570082,
        0.44278869,  0.4656629 ], dtype=float32)

I just realized I'm not sure what synthetic_ehr is? Does it look right to you?

I thought it would be like a row of a table where the columns are the 1071 ICD-9 codes, and the counts are the number of times those entities appear in the patients ehr? So the counts should be whole numbers, and would give some idea of co-morbidities? For example, cardiovascular and metabolic disorders would frequently co-occur?

So would one way of analysis be a correlation matrix?

Thanks very much 👍

The text was updated successfully, but these errors were encountered:

mp2893 · 2017-09-10T21:00:47Z

First of all, you should round up your X so that it will be integers.

Your X is of shape (10000, 1071).
Each row corresponds to a single synthetic patient.
Each column corresponds to a specific ICD9 diagnosis code.
You can use ".types" file created by process_mimic.py to map each column to a specific ICD9 diagnosis code. (Read the beginning part of the source code of process_mimic.py for more information about ".types" file)

If you want to do a correlation analysis, you can creates a correlation matrix using Pearson correlation coefficient (or any correlation metric you choose. e.g. Spearman correlation). The resulting correlation matrix will be of shape (1071, 1071) of course.

Actually, it would be interesting to provide the correlation information to the discriminator while training. This can be the next version of minibatch averaging I used in the paper. I am not sure if it will be as effective as minibatch averaging because it will be way harder for the generator to figure out how to satisfy the correlation statistics, than to figure out how to satisfy the average statistics. But it will still be worth a try.

ghost · 2017-09-10T23:29:56Z

Ah that makes sense. I just re-read the paper though and it say for dataset B which is the mimic dataset,

From dataset B, we extracted ICD9 codes only and grouped them by generalizing up to their first 3
digits. Finally, we aggregate a patient’s longitudinal record into a single fixed-size vector x ∈ Z
|C| , where |C| equals 615, 1071 and 569 for dataset A, B and C respectively. Note that datasets A and B are binarized for experiments regarding binary variables while dataset C is used for experiments regarding count variables.

So does that mean, I've made a mistake using "counts" for the MIMIC dataset?

I'll retrain any way using "binary", I'm curious to see the differences?

mp2893 · 2017-09-11T00:22:35Z

No in the paper I just chose to use the binary matrix. It is totally up to you which matrix you want to synthesize.

Of course, the performance of medGAN will be better with binary matrix(especially since MIMIC-III has only 45k samples), but count matrices will be more informative.

arielpeterson · 2018-07-27T22:31:26Z

Hi Edward,

I have read the responses above, but just want to check that I understand correctly. Once we have the resulting numpy array, I should round the array to the nearest integer and map each column its corresponding ICD9 code for the patient.

Another question regarding the resulting array is what the values mean before they are rounded?

Thank you!

mp2893 · 2018-07-28T01:26:49Z

Hi Ariel,

Yes that is correct.
The float values before rounding is just a number that medGAN thinks are statistically similar to the real data. (remember that the integer values are converted to float values when being input to medGAN)

RacingTadpole · 2019-03-27T04:38:31Z

Hi,
Thanks for making this code available. To try it out, I started by running it on randomly generated integer count data:

>>> import numpy as np
>>> np.random.seed(1)
>>> data = np.floor(np.random.exponential(4, size=(10000, 120))).astype(np.float32)
>>> np.save('datafile1.npy', data)
>>> exit()
$ python medgan.py datafile1.npy output_run1/output --data_type=count
...
$ python medgan.py datafile1.npy output_run1/synthetic --model_file=output_run1/output-999 --generate_data=True
$ python
>>> import numpy as np
>>> syn = np.load('output_run1/synthetic.npy')
>>> np.min(syn), np.max(syn)
(1.5497208e-06, 0.99999994)

As the final line shows, the output synthetic data are all floats between 0 and 1. Is that what you expect (in which case, how should I interpret them?), or should they be counts? Thanks!

mp2893 · 2019-03-31T16:48:25Z

Hi RacingTadpole,

If the provided training data are proper count data, then the output should also be count data.

Thanks,
Ed

canoninzaz · 2019-05-18T00:06:26Z

Hi RacingTadpole,

If the provided training data are proper count data, then the output should also be count data.

Thanks,
Ed

Hi ED,

Hoping not too late to post my finding here.
I've run into the same question as RacingTadpole did.
When I've done the following:
$ python process_mimic.py ADMISSIONS.csv DIAGNOSES_ICD.csv count "count"
$ python medgan.py count.matrix model_count --data_type="count"
$ python medgan.py count.matrix sdg_count --model_file=model_count-999 --generate_data=True
the sdg_count should be a synthetic, count-value data generated by GAN. But the values in sdg_count are range from 0 to 1. It's based on MIMIC3.

aauss · 2019-07-24T14:34:08Z

Hi ED,

Hoping not too late to post my finding here.
I've run into the same question as RacingTadpole did.
When I've done the following:
$ python process_mimic.py ADMISSIONS.csv DIAGNOSES_ICD.csv count "count"
$ python medgan.py count.matrix model_count --data_type="count"
$ python medgan.py count.matrix sdg_count --model_file=model_count-999 --generate_data=True
the sdg_count should be a synthetic, count-value data generated by GAN. But the values in sdg_count are range from 0 to 1. It's based on MIMIC3.

Hello ED,
I am also confused that the generated data ranges from 0 to 1 since the count numbers encountered in the matrix file generated from MIMIC3 range to values above 50. Rounding the generated values does not reflect the pattern found in the real data. Can you help why this might be so. I run the same code as canoninzaz.

Thank you for your help!
All the best
Auss

aauss · 2019-07-25T08:28:11Z

I found the mistake in
>$ python medgan.py count.matrix sdg_count --model_file=model_count-999 --generate_data=True
We/you need to specify --data_type=count
The reason is, that the script instanciates a new medgan object and its default data_type is binary. That is why the output ranges from 0 to 1 and from 0 to n where n can also be larger than 1.

All the best,
Auss

Myshgithub · 2019-10-29T06:47:54Z

First of all, you should round up your X so that it will be integers.

Your X is of shape (10000, 1071).
Each row corresponds to a single synthetic patient.
Each column corresponds to a specific ICD9 diagnosis code.
You can use ".types" file created by process_mimic.py to map each column to a specific ICD9 diagnosis code. (Read the beginning part of the source code of process_mimic.py for more information about ".types" file)

If you want to do a correlation analysis, you can creates a correlation matrix using Pearson correlation coefficient (or any correlation metric you choose. e.g. Spearman correlation). The resulting correlation matrix will be of shape (1071, 1071) of course.

Actually, it would be interesting to provide the correlation information to the discriminator while training. This can be the next version of minibatch averaging I used in the paper. I am not sure if it will be as effective as minibatch averaging because it will be way harder for the generator to figure out how to satisfy the correlation statistics, than to figure out how to satisfy the average statistics. But it will still be worth a try.

So, we will have Row-Number_ID in all kinds of synthetic generated records by medGAN always? Which is not Patient ID! Unless in our data set Patient ID would be same as Row-Number_ID!?
Lets say if we want to generate another feature like (gender) other than Dx codes, then the shape of X will be: (10000,1071,2) here by 2 I mean; it can correspond to Male/ Female features respectively or (10000, 1071,1) which by 1, I just can consider Gender( and consider different values; 1:F, 0:Male) in case of Binary?

Thanks for your consideration in advance

Myshgithub mentioned this issue Oct 29, 2019

Other fields in data generation #10

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to interpret the samples? #3

How to interpret the samples? #3

ghost commented Sep 10, 2017 •

edited by ghost

Loading

mp2893 commented Sep 10, 2017

ghost commented Sep 10, 2017

mp2893 commented Sep 11, 2017

arielpeterson commented Jul 27, 2018

mp2893 commented Jul 28, 2018

RacingTadpole commented Mar 27, 2019

mp2893 commented Mar 31, 2019

canoninzaz commented May 18, 2019 •

edited

Loading

aauss commented Jul 24, 2019

aauss commented Jul 25, 2019

Myshgithub commented Oct 29, 2019

How to interpret the samples? #3

How to interpret the samples? #3

Comments

ghost commented Sep 10, 2017 • edited by ghost Loading

mp2893 commented Sep 10, 2017

ghost commented Sep 10, 2017

mp2893 commented Sep 11, 2017

arielpeterson commented Jul 27, 2018

mp2893 commented Jul 28, 2018

RacingTadpole commented Mar 27, 2019

mp2893 commented Mar 31, 2019

canoninzaz commented May 18, 2019 • edited Loading

aauss commented Jul 24, 2019

aauss commented Jul 25, 2019

Myshgithub commented Oct 29, 2019

ghost commented Sep 10, 2017 •

edited by ghost

Loading

canoninzaz commented May 18, 2019 •

edited

Loading