Other fields in data generation #10

Ankit267 · 2018-09-10T12:53:38Z

Hi Edward,

Thanks for the code, it really helps in understanding the paper better.
Currently your python code generates patient id and ICD9 diagnosis codes.
I wanted to know what changes or modifications do I have to do in your Process_mimic and medGAN code if I need to generate synthetic data to incorporate fields such as Age,gender,Procedures etc.?

Do I have to include the desired fields in process_mimic file only or do I need to make changes to medGAN.py also?

Secondly, the data that would be generated (i.e. pid, icd9, age,gender, procedures) will take into consideration impact of other variables or will it solely be dependent on the Patient id?

Thanks

mp2893 · 2018-09-10T19:28:06Z

Hi Ankit267,

In order to generate features (age, gender, procedure codes) other than Dx codes, you need training data that includes them in the first place.
So, yes, you need to modify process_mimic so that you can extract those features from MIMIC-III.
(Actually, you don't even need to bother with process_mimic. You can just write your own script to extract desired features from MIMIC-III)

As for medGAN.py:
If you are generating a mix of features such as age (non-negative integer) and presence of Dx codes (binary), you need to modify medGAN accordingly. For binary features, sigmoid activation function is used, but for non-negative integers, ReLU should be used.
So, for each output neuron of the generator, different activation functions should be used, and I'm not sure if that's doable in TensorFlow. (Maybe in PyTorch?)
There are other types of features, such as ethnicity, which is neither non-negative integer or binary.
Ethnicity is one-hot among multiple classes (e.g. you can't be both "Asian" and "Hispanic". There is "Other" instead.) For this feature type, you should use Softmax instead. But that's another technical issue I'm not sure if TensorFlow can handle.
So in summary, with my implementation of medGAN, you should stick to generating only one feature type.
If you want to generate non-negative integers and binary values at the same time, I think you can try with only using ReLU, but I'm not sure if that will work out great.

For the second question: medGAN will generate synthetic samples that closely follow the distribution of the real data samples. So the dependency among all features will come into play. Since Patient ID is not being generated (unless you include them in the training data) no features will depend on Patient ID.

Hope this helps.

Best,
Ed

Ankit267 · 2018-09-11T14:47:10Z

Thanks Ed, that really helps.
Will get back to you in case of any concerns.

Myshgithub · 2019-10-28T14:45:35Z

Dear Ed
As you mentioned above, Patient ID is not being generated (unless we include them in the training data). So, my question is medGAN generate synthetic samples with same and in order Patient ID when we run it different times? Thank you

mp2893 · 2019-10-28T23:35:59Z

Hi Myshgithub,

The synthetic records generated by MedGAN will have no relationship whatsoever with the original Patient IDs. The generated records will be purely synthetic, and every time you generate a new batch, you get a fresh batch of synthetic records. Sometimes you might get duplicate records by chance, but that doesn't mean they are the same patient. MedGAN does not understand the concept of Patient ID, unless you modify it somehow.

Best,
Ed

Myshgithub · 2019-10-29T06:40:43Z

Thank you so much Ed!
But, in the first comment here above: Ankit267 mentioned that
(Currently your python code generates patient id and ICD9 diagnosis codes.) and then you replied that: "Since Patient ID is not being generated (unless you include them in the training data) no features will depend on Patient ID."

So, my question is please what changes or modifications do I have to do you think in order to have ( Patient ID along with synthetic generated records)?
1- One way as discussed seems to be include the desired fields( Patient ID) in process_mimic file only. Is it correct?
2- Do you aware of any other way that we can include them(Patient ID) in to the training data?
3-In (#3) for interpreting the generated samples, is mentioned that: (“X is of shape (10000, 1071), and Each row corresponds to a single synthetic patient”). So, that is just Row-Number_ID? And they are ordered?

I would appreciate your responses...

Ankit267 closed this as completed Sep 11, 2018

Myshgithub mentioned this issue Oct 28, 2019

Generating different features #13

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Other fields in data generation #10

Other fields in data generation #10

Ankit267 commented Sep 10, 2018

mp2893 commented Sep 10, 2018

Ankit267 commented Sep 11, 2018

Myshgithub commented Oct 28, 2019

mp2893 commented Oct 28, 2019

Myshgithub commented Oct 29, 2019

Other fields in data generation #10

Other fields in data generation #10

Comments

Ankit267 commented Sep 10, 2018

mp2893 commented Sep 10, 2018

Ankit267 commented Sep 11, 2018

Myshgithub commented Oct 28, 2019

mp2893 commented Oct 28, 2019

Myshgithub commented Oct 29, 2019