Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Other fields in data generation #10

Closed
Ankit267 opened this issue Sep 10, 2018 · 5 comments
Closed

Other fields in data generation #10

Ankit267 opened this issue Sep 10, 2018 · 5 comments

Comments

@Ankit267
Copy link

Hi Edward,

Thanks for the code, it really helps in understanding the paper better.
Currently your python code generates patient id and ICD9 diagnosis codes.
I wanted to know what changes or modifications do I have to do in your Process_mimic and medGAN code if I need to generate synthetic data to incorporate fields such as Age,gender,Procedures etc.?

Do I have to include the desired fields in process_mimic file only or do I need to make changes to medGAN.py also?

Secondly, the data that would be generated (i.e. pid, icd9, age,gender, procedures) will take into consideration impact of other variables or will it solely be dependent on the Patient id?

Thanks

@mp2893
Copy link
Owner

mp2893 commented Sep 10, 2018

Hi Ankit267,

In order to generate features (age, gender, procedure codes) other than Dx codes, you need training data that includes them in the first place.
So, yes, you need to modify process_mimic so that you can extract those features from MIMIC-III.
(Actually, you don't even need to bother with process_mimic. You can just write your own script to extract desired features from MIMIC-III)

As for medGAN.py:
If you are generating a mix of features such as age (non-negative integer) and presence of Dx codes (binary), you need to modify medGAN accordingly. For binary features, sigmoid activation function is used, but for non-negative integers, ReLU should be used.
So, for each output neuron of the generator, different activation functions should be used, and I'm not sure if that's doable in TensorFlow. (Maybe in PyTorch?)
There are other types of features, such as ethnicity, which is neither non-negative integer or binary.
Ethnicity is one-hot among multiple classes (e.g. you can't be both "Asian" and "Hispanic". There is "Other" instead.) For this feature type, you should use Softmax instead. But that's another technical issue I'm not sure if TensorFlow can handle.
So in summary, with my implementation of medGAN, you should stick to generating only one feature type.
If you want to generate non-negative integers and binary values at the same time, I think you can try with only using ReLU, but I'm not sure if that will work out great.

For the second question: medGAN will generate synthetic samples that closely follow the distribution of the real data samples. So the dependency among all features will come into play. Since Patient ID is not being generated (unless you include them in the training data) no features will depend on Patient ID.

Hope this helps.

Best,
Ed

@Ankit267
Copy link
Author

Thanks Ed, that really helps.
Will get back to you in case of any concerns.

@Myshgithub
Copy link

Dear Ed
As you mentioned above, Patient ID is not being generated (unless we include them in the training data). So, my question is medGAN generate synthetic samples with same and in order Patient ID when we run it different times? Thank you

@mp2893
Copy link
Owner

mp2893 commented Oct 28, 2019

Hi Myshgithub,

The synthetic records generated by MedGAN will have no relationship whatsoever with the original Patient IDs. The generated records will be purely synthetic, and every time you generate a new batch, you get a fresh batch of synthetic records. Sometimes you might get duplicate records by chance, but that doesn't mean they are the same patient. MedGAN does not understand the concept of Patient ID, unless you modify it somehow.

Best,
Ed

@Myshgithub
Copy link

Thank you so much Ed!
But, in the first comment here above: Ankit267 mentioned that
(Currently your python code generates patient id and ICD9 diagnosis codes.) and then you replied that: "Since Patient ID is not being generated (unless you include them in the training data) no features will depend on Patient ID."

So, my question is please what changes or modifications do I have to do you think in order to have ( Patient ID along with synthetic generated records)?
1- One way as discussed seems to be include the desired fields( Patient ID) in process_mimic file only. Is it correct?
2- Do you aware of any other way that we can include them(Patient ID) in to the training data?
3-In (#3) for interpreting the generated samples, is mentioned that: (“X is of shape (10000, 1071), and Each row corresponds to a single synthetic patient”). So, that is just Row-Number_ID? And they are ordered?

I would appreciate your responses...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants