-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to interpret the samples? #3
Comments
First of all, you should round up your X so that it will be integers. Your X is of shape (10000, 1071). If you want to do a correlation analysis, you can creates a correlation matrix using Pearson correlation coefficient (or any correlation metric you choose. e.g. Spearman correlation). The resulting correlation matrix will be of shape (1071, 1071) of course. Actually, it would be interesting to provide the correlation information to the discriminator while training. This can be the next version of minibatch averaging I used in the paper. I am not sure if it will be as effective as minibatch averaging because it will be way harder for the generator to figure out how to satisfy the correlation statistics, than to figure out how to satisfy the average statistics. But it will still be worth a try. |
Ah that makes sense. I just re-read the paper though and it say for dataset
So does that mean, I've made a mistake using "counts" for the MIMIC dataset? I'll retrain any way using "binary", I'm curious to see the differences? |
No in the paper I just chose to use the binary matrix. It is totally up to you which matrix you want to synthesize. Of course, the performance of medGAN will be better with binary matrix(especially since MIMIC-III has only 45k samples), but count matrices will be more informative. |
Hi Edward, I have read the responses above, but just want to check that I understand correctly. Once we have the resulting numpy array, I should round the array to the nearest integer and map each column its corresponding ICD9 code for the patient. Another question regarding the resulting array is what the values mean before they are rounded? Thank you! |
Hi Ariel, Yes that is correct. |
Hi,
As the final line shows, the output synthetic data are all floats between 0 and 1. Is that what you expect (in which case, how should I interpret them?), or should they be counts? Thanks! |
Hi RacingTadpole, If the provided training data are proper count data, then the output should also be count data. Thanks, |
Hi ED, Hoping not too late to post my finding here. |
Hello ED, Thank you for your help! |
I found the mistake in All the best, |
So, we will have Row-Number_ID in all kinds of synthetic generated records by medGAN always? Which is not Patient ID! Unless in our data set Patient ID would be same as Row-Number_ID!? Thanks for your consideration in advance |
Hi Ed,
thank you very much for adding the
process_mimic.py
script :)It all worked fairly painlessly, following your clear instructions (I used "counts") - and now I'm the very proud owner of
10000
synthetic EHR's - woohoo !!!So I loaded samples, but I'm not sure how to interpret them?
I just realized I'm not sure what
synthetic_ehr
is? Does it look right to you?I thought it would be like a row of a table where the columns are the
1071
ICD-9 codes, and the counts are the number of times those entities appear in the patients ehr? So the counts should be whole numbers, and would give some idea of co-morbidities? For example, cardiovascular and metabolic disorders would frequently co-occur?So would one way of analysis be a correlation matrix?
Thanks very much 👍
The text was updated successfully, but these errors were encountered: