-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
possible issue with random sampling #15
Comments
The average distance between molecules is ~20. The molecules are distributed as you said, close to the SMILES of the training set molecules. My guess is that setting the noise to a very large value makes it difficult to find valid SMILES that are correct. Bumping @beangoben to see if he has any ideas. |
hi Abrusan, I think the problem might be that the noise level is too high. Searching for molecules from random vectors that are 50-200 STD (z-distance wise) is huge. Each dimension is assumed to be gaussian distributed..so the actual probability mass outside of 2-4 STD should be quite small (https://arxiv.org/abs/1609.04468). If i had to guess, I would think the RNN is just decoding whatever it could make sense from the first molecule in the batch. I think you will find more subtle differences and a larger variety if you sample with 0.5 (local neighborhood), 1.0-2.0 (random molecules). |
Hi guys, Thanks for the comments. I have to admit I still think something critical is missing. I used several noise levels: 3, 6, 12, 25, 50, 100 (but also tried 0.1, and even 200; [noise=N, df = vae.z_to_smiles(z_1,decode_attempts=100,noise_norm=noise)]). My aim was to have a gradient between noise levels that sample the close neighborhood of a SMILES, and between noise levels that effectively pick SMILES randomly from the entire latent space. To my surprise, dramatic increases in the specified noise levels lead to rather modest increases in the diversity of returned molecules - I have not reached a noise level that effectively returns SMILES that are structurally unrelated to the input. (In other words - what noise level should I use to sample the entire latent space, essentially randomly?) So it seems that in practice the relationship between noise level and the structural diversity of the returned smiles is rather nontrivial, which is surprising, given the perturb_z (vae_utils.py) function (but I am not a python programmer). It would be great if you could clarify this. Best wishes, |
Hi guys, Best wishes, |
Hi guys,
I have encountered the following "weird" behaviour when I sample the latent space near molecules near a SMILES: the output molecules somehow change little with the specified noise level. My installation seems to be okay, it reproduces the examples (I'm using a CPU based installation), so I wonder whether I am missing something. I provide below some examples, but it is the case for many other molecules. (For the cases here I take only 100 samples, but for "production" work I take tens of thousands, and the pattern remains)
Noise 200:
$ python get_vae_smiles.py "CSCC(=O)NNC(=O)c1c(C)oc(C)c1C" 2>/dev/null
Using standarized functions? True
Standarization: estimating mu and std values ...done!
Input : CSCC(=O)NNC(=O)c1c(C)oc(C)c1C
Reconstruction : CSCC(=O)N(C(=O)c1c(C)oc(C)c1C
Z representation : (1, 196) with norm 10.705
Searching molecules randomly sampled from 200.00 std (z-distance) from the point
Found 10 unique mols, out of 30
SMILES
0 CSCC(=O)NNC(=O)c1c(C)oc(C)c1C
1 CSC(C=O)NNC(=O)c1c(C)oc(C)c1C
2 COCC(=O)NC(C=O)c1c(C)oc(C)c1C
3 CSCC(=O)NCC(=O)c1c(C)oc(C)c1C
4 COCC(=O)NCC(=O)c1c(C)oc(C)c1C
5 CSC(C=O)NCC(=O)c1c(C)oc(C)c1C
6 COCC(=O)NCC(=O)c1c(C)oc(C)c1Cl
7 CSC(C=O)NCC(=O)c1c(F)oc(C)c1C
8 COCC(=O)NC(=O)c1cc(O)nc(C)c1C
9 C#COC(=N)NC(=O)c1ccccc(Cl)cc1Cl
Name: smiles, dtype: object
Noise 2:
Searching molecules randomly sampled from 2.00 std (z-distance) from the point
Found 13 unique mols, out of 75
SMILES
0 CSCC(=O)NNC(=O)c1c(C)oc(C)c1C
1 CSC(C=O)NNC(=O)c1c(C)oc(C)c1C
2 CSCC(=O)NC(C=O)c1c(C)oc(C)c1C
3 COCC(=O)NC(C=O)c1c(C)oc(C)c1C
4 CSCC(=O)NCC(=O)c1c(C)oc(C)c1C
5 COC(C=O)NNC(=O)c1c(C)oc(C)c1C
6 COCC(=O)NCC(=O)c1c(C)oc(C)c1C
7 CSCC(=O)NCC(=O)c1c(O)oc(C)c1C
8 CSC(C=O)NCC(=O)c1c(C)oc(C)c1C
9 CSC(C=O)NCC(=O)c1c(F)oc(C)c1C
10 COC(C=O)NCC(=O)c1c(C)oc(C)c1C
11 CSCC(=O)N/C(=O)c1c(C)oc(C)c1C
12 ClCC(=O)NCC(=O)c1c(C)oc(C)c1C
Name: smiles, dtype: object
Searching molecules randomly sampled from 50.00 std (z-distance) from the point
Found 14 unique mols, out of 65
SMILES
0 CSCC(=O)NNC(=O)c1c(C)oc(C)c1C
1 COCC(=O)NNC(=O)c1c(C)oc(C)c1C
2 CSC(C=O)NNC(=O)c1c(C)oc(C)c1C
3 COCC(=O)NC(C=O)c1c(C)oc(C)c1C
4 CSCC(=O)NCC(=O)c1c(C)oc(C)c1C
5 CSC(C=O)NC(C=O)c1c(C)oc(C)c1C
6 COCC(=O)NCC(=O)c1c(C)oc(C)c1C
7 CSC(C=O)NCC(=O)c1c(C)oc(C)c1C
8 CSC(C=O)NCC(=O)c1c(F)oc(C)c1C
9 COC(C=O)NCC(=O)c1c(C)oc(C)c1C
10 CSCC(=O)N/C(=O)c1c(C)oc(C)c1C
11 ClC(C=O)NCC(=O)c1c(C)oc(C)c1C
12 ClCC(=O)NCC(=O)c1c(C)oc(C)c1C
13 ClCC(=O)NC(C=O)c1c(C)oc(C)c1C
Name: smiles, dtype: object
So it seems that for large Z distances the SMILES are not so much different than for small distances. )What is the distribution of the random sampling? I would expect this if the random sampling is not uniform and heavily biased towards the coordinates of input SMILES, so the specified noise level affects only the peripheries, and most molecules of the output still originate from the close neighbourhood of the SMILES.
I would greatly appreciate any help with this issue.
Best wishes,
Gyorgy Abrusan
The text was updated successfully, but these errors were encountered: