Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About bioactivity anotations #11

Open
velocirraptor23 opened this issue Oct 4, 2024 · 4 comments · May be fixed by #12
Open

About bioactivity anotations #11

velocirraptor23 opened this issue Oct 4, 2024 · 4 comments · May be fixed by #12

Comments

@velocirraptor23
Copy link

I am following the tutorial for inference with some pretrained models, however I struggle to see the test dataset, where is this file and what type of infromation is neded for inference?
In the other notebook for data splitting, it is loaded the KinodataDocked dataset which I understand has been already curated by RMSD, however when I look the structure of the data set I can not see where is the anotation of the bioactivity, the only thing there is the class of bioactivity which is pIC50. Just wondering if the actual value os hidden in the nodes or edges. I am just trying to understand the whole data.
Finally, I was wondering if those datasets have some information about where the pocket comes from, which protein belongs to.

Thanks a lot

Cesar

@mbackenkoehler
Copy link
Collaborator

Hi Cesar,

thanks for your questions!

I am following the tutorial for inference with some pretrained models, however I struggle to see the test dataset, where is this file and what type of infromation is neded for inference?

The preprocessed models and datasets are located on Zenodo. Let me know, if you are interested in a script which lets you apply the models to a (e.g. mol2) file of pocket and ligand.

In the other notebook for data splitting, it is loaded the KinodataDocked dataset which I understand has been already curated by RMSD, however when I look the structure of the data set I can not see where is the anotation of the bioactivity, the only thing there is the class of bioactivity which is pIC50. Just wondering if the actual value os hidden in the nodes or edges. I am just trying to understand the whole data.

The activity value for a HeteroData object is located in the y field. In data_splits.ipynb there are batches of these objects and that's why there are 128 such values.

Finally, I was wondering if those datasets have some information about where the pocket comes from, which protein belongs to.

In principle you need to map the pytorch geometric objects to the corresponding entry in the kinodata-3D dataframe. On this point I am not quite sure how the idents relate. Maybe @joschka-gross can help out with this point.

Best,
Michael

@velocirraptor23
Copy link
Author

velocirraptor23 commented Oct 7, 2024

Hi @mbackenkoehler,

Thanks a lot for the kind response. About the script, yes, please. I would like to do the inference with other molecules and/or binding sites. An for the rest of questions I think your answers clarified them. Would be nice to have a direct way to track the native protein and ligands in the database.

Best wishes,

Cesar

@joschka-gross joschka-gross linked a pull request Oct 11, 2024 that will close this issue
@joschka-gross
Copy link
Collaborator

Hi Cesar! I changed the dataset processing such that the dataset will now include the chembl_activity ID and the KLIFS structure ID. If you wish to add this information to an older processed version of the dataset, you can follow the new example in examples/patch_dataset_with_chembl_ids.ipynb. PR #12 will add this fix on the main branch.

Best,
Joschka

@velocirraptor23
Copy link
Author

velocirraptor23 commented Oct 13, 2024

Hi @joschka-gross, Thanks for this. Now it shoud be easier to track everything. I have tried the example notebook but I got this error, not sure if i did it correctly. I just reeplaced the dataset.py and patch_with_data_source.py scripts. However, I got this error which I am not sure where it comes from. Any help would be very appreciated and again, thanks for having a look on my request. This error comes in the part where I pull the dataframe. df = dataset.df

#######

Reading data frame from /kinodata-3D-affinity-prediction/data/raw/kinodata_docked_v2.sdf.gz...
Deduping data frame (current size: 121913)...
119713 complexes remain after deduplication.
Checking for missing pocket mol2 files...

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3244/3244 [00:08<00:00, 387.83it/s]


ValueError Traceback (most recent call last)
Cell In[3], line 2
1 # during dataset creation, this data frame is the main data source
----> 2 df = dataset.df

File [/miniconda3/envs/kinodata/lib/python3.10/functools.py#line=980), in cached_property.get(self, instance, owner)
979 val = cache.get(self.attrname, _NOT_FOUND)
980 if val is _NOT_FOUND:
--> 981 val = self.func(instance)
982 try:
983 cache[self.attrname] = val

File [/kinodata-3D-affinity-prediction/kinodata/data/dataset.py#line=342), in KinodataDocked.df(self)
341 @cached_property
342 def df(self) -> pd.DataFrame:
--> 343 return process_raw_data(
344 Path(self.raw_dir),
345 self.raw_file_names[0],
346 self.remove_hydrogen,
347 self.pocket_dir,
348 self.pocket_sequence_file,
349 activity_type_subset=["pIC50"],
350 )

File [/kinodata/data/dataset.py#line=148), in process_raw_data(raw_dir, file_name, remove_hydrogen, pocket_dir, pocket_sequence_file, activity_type_subset)
146 resp.raise_for_status()
147 fp.write_bytes(resp.content)
--> 149 pocket_mol2_files = {
150 int(fp.stem.split("_")[0]): fp for fp in (pocket_dir).iterdir()
151 }
152 df["pocket_mol2_file"] = [
153 pocket_mol2_files[row["similar.klifs_structure_id"]] for _, row in df.iterrows()
154 ]
156 # backwards compatability

File [/kinodata-3D-affinity-prediction/kinodata/data/dataset.py#line=149), in (.0)
146 resp.raise_for_status()
147 fp.write_bytes(resp.content)
149 pocket_mol2_files = {
--> 150 int(fp.stem.split("_")[0]): fp for fp in (pocket_dir).iterdir()
151 }
152 df["pocket_mol2_file"] = [
153 pocket_mol2_files[row["similar.klifs_structure_id"]] for _, row in df.iterrows()
154 ]
156 # backwards compatability

ValueError: invalid literal for int() with base 10: '.ipynb'

Best wishes,
Cesar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants