About bioactivity anotations #11

velocirraptor23 · 2024-10-04T00:04:04Z

I am following the tutorial for inference with some pretrained models, however I struggle to see the test dataset, where is this file and what type of infromation is neded for inference?
In the other notebook for data splitting, it is loaded the KinodataDocked dataset which I understand has been already curated by RMSD, however when I look the structure of the data set I can not see where is the anotation of the bioactivity, the only thing there is the class of bioactivity which is pIC50. Just wondering if the actual value os hidden in the nodes or edges. I am just trying to understand the whole data.
Finally, I was wondering if those datasets have some information about where the pocket comes from, which protein belongs to.

Thanks a lot

Cesar

mbackenkoehler · 2024-10-07T11:09:47Z

Hi Cesar,

thanks for your questions!

I am following the tutorial for inference with some pretrained models, however I struggle to see the test dataset, where is this file and what type of infromation is neded for inference?

The preprocessed models and datasets are located on Zenodo. Let me know, if you are interested in a script which lets you apply the models to a (e.g. mol2) file of pocket and ligand.

In the other notebook for data splitting, it is loaded the KinodataDocked dataset which I understand has been already curated by RMSD, however when I look the structure of the data set I can not see where is the anotation of the bioactivity, the only thing there is the class of bioactivity which is pIC50. Just wondering if the actual value os hidden in the nodes or edges. I am just trying to understand the whole data.

The activity value for a HeteroData object is located in the y field. In data_splits.ipynb there are batches of these objects and that's why there are 128 such values.

Finally, I was wondering if those datasets have some information about where the pocket comes from, which protein belongs to.

In principle you need to map the pytorch geometric objects to the corresponding entry in the kinodata-3D dataframe. On this point I am not quite sure how the idents relate. Maybe @joschka-gross can help out with this point.

Best,
Michael

velocirraptor23 · 2024-10-07T12:24:21Z

Hi @mbackenkoehler,

Thanks a lot for the kind response. About the script, yes, please. I would like to do the inference with other molecules and/or binding sites. An for the rest of questions I think your answers clarified them. Would be nice to have a direct way to track the native protein and ligands in the database.

Best wishes,

Cesar

joschka-gross · 2024-10-11T13:36:10Z

Hi Cesar! I changed the dataset processing such that the dataset will now include the chembl_activity ID and the KLIFS structure ID. If you wish to add this information to an older processed version of the dataset, you can follow the new example in examples/patch_dataset_with_chembl_ids.ipynb. PR #12 will add this fix on the main branch.

Best,
Joschka

velocirraptor23 · 2024-10-13T17:27:51Z

Hi @joschka-gross, Thanks for this. Now it shoud be easier to track everything. I have tried the example notebook but I got this error, not sure if i did it correctly. I just reeplaced the dataset.py and patch_with_data_source.py scripts. However, I got this error which I am not sure where it comes from. Any help would be very appreciated and again, thanks for having a look on my request. This error comes in the part where I pull the dataframe. df = dataset.df

#######

Reading data frame from /kinodata-3D-affinity-prediction/data/raw/kinodata_docked_v2.sdf.gz...
Deduping data frame (current size: 121913)...
119713 complexes remain after deduplication.
Checking for missing pocket mol2 files...

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3244/3244 [00:08<00:00, 387.83it/s]

ValueError Traceback (most recent call last)
Cell In[3], line 2
1 # during dataset creation, this data frame is the main data source
----> 2 df = dataset.df

File [/miniconda3/envs/kinodata/lib/python3.10/functools.py#line=980), in cached_property.get(self, instance, owner)
979 val = cache.get(self.attrname, _NOT_FOUND)
980 if val is _NOT_FOUND:
--> 981 val = self.func(instance)
982 try:
983 cache[self.attrname] = val

File [/kinodata-3D-affinity-prediction/kinodata/data/dataset.py#line=342), in KinodataDocked.df(self)
341 @cached_property
342 def df(self) -> pd.DataFrame:
--> 343 return process_raw_data(
344 Path(self.raw_dir),
345 self.raw_file_names[0],
346 self.remove_hydrogen,
347 self.pocket_dir,
348 self.pocket_sequence_file,
349 activity_type_subset=["pIC50"],
350 )

File [/kinodata/data/dataset.py#line=148), in process_raw_data(raw_dir, file_name, remove_hydrogen, pocket_dir, pocket_sequence_file, activity_type_subset)
146 resp.raise_for_status()
147 fp.write_bytes(resp.content)
--> 149 pocket_mol2_files = {
150 int(fp.stem.split("_")[0]): fp for fp in (pocket_dir).iterdir()
151 }
152 df["pocket_mol2_file"] = [
153 pocket_mol2_files[row["similar.klifs_structure_id"]] for _, row in df.iterrows()
154 ]
156 # backwards compatability

File [/kinodata-3D-affinity-prediction/kinodata/data/dataset.py#line=149), in (.0)
146 resp.raise_for_status()
147 fp.write_bytes(resp.content)
149 pocket_mol2_files = {
--> 150 int(fp.stem.split("_")[0]): fp for fp in (pocket_dir).iterdir()
151 }
152 df["pocket_mol2_file"] = [
153 pocket_mol2_files[row["similar.klifs_structure_id"]] for _, row in df.iterrows()
154 ]
156 # backwards compatability

ValueError: invalid literal for int() with base 10: '.ipynb'

Best wishes,
Cesar

joschka-gross linked a pull request Oct 11, 2024 that will close this issue

Improvements to fix missing data sources #12

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About bioactivity anotations #11

About bioactivity anotations #11

velocirraptor23 commented Oct 4, 2024

mbackenkoehler commented Oct 7, 2024

velocirraptor23 commented Oct 7, 2024 •

edited

Loading

joschka-gross commented Oct 11, 2024

velocirraptor23 commented Oct 13, 2024 •

edited

Loading

About bioactivity anotations #11

About bioactivity anotations #11

Comments

velocirraptor23 commented Oct 4, 2024

mbackenkoehler commented Oct 7, 2024

velocirraptor23 commented Oct 7, 2024 • edited Loading

joschka-gross commented Oct 11, 2024

velocirraptor23 commented Oct 13, 2024 • edited Loading

velocirraptor23 commented Oct 7, 2024 •

edited

Loading

velocirraptor23 commented Oct 13, 2024 •

edited

Loading