Protein function prediction with GO - Part 3 #64

aditya0by0 · 2024-11-04T12:07:11Z

PR for the Issue Protein function prediction with GO #36

Note: The above issue will be implemented in 3 PRs:

Protein function prediction with GO #39 (Merged)
Protein function prediction with GO - Part 2 #57 (Merged)
Protein function prediction with GO - Part 3 #64
PR for the issue Add SCOPe dataset to our pipeline #67

Changes to be done in this PR

evaluation: Evaluate using the same metrics as DeepGO for comparing the models

on a new branch: metrics for evaluation (I talked to Martin about the Fmax score: Although it has some methodological issues, we should include it in our evaluation to do a comparison with DeepGO)

DeepGO-SE (paper): use these results as a baseline, integrate their data into our pipeline (there is a link to the dataset on their github page

- migration from deep go format to chebai->go_uniprot format

- #36 (comment)

- +migration structure changes

aditya0by0 · 2024-11-13T22:45:41Z

I have made the suggested changes for migration. Please check.

Config for DeepGO1:

class_path: chebai.preprocessing.datasets.go_uniprot.DeepGO1MigratedData
init_args:
  go_branch: "MF"
  max_sequence_length: 1002
  reader_kwargs: {n_gram: 3}

Config for DeepGO2:

class_path: chebai.preprocessing.datasets.go_uniprot.DeepGO2MigratedData
init_args:
  go_branch: "MF"
  max_sequence_length: 1000
  reader_kwargs: {n_gram: 3}

aditya0by0 · 2025-03-10T12:13:26Z

@sfluegel05, I have made the suggested changes for scope. Please check.

I have generated a new SCOPe50 dataset, but there still seem to be labels which have 0 protein sequences assigned to them. Could you have a look at that?

@sfluegel05, I have resolved this issue, Please check.

sfluegel05 · 2025-03-12T16:51:54Z

Now, the number of instances per label is at least 1, but still less than 50 in many cases. The main issue seems to be that the threshold is applied before most of the processing. In the function graph_to_raw_dataset(), the graph is given as an input. Based on that graph, the threshold is applied and only after that, you do all the resolving from domains to class labels and sequences.

This should be the other way round:

Find the sequences and all labels that can be applied to each sequence
Based on that information, construct the graph
Based on the graph, select the labels that pass the threshold

I hope this helps!

aditya0by0 · 2025-03-15T17:26:21Z

Now, the number of instances per label is at least 1, but still less than 50 in many cases. The main issue seems to be that the threshold is applied before most of the processing. In the function graph_to_raw_dataset(), the graph is given as an input. Based on that graph, the threshold is applied and only after that, you do all the resolving from domains to class labels and sequences.

This should be the other way round:

Find the sequences and all labels that can be applied to each sequence

Based on that information, construct the graph

Based on the graph, select the labels that pass the threshold

I hope this helps!

Thanks for the suggestion. I have fixed the issue and now all labels have more than or equal to 50 true instances for SCOPe50.
I had also started a training for it and I am facing a error related to electra. Please check here . Please let me know if you have any suggestions on how to resolve this.

Also, I have made suggested changes for scope notebook.

Please check.

sfluegel05 · 2025-03-18T15:50:35Z

My first guess is that you have to change model.config.max_position_embeddings. This is set to 1,800 at the moment, apparently you need ~2,500 instead.
Thanks for making the changes to the notebook.

aditya0by0 · 2025-03-19T11:35:51Z

@sfluegel05, I increased the max_position_embeddings of ELECTRA to 3000 in (2b0ed0a) since it was throwing the same error at 2500.

I have already started the training, but the issue now is that only 5 epochs have been completed in 17 hours.

aditya0by0 · 2025-03-19T20:09:11Z

@sfluegel05, I increased the max_position_embeddings of ELECTRA to 3000 in (2b0ed0a) since it was throwing the same error at 2500.

I have already started the training, but the issue now is that only 5 epochs have been completed in 17 hours.

Please check here the results after 24hrs of training, only 6 epochs completed. The batch file has maximum 24hrs as timeout.

sfluegel05 · 2025-03-25T14:21:34Z

Sorry for the late reply. This is indeed strange. Comparing it to other runs, I don't see a reason why your run should be this slow. At least, it seems to speed up towards the end:

But even the final speed of 1 epoch per hour is too slow. For comparison: A chebi50 has 1,524 classes and ~1,400 steps per epoch. Still, my latest run with chebi50 finished after 14 hours and 200 epochs (wandb run). The model parameters and batch size should be the same for both.

A few things you can check / try:

which GPU you are using - some are faster than others. The wandb run says the slow run was on hpc3-53 which should have an A100 GPU. That is the one I use as well.
try different nodes: -w hpc3-52 or --nodelist=hpc3-52 (gpu nodes are 52, 53 and 54)
change some of the memory / cpu configurations in the batch file
check if you are repeatedly doing some expensive preprocessing

These are the sbatch parameters I use as defaults, but I have not checked if they are optimal for the SCOPe task:

#SBATCH --ntasks=1
#SBATCH --cpus-per-task=32
#SBATCH --threads-per-core=1
#SBATCH --mem=256000
#SBATCH --partition=gpu
#SBATCH --gres=gpu:A100:1

aditya0by0 · 2025-03-25T16:18:35Z

@sfluegel05
In all other cases/runs, such as with the ChEBI data, we are using pre-trained ELECTRA, which was trained with a vocabulary size of 1400 and a maximum position embedding of 1800.

However, for SCOPe, we can't use the same pre-trained ELECTRA model because we have increased the vocabulary size to 8500 and the max position embedding to 3000. As a result, we are training ELECTRA from scratch without any pre-trained weights. I highly suspect that the increase in vocabulary size, max position embedding, and training without a pre-trained ELECTRA model are contributing to the slower training performance."

aditya0by0 · 2025-03-29T11:17:19Z

All partitions all have a 2-day (48-hour) time limit,

/home/staff/a/akhedekar/python-chebai$ sinfo -s
PARTITION AVAIL  TIMELIMIT   NODES(A/I/O/T) NODELIST
workq*       up 2-00:00:00        50/0/1/51 hpc3-[1-51]
gpu          up 2-00:00:00          4/0/0/4 hpc3-[52-54],klpsy-1
klab-cpu     up 2-00:00:00          0/3/0/3 klab-[2,5-6]
klab-gpu     up 2-00:00:00          1/0/0/1 klab-1
klab-l40s    up 2-00:00:00          1/1/0/2 klab-[3-4

GPU partition (gpu) has no explicit memory limits because:
MaxMemPerNode=UNLIMITED
DefMemPerNode=UNLIMITED
The reported memory (sinfo -o "%P %m") shows 950,000+ MB (~950GB per node).

/home/staff/a/akhedekar/python-chebai$ sinfo -o "%P %m"
PARTITION MEMORY
workq* 950000
gpu 950000+
klab-cpu 950000+
klab-gpu 1950000
klab-l40s 950000
/home/staff/a/akhedekar/python-chebai$ scontrol show partition gpu
PartitionName=gpu
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=2-00:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=2-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=hpc3-[52-54],klpsy-1
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=432 TotalNodes=4 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

aditya0by0 · 2025-03-29T11:40:04Z

@sfluegel05,
As observed after the failure of the unit tests in commit ef4bc0b, simply commenting out or removing the dependencies related to protein (such as esm) will not work. This will lead to an ImportError for users, as imports like esm are used at the module level in files like reader.py.

To resolve this, we can either:

Move the imports into the constructor of the respective classes, so that they are only imported when the class is instantiated.
Alternatively, consider lazy-loading the dependencies in a more controlled manner, depending on the specific use case.

This will prevent unnecessary imports and avoid breaking functionality for users who don't require these dependencies.
Please let me know your suggestions.

sfluegel05 · 2025-04-01T15:57:33Z

Regarding the memory limits: You are right, there is a lot of memory per node. However, for each GPU, only 80GB are available. You can specify the number of GPUs with sbatch --gres=gpu:2 (2 being the number of GPUs). This might prolong the wait time on the cluster. In any case, I would suggest to modify the dataset so that we need a lower max_position_embeddings (i.e., filter out input sequences that have a certain length).

Regarding the imports: Here the plan is: first get this branch merged (including dependencies), then, on a new branch, remove all protein-related code (and put it in python-chebai-proteins). Then, we can remove the imports without any issues.

This reverts commit ef4bc0b.

This reverts commit 2b0ed0a.

- the vocab size was increased for proteins in commit a12354b - as we are going to move protein related code to new repo, revert this to original value

aditya0by0 · 2025-04-01T18:25:21Z

Regarding the imports: Here the plan is: first get this branch merged (including dependencies), then, on a new branch, remove all protein-related code (and put it in python-chebai-proteins). Then, we can remove the imports without any issues.

@sfluegel05, I have reverted relevant commits. Can you please review and merge this branch/PR, to proceed with the plan.

script to evaluate go predictions

bdba442

aditya0by0 self-assigned this Nov 4, 2024

aditya0by0 mentioned this pull request Nov 4, 2024

Protein function prediction with GO #36

Closed

aditya0by0 linked an issue Nov 4, 2024 that may be closed by this pull request

Protein function prediction with GO #36

Closed

aditya0by0 added 12 commits November 4, 2024 15:22

Merge branch 'dev' into protein_prediction

264bd94

add fmax to evaluation script

6c0fce1

Merge branch 'dev' into protein_prediction

154e827

add base code for deep_go data migration

58ae92d

- migration from deep go format to chebai->go_uniprot format

varry fmax threshold as per paper

78a38de

go_uniprot: add sequence len to docstring

3a4e007

update experiment evidence codes as per DeepGo SE

227a014

- #36 (comment)

Merge branch 'dev' into protein_prediction

33436e8

consIder X as a valid amino acid as per DeepGO-SE

c6d60cd

- #36 (comment)

deepgo se mirgration : add class to migrate

ca5461f

Merge branch 'dev' into protein_prediction

af54954

migration: rectify errors

dfb9430

aditya0by0 requested a review from sfluegel05 November 7, 2024 10:15

aditya0by0 added 9 commits November 7, 2024 13:25

protein trigram containing tokenS with X

085b13b

- #36 (comment)

protein token unigram contain X

3e0bae0

- #36 (comment)

add migration for deepgo1 - 2018 paper

99b5af1

deepgo1: create non-exclusive val set as a placeholder

a15d492

deepgo1: further split train set into train and val for

e0a8524

- +migration structure changes

migration script update

093be28

add classes to use migrated deepgo data

14db9d6

deepgo: minor code change

8922d4d

modify prints to display actual file name

796356c

aditya0by0 added 3 commits November 17, 2024 23:42

create sub dir for deego dataset and move rel files

3c11a69

update imports as per new deepGO dir

2b571c5

update import dir for pretrain test

f75e30b

sfluegel05 and others added 3 commits March 4, 2025 09:51

modify notebook introduction

f13e935

ffn: fix error for loss kwargs

36e6162

scope: fix for no True labels for some classes/columns

93c7fc5

aditya0by0 added 4 commits March 14, 2025 10:44

Merge branch 'dev' into protein_prediction

6d7b467

scope: fix for true values less given threshold for some labels

767b210

go_notebook: update import statement

081b44d

scope notebook: add scope description and minor changes

81c1348

aditya0by0 mentioned this pull request Mar 17, 2025

Ensemble Models #77

Merged

electra config: increase max_postional_embeddings to 3000

2b0ed0a

comment protein-related requirements

ef4bc0b

aditya0by0 added 4 commits April 1, 2025 19:25

Revert "comment protein-related requirements"

58bcf05

This reverts commit ef4bc0b.

scope: filter out sequence with len gt than given len

831f70d

Revert "electra config: increase max_postional_embeddings to 3000"

158d6f3

This reverts commit 2b0ed0a.

electra config: reset the vocab size to previous default value for scope

bb0b4db

- the vocab size was increased for proteins in commit a12354b - as we are going to move protein related code to new repo, revert this to original value

sfluegel05 marked this pull request as ready for review April 2, 2025 16:08

sfluegel05 merged commit 052677e into dev Apr 2, 2025
8 checks passed

sfluegel05 deleted the protein_prediction branch April 2, 2025 16:10

aditya0by0 mentioned this pull request Apr 15, 2025

Protein Prediction Codebase Migration ChEB-AI/python-chebai-proteins#1

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Protein function prediction with GO - Part 3 #64

Protein function prediction with GO - Part 3 #64

Uh oh!

aditya0by0 commented Nov 4, 2024 •

edited

Loading

Uh oh!

aditya0by0 commented Nov 13, 2024

Uh oh!

aditya0by0 commented Mar 10, 2025

Uh oh!

sfluegel05 commented Mar 12, 2025

Uh oh!

aditya0by0 commented Mar 15, 2025

Uh oh!

sfluegel05 commented Mar 18, 2025

Uh oh!

aditya0by0 commented Mar 19, 2025

Uh oh!

aditya0by0 commented Mar 19, 2025

Uh oh!

sfluegel05 commented Mar 25, 2025

Uh oh!

aditya0by0 commented Mar 25, 2025

Uh oh!

aditya0by0 commented Mar 29, 2025

Uh oh!

aditya0by0 commented Mar 29, 2025

Uh oh!

sfluegel05 commented Apr 1, 2025

Uh oh!

aditya0by0 commented Apr 1, 2025

Uh oh!

Uh oh!

Uh oh!

Protein function prediction with GO - Part 3 #64

Protein function prediction with GO - Part 3 #64

Uh oh!

Conversation

aditya0by0 commented Nov 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR for the Issue Protein function prediction with GO #36

PR for the issue Add SCOPe dataset to our pipeline #67

Changes to be done in this PR

Uh oh!

aditya0by0 commented Nov 13, 2024

Uh oh!

aditya0by0 commented Mar 10, 2025

Uh oh!

sfluegel05 commented Mar 12, 2025

Uh oh!

aditya0by0 commented Mar 15, 2025

Uh oh!

sfluegel05 commented Mar 18, 2025

Uh oh!

aditya0by0 commented Mar 19, 2025

Uh oh!

aditya0by0 commented Mar 19, 2025

Uh oh!

sfluegel05 commented Mar 25, 2025

Uh oh!

aditya0by0 commented Mar 25, 2025

Uh oh!

aditya0by0 commented Mar 29, 2025

Uh oh!

aditya0by0 commented Mar 29, 2025

Uh oh!

sfluegel05 commented Apr 1, 2025

Uh oh!

aditya0by0 commented Apr 1, 2025

Uh oh!

Uh oh!

Uh oh!

aditya0by0 commented Nov 4, 2024 •

edited

Loading