Skip to content

Protein function prediction with GO - Part 3 #64

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 78 commits into from
Apr 2, 2025
Merged

Conversation

aditya0by0
Copy link
Member

@aditya0by0 aditya0by0 commented Nov 4, 2024

Note: The above issue will be implemented in 3 PRs:

Changes to be done in this PR

evaluation: Evaluate using the same metrics as DeepGO for comparing the models

From comment #36 (comment)

  • on a new branch: metrics for evaluation (I talked to Martin about the Fmax score: Although it has some methodological issues, we should include it in our evaluation to do a comparison with DeepGO)
  • DeepGO-SE (paper): use these results as a baseline, integrate their data into our pipeline (there is a link to the dataset on their github page

@aditya0by0 aditya0by0 self-assigned this Nov 4, 2024
@aditya0by0 aditya0by0 linked an issue Nov 4, 2024 that may be closed by this pull request
@aditya0by0 aditya0by0 requested a review from sfluegel05 November 7, 2024 10:15
@aditya0by0
Copy link
Member Author

I have made the suggested changes for migration. Please check.

Config for DeepGO1:

class_path: chebai.preprocessing.datasets.go_uniprot.DeepGO1MigratedData
init_args:
  go_branch: "MF"
  max_sequence_length: 1002
  reader_kwargs: {n_gram: 3}

Config for DeepGO2:

class_path: chebai.preprocessing.datasets.go_uniprot.DeepGO2MigratedData
init_args:
  go_branch: "MF"
  max_sequence_length: 1000
  reader_kwargs: {n_gram: 3}

@aditya0by0
Copy link
Member Author

@sfluegel05, I have made the suggested changes for scope. Please check.

I have generated a new SCOPe50 dataset, but there still seem to be labels which have 0 protein sequences assigned to them. Could you have a look at that?

@sfluegel05, I have resolved this issue, Please check.

@sfluegel05
Copy link
Collaborator

Now, the number of instances per label is at least 1, but still less than 50 in many cases. The main issue seems to be that the threshold is applied before most of the processing. In the function graph_to_raw_dataset(), the graph is given as an input. Based on that graph, the threshold is applied and only after that, you do all the resolving from domains to class labels and sequences.

This should be the other way round:

  1. Find the sequences and all labels that can be applied to each sequence
  2. Based on that information, construct the graph
  3. Based on the graph, select the labels that pass the threshold

I hope this helps!

@aditya0by0
Copy link
Member Author

Now, the number of instances per label is at least 1, but still less than 50 in many cases. The main issue seems to be that the threshold is applied before most of the processing. In the function graph_to_raw_dataset(), the graph is given as an input. Based on that graph, the threshold is applied and only after that, you do all the resolving from domains to class labels and sequences.

This should be the other way round:

  1. Find the sequences and all labels that can be applied to each sequence
  2. Based on that information, construct the graph
  3. Based on the graph, select the labels that pass the threshold

I hope this helps!

Thanks for the suggestion. I have fixed the issue and now all labels have more than or equal to 50 true instances for SCOPe50.
I had also started a training for it and I am facing a error related to electra. Please check here . Please let me know if you have any suggestions on how to resolve this.

Also, I have made suggested changes for scope notebook.

Please check.

@aditya0by0 aditya0by0 mentioned this pull request Mar 17, 2025
@sfluegel05
Copy link
Collaborator

My first guess is that you have to change model.config.max_position_embeddings. This is set to 1,800 at the moment, apparently you need ~2,500 instead.
Thanks for making the changes to the notebook.

@aditya0by0
Copy link
Member Author

@sfluegel05, I increased the max_position_embeddings of ELECTRA to 3000 in (2b0ed0a) since it was throwing the same error at 2500.

I have already started the training, but the issue now is that only 5 epochs have been completed in 17 hours.

@aditya0by0
Copy link
Member Author

@sfluegel05, I increased the max_position_embeddings of ELECTRA to 3000 in (2b0ed0a) since it was throwing the same error at 2500.

I have already started the training, but the issue now is that only 5 epochs have been completed in 17 hours.

Please check here the results after 24hrs of training, only 6 epochs completed. The batch file has maximum 24hrs as timeout.

@sfluegel05
Copy link
Collaborator

Sorry for the late reply. This is indeed strange. Comparing it to other runs, I don't see a reason why your run should be this slow. At least, it seems to speed up towards the end:
image

But even the final speed of 1 epoch per hour is too slow. For comparison: A chebi50 has 1,524 classes and ~1,400 steps per epoch. Still, my latest run with chebi50 finished after 14 hours and 200 epochs (wandb run). The model parameters and batch size should be the same for both.

A few things you can check / try:

  • which GPU you are using - some are faster than others. The wandb run says the slow run was on hpc3-53 which should have an A100 GPU. That is the one I use as well.
  • try different nodes: -w hpc3-52 or --nodelist=hpc3-52 (gpu nodes are 52, 53 and 54)
  • change some of the memory / cpu configurations in the batch file
  • check if you are repeatedly doing some expensive preprocessing

These are the sbatch parameters I use as defaults, but I have not checked if they are optimal for the SCOPe task:

#SBATCH --ntasks=1
#SBATCH --cpus-per-task=32
#SBATCH --threads-per-core=1
#SBATCH --mem=256000
#SBATCH --partition=gpu
#SBATCH --gres=gpu:A100:1

@aditya0by0
Copy link
Member Author

@sfluegel05
In all other cases/runs, such as with the ChEBI data, we are using pre-trained ELECTRA, which was trained with a vocabulary size of 1400 and a maximum position embedding of 1800.

However, for SCOPe, we can't use the same pre-trained ELECTRA model because we have increased the vocabulary size to 8500 and the max position embedding to 3000. As a result, we are training ELECTRA from scratch without any pre-trained weights. I highly suspect that the increase in vocabulary size, max position embedding, and training without a pre-trained ELECTRA model are contributing to the slower training performance."

@aditya0by0
Copy link
Member Author

All partitions all have a 2-day (48-hour) time limit,

/home/staff/a/akhedekar/python-chebai$ sinfo -s
PARTITION AVAIL  TIMELIMIT   NODES(A/I/O/T) NODELIST
workq*       up 2-00:00:00        50/0/1/51 hpc3-[1-51]
gpu          up 2-00:00:00          4/0/0/4 hpc3-[52-54],klpsy-1
klab-cpu     up 2-00:00:00          0/3/0/3 klab-[2,5-6]
klab-gpu     up 2-00:00:00          1/0/0/1 klab-1
klab-l40s    up 2-00:00:00          1/1/0/2 klab-[3-4

GPU partition (gpu) has no explicit memory limits because:
MaxMemPerNode=UNLIMITED
DefMemPerNode=UNLIMITED
The reported memory (sinfo -o "%P %m") shows 950,000+ MB (~950GB per node).

/home/staff/a/akhedekar/python-chebai$ sinfo -o "%P %m"
PARTITION MEMORY
workq* 950000
gpu 950000+
klab-cpu 950000+
klab-gpu 1950000
klab-l40s 950000
/home/staff/a/akhedekar/python-chebai$ scontrol show partition gpu
PartitionName=gpu
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=2-00:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=2-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=hpc3-[52-54],klpsy-1
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=432 TotalNodes=4 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

@aditya0by0
Copy link
Member Author

@sfluegel05,
As observed after the failure of the unit tests in commit ef4bc0b, simply commenting out or removing the dependencies related to protein (such as esm) will not work. This will lead to an ImportError for users, as imports like esm are used at the module level in files like reader.py.

To resolve this, we can either:

  1. Move the imports into the constructor of the respective classes, so that they are only imported when the class is instantiated.
  2. Alternatively, consider lazy-loading the dependencies in a more controlled manner, depending on the specific use case.

This will prevent unnecessary imports and avoid breaking functionality for users who don't require these dependencies.
Please let me know your suggestions.

@sfluegel05
Copy link
Collaborator

Regarding the memory limits: You are right, there is a lot of memory per node. However, for each GPU, only 80GB are available. You can specify the number of GPUs with sbatch --gres=gpu:2 (2 being the number of GPUs). This might prolong the wait time on the cluster. In any case, I would suggest to modify the dataset so that we need a lower max_position_embeddings (i.e., filter out input sequences that have a certain length).

Regarding the imports: Here the plan is: first get this branch merged (including dependencies), then, on a new branch, remove all protein-related code (and put it in python-chebai-proteins). Then, we can remove the imports without any issues.

- the vocab size was increased for proteins in commit a12354b
- as we are going to move protein related code to new repo, revert this to original value
@aditya0by0
Copy link
Member Author

Regarding the imports: Here the plan is: first get this branch merged (including dependencies), then, on a new branch, remove all protein-related code (and put it in python-chebai-proteins). Then, we can remove the imports without any issues.

@sfluegel05, I have reverted relevant commits. Can you please review and merge this branch/PR, to proceed with the plan.

@sfluegel05 sfluegel05 marked this pull request as ready for review April 2, 2025 16:08
@sfluegel05 sfluegel05 merged commit 052677e into dev Apr 2, 2025
8 checks passed
@sfluegel05 sfluegel05 deleted the protein_prediction branch April 2, 2025 16:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add SCOPe dataset to our pipeline Protein function prediction with GO
2 participants