Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Protein function prediction with GO - Part 3 #64

Draft
wants to merge 48 commits into
base: dev
Choose a base branch
from

Conversation

aditya0by0
Copy link
Collaborator

@aditya0by0 aditya0by0 commented Nov 4, 2024

Note: The above issue will be implemented in 3 PRs:

Changes to be done in this PR

evaluation: Evaluate using the same metrics as DeepGO for comparing the models

From comment #36 (comment)

  • on a new branch: metrics for evaluation (I talked to Martin about the Fmax score: Although it has some methodological issues, we should include it in our evaluation to do a comparison with DeepGO)
  • DeepGO-SE (paper): use these results as a baseline, integrate their data into our pipeline (there is a link to the dataset on their github page

@aditya0by0 aditya0by0 self-assigned this Nov 4, 2024
@aditya0by0 aditya0by0 linked an issue Nov 4, 2024 that may be closed by this pull request
@aditya0by0 aditya0by0 requested a review from sfluegel05 November 7, 2024 10:15
@aditya0by0
Copy link
Collaborator Author

I have made the suggested changes for migration. Please check.

Config for DeepGO1:

class_path: chebai.preprocessing.datasets.go_uniprot.DeepGO1MigratedData
init_args:
  go_branch: "MF"
  max_sequence_length: 1002
  reader_kwargs: {n_gram: 3}

Config for DeepGO2:

class_path: chebai.preprocessing.datasets.go_uniprot.DeepGO2MigratedData
init_args:
  go_branch: "MF"
  max_sequence_length: 1000
  reader_kwargs: {n_gram: 3}

@aditya0by0
Copy link
Collaborator Author

I have done the suggested changes (1b8b270 and 85c47a0). Please review.

aditya0by0 and others added 7 commits December 4, 2024 16:41
- modify deepgo2 migration script to migrate the esm2 embeddings too
- modify migration class to use esm2 embeddings or reader features, based on input
- this will help to identify methods that needs to be implemented during coding and not during runtime
@aditya0by0 aditya0by0 linked an issue Jan 5, 2025 that may be closed by this pull request
@aditya0by0
Copy link
Collaborator Author

aditya0by0 commented Jan 8, 2025

Commit which added ESM2 embedding in migration and training process : e7b3d80

This commit has changes which also includes the ESM2 embeddings from in deepgo2 migration process.
Please run chebai/preprocessing/migration/deep_go/migrate_deep_go_2_data.py (migration script) before starting training process.

@sfluegel05, Please find the config to train model on DeepGO 2 migrated data with ESM2 embeddings.

class_path: chebai.preprocessing.datasets.deepGO.go_uniprot.DeepGO2MigratedData
init_args:
  go_branch: "MF"
  max_sequence_length: 1000
  use_esm2_embeddings: True

@sfluegel05
Copy link
Collaborator

I added the ESM2 config as well as a simple feed forward network.

ESM2 with Electra does not work out of the box, since ESM2 uses real values that may be negative. Electra expects positive values only. Also, I'm not sure how sensible using Electra here would be - Electra expects a sequence, but ESM2 provides an embedding vector that is not directly related to the sequence (tell me if I'm wrong here).

@aditya0by0
Copy link
Collaborator Author

aditya0by0 commented Jan 12, 2025

I added the ESM2 config as well as a simple feed forward network.

ESM2 with Electra does not work out of the box, since ESM2 uses real values that may be negative. Electra expects positive values only. Also, I'm not sure how sensible using Electra here would be - Electra expects a sequence, but ESM2 provides an embedding vector that is not directly related to the sequence (tell me if I'm wrong here).

Yes, you're absolutely right about the compatibility issues between ESM2 and Electra. ESM2 embeddings can have negative values as they are activation values from specific layer within ESM2 network.
As, Electra expects positive values and there is no direct token-level alignment in ESM2 embeddings, Electra (specifically its discriminator) might not work well with it.

We can perform ReLU or normalization as pre-processing to ESM2 embeddings to shift them into positive domain, before feeding it to Electra, but I doubt whether it will still work due to previous stated reasons.
Using just Feed Forward Network instead of Electra seems like a more suitable approach.

self.layers.append(torch.nn.Linear(input_size, hidden_size))
for _ in range(num_hidden_layers):
self.layers.append(torch.nn.Linear(hidden_size, hidden_size))
self.layers.append(torch.nn.Linear(hidden_size, self.out_dim))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @sfluegel05, I have a question regarding the FFN code. Could you help me understand how self.out_dim gets a value other than None? In the superclass implementation, this attribute is initialized as None, and I don't see any relevant parameters for it in the model's configuration file. Is this value being set somewhere else in the workflow?

I tried running the your code with the following configuration:

G:\anaconda3\envs\env_chebai\python.exe G:\github-aditya0by0\python-chebai\chebai fit --trainer=configs/training/default_trainer.yml --model=configs/model/ffn.yml --data=configs/data/deepGO/deepgo_2_migrated_data.yml

However, I encountered the following error:

  File "G:\github-aditya0by0\python-chebai\chebai\models\ffn.py", line 25, in __init__
    self.layers.append(torch.nn.Linear(hidden_size, self.out_dim))
  File "G:\anaconda3\envs\env_chebai\lib\site-packages\torch\nn\modules\linear.py", line 106, in __init__
    torch.empty((out_features, in_features), **factory_kwargs)
TypeError: empty() received an invalid combination of arguments - got (tuple, dtype=NoneType, device=NoneType), but expected one of:
 * (tuple of ints size, *, tuple of names names, torch.memory_format memory_format = None, torch.dtype dtype = None, torch.layout layout = None, torch.device device = None, bool pin_memory = False, bool requires_grad = False)
 * (tuple of ints size, *, torch.memory_format memory_format = None, Tensor out = None, torch.dtype dtype = None, torch.layout layout = None, torch.device device = None, bool pin_memory = False, bool requires_grad = False)

The error suggests that self.out_dim is still None. Could you point out where this attribute should be assigned or how it's handled?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, there i a simple trick for that - out_dim is expected as part of the command. So adding --model.out_dim=997 (with 997 being the number of classes, this depends on the dataset you use) to your command should fix the error.

That is the way it is handled for Electra at the moment as well. The reason why this parameter is not part of the config is that it changes more frequently - most model parameters only change if you want to experiment with them specifically. out_dim depends on chebi_version and the dataset used (e.g. ChEBI100/ChEBI50). Therefore, it does not have a default value.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sfluegel05, I was wondering if we could explore dynamically setting the out_dim parameter in the setup method of the LightningModule , leveraging the fact that it inherits DataHooks. This way, the model could fetch the number of labels (num_of_labels) directly from the data module after it's loaded.

By doing this, we could avoid requiring the out_dim parameter to be explicitly passed in the command line every time. Instead, the model could dynamically adapt based on the dataset being used. This would simplify the workflow and reduce potential user errors.

Would this be feasible, or are there any constraints that might prevent this approach?.

@aditya0by0
Copy link
Collaborator Author

@aditya0by0
Copy link
Collaborator Author

@sfluegel05, I have completed the implementation of scope dataset. Please check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add SCOPe dataset to our pipeline Protein function prediction with GO
2 participants