Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding simulation stuff #41

Merged
merged 73 commits into from
Feb 10, 2025
Merged
Show file tree
Hide file tree
Changes from 56 commits
Commits
Show all changes
73 commits
Select commit Hold shift + click to select a range
dd80123
adding preprocessing for the spectrum simulation subset
adamoyoung May 27, 2024
dc757e9
initial dataset/transform changes, still WIP
adamoyoung May 27, 2024
532272e
adding dataset stuff for simulation models
adamoyoung May 28, 2024
8c6514c
fixing minor bugs
adamoyoung May 28, 2024
0b8d346
temporary commit, WIP
adamoyoung May 29, 2024
5921920
attempting merge
adamoyoung May 29, 2024
71ad27c
updating dataset stuff to work with new data
adamoyoung May 29, 2024
da0bcfe
adding model stuff, WIP
adamoyoung May 30, 2024
4fe1d32
more model stuff, still WIP
adamoyoung May 30, 2024
002bb15
more pl model stuff, still WIP
adamoyoung May 30, 2024
f846ac3
first model runs, still need to debug training
adamoyoung May 31, 2024
d473f57
minor dataset rework, eval metrics still not functional
adamoyoung May 31, 2024
cbb1ad0
NEIMS model now works in notebook, does not perform well on validatio…
adamoyoung May 31, 2024
8006a2c
adding runner.py script, proper epoch metric accumulation
adamoyoung Jun 1, 2024
5d081e0
adding precursor only baseline, fixing some bugs
adamoyoung Jun 1, 2024
52666d5
refactoring into runner/config format, fixing intensity bug, other sm…
adamoyoung Jun 2, 2024
a6dfbea
integrating newer dataset (v4), misc small changes
adamoyoung Jun 2, 2024
e3eb328
implemented GNN models
adamoyoung Jun 2, 2024
18751cd
changing old dataset filters to checks
adamoyoung Jun 3, 2024
5cf673d
final commit before original submission
adamoyoung Jul 23, 2024
4153a2f
finished merge
adamoyoung Jul 23, 2024
1123bad
major updates
adamoyoung Aug 4, 2024
950245a
removing cache_feats
adamoyoung Aug 8, 2024
368978f
merging changes from main
adamoyoung Aug 8, 2024
0f5adf1
fixing import bugs, inheritance bugs
adamoyoung Aug 9, 2024
3f13624
fixing metric calculation and logging functions to comply with parent…
adamoyoung Aug 9, 2024
3304b64
reworking models to not use save_hyperparameters
adamoyoung Aug 11, 2024
163c37b
reworking SimulationDataset
adamoyoung Aug 11, 2024
8c1cc68
initial simulation retrieval implementation, support for dataset subs…
adamoyoung Aug 14, 2024
0ccea3f
more retrieval changes, fixing some bugs
adamoyoung Aug 14, 2024
95a6299
fixing cos_sim bug
adamoyoung Aug 14, 2024
c3e3f7a
removing pdb statements
adamoyoung Aug 14, 2024
0e3b681
adding random noise to break ties
adamoyoung Aug 15, 2024
50d55a5
merging changes
adamoyoung Aug 15, 2024
b364076
properly integrating new changes (retrieval, hparams, bootstrapping)
adamoyoung Aug 15, 2024
ab9c56e
fixing GNN model, refactoring cos sim calculations to work with boots…
adamoyoung Aug 15, 2024
f9f5ad1
adding checkpoint loading for test evaluation
adamoyoung Aug 15, 2024
12718e7
fixing improperly processed mol data
adamoyoung Aug 15, 2024
e0d4749
updating retrieval configs
adamoyoung Aug 16, 2024
0c547df
adding mz_bin_res configs
adamoyoung Aug 16, 2024
63e634a
adding additional metrics
adamoyoung Aug 16, 2024
3c8be7b
reducing max_epochs back down to 20
adamoyoung Aug 16, 2024
c2ec360
returning max_epochs to 100
adamoyoung Aug 16, 2024
8db29c8
minor config changes
adamoyoung Aug 16, 2024
bd67369
adding df_test and ckpt uploading
adamoyoung Aug 16, 2024
81cb98a
adding notebook to calculate bootstrap
adamoyoung Aug 16, 2024
054750e
updating notebook
adamoyoung Aug 16, 2024
54e5782
merged with new changes
adamoyoung Aug 26, 2024
82ea88c
config cleanup
adamoyoung Aug 26, 2024
2777f8a
fixing bootstrap sig figs
adamoyoung Aug 26, 2024
de016ce
minor bug fix
adamoyoung Aug 26, 2024
f227f52
reorganizing configs
adamoyoung Aug 26, 2024
1ccc233
Merge branch 'main' into adamo5
adamoyoung Oct 28, 2024
bd2e486
removing wandb information
adamoyoung Oct 28, 2024
2db94b8
removing bin res ablations
adamoyoung Oct 28, 2024
988d343
reworking data stuff
adamoyoung Oct 29, 2024
1daf6f0
merging with more recent changes
adamoyoung Jan 6, 2025
ee31e86
removing old notebook
adamoyoung Jan 6, 2025
73c3adc
fixing jss metric
adamoyoung Jan 10, 2025
b70d544
reorganizing run files
adamoyoung Jan 10, 2025
d77b965
minor updates to SpecToMzsInts transform
adamoyoung Jan 10, 2025
73328f4
Merge branch 'main' into adamo5
adamoyoung Jan 12, 2025
443fb07
redefining jss such that it is between 0 and 1
adamoyoung Jan 12, 2025
9fdb5c8
minor fix to datasets
adamoyoung Jan 17, 2025
7ccbc91
adding updated version of demo notebook
adamoyoung Jan 17, 2025
2dc5bf3
reworking configs, removing old notebooks
adamoyoung Feb 4, 2025
aaafca6
adding script for fixing JSS from original version
adamoyoung Feb 4, 2025
059d0fe
removing user-specific info from template config
adamoyoung Feb 4, 2025
4b7589f
updating demo stuff
adamoyoung Feb 4, 2025
7a03118
updating boostrap notebook
adamoyoung Feb 4, 2025
5a68754
fixing some notebook stuff
adamoyoung Feb 4, 2025
01508d2
removing debug configs
adamoyoung Feb 10, 2025
1c5fdd2
Merge branch 'main' into adamo5
roman-bushuiev Feb 10, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -136,3 +136,6 @@ dmypy.json

# VSCode
*.vscode

# W&B
wandb/
16 changes: 16 additions & 0 deletions config/simulation/fp.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# wandb
wandb_name: "sim_fp"
# data
pth: "MassSpecGym.tsv"
split_type: "benchmark"
# output
mz_max: 1005.
mz_bin_res: 0.01
ints_transform: "sqrt"
# model
model_type: "fp"
# optimization
max_epochs: 100
# other
accelerator: "gpu"
do_retrieval: False
16 changes: 16 additions & 0 deletions config/simulation/gnn.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# wandb
wandb_name: "sim_gnn"
# data
pth: "MassSpecGym.tsv"
split_type: "benchmark"
# output
mz_max: 1005.
mz_bin_res: 0.01
ints_transform: "sqrt"
# model
model_type: "gnn"
# optimization
max_epochs: 100
# other
accelerator: "gpu"
do_retrieval: False
16 changes: 16 additions & 0 deletions config/simulation/preconly.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# wandb
wandb_name: "sim_preconly"
# data
pth: "MassSpecGym.tsv"
split_type: "benchmark"
# output
mz_max: 1005.
mz_bin_res: 0.01
ints_transform: "none"
# model
model_type: "prec_only"
# optimization
max_epochs: 1
# other
accelerator: "gpu"
do_retrieval: False
17 changes: 17 additions & 0 deletions config/simulation_retrieval/fp_formula.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# wandb
wandb_name: "fp_formula"
# data
pth: "MassSpecGym.tsv"
candidates_pth: "molecules/MassSpecGym_retrieval_candidates_formula.json"
split_type: "benchmark"
# output
mz_max: 1005.
mz_bin_res: 0.01
ints_transform: "sqrt"
# model
model_type: "fp"
# optimization
max_epochs: 100
# other
accelerator: "gpu"
save_ckpt: True
17 changes: 17 additions & 0 deletions config/simulation_retrieval/fp_mass.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# wandb
wandb_name: "fp_mass"
# data
pth: "MassSpecGym.tsv"
candidates_pth: "molecules/MassSpecGym_retrieval_candidates_mass.json"
split_type: "benchmark"
# output
mz_max: 1005.
mz_bin_res: 0.01
ints_transform: "sqrt"
# model
model_type: "fp"
# optimization
max_epochs: 100
# other
accelerator: "gpu"
save_ckpt: True
17 changes: 17 additions & 0 deletions config/simulation_retrieval/gnn_formula.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# wandb
wandb_name: "gnn_formula"
# data
pth: "MassSpecGym.tsv"
candidates_pth: "molecules/MassSpecGym_retrieval_candidates_formula.json"
split_type: "benchmark"
# output
mz_max: 1005.
mz_bin_res: 0.01
ints_transform: "sqrt"
# model
model_type: "gnn"
# optimization
max_epochs: 100
# other
accelerator: "gpu"
save_ckpt: True
17 changes: 17 additions & 0 deletions config/simulation_retrieval/gnn_mass.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# wandb
wandb_name: "gnn_mass"
# data
pth: "MassSpecGym.tsv"
candidates_pth: "molecules/MassSpecGym_retrieval_candidates_mass.json"
split_type: "benchmark"
# output
mz_max: 1005.
mz_bin_res: 0.01
ints_transform: "sqrt"
# model
model_type: "gnn"
# optimization
max_epochs: 100
# other
accelerator: "gpu"
save_ckpt: True
17 changes: 17 additions & 0 deletions config/simulation_retrieval/preconly_formula.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# wandb
wandb_name: "preconly_formula"
# data
pth: "MassSpecGym.tsv"
candidates_pth: "molecules/MassSpecGym_retrieval_candidates_formula.json"
split_type: "benchmark"
# output
mz_max: 1005.
mz_bin_res: 0.01
ints_transform: "none"
# model
model_type: "prec_only"
# optimization
max_epochs: 1
# other
accelerator: "gpu"
save_ckpt: True
17 changes: 17 additions & 0 deletions config/simulation_retrieval/preconly_mass.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# wandb
wandb_name: "preconly_mass"
# data
pth: "MassSpecGym.tsv"
candidates_pth: "molecules/MassSpecGym_retrieval_candidates_mass.json"
split_type: "benchmark"
# output
mz_max: 1005.
mz_bin_res: 0.01
ints_transform: "none"
# model
model_type: "prec_only"
# optimization
max_epochs: 1
# other
accelerator: "gpu"
save_ckpt: True
74 changes: 74 additions & 0 deletions config/template.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
# wandb
wandb_entity: "your_wandb_entity" ### change this
wandb_project: "your_wandb_project" ## change this
wandb_name: "template"
# data
pth: ## add path
candidates_pth: # add path
meta_keys: ["adduct","precursor_mz","instrument_type","collision_energy"]
fp_types: ["morgan","maccs","rdkit"]
adducts: ["[M+H]+"]
instrument_types: ["QTOF","QFT","Orbitrap","ITFT"]
max_collision_energy: 200.
mz_from: 10.
mz_to: 1000.
split_type: "benchmark"
subsample_frac:
# input
metadata_insert_location: "mlp"
collision_energy_insert_size: 16
adduct_insert_size: 16
instrument_type_insert_size: 16
# output
mz_max: 1005.
mz_bin_res: 0.1
ints_transform: "none"
# model
model_type: "fp"
mlp_hidden_size: 1024
mlp_dropout: 0.1
mlp_num_layers: 4
mlp_use_residuals: True
ff_prec_mz_offset: 5
ff_bidirectional: True
ff_output_map_size: 256
mol_hidden_size: 256
mol_num_layers: 4
mol_gnn_type: GINE
mol_normalization: batch
mol_dropout: 0.2
mol_pool_type: mean
# optimization
lr: 0.0003
lr_schedule: False
lr_decay_rate: 0.0
lr_warmup_steps: 1000
lr_decay_steps: 5000
weight_decay: 0.0000001
train_sample_weight: False #True
eval_sample_weight: False #True
batch_size: 128
max_epochs: 100
drop_last: False
gradient_clip_val: 0.0
gradient_clip_algorithm:
optimizer_type: "adam"
# other
num_workers: 8
accelerator: "cpu"
log_every_n_steps: 1
seed: 420
cache_feats: False
mp_sharing_strategy: "file_system"
do_retrieval: True
retrieval_batch_size: 8
at_ks: [1, 5, 20]
pin_memory: True
persistent_workers: True
sim_metrics:
- cos_sim
- js_sim
- cos_sim_sqrt
- cos_sim_obj
save_df_test: True
save_ckpt: False
Loading