-
Notifications
You must be signed in to change notification settings - Fork 3
/
main.py
450 lines (408 loc) · 23.1 KB
/
main.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# Created on 05 October 2020
# @authors: Niklas Siedhoff, Alexander-Maurice Illig
# @contact: <[email protected]>
# PyPEF - Pythonic Protein Engineering Framework
# https://github.com/niklases/PyPEF
# Licensed under Creative Commons Attribution-ShareAlike 4.0 International Public License (CC BY-SA 4.0)
# For more information about the license see https://creativecommons.org/licenses/by-nc/4.0/legalcode
# PyPEF – An Integrated Framework for Data-Driven Protein Engineering
# Journal of Chemical Information and Modeling, 2021, 61, 3463-3476
# https://doi.org/10.1021/acs.jcim.1c00099
# docstring for argument parsing using docopts
"""
PyPEF - Pythonic Protein Engineering Framework
written by Niklas Siedhoff and Alexander-Maurice Illig.
Modeling options
----------------
I. Pure ML modeling
-------------------
PyPEF provides three encoding options for training machine learning models, i.e.
regression models trained by supervised learning:
1. DCA: Direct coupling analysis (DCA) based on evolutionary couplings (input:
coupling parameter file generated by the C framework plmc) or
generating parameters using TensorFlow-based GREMLIN (input: MSA).
2. AAidx: Based on AAindex descriptors (566 amino acid descriptor files
taken from the AAindex database).
3. OneHot: One-hot encoding representing the occurrence of an
amino acid at a sequence position as a single 1 and 19 0's.
Any encoding technique enables pure ML-based modeling, see
https://doi.org/10.1021/acs.jcim.1c00099
and DCA-based sequence encoding enables a hybrid modeling approach, see
https://doi.org/10.1101/2022.06.07.495081
If an MSA can be constructed for the target sequence, e.g. using Jackhmmer,
encoding option 1 will likely outperform encoding option 2.
However, encoding option 2 provides a static encoding technique that is
independent from the evolutionary history of a target sequence and
without the need for MSA construction.
Here, the AAidx encodings for modeling are compared, i.e. validated, with respect
to their performance on the test set (comparable to an hyperparameter search
for finding the best static encoding set for model inference).
Further, one-hot encoding (encoding option 3) provides a simple but fast and often
well-performing encoding option that will likely outperform the AAindex-based
technique for model generalization.
II. Hybrid modeling
-------------------
Constructing a hybrid model that combines pure statistical DCA-based prediction (a
variant's relative 'evolutionary energy' to the wild type) and DCA-encoding based
training of a ML model similar to pure ML modeling option I.1.
Based on features generated from the direct coupling analysis (.params file output
using the plmc framework or provided MSA and running GREMLIN).
Individual model contributions are optimization only based on Spearman's correlation
coefficient and thus, only variant fitness ranks are to be considered for evaluating
model performance, not the exact predicted fitness value. For regression, up to now
only L2-regularized linear regression (Ridge regression) is provided as modeling option.
Running example of training, testing, and using a pure ML model for prediction
------------------------------------------------------------------------------
Exemplary running of PyPEF for training a pure ML model using encoding option 2
based on features generated from the AAIndex database (566 amino acid descriptor
indices taken from the AAIndex database).
1. Create files for training and testing from variant-fitness CSV data:
pypef mklsts -i variant_and_fitness.csv -w wt_sequence.fasta
2. Train and validate models:
pypef ml -e onehot -l LS.fasta -t TS.fasta --regressor pls
3. Plot the test set entries against test set predictions (creates PNG figure, MODEL is
the chosen AAindex, the ML-DCA model, or here the ONEHOT model):
pypef ml -e onehot -m ONEHOT -t TS.fasta
4. Create files for prediction:
- Single file:
pypef mkps -w wt_sequence.fasta -i variant_fitness.csv
- Recombinant/diverse prediction files:
pypef mkps -w wt_sequence.fasta -i variant_fitness.csv
[--drecomb] [--trecomb] [--qarecomb] [--qirecomb]
[--ddiverse] [--tdiverse] [--qdiverse]
5. Predict (unknown/new) variants:
- Single file:
pypef ml -e aaidx -m MODEL -p Prediction_Set.fasta
- Recombinant/diverse prediction files in created prediction set folders:
pypef ml -e aaidx -m MODEL --pmult [--drecomb] [...] [--qdiverse]
- Directed evolution - for performing and plotting in silico evolution trajectories:
pypef ml -e aaidx directevo -m MODEL [...]
Note: The commands for hybrid modeling are very similar to the commands for pure ML modeling,
see pypef -h for possible commands.
For generating DCA parameters using GREMLIN, you have to provide an MSA in FASTA or A2M format:
pypef param_inference --msa MSA_FILE --wt WT_FASTA [--opt_iter 100]
Helpful commands for data conversion
-----------------------------------------------
Creation of learning and test sets - splitting CSV variant-fitness data:
pypef mklsts --wt WT_FASTA --input CSV_FILE
[--drop THRESHOLD] [--numrnd NUMBER]
Creation of prediction sets from CSV data (using single-substituted variant data):
pypef mkps --wt WT_FASTA --input CSV_FILE
[--drop THRESHOLD] [--drecomb] [--trecomb] [--qarecomb] [--qirecomb]
[--ddiverse] [--tdiverse] [--qdiverse]
Encoding a CSV file (for further performance studies such as "low N" or
"mutational extrapolation" engineering tasks:
pypef encode --input CSV_FILE --encoding ENCODING_TECHNIQUE --wt WT_FASTA
[--params PARAM_FILE] [--y_wt WT_FITNESS] [--model MODEL] [--nofft]
[--threads THREADS] [--sep CSV_COLUMN_SEPARATOR] [--fitness_key FITNESS_KEY]
Converting a STO alignment file to A2M format:
pypef sto2a2m --sto STO_MSA_FILE
[--inter_gap INTER_GAP] [--intra_gap INTRA_GAP]
Usage:
pypef mklsts --wt WT_FASTA --input CSV_FILE
[--drop THRESHOLD] [--sep CSV_COLUMN_SEPARATOR] [--mutation_sep MUTATION_SEPARATOR] [--numrnd NUMBER]
pypef mkps --wt WT_FASTA [--input CSV_FILE]
[--drop THRESHOLD] [--ssm] [--drecomb] [--trecomb] [--qarecomb] [--qirecomb]
[--ddiverse] [--tdiverse] [--qdiverse]
pypef param_inference
[--msa MSA_FILE] [--params PARAM_FILE]
[--wt WT_FASTA] [--opt_iter N_ITER]
pypef save_msa_info --msa MSA_FILE --wt WT_FASTA
[--opt_iter N_ITER]
pypef encode --input CSV_FILE --encoding ENCODING_TECHNIQUE --wt WT_FASTA
[--params PARAM_FILE] [--y_wt WT_FITNESS] [--model MODEL] [--nofft]
[--threads THREADS]
[--sep CSV_COLUMN_SEPARATOR] [--fitness_key FITNESS_KEY]
pypef reformat_csv --input CSV_FILE
[--sep CSV_COLUMN_SEPARATOR] [--mutation_sep MUTATION_SEPARATOR] [--fitness_key FITNESS_KEY]
pypef shift_pos --input CSV_FILE --offset OFFSET
[--sep CSV_COLUMN_SEPARATOR] [--mutation_sep MUTATION_SEPARATOR] [--fitness_key FITNESS_KEY]
pypef sto2a2m --sto STO_MSA_FILE [--inter_gap INTER_GAP] [--intra_gap INTRA_GAP]
pypef hybrid
[--ts TEST_SET] [--ps PREDICTION_SET]
[--model MODEL] [--params PARAM_FILE]
[--ls LEARNING_SET] [--label] [--threads THREADS]
pypef hybrid --model MODEL --params PARAM_FILE
[--ts TEST_SET] [--label]
[--ps PREDICTION_SET] [--pmult] [--drecomb] [--trecomb] [--qarecomb] [--qirecomb]
[--ddiverse] [--tdiverse] [--qdiverse] [--negative]
[--threads THREADS]
pypef hybrid directevo --wt WT_FASTA --params PARAM_FILE
[--model MODEL]
[--input CSV_FILE] [--y_wt WT_FITNESS] [--numiter NUM_ITER]
[--numtraj NUM_TRAJ] [--temp TEMPERATURE]
[--negative] [--usecsv] [--csvaa] [--drop THRESHOLD]
pypef hybrid train_and_save --input CSV_FILE --params PARAM_FILE --wt WT_FASTA
[--fit_size REL_LEARN_FIT_SIZE] [--test_size REL_TEST_SIZE]
[--threads THREADS] [--sep CSV_COLUMN_SEPARATOR]
[--fitness_key FITNESS_KEY] [--rnd_state RND_STATE]
pypef hybrid low_n --input ENCODED_CSV_FILE
pypef hybrid extrapolation --input ENCODED_CSV_FILE
[--conc]
pypef ml --encoding ENCODING_TECHNIQUE --ls LEARNING_SET --ts TEST_SET
[--save NUMBER] [--regressor TYPE] [--nofft] [--all] [--params PARAM_FILE]
[--sort METRIC_INT] [--threads THREADS] [--label]
pypef ml --encoding ENCODING_TECHNIQUE --model MODEL --ts TEST_SET
[--nofft] [--params PARAM_FILE] [--threads THREADS] [--label]
pypef ml --show
[MODELS]
pypef ml --encoding ENCODING_TECHNIQUE --model MODEL --ps PREDICTION_SET
[--params PARAM_FILE] [--threads THREADS] [--nofft] [--negative]
pypef ml --encoding ENCODING_TECHNIQUE --model MODEL --pmult
[--drecomb] [--trecomb] [--qarecomb] [--qirecomb]
[--ddiverse] [--tdiverse] [--qdiverse]
[--regressor TYPE] [--nofft] [--negative] [--params PARAM_FILE] [--threads THREADS]
pypef ml --encoding ENCODING_TECHNIQUE directevo --model MODEL --wt WT_FASTA
[--input CSV_FILE] [--y_wt WT_FITNESS] [--numiter NUM_ITER] [--numtraj NUM_TRAJ] [--temp TEMPERATURE]
[--nofft] [--negative] [--usecsv] [--csvaa] [--drop THRESHOLD] [--params PARAM_FILE]
pypef ml low_n --input ENCODED_CSV_FILE
[--regressor TYPE]
pypef ml extrapolation --input ENCODED_CSV_FILE
[--regressor TYPE] [--conc]
Options:
--all Finally training on all data [default: False].
--conc Concatenating mutational level variants for predicting variants
from next higher level [default: False].
--csvaa Directed evolution csv amino acid substitutions,
requires flag "--usecsv" [default: False].
--ddiverse Create/predict double natural diverse variants [default: False].
--drecomb Create/predict double recombinants [default: False].
-d --drop THRESHOLD Below threshold variants will be discarded from the
data [default: -9E09].
-e --encoding ENCODING_TECHNIQUE Sets technique used for encoding sequences for constructing regression models;
choose between 'aaidx' (AAIndex-based encoding), 'onehot' (OneHot-based encoding),
and DCA encoding using Gremlin/plmc (DCA-based encoding) [default: onehot].
--fitness_key FITNESS_KEY Label of CSV fitness column. Else uses second column.
-h --help Show this screen [default: False].
-i --input CSV_FILE Input data file in .csv format.
--inter_gap INTER_GAP Fraction to delete all positions with more than
'inter_gap' * 100 % gaps (columnar trimming) [default: 0.3].
--intra_gap INTRA_GAP Fraction to delete all sequences with more than
'intra_gap' * 100 % gaps after being columnar trimmed
(line trimming) [default: 0.5].
--label Label the plot instances [default: False].
-l --ls LEARNING_SET Input learning set in .fasta format.
-m --model MODEL Model (pickle file) for plotting of validation or for
performing predictions.
--msa MSA_FILE Multiple sequence alignment (MSA) in FASTA or A2M format for
inferring DCA parameters.
--mutation_sep MUTATION_SEP Mutation separator [default: /].
--mutation_extrapolation Mutation extrapolation [default: False].
--negative Set if more negative values define better variants [default: False].
--nofft Raw sequence input, i.e., no FFT for establishing protein spectra
as vector inputs, only implemented as option for AAindex-based
sequence encoding [default: False].
-n --numrnd NUMBER Number of randomly created Learning and Validation
datasets [default: 0].
--numiter NUM_ITER Number of mutation iterations per evolution trajectory [default: 5].
--numtraj NUM_TRAJ Number of trajectories, i.e., evolution pathways [default: 5].
-o --offset OFFSET Offset for shifting substitution positions of the input CSV file [default: 0].
--opt_iter N_ITER Number of iterations for GREMLIN-based optimization of local fields
and couplings [default: 100].
--params PARAM_FILE Input PLMC couplings parameter file.
-u --pmult Predict for all prediction files in folder for recombinants
or for diverse variants [default: False].
-p --ps PREDICTION_SET Prediction set for performing predictions using a trained Model.
--qdiverse Create quadruple natural diverse variants [default: False].
--qarecomb Create/predict quadruple recombinants [default: False].
--qirecomb Create/predict quintuple recombinants [default: False].
--regressor TYPE Type of regression (R.) to use, options: PLS CV R.: pls,
PLS LOOCV R.: pls_loocv, Random Forest CV R.: rf, SVM CV R.: svr,
MLP CV R.: mlp, Ridge CV R.: ridge (or l2),
LassoLars CV R.: lassolars (or l1) [default: pls].
--rnd_splits RND_SPLITS Number of random splits for Low N testing [default: 5].
--rnd_state RND_STATE Sets the random state for reproduction, only implemented
for hybrid train_and_save [default: 42].
-s --save NUMBER Number of models to be saved as pickle files [default: 5].
--sep CSV_COLUMN_SEPARATOR CSV Column separator [default: ;].
--show Show achieved model performances from Model_Results.txt.
--sort METRIC_INT Rank models based on metric {1: R^2, 2: RMSE, 3: NRMSE,
4: Pearson's r, 5: Spearman's rho} [default: 1].
--ssm Create single-saturation mutagenesis prediction set (does not
require CSV input) [default: False].
--sto STO_MSA_FILE The input MSA file in STO (Stockholm) format.
--tdiverse Create/predict triple natural diverse variants [default: False].
--temp TEMPERATURE "Temperature" of Metropolis-Hastings criterion [default: 0.01]
--threads THREADS Parallel computing of training and validation of models.
Number of threads used in parallel computing, by default
no hyperthreading.
--fit_size REL_LEARN_FIT_SIZE Relative size of the train set for initial fitting. The remaining data
for training is used for hyperparameter optimization on train subsets
used for validation, while in sum the total data for training is
training data = train_fit data + train_test(validation) data
= all data - test data.
The default of 0.66 means that 34 % of the train data is taken for
train_test validation [default: 0.66].
--test_size REL_TEST_SIZE Relative size of the test set; if set to 0.0 the trained model
will not be tested [default: 0.2].
--trecomb Create/predict triple recombinants [default: False].
--usecsv Perform directed evolution on single variant csv position
data [default: False].
-t --ts TEST_SET Input validation set in .fasta format.
--version Show version [default: False].
-w --wt WT_FASTA Input wild-type sequence file (in FASTA format).
--wt_pos WT_POSITION Row position of encoded wild-type in encoding CSV file (0-indexed) [default: 0].
-y --y_wt WT_FITNESS Fitness value (y) of wild-type [default: 1.0].
encode Encoding [default: False].
hybrid Hybrid modeling based on DCA-derived sequence encoding [default: False].
ml Pure machine learning modeling based on encoded sequences [default: False].
MODELS Number of saved models to show [default: 5].
onehot OneHot-based encoding [default: False].
param_inference Inferring DCA params using the GREMLIN approach [default: False].
reformat_csv Reformat input CSV with indicated column and mutation separators to default
CSV style (column separator ';' and mutation separator '/') [default: False.]
save_msa_info Optimize local fields and couplings of MSA based on GREMLIN DCA approach and
save resulting coupling matrix and highly coevolved amino acids.
shift_pos Shift positions of all variant substitutions of the input CSV
file (identical to reformat_csv when setting --offset to 0) [default: False.]
sto2a2m Transform multiple sequence alignment from STO format to
A2M format [default: False].
"""
from os import environ
try:
environ['TF_CPP_MIN_LOG_LEVEL'] = '3' # 3 = TensorFlow INFO, WARNING, and ERROR messages are not printed
except KeyError:
pass
from sys import argv, version_info
from pypef import __version__
if version_info[0] < 3 or version_info[1] < 9:
raise SystemError(f"The current version of PyPEF (v {__version__}) requires at "
f"least Python 3.9 or higher versions of Python.")
import time
from datetime import timedelta
import logging
from docopt import docopt
from schema import Schema, SchemaError, Optional, Or, Use
from pypef.ml.ml_run import run_pypef_pure_ml
from pypef.dca.dca_run import run_pypef_hybrid_modeling
from pypef.utils.utils_run import run_pypef_utils
logger = logging.getLogger("pypef")
logger.setLevel(logging.INFO)
ch = logging.StreamHandler()
ch.setLevel(logging.DEBUG)
formatter = logging.Formatter(
"%(asctime)s.%(msecs)03d %(levelname)s %(filename)s:%(lineno)d -- %(message)s",
"%Y-%m-%d %H:%M:%S"
)
ch.setFormatter(formatter)
logger.addHandler(ch)
schema = Schema({
Optional('--all'): bool,
Optional('--conc'): bool,
Optional('--csvaa'): bool,
Optional('--ddiverse'): bool,
Optional('--drecomb'): bool,
Optional('--drop'): Use(float),
Optional('--encoding'): Use(str),
Optional('--fitness_key'): Or(None, str),
Optional('--fit_size'): Use(float),
Optional('--help'): bool,
Optional('--input'): Or(None, str),
Optional('--inter_gap'): Use(float),
Optional('--intra_gap'): Use(float),
Optional('--label'): bool,
Optional('--ls'): Or(None, str),
Optional('--model'): Or(None, str),
Optional('--msa'): Or(None, str),
Optional('--mutation_sep'): Or(None, str),
Optional('--negative'): bool,
Optional('--nofft'): bool,
Optional('--numrnd'): Use(int),
Optional('--numiter'): Use(int),
Optional('--numtraj'): Use(int),
Optional('--offset'): Use(int),
Optional('--opt_iter'): Use(int),
Optional('--params'): Or(None, str),
Optional('--pmult'): bool,
Optional('--ps'): Or(None, str),
Optional('--qdiverse'): bool,
Optional('--qarecomb'): bool,
Optional('--qirecomb'): bool,
Optional('--regressor'): Or(None, str),
Optional('--rnd_splits'): Use(int),
Optional('--rnd_state'): Use(int),
Optional('--save'): Use(int),
Optional('--sep'): Or(None, str),
Optional('--show'): Use(int),
Optional('--sort'): Use(int),
Optional('--ssm'): bool,
Optional('--sto'): Or(None, str),
Optional('--tdiverse'): bool,
Optional('--temp'): Use(float),
Optional('--test_size'): Use(float),
Optional('--threads'): Or(None, Use(int)),
Optional('--train_size'): Use(float),
Optional('--trecomb'): bool,
Optional('--usecsv'): bool,
Optional('--ts'): Or(None, str),
Optional('--wt'): Or(None, str),
Optional('--wt_pos'): Use(int),
Optional('--y_wt'): Or(None, Use(float)),
Optional('aaidx'): bool,
Optional('param_inference'): bool,
Optional('hybrid'): bool,
Optional('directevo'): bool,
Optional('encode'): bool,
Optional('extrapolation'): bool,
Optional('low_n'): bool,
Optional('mklsts'): bool,
Optional('mkps'): bool,
Optional('ml'): bool,
Optional('MODELS'): Or(None, Use(int)),
Optional('onehot'): bool,
Optional('reformat_csv'): bool,
Optional('save_msa_info'): bool,
Optional('shift_pos'): bool,
Optional('sto2a2m'): bool,
Optional('train_and_save'): bool,
})
def validate(args):
"""
Validate (docopt) arguments.
Parameters
----------
args: dict
Key-value pairs of arguments,
e.g.,
{'mklsts': True,
'--wt': 'WT_Seq.fasta',
'--input': 'Variant-Fitness.csv'}
Returns
-------
None
"""
try:
args = schema.validate(args)
return args
except SchemaError as e:
exit(e)
def run_main():
"""
Entry point for pip-installed version.
Arguments are created from Docstring using docopt that
creates an argument dict.
"""
arguments = docopt(__doc__, version=__version__)
start_time = time.time()
logger.debug(f'main.py __name__: {__name__}, version: {__version__}')
logger.debug(str(argv)[1:-1].replace("\'", "").replace(",", ""))
logger.debug(f'\n{arguments}')
arguments = validate(arguments)
if arguments['directevo']:
run_pypef_utils(arguments)
elif arguments['ml']:
run_pypef_pure_ml(arguments)
elif arguments['hybrid'] or arguments['param_inference'] or arguments['save_msa_info']:
run_pypef_hybrid_modeling(arguments)
else:
run_pypef_utils(arguments)
elapsed = str(timedelta(seconds=time.time() - start_time)).split(".")[0]
elapsed = f'{elapsed.split(":")[0]} h {elapsed.split(":")[1]} min {elapsed.split(":")[2]} s'
logger.info(f'Done! (Run time: {elapsed})')
if __name__ == '__main__':
"""
Entry point for direct file run.
"""
run_main()