This repo contains the results data for Round 2 of EGFR Protein Design Competition, hosted by Adaptyv Bio in partnership with Polaris and Dimension.
Due to the great interest, Adaptyv is organizing a consortium of community members to create a post competition writeup, compiling additional data, analyses performed and learnings. Click here to join the effort
Contributions so far include
- FoldSeek and DE-STRESS scores, contributed by the team at https://github.com/wells-wood-research
📊 Processed binding affinity characterization data and sequence similarity metrics:
🔬 Kinetic curves and raw data:
https://api.adaptyvbio.com/storage/v1/object/public/egfr_design_competition_2/package.zip
🧱 AlphaFold2 Structure predictions (.pdb) for all 400 selected designs:
🌐 Embeddings from a variety of models (ESM C, ESM2, Saprot, Protek) for all submissions:
https://api.adaptyvbio.com/storage/v1/object/public/egfr_design_competition_2/embeddings.zip
More details on the metrics that were used for ranking the sequences can be found in the metrics repo.
Two of the metrics that we used for scoring the designs computationally are derived from AlphaFold2 predictions. To calculate them, we began by generating a structure prediction using ColabFold (with 5 models, 3 recycles, 3 seeds, with templates and without initial guess). The top-ranked model was selected for each design.
To get PAE interaction, The Predicted Aligned Error (PAE) of the top-ranked prediction was then averaged across residue pairs, where one residue belongs to the target and the other to the binder, as done here.
The second metric, ipTM, is predicted by the model directly.
Here we also show the predicted Local Distance Difference Test (pLDDT) scores, averaged over residues of the binder chain.
The third metric used in the ranking is ESM2 PLL (Pseudo Log Likelihood). We use the esm2_t33_650M_UR50D
model for the calculation and we do not normalize by the length of the sequence.
We checked each sequence against several sequence databases. As part of the initial competition rules, only proteins that were at least 10 amino acids (AA) away from a published sequence were considered valid and counted in the final leaderboard. The results of that similarity search are stored in the results folder. The similarity check metric is calculated as identity * coverage
, where:
• Identity is the highest percentage of matching amino acids between a subsequence of the query and a subsequence of the database entry.
• Coverage is the fraction of the query sequence that aligns with the database entry.
Proteins with less than 10 amino acid distance to a database entry were excluded from the competition. A similarity_check
value of “null” indicates that no matches were found in any of the the databases.
The databases that we checked are SwisssProt, THPdb, USPTO, sequences from the first round of the competition and binders designed by Cao et al. (2022). The scripts can be found in the first round data package repo.
We checked every design against SwissProt and PDB databases using TM-score in FoldSeek by the Steinegger Lab. We calculated several metrics, of which:
• evalue: The E-value, representing the number of expected alignments with a score as good as or better than the one observed by chance. Lower values indicate more significant alignments.
• alntmscore: TM-score of the alignment, which measures structural similarity on a scale from 0 to 1 (1 being identical structures).
• lddt: Local Distance Difference Test (LDDT) score for the aligned region, indicating alignment quality. Ranges from 0 to 1, with 1 indicating perfect alignment.
• prob: The probability of the alignment being correct or meaningful.
Full details with bash scripts and additional files can be found here.
DE-STRESS is a tool from the Wells Wood Research Group that evaluates structural models of designed and engineered proteins. The program calculates roughly 70 physicochemical properties using a variety of software tools, including all-atom scoring functions (such as Rosetta, EvoEF2, BUDE), measures of geometric packing density and hydrogen bonding quality, aggregation propensity, isoelectric point, and many others.
We ran DE-STRESS on the AlphaFold structure predictions for all 400 selected designs from round 2 of the competition, both with EGFR (destress_binder_with_egfr.csv
) and without (destress_binder_only.csv
).
The full glossary of metrics and descriptions is available here. DE-STRESS can be used through web server or through Command Line Interface (CLI). The full paper is available here.
The submitted protein sequences were reverse-translated, and the corresponding DNA sequences were optimized using Adaptyv's internal pipeline. This process considered several parameters, including optimal codon usage for cell-free systems, mRNA secondary structure stability, and synthesizability factors. Additionally, non-coding regions at the 5' and 3' ends, optimized for cell-free expression, were incorporated into the coding sequences. Suitable gene constructs were successfully generated for all submitted protein sequences.
Protein synthesis was carried out using an optimized cell-free expression system, suitable for a wide range of proteins. The template DNA was added, and protein expression was conducted over a defined period. During the competition, at least two expression batches were performed for each sequence entry, with some sequences tested up to four times under varying conditions. Protein synthesis success was assessed via a label-free quantification assay. Sequences that yielded less than 0.02 µg/mL of protein were excluded from further experimental characterization.
The binding assay was conducted using Bio-Layer Interferometry (BLI), a label-free technology for biomolecular interaction measurement. A multi-cycle kinetic assay was performed against the target antigen. Expressed ligands were immobilized on the probe surface using tag-specific chemistry, and several concentrations of the antigen (ranging from 1000 nM to 10 nM) were flowed over the probe. The experiments were performed in duplicate using a HBS-T buffer with 0.5% BSA at 25°C.
The binding signals were baseline-corrected and fitted using a 1:1 binding model across all tested concentrations for each replicate. This approach allowed us to extract the kinetic rates (association and dissociation) and calculate the affinity constants (KD) for each ligand. The predicted binding curves were generated based on the fit parameters, ensuring an accurate representation of the interaction dynamics. In cases where the maximum signal fell below the quantifiable threshold, or when the interaction kinetics were too fast relative to the device's temporal resolution, we employed equilibrium analysis to estimate the dissociation constant (KD). Each experimental replicate was analyzed independently.
All code is licensed under Apache 2, all data is licensed under the ODbL. Contact [email protected] or any other responsible at Adaptyv if you require a different license.