Allow user-defined rules #7

VittorioRainaldi · 2024-09-18T18:54:58Z

I tested the tool on a couple of protein sequences and for one of them the predicted DNA sequence is too complex for synthesis for both IDT and twist.
Twist has the following rules to determine whether a sequence is too complex:

- Avoid repeats of ≥ 20bp or Tm ≥ 60C
- Global GC content must be between 25% and 65%
- Avoid extreme differences in GC content within a gene (i.e. the difference in GC content between the highest and lowest 50bp stretch should be no greater than 52%)
- Minimize homopolymers
- Minimize the number/length of small repeats scattered throughout the sequence
- For HIS tags use a combination of CAC and CAT codons i.e. CACCAT…

Output sequences could be screened for such issues and regenerated if needed.

The text was updated successfully, but these errors were encountered:

Adibvafa · 2024-09-19T15:04:28Z

This is a great suggestion, we will work on it.
Do you mind sharing the protein/dna sequence you ran into issue with?

gui11aume · 2024-09-19T17:23:14Z

An option is to use non-deterministic mode to produce many good variants, then use a complexity metric based on those criteria and sort the variants on this metric. We should have a good default metric, but we should also allow users to set the parameters they want (e.g., they don't care about homopolymers so they would not penalize them).

VittorioRainaldi · 2024-09-19T18:00:28Z

The protein sequence in question is called ecm (uniprot ID Q3IZ90).
I am pasting the predicted DNA sequence because as far as I understand torch is not always reproducible across systems:

ATGACTCAGAAAGACTCACCATGGCTGTTCAGGACCTATGCGGGACACAGCACAGCCAAAGCCTCCAATGCGCTGTACCGTACCAACCTGGCGAAAGGTCAGACCGGTCTGAGCGTGGCGTTTGATCTGCCGACCCAGACCGGCTATGACAGCGATGATGCGCTGGCCCGCGGCGAAGTCGGTAAAGTCGGTGTACCGATCTGCCACCTGGGTGACATGCGTATGCTGTTTGACCAGATCCCGCTGGAACAGATGAACACCTCTATGACCATCAATGCCACAGCACCGTGGCTGCTGGCGCTGTACATTGCCGTAGCTGAAGAGCAGGGTGCGGACATCAGCAAACTGCAGGGTACTGTTCAGAATGACCTGATGAAAGAGTATCTCAGCCGTGGCACCTACATCTGCCCGCCGCGTCCATCTCTGCGCATGATCACCGATGTGGCGGCTTACACCCGTGTTCATCTGCCGAAATGGAACCCGATGAACGTCTGCTCTTACCACCTGCAGGAAGCAGGTGCGACACCGGAACAGGAACTGGCGTTTGCGCTGGCCACCGGTATTGCGGTGCTGGATGACCTGCGCACCAAAGTGCCGGCAGAACATTTCCCGGCGATGGTTGGCCGCATCAGCTTCTTCGTTAACGCCGGTATCCGCTTTGTGACCGAAATGTGCAAAATGCGTGCGTTTGTTGACCTGTGGGATGAGATCTGCCGTGACCGTTACGGTATCGAAGAAGAGAAATACCGCCGTTTCCGCTACGGTGTGCAGGTTAACAGCCTGGGCCTGACCGAACAGCAGCCGGAGAACAACGTCTACCGCATCCTGATTGAGATGCTGGCGGTGACCCTGAGCAAGAAAGCGCGTGCGCGTGCTGTTCAGCTGCCGGCGTGGAACGAAGCGCTGGGTCTGCCGCGTCCGTGGGACCAGCAGTGGAGCCTGCGTATGCAGCAGATCCTGGCCTACGAGTCCGACCTGCTGGAGTATGAAGACCTGTTTGATGGTAACCCGGCGATCGAGCGTAAAGTTGAAGCGCTGAAAGACGGTGCGCGTGAGGAGCTGGCGCACATTGAGGCGATGGGTGGTGCGATTGAAGCGATCGACTACATGAAAGCGCGTCTGGTAGAGAGCAATGCCGAGCGTATTGCCCGTGTGGAGACCGGTGAAACCGTGGTGGTCGGTGTGAACCGCTGGACCTCTGGTGCACCATCTCCGCTGACCACTGGTGACGGTGCGATTATGGTTGCTGATCCGGAAGCAGAGCGCGATCAGATTGCCCGTCTGGAAGCATGGCGTGCGGGTCGTGATGGTGCGGCGGTGGCTGCGGCGCTGGCTGAACTGCGCCGTGCGGCGACCTCCGGTGAGAACGTCATGCCGGCCTCTATTGCCGCTGCGAAAGCCGGCGCCACCACCGGTGAATGGGCGGCAGAGCTGCGCCGTGCCTTCGGTGAGTTCCGCGGCCCGACCGGTGTTGCGCGTGCGCCAAGCAACCGCACCGAAGGTCTGGATCCGATCCGTGAAGCGGTTCAGGCGGTCTCCGCGCGTCTGGGCCGTCCGCTGAAATTTGTGGTCGGTAAACCGGGTCTGGATGGCCACTCCAACGGTGCGGAACAGATTGCCGCGCGCGCGCGCGACTGCGGCATGGATATCACCTACGATGGTATCCGCCTGACGCCAGCGGAGATCGTGGCGAAAGCGGCCGATGAGCGCGCGCACGTCCTCGGTCTGTCCATTCTGTCCGGCTCCCACATGCCGCTGGTGACCGAAGTGCTGGCTGAAATGCGCCGCGCGGGTCTGGATGTTCCGCTGATCGTTGGCGGTATCATTCCGGAAGAAGATGCGGCGGAGCTGCGTGCCTCCGGTGTTGCGGCGGTTTACACCCCGAAAGATTTTGAGCTGAACCGCATTATGATGGATATTGTCGGCCTGGTTGACCGCACTGCGCTGGCGGCGGAATAA

This sequence gives the following output on the IDT gblock analysis tool:

Denied - High Complexity (Scores of 10 or greater)

The identified complexities prevent manufacturing of this sequence.

Total Complexity Score: 18

Complexity Description
Score

One or more repeated sequences greater than 8 bases comprise 61.9% of the overall sequence. Solution: Redesign to reduce the repeats to be less than 40% of the sequence.
8.8
The GC content of the segment from position 1001 to position 1800 is 64.2%. Solution: Redesign to reduce the GC content below 60%.
4.2
This sequence contains a window of 100 bases starting at base 1399 with a GC content of 74%. Solution: Redesign this region to have a GC content less than 69%.
4
A hairpin with the stem sequence CAGATGAACAC exists at the following locations: 250, 458. Solution: Modify the sequence to reduce the length of the stem or complement to less than 10 bases.
1

Aside from repeated sequences, I believe a GC content of >60% is highly unlikely for E. coli coding sequences, so I would expect that to be reflected in the model.

VittorioRainaldi · 2024-09-19T18:08:30Z

Ah, I forgot to mention that I optimized the sequence with the "E. coli general" setting using the code snippet on PyPi.

Adibvafa · 2024-09-22T03:25:28Z

@VittorioRainaldi While we work on adding user-defined rules and restrictions, could you try non-deterministic generation of multiple sequences using the new version of package to see if it solves your problem?

VittorioRainaldi · 2024-09-22T07:07:29Z

I tested it with the following settings:

temperature = 0.2, 0.5, and 0.8
top_p = 0.95
num_sequences = 10

then I calculated the hamming distance of the sequences. While increasing the temperature does lead to more diverse set of sequences, none of the ones I tested passed the IDT screening.

Here is the output I get, first number is GC content, second number is hamming distance.

temperature = 0.2

60.28586013272078 0
60.0816743236345 45
60.13272077590608 47
60.23481368044921 48
60.23481368044921 48
60.23481368044921 48
60.23481368044921 48
60.23481368044921 48
60.23481368044921 48
60.23481368044921 48

temperature 0.5

59.928534966819804 0
60.54109239407861 128
60.0816743236345 103
59.979581419091375 100
60.0816743236345 98
60.28586013272078 99
60.18376722817764 97
60.23481368044921 98
60.23481368044921 98
60.23481368044921 98

temperature 0.8

59.060745278203164 0
59.315977539561004 206
60.13272077590608 195
60.54109239407861 202
60.33690658499234 181
60.18376722817764 186
60.49004594180705 174
59.87748851454824 167
59.826442062276676 175
60.33690658499234 176

By the way, is there a way to calculate how much each sequence differs from the "optimum"? Perhaps an internal scoring function for the model?

P.S.: the hamming distance is not pairwise, I just used the first sequence as a comparison for all of them, that's why the first is always zero.

Adibvafa self-assigned this Sep 19, 2024

Adibvafa added the bug Something isn't working label Sep 19, 2024

Adibvafa added enhancement New feature or request and removed bug Something isn't working labels Sep 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow user-defined rules #7

Allow user-defined rules #7

VittorioRainaldi commented Sep 18, 2024

Adibvafa commented Sep 19, 2024

gui11aume commented Sep 19, 2024

VittorioRainaldi commented Sep 19, 2024

VittorioRainaldi commented Sep 19, 2024

Adibvafa commented Sep 22, 2024 •

edited

Loading

VittorioRainaldi commented Sep 22, 2024 •

edited

Loading

Allow user-defined rules #7

Allow user-defined rules #7

Comments

VittorioRainaldi commented Sep 18, 2024

Adibvafa commented Sep 19, 2024

gui11aume commented Sep 19, 2024

VittorioRainaldi commented Sep 19, 2024

VittorioRainaldi commented Sep 19, 2024

Adibvafa commented Sep 22, 2024 • edited Loading

VittorioRainaldi commented Sep 22, 2024 • edited Loading

Adibvafa commented Sep 22, 2024 •

edited

Loading

VittorioRainaldi commented Sep 22, 2024 •

edited

Loading