Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow user-defined rules #7

Open
VittorioRainaldi opened this issue Sep 18, 2024 · 6 comments
Open

Allow user-defined rules #7

VittorioRainaldi opened this issue Sep 18, 2024 · 6 comments
Assignees
Labels
enhancement New feature or request

Comments

@VittorioRainaldi
Copy link

I tested the tool on a couple of protein sequences and for one of them the predicted DNA sequence is too complex for synthesis for both IDT and twist.
Twist has the following rules to determine whether a sequence is too complex:

- Avoid repeats of ≥ 20bp or Tm ≥ 60C
- Global GC content must be between 25% and 65%
- Avoid extreme differences in GC content within a gene (i.e. the difference in GC content between the highest and lowest 50bp stretch should be no greater than 52%)
- Minimize homopolymers
- Minimize the number/length of small repeats scattered throughout the sequence
- For HIS tags use a combination of CAC and CAT codons i.e. CACCAT…

Output sequences could be screened for such issues and regenerated if needed.

@Adibvafa Adibvafa self-assigned this Sep 19, 2024
@Adibvafa Adibvafa added the bug Something isn't working label Sep 19, 2024
@Adibvafa
Copy link
Owner

This is a great suggestion, we will work on it.
Do you mind sharing the protein/dna sequence you ran into issue with?

@gui11aume
Copy link
Collaborator

An option is to use non-deterministic mode to produce many good variants, then use a complexity metric based on those criteria and sort the variants on this metric. We should have a good default metric, but we should also allow users to set the parameters they want (e.g., they don't care about homopolymers so they would not penalize them).

@VittorioRainaldi
Copy link
Author

The protein sequence in question is called ecm (uniprot ID Q3IZ90).
I am pasting the predicted DNA sequence because as far as I understand torch is not always reproducible across systems:

ATGACTCAGAAAGACTCACCATGGCTGTTCAGGACCTATGCGGGACACAGCACAGCCAAAGCCTCCAATGCGCTGTACCGTACCAACCTGGCGAAAGGTCAGACCGGTCTGAGCGTGGCGTTTGATCTGCCGACCCAGACCGGCTATGACAGCGATGATGCGCTGGCCCGCGGCGAAGTCGGTAAAGTCGGTGTACCGATCTGCCACCTGGGTGACATGCGTATGCTGTTTGACCAGATCCCGCTGGAACAGATGAACACCTCTATGACCATCAATGCCACAGCACCGTGGCTGCTGGCGCTGTACATTGCCGTAGCTGAAGAGCAGGGTGCGGACATCAGCAAACTGCAGGGTACTGTTCAGAATGACCTGATGAAAGAGTATCTCAGCCGTGGCACCTACATCTGCCCGCCGCGTCCATCTCTGCGCATGATCACCGATGTGGCGGCTTACACCCGTGTTCATCTGCCGAAATGGAACCCGATGAACGTCTGCTCTTACCACCTGCAGGAAGCAGGTGCGACACCGGAACAGGAACTGGCGTTTGCGCTGGCCACCGGTATTGCGGTGCTGGATGACCTGCGCACCAAAGTGCCGGCAGAACATTTCCCGGCGATGGTTGGCCGCATCAGCTTCTTCGTTAACGCCGGTATCCGCTTTGTGACCGAAATGTGCAAAATGCGTGCGTTTGTTGACCTGTGGGATGAGATCTGCCGTGACCGTTACGGTATCGAAGAAGAGAAATACCGCCGTTTCCGCTACGGTGTGCAGGTTAACAGCCTGGGCCTGACCGAACAGCAGCCGGAGAACAACGTCTACCGCATCCTGATTGAGATGCTGGCGGTGACCCTGAGCAAGAAAGCGCGTGCGCGTGCTGTTCAGCTGCCGGCGTGGAACGAAGCGCTGGGTCTGCCGCGTCCGTGGGACCAGCAGTGGAGCCTGCGTATGCAGCAGATCCTGGCCTACGAGTCCGACCTGCTGGAGTATGAAGACCTGTTTGATGGTAACCCGGCGATCGAGCGTAAAGTTGAAGCGCTGAAAGACGGTGCGCGTGAGGAGCTGGCGCACATTGAGGCGATGGGTGGTGCGATTGAAGCGATCGACTACATGAAAGCGCGTCTGGTAGAGAGCAATGCCGAGCGTATTGCCCGTGTGGAGACCGGTGAAACCGTGGTGGTCGGTGTGAACCGCTGGACCTCTGGTGCACCATCTCCGCTGACCACTGGTGACGGTGCGATTATGGTTGCTGATCCGGAAGCAGAGCGCGATCAGATTGCCCGTCTGGAAGCATGGCGTGCGGGTCGTGATGGTGCGGCGGTGGCTGCGGCGCTGGCTGAACTGCGCCGTGCGGCGACCTCCGGTGAGAACGTCATGCCGGCCTCTATTGCCGCTGCGAAAGCCGGCGCCACCACCGGTGAATGGGCGGCAGAGCTGCGCCGTGCCTTCGGTGAGTTCCGCGGCCCGACCGGTGTTGCGCGTGCGCCAAGCAACCGCACCGAAGGTCTGGATCCGATCCGTGAAGCGGTTCAGGCGGTCTCCGCGCGTCTGGGCCGTCCGCTGAAATTTGTGGTCGGTAAACCGGGTCTGGATGGCCACTCCAACGGTGCGGAACAGATTGCCGCGCGCGCGCGCGACTGCGGCATGGATATCACCTACGATGGTATCCGCCTGACGCCAGCGGAGATCGTGGCGAAAGCGGCCGATGAGCGCGCGCACGTCCTCGGTCTGTCCATTCTGTCCGGCTCCCACATGCCGCTGGTGACCGAAGTGCTGGCTGAAATGCGCCGCGCGGGTCTGGATGTTCCGCTGATCGTTGGCGGTATCATTCCGGAAGAAGATGCGGCGGAGCTGCGTGCCTCCGGTGTTGCGGCGGTTTACACCCCGAAAGATTTTGAGCTGAACCGCATTATGATGGATATTGTCGGCCTGGTTGACCGCACTGCGCTGGCGGCGGAATAA

This sequence gives the following output on the IDT gblock analysis tool:


Denied - High Complexity (Scores of 10 or greater)

The identified complexities prevent manufacturing of this sequence.

Total Complexity Score: 18

Complexity Description
Score

  • One or more repeated sequences greater than 8 bases comprise 61.9% of the overall sequence. Solution: Redesign to reduce the repeats to be less than 40% of the sequence.
    8.8
  • The GC content of the segment from position 1001 to position 1800 is 64.2%. Solution: Redesign to reduce the GC content below 60%.
    4.2
  • This sequence contains a window of 100 bases starting at base 1399 with a GC content of 74%. Solution: Redesign this region to have a GC content less than 69%.
    4
  • A hairpin with the stem sequence CAGATGAACAC exists at the following locations: 250, 458. Solution: Modify the sequence to reduce the length of the stem or complement to less than 10 bases.
    1

Aside from repeated sequences, I believe a GC content of >60% is highly unlikely for E. coli coding sequences, so I would expect that to be reflected in the model.

@VittorioRainaldi
Copy link
Author

Ah, I forgot to mention that I optimized the sequence with the "E. coli general" setting using the code snippet on PyPi.

@Adibvafa
Copy link
Owner

Adibvafa commented Sep 22, 2024

@VittorioRainaldi While we work on adding user-defined rules and restrictions, could you try non-deterministic generation of multiple sequences using the new version of package to see if it solves your problem?

@Adibvafa Adibvafa added enhancement New feature or request and removed bug Something isn't working labels Sep 22, 2024
@VittorioRainaldi
Copy link
Author

VittorioRainaldi commented Sep 22, 2024

I tested it with the following settings:

  • temperature = 0.2, 0.5, and 0.8
  • top_p = 0.95
  • num_sequences = 10

then I calculated the hamming distance of the sequences. While increasing the temperature does lead to more diverse set of sequences, none of the ones I tested passed the IDT screening.

Here is the output I get, first number is GC content, second number is hamming distance.

temperature = 0.2

60.28586013272078 0
60.0816743236345 45
60.13272077590608 47
60.23481368044921 48
60.23481368044921 48
60.23481368044921 48
60.23481368044921 48
60.23481368044921 48
60.23481368044921 48
60.23481368044921 48

temperature 0.5

59.928534966819804 0
60.54109239407861 128
60.0816743236345 103
59.979581419091375 100
60.0816743236345 98
60.28586013272078 99
60.18376722817764 97
60.23481368044921 98
60.23481368044921 98
60.23481368044921 98

temperature 0.8

59.060745278203164 0
59.315977539561004 206
60.13272077590608 195
60.54109239407861 202
60.33690658499234 181
60.18376722817764 186
60.49004594180705 174
59.87748851454824 167
59.826442062276676 175
60.33690658499234 176

By the way, is there a way to calculate how much each sequence differs from the "optimum"? Perhaps an internal scoring function for the model?

P.S.: the hamming distance is not pairwise, I just used the first sequence as a comparison for all of them, that's why the first is always zero.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants