-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow user-defined rules #7
Comments
This is a great suggestion, we will work on it. |
An option is to use non-deterministic mode to produce many good variants, then use a complexity metric based on those criteria and sort the variants on this metric. We should have a good default metric, but we should also allow users to set the parameters they want (e.g., they don't care about homopolymers so they would not penalize them). |
The protein sequence in question is called ecm (uniprot ID Q3IZ90). ATGACTCAGAAAGACTCACCATGGCTGTTCAGGACCTATGCGGGACACAGCACAGCCAAAGCCTCCAATGCGCTGTACCGTACCAACCTGGCGAAAGGTCAGACCGGTCTGAGCGTGGCGTTTGATCTGCCGACCCAGACCGGCTATGACAGCGATGATGCGCTGGCCCGCGGCGAAGTCGGTAAAGTCGGTGTACCGATCTGCCACCTGGGTGACATGCGTATGCTGTTTGACCAGATCCCGCTGGAACAGATGAACACCTCTATGACCATCAATGCCACAGCACCGTGGCTGCTGGCGCTGTACATTGCCGTAGCTGAAGAGCAGGGTGCGGACATCAGCAAACTGCAGGGTACTGTTCAGAATGACCTGATGAAAGAGTATCTCAGCCGTGGCACCTACATCTGCCCGCCGCGTCCATCTCTGCGCATGATCACCGATGTGGCGGCTTACACCCGTGTTCATCTGCCGAAATGGAACCCGATGAACGTCTGCTCTTACCACCTGCAGGAAGCAGGTGCGACACCGGAACAGGAACTGGCGTTTGCGCTGGCCACCGGTATTGCGGTGCTGGATGACCTGCGCACCAAAGTGCCGGCAGAACATTTCCCGGCGATGGTTGGCCGCATCAGCTTCTTCGTTAACGCCGGTATCCGCTTTGTGACCGAAATGTGCAAAATGCGTGCGTTTGTTGACCTGTGGGATGAGATCTGCCGTGACCGTTACGGTATCGAAGAAGAGAAATACCGCCGTTTCCGCTACGGTGTGCAGGTTAACAGCCTGGGCCTGACCGAACAGCAGCCGGAGAACAACGTCTACCGCATCCTGATTGAGATGCTGGCGGTGACCCTGAGCAAGAAAGCGCGTGCGCGTGCTGTTCAGCTGCCGGCGTGGAACGAAGCGCTGGGTCTGCCGCGTCCGTGGGACCAGCAGTGGAGCCTGCGTATGCAGCAGATCCTGGCCTACGAGTCCGACCTGCTGGAGTATGAAGACCTGTTTGATGGTAACCCGGCGATCGAGCGTAAAGTTGAAGCGCTGAAAGACGGTGCGCGTGAGGAGCTGGCGCACATTGAGGCGATGGGTGGTGCGATTGAAGCGATCGACTACATGAAAGCGCGTCTGGTAGAGAGCAATGCCGAGCGTATTGCCCGTGTGGAGACCGGTGAAACCGTGGTGGTCGGTGTGAACCGCTGGACCTCTGGTGCACCATCTCCGCTGACCACTGGTGACGGTGCGATTATGGTTGCTGATCCGGAAGCAGAGCGCGATCAGATTGCCCGTCTGGAAGCATGGCGTGCGGGTCGTGATGGTGCGGCGGTGGCTGCGGCGCTGGCTGAACTGCGCCGTGCGGCGACCTCCGGTGAGAACGTCATGCCGGCCTCTATTGCCGCTGCGAAAGCCGGCGCCACCACCGGTGAATGGGCGGCAGAGCTGCGCCGTGCCTTCGGTGAGTTCCGCGGCCCGACCGGTGTTGCGCGTGCGCCAAGCAACCGCACCGAAGGTCTGGATCCGATCCGTGAAGCGGTTCAGGCGGTCTCCGCGCGTCTGGGCCGTCCGCTGAAATTTGTGGTCGGTAAACCGGGTCTGGATGGCCACTCCAACGGTGCGGAACAGATTGCCGCGCGCGCGCGCGACTGCGGCATGGATATCACCTACGATGGTATCCGCCTGACGCCAGCGGAGATCGTGGCGAAAGCGGCCGATGAGCGCGCGCACGTCCTCGGTCTGTCCATTCTGTCCGGCTCCCACATGCCGCTGGTGACCGAAGTGCTGGCTGAAATGCGCCGCGCGGGTCTGGATGTTCCGCTGATCGTTGGCGGTATCATTCCGGAAGAAGATGCGGCGGAGCTGCGTGCCTCCGGTGTTGCGGCGGTTTACACCCCGAAAGATTTTGAGCTGAACCGCATTATGATGGATATTGTCGGCCTGGTTGACCGCACTGCGCTGGCGGCGGAATAA This sequence gives the following output on the IDT gblock analysis tool: Denied - High Complexity (Scores of 10 or greater) The identified complexities prevent manufacturing of this sequence. Total Complexity Score: 18 Complexity Description
Aside from repeated sequences, I believe a GC content of >60% is highly unlikely for E. coli coding sequences, so I would expect that to be reflected in the model. |
Ah, I forgot to mention that I optimized the sequence with the "E. coli general" setting using the code snippet on PyPi. |
@VittorioRainaldi While we work on adding user-defined rules and restrictions, could you try non-deterministic generation of multiple sequences using the new version of package to see if it solves your problem? |
I tested it with the following settings:
then I calculated the hamming distance of the sequences. While increasing the temperature does lead to more diverse set of sequences, none of the ones I tested passed the IDT screening. Here is the output I get, first number is GC content, second number is hamming distance. temperature = 0.2 60.28586013272078 0 temperature 0.5 59.928534966819804 0 temperature 0.8 59.060745278203164 0 By the way, is there a way to calculate how much each sequence differs from the "optimum"? Perhaps an internal scoring function for the model? P.S.: the hamming distance is not pairwise, I just used the first sequence as a comparison for all of them, that's why the first is always zero. |
I tested the tool on a couple of protein sequences and for one of them the predicted DNA sequence is too complex for synthesis for both IDT and twist.
Twist has the following rules to determine whether a sequence is too complex:
Output sequences could be screened for such issues and regenerated if needed.
The text was updated successfully, but these errors were encountered: