FLASK (faster) RNA Secondary Structure App
Streamlit RNA Secondary Structure App
Interactive tool for the prediction and visualization of RNA secondary structures, using dynamic programming. It predicts base pairing and generates a visual representation of the structure based on the input RNA sequence. The algorithm I implemented in this code is related to the Nussinov’s algorithm.
Optimal secondary structure prediction of RNA sequence by maximizing the number of base pairs under specific constraints.
-
RNA Sequence:
- A string of nucleotides: adenine (A), uracil (U), guanine (G), and cytosine (C).
- Valid base-pair interactions:
- Canonical: A-U, U-A, G-C, C-G
- Wobble: G-U, U-G
-
Dot-Bracket Notation:
- The predicted structure is represented as a string of
(
,)
, and.
:.
: Unpaired base.(
,)
: Paired bases, indicating a bond.
- The predicted structure is represented as a string of
-
Dynamic Programming Approach:
- DP table
dp[i][j]
where:i
andj
represent indices of the RNA sequence.dp[i][j]
stores the maximum number of base pairs in the subsequence fromi
toj
.
- DP table
-
DP Table initialization:
- A 2D matrix
dp
of size (n \times n) (where (n) is the sequence length) is initialized to 0. - A separate
traceback
matrix is used to store the decisions for reconstructing the structure later.
- A 2D matrix
-
DP Table iterative filling:
- Start with short subsequences and expand to larger ones.
- For each pair of indices
(i, j)
(where (j > i)):- Case 1: Unpaired: The nucleotide at position
j
is left unpaired: [ dp[i][j] = dp[i][j-1] ] - Case 2: Paired: If
sequence[i]
can pair withsequence[j]
:- Add 1 (for the new pair) to the solution of the inner subsequence ((i+1, j-1)): [ dp[i][j] = \max(dp[i][j], 1 + dp[i+1][j-1]) ]
- Case 3: Bifurcation: Split the subsequence into two parts:
- Combine solutions from two non-overlapping subsequences ((i, k)) and ((k+1, j)): [ dp[i][j] = \max(dp[i][j], dp[i][k] + dp[k+1][j]) \quad \text{for } k \in (i + \text{MIN_LOOP}, j - \text{MIN_LOOP}) ]
- Case 1: Unpaired: The nucleotide at position
Constraints:
- Min-loop-size: Ensures there are at least
MIN_LOOP
unpaired bases between paired bases (i) and (j). - Valid base pairing: Ensures following of biological rules.
-
Traceback:
- For reconstructing the optimal solution.
- Starting from the full sequence
(0, n-1)
, decisions stored intraceback[i][j]
determine:- Whether
i
andj
are paired. - If the subsequence bifurcates into two smaller problems.
- Whether
-
Output:
- Dot-bracket notation for the secondary structure.
- A list of base pairs (i.e., indices of paired bases).
The algorithm can be described using the recurrence relation:
dp[i][j] = max {
- dp[i][j-1] (Unpaired)
- dp[i+1][j-1] + 1 (Paired, if can_pair(sequence[i], sequence[j]))
- maxk=i+MIN_LOOPj-MIN_LOOP(dp[i][k] + dp[k+1][j]) (Bifurcation) }
-
Table Filling:
- The outer loop iterates over subseq. lengths (L) from
MIN_LOOP + 2
to (n), and inner loops iterate over (i) and (j). - For each cell
(i, j)
, there is an additional loop over possible bifurcation points (k). - Complexity: (O(n^3)).
- The outer loop iterates over subseq. lengths (L) from
-
Traceback:
- Linear in the number of base pairs.
- Complexity: (O(n)).
Overall time complexity is (O(n^3)).
Simplified model for RNA secondary structure prediction:
- It maximizes base pairing without considering thermodynamic stability or pseudoknots.
- Extensions like Zuker’s algorithm can incorporate free-energy minimization for more accurate predictions.
For the sequence AUGCGUA
:
dp[i][j]
will compute the maximum number of pairs.- Traceback reconstructs the structure:
- Dot-bracket:
.(())..
- Base pairs: [(1, 6), (2, 5)]
- Dot-bracket: