A framework for identifying the structural motifs that are informative of whole-genome measurements across all the transcripts
Build Status |
---|
pyteiser identifies structural motifs that could explain genome-wide changes in gene expression, stability or other quantitative transcriptomic measures.
pyteiser encodes structural motifs as context-free grammars (CFG) that represents short stem-loop structures along with a known primary sequence.
First, pyteiser generates a comprehensive set of short seeds (of length 8-16) with a given information content: the seeds are generated in the way that they are not too specific but also not too general. An example seed would have a secondary structure of <<<<......>>>>
and a sequence of AAUNNGNGNUNAUU
.
Then, the given RNA sequences (for example, 3'UTRs) are scanned for the occurrence of all the seeds. At that stage, each seed is assigned a binary representation vector, showing which transcripts contain matches for this seed and which ones do not. For example, if a user provided expression values for N sequences, the representation vector will be of length N and each element will be set to "1" if the corresponding fragment has a match to the given seed or "0" if it doesn't.
Then, each seed's binary vector is tested to assess whether it is informative of the input whole-genome measurements. For example, if the expression change values for transcripts [A, B, C and D] are [High, High, Low, Low] and the binary representation vector for seed X is [1, 1, 0, 0] we make a conclusion that the seed X might be informative of the expression changes. To capture such dependency we use Mutual Information (MI) (see Cover and Thomas, 2006).
Then, pyteiser runs several non-parametric statistical tests to determine which seeds are significantly informative of expression changes and which ones are not. The seeds that did not pass the statistical tests get filtered out.
Then, the seeds that passed the tests are classified by "families" - groups of seeds with very similar representation vectors. The initial set of seeds pyteiser works with is redundant, so several seeds can be very similar to each other and match the same sequences. Such redundant groups of seeds get collapsed into groups or "families".
Then, a single representative seed is chosen for family. The representative seeds is then optimized by a greedy algorithm. The algorithm applies small consecutive changes to sequence and structure of the motif up until the changes don't make the seed more informative of genome-wide measurements anymore. For example, if changing the 1st nucleotide of the seed AAUNNGNGNUNAUU
to 'G' makes the seed's representation vector more informative of the observed expression values, the algorithm keeps such change. The final oprimized seed is being tested for robustness: the statistical tests for seed significance are being repeated with down-sampled expression data. If the seed is not robust to down-sampling of expression data, it gets filtered out.
Finally, the seeds that have passed all the tests are ranked by how informative are they in regard to the transcriptomic measures provided. Text and graphic reports are being printed, showing motif logos, patterns of their representation, their statistical significance etc.
pyteiser is a successor of TEISER (see link). pyteiser is built around the same concept. The overall pipeline is similar, however, several changes and improvements were made. pyteiser performs additional testing of seed matches using in silico RNA secondary structure prediction. The statistical tests and optimization algorithms were improved. pyteiser is also capable of handling SHAPE RNA secondary structure probing data; such data provide additional information about RNA secondary structures that do or do not exist in the cell in vivo.
pyteiser is intentionally designed in a way that allows storage of all the intermediate files in compressed binary format. The reason is that an academic lab is often focused on one model system; if using pyteiser for the same organism / the same transcriptome, there is no need to re-run time-consuming steps of seed occurence profile calculation over and over again; user can save computational resources by just storing intermediate files for a given transcriptome. We provide two different modes of compressing the seed occurence profiles: either store only indices of matching sequences (takes less space if running pyteiser on big number of short sequences); or store full profiles (takes less space if running pyteiser on small number of long sequences).
pyteiser is written in Python 3.7. The computationally heavy funcions are implemented efficiently through extensive usage on numpy arrays and numba Python compiler.
Install with pip:
First, install the requirements:
pip install numba==0.50.1 numpy>=1.19.1 pandas>=1.1.1
Then, install pyteiser:
pip install pyteiser
Small number of seeds, automatic pipeline that could be run on a PC:
We provide a single wrapper script that runs the whole pipeline (pyteiser/pyteiser/wrappers/pyteiser_pipeline.py
) starting with a set of sequences, corresponding measurements (expression or other ones) and a set of seeds, and reports a list of top candidate seeds and their matches in the input sequences. See the specifications of input files and description of parameters below.
Large number of seeds, a set of scripts that must be run on HPC:
Depending on the size of sequence set and on desired number of seeds to be analyzed, the pipeline might require a lot of computing resources; if searching among a large number of seeds (milliones), it might not be possible to run the pipeline on a PC, and a high-performance computing (HPC) machine might be required. Depending on the institution / company you are at the HPC you are using might have different job submission requirements. In particular, the keywords for memory or time requests might differ among individual HPC systems. Also, depending on cluster resources availability, it might be hard to reserve cores on HPC for long enough to be able to run the whole pipeline in a single run. In that case, we recommend running each step of the pipeline individually. We provide frameworks for either running the scripts on your own machine or submitting it to SGE-based HPCs through qsub
command. For each computationally heavy step of the pipeline, we provide a script that runs on its own and also a script that is adjusted for submission through qsub. Most scripts can be submitted to HPC with a universal script for qsub submission (named qsub_universal_submission.py
). It lets you (i) specify the keywords the HPC machine you're using requires, (ii) request how much time, memory and cores do you want to use and (iii) submit any of the computationally heavy scripts from the pipeline. A set of wrapper scripts that implement individual steps of the pipeline is provided in pyteiser/pyteiser/wrappers
.
Input files:
- Necessary:
rna_fastafile
: a fasta file with RNA sequences of interestexp_values_file
: expression values in a csv format. It should have 2 or more columns, one column with names of of sequences, another column with values. The names of these two columns have to be specified with the--anno_name_column
and--measur_column
arguments. All the names of the sequences listed in this file must also have a corresponding sequence record in the fasta file provided withrna_fastafile
. The measurement values might be integer or float numbers.seeds_file
: a binary file containing the seeds to search through. Such file can be created withseed_generator.py
script
- Optional:
- user can include a file with RNA structure probing data (SHAPE or DMS-seq) to guide the possible match selection. There is no commonly used standard format for SHAPE RNA reactivity data; therefore, we are using the two-column SHAPE file format used by RNAstructure package (link). SHAPE file provided by user should contain SHAPE profiles for multiple sequences, separated with
>
, like in fasta file. SHAPE file can be provided to thefilter_profiles_by_folding.py
script with the--shape_profile
argument
- user can include a file with RNA structure probing data (SHAPE or DMS-seq) to guide the possible match selection. There is no commonly used standard format for SHAPE RNA reactivity data; therefore, we are using the two-column SHAPE file format used by RNAstructure package (link). SHAPE file provided by user should contain SHAPE profiles for multiple sequences, separated with
Output files: The pipeline generates two files:
pyteiser_info.bin
: a table, each row corresponds to one motif that has been identified as significant. The columns include: the motif identifier, sequence and structure of the motif, mutual information of the seed occurence profile and the sequences expression data, pvalue for a permutation test, zscore, the result of robustness test and the number of sequences that contain this seedpyteiser_matches.bin
: a table, each row corresponds to one sequence from the user-provided fasta file. The first two columns contain the sequence and its name; every one of the following columns corresponds to one of the motifs that have been identified as significant; in these columns, cells contain sub-sequences matching the given motif
First, download the three example input files: sequences, measurements and seeds
Then, create folders for temporary files and for output files (you can do it with mkdir temp
and mkdir out
)
Finally, launch the pipeline with the command:
pyteiser_pipeline --rna_fastafile <path to test_seqs.fa> --exp_values_file <path to test_expression_values.csv> --seeds_file <path to test_seeds_20.bin> --temp_folder <path to the folder for temporary files> --out <path to the output folder>
Arguments for the automatic pipeline (parameters for all the individual steps included):
- input / output files:
rna_fastafile
: fasta file with RNA sequences, see aboveexp_values_file
: expression values in a csv format, see aboveanno_name_column
: column name in exp_values file that contains annotations, see abovemeasur_column
: column name in exp_values file that contains expression measurements, see aboveseeds_file
: file with seeds in binary format, see abovetemp_folder
: folder to write temporary files toout
: output folder
- optional arguments:
nbins
: number of bins for discretization of expression profilemin_occurences
: minimal number of seed occurence in the transcriptome for a seed to be considered at allmaxfreq
: maximal seed frequency in the sequences analyzed for a seed to be consideredn_permutations
: number of permutations for the rank test for a seedmax_pvalue
: p-value threshold for filtering seedsmin_zscore
: z-score threshold for filtering seedsstep_1_jump
: step size at the 1st round of empirical threshold searchstep_2_min_interval
: resolution of empirical threshold searchstep_1_min_fraction
: minimal fraction of passing seeds for the 1st round of empirical threshold searchstep_2_min_fraction
: minimal fraction of passing seeds for the 2nd round of empirical threshold searchstep_3_min_fraction
: minimal fraction of passing seeds for the 3rd round of empirical threshold searchmin_ratio
: threshold on ratio of CMI/MI for the conditional information test for seed noveltyare_input_seeds_degenerate
: do input seeds contain degenerate nucleotides (can be ignored in most cases)indices_mode
: choice of compression mode for storing seed occurence profiles: if True, store only indices of matching sequences (takes less space if running pyteiser on big number of short sequences); if False, store full profiles (takes less space if running pyteiser on small number of long sequences)index_bit_width
: number of bits per one index when compressing seed occurence profiles. Depends on the number of sequences provided. Default value 24, it's enough if number of sequences provided is smaller than 16.777.216random_noseed
: when choosing the order of positions to optimize, do not set the random number generator to a specific seedjackknife_n_permutations
: number of permutations for pvalue calculation in jackknife testjackknife_max_pvalue
: maximal pvalue for jackknife testjackknife_n_samples
: how many permutations to do in jackknife testjackknife_fraction_retain
: what fraction of the sample to retain for each testjackknife_min_fraction_passed
: what fraction of all iterations should
- arguments for
seed_generator.py
scriptoutfolder
: output folderprefix
: prefix for naming the seed filenum_motifs_per_file
: maximal number of seeds to write into a single filemin_stem_length
: minimal stem length to considermax_stem_length
: maximal stem length to considermin_loop_length
: minimal loop length to considermax_loop_length
: maximal loop length to considerprint_sequences
: print the sequences of generated seedsprint_structures
: print the structures of generated seedprint_first_motif
: print the first seed after each increase in stem or loop lengthmin_inf_bases
: minimal number of informative (non-N) basesmax_inf_bases
: maximal number of informative (non-N) basesminI
: minimal information contentmaxI
: maximal information content
You will have to specify the input and output folders you want to use. All the numeric parameters have preset default values; changing them is not recommended unless you have a very specific reason to do so.
Below, the steps of the pipeline are listed along with the name of the corresponding script.
Use pyteiser/seeds_generator.py
Use pyteiser/wrappers/binarize_sequences.py
Use pyteiser/wrappers/calculate_seed_profiles.py - run on HPC!
Use either pyteiser/wrappers/preprocess_expression_profile_ensembl.py or pyteiser/wrappers/preprocess_custom_expression_profile.py
Use pyteiser/wrappers/calculate_MI_profiles.py - run on HPC!
Use pyteiser/wrappers/filter_profiles_by_folding.py
Use pyteiser/wrappers/choose_significant_seeds_v3.py - run on HPC!
Use pyteiser/wrappers/combine_passed_seeds.py
Use pyteiser/wrappers/filter_passed_seeds_with_CMI.py
Use pyteiser/wrappers/optimize_seeds_single_chunk.py - run on HPC! You can submit it with pyteiser/wrappers/qsub_optimize_seeds.py
Use pyteiser/wrappers/combine_optimized_seeds.py
MIT license
Fish, Khoroshkin, Navickas, 2021
pyteiser has been developed in Goodarzi lab at UCSF by Matvei Khoroshkin and Hani Goodarzi