Skip to content

Latest commit

 

History

History
169 lines (125 loc) · 12.4 KB

README.md

File metadata and controls

169 lines (125 loc) · 12.4 KB

MiRACLe: an Individual-Specific Approach to Improve MicroRNA-Target Prediction Based on a Random Contact Model


Table of contents

  1. Introduction
  2. Executing miRACLe
    2.1 Files required
    2.2 Script Execution
    2.3 R package
  3. Benchmarking evaluations
  4. References

The strength of miRNA-mRNA interactions (MMIs) in a biological system depends on both the sequence characteristics and expression patterns of RNAs. Integrating the two features into a random contact model, we propose miRACLe (miRNA Analysis by a Contact modeL) to achieve miRNA target prediction at both individual and population levels. Evaluation by a variety of measures shows that fitting a sequence-based algorithm into the framework of miRACLe can improve its prediction power with a significant margin, and the combination of miRACLe and the cumulative weighted context++ scores from TargetScan consistently outperforms state-of-the-art methods in prediction accuracy, regulatory potential and biological relevance.Empirical test suggests that on a laptop Intel Core i7-4712HQ personal computer with a 2.30 GHz CPU and 16 GB of RAM, our source code implementation requires less than 10 seconds of CPU time to complete the prediction for an individual sample.


In order to run the current version of miRACLe, the users should provide two data files that describe the expression levels of each miRNA and mRNA for the same sample. And one additional file that defines the correspondence of samples between the miRNA and mRNA data files. All files are tab-delimited ASCII text files and must comply with the following specifications:

  1. Input miRNA expression file is organized as follows:

    miRNA TCGA-05-4384-01A-01T-1754-13 TCGA-05-4390-01A-02T-1754-13 TCGA-05-4396-01A-21H-1857-13 TCGA-50-5066-01A-01T-1627-13
    hsa-let-7a-5p 19.0144 16.2421 19.2817 18.0721
    hsa-let-7a-3p 7.31298 6.2094 7.8392 6.2667
    hsa-let-7a-2-3p 6.5235 5.4594 3.7004 7.3837
    hsa-let-7b-5p 16.9613 15.5496 17.8444 16.9950
    hsa-let-7b-3p 7.9248 5.2094 7.6653 6.8201

    The first line contains the labels Name followed by the identifiers for each sample in the dataset.

    Line format: Name(tab)(sample 1 name)(tab)(sample 2 name) (tab) ... (sample N name)
    Example: miRNAName sample_1 sample_2 ... sample_n

    The remainder of the file contains data for each of the miRNAs. There is one line for each miRNA. Each line contains the miRNA name and a value for each sample in the dataset.

  2. Input mRNA expression file is organized as follows:

    Gene TCGA-05-4384-01 TCGA-05-4390-01 TCGA-05-4396-01 TCGA-50-5066-01
    AARS 10.7094 11.6932 12.4282 11.0464
    AASDHPPT 9.9081 9.6716 10.1113 9.98328
    AASDH 7.9471 7.2897 8.3216 7.6274
    AASS 9.9649 7.7752 9.1723 5.9506
    AATF 9.9525 9.5380 9.3670 8.4375

    The first line contains the labels Name followed by the identifiers for each sample in the dataset.

    Line format: Name(tab)(sample 1 name)(tab)(sample 2 name) (tab) ... (sample N name)
    Example: GeneName sample_a sample_b ... sample_m

    The remainder of the file contains data for each of the mRNAs. There is one line for each mRNA. Each line contains the mRNA name and a value for each sample in the dataset.

    Note that the input miRNA/mRNA expression file should be transformed into a non-negative matrix, in order for the main program to execute correctly. Both microarray profiling and RNA sequencing data are accepted as input. To achieve optimal prediction on the sequencing data, we strongly recommend that users provide log2 transformed normalized counts (e.g. RSEM or RPM) as the input for our program.

  3. Sample matching file generally contains two columns, which shows the corresponding relationship of the sample identifiers in miRNA expression file and mRNA expression file (miRNA must be the first column and mRNA must be the second column). It also serves as a index to denote which samples we choose to analyze. It is organized as follows:

    miRNA Gene
    TCGA-50-5066-01A-01T-1627-13 TCGA-50-5066-01
    TCGA-05-4384-01A-01T-1754-13 TCGA-05-4384-01
    TCGA-05-4390-01A-02T-1754-13 TCGA-05-4390-01
    TCGA-05-4396-01A-21H-1857-13 TCGA-05-4396-01

    The first line must contain the label Names for samples in each expression dataset with the first column for miRNA and second column for mRNA.

    Line format: (sample name in miRNA file)(tab)(sample name in mRNA file)
    Example: sample_1 sample_a

    The remainder of the file contains sample identifiers used in the miRNA and mRNA expression files. There is one line for each sample. Each line contains the identifiers for that sample.

miRACLe is written in R and can be downloaded here along with test datasets. The source code of miRACLe consists of three parts, namely, 'FUNCTIONS', 'DATA INPUT' and 'MAIN CODE'. The main function "miracle" in "MAIN PROGRAM" calculates the miracle score for each miRNA-mRNA pair at individual and population levels, based on which all putative MMIs are ranked. The essential inputs that the miRACLe algorithm requires to run includes two parts:

The first part contains the sequence-based interaction scores (seqScore) for putative miRNA-mRNA pairs. These scores are originally obtained from TargetSan v7.2 (TargetScan7_CWCS_cons and TargetScan7_CWCS), DIANA-microT-CDS (DIANA_microT_CDS), MirTarget v4 (MirTarget4), miRanda-mirSVR (miRanda_mirSVR) and compiled by the developers to fit the model. Default is TargetScan7_CWCS_cons. The other scores can be downloaded here.

seqScore = as.matrix(read.table("TargetScan7_CWCS_cons.txt", head = TRUE, sep = "\t"))

User can also provide their own sequence matching scores, as long as the format of input file meets the requirements. Specifically, the first line must contain the label Names for mRNAs, miRNAs and their associated interaction scores. The remainder of the file contains RNA identifiers corresponding to those used in the expression files and the scores for each miRNA-mRNA pair. Note that the first column must contain identifiers for mRNAs, the second column must contain identifiers for miRNAs with the third column containing the associated scores.

The second part contains paired miRNA-mRNA expression profiles and should be provided by the users.

sampleMatch = as.matrix(read.table("Test_DLBC_sampleMatch.txt", head = TRUE, sep = "\t"))
mirExpr = as.matrix(read.table("Test_DLBC_miRNA.txt", head = FALSE, sep = "\t"))
tarExpr = as.matrix(read.table("Test_DLBC_mRNA.txt", head = FALSE, sep = "\t"))

The 'miracle' function also provides three optional parameters for users, which are: samSelect (sample selection, users can select a subset of all samples to analyze, default is NULL, which means no selection will applied), exprFilter (filter of expression profile, miRNAs/mRNAs that are not expressed in more than a given percentage of samples will be removed, default is 1), and OutputSelect (logical variable, select “TRUE” to return the top 10 percent-ranked predictions by scores, and “FALSE” to return the whole prediction result. Default is TRUE).

miracle(seqScore, sampleMatch, mirExpr, tarExpr, samSelect = NULL, exprFilter = 1, OutputSelect = TRUE)

The R package of the miRACLe algorithm is provided here.


  1. The codes to reproduce the benchmarking evaluations are written in R.
  2. Generally, all these codes are arranged into three parts as 'FUNCTIONS', 'INPUT DATA' and 'MAIN CODE'. The users need to download and fill in the relevant input files before implementing corresponding analyses.
  3. Files required for the reproduction of the evaluations can be broadly classified into three categories:
  • Sequence-based predictions (including the seqScore for integrative methods)

    Data file Description
    TargetScan7_CWCS_cons.txt cumulative weighted context++ scores for conserved targets sites of conserved miRNA families obtained from TargetScan v7.2
    TargetScan7_CWCS.txt cumulative weighted context++ scores for all miRNA-mRNA pairs obtained from TargetScan v7.2
    TargetScan7_qMRE_cons.txt number of conserved target sites of conserved miRNA families obtained from TargetScan v7.2
    TargetScan7_qMRE.txt number of target sites for all miRNA-mRNA pairs obtained from TargetScan v7.2
    DIANA_microT_CDS.txt human interactions with miTG scores greater than 0.7 obtained from DIANA-microT-CDS
    miRanda_mirSVR.txt human conserved miRNA predictions with good mirSVR score obained from miRanda-mirSVR
    miRmap.txt predictions from miRmap
    miRTar2GO.txt predictions from the “Highly sensitive” prediction set of miRTar2GO
    miRTar2GO_HeLa.txt predictions in HeLa cells from the “Highly sensitive” prediction set of miRTar2GO
    MirTarget4.txt human predictions obtained from miRDB v6.0
    miRWalk3.txt human predictions restricted to 3`UTR obtained from miRWalk v3.0
    PITA.txt the top human predictions with 3/15 flank obtained from PITA
    Combine_MMIs.txt combined predictions from DIANA-microT-CDS, miRanda-mirSVR, MirTarget4, PITA and TargetScan7.CWCS
    Symbol_to_ID.txt paired gene symbols and gene entrez IDs downloaded from HGNC

    These predictions are provided in a compressed file Sequence_based_predictions.7z.

  • Input expression data files

    Data file Descriptions
    HeLa expression data normalized microarray/RNA-Seq expression data for HeLa cell line
    NCI60 data normalized microarray data for 59 NCI-60 cancer cell lines
    TCGA data log2-transformed RPM/RSEM data for 7991 cancer patients from 32 TCGA cancer types
    MCC data normalized microarray data for 68 tumor tissues and 21 normal tissues

    These expression data files are provided along with relevant source codes except that the TCGA expression data files are provided in a compressed file TCGA_data.7z.

  • Validation data (Reference data)

    • Experimentally validated MMIs
    Data file Description validated MMI counts
    Vset_HeLa.txt MMIs that are validated in HeLa cells from TarBase v8.0 34,263
    Vset_celllines.txt MMIs that are validated in cell lines from TarBase v8.0 349,726
    Vset_all.txt validated MMIs obtained from TarBase v8.0 376,205
    Vset_hc.txt high-confidence set compiled from TarBase v8.0, miRTarbase v7.0, miRecords and oncomirDB 10,575
    • Curated miRNA transfection experiments
    Data file Description
    Transet_HeLa_Array.txt Unified dataset of 5 miRNA transfections in HeLa cell line in which gene exrpession changes are measured by microarray
    Transet_HeLa_Seq.txt Unified dataset of 25 miRNA transfections in HeLa cell line in which gene exrpession changes are measured by RNA-Seq
    Transet_multi.txt Unified dataset of 105 non-redundant miRNA transfections that are originally collected from 77 human cell lines or tissues
    • Known cancer genes
    Data file Description Molecule counts
    Cancer_gene_set cancer genes obtained from cancer gene census 723

    These reference data files are provided along with relevant source codes.


MiRACLe: an Individual-Specific Approach to Improve MicroRNA-Target Prediction Based on a Random Contact Model (in preparation)