forked from bcgsc/NanoSim
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.txt
136 lines (110 loc) · 6.42 KB
/
README.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
NanoSim 1.0.0
-------------------------------------------------------------------------------
NanoSim is a fast and scalable read simulator that captures the technology-
specific features of ONT data, and allows for adjustments upon improvement of
nanopore sequencing technology.
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Dependencies:
LAST (Tested with version 581)
R (Tested with version 3.2.3)
Python (2.6 or above)
Numpy (Tested with version 1.10.1 or above)
-------------------------------------------------------------------------------
Usage
NanoSim is implemented using R for error model fitting and Python for read
length analysis and simulation. The first step of NanoSim is read
characterization, which provides a comprehensive alignment-based analysis, and
generates a set of read profiles serving as the input to the next step, the
simulation stage. The simulation tool uses the model built in the previous step
to produce in silico reads for a given reference genome. It also outputs a list
of introduced errors, consisting of the position on each read, error type and
reference bases.
1. Characterization stage
Characterization stage takes a reference and a training read set in FASTA format
as input. User can also provide their own alignment file in MAF format.
Usage:
./read_analysis.py <options>
[options]:
-h : print usage message
-i : training ONT real reads, must be fasta files
-r : reference genome of the training reads
-m : User can provide their own alignment file, in maf extension. Optional
-o : The prefix of output file, default = 'training'
* NOTICE: -m option allows users to provide their own alignment file. Make sure
that the name of query sequences are the same as appears in the fasta files.
For fasta files, some headers have spaces in them and most aligners only take
part of the header (before the first white space/tab) as the query name. However,
the truncated headers may not be unique if using the output of poretools. We
suggest users to pre-process the fasta files by concatenating all elements in
the header via '_' before alignment and feed the processed fasta file as input
of NanoSim.
2. Simulation stage
Simulation stage takes reference genome and read profiles as input and outputs
simulated reads in FASTA fomat.
Usage:
./simulator.py [command] <options>
[command]:
circular | linear
# Do not choose 'circular' when there is more than one sequence in the reference
<options>:
-h : print usage message
-r : reference genome in fasta file, specify path and file name. Required
-c : the prefix of training set profiles, same as the output prefix in
read_analysis.py, default = training
-o : The prefix of output file, default = 'simulated'
-n : Number of generated reads, default = 20,000 reads
--perfect: Output perfect reads, no mutations. Optional
--KmerBias: prohibits homopolymers with length >= 6 bases in output reads. Optional
For example:
1 If you want to simulate E. coli genome, then circular command must be chosen
because it's a circular genome
./simulator.py circular -r Ecoli_ref.fasta -c ecoli
2 If you want to simulate only perfect reads, i.e. no snps, or indels, just simulate
the read length distribution
./simulator.py circular -r Ecoli_ref.fasta -c ecoli --perfect
3 If you want to simulate S. cerevisiae genome with kmer bias, then linear command
must be chosen because it's a linear genome
./simulator.py linear -r yeast_ref.fasta -c yeast --KmerBias
See more detailed example in example.sh
------------------------------------------------------------------------------------
Explaination of output files
1. Characterization stage
training_aligned_length_ecdf: Length distribution of aligned regions on aligned
reads
training_aligned_reads_ecdf: Length distribution of aligned reads
training_align_ratio: Empirical distribution of align ratio of each read
training_besthit.maf: The best alignment of each read based on length
training_match.hist/training_mis.hist/training_del.hist/training_ins.hist:
Histogram of match, mismatch, and indels
training_first_match.hist: Histogram of the first match length of each alignment
training_error_markov_model: Markov model of error types
training_ht_ratio: Empirical distribution of the head region vs total unaligned
region
training.maf: The output of LAST, alignment file in MAF format
training_match_markov_model: Markov model of the length of matches (stretches
of correct base calls)
training_model_profile: Fitted model for errors
training_processed.maf: A re-formatted MAF file for user-provided alignment file
training_unaligned_length_ecdf: Length distribution of unaligned reads
2. Simulation stage
simulated.log: Log file for simulation process
simulated_reads.fasta: FASTA file of simulated reads. Each reads has "unaligned",
"aligned", or "perfect" in the header determining their error rate.
"unaligned" means that the reads have an error rate over 90% and cannot be
aligned. "aligned" reads have the same error rate as training reads. "perfect"
reads have no errors.
To explain the information in the header, we have two examples:
a. >ref|NC-001137|-[chromosome=V]_468529_unaligned_0_F_0_3236_0
All information before the first _ are chromosome information. 468529 is the start
position and unaligned suggesting it should be unaligned to the reference. The
first 0 is the sequence index. F represents a forward strand. 0_3236_0 means that
sequence length extracted from the reference is 3236 bases.
b. >ref|NC-001143|-[chromosome=XI]_115406_aligned_16565_R_92_12710_2
This is an aligned read coming from chromosome XI at position 115406. 16565 is
the sequence index. R represents a reverse complement strand. 92_12710_2
means that this read has 92-base head region (cannot be aligned), followed by
12710 bases of middle region, and then 2-base tail region.
The information in the header can help users to locate the read easily.
simulated_error_profile: Contains all the information of errors introduced into
each reads, including error type, position, original bases and current bases.