MASED - Methylation Analysis of SEgmental Duplications.
Analyse the history and methylation of segmental duplications of Arabidopsis thaliana.
- Which segments are duplicated from others?
- When did their duplication happen?
- How does the methylation of these duplications change during evolution and function acquisition?
-
mcscanx
: Scan multiple genomes or subgenomes to identify putative homologous chromosomal regions, then align these regions using genes as anchors. -
colline2yml
: Get the collinearity regions of the genomes to YAML files. -
sort_sd
: Sort the segmental duplications by different time of occurrence. -
find_promoter
: Determine the promoter region of genomes. -
analy_meth
: Analyse the methylation of segments.
Here we should firstly install Sra-toolkit, MCScanX and Bismark.
cd MASED
unzip resource/MCScanX.zip
cd MCScanX
make
cd ~
cd MASED
unzip resource/bismark.zip
cd ~
echo '# Bismark' >> .bashrc
echo 'export PATH="$HOME/MASED/bismark:$PATH"' >> .bashrc
echo >> .bashrc
echo '# MCScanX' >> .bashrc
echo 'export PATH="$HOME/MASED/MCScanX:$PATH"' >> .bashrc
echo >> .bashrc
source .bashrc
We choose 5 species to build up a time line.
To download them, you should go to JGI website with login. We also provide some data in our folder data/
. Others like Theobroma cacao and Brassica rapa can be downloaded from other database.
Specie Name | Database | Version |
---|---|---|
Arabidopsis thaliana | Araport | V11 |
Arabidopsis lyrata | JGI | V2.1 |
Capsella rubella | JGI | V1.1 |
Brassica rapa | BRAD | V3.0 |
Theobroma cacao | Cocoa Genome Hub | V2 |
Take Theobroma cacao as an example:
(ATTENTION: If you want to try this for yourself, make sure you have removed the corresponding files in the folder!)
wget https://cocoa-genome-hub.southgreen.fr/sites/cocoa-genome-hub.southgreen.fr/files/download/Theobroma_cacaoV2_annot_annoted_clean.gff3.tar.gz
tar -zvxf Theobroma_cacaoV2_annot_annoted_clean.gff3.tar.gz
mv Theobroma_cacaoV2_annot_annoted_clean.gff3 ~/MASED/data/Tcac.gff3
rm Theobroma_cacaoV2_annot_annoted_clean.gff3.tar.gz
wget https://cocoa-genome-hub.southgreen.fr/sites/cocoa-genome-hub.southgreen.fr/files/download/Theobroma_cacaoV2_annot_protein.faa.tar.gz
tar -zvxf Theobroma_cacaoV2_annot_protein.faa.tar.gz
mv Theobroma_cacaoV2_annot_protein.faa ~/MASED/data/Tcac.pep
rm Theobroma_cacaoV2_annot_protein.faa.tar.gz
We have a reference genome fasta of Arabidopsis thaliana in our folder data/
.
And then we prepare the genome with App::Egaz.
We choose SRX2871291 and SRX2871292 as the BS-Seq data of the Columbia wild type Arabidopsis thaliana.
prefetch SRR5631389
prefetch SRR5631390
prefetch SRR5631391
prefetch SRR5631392
The data is now in ~/ncbi/public/sra/
mv ~/ncbi/public/sra/SRR56313* ~/MASED/data/.
cd ~/MASED/data/
fastq-dump --split-3 SRR56313*
rm SRR56313*.sra
cd ~/MASED/data
awk '$3 == "gene" {print $1 "\t" $4 "\t" $5 "\t" $9 "\t" $7}' Atha/Atha.gff3 > Atha.gene.gff
awk '$3 == "gene" {print $1 "\t" $4 "\t" $5 "\t" $9 "\t" $7}' Alyr/Alyr.gff3 > Alyr.gene.gff
awk '$3 == "gene" {print $1 "\t" $4 "\t" $5 "\t" $9 "\t" $7}' Crub/Crub.gff3 > Crub.gene.gff
awk '$3 == "gene" {print $1 "\t" $4 "\t" $5 "\t" $9 "\t" $7}' Brap/Brap.gff3 > Brap.gene.gff
awk '$3 == "gene" {print $1 "\t" $4 "\t" $5 "\t" $9 "\t" $7}' Tcac/Tcac.gff3 > Tcac.gene.gff
perl ../gff_pep_Atha.pl Atha.gene.gff AT.gff Atha/Atha.pep > AT.pep
perl ../gff_pep_Alyr.pl Alyr.gene.gff AL.gff Alyr/Alyr.pep > AL.pep
perl ../gff_pep_Crub.pl Crub.gene.gff CR.gff Crub/Crub.pep > CR.pep
perl ../gff_pep_Brap.pl Brap.gene.gff BR.gff Brap/Brap.pep > BR.pep
perl ../gff_pep_Tcac.pl Tcac.gene.gff TC.gff Tcac/Tcac.pep > TC.pep
makeblastdb -in AT.pep -dbtype prot -parse_seqids -out ATdb
nohup blastp -query AL.pep -db ATdb -out AT_AL.blast -evalue 1e-10 -num_threads 4 -outfmt 6 -num_alignments 5 &
nohup blastp -query CR.pep -db ATdb -out AT_CR.blast -evalue 1e-10 -num_threads 4 -outfmt 6 -num_alignments 5 &
nohup blastp -query BR.pep -db ATdb -out AT_BR.blast -evalue 1e-10 -num_threads 4 -outfmt 6 -num_alignments 5 &
nohup blastp -query TC.pep -db ATdb -out AT_TC.blast -evalue 1e-10 -num_threads 4 -outfmt 6 -num_alignments 5 &
cat AT.gff AL.gff > AT_AL.gff
cat AT.gff CR.gff > AT_CR.gff
cat AT.gff BR.gff > AT_BR.gff
cat AT.gff TC.gff > AT_TC.gff
mcscanx -s 3 -m 2 AT_AL
mcscanx -s 3 -m 2 AT_CR
mcscanx -s 3 -m 2 AT_BR
mcscanx -s 3 -m 2 AT_TC
# extract ranges of segment duplications
perl ../colline2yml.pl AT.gff AT_AT.collinearity > AT_AT.yml
perl ../colline2yml_AL.pl AT.gff AT_AL.collinearity > AT_AL.yml
perl ../colline2yml.pl AT.gff AT_CR.collinearity > AT_CR.yml
perl ../colline2yml.pl AT.gff AT_BR.collinearity > AT_BR.yml
perl ../colline2yml.pl AT.gff AT_TC.collinearity > T_TC.yml
cd ~/MASED/data/
mkdir SAAT
gzip -dc Atha/Atha.fa.gz > Atha/Atha.fa
faops filter -N -s Atha/Atha.fa stdout | faops split-name stdin .
egaz repeatmasker ./*.fa -o SAAT/. --gff --parallel 4
faops size ./*.fa > SAAT/chr.sizes
faToTwoBit ./*.fa SAAT/chr.2bit
cat ./*.fa | faops filter -ßßU stdin SAAT/chr.fasta
mv Atha/Atha.gff3 SAAT/chr.gff
# create anno.yml
runlist gff --tag CDS --remove chr.gff -o cds.yml
runlist gff --remove ./*.rm.gff -o repeat.yml
runlist merge repeat.yml cds.yml -o anno.yml
rm repeat.yml cds.yml ./*.rm.gff /*.rm.out
egaz template \
. \
--self -o AT_FSD \
--taxon ./ensembl_taxon.csv \
--circos --aligndb --parallel 4 -v
bash AT_FSD/1_self.sh
bash AT_FSD/3_proc.sh
bash AT_FSD/4_circos.sh
bash AT_FSD/6_chr_length.sh
bash AT_FSD/7_self_aligndb.sh
bash AT_FSD/9_pack_up.sh
Examples in S288c.txt
I:1-100
I(+):90-150
S288c.I(-):190-200
II:21294-22075
II:23537-24097
Simple rules:
chromosome
andstart
are requiredspecies
,strand
andend
are optional.
to separatespecies
andchromosome
strand
is one of+
and-
and surround by round brackets:
to separate names and digits-
to separatestart
andend
- names should be alphanumeric and without spaces
species.chromosome(strand):start-end
--------^^^^^^^^^^--------^^^^^^----
Examples in example.fas
>S288c.I(+):13267-13287|species=S288c
TCGTCAGTTGGTTGACCATTA
>YJM789.gi_151941327(-):5668-5688|species=YJM789
TCGTCAGTTGGTTGACCATTA
>RM11.gi_61385832(-):5590-5610|species=RM11
TCGTCAGTTGGTTGACCATTA
>Spar.gi_29362400(+):2477-2497|species=Spar
TCATCAGTTGGCAAACCGTTA
Ivan Woo <[email protected]>
This is free software; you can redistribute it and/or modify it under the same terms.