Skip to content

It's my graduation project at NJU with the help of professor WangQ.

Notifications You must be signed in to change notification settings

IvanWoo22/MASED

Repository files navigation

MASED

Name Origin

MASED - Methylation Analysis of SEgmental Duplications.

Purpose

Analyse the history and methylation of segmental duplications of Arabidopsis thaliana.

  • Which segments are duplicated from others?
  • When did their duplication happen?
  • How does the methylation of these duplications change during evolution and function acquisition?

workflow

Directory organization

  • mcscanx: Scan multiple genomes or subgenomes to identify putative homologous chromosomal regions, then align these regions using genes as anchors.

  • colline2yml: Get the collinearity regions of the genomes to YAML files.

  • sort_sd: Sort the segmental duplications by different time of occurrence.

  • find_promoter: Determine the promoter region of genomes.

  • analy_meth: Analyse the methylation of segments.

Preparations

Software

Here we should firstly install Sra-toolkit, MCScanX and Bismark.

MCScanX

cd MASED
unzip resource/MCScanX.zip
cd MCScanX
make
cd ~

Bismark

cd MASED
unzip resource/bismark.zip
cd ~

Put into PATH

echo '# Bismark' >> .bashrc
echo 'export PATH="$HOME/MASED/bismark:$PATH"' >> .bashrc
echo >> .bashrc
echo '# MCScanX' >> .bashrc
echo 'export PATH="$HOME/MASED/MCScanX:$PATH"' >> .bashrc
echo >> .bashrc
source .bashrc

Data

Collinear

We choose 5 species to build up a time line.

tree

To download them, you should go to JGI website with login. We also provide some data in our folder data/. Others like Theobroma cacao and Brassica rapa can be downloaded from other database.

Specie Name Database Version
Arabidopsis thaliana Araport V11
Arabidopsis lyrata JGI V2.1
Capsella rubella JGI V1.1
Brassica rapa BRAD V3.0
Theobroma cacao Cocoa Genome Hub V2

Take Theobroma cacao as an example:
(ATTENTION: If you want to try this for yourself, make sure you have removed the corresponding files in the folder!)

wget https://cocoa-genome-hub.southgreen.fr/sites/cocoa-genome-hub.southgreen.fr/files/download/Theobroma_cacaoV2_annot_annoted_clean.gff3.tar.gz
tar -zvxf Theobroma_cacaoV2_annot_annoted_clean.gff3.tar.gz
mv Theobroma_cacaoV2_annot_annoted_clean.gff3 ~/MASED/data/Tcac.gff3
rm Theobroma_cacaoV2_annot_annoted_clean.gff3.tar.gz

wget https://cocoa-genome-hub.southgreen.fr/sites/cocoa-genome-hub.southgreen.fr/files/download/Theobroma_cacaoV2_annot_protein.faa.tar.gz
tar -zvxf Theobroma_cacaoV2_annot_protein.faa.tar.gz
mv Theobroma_cacaoV2_annot_protein.faa ~/MASED/data/Tcac.pep
rm Theobroma_cacaoV2_annot_protein.faa.tar.gz

Segment Duplication

We have a reference genome fasta of Arabidopsis thaliana in our folder data/. And then we prepare the genome with App::Egaz.

BS-Seq

We choose SRX2871291 and SRX2871292 as the BS-Seq data of the Columbia wild type Arabidopsis thaliana.

prefetch SRR5631389
prefetch SRR5631390
prefetch SRR5631391
prefetch SRR5631392

The data is now in ~/ncbi/public/sra/

mv ~/ncbi/public/sra/SRR56313* ~/MASED/data/.
cd ~/MASED/data/
fastq-dump --split-3 SRR56313*
rm SRR56313*.sra

Analysis

Pretreatment for MCScanX

cd ~/MASED/data

awk '$3 == "gene" {print $1 "\t" $4 "\t" $5 "\t" $9 "\t" $7}' Atha/Atha.gff3 > Atha.gene.gff
awk '$3 == "gene" {print $1 "\t" $4 "\t" $5 "\t" $9 "\t" $7}' Alyr/Alyr.gff3 > Alyr.gene.gff
awk '$3 == "gene" {print $1 "\t" $4 "\t" $5 "\t" $9 "\t" $7}' Crub/Crub.gff3 > Crub.gene.gff
awk '$3 == "gene" {print $1 "\t" $4 "\t" $5 "\t" $9 "\t" $7}' Brap/Brap.gff3 > Brap.gene.gff
awk '$3 == "gene" {print $1 "\t" $4 "\t" $5 "\t" $9 "\t" $7}' Tcac/Tcac.gff3 > Tcac.gene.gff

perl ../gff_pep_Atha.pl Atha.gene.gff AT.gff Atha/Atha.pep > AT.pep
perl ../gff_pep_Alyr.pl Alyr.gene.gff AL.gff Alyr/Alyr.pep > AL.pep
perl ../gff_pep_Crub.pl Crub.gene.gff CR.gff Crub/Crub.pep > CR.pep
perl ../gff_pep_Brap.pl Brap.gene.gff BR.gff Brap/Brap.pep > BR.pep
perl ../gff_pep_Tcac.pl Tcac.gene.gff TC.gff Tcac/Tcac.pep > TC.pep

makeblastdb -in AT.pep -dbtype prot -parse_seqids -out ATdb

nohup blastp -query AL.pep -db ATdb -out AT_AL.blast -evalue 1e-10 -num_threads 4 -outfmt 6 -num_alignments 5 &
nohup blastp -query CR.pep -db ATdb -out AT_CR.blast -evalue 1e-10 -num_threads 4 -outfmt 6 -num_alignments 5 &
nohup blastp -query BR.pep -db ATdb -out AT_BR.blast -evalue 1e-10 -num_threads 4 -outfmt 6 -num_alignments 5 &
nohup blastp -query TC.pep -db ATdb -out AT_TC.blast -evalue 1e-10 -num_threads 4 -outfmt 6 -num_alignments 5 &

cat AT.gff AL.gff > AT_AL.gff
cat AT.gff CR.gff > AT_CR.gff
cat AT.gff BR.gff > AT_BR.gff
cat AT.gff TC.gff > AT_TC.gff

Run MCScanX

mcscanx -s 3 -m 2 AT_AL
mcscanx -s 3 -m 2 AT_CR
mcscanx -s 3 -m 2 AT_BR
mcscanx -s 3 -m 2 AT_TC

# extract ranges of segment duplications
perl ../colline2yml.pl AT.gff AT_AT.collinearity > AT_AT.yml
perl ../colline2yml_AL.pl AT.gff AT_AL.collinearity > AT_AL.yml
perl ../colline2yml.pl AT.gff AT_CR.collinearity > AT_CR.yml
perl ../colline2yml.pl AT.gff AT_BR.collinearity > AT_BR.yml
perl ../colline2yml.pl AT.gff AT_TC.collinearity > T_TC.yml

Find Segment Duplication

cd ~/MASED/data/
mkdir SAAT
gzip -dc Atha/Atha.fa.gz > Atha/Atha.fa

faops filter -N -s Atha/Atha.fa stdout | faops split-name stdin .
egaz repeatmasker ./*.fa -o SAAT/. --gff --parallel 4
faops size ./*.fa > SAAT/chr.sizes
faToTwoBit ./*.fa SAAT/chr.2bit
cat ./*.fa | faops filter -ßßU stdin SAAT/chr.fasta
mv Atha/Atha.gff3 SAAT/chr.gff

# create anno.yml
runlist gff --tag CDS --remove chr.gff -o cds.yml
runlist gff --remove ./*.rm.gff -o repeat.yml
runlist merge repeat.yml cds.yml -o anno.yml

rm repeat.yml cds.yml ./*.rm.gff /*.rm.out

egaz template \
    . \
    --self -o AT_FSD \
    --taxon ./ensembl_taxon.csv \
    --circos --aligndb --parallel 4 -v

bash AT_FSD/1_self.sh
bash AT_FSD/3_proc.sh
bash AT_FSD/4_circos.sh
bash AT_FSD/6_chr_length.sh
bash AT_FSD/7_self_aligndb.sh
bash AT_FSD/9_pack_up.sh

Positions

Examples in S288c.txt

I:1-100
I(+):90-150
S288c.I(-):190-200
II:21294-22075
II:23537-24097

positions

Simple rules:

  • chromosome and start are required
  • species, strand and end are optional
  • . to separate species and chromosome
  • strand is one of + and - and surround by round brackets
  • : to separate names and digits
  • - to separate start and end
  • names should be alphanumeric and without spaces
species.chromosome(strand):start-end
--------^^^^^^^^^^--------^^^^^^----

Runlists in YAML

App::RL

jrunlist

Blocked fasta files

Examples in example.fas

>S288c.I(+):13267-13287|species=S288c
TCGTCAGTTGGTTGACCATTA
>YJM789.gi_151941327(-):5668-5688|species=YJM789
TCGTCAGTTGGTTGACCATTA
>RM11.gi_61385832(-):5590-5610|species=RM11
TCGTCAGTTGGTTGACCATTA
>Spar.gi_29362400(+):2477-2497|species=Spar
TCATCAGTTGGCAAACCGTTA

blocked-fasta-files

App::Fasops

Ranges and links of ranges

App::Rangeops

jrange

Author

Ivan Woo <[email protected]>

Copyright and license

This is free software; you can redistribute it and/or modify it under the same terms.

About

It's my graduation project at NJU with the help of professor WangQ.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages