The reference-free Arabidopsis thaliana pangenome

This repository collects all the commands and script used for the reference-free Arabidopsis thaliana pangenome. This work is part of the PhD project carried out by Lia Obinu.

The reference-free Arabidopsis thaliana pangenome
Table of contents
Building the reference-free Arabidopsis thaliana pangenome graph
Pangenome annotation - reference based
Pangenome annotation - non-reference sequences
Final annotation screening
- genes
- pseudogenes
Gene ontology enrichment analysis
Sequence-based pangenome analysis
Similarity analysis
Other data visualisation

Building the reference-free Arabidopsis thaliana pangenome graph

Get the assemblies

We found in total 93 suitable assemblies, which are at chromosome level or complete genome level. They are coming from the following BioProjects: PRJEB55353, PRJEB55632, PRJEB50694, PRJEB30763, PRJEB31147, PRJEB37252, PRJEB37257, PRJEB37258, PRJEB37260, PRJEB37261, PRJEB40125, PRJEB51511, PRJNA777107, PRJNA779205, PRJNA828135, PRJNA834751, PRJNA10719, PRJNA915353, PRJNA311266, PRJCA005809, PRJCA007112.

Quality check and trimming of the assemblies

We should exclude short contigs from the pangenome, but we need to set a threshold, and for this we must quality inspect the assemblies and check the length distribution. We can do this using QUAST:

quast accessions/*.fna.gz accessions/*.fasta.gz -o ./output_quast -r accessions/GCA_000001735.2_TAIR10.1_genomic.fna.gz -t 40 --eukaryote --large -k --plots-format png

Based on QUAST results, we can decided to trim the contigs that are smaller than 5kbp. The assemblies that will suffer a loss will be the following according to QUAST:

Assembly	# contigs (>= 0 bp)	# contigs (>= 5000 bp)	Total length (>= 0 bp)	Total length (>= 5000 bp)
GCA_900660825.1_Ath.Ler-0.MPIPZ.v1.0_genomic	109	78	119626746	119530365
GCA_902460265.3_Arabidopsis_thaliana_An-1_chrom_genomic	111	83	120129838	120059751
GCA_902460275.1_Arabidopsis_thaliana_Cvi-0_genomic	102	71	119749512	119657772
GCA_902460285.1_Arabidopsis_thaliana_Ler_genomic	105	75	120338059	120246237
GCA_902460295.1_Arabidopsis_thaliana_Sha_genomic	94	74	120289901	120224609
GCA_902460305.1_Arabidopsis_thaliana_Kyo_genomic	184	155	122202079	122112347
GCA_902460315.1_Arabidopsis_thaliana_Eri-1_genomic	142	115	120795191	120728700

For the filtering we can use bioawk.

bioawk -c fastx '(length($seq)>5000){print ">" $name ORS $seq}' GCA_900660825.1_Ath.Ler-0.MPIPZ.v1.0_genomic.fna > GCA_900660825.1_Ath.Ler-0.MPIPZ.v1.0_genomic.trim.5k.fna

#check:
grep '>' GCA_900660825.1_Ath.Ler-0.MPIPZ.v1.0_genomic.fna | wc -l
---
109

grep '>' GCA_900660825.1_Ath.Ler-0.MPIPZ.v1.0_genomic.trim.5k.fna | wc -l
---
78

bioawk -c fastx '(length($seq)>5000){print ">" $name ORS $seq}' GCA_902460265.3_Arabidopsis_thaliana_An-1_chrom_genomic.fna > GCA_902460265.3_Arabidopsis_thaliana_An-1_chrom_genomic.trim.5k.fna

#check:
grep '>' GCA_902460265.3_Arabidopsis_thaliana_An-1_chrom_genomic.fna | wc -l
---
111

grep '>' GCA_902460265.3_Arabidopsis_thaliana_An-1_chrom_genomic.trim.5k.fna | wc -l
---
83

bioawk -c fastx '(length($seq)>5000){print ">" $name ORS $seq}' GCA_902460275.1_Arabidopsis_thaliana_Cvi-0_genomic.fna > GCA_902460275.1_Arabidopsis_thaliana_Cvi-0_genomic.trim.5k.fna

#check:
grep '>' GCA_902460275.1_Arabidopsis_thaliana_Cvi-0_genomic.fna | wc -l
---
102

grep '>' GCA_902460275.1_Arabidopsis_thaliana_Cvi-0_genomic.trim.5k.fna | wc -l
---
71

bioawk -c fastx '(length($seq)>5000){print ">" $name ORS $seq}' GCA_902460285.1_Arabidopsis_thaliana_Ler_genomic.fna > GCA_902460285.1_Arabidopsis_thaliana_Ler_genomic.trim.5k.fna

#check:
grep '>' GCA_902460285.1_Arabidopsis_thaliana_Ler_genomic.fna | wc -l
---
105

grep '>' GCA_902460285.1_Arabidopsis_thaliana_Ler_genomic.trim.5k.fna | wc -l
---
75

bioawk -c fastx '(length($seq)>5000){print ">" $name ORS $seq}' GCA_902460295.1_Arabidopsis_thaliana_Sha_genomic.fna > GCA_902460295.1_Arabidopsis_thaliana_Sha_genomic.trim.5k.fna 

#check:
grep '>' GCA_902460295.1_Arabidopsis_thaliana_Sha_genomic.fna | wc -l
---
94

grep '>' GCA_902460295.1_Arabidopsis_thaliana_Sha_genomic.trim.5k.fna | wc -l
---
74

bioawk -c fastx '(length($seq)>5000){print ">" $name ORS $seq}' GCA_902460305.1_Arabidopsis_thaliana_Kyo_genomic.fna > GCA_902460305.1_Arabidopsis_thaliana_Kyo_genomic.trim.5k.fna

#check:
grep '>' GCA_902460305.1_Arabidopsis_thaliana_Kyo_genomic.fna | wc -l
---
184

grep '>' GCA_902460305.1_Arabidopsis_thaliana_Kyo_genomic.trim.5k.fna | wc -l
---
155

bioawk -c fastx '(length($seq)>5000){print ">" $name ORS $seq}' GCA_902460315.1_Arabidopsis_thaliana_Eri-1_genomic.fna > GCA_902460315.1_Arabidopsis_thaliana_Eri-1_genomic.trim.5k.fna

#check:
grep '>' GCA_902460315.1_Arabidopsis_thaliana_Eri-1_genomic.fna | wc -l
---
142

grep '>' GCA_902460315.1_Arabidopsis_thaliana_Eri-1_genomic.trim.5k.fna | wc -l
---
115

Now we should zip the .trim.5k.fna files:

for i in *trim.5k.fna ; do (echo "gzip $i"); done | bash

From now on, we are going to used the trimmed version of these assemblies.

Renaming the assemblies

We can follow the S. cerevisiae example:

ls *.gz | cut -f 1 -d '.' | uniq | while read f; do
    echo $f
    zcat $f.* > $f.fa
done

To change the sequence names according to PanSN-spec, we can use fastix:

ls *.fa | while read f; do
    sample_name=$(echo $f | cut -f 1 -d '.');
    echo ${sample_name}
    fastix -p "${sample_name}#1#" $f >> Arabidopsis.pg.in.fasta
done

With "${sample_name}#1#" we specify haplotype_id equals to 1 for all the assemblies, as they are all haploid. The names of the scaffolds are now following the pattern:

[sample_name][delim][haplotype_id][delim][contig_or_scaffold_name]

Format fasta headers and remove organelles from assemblies

We should remove all what comes after \t from the fasta headers as this can led to errors in subsequent steps:

cat Arabidopsis.pg.in.fasta | sed -E '/^>/s/( +|\t).*//' > Arabidopsis.adj.pg.in.fasta

We then should remove organelle scaffolds from our input file. Only three assemblies have Mt e Plt, and the corresponding scaffolds are:

>GCA_000001735#1#BK010421.1
>GCA_000001735#1#AP000423.1
>GCA_023115395#1#CP096029.1
>GCA_023115395#1#CP096030.1
>GCA_904420315#1#LR881472.1
>GCA_904420315#1#LR881471.1

We need to put these headers in a file that we will call Mt_and_Plt_headers.txt.

We can now exclude these scaffolds from our previous multifasta input file:

awk '(NR==FNR) { toRemove[$1]; next }
     /^>/ { p=1; for(h in toRemove) if ( h ~ $0) p=0 }
    p' Mt_and_Plt_headers.txt Arabidopsis.adj.pg.in.fasta > Arabidopsis.pg.input.fasta

We need to compress our input and to index it:

bgzip -@ 16 Arabidopsis.pg.input.fasta
samtools faidx Arabidopsis.pg.input.fasta.gz

Sequence Partitioning using partition-before-pggb

We can now run partition-before-pggb on our multifasta (note: if it is not possible to estimate the divergence, you should try different settings to individuate the right value to set for -p):

partition-before-pggb -i Arabidopsis.pg.input.fasta.gz -o output_pbp_p90 -n 93 -t 100 -Y '#' -p 90 -s 10k -V 'GCA_000001735:#:1000' -m -D ./temp

-i input
-o output directory
-n number of haplotypes
-p percent identity for mapping/alignment
-s segment length for mapping [default: 5000]
-V to produce vcf against ref
-m generate MultiQC report of graphs' statistics and visualizations,automatically runs odgi stats
-D temporary files directory

Running pggb on each community

We can now run pggb on each community! :-)

To build the graph of each community we will use pggb -p 95.

community 0 - chr1

pggb -i output_pbp_p90/Arabidopsis.pg.input.fasta.gz.c325321.community.0.fa \
     -o output_pbp_p90/community.0.out \
     -s 10000 -l 50000 -p 95 -n 93 -K 19 -F 0.001 -g 30 \
     -k 23 -f 0 -B 10000000 \
     -j 0 -e 0 -G 700,900,1100 -P 1,19,39,3,81,1 -O 0.001 -d 100 -Q Consensus_ \
     -Y "#" -V GCA_000001735:#:1000 --multiqc --temp-dir ./temp --threads 100 --poa-threads 100

community 1 - chr2

pggb -i output_pbp_p90/Arabidopsis.pg.input.fasta.gz.c325321.community.1.fa \
     -o output_pbp_p90/community.1.out \
     -s 10000 -l 50000 -p 95 -n 93 -K 19 -F 0.001 -g 30 \
     -k 23 -f 0 -B 10000000 \
     -j 0 -e 0 -G 700,900,1100 -P 1,19,39,3,81,1 -O 0.001 -d 100 -Q Consensus_ \
     -Y "#" -V GCA_000001735:#:1000 --multiqc --temp-dir ./temp --threads 100 --poa-threads 100

community 2 - chr3

pggb -i output_pbp_p90/Arabidopsis.pg.input.fasta.gz.c325321.community.2.fa \
     -o output_pbp_p90/community.2.out \
     -s 10000 -l 50000 -p 95 -n 93 -K 19 -F 0.001 -g 30 \
     -k 23 -f 0 -B 10000000 \
     -j 0 -e 0 -G 700,900,1100 -P 1,19,39,3,81,1 -O 0.001 -d 100 -Q Consensus_ \
     -Y "#" -V GCA_000001735:#:1000 --multiqc --temp-dir ./temp --threads 100 --poa-threads 100

community 3 - chr4

pggb -i output_pbp_p90/Arabidopsis.pg.input.fasta.gz.c325321.community.3.fa \
     -o output_pbp_p90/community.3.out \
     -s 10000 -l 50000 -p 95 -n 93 -K 19 -F 0.001 -g 30 \
     -k 23 -f 0 -B 10000000 \
     -j 0 -e 0 -G 700,900,1100 -P 1,19,39,3,81,1 -O 0.001 -d 100 -Q Consensus_ \
     -Y "#" -V GCA_000001735:#:1000 --multiqc --temp-dir ./temp --threads 100 --poa-threads 100

community 4 - chr5

pggb -i output_pbp_p90/Arabidopsis.pg.input.fasta.gz.c325321.community.4.fa \
     -o output_pbp_p90/community.4.out \
     -s 10000 -l 50000 -p 95 -n 93 -K 19 -F 0.001 -g 30 \
     -k 23 -f 0 -B 10000000 \
     -j 0 -e 0 -G 700,900,1100 -P 1,19,39,3,81,1 -O 0.001 -d 100 -Q Consensus_ \
     -Y "#" -V GCA_000001735:#:1000 --multiqc --temp-dir ./temp --threads 100 --poa-threads 100

Pangenome annotation - reference based

Get odgi untangle for annotation

To use odgi untangle for annotating the pangenome graph, we will switch to untangle_for_annotation branch.

cd odgi
git checkout untangle_for_annotation && git pull && git submodule update --init --recursive
cmake -H. -Bbuild && cmake --build build -- -j 16

Prepare the annotation files for injection

Genes

wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/735/GCA_000001735.2_TAIR10.1/GCA_000001735.2_TAIR10.1_genomic.gff.gz
gunzip GCA_000001735.2_TAIR10.1_genomic.gff.gz

Now, we will inject only the genes, so we need to extract only gene rows from the gff of TAIR10 and split them based on chromosomes.

Let's have a look to the gff file:

head GCA_000001735.2_TAIR10.1_genomic.gff
---
##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build TAIR10.1
#!genome-build-accession NCBI_Assembly:GCA_000001735.2
##sequence-region CP002684.1 1 30427671
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=3702
CP002684.1      Genbank region  1       30427671        .       +       .       ID=CP002684.1:1..30427671;Dbxref=taxon:3702;Name=1;chromosome=1;ecotype=Columbia;gbkey=Src;mol_type=genomic DNA
CP002684.1      Genbank gene    3631    5899    .       +       .       ID=gene-AT1G01010;Dbxref=Araport:AT1G01010,TAIR:AT1G01010;Name=NAC001;gbkey=Gene;gene=NAC001;gene_biotype=protein_coding;gene_synonym=ANAC001,NAC domain containing protein 1,T25K16.1,T25K16_1;locus_tag=AT1G01010
CP002684.1      Genbank mRNA    3631    5899    .       +       .       ID=rna-gnl|JCVI|mRNA.AT1G01010.1;Parent=gene-AT1G01010;Dbxref=Araport:AT1G01010,TAIR:AT1G01010;gbkey=mRNA;gene=NAC001;inference=similar to RNA sequence%2C mRNA:INSD:BT001115.1%2CINSD:AF439834.1%2CINSD:AK226863.1;locus_tag=AT1G01010;orig_protein_id=gnl|JCVI|AT1G01010.1;orig_transcript_id=gnl|JCVI|mRNA.AT1G01010.1;product=NAC domain containing protein 1

community 0 - chr1

In order to inject the genes in the pangenome, we need to extract the information we need from the gff, and modify them according to the PanSN-spec naming:

awk -F "\t|ID=gene-|;Dbxref" '$1 ~/^CP002684.1$/ && $3 ~/^gene$/ {print ("GCA_000001735#1#"$1"\t"$4"\t"$5"\t"$10":"$4"-"$5)}' ../../../GCA_000001735.2_TAIR10.1_genomic.gff > chr1.genes.adj.TAIR10.bed

head chr1.genes.adj.TAIR10.bed | column -t
---
GCA_000001735#1#CP002684.1  3631   5899   AT1G01010:3631-5899
GCA_000001735#1#CP002684.1  6788   9130   AT1G01020:6788-9130
GCA_000001735#1#CP002684.1  11101  11372  AT1G03987:11101-11372
GCA_000001735#1#CP002684.1  11649  13714  AT1G01030:11649-13714
GCA_000001735#1#CP002684.1  23121  31227  AT1G01040:23121-31227
GCA_000001735#1#CP002684.1  23312  24099  AT1G03993:23312-24099
GCA_000001735#1#CP002684.1  28500  28706  AT1G01046:28500-28706
GCA_000001735#1#CP002684.1  31170  33171  AT1G01050:31170-33171
GCA_000001735#1#CP002684.1  32727  33009  AT1G03997:32727-33009
GCA_000001735#1#CP002684.1  33365  37871  AT1G01060:33365-37871

Make sure that community 0 is chromosome 1 of TAIR10:

odgi paths -i Arabidopsis.pg.input.community.0.og -L | grep 'GCA_000001735'
---
GCA_000001735#1#CP002684.1

community 1 - chr2

In order to inject the genes in the pangenome, we need to extract the information we need from the gff, and modify them according to the PanSN-spec naming:

awk -F "\t|ID=gene-|;Dbxref" '$1 ~/^CP002685.1$/ && $3 ~/^gene$/ {print ("GCA_000001735#1#"$1"\t"$4"\t"$5"\t"$10":"$4"-"$5)}' ../../../GCA_000001735.2_TAIR10.1_genomic.gff > chr2.genes.adj.TAIR10.bed

head chr2.genes.adj.TAIR10.bed | column -t
---
GCA_000001735#1#CP002685.1  1025   3173   AT2G01008:1025-3173
GCA_000001735#1#CP002685.1  2805   3176   AT2G03855:2805-3176
GCA_000001735#1#CP002685.1  3706   5513   AT2G01010:3706-5513
GCA_000001735#1#CP002685.1  5782   5945   AT2G01020:5782-5945
GCA_000001735#1#CP002685.1  6571   6672   AT2G01021:6571-6672
GCA_000001735#1#CP002685.1  7090   7505   AT2G03865:7090-7505
GCA_000001735#1#CP002685.1  7669   9399   AT2G03875:7669-9399
GCA_000001735#1#CP002685.1  9648   9767   AT2G01023:9648-9767
GCA_000001735#1#CP002685.1  9849   10177  AT2G00340:9849-10177
GCA_000001735#1#CP002685.1  50288  51487  AT2G01035:50288-51487

Make sure that community 1 is chromosome 2 of TAIR10:

odgi paths -i Arabidopsis.pg.input.community.1.og -L | grep 'GCA_000001735'
---
GCA_000001735#1#CP002685.1

community 2 - chr3

In order to inject the genes in the pangenome, we need to extract the information we need from the gff, and modify them according to the PanSN-spec naming:

awk -F "\t|ID=gene-|;Dbxref" '$1 ~/^CP002686.1$/ && $3 ~/^gene$/ {print ("GCA_000001735#1#"$1"\t"$4"\t"$5"\t"$10":"$4"-"$5)}' ../../../GCA_000001735.2_TAIR10.1_genomic.gff > chr3.genes.adj.TAIR10.bed

head chr3.genes.adj.TAIR10.bed | column -t
---
GCA_000001735#1#CP002686.1  1609   4159   AT3G01015:1609-4159
GCA_000001735#1#CP002686.1  4342   4818   AT3G01010:4342-4818
GCA_000001735#1#CP002686.1  5104   6149   AT3G01020:5104-6149
GCA_000001735#1#CP002686.1  6657   7772   AT3G01030:6657-7772
GCA_000001735#1#CP002686.1  8723   12697  AT3G01040:8723-12697
GCA_000001735#1#CP002686.1  13046  15906  AT3G01050:13046-15906
GCA_000001735#1#CP002686.1  15934  16320  AT3G00970:15934-16320
GCA_000001735#1#CP002686.1  16631  18909  AT3G01060:16631-18909
GCA_000001735#1#CP002686.1  19409  20806  AT3G01070:19409-20806
GCA_000001735#1#CP002686.1  25355  27712  AT3G01080:25355-27712

Make sure that community 2 is chromosome 3 of TAIR10:

odgi paths -i Arabidopsis.pg.input.community.2.og -L | grep 'GCA_000001735'
---
GCA_000001735#1#CP002686.1

community 3 - chr4

In order to inject the genes in the pangenome, we need to extract the information we need from the gff, and modify them according to the PanSN-spec naming:

awk -F "\t|ID=gene-|;Dbxref" '$1 ~/^CP002687.1$/ && $3 ~/^gene$/ {print ("GCA_000001735#1#"$1"\t"$4"\t"$5"\t"$10":"$4"-"$5)}' ../../../GCA_000001735.2_TAIR10.1_genomic.gff > chr4.genes.adj.TAIR10.bed

head chr4.genes.adj.TAIR10.bed | column -t
---
GCA_000001735#1#CP002687.1  1180   1536   AT4G00005:1180-1536
GCA_000001735#1#CP002687.1  2895   10504  AT4G00020:2895-10504
GCA_000001735#1#CP002687.1  10815  13359  AT4G00026:10815-13359
GCA_000001735#1#CP002687.1  13527  14413  AT4G00030:13527-14413
GCA_000001735#1#CP002687.1  14627  16079  AT4G00040:14627-16079
GCA_000001735#1#CP002687.1  17639  20183  AT4G00050:17639-20183
GCA_000001735#1#CP002687.1  21351  29064  AT4G00060:21351-29064
GCA_000001735#1#CP002687.1  29072  31426  AT4G00070:29072-31426
GCA_000001735#1#CP002687.1  32748  33756  AT4G00080:32748-33756
GCA_000001735#1#CP002687.1  33800  33872  AT4G00085:33800-33872

Make sure that community 3 is chromosome 4 of TAIR10:

odgi paths -i Arabidopsis.pg.input.community.3.og -L | grep 'GCA_000001735'
---
GCA_000001735#1#CP002687.1

community 4 - chr5

In order to inject the genes in the pangenome, we need to extract the information we need from the gff, and modify them according to the PanSN-spec naming:

awk -F "\t|ID=gene-|;Dbxref" '$1 ~/^CP002688.1$/ && $3 ~/^gene$/ {print ("GCA_000001735#1#"$1"\t"$4"\t"$5"\t"$10":"$4"-"$5)}' ../../../GCA_000001735.2_TAIR10.1_genomic.gff > chr5.genes.adj.TAIR10.bed

head chr5.genes.adj.TAIR10.bed | column -t
---
GCA_000001735#1#CP002688.1  2      303    AT5G00730:2-303
GCA_000001735#1#CP002688.1  995    5156   AT5G01010:995-5156
GCA_000001735#1#CP002688.1  5256   5907   AT5G01015:5256-5907
GCA_000001735#1#CP002688.1  5339   5593   AT5G01017:5339-5593
GCA_000001735#1#CP002688.1  5917   8467   AT5G01020:5917-8467
GCA_000001735#1#CP002688.1  9780   13235  AT5G01030:9780-13235
GCA_000001735#1#CP002688.1  13128  16236  AT5G01040:13128-16236
GCA_000001735#1#CP002688.1  18086  20887  AT5G01050:18086-20887
GCA_000001735#1#CP002688.1  22684  24934  AT5G01060:22684-24934
GCA_000001735#1#CP002688.1  25012  25908  AT5G01070:25012-25908

Make sure that community 4 is chromosome 5 of TAIR10:

odgi paths -i Arabidopsis.pg.input.community.4.og -L | grep 'GCA_000001735'
---
GCA_000001735#1#CP002688.1

Pseudogenes

The original file from Mascagni et al., 2021 was modified as explained in scripts/bed_to_gff3.ipynb. The coordinates of pseudogenes annotated in the reverse strain were then converted in forward strain to allow compatibility with the downstream analysis. This is how the final annotations file looks like:

head pseudogenes.txt
---
Chromosome	end	start	coverage	Genewise_score	position	size	size	Strand	Type	Pater	Species	Stop_codon	Frameshift
Chr1	6144	6269	0.09	56.39	UTR	129	plus	FRAG	AT1G01010.1	A.thaliana	0	0
Chr1	47087	47325	0.06	74.09	Intergenic	206	plus	FRAG	AT1G02730.1	A.thaliana	1	3
Chr1	89932	90153	0.02	24.75	Intergenic	51	plus	FRAG	AT5G41740.2	A.thaliana	1	3
Chr1	259011	259151	0.06	48.95	Intergenic	84	plus	FRAG	AT1G01690.1	A.thaliana	0	0
Chr1	266562	267495	0.99	68.96	UTR	979	plus	DUP	AT4G00525.1	A.thaliana	0	0
Chr1	426018	426176	0.06	17.28	Intergenic	36	plus	FRAG	AT3G04430.1	A.thaliana	1	1
Chr1	578305	578484	0.04	47.13	Intron_cds	102	plus	FRAG	AT1G05120.1	A.thaliana	0	0
Chr1	582401	582585	0.46	52.69	Intergenic	236	plus	SE	AT4G02160.1	A.thaliana	0	1
Chr1	582634	582971	0.57	117.33	Intergenic	247	plus	SE	AT5G61710.1	A.thaliana	0	1

community 0 - chr1

awk '$1 == "Chr1"' ../../../../pseudogenes.txt | awk -v OFS='\t' '{print($1,$2,$3,$1":"$2"-"$3)}' | sed 's/Chr1/GCA_000001735#1#CP002684.1/g' > Pseudogenes_TAIR10_chr1.bed

head Pseudogenes_TAIR10_chr1.bed 
---
GCA_000001735#1#CP002684.1	6144	6269	GCA_000001735#1#CP002684.1:6144-6269
GCA_000001735#1#CP002684.1	47087	47325	GCA_000001735#1#CP002684.1:47087-47325
GCA_000001735#1#CP002684.1	89932	90153	GCA_000001735#1#CP002684.1:89932-90153
GCA_000001735#1#CP002684.1	259011	259151	GCA_000001735#1#CP002684.1:259011-259151
GCA_000001735#1#CP002684.1	266562	267495	GCA_000001735#1#CP002684.1:266562-267495
GCA_000001735#1#CP002684.1	426018	426176	GCA_000001735#1#CP002684.1:426018-426176
GCA_000001735#1#CP002684.1	578305	578484	GCA_000001735#1#CP002684.1:578305-578484
GCA_000001735#1#CP002684.1	582401	582585	GCA_000001735#1#CP002684.1:582401-582585
GCA_000001735#1#CP002684.1	582634	582971	GCA_000001735#1#CP002684.1:582634-582971
GCA_000001735#1#CP002684.1	893047	893493	GCA_000001735#1#CP002684.1:893047-893493

community 1 - chr2

awk '$1 == "Chr2"' ../../../../pseudogenes.txt | awk -v OFS='\t' '{print($1,$2,$3,$1":"$2"-"$3)}' | sed 's/Chr2/GCA_000001735#1#CP002685.1/g' > Pseudogenes_TAIR10_chr2.bed

head Pseudogenes_TAIR10_chr2.bed 
---
GCA_000001735#1#CP002685.1	2629	2769	GCA_000001735#1#CP002685.1:2629-2769
GCA_000001735#1#CP002685.1	2939	3079	GCA_000001735#1#CP002685.1:2939-3079
GCA_000001735#1#CP002685.1	126312	126472	GCA_000001735#1#CP002685.1:126312-126472
GCA_000001735#1#CP002685.1	167479	167678	GCA_000001735#1#CP002685.1:167479-167678
GCA_000001735#1#CP002685.1	184200	184421	GCA_000001735#1#CP002685.1:184200-184421
GCA_000001735#1#CP002685.1	214567	214668	GCA_000001735#1#CP002685.1:214567-214668
GCA_000001735#1#CP002685.1	249464	249634	GCA_000001735#1#CP002685.1:249464-249634
GCA_000001735#1#CP002685.1	287831	287959	GCA_000001735#1#CP002685.1:287831-287959
GCA_000001735#1#CP002685.1	338842	339105	GCA_000001735#1#CP002685.1:338842-339105
GCA_000001735#1#CP002685.1	343391	343537	GCA_000001735#1#CP002685.1:343391-343537

community 2 - chr3

awk '$1 == "Chr3"' ../../../../pseudogenes.txt | awk -v OFS='\t' '{print($1,$2,$3,$1":"$2"-"$3)}' | sed 's/Chr3/GCA_000001735#1#CP002686.1/g' > Pseudogenes_TAIR10_chr3.bed

head Pseudogenes_TAIR10_chr3.bed 
---
GCA_000001735#1#CP002686.1	1009	1200	GCA_000001735#1#CP002686.1:1009-1200
GCA_000001735#1#CP002686.1	430020	430130	GCA_000001735#1#CP002686.1:430020-430130
GCA_000001735#1#CP002686.1	484416	484737	GCA_000001735#1#CP002686.1:484416-484737
GCA_000001735#1#CP002686.1	630152	630235	GCA_000001735#1#CP002686.1:630152-630235
GCA_000001735#1#CP002686.1	792374	792475	GCA_000001735#1#CP002686.1:792374-792475
GCA_000001735#1#CP002686.1	799113	799223	GCA_000001735#1#CP002686.1:799113-799223
GCA_000001735#1#CP002686.1	832148	832279	GCA_000001735#1#CP002686.1:832148-832279
GCA_000001735#1#CP002686.1	863502	863939	GCA_000001735#1#CP002686.1:863502-863939
GCA_000001735#1#CP002686.1	864622	864843	GCA_000001735#1#CP002686.1:864622-864843
GCA_000001735#1#CP002686.1	987088	990672	GCA_000001735#1#CP002686.1:987088-990672

community 3 - chr4

awk '$1 == "Chr4"' ../../../../pseudogenes.txt | awk -v OFS='\t' '{print($1,$2,$3,$1":"$2"-"$3)}' | sed 's/Chr4/GCA_000001735#1#CP002687.1/g' > Pseudogenes_TAIR10_chr4.bed

head Pseudogenes_TAIR10_chr4.bed 
---
GCA_000001735#1#CP002687.1	11253	11351	GCA_000001735#1#CP002687.1:11253-11351
GCA_000001735#1#CP002687.1	62136	62339	GCA_000001735#1#CP002687.1:62136-62339
GCA_000001735#1#CP002687.1	70243	70416	GCA_000001735#1#CP002687.1:70243-70416
GCA_000001735#1#CP002687.1	99056	99229	GCA_000001735#1#CP002687.1:99056-99229
GCA_000001735#1#CP002687.1	99355	99464	GCA_000001735#1#CP002687.1:99355-99464
GCA_000001735#1#CP002687.1	102138	102411	GCA_000001735#1#CP002687.1:102138-102411
GCA_000001735#1#CP002687.1	102614	103287	GCA_000001735#1#CP002687.1:102614-103287
GCA_000001735#1#CP002687.1	103362	103703	GCA_000001735#1#CP002687.1:103362-103703
GCA_000001735#1#CP002687.1	105289	105375	GCA_000001735#1#CP002687.1:105289-105375
GCA_000001735#1#CP002687.1	121936	122058	GCA_000001735#1#CP002687.1:121936-122058

community 4 - chr5

awk '$1 == "Chr5"' ../../../../pseudogenes.txt | awk -v OFS='\t' '{print($1,$2,$3,$1":"$2"-"$3)}' | sed 's/Chr5/GCA_000001735#1#CP002688.1/g' > Pseudogenes_TAIR10_chr5.bed

head Pseudogenes_TAIR10_chr5.bed 
---
GCA_000001735#1#CP002688.1	17952	18035	GCA_000001735#1#CP002688.1:17952-18035
GCA_000001735#1#CP002688.1	18038	18085	GCA_000001735#1#CP002688.1:18038-18085
GCA_000001735#1#CP002688.1	453356	453511	GCA_000001735#1#CP002688.1:453356-453511
GCA_000001735#1#CP002688.1	453793	454342	GCA_000001735#1#CP002688.1:453793-454342
GCA_000001735#1#CP002688.1	498633	498719	GCA_000001735#1#CP002688.1:498633-498719
GCA_000001735#1#CP002688.1	595390	595590	GCA_000001735#1#CP002688.1:595390-595590
GCA_000001735#1#CP002688.1	597848	597991	GCA_000001735#1#CP002688.1:597848-597991
GCA_000001735#1#CP002688.1	660001	660384	GCA_000001735#1#CP002688.1:660001-660384
GCA_000001735#1#CP002688.1	672914	673057	GCA_000001735#1#CP002688.1:672914-673057
GCA_000001735#1#CP002688.1	679891	680126	GCA_000001735#1#CP002688.1:679891-680126