You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have been trying to build a pangenome graph of Arabidopsis arenosa (an outcrosser with high heterozygosity) from 10 unscaffolded assemblies (and 10 corresponding alternative haplotype assemblies), and 4 chromosome-level assemblies. Three of the chromosome-level assemblies are from closely related species, and one is a recent A. arenosa build. As the unscaffolded assemblies are quite fragmented (min. contig size of 4kb) I was using the sequence partitioning method to cluster the contigs into chromosome communities, and then building graphs per chromosome, using the following settings:
To try and and simplify the graph, I used the software ragtag to scaffold the fragmented assemblies to the arenosa chromosome-level assembly, and then built a graph using the 10 "pseudo-scaffolded" primary (+ 10 alternative) assemblies, plus the arenosa reference. Here are a couple of different settings I tried for a single chromosome:
The linearity has improved, but there is still some very complex looking regions that might not be aligned properly. Do you have any recommendations for the parameters I should be using? For context, I want to capture structural variation among the lineages represented by the 10 assemblies, and then align short reads to the graph so I can genotype SVs in existing sequencing data. I saw there was a parameter -Y to avoid self-mappings, which could reduce complexity, but it’s unclear to me what argument needs to be passed to it. Thank you for any help with this!
The text was updated successfully, but these errors were encountered:
It was a multi-sequence fasta file, just as you would have for a chromosome-level assembly, there were just many more sequences contained in it than you'd have in a cleaner assembly!
In the end for my data the minigraph-cactus pipeline was more appropriate, as we had one chromosome-level assembly that could act as a backbone that the other more fragmented assemblies could be sequentially aligned to. If I had a full set of chromosome-level assemblies I would have better luck with PGGB.
I have been trying to build a pangenome graph of Arabidopsis arenosa (an outcrosser with high heterozygosity) from 10 unscaffolded assemblies (and 10 corresponding alternative haplotype assemblies), and 4 chromosome-level assemblies. Three of the chromosome-level assemblies are from closely related species, and one is a recent A. arenosa build. As the unscaffolded assemblies are quite fragmented (min. contig size of 4kb) I was using the sequence partitioning method to cluster the contigs into chromosome communities, and then building graphs per chromosome, using the following settings:
pggb -i $fasta -s 2000 -p 90 -k 29 -G 3079, 3559 -n 24 -t 12 -v -L -U -S -m -o $outdir
and the output was very messy (for a single chromosome):
length: 85,478,416 (largest constituent chromosome is 24,241,940)
nodes: 5,153,596
edges: 7,275,076
paths: 3891
To try and and simplify the graph, I used the software ragtag to scaffold the fragmented assemblies to the arenosa chromosome-level assembly, and then built a graph using the 10 "pseudo-scaffolded" primary (+ 10 alternative) assemblies, plus the arenosa reference. Here are a couple of different settings I tried for a single chromosome:
pggb -i $fasta -s 20000 -p 90 -k 47 -G 3079,3559 -n 21 -P 1,4,6,2,26,1 -t 12 -v -S -L -o $outdir
length: 122,223,409 (longest constituent chromosome is 26,723,338)
nodes: 2,598,044
edges: 3,605,318
paths: 21
pggb -i $fasta -s 10000 -p 90 -k 47 -G 3079,3559 -n 21 -P 1,4,6,2,26,1 -t 12 -v -S -L -o $outdir
length: 88,605,057 (longest constituent chromosome is 26,723,338)
nodes: 14,316,648
edges: 20,498,846
paths: 21
The linearity has improved, but there is still some very complex looking regions that might not be aligned properly. Do you have any recommendations for the parameters I should be using? For context, I want to capture structural variation among the lineages represented by the 10 assemblies, and then align short reads to the graph so I can genotype SVs in existing sequencing data. I saw there was a parameter -Y to avoid self-mappings, which could reduce complexity, but it’s unclear to me what argument needs to be passed to it. Thank you for any help with this!
The text was updated successfully, but these errors were encountered: