Building a graph from fragmented assemblies #268

evcurran · 2022-12-13T15:06:22Z

I have been trying to build a pangenome graph of Arabidopsis arenosa (an outcrosser with high heterozygosity) from 10 unscaffolded assemblies (and 10 corresponding alternative haplotype assemblies), and 4 chromosome-level assemblies. Three of the chromosome-level assemblies are from closely related species, and one is a recent A. arenosa build. As the unscaffolded assemblies are quite fragmented (min. contig size of 4kb) I was using the sequence partitioning method to cluster the contigs into chromosome communities, and then building graphs per chromosome, using the following settings:

pggb -i $fasta -s 2000 -p 90 -k 29 -G 3079, 3559 -n 24 -t 12 -v -L -U -S -m -o $outdir

and the output was very messy (for a single chromosome):

length: 85,478,416 (largest constituent chromosome is 24,241,940)
nodes: 5,153,596
edges: 7,275,076
paths: 3891

To try and and simplify the graph, I used the software ragtag to scaffold the fragmented assemblies to the arenosa chromosome-level assembly, and then built a graph using the 10 "pseudo-scaffolded" primary (+ 10 alternative) assemblies, plus the arenosa reference. Here are a couple of different settings I tried for a single chromosome:

pggb -i $fasta -s 20000 -p 90 -k 47 -G 3079,3559 -n 21 -P 1,4,6,2,26,1 -t 12 -v -S -L -o $outdir

length: 122,223,409 (longest constituent chromosome is 26,723,338)
nodes: 2,598,044
edges: 3,605,318
paths: 21

pggb -i $fasta -s 10000 -p 90 -k 47 -G 3079,3559 -n 21 -P 1,4,6,2,26,1 -t 12 -v -S -L -o $outdir

length: 88,605,057 (longest constituent chromosome is 26,723,338)
nodes: 14,316,648
edges: 20,498,846
paths: 21

The linearity has improved, but there is still some very complex looking regions that might not be aligned properly. Do you have any recommendations for the parameters I should be using? For context, I want to capture structural variation among the lineages represented by the 10 assemblies, and then align short reads to the graph so I can genotype SVs in existing sequencing data. I saw there was a parameter -Y to avoid self-mappings, which could reduce complexity, but it’s unclear to me what argument needs to be passed to it. Thank you for any help with this!

The text was updated successfully, but these errors were encountered:

jianshu93 · 2024-11-07T00:07:28Z

Hi @evcurran,

Just curious how you put 10 genomes together, when each genome is fragmented. one fasta file cannot do that right?

Thanks,
Jianshu

evcurran · 2024-11-07T13:22:48Z

It was a multi-sequence fasta file, just as you would have for a chromosome-level assembly, there were just many more sequences contained in it than you'd have in a cleaner assembly!

In the end for my data the minigraph-cactus pipeline was more appropriate, as we had one chromosome-level assembly that could act as a backbone that the other more fragmented assemblies could be sequentially aligned to. If I had a full set of chromosome-level assemblies I would have better luck with PGGB.

jianshu93 · 2024-11-07T17:47:56Z

Unfortunately，I do not have such complete assembly to map against to. All are fragmented. so no way out at least for now？

Thanks，
Jianshu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Building a graph from fragmented assemblies #268

Building a graph from fragmented assemblies #268

evcurran commented Dec 13, 2022

jianshu93 commented Nov 7, 2024

evcurran commented Nov 7, 2024 •

edited

Loading

jianshu93 commented Nov 7, 2024

Building a graph from fragmented assemblies #268

Building a graph from fragmented assemblies #268

Comments

evcurran commented Dec 13, 2022

jianshu93 commented Nov 7, 2024

evcurran commented Nov 7, 2024 • edited Loading

jianshu93 commented Nov 7, 2024

evcurran commented Nov 7, 2024 •

edited

Loading