Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Building a graph from fragmented assemblies #268

Open
evcurran opened this issue Dec 13, 2022 · 3 comments
Open

Building a graph from fragmented assemblies #268

evcurran opened this issue Dec 13, 2022 · 3 comments

Comments

@evcurran
Copy link

I have been trying to build a pangenome graph of Arabidopsis arenosa (an outcrosser with high heterozygosity) from 10 unscaffolded assemblies (and 10 corresponding alternative haplotype assemblies), and 4 chromosome-level assemblies. Three of the chromosome-level assemblies are from closely related species, and one is a recent A. arenosa build. As the unscaffolded assemblies are quite fragmented (min. contig size of 4kb) I was using the sequence partitioning method to cluster the contigs into chromosome communities, and then building graphs per chromosome, using the following settings:

pggb -i $fasta -s 2000 -p 90 -k 29 -G 3079, 3559 -n 24 -t 12 -v -L -U -S -m -o $outdir

and the output was very messy (for a single chromosome):
scaffold_5_arenosa_pri_alt fa 3dd2fe6 2ff309f 57e755b smooth og lay draw_mqc

length: 85,478,416 (largest constituent chromosome is 24,241,940)
nodes: 5,153,596
edges: 7,275,076
paths: 3891

To try and and simplify the graph, I used the software ragtag to scaffold the fragmented assemblies to the arenosa chromosome-level assembly, and then built a graph using the 10 "pseudo-scaffolded" primary (+ 10 alternative) assemblies, plus the arenosa reference. Here are a couple of different settings I tried for a single chromosome:

pggb -i $fasta -s 20000 -p 90 -k 47 -G 3079,3559 -n 21 -P 1,4,6,2,26,1 -t 12 -v -S -L -o $outdir

scaffold_1_arenosa_pri_alt_noRefs fa 66905ed 7bdde5a c6d8610 smooth og viz_depth

scaffold_1_arenosa_pri_alt_noRefs fa 66905ed 7bdde5a c6d8610 smooth og lay draw_mqc

length: 122,223,409 (longest constituent chromosome is 26,723,338)
nodes: 2,598,044
edges: 3,605,318
paths: 21

pggb -i $fasta -s 10000 -p 90 -k 47 -G 3079,3559 -n 21 -P 1,4,6,2,26,1 -t 12 -v -S -L -o $outdir

scaffold_1_arenosa_pri_alt_noRefs fa 5f35582 7bdde5a c6d8610 smooth og viz_depth

scaffold_1_arenosa_pri_alt_noRefs fa 5f35582 7bdde5a c6d8610 smooth og lay draw_mqc

length: 88,605,057 (longest constituent chromosome is 26,723,338)
nodes: 14,316,648
edges: 20,498,846
paths: 21

The linearity has improved, but there is still some very complex looking regions that might not be aligned properly. Do you have any recommendations for the parameters I should be using? For context, I want to capture structural variation among the lineages represented by the 10 assemblies, and then align short reads to the graph so I can genotype SVs in existing sequencing data. I saw there was a parameter -Y to avoid self-mappings, which could reduce complexity, but it’s unclear to me what argument needs to be passed to it. Thank you for any help with this!

@jianshu93
Copy link

Hi @evcurran,

Just curious how you put 10 genomes together, when each genome is fragmented. one fasta file cannot do that right?

Thanks,
Jianshu

@evcurran
Copy link
Author

evcurran commented Nov 7, 2024

It was a multi-sequence fasta file, just as you would have for a chromosome-level assembly, there were just many more sequences contained in it than you'd have in a cleaner assembly!

In the end for my data the minigraph-cactus pipeline was more appropriate, as we had one chromosome-level assembly that could act as a backbone that the other more fragmented assemblies could be sequentially aligned to. If I had a full set of chromosome-level assemblies I would have better luck with PGGB.

@jianshu93
Copy link

Unfortunately,I do not have such complete assembly to map against to. All are fragmented. so no way out at least for now?

Thanks,
Jianshu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants