-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions related to interpretation results phased assembly. #245
Comments
The initial assembly (the 7.3 Gbp) file is still phased, it's just not split into haplotypes. Each contig is from a single haplotype based on HiFi + ONT data information. In the second case, the assembly.fasta will still be the full assembly (haplotype1 + haplotype2 + unassigned) so it makes sense that it has not changed. It may have linked some sequences together within haplotypes that couldn't be linked without HiC, making them longer which would be reflected in sequence stats. I suspect the issue is the assembly is almost completely phased already and there isn't anything for HiC to do except assign to a haplotype. This requires homology detection which may not be tolerating the diversity of this species vs human. You can try increasing the |
Without graph and colors it is hard to say anything, but anyway, such large unassigned file is not normal. Which species is it? Do you expect to have large rDNA arrays, what is estimated level of heterozygosity? |
Aha, this is actually another problem. Graph nodes (which are constructed from ONT & HiFi but not HiC) are quite fragmented. This is not normal, those regions should be resolved with ONT reads. Without it there is just not enough HiC signal to assign labels for those short haplotype-specific regions - for some chromosome pairs there are thousands of nodes. The reason seems to be a verkko problem, but on the graph simplification/ONT alignment stages, and not on HiC phasing - I see huge amount of fake bulges with really different coverage between "haplotypes" - @skoren can you have a look? --haplo-divergence should not actually change anything here. |
Yet another assumption - is it a cell line or a tissue sample? Possibly those numerous 5x vs 50x bubbles can represent problems with the cell line? |
The genome seems to be a combination of quite diverge but then some extremely homozygous regions (like the one @Dmitry-Antipov posted above). In normal homozygous genomes those low coverage nodes would be simplified and you'd end up with a single large node but here, because it's double the coverage of most of the genome, they aren't. I do think that |
yep, Sergey is right, I did not notice that in addition to that underresolved problem there are some unlabeled large nodes - those labels can be fixed with haplo-divergence. But it definitely will not help with the fragmented+unassigned problem i was writing about |
Gentleman, thanks so far, there is definitively some food for thought here before providing some feedback (next week). In the meanwhile, I'll rerun the final steps with a larger haplotype divergence setting and take that also along int he feedback. |
When you re-run with the halo-divergence change, I'd suggest also adding the |
Still working on understanding this issue; Besides the example you picked out that part of the genome is relatively homozygous (on one chromosome), on other chromosomes we might have much higher diversity. Could diversity be to high, which results in failure to detect that two homolouges belong to the same chromosome, result in not exporting these sequences to the fased fasta file? As in, this probably would als impact your previous suggestion regarding the assembly as '--haploid'? |
Right, there's two issues. One is there is high diversity in some regions and two is there is some very homozygous regions. The |
No improvement with --haploid and --haplo-divergence 0.2. What divergence does the value of 0.2 imply? |
Could you share graph and colors for the most recent run once again?
It should help to detect homologous regions and thus better assign colors(by Hi-C phasing) for them. But corresponding regions (aka nodes in graph) should be long enough to make it work. |
share.tar.gz |
As an addition to above, Hifi heterozygous/homozygous depth = 40x/80x and ONT heterozygous/homozygous depth = 11x/21x |
I've checked, still see both problems we discussed above. Alternatively to fastas I can look on mashmap's mappings counted on your side - 8-hicPipeline/run_mashmap.sh & 8-hicPipeline/mashmap.out |
With respect to sharing the fastas; we'll have to see how to perform this in a secure manner. I'll generate the mashmap50.out file as alternative. But as the mashmap.out file was cleaned, I'll rerun 8-hicPipeline/run_mashmap.sh; which will take some days. Not sure if this helps, but untigs.fasta (7.4G) and unitigs.hpc.fasta (5.2G) raw sizes. Haploid is approx 4.5-5Gb |
yeah, sharing fastas can be sensitive issue. I need them exactly to look on mashmap's results, so running it on your side and sending us resulting mappings is perfectly fine. |
mashmap60.out.gz |
OK, so mashmap "alignments" are really fragmented, we just do not see enough similarity to phase corresponding nodes. We can make some improvements on our side, to test that they are somehow reasonable we'll also need hic.byread.compressed (just pairwise hi-c mapping counts) from 8-hicPipeline @skoren, do you have any suggestions about improving bulge removal in haploid part of this genome(see attached figure) |
xab.gz @Dmitry-Antipov if I read your suggestion correctly, it would be ok te reexecute the run_mashmap.sh script again with the modified --pi settings. Correct? Or would a rerun of the complete pipeline be necessary with the modified haplo-divergence 0.25 setting? |
you can edit that script and then rerun hic_phasing.sh to see whether the number of non-phased nodes (phased ones are listed in hicverkko.colors.tsv) will be significantly reduced or not. |
On the bubble popping issue, verkko doesn't expect to have such a wide range of heterozygosity in the same genome and we rely on a global coverage estimate from the graph. This is why the |
the size of hicverkko.colors.tsv of the previous run and after rerun hic_phasing.sh is similar. Few changes, but the majority of contigs between both files are in common. Food for thought. Also, I'm still puzzled by the 5x coverage bubbles in the graph above. Haploid HiFi depth is 40x/ homozygous 80x. These bubbles could not come from incorporation of the ONT or HIC data? |
The coverage on those nodes is based solely on HiFi kmers so they couldn't have come from ONT data (which would have 0 HiFi k-mers) and Hi-C data doesn't add any sequence to the graph. I suspect there is either some low-level somatic variation (if this is a cell line or similar) or some kind of systematic error. |
It's a diploid plant species, haploid genome ~5Gb. Slowly getting some additional information from orthogonal datasets. Some chromosomes are similar / some chromosomes are diverged. With ~7.5Gb, the total size of the assembly Is not that bad (Verkko / p_utg Hifiasm ~8Gb) . I see signals that contigs from some chromosomes are nicely phase separated while in others the hamming rate is extremely high. I can share some more insights, but not via this medium. The update in terms of "using local component coverage and not global coverage" you commented on three days ago sounds like one strategy that would push things forward. I'm happy to give it a try, even if it is still in a test branch, but will wait util you give the go ahead @skoren. @Dmitry-Antipov does it make sense to look further at the mashmap output / hicverkko.colors.tsv? Or is it clear enough that this was not the way forward? |
I have all the data needed for now. believe that it's possible to see at least some improvement on unphased contigs, but just didn't implement corresponding fix yet. Hope to update on this issue at the beginning on next week. |
I have some other things to do, but can have more focus on this project again in August |
@Dmitry-Antipov is there an update (d branch) for me to test? Tnx! |
Hi, You can try the branch het_fixes Anyway, I'd be very accurate with phasing results here, and definitely check them with gene-based verification like compleasm |
Rerun stuff from scratch; but getting stuck in processing files in 8-hic. See the dag of jobs: HiC_rdnascaff 1 Error likely has to do with not being able to find a file, but have not yet figured out which. Is there a specific batch script left-over by Snakemake (batch-script directory) to restart only the last part manually? To have more fine-grained control on where to start debugging? |
What was the job crashed? tail of log file should help to get what happened. If you have all mashmap jobs finished, there's quite a few scripts to run: bunch of scaffold_prefilter*.sh scripts: scaffold_mergeBWA.sh and hic_scaff.sh. Possibly they are already created in your 8-hicPipeline folder - do you see them? My first suggestion is that rukki.paths.tsv & rukki.paths.gaf are the files that cause rerun because of suboptimal snakemake pipeline design. Do you see them in 8-hicPipeline folder? |
rukki.paths.tsv & rukki.paths.gaf (they are identical?) are there. The scaffold_prefilter*.sh, scaffold_mergeBWA.sh and hic_scaff.sh are lacking. Error executing rule prepareScaffolding on cluster (jobid: 110, external: 109538, jobscript: /media/ont/ASM3.91/.snakemake/tmp.enu3oklq/verkko.prepareScaffolding.110.sh). For error details see the cluster log and the log files of the involved rule(s). but cannot find more. |
Then it likely means that third mashmap crashed and paths2ref.mashmap is not valid. You can check what is the error in log, 8-hicPipeline/prepare_ref_mashmap.err |
---Preparing rukki paths fasta |
This output looks normal, strange. Can you rerun verkko with snakemake dry-run option (verkko <> --snakeopts '--dry-run' > dry_rerun.log) and upload the resulting log? |
dry_rerun.log |
Ok, it seems that there's a bug in our pipeline logic, which is present in v2.2 release too and affects runs with --cleanup option you likely used. Thank you for the help with locating it. Simplest way to use current run is to skip scaffolding step completely. Hi-C phasing is already done since you have rukki.paths.gaf file. We added scaffolding after v2.1 release, so results should be similar you had before with exception of more contigs phased because of my changes for higher heterozygosity in het_fixes branch. If you actually want to run scaffolding and not only phasing I'll create instructions, but since we do not want to rerun mashmap jobs it would be a bit tricky. However, we had no chance to test scaffolding on such heterozygous data (and we use diploid structure of the genome in the algorithm), so possibly not running scaffolding will just be right. |
Ok, thanks. I'll try that over the weekend. I'm interested in the scaffolding as well, but that can be a later experiment. I'll first check the phasing step and validate with orthogonal datasets. Will keep you informed. tnx. |
OK, so for scaffolding:
For all runs I'd run in dry mode first to check that snakemake is not trying to rerun jobs that are already finished. You can compare prescaf_rukki.paths.tsv and scaff_rukki.paths.tsv to see the effect of scaffolding. |
What is the right file for --assembly to use? |
Not file but path to the assembly dir. |
Both haplotypes are now 0 bytes. No apparent errors in the logs. I'll dive a bit deeper. |
I'd guess the issue is the original assembly (passed in --assembly) was not fully complete so the consensus (--paths) option didn't know how to partition the new consensus into haplotypes. Do the names in the assembly.fasta file start with haplotype1 or haplotype2 or similar? If so, you can still use the assembly.fasta as your result and just take any sequence with the name haplotype1 as hap1 and haplotype2 as hap2. |
No, they were staring with contain-xxxxxx. Could'n find (until now) any other reason. |
Something must have gone wrong with the paths processing, the paths file names contain things like haplotype1/2 correct? What was the exact --paths command? |
verkko --paths ASM3.91/8-hicPipeline/rukki.paths.gaf |
If the file is coming from the 8-hicPipeline folder, it should be phased and not be named Perhaps given that the original issue you had is fixed in v2.2.1 and that you had an issue with re-generating the consensus, it is easiest to just run a new assembly from scratch with v2.2.1 to make sure we're not chasing restart/mixing problems in the runs. |
Will do so, at another moment/ticket we'll need to look at the memory allocation to mashmap. this is not sufficient (slurm submission) and I'll need to restart things manually (not a problem for now). |
Ok, we are making progress here. The last version does indeed scaffold more into haplotypes (2x). Of the 7.2Gb basal assembly, 2Gb could not be assigned. The other are approx. 50/50 separated over the two haplotypes. I'm getting some orthogonal data as well to confirm some stuf. We can discuss things later in more detail, first need to process this. |
Glad to hear that it is improving. The initial assembly you shared had quite a few homozygous regions w/no bubbles (except sequencing errors) so I suspect that many of the unassigned are from regions w/o any heterozygosity and likely belong to both haplotypes. I'd suggest checking the coverage of the nodes (the noseq.gfa has hifi coverage) to see if they are double the average. Or you can share |
Hi Sergey, I'll share this data one of these weeks. In the meanwhile, I have some orthologous sets (marker data from a population to check phasing as well). Together, they might give better insights in where things go well, and where not. Any chance you being at PAG? |
Yes, I'll be at PAG. |
I’ll try to bump into you, otherwise make an appointment for Tueseday |
I'm attempting an assembly (relatively heterozygous diploid genome; 5Gb haploid size; with Hifi, ONT & Hi-C).
In the directory 7-consensus, the uniting-popped fast = 7.3Gb; however the uniting.popped.haplotype1, haplotype2 and unsigned is 0 bytes in size.
Thus this imply that there is no contribution of the ONT data to the phasing; though there are reads in the ont_subset.fasta.gz?
The final assembly.fasta is still 7.3Gb; with the haplotype1 and 2 fast both being 1.3G in size. This phasing is the sole contribution from the Hi-C data?
What would be good parameters to re-atempt this assembly; to deal with the higher diversity in this species (compared to human)?
Tnx!
The text was updated successfully, but these errors were encountered: