Replies: 1 comment 11 replies
-
How are you adding mutations to your tree sequence? If you are using |
Beta Was this translation helpful? Give feedback.
11 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I'm working on research project of DNA damage simulation pipeline for Ancestry Inference where I:
Current issue
I notice that
write_vcf()
doesn't take a reference sequence parameter, whilewrite_fasta()
does accept reference_sequence. This creates a misalignment between my files:FASTA generation allows reference sequence
ts.write_fasta("output.fa", reference_sequence=ref_seq)
VCF generation has no equivalent parameter
ts.write_vcf("output.vcf") # No reference_sequence parameter available
While
write_vcf()
has aposition_transform
parameter, this only allows approximate position adjustments, not ensuring that alleles are correctly aligned to a reference.Position inconsistency issue
Even when using the same tree sequence for both outputs, the positions in the VCF are not consistent with the FASTA. The VCF positions from
ts.write_vcf()
start at 0 by default, and don't correspond to the actual positions in the reference sequence. This creates fundamental problems for my simulation pipeline, as I need the VCF to reflect the exact same positions as in the FASTA file to properly analyze the effects of DNA damage against a known reference.Based on my observations, write_vcf() appears to use relative positions where mutations are encoded as discrete markers beginning at 0. For FASTA generation, this mask is mapped to physical genomic coordinates by replacing unmasked positions (0s) with reference nucleotides. However, there is no direct mechanism to export VCF positions as absolute genomic coordinates rather than normalized values, creating a mismatch between the VCF’s relative positions and the FASTA’s physical positions.
Impact on DNA damage analysis
In ancient DNA/damaged sample pipelines precise VCF-FASTA alignment is critical because:
Current workarounds using relative positions (0-1 range) or approximate alignments fail to meet these requirements for real-world damaged data analysis.
Attempted workarounds
I've tried several approaches without success:
position_transform
, but this proved too approximate and failed to achieve precise positioning.These workarounds produce approximations when precision is critical. In my pipeline, the FASTA files are the "ground truth" and serve as the baseline for all comparisons. The VCF needs to perfectly match this ground truth, not approximate it. The current methods introduce unnecessary complexity and error into what should be a straightforward alignment.
Desired functionality
I need to generate VCF files that precisely match the genomic positions and mutations in my FASTA files before damage is introduced. This alignment is critical for realistic comparisons between original and damaged sequences.
My ideal solution would be either:
reference_sequence
parameter forwrite_vcf()
Is there a recommended approach to ensure exact position/allele alignment between FASTA and VCF outputs? Or should this functionality be added?
Perhaps write_vcf() could be enhanced to accept a reference_sequence parameter (similar to write_fasta()), since write_fasta() already successfully applies mutations at precise positions on the reference sequence - implementing this for VCF output would ensure perfect alignment between VCF and FASTA files and eliminate the need for approximate position transformations.
Would this be feasible? I'm happy to help test implementations
Thank you very much for your help.
PS : While discussion #2838 addresses 0/1-based coordinate conflicts, this request focuses on biological alignment - ensuring VCF positions/REF alleles exactly match reference-anchored FASTA outputs, crucial for damage artifact analysis. The solutions proposed there don't resolve reference sequence mismatches.
Beta Was this translation helpful? Give feedback.
All reactions