Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reject contigs with names that would break bcftools #65

Open
fedarko opened this issue Jan 26, 2023 · 0 comments
Open

Reject contigs with names that would break bcftools #65

fedarko opened this issue Jan 26, 2023 · 0 comments
Labels
bug Something isn't working good first issue Good for newcomers

Comments

@fedarko
Copy link
Owner

fedarko commented Jan 26, 2023

I'm currently working with a contig named "160+,38-,88-,106-,51+,23+,92-,34+,158-,167+,41+,131+,81+,58+,102+,192-(circular)". I was able to align reads to this contig and perform r-mutation calling in it; however, when I get to the part of strainFlye call r-mutation that attempts to convert the resulting VCF file to a BCF file, I get the following error:

...
call r-mutation @ 1,989.53s: Converting the VCF file to a compressed BCF file...
[E::bcf_hdr_parse_line] Could not parse the header line: "##contig=<ID=160+,38-,88-,106-,51+,23+,92-,34+,158-,167+,41+,131+,81+,58+,102+,192-(circular),length=388097>"
...

I'm pretty sure the commas and parentheses are the kicker here. The VCF v4.3 spec confirms this:

Contig names follow the same rules as the SAM format’s reference sequence names: they may contain any printable
ASCII characters in the range [!-~] apart from ‘\ , "‘’ () [] {} <>’ and may not start with ‘*’ or ‘=’. Thus they
match the following regular expression:

[0-9A-Za-z!#$%&+./:;?@^_|~-][0-9A-Za-z!#$%&*+./:;=?@^_|~-]*

In particular, excluding commas facilitates parsing ##contig lines, and excluding the characters ‘<>[]’ and initial ‘*’
avoids clashes with symbolic alleles. The contig names must not use a reserved symbolic allele name.

The obvious solution here is thus updating strainFlye's FASTA-reading function (?) to reject badly-named contigs like this. We could maybe make this check only required for strainFlye commands that will use bcftools or something, but it's less of a headache (and probably good practice) to just apply this check to every time we read a FASTA file.

This is a similar idea to #57 -- however, addressing that issue in its entirety would require some extra work. (For example: The regex given by the VCF spec apparently allows contigs with / characters in their names, but I have to imagine that if your contigs actually have / in their names then a swarm of locusts will devour your terminal or something when you start trying to use these names in files.)

@fedarko fedarko added bug Something isn't working good first issue Good for newcomers labels Jan 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

1 participant