Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ID field as semicolon-separated list #8

Closed
wants to merge 446 commits into from

Conversation

myourshaw
Copy link

From the VCF spec: "ID - identifier: Semi-colon separated list of unique identifiers where available. If this is a dbSNP variant it is encouraged to use the rs number(s). No identifier should be present in more than one data record. If there is no identifier available, then the missing value should be used. (String, no white-space or semi-colons permitted)"
In parser.py, next(self)

       if row[2] != '.':
            ID = row[2].split(';')
        else:
            ID = None

James Casbon and others added 30 commits June 12, 2012 01:50
Make metadata RE reluctant (stop on first = not last)
Fix writing of Number=A and G INFO/FORMAT fields
martijnvermaat and others added 29 commits June 8, 2014 20:30
Adds _Record.affected_start and .affected_end.
making alternate allele frequency work in the case of non-diploid genotypes
As reported in #164, we previously crashed on flag INFO fields declared
as strings (and the number of values declared as 1). This is indeed not
according to spec, but we should probably allow it anyway.
It is not valid according to the spec, but issue #164 shows a VCF file
where the FORMAT column contains just a dot character. We have no way
of interpreting the subsequent genotype columns in that case, so this
patch ignores them.
Allow flag INFO field to be declared as string
Don't crash when FORMAT is set to the missing value (.)
The spec actually does not allow for metadata lines without value, but we
shouldn't crash on them.

Fixes #168
Before we figure out what causes this, let's have a working test suite by
fixing pysam on the latest working release.

Traceback:

    Traceback (most recent call last):
      File "/home/travis/build/jamescasbon/PyVCF/build/lib.linux-x86_64-3.3/vcf/test/test_vcf.py", line 1109, in testNoVariantsInRange
        fetched_variants = self.reader.fetch('20', 14370, 17329)
      File "/home/travis/build/jamescasbon/PyVCF/build/lib.linux-x86_64-3.3/vcf/parser.py", line 623, in fetch
        self.reader = self._tabix.fetch(chrom, start, end)
      File "ctabix.pyx", line 345, in pysam.ctabix.Tabixfile.fetch (pysam/ctabix.c:4241)
    TypeError: expected bytes, str found

See #175
- Add R as an INFO field count (number of alleles including reference).
- Support the optional Source and Version keys on INFO metainformation.

Thanks alot @travc for contributing these fixes!

See #172
Partial support for VCFv4.2
The VCF 4.0 and newer specifications say the ALT field is a comma
separated list that includes "base Strings made up of the bases
A,C,G,T,N". Notably, the last case was not handled by `Record.is_snp`,
causing it to erroneously report `False` for records with "N" as the ALT.
Bugfix: SNP records with N as ALT now noted as SNPs.
* Remember the ploidity of uncalled genotypes such that
  the sample genotypes written by PyVCF.Writer match the
  sample genotypes read by PyVCF.Reader.
* For uncalled _Calls, gt_nums and gt_bases are None;
  gt_alleles is a list of "None" with a length of _Call.ploidity.
Warning about open file handles muddle the output of unit tests
and are a potentially confusing factor to those interpreting
the tests.
The sample.data.GT attribute is no longer set to None for
uncalled calls, which means that _format_sample can now
rely on obtaining the original sample genotype.
Fix double quoting issue when writing VCFs
The issue in 0.8.0 seems to be fixed in 0.8.1, so it's now safe to
just blacklist 0.8.0 specifically.

See #175
…ls 1.2 when inputs have no ##contig information
Support ##contig headers with only ID attributes. Generated by bcftools 1.2 when inputs have no ##contig information
Allow for whitespace after commas in metadata lines
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.