Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

some patients ids are in gene tables but do not exist #329

Open
pontikos opened this issue Apr 7, 2021 · 16 comments
Open

some patients ids are in gene tables but do not exist #329

pontikos opened this issue Apr 7, 2021 · 16 comments
Assignees

Comments

@pontikos
Copy link
Contributor

pontikos commented Apr 7, 2021

example if you go to gene

ENSG00000140505

then patient

PH0000043392

is listed in the variants but does not exist in patients table.

I'm guessing new schema will fix this as the gene API call will cross reference with the patients tables via the individual_variants table.

@alanwilter alanwilter self-assigned this Apr 8, 2021
@alanwilter
Copy link
Collaborator

gene.py still uses the old schema and though we have NewGene db.model, the refactoring of this module will require several other linked tables.

And indeed, the data is wrong in the old schema. New schema should solve this.

@alanwilter
Copy link
Collaborator

@pontikos there's something I need to understand: for a given gene, get the variants associated BUT, e.g.:
https://dev-live.phenopolis.org/gene/ENSG00000140505

  • demo sees NO variants, since it cannot access the patients in HET and HOM, but shouldn't we only omit the patients and still show the variants?

BTW, I found redundant for the backend api to return the col genes for the variants objs.

@pontikos
Copy link
Contributor Author

pontikos commented Apr 8, 2021

I think that's right the demo account should only be able to see 2-3 genes and only the see the variants ids not the individual ids.

@alanwilter
Copy link
Collaborator

For demo I can understand but and for other users? Using your example:

--- for a gene, get the variants 
select distinct ui."user", v.id from phenopolis.variant v 
join phenopolis.variant_gene vg on vg.variant_id = v.id
join phenopolis.individual_variant iv on iv.variant_id = vg.variant_id 
join phenopolis.individual i on i.id = iv.individual_id
join public.users_individuals ui on ui.internal_id = i.phenopolis_id and ui."user" <> 'Admin'
where vg.gene_id = 'ENSG00000140505'
;

Admin can see 28 variants, but other users would see just a few:

user		id
AliceDavidson	3518137
AliceDavidson	3518144
AliceDavidson	3518185
AliceDavidson	3518187
AliceDavidson	3518202
AliceDavidson	3518322
Tian		3518202
Tian		10981748
TonySegal	3518176
TS		3518176
UKIRDC		3518269

I'm still wondering if any other user, except for demo, should be able to see all variants. I would just be careful to show in HOM and HET cols only the patients the user is allowed to see.

@pontikos
Copy link
Contributor Author

pontikos commented Apr 8, 2021

I see your point. On the gene page, users other than demo should be able to see all the variants (even those of patients they do not own). However they cannot go to the patient page of a patient they do not own.

@alanwilter
Copy link
Collaborator

Just to clarify, user "John_doe" would see all variants but won't see any of those patients it doesn't own, so not even a chip link will be shown.

@pontikos
Copy link
Contributor Author

pontikos commented Apr 8, 2021

The way it works at the moment is that the patient chip is shown but the preivew you get says "you are not allowed to view this patient". And you cannot click through to the patient page.

However in the gene table all variants from all patients are shown, it's just that you can't information about the patients that you do not own, if that makes sense?

@alanwilter
Copy link
Collaborator

alanwilter commented Apr 8, 2021

I tested that, the trouble is if you click on one of the patients you can't access, it takes you back to dashboard and you loose what you're doing. I thinks it's better omit the patients you can't access and yet see all the variants.

I'm collecting all the elements (columns), for the variants, that you have in the old schema and try to rebuild with new schema. So far I couldn't find, in the new schema:
AF, AC, AN, HET, HOM, MLEAC, MLEAF, af_jirdc, af_tommo, af_krgdb, af_converge, af_hgvd, old ID
On the other hand, in the new phenopolis.variant I have extra:
new ID, dbsnp, variant_class, revel, fathom_score
and in phenopolis.individual_variant, I have:
zygosity, status, clinvar_id, pubmed_id, qd, comment
I'm wondering which elements do you want to see the gene page to compose the columns for the variants associated.

I'm fetching data from kaviar and nomad tables and yet I have cadd table, old schema had no reference to cadd. Should we add cadd? Just af_cadd for example?

@alanwilter
Copy link
Collaborator

I see I can use phenopolis.individual_variant.zigosity to build HET and HOM.

@alanwilter
Copy link
Collaborator

@pontikos this is the query the working the best so far using the new schema:

select distinct 
v.*, iv.dp, iv."fs", iv.mq, iv.qd, iv."filter", --, iv.zygosity -- add individual_variant
vg.hgvs_c, vg.hgvs_p, vg.most_severe_consequence, vg.impact, vg.canonical, -- add variant_gene
(
	select array_agg(i.phenopolis_id)
	from phenopolis.individual i
	join public.users_individuals ui on ui.internal_id = i.phenopolis_id
	join phenopolis.individual_variant iv2 on iv2.individual_id = i.id and iv2.zygosity = 'HOM'
	where ui."user" = 'demo'
	and vg.variant_id = iv2.variant_id 
) as HOM,
(
	select array_agg(i.phenopolis_id)
	from phenopolis.individual i
	join public.users_individuals ui on ui.internal_id = i.phenopolis_id
	join phenopolis.individual_variant iv2 on iv2.individual_id = i.id and iv2.zygosity = 'HET'
	where ui."user" = 'demo'
	and vg.variant_id = iv2.variant_id 
) as HET
from phenopolis.variant v 
join phenopolis.variant_gene vg on vg.variant_id = v.id
join phenopolis.individual_variant iv on iv.variant_id = vg.variant_id 
join ensembl.gene g on g.ensembl_gene_id = vg.gene_id and g.assembly = 'GRCh37'
where vg.gene_id = 'ENSG00000144285'
order by v.chrom, v.pos
;

It gives the same results if user is Admin, AliceDavidson or demo with one critical difference: HOM and HET will show only the patients the users are allowed. And its Execution Time is ok (155.292 ms).

Now what I need from you is to decide which values (cols) do you want to send to the frontend. The table below shows what I got so far (Common), what is new and you may want to add (New schema) and what I didn't get so far (Old schema).

I do need help with the values from Old schema. For some of them I can build/derive it but, for most of them, I don't find where to get them.

I looked at e.g. kaviar table but I don't know how to join it, would it be with chrom+pos+ref+alt combination?

Old schema New schema Common
AF dbsnp chrom
AC variant_class pos
AN revel ref
HET_COUNT (can be calculated) fathmm_score alt
HOM_COUNT (can be calculated) qd dann
af_kaviar impact cadd_phred
af_gnomad_genomes canonical dp
af_jirdc fs
af_tommo mq
af_krgdb filter
af_converge hgvs_c
af_hgvd hgvs_p
gene_symbol (redundant) most_severe_consequence
MLEAC hom
MLEAF het
variant_id (can be composed)
id (different meaning) e.g. rs41279194 id (different meaning)*

* id | bigint | not null | nextval('phenopolis.variant_id_seq'::regclass)

@alanwilter
Copy link
Collaborator

alanwilter commented Apr 9, 2021

Trying to get an idea about kaviar, cadd and gnomad, I did this:

select distinct
a.phred as phred_cadd, a.raw_score as raw_score_cadd, -- cadd : phred_cadd <> v.cadd_phred
av.ac as ac_gnomad, av.af  as af_gnomad, -- gnomad
ah.ac as ac_kaviar, ah.af as af_kaviar, ah.an, -- kaviar
v.*, iv.dp, iv."fs", iv.mq, iv.qd, iv."filter", --, iv.zygosity -- add individual_variant
vg.hgvs_c, vg.hgvs_p, vg.most_severe_consequence, vg.impact, vg.canonical, -- add variant_gene
(
	select array_agg(i.phenopolis_id)
	from phenopolis.individual i
	join public.users_individuals ui on ui.internal_id = i.phenopolis_id
	join phenopolis.individual_variant iv2 on iv2.individual_id = i.id and iv2.zygosity = 'HOM'
	where ui."user" = 'Admin'
	and vg.variant_id = iv2.variant_id 
) as HOM,
(
	select array_agg(i.phenopolis_id)
	from phenopolis.individual i
	join public.users_individuals ui on ui.internal_id = i.phenopolis_id
	join phenopolis.individual_variant iv2 on iv2.individual_id = i.id and iv2.zygosity = 'HET'
	where ui."user" = 'Admin'
	and vg.variant_id = iv2.variant_id 
) as HET
from phenopolis.variant v 
join phenopolis.variant_gene vg on vg.variant_id = v.id
join phenopolis.individual_variant iv on iv.variant_id = vg.variant_id 
join ensembl.gene g on g.ensembl_gene_id = vg.gene_id and g.assembly = 'GRCh37'
left outer join cadd.annotation a on a.chrom = v.chrom and a.pos = v.pos and a."ref" = v."ref" and a.alt = v.alt
left outer join gnomad.annotation_v3 av on av.chrom = v.chrom and av.pos = v.pos and av."ref" = v."ref" and av.alt = v.alt 
left outer join kaviar.annotation_hg19 ah on ah.chrom = v.chrom and ah.pos = v.pos and ah."ref" = v."ref" and ah.alt = v.alt
where vg.gene_id = 'ENSG00000144285'
order by v.chrom, v.pos

however, I can't explain why I have multiple row for kaviar e.g.:

phenopolis_dev_db=> select * from kaviar.annotation_hg19 where chrom='2' and pos=166854780 and "ref" = 'T' and alt = 'TA';
    id     | chrom |    pos    | ref | alt | ac  |    af     |  an   |
                         ds
-----------+-------+-----------+-----+-----+-----+-----------+-------+--------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
 244761705 | 2     | 166854780 | T   | TA  |  35 | 0.0013269 | 26378 | ISB_founders-Nge3
 244761706 | 2     | 166854780 | T   | TA  | 400 | 0.0151642 | 26378 | GS000011735|GS000011816|GS000015156|GS000015893|GS000016450|Inova_CGI_founders-Nge3|Inova_Illumina_founders-N
ge3|SSIP|UK10K|gonl|phase3-ASW|phase3-BEB|phase3-CEU|phase3-CLM|phase3-FIN|phase3-GBR|phase3-GIH|phase3-IBS|phase3-MXL|phase3-PJL|phase3-PUR|phase3-STU|phase3-TSI
(2 rows)

Because of potential multiple kaviar rows I got more than 452 rows expected.

@pontikos
Copy link
Contributor Author

Thanks for all these investigations @alanwilter !

Regarding joining on other tables yes: chrom, pos, ref and alt is the composite foreign key to use.

Regarding multiple rows for kaviar: I wasn't aware of this issue. I think workaround is to only keep one row per variant. Either by selecting the larger of the two AC values or only keeping the one from the most sources (ds column).

@alanwilter
Copy link
Collaborator

I forgot to call your attention to phred_cadd, present in phenopolis.variant.phred_cadd and different from cadd.annotation.phred, which value is correct?

For kaviar, in my example above, ac correlates with ds, so it would be simpler to use ac, but what is it and what it stands for? I'm going to investigate a bit further on kaviar and possibly test cadd and gnomad for duplications.

But, still, these are all collateral issues, my main point is to define which elements we're going to expose to the frontend, see table above.

@pontikos
Copy link
Contributor Author

pontikos commented Apr 10, 2021 via email

@alanwilter
Copy link
Collaborator

I've updated table Old x New schema for GENE web page.
This is the piece of code in gene.py that uses the new schema and I'm trying to compare with old schema, using gene TTLL5:

def _get_variants_by_gene(gene_id):
    sqlq = sql.SQL(
        """
        select distinct v.chrom as "CHROM", v.pos as "POS", v."ref" as "REF", v.alt as "ALT", v.cadd_phred, v.dann,
        v.fathmm_score, v.revel, -- new added
        -- removed: v.id
        vg.most_severe_consequence, vg.hgvs_c as hgvsc, vg.hgvs_p as hgvsp, -- via variant_gene
        iv.dp as "DP", iv."fs" as "FS", iv.mq as "MQ", iv."filter" as "FILTER", -- via individual_variant
        (
            select array_agg(i.phenopolis_id)
            from phenopolis.individual i
            join phenopolis.individual_variant iv2 on iv2.individual_id = i.id and iv2.zygosity = 'HOM'
            where vg.variant_id = iv2.variant_id
        ) as "HOM",
        (
            select array_agg(i.phenopolis_id)
            from phenopolis.individual i
            join phenopolis.individual_variant iv2 on iv2.individual_id = i.id and iv2.zygosity = 'HET'
            where vg.variant_id = iv2.variant_id
        ) as "HET",
        (
            select distinct on (ah.chrom,ah.pos,ah."ref",ah.alt) ah.af from kaviar.annotation_hg19 ah
            where ah.chrom = v.chrom and ah.pos = v.pos and ah."ref" = v."ref" and ah.alt = v.alt
            order by ah.chrom,ah.pos,ah."ref",ah.alt,ah.ac desc
        ) as af_kaviar,
        av.af as af_gnomad_genomes -- gnomad
        -- deprecated: MLEAF, MLEAC
        -- not used: gene_symbol, gene_id
        -- need to be added (by Daniele): af_converge, af_hgvd, af_jirdc, af_krgdb, af_tommo,
        from phenopolis.variant v
        join phenopolis.variant_gene vg on vg.variant_id = v.id
        join phenopolis.individual_variant iv on iv.variant_id = vg.variant_id
        left outer join gnomad.annotation_v3 av
            on av.chrom = v.chrom and av.pos = v.pos and av."ref" = v."ref" and av.alt = v.alt
        where vg.gene_id = %s
        order by v.chrom, v.pos, v."ref", v.alt
    """
    )
    with get_db() as conn:
        with conn.cursor() as cur:
            cur.execute(sqlq, [gene_id])
            variants = cursor2dict(cur)
    for v in variants:
        v["variant_id"] = [{"display": f'{v["CHROM"]}-{v["POS"]}-{v["REF"]}-{v["ALT"]}'}]
        het = v["HET"] or []
        hom = v["HOM"] or []
        v["HET_COUNT"] = len(het)
        v["HOM_COUNT"] = len(hom)
        v["AC"] = v["HET_COUNT"] + 2 * v["HOM_COUNT"]
        v["AN"] = (v["HET_COUNT"] + v["HOM_COUNT"]) * 2
        v["AF"] = v["AC"] / v["AN"]

    return variants

First thing @pontikos is if you can tell me if the formulae are correct:

        v["AC"] = v["HET_COUNT"] + 2 * v["HOM_COUNT"]
        v["AN"] = (v["HET_COUNT"] + v["HOM_COUNT"]) * 2
        v["AF"] = v["AC"] / v["AN"]

Another thing is why phenopolis.individual_variant sometimes offer more than one row for a given variant (14-76098265-T-C which shows in new query but not in old schema), e.g.:

select distinct iv.variant_id, iv.chrom, iv.pos, iv."ref", iv.alt, iv.zygosity, iv.dp, iv.mq, iv.qd
from phenopolis.individual_variant iv 
where iv.chrom = '14' and iv.pos = 76098265 and iv."ref" = 'T' and iv.alt = 'C';

 variant_id | chrom |   pos    | ref | alt | zygosity |  dp   |  mq   |  qd
------------+-------+----------+-----+-----+----------+-------+-------+-------
    3959887 | 14    | 76098265 | T   | C   | HET      | 33676 | 56.54 | 13.76
    3959887 | 14    | 76098265 | T   | C   | HET      | 35864 | 55.11 | 13.83

while this is not the case in the old schema (all rows are unique):

select * from public.variants v where 
v.gene_id = 'ENSG00000119685' -- 564 variants
order by "CHROM", "POS", "REF", "ALT" ; 

In the end, I have only 95 (or 94, variant_id 10846523 -- 14-76114226-A-ATG -- is not in phenopolis.individual_variant, but it's not in public.het_variants or public.hom_variants so it may explain why) unique variants in the new schema (if I remove the multi-rows with different DPs for a given variant). And only few are common to old x new schema.

I added another sheet (Compare52) to table Old x New schema for GENE web page where I try to analyse the 52 variants I found in common (but note new schema may have multiple rows for a given variant). Looking there and only af_kaviar seems to be stable (but not a perfect match).

I understand that phenopolis.variant_gene may not be complete and the scripts in our repository don't show we did a "migration" on these data. The only mention I see about populating phenopolis.variant_gene is scripts/import_individual_variants.py which does "Import an individual's VAR.tsv file".

As for phenopolis.individual_variant it seems more reliable since we have scripts/import_individual_variants.py, which did "Migrate variants data from public.hom_variants and public.het_variants to phenopolis.individual_variant"

Bottom line: as I don't know much about phenopolis.variant_gene I can only hope my code, barring some blatant mistakes, is doing what it's supposed to do yet I can't compare with previous schema.

@pontikos
Copy link
Contributor Author

pontikos commented Apr 21, 2021

This is correct:

        v["AC"] = v["HET_COUNT"] + 2 * v["HOM_COUNT"]
        v["AN"] = (v["HET_COUNT"] + v["HOM_COUNT"]) * 2
        v["AF"] = v["AC"] / v["AN"]

Re phenopolis.individual_variant:
if variant is seen in two people then the the MQ, DP etc might have been different between the two people, but anyway we do not display this information to users

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants