some patients ids are in gene tables but do not exist #329

pontikos · 2021-04-07T22:29:51Z

example if you go to gene

ENSG00000140505

then patient

PH0000043392

is listed in the variants but does not exist in patients table.

I'm guessing new schema will fix this as the gene API call will cross reference with the patients tables via the individual_variants table.

alanwilter · 2021-04-08T10:52:59Z

gene.py still uses the old schema and though we have NewGene db.model, the refactoring of this module will require several other linked tables.

And indeed, the data is wrong in the old schema. New schema should solve this.

alanwilter · 2021-04-08T12:02:23Z

@pontikos there's something I need to understand: for a given gene, get the variants associated BUT, e.g.:
https://dev-live.phenopolis.org/gene/ENSG00000140505

demo sees NO variants, since it cannot access the patients in HET and HOM, but shouldn't we only omit the patients and still show the variants?

BTW, I found redundant for the backend api to return the col genes for the variants objs.

pontikos · 2021-04-08T12:07:56Z

I think that's right the demo account should only be able to see 2-3 genes and only the see the variants ids not the individual ids.

alanwilter · 2021-04-08T12:30:56Z

For demo I can understand but and for other users? Using your example:

--- for a gene, get the variants 
select distinct ui."user", v.id from phenopolis.variant v 
join phenopolis.variant_gene vg on vg.variant_id = v.id
join phenopolis.individual_variant iv on iv.variant_id = vg.variant_id 
join phenopolis.individual i on i.id = iv.individual_id
join public.users_individuals ui on ui.internal_id = i.phenopolis_id and ui."user" <> 'Admin'
where vg.gene_id = 'ENSG00000140505'
;

Admin can see 28 variants, but other users would see just a few:

user		id
AliceDavidson	3518137
AliceDavidson	3518144
AliceDavidson	3518185
AliceDavidson	3518187
AliceDavidson	3518202
AliceDavidson	3518322
Tian		3518202
Tian		10981748
TonySegal	3518176
TS		3518176
UKIRDC		3518269

I'm still wondering if any other user, except for demo, should be able to see all variants. I would just be careful to show in HOM and HET cols only the patients the user is allowed to see.

pontikos · 2021-04-08T13:22:02Z

I see your point. On the gene page, users other than demo should be able to see all the variants (even those of patients they do not own). However they cannot go to the patient page of a patient they do not own.

alanwilter · 2021-04-08T13:31:12Z

Just to clarify, user "John_doe" would see all variants but won't see any of those patients it doesn't own, so not even a chip link will be shown.

pontikos · 2021-04-08T18:35:24Z

The way it works at the moment is that the patient chip is shown but the preivew you get says "you are not allowed to view this patient". And you cannot click through to the patient page.

However in the gene table all variants from all patients are shown, it's just that you can't information about the patients that you do not own, if that makes sense?

alanwilter · 2021-04-08T19:34:12Z

I tested that, the trouble is if you click on one of the patients you can't access, it takes you back to dashboard and you loose what you're doing. I thinks it's better omit the patients you can't access and yet see all the variants.

I'm collecting all the elements (columns), for the variants, that you have in the old schema and try to rebuild with new schema. So far I couldn't find, in the new schema:
AF, AC, AN, HET, HOM, MLEAC, MLEAF, af_jirdc, af_tommo, af_krgdb, af_converge, af_hgvd, old ID
On the other hand, in the new phenopolis.variant I have extra:
new ID, dbsnp, variant_class, revel, fathom_score
and in phenopolis.individual_variant, I have:
zygosity, status, clinvar_id, pubmed_id, qd, comment
I'm wondering which elements do you want to see the gene page to compose the columns for the variants associated.

I'm fetching data from kaviar and nomad tables and yet I have cadd table, old schema had no reference to cadd. Should we add cadd? Just af_cadd for example?

alanwilter · 2021-04-08T20:08:39Z

I see I can use phenopolis.individual_variant.zigosity to build HET and HOM.

alanwilter · 2021-04-09T11:50:17Z

@pontikos this is the query the working the best so far using the new schema:

select distinct 
v.*, iv.dp, iv."fs", iv.mq, iv.qd, iv."filter", --, iv.zygosity -- add individual_variant
vg.hgvs_c, vg.hgvs_p, vg.most_severe_consequence, vg.impact, vg.canonical, -- add variant_gene
(
	select array_agg(i.phenopolis_id)
	from phenopolis.individual i
	join public.users_individuals ui on ui.internal_id = i.phenopolis_id
	join phenopolis.individual_variant iv2 on iv2.individual_id = i.id and iv2.zygosity = 'HOM'
	where ui."user" = 'demo'
	and vg.variant_id = iv2.variant_id 
) as HOM,
(
	select array_agg(i.phenopolis_id)
	from phenopolis.individual i
	join public.users_individuals ui on ui.internal_id = i.phenopolis_id
	join phenopolis.individual_variant iv2 on iv2.individual_id = i.id and iv2.zygosity = 'HET'
	where ui."user" = 'demo'
	and vg.variant_id = iv2.variant_id 
) as HET
from phenopolis.variant v 
join phenopolis.variant_gene vg on vg.variant_id = v.id
join phenopolis.individual_variant iv on iv.variant_id = vg.variant_id 
join ensembl.gene g on g.ensembl_gene_id = vg.gene_id and g.assembly = 'GRCh37'
where vg.gene_id = 'ENSG00000144285'
order by v.chrom, v.pos
;

It gives the same results if user is Admin, AliceDavidson or demo with one critical difference: HOM and HET will show only the patients the users are allowed. And its Execution Time is ok (155.292 ms).

Now what I need from you is to decide which values (cols) do you want to send to the frontend. The table below shows what I got so far (Common), what is new and you may want to add (New schema) and what I didn't get so far (Old schema).

I do need help with the values from Old schema. For some of them I can build/derive it but, for most of them, I don't find where to get them.

I looked at e.g. kaviar table but I don't know how to join it, would it be with chrom+pos+ref+alt combination?

Old schema	New schema	Common
AF	dbsnp	chrom
AC	variant_class	pos
AN	revel	ref
HET_COUNT (can be calculated)	fathmm_score	alt
HOM_COUNT (can be calculated)	qd	dann
af_kaviar	impact	cadd_phred
af_gnomad_genomes	canonical	dp
af_jirdc		fs
af_tommo		mq
af_krgdb		filter
af_converge		hgvs_c
af_hgvd		hgvs_p
gene_symbol (redundant)		most_severe_consequence
MLEAC		hom
MLEAF		het

variant_id (can be composed)
id (different meaning) e.g. rs41279194	id (different meaning)*

* id | bigint | not null | nextval('phenopolis.variant_id_seq'::regclass)

alanwilter · 2021-04-09T13:30:19Z

Trying to get an idea about kaviar, cadd and gnomad, I did this:

select distinct
a.phred as phred_cadd, a.raw_score as raw_score_cadd, -- cadd : phred_cadd <> v.cadd_phred
av.ac as ac_gnomad, av.af  as af_gnomad, -- gnomad
ah.ac as ac_kaviar, ah.af as af_kaviar, ah.an, -- kaviar
v.*, iv.dp, iv."fs", iv.mq, iv.qd, iv."filter", --, iv.zygosity -- add individual_variant
vg.hgvs_c, vg.hgvs_p, vg.most_severe_consequence, vg.impact, vg.canonical, -- add variant_gene
(
	select array_agg(i.phenopolis_id)
	from phenopolis.individual i
	join public.users_individuals ui on ui.internal_id = i.phenopolis_id
	join phenopolis.individual_variant iv2 on iv2.individual_id = i.id and iv2.zygosity = 'HOM'
	where ui."user" = 'Admin'
	and vg.variant_id = iv2.variant_id 
) as HOM,
(
	select array_agg(i.phenopolis_id)
	from phenopolis.individual i
	join public.users_individuals ui on ui.internal_id = i.phenopolis_id
	join phenopolis.individual_variant iv2 on iv2.individual_id = i.id and iv2.zygosity = 'HET'
	where ui."user" = 'Admin'
	and vg.variant_id = iv2.variant_id 
) as HET
from phenopolis.variant v 
join phenopolis.variant_gene vg on vg.variant_id = v.id
join phenopolis.individual_variant iv on iv.variant_id = vg.variant_id 
join ensembl.gene g on g.ensembl_gene_id = vg.gene_id and g.assembly = 'GRCh37'
left outer join cadd.annotation a on a.chrom = v.chrom and a.pos = v.pos and a."ref" = v."ref" and a.alt = v.alt
left outer join gnomad.annotation_v3 av on av.chrom = v.chrom and av.pos = v.pos and av."ref" = v."ref" and av.alt = v.alt 
left outer join kaviar.annotation_hg19 ah on ah.chrom = v.chrom and ah.pos = v.pos and ah."ref" = v."ref" and ah.alt = v.alt
where vg.gene_id = 'ENSG00000144285'
order by v.chrom, v.pos

however, I can't explain why I have multiple row for kaviar e.g.:

phenopolis_dev_db=> select * from kaviar.annotation_hg19 where chrom='2' and pos=166854780 and "ref" = 'T' and alt = 'TA';
    id     | chrom |    pos    | ref | alt | ac  |    af     |  an   |
                         ds
-----------+-------+-----------+-----+-----+-----+-----------+-------+--------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
 244761705 | 2     | 166854780 | T   | TA  |  35 | 0.0013269 | 26378 | ISB_founders-Nge3
 244761706 | 2     | 166854780 | T   | TA  | 400 | 0.0151642 | 26378 | GS000011735|GS000011816|GS000015156|GS000015893|GS000016450|Inova_CGI_founders-Nge3|Inova_Illumina_founders-N
ge3|SSIP|UK10K|gonl|phase3-ASW|phase3-BEB|phase3-CEU|phase3-CLM|phase3-FIN|phase3-GBR|phase3-GIH|phase3-IBS|phase3-MXL|phase3-PJL|phase3-PUR|phase3-STU|phase3-TSI
(2 rows)

Because of potential multiple kaviar rows I got more than 452 rows expected.

pontikos · 2021-04-10T02:50:12Z

Thanks for all these investigations @alanwilter !

Regarding joining on other tables yes: chrom, pos, ref and alt is the composite foreign key to use.

Regarding multiple rows for kaviar: I wasn't aware of this issue. I think workaround is to only keep one row per variant. Either by selecting the larger of the two AC values or only keeping the one from the most sources (ds column).

alanwilter · 2021-04-10T06:20:01Z

I forgot to call your attention to phred_cadd, present in phenopolis.variant.phred_cadd and different from cadd.annotation.phred, which value is correct?

For kaviar, in my example above, ac correlates with ds, so it would be simpler to use ac, but what is it and what it stands for? I'm going to investigate a bit further on kaviar and possibly test cadd and gnomad for duplications.

But, still, these are all collateral issues, my main point is to define which elements we're going to expose to the frontend, see table above.

pontikos · 2021-04-10T08:59:51Z

If cadd.annotation.phred exists for a variant then use that. That table contains all the precomputed case scores for known variants. If a variant is known I think we can stick to using that table (I think you need to make sure we are using the correct build which is 37 in our case). Sometimes cadd score may be missing from that table (for indels) in which case we need to compute it. In that case it will probably end up in the variants table. However to be consistent I think we should maybe insert it as an extra row in the case.annotarion table ? About your more general question of what fields are displayed, it sounds like we need to agree on what the returned json looks like?

…

On Sat, 10 Apr 2021, 07:20 Alan Silva, ***@***.***> wrote: I forgot to call your attention to phred_cadd, present in phenopolis.variant.phred_cadd and different from cadd.annotation.phred, which value is correct? For kaviar, in my example above, ac correlates with ds, so it would be simpler to use ac, but what is it and what it stands for? I'm going to investigate a bit further on kaviar and possibly test cadd and gnomad for duplications. But, still, these are all collateral issues, my main point is to define which elements we're going to expose to the frontend, see table above. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#329 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA5MN5HXRY675LEJGVPHLM3TH7UZ5ANCNFSM42RU5L5Q> .

alanwilter · 2021-04-20T13:46:57Z

I've updated table Old x New schema for GENE web page.
This is the piece of code in gene.py that uses the new schema and I'm trying to compare with old schema, using gene TTLL5:

def _get_variants_by_gene(gene_id):
    sqlq = sql.SQL(
        """
        select distinct v.chrom as "CHROM", v.pos as "POS", v."ref" as "REF", v.alt as "ALT", v.cadd_phred, v.dann,
        v.fathmm_score, v.revel, -- new added
        -- removed: v.id
        vg.most_severe_consequence, vg.hgvs_c as hgvsc, vg.hgvs_p as hgvsp, -- via variant_gene
        iv.dp as "DP", iv."fs" as "FS", iv.mq as "MQ", iv."filter" as "FILTER", -- via individual_variant
        (
            select array_agg(i.phenopolis_id)
            from phenopolis.individual i
            join phenopolis.individual_variant iv2 on iv2.individual_id = i.id and iv2.zygosity = 'HOM'
            where vg.variant_id = iv2.variant_id
        ) as "HOM",
        (
            select array_agg(i.phenopolis_id)
            from phenopolis.individual i
            join phenopolis.individual_variant iv2 on iv2.individual_id = i.id and iv2.zygosity = 'HET'
            where vg.variant_id = iv2.variant_id
        ) as "HET",
        (
            select distinct on (ah.chrom,ah.pos,ah."ref",ah.alt) ah.af from kaviar.annotation_hg19 ah
            where ah.chrom = v.chrom and ah.pos = v.pos and ah."ref" = v."ref" and ah.alt = v.alt
            order by ah.chrom,ah.pos,ah."ref",ah.alt,ah.ac desc
        ) as af_kaviar,
        av.af as af_gnomad_genomes -- gnomad
        -- deprecated: MLEAF, MLEAC
        -- not used: gene_symbol, gene_id
        -- need to be added (by Daniele): af_converge, af_hgvd, af_jirdc, af_krgdb, af_tommo,
        from phenopolis.variant v
        join phenopolis.variant_gene vg on vg.variant_id = v.id
        join phenopolis.individual_variant iv on iv.variant_id = vg.variant_id
        left outer join gnomad.annotation_v3 av
            on av.chrom = v.chrom and av.pos = v.pos and av."ref" = v."ref" and av.alt = v.alt
        where vg.gene_id = %s
        order by v.chrom, v.pos, v."ref", v.alt
    """
    )
    with get_db() as conn:
        with conn.cursor() as cur:
            cur.execute(sqlq, [gene_id])
            variants = cursor2dict(cur)
    for v in variants:
        v["variant_id"] = [{"display": f'{v["CHROM"]}-{v["POS"]}-{v["REF"]}-{v["ALT"]}'}]
        het = v["HET"] or []
        hom = v["HOM"] or []
        v["HET_COUNT"] = len(het)
        v["HOM_COUNT"] = len(hom)
        v["AC"] = v["HET_COUNT"] + 2 * v["HOM_COUNT"]
        v["AN"] = (v["HET_COUNT"] + v["HOM_COUNT"]) * 2
        v["AF"] = v["AC"] / v["AN"]

    return variants

First thing @pontikos is if you can tell me if the formulae are correct:

        v["AC"] = v["HET_COUNT"] + 2 * v["HOM_COUNT"]
        v["AN"] = (v["HET_COUNT"] + v["HOM_COUNT"]) * 2
        v["AF"] = v["AC"] / v["AN"]

Another thing is why phenopolis.individual_variant sometimes offer more than one row for a given variant (14-76098265-T-C which shows in new query but not in old schema), e.g.:

select distinct iv.variant_id, iv.chrom, iv.pos, iv."ref", iv.alt, iv.zygosity, iv.dp, iv.mq, iv.qd
from phenopolis.individual_variant iv 
where iv.chrom = '14' and iv.pos = 76098265 and iv."ref" = 'T' and iv.alt = 'C';

 variant_id | chrom |   pos    | ref | alt | zygosity |  dp   |  mq   |  qd
------------+-------+----------+-----+-----+----------+-------+-------+-------
    3959887 | 14    | 76098265 | T   | C   | HET      | 33676 | 56.54 | 13.76
    3959887 | 14    | 76098265 | T   | C   | HET      | 35864 | 55.11 | 13.83

while this is not the case in the old schema (all rows are unique):

select * from public.variants v where 
v.gene_id = 'ENSG00000119685' -- 564 variants
order by "CHROM", "POS", "REF", "ALT" ;

In the end, I have only 95 (or 94, variant_id 10846523 -- 14-76114226-A-ATG -- is not in phenopolis.individual_variant, but it's not in public.het_variants or public.hom_variants so it may explain why) unique variants in the new schema (if I remove the multi-rows with different DPs for a given variant). And only few are common to old x new schema.

I added another sheet (Compare52) to table Old x New schema for GENE web page where I try to analyse the 52 variants I found in common (but note new schema may have multiple rows for a given variant). Looking there and only af_kaviar seems to be stable (but not a perfect match).

I understand that phenopolis.variant_gene may not be complete and the scripts in our repository don't show we did a "migration" on these data. The only mention I see about populating phenopolis.variant_gene is scripts/import_individual_variants.py which does "Import an individual's VAR.tsv file".

As for phenopolis.individual_variant it seems more reliable since we have scripts/import_individual_variants.py, which did "Migrate variants data from public.hom_variants and public.het_variants to phenopolis.individual_variant"

Bottom line: as I don't know much about phenopolis.variant_gene I can only hope my code, barring some blatant mistakes, is doing what it's supposed to do yet I can't compare with previous schema.

pontikos · 2021-04-21T08:55:02Z

This is correct:

        v["AC"] = v["HET_COUNT"] + 2 * v["HOM_COUNT"]
        v["AN"] = (v["HET_COUNT"] + v["HOM_COUNT"]) * 2
        v["AF"] = v["AC"] / v["AN"]

Re phenopolis.individual_variant:
if variant is seen in two people then the the MQ, DP etc might have been different between the two people, but anyway we do not display this information to users

alanwilter self-assigned this Apr 8, 2021

pontikos added this to the migration_to_new_db_schema milestone May 18, 2021

This was referenced May 21, 2021

implement search by rsid #298

Open

keep GQ and AD information in variant het and hom table #5

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

some patients ids are in gene tables but do not exist #329

some patients ids are in gene tables but do not exist #329

pontikos commented Apr 7, 2021

alanwilter commented Apr 8, 2021

alanwilter commented Apr 8, 2021

pontikos commented Apr 8, 2021

alanwilter commented Apr 8, 2021

pontikos commented Apr 8, 2021

alanwilter commented Apr 8, 2021

pontikos commented Apr 8, 2021

alanwilter commented Apr 8, 2021 •

edited

Loading

alanwilter commented Apr 8, 2021

alanwilter commented Apr 9, 2021

alanwilter commented Apr 9, 2021 •

edited

Loading

pontikos commented Apr 10, 2021

alanwilter commented Apr 10, 2021

pontikos commented Apr 10, 2021 via email

alanwilter commented Apr 20, 2021

pontikos commented Apr 21, 2021 •

edited

Loading

some patients ids are in gene tables but do not exist #329

some patients ids are in gene tables but do not exist #329

Comments

pontikos commented Apr 7, 2021

alanwilter commented Apr 8, 2021

alanwilter commented Apr 8, 2021

pontikos commented Apr 8, 2021

alanwilter commented Apr 8, 2021

pontikos commented Apr 8, 2021

alanwilter commented Apr 8, 2021

pontikos commented Apr 8, 2021

alanwilter commented Apr 8, 2021 • edited Loading

alanwilter commented Apr 8, 2021

alanwilter commented Apr 9, 2021

alanwilter commented Apr 9, 2021 • edited Loading

pontikos commented Apr 10, 2021

alanwilter commented Apr 10, 2021

pontikos commented Apr 10, 2021 via email

alanwilter commented Apr 20, 2021

pontikos commented Apr 21, 2021 • edited Loading

alanwilter commented Apr 8, 2021 •

edited

Loading

alanwilter commented Apr 9, 2021 •

edited

Loading

pontikos commented Apr 21, 2021 •

edited

Loading