Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue-447: Investigation notebook for new somatic/oncogenic clinical classifications #448

Merged
merged 3 commits into from
Dec 3, 2024
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
clean up notebook a bit
  • Loading branch information
apriltuesday committed Oct 24, 2024
commit aaedf844b35814cee5e88be5d464a81fa5466f95
128 changes: 4 additions & 124 deletions data-exploration/other/somatic-oncogenic-records.ipynb
Original file line number Diff line number Diff line change
@@ -2,7 +2,7 @@
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"execution_count": 55,
"id": "94fb1395-8914-4aed-8004-6299f103efcd",
"metadata": {
"tags": []
@@ -17,7 +17,7 @@
{
"cell_type": "code",
"execution_count": 45,
"id": "25f84a99-c91c-4885-9f9a-95623f434827",
"id": "c8c06c5c-2fd8-4356-b29a-42ad736d5aaa",
"metadata": {
"tags": []
},
@@ -213,7 +213,6 @@
"# e.g. (somatic, somatic) or (germline, oncogenic) - nb. everything *not* in this list of 505 is just (germline,)\n",
"rcv_classifications = defaultdict(list)\n",
"\n",
"# i = 0\n",
"for r in dataset:\n",
" rcv_all_class = []\n",
" for c in r.clinical_classifications:\n",
@@ -246,11 +245,7 @@
" print(\"unknown classification type:\", class_type)\n",
" \n",
" rcv_all_class = tuple(sorted(rcv_all_class))\n",
" rcv_classifications[rcv_all_class].append(r.accession)\n",
" \n",
" # i += 1\n",
" # if i > 10:\n",
" # break"
" rcv_classifications[rcv_all_class].append(r.accession)"
]
},
{
@@ -453,9 +448,7 @@
"Summary:\n",
"* All values and all fields are being used to varying degrees\n",
"* Most data involves oncogenic classification, so no assertion types etc.\n",
"* A fully future-proof implementation would support everything here, but a simple inclusion of the oncogenic classification terms in the `clinicalSignificances` enum would cover 87% of the missing data (on the other hand, if we're not future-proofing what's the point)\n",
"\n",
"I did **not** check how presence of somatic/oncogenic/germline clinical classification corresponds to allele origins - does this matter to Open Targets? Note that most somatic evidence still has germline clinical classifications, and likely always will as ClinVar has said it's not doing a wholesale migration of old records. But we could perhaps try to associate \"like for like\" if both are present in both fields."
"* A fully future-proof implementation would support everything here, but a simple inclusion of the oncogenic classification terms in the `clinicalSignificances` enum would cover 87% of the missing data (on the other hand, if we're not future-proofing what's the point)"
]
},
{
@@ -464,119 +457,6 @@
"id": "7789c0d9-0d84-42c6-80fe-91eea184ea15",
"metadata": {},
"outputs": [],
"source": [
"for r in dataset:\n",
" if r.accession == 'RCV000426735':\n",
" pprint(r.record_xml)\n",
" break"
]
},
{
"cell_type": "markdown",
"id": "eb3bfbee-317e-4f8c-8b72-41b36325c849",
"metadata": {},
"source": [
"Schema proposals\n",
"\n",
"* Multiplicity:\n",
" * Keep reporting 1 confidence, choose most/least confident rating if multiple\n",
" * Report a flat list - parallel to clinical significances\n",
" * Report clinical significances as a list of objects, containing descriptive text & review status\n",
"* New terms:\n",
" * Add new description terms and ignore the other parts\n",
" * Might still need to think of the legibility of \"tier i - strong\"\n",
" * Add as \"compound\" terms - bit like how ClinVar does on website\n",
" * e.g. \"tier i - diagnostic - supports diagnosis\"\n",
" * Report as structured object"
]
},
{
"cell_type": "markdown",
"id": "8fce48ee-6722-41ca-b1a9-eeb7a458e67b",
"metadata": {},
"source": [
"Current:\n",
"```\n",
"{\n",
" \"clinicalSignificances\": [\"likely pathogenic\", \"pathogenic\"],\n",
" \"confidence\": \"criteria provided, multiple submitters, no conflicts\",\n",
"}\n",
"```\n",
"\n",
"Structured:\n",
"```\n",
"{\n",
" \"clinicalSignificances\": [\n",
" {\n",
" \"type\": \"germline\",\n",
" \"terms\": [\"likely pathogenic\", \"pathogenic\"],\n",
" \"confidence\": \"criteria provided, multiple submitters, no conflicts\"\n",
" },\n",
" {\n",
" \"type\": \"somatic\",\n",
" \"terms\": [\"tier i - strong\"],\n",
" \"confidence\": \"criteria provided, single submitter\",\n",
" \"somaticAssertionType\": \"diagnostic\",\n",
" \"somaticClinicalSignificance\": \"supports diagnosis\"\n",
" },\n",
" {\n",
" \"type\": \"somatic\",\n",
" \"terms\": [\"tier ii - potential\"],\n",
" \"confidence\": \"criteria provided, single submitter\",\n",
" \"somaticAssertionType\": \"prognostic\",\n",
" \"somaticClinicalSignificance\": \"poor outcome\"\n",
" }\n",
" ]\n",
"}\n",
"```\n",
"* Terms need to be lists in all cases as RCV clinical classifications are aggregates - see [here](https://www.ncbi.nlm.nih.gov/clinvar/docs/clinsig/)\n",
"\n",
"\n",
"Semi-structured:\n",
"```\n",
"{\n",
" \"clinicalSignificances\": [\n",
" {\n",
" \"terms\": [\"likely pathogenic\", \"pathogenic\"],\n",
" \"confidence\": \"criteria provided, multiple submitters, no conflicts\"\n",
" },\n",
" {\n",
" \"terms\": [\"tier i - diagnostic - supports diagnosis\"],\n",
" \"confidence\": \"criteria provided, single submitter\",\n",
" }\n",
" {\n",
" \"terms\": [\"tier i - prognostic - poor outcome\"],\n",
" \"confidence\": \"criteria provided, single submitter\",\n",
" }\n",
" ]\n",
"}\n",
"```\n",
"* Could include type as separate field or part of the string\n",
"\n",
"\n",
"Flat:\n",
"```\n",
"{\n",
" \"clinicalSignificances\": [\"likely pathogenic\", \"pathogenic\", \"tier i - strong\", \"tier ii - potential\"],\n",
" \"confidences\": [\"criteria provided, multiple submitters, no conflicts\", \"criteria provided, single submitter\"]\n",
"}\n",
"```\n",
"* `clinicalSignificances` could be compound terms as above\n",
"* Lists could be always have the same length and be fully parallel - so if we explode a term, we repeat the confidence value"
]
},
{
"cell_type": "markdown",
"id": "3410c33a-dfd7-44d4-a74b-f0334d285cdb",
"metadata": {},
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "15146bb1-f038-4df5-b119-7d5a9c096a1e",
"metadata": {},
"outputs": [],
"source": []
}
],
Loading