Refactoring Generic Assay Data Model? #8714

inodb · 2020-06-12T20:35:49Z

inodb
Jun 12, 2020
Maintainer

We're currently trying to use generic assay more. The current data model works, but it is not very intuitive when (1) looking at the data in the database and by extension when (2) creating the data files. One of the reasons is b/c both treatment and later generic assay were shoehorned into genetic_profile. I think that worked well for prototyping but I worry about longterm maintainability. Now that we are also starting to add microbiome and mutational signature data it might be worth revisiting

Image from: RFC51: Generic Assay

Data model

The table genetic_profile contains GENERIC_ASSAY. Maybe more clear if (1) this table is either renamed to generic_profile and generic_assay_type is set to GENETIC for all datatypes of MAF/DISCRETE/CONTINUOUS/Z-SCORE/LOG2-VALUE/FUSION/SV. One issue here is that the datatype field for GENETIC is always going to be different from generic_assay, which is not clear from the data model. It might make sense to go put all generic_assays in a separate table instead
Treatment Profile specific columns in genetic_profile: PIVOT_THRESHOLD, SORT_ORDER
The connection generic_entity_properties -> genetic_entity -> genetic_alteration -> genetic_profile. If you look at the above schema image it is for me hard to imagine what any of these things mean without looking at the data. In addition there are NULL values in genetic_entity for each ENTITY_TYPE of GENE. I get why that is, but it's not very clean so might make sense to refactor this?

Data files

When reading the documentation here: https://docs.cbioportal.org/5.1-data-loading/data-loading/file-formats#generic-assay. This description is very clear, but the files themselves are a bit hard to follow:

cancer_study_identifier: study_es_0
genetic_alteration_type: GENERIC_ASSAY
generic_assay_type: TREATMENT_RESPONSE
datatype: LIMIT-VALUE
stable_id: treatment_ic50
profile_name: IC50 values of compounds on cellular phenotype readout
profile_description: IC50 (compound concentration resulting in half maximal inhibition) of compounds on cellular phenotype readout of cultured mutant cell lines.
data_filename: data_treatment_ic50.txt
show_profile_in_analysis_tab: true
pivot_threshold_value: 0.1
value_sort_order: ASC
generic_entity_meta_properties: NAME,DESCRIPTION,URL

genetic_alteration_type this might be tricky to understand for a user. The field is called genetic_alteration_type, but we are talking about non-genetic data? Let's allow an alias of profile_type
datatype: LIMIT-VALUE this is a pretty hard to understand datatype, so maybe better to use a more simple example first?
pivot_threshold_value very specific to treatment, is it optional?
value_sort_order seems like a bit of an edge case?

It might make sense to show the most simple example first with these data files and then show a complete reference for all possible properties. I guess once we have mutational signature data we can maybe use that as the more basic example.

Anyway, just wanted to capture my thoughts while going through this for future reference. We can discuss later

jjgao · 2020-06-15T21:13:11Z

jjgao
Jun 15, 2020
Maintainer

Adding @n1zea144 @sheridancbio @dippindots

@inodb these are good points. I agree that the naming has evolved to be very confusing...

To simplify the concepts, may we can view all data in a study as the big matrix organized by profiles. I think it would be good to rename to general terms, e.g.:

genetic_profile -> profile
genetic_alteration -> profile_data
genetic_profile_samle -> profiled_sample
genetic_entity -> profiled_measurement (?)
generic_entity_properties -> profiled_measurement_properties (?)

If we change this, we should also change the API naming, so it would be a fairly big change...

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactoring Generic Assay Data Model? #8714

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Refactoring Generic Assay Data Model? #8714

inodb Jun 12, 2020 Maintainer

Data model

Data files

Replies: 1 comment

jjgao Jun 15, 2020 Maintainer

inodb
Jun 12, 2020
Maintainer

jjgao
Jun 15, 2020
Maintainer