-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Normalizing/standardizing data storage (enforcing bestpractices) through PL/pgsql functions #124
Comments
Here is one use case I have: I manage stocks and I would like to add a trait value to a given stock. Depending on the trait, I could use stockprop if the value is continuous or the stock_cvterm table if the value is discrete. It would be fine but... I could also use the phenotype table and associate a trait phenotype to my stock. Then, the process becomes a bit more complex. I have to identify the observable part I am looking at and the attribute I am measuring. Then, I have to create a phenotype entry for that "observable" (find the cvterm), that "attribute" (find another cvterm) and that "value" (which can be either a discrete value in cvalue_id column or a continuous value in value column). Then, I have to create a nd_experiment enrty corresponding to the experiment that lead to the measure. That experiment will require a geolocation as well. But I may use an empty one if needed (ie. nd_geolocation_id 1). After that, I have to link that nd_experiment to the phenotype (nd_experiment_phenotype) and the stock (nd_experiment_stock). It is by far more complex than just inserting one row in stockprop or stock_cvterm tables. Therefore, I believe most people would be too lazy to use nd_experiment and phenotype tables. I even heard that some people added a (wild! ;) ) stock_phenotype table to simplify the thing and still take profit of the observable and attribute fields. For this use case, I would see a function like Such a function would greatly simplify the use of the natural diversity tables which are quite complex. Their complexity make them hard to use and query but materialized views are here to solve that issue. Their interest is that they normalize data if they are used. If such data are "normalized", it is then possible to write generic user interfaces that work on those data. Otherwise, you can't because you don't know if people stored things in stockprop, stock_cvterm or whatever... This "use case" may not be alone. I believe the same kind of approach could be used for each case in chado best practices. I yet have another use case when I need to add a stock_cvterm relationship: depending on the cvterm used, I may either want to allow several relationships between a stock and the cvterms of a given CV, only one relationship at a time, or only one for a given subset of terms of a CV (using the CV graph to select a subtree of a "root" term that would provide the allowed subset of cvterms). A parametrized function could help here as well. Note: I initially stored trait values in stockprop/stock_cvterm tables, even sometimes, I think I should have created a custom CV for some discrete values stored in the stockprop table instead of using stock_cvterm because no official CV was existing for those traits at that time. Sometimes, I regret (time pressure, too lazy?)! Now, I'm moving things to phenotype and it's a pain... Am I just the "1%" of people doing such things? ;-) Discussion is open! :-) |
I agree that using the ND experiment tables is a pain... We're actually using a different approach documented here to bypass them for many of the reasons you mentioned. Such a function would have greatly improved the chances of us using the ND Experiment. I support such an approach. The benefit of using functions in this manner is that it still provides support for both our approaches:
Since warnings and such only show up when using the function, it doesn't reduce the flexibility of Chado -just provides easier means of following best practices. Great idea! |
I'd like to bring a discussion (yesterday on gather.town) here for the community.
The main strength of a schema like Chado is to normalize how biological data are stored. Then, we can create generic tools that would work with everybody's chado instance as long as they store things the (exact) same way. However, it is often possible to store the same thing in different ways. Then, making generic tools for Chado becomes a harder task.
In order to enforce standard ways of storing things, there is a best-practices documentation but people may not ready it, forget about it or misunderstand things.
I propose an additional approach: adding embedded functions into Chado schema which roles are to store data appropriately. I see several advantages: first, they will enforce the way things are stored (ie. use the appropriate tables and create the appropriate links between things), they can check data integrity (ie. if a date is stored in a *prop.value column, the function can raise an issue if the date is incorrect), they can adapt to schema changes (ie. a same function can remain from one schema version to another even in case of major table changes). There are some drawback: there is a lot of code to write and maintain, there will be many functions to write to fulfill many use cases, how can we decide what is needed? I will post some example in a next comment.
So far, it's just a though (related to a problem I faced), with no code written. It would require a lot of work but I'd like to have the community opinion about this approach.
Edit: well, there is already some code for basic stuff (store_db, store_dbxref, store_organism, store_feature, store_featureloc, store_feature_synonym, store_analysis) but it could be generalized. "store_feature" could be used by new functions like store_dna, store_snp,...
The text was updated successfully, but these errors were encountered: