-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sanity checker #42
Comments
@didiermonselesan and @willrhobbs in Tasmania pretty much know all the sanity checking that would need to be done. They are going to list the types of things that need checking below... |
Also related to this (and a wider issue than just the plugin) is to work out what to do when the sanity checker fails - how do we deal with 'faulty' data in the archive at NCI? |
Tim, Agreed. The first thing would be to log the failures and the reasons why the workflow fails on particular files.
Then possibly moving to more ‘subjective' checks
Note that some packages do perform conventions checking either for single files or when apply across files. Cheers, Didier From: Tim Bedin <[email protected]mailto:[email protected]> Also related to this (and a wider issue than just the plugin) is to work out what to do when the sanity checker fails - how do we deal with 'faulty' data in the archive at NCI? — |
My feeling is that rather than using the 'sanity checker' to handle exceptions it's probably best just to flag issues that can then be investigated. I also agree that as a first run checking the metadata on each file is complete and accurate would be a good start (we would hope that PCMDI would do this, but that's unlikely). Physical-consistency checks would also be helpful, but we'd need to consider each variable separately. Some variables lend themselves to 'hard limits' e.g. salinity and precip can never be less than zero (although they are in some CMIP5 models!); others are less obvious (what should the maximum allowable be temperature be?) I would add a test of the time coords on all merged lists of variables across a range of files. Time arrays should be consistent across all files within the same experiment, monotonic, have no gaps, and no 'overlaps'. In my experience these types of errors have been the biggest and most persistent headache. There are some simple 'experiment' tests based on global energy balance that both Didier and I have found can highlight some sneaky issues; these would not be applied to all files of course, but are a useful model diagnostic. Will From: didiermonselesan <[email protected]mailto:[email protected]> Tim, Agreed. The first thing would be to log the failures and the reasons why the workflow fails on particular files.
Then possibly moving to more 'subjective' checks
Note that some packages do perform conventions checking either for single files or when apply across files. Cheers, Didier From: Tim Bedin <[email protected]mailto:[email protected]mailto:[email protected]> Also related to this (and a wider issue than just the plugin) is to work out what to do when the sanity checker fails - how do we deal with 'faulty' data in the archive at NCI? Reply to this email directly or view it on GitHubhttps://github.com//issues/42#issuecomment-109175388. Reply to this email directly or view it on GitHubhttps://github.com//issues/42#issuecomment-109656913. University of Tasmania Electronic Communications Policy (December, 2014). |
In order to get this into the tool it would have to be implemented in a script that takes in the name of the file to check on the command line. Something like:
etc. We could have it integrated with every workflow, but I think it could be a better approach to have a specific workflow just for data checking. Another issue is what to do when problems are found - should they be fixed in the downloaded data? Should the files be deleted from the archive? Do the modelling groups need to be contacted this far after release? I know that people at Aspendale have been working on this problem as part of the data archive download. I am getting in touch with NCI to seek their input on this issue. |
During the CWSLab phone meeting last week @taerwin suggested that the checking/cleaning of CMIP5 data could handled by NCI. Instead of pointing at the raw CMIP5 data directory, the CWSLab workflow tool could instead be pointed at directory of data that has passed a quality check. If users found an error in the quality checked data, they could report it to NCI and that data could be removed from the directory until the data were fixed (and an appropriate test could be added to the quality check). NCI would report errors back to PCMDI if necessary. NCI would obviously need assistance from people like @didiermonselesan and @willrhobbs in defining the checks that should be included in the quality checking procedure. For the sake of transparency it would be good if the code for the tests were hosted on a public GitHub repo (possibly in the CWSL repo). |
Late to the party, sorry. I'm watching this space now so you can contact me here, but email's probably still better for me having a searchable record :) I think you all have my address. |
We need a sanity checker to check for problems with the data.
In a workflow you would go
Constraint Builder
->CMIP5
->Sanity Check
.The text was updated successfully, but these errors were encountered: