-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.Rmd
376 lines (238 loc) · 20.4 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
---
output:
github_document:
df_print: kable
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# hydrogauge
<!-- badges: start -->
[![R-CMD-check](https://github.com/galenholt/hydrogauge/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/galenholt/hydrogauge/actions/workflows/R-CMD-check.yaml)
<!-- badges: end -->
## Purpose
This package was originally designed to query various state water gauge networks and the Australian Bureau of Meteorology (BoM) APIs. Some Australian states use [Kisters](https://resources.kisters.com.au/public/kisters-web-publishing/) Hydstra (hydllp) while the BoM uses Kisters WISKI/KiWIS. Thus, this package is likely to work for many other services use these APIs with a simple change to the URL (argument `portal`).
I have taken an approach that tries to blend functionality with ease of use. I use a set of core functions that give the user access to most of the API functionality when needed, but have then wrapped these in an outer set of functions that attempt to make the most common use-cases easier and avoid the need to dig into API documentation.
## Installation
You can install the development version of hydrogauge from [GitHub](https://github.com/) with:
``` r
# install.packages("devtools")
devtools::install_github("galenholt/hydrogauge")
```
## Background
This package was originally conceived to enable a workflow of querying to discover data availability (variables, time spans, locations), followed by pulling data based on those findings. It has sets of functions that try to be as consistent as possible with the underlying API calls to expose as much functionality as possible and avoid making too many decisions for the user. Thus, the core functions try to stay as close as possible to simple translations of arguments from R to Kisters formats, and of output JSON back to R. This approach lets us get closer to exposing the full functionality and makes for clearer mapping to the [Kisters documentation](https://kisters.com.au/doco/hydllp.htm).
We do provide convenience functions (primarly `fetch_timeseries`, `fetch_kiwis_timeseries`, and `fetch_hydstra_timeseries`) that automate some of the workflow (such as querying period of record and pulling that range). With the recent inclusion of BoM gauges, it is likely that harmonized wrapper functions will be developed to allow calling both types of databases with the same arguments and returning equivalent outputs.
Caveat: while the functionality provided maps as closely as possible to the underlying API calls, that functionality is not complete. This package does not have complete coverage of the available API calls or their arguments, but is under active development to add missing capabilities. Initial focus has been on identifying what is available to pull and pulling timeseries, along with including new data portals.
### Supported sources
Users can input any url for a KiWIS or HYDSTRA (hydllp) portal, but there are also some that can be called by name (not case sensitive):
- 'vic': the [Victorian water data](https://data.water.vic.gov.au/static.htm) (API url <https://data.water.vic.gov.au/WMIS/cgi/webservice.exe?>)
- 'nsw' : [New South Wales water data](https://realtimedata.waternsw.com.au/) (API url <https://realtimedata.waternsw.com.au/cgi/webservice.exe?>)
- 'qld': [Queensland water data](https://water-monitoring.information.qld.gov.au/) (API url <https://water-monitoring.information.qld.gov.au/cgi/webservice.exe?>)
- 'bom': [Australian Bureau of Meteorology](http://www.bom.gov.au/waterdata) (API url <http://www.bom.gov.au/waterdata/services>)
Other Australian states are in progress and haven't been looked into very thoroughly; for the moment, use BoM as a fallback if you don't know their API urls.
- [WA](https://wir.water.wa.gov.au/Pages/Water-Information-Reporting.aspx) seems to use HYDSTRA, but the API url hasn't been found yet.
- Unclear what [SA](https://water.data.sa.gov.au/) is using as a backend
- [Tasmania](https://portal.wrt.tas.gov.au/Data) and [NT](https://ntg.aquaticinformatics.net/Data) have maps that look a lot like BoM
If you're outside Australia, just use your URL. Create an issue (or pull request) to add other named portals.
The [Kisters website](https://resources.kisters.com.au/public/kisters-web-publishing/) provides a list of sites that are likely to work with these functions, but it may take work to find the correct API url, and those not in the list above are untested.
## Example
```{r setup}
library(hydrogauge)
library(ggplot2)
```
At present, we largely assume the user will have a list of gauges they want, as the current ability to programatically obtain gauge numbers according to criteria is limited (but see `get_sites_by_datasource`, which allows asking for all sites that have a given datasource).
To get the timeseries, the user *should*, but does not need to (examples in development), know the available variables. However, finding what they are can be done through the functions here.
This vignette will proceed with a set of sites chosen to span a range of characteristics (which we know from test queries).
- The Upper Steavenson (405328) only has flow
- Barwon (233217) has many variables, but their start dates differ
- Taggerty (405331) is no longer in operation- ran 2010-2013
- Marysville golf course (405837) is only rainfall
The functions all require gauges to be their numeric codes as characters. The API needs a comma-separated string (`"number1, number2"` ) , but the functions here will accept a vector `c("number1", "number2"`) and decompose it internally. This is typically easier and reflects more common R workflows such as having a column of site numbers in a dataframe.
```{r}
barwon <- '233217'
steavenson <- '405328'
taggerty <- '405331'
golf <- '405837'
```
## Times
Both sets of API (Hydstra/hydllp and kiwis) expect times to be gauge-local and return times local to the gauge. This package provides some functionality to return times in other formats (see `timetype` argument), including a standard parseable character format, UTC, and local time objects. In general, working with UTC and offsets is preferable, but care should be taken when giving desired start and end times, as these need to be in gauge-local time.
## Definitions and available options
I have tried to keep the function argument names the same as in the API, and the API restricts the options to the function arguments. The API functions are documented by [Kisters](https://kisters.com.au/doco/hydllp.htm) (the creators), and there is a bit more information about options from [Queensland](https://water-monitoring.information.qld.gov.au/wini/Documents/RDMW_API_doco.pdf), though there are discrepancies between states.
After quite a lot of digging and testing, an incomplete set of definitions of common arguments and their potential values follows:
- `site_list` is the gauge number as a character (and the functions here accept a vector of these gauge numbers). Can be any gauge number in the database. Obtaining them programatically is limited currently, except in `get_sites_by_datasource`.
- `datasource` The type of data. Currently aware of `"A"`, `"TELEM"`, and `"TELEMCOPY"`, but have done this in a roundabout way and there may be others and sites I have not examined. `A` is likely to mean 'archive', `TELEM` seems to mean 'telemetry', and I'm not sure why there's a copy.
- Some quick testing with `get_sites_by_datasource` shows that there are many more sites with `A` than `TELEM`, but that `TELEM` is not a subset- there are sites with `TELEM` and not `A`. Initial testing of `get_db_info` also finds sites that do not appear with any of these, and seem to have no data.
- `var_list` is the type of variable, e.g. rainfall, flow, temp. Should be a character of the numeric code, with or without trailing ".00", and can be a vector, e.g. c("100", "210.00").
- I do not currently have a comprehensive list of possible variables and their meaning, but `get_variables_by_site` will provide one for a set of sites.
- The [Queensland documentation](https://water-monitoring.information.qld.gov.au/wini/Documents/RDMW_API_doco.pdf) gives some more information, but the numbers are not always the same.
- Some variables (typically discharge) are calculated, and *do not appear in queries of available variables* such as `get_variables_by_site`. Those I'm aware of are "141"- discharge in ML, and "140", discharge in cumecs ($m^3/sec$).
- `start_time` and `end_time` are the start and end times of the period requested. The API is strict that these should be 14-digit strings "YYYYMMDDHHIIEE", but the functions here will take them in date formats (posix), character, or numeric, and they do not need to be 14-digits.
- If not dates, they *should* be at least YYYYMMDD (either character or numeric), and the rest will be padded with zeros.
- `interval` is the time interval of the return values for timeseries. Options seem to be (base on API error messages)
- `"year"`, `"month"`, `"day"`, `"hour"`, `"minute"`, `"second"`. I don't think capitalisation matters.
- Also have not thoroughly tested if only some are available for some variables at some sites
- `multiplier` I \*think\* this allows intervals like 5 days, by passing `interval = 'day'` and `multiplier = 5`. Not tested other than 1 at present.
- `data_type` is the statistic to apply within each interval to get the values.
- Options (from API error messages): `"mean"`, `"max"`, `"min"`, `"start"`, `"end"`, `"first"`, `"last"`, `"tot"`, `"maxmin"`, `"point"`, `"cum"`. Not all are currently tested.
- *Warning:* any given API call can only takes one value, which is applied to all variables. *This is unlikely to be appropriate if asking for many variables.*
- Two options (both requiring *a priori* knowledge of available variables):
1. run `get_ts_traces` multiple times, with different subsets of `var_type`, each with an appropriate `data_types`.
2. Use `get_ts_traces2`, which allows matched vectors of `var_type` and `data_type`, effectively automating option 1. (Deprecated, new approach pending)
## Finding datasources
To see what datasources are available for a site, use `get_datasources_by_site`. I've largely just been using "A" to test, but it's worth looking to see what datasources are available for a target site(s), and then doing the next step (finding variables) for each, to see whether the available variables (or timeperiods) differ. I'd like to automate that, but haven't yet.
```{r}
ds <- get_datasources_by_site(portal = 'Vic',
site_list = c(barwon, steavenson,
taggerty, golf))
```
```{r rows.print = 15}
ds
```
And we can plot that to get a visualisation. I'm planning to have this sort of plot for lots of the functions, but for now this is it.
```{r}
plot_datasources_by_site(ds)
```
## Finding available variables
Assuming for the moment the user knows the gauge numbers of interest, we then need to know what variables are available to extract timeseries of. We use `get_variable_list` to get this information, including both their names and numbers, as well as other details such as the time period of record and units.
To demonstrate with the sites above that have a range of variable types and starts,
```{r}
var_info <- get_variable_list(portal = 'Vic',
site_list = c(barwon, taggerty,
steavenson, golf),
datasource = "A")
```
That returns a tibble with information about each gauge and variable. A few things to note- it gives the names of the gauges, the names and values of the variables, and a start and end date for each. For example, the Barwon's start date for stage (100) is 1961, while the others (pH, ppm, etc) didn't start until 2010.
*Note that this does **not** return derived discharge variables (140 and 141)*. If variable 100 (stage height) exists, the other two always seem to, though I'm not positive.
```{r rows.print = 15}
var_info
```
## Obtaining timeseries
This is the main use of the package, with the other bits getting us to the point of knowing what to ask for. Specifically, `get_variable_list` gives us a reference to know what the variables are to ask for and the relevant timeperiods.
There are a few ways to request the timeseries, and some pitfalls to avoid.
If we just want a set of variables that all need the same statistic applied (e.g. daily mean flows), we can pass that in as a vector. For example, to get daily mean stage height, discharge, and temperature, we can do that in one call, even for multiple gauges. Asking here for only 5 days to keep the demo reasonable.
```{r}
ts_days <- get_ts_traces(portal = 'Vic',
site_list = c(barwon, steavenson, taggerty, golf),
datasource = "A",
var_list = c("100", "141", "450"),
start_time = 20200101,
end_time = 20201231,
interval = "day",
data_type = "mean",
multiplier = 1,
returnformat = 'df')
```
That returns a tall dataframe, which the user can then split up or plot how they want. There are other options that return lists of dataframes if the user does not want all sites and variables combined-
- `returnformat = "varlist"` a list with one tibble per variable
- `returnformat = "sitelist"` a list with one tibble per site
- `returnformat = "sxvlist"` a list with one tibble per site x variable combo (including empty lists for missing combos)
```{r rows.print = 30}
# rows.print doesn't really work with devtools::build_readme(), so use head
head(ts_days, 30)
```
Note that if a variable isn't available for a gauge it just isn't returned, and same with timeperiods. So, the Barwon returns all variables, the golf course gauge does not return anything because it does not collect these variables, the Steavenson returns level and discharge but not temp, and the Taggerty doesn't appear at all despite having these variables because we've asked for data after it was decommissioned.
### Multiple variables, multiple statistics
Now, if we want another set of variables that should have a different statistic (e.g. rainfall makes sense as the daily sum, not the mean), there are two options- a separate call of `get_ts_traces` or `get_ts_traces2`.
First, the separate `get_ts_traces`. Note that again this will ignore gauges without the info (Barwon)
```{r}
ts_rain <- get_ts_traces(portal = 'Vic',
site_list = c(barwon, golf),
datasource = "A",
var_list = c("10"),
start_time = 20200101,
end_time = 20201231,
interval = "day",
data_type = "tot",
multiplier = 1,
returnformat = 'df')
```
```{r}
head(ts_rain, 30)
```
If the user wants to combine, they can `dplyr::bind_rows` to combine post-hoc.
An automated approach is pending.
## Notes
I have not done much systematic testing of speed for big requests (long periods of record for lots of sites and variables). I am working on optimizing the API calls. It's certainly possible to overflow your memory if you ask for everything for all sites.
## Development plans
This package is under active development, and has been put on github as soon as the main functionality (`get_ts_traces`) was working. The current high-priority next steps are:
- Diagnostic plots (heatmaps of data availability in a few dimensions- see e.g. `plot_datasources_by_site` for a simple implementation)
- And some plots in here
- Simple timeseries plots for data inspection, though those will in general be made by the user
- And throw some plots in here
- Selecting and finding sites based on criteria
- *Especially* geographic
- Smarter/faster handling of `data_type` matching to `var_list`
- Other states- NSW and QLD use a similar system, though it looks like there are small differences at least in `datasource` values and some variable codes. Still, shouldn't take much to extend, I don't think.
# Other states
I haven't done as much investigation of the datasources and similar variables, but the code should work for NSW and Queensland now by simply passing state name. South Australia seems to use BOM, but I haven't looked into it much, nor any of the non-basin-states. They likely will work if they use Kisters Hydstra.
## datasources
Using `get_datasources_by_site` to see what datasources are available:
```{r}
nsw_ds <- get_datasources_by_site(portal = 'NSW',
site_list = c("422028", "410007"))
```
```{r}
nsw_ds
```
```{r}
qld_ds <- get_datasources_by_site(portal = 'QLD',
site_list = c("423203A", "424201A"))
```
```{r}
qld_ds
```
## Traces and plots
### NSW
Now we can `get_ts_traces` for `A`, just to keep things consistent
```{r}
nsw_ts_days <- get_ts_traces(portal = 'NSW',
site_list = c("422028", "410007"),
datasource = "A",
var_list = c("100", "141", "450"),
start_time = 20200101,
end_time = 20201231,
interval = "day",
data_type = "mean",
multiplier = 1,
returnformat = 'df')
```
```{r}
ggplot(nsw_ts_days, aes(x = time, y = value, color = site_short_name)) +
facet_wrap(~variable_short_name, scales = 'free') +
geom_line()
```
### QLD
I didn't pick very interesting examples here.
```{r}
qld_ts_days <- get_ts_traces(portal = 'QLD',
site_list = c("423203A", "424201A"),
datasource = "A",
var_list = c("100", "141", "450"),
start_time = 20200101,
end_time = 20201231,
interval = "day",
data_type = "mean",
multiplier = 1,
returnformat = 'df')
```
```{r}
ggplot(qld_ts_days, aes(x = time, y = value, color = site_short_name)) +
facet_wrap(~variable_short_name, scales = 'free') +
geom_line()
```
## Similar packages
The [bomWater](https://github.com/buzacott/bomWater) and [kiwisR](https://github.com/rywhale/kiwisR) packages have much of the functionality needed to call the BoM API, targeted at Australia (bomWater) and generally (kiwisR). I have learned a lot from them, and have only chosen to reimplement the BoM work because neither quite had what I needed for my workflow without being convoluted.
There are similar python packages [mdba-gauge-getter](https://github.com/MDBAuth/MDBA_Gauge_Getter), which calls state and BoM, and [bomwater](github.com/csiro-hydroinformatics/pybomwater), which is the BoM interface. Both of these are tailored for flow and stage timeseries, with less emphasis than here on identifying available data and exposing API funcationality.
An obvious missing piece is USGS gauges. The USGS provides the in-house [dataRetrieval](https://github.com/DOI-USGS/dataRetrieval) package. At present, I have not explored wrapping that here.
## Useful API documentation
### HYDSTRA
I have tried to keep the function argument names the same as in the API, and the API restricts the options to the function arguments. The API functions are documented by [Kisters](https://kisters.com.au/doco/hydllp.htm) (the creators), and there is a bit more information about options from [Queensland](https://water-monitoring.information.qld.gov.au/wini/Documents/RDMW_API_doco.pdf), though there are discrepancies between states.
### WISKI/KiWIS
Most of the documentation I've found here is from the [Scottish Environment Protection Agency](sepa.org.uk); I have not found good docs straight from Kisters. The [Kisters docs themselves](https://timeseries.sepa.org.uk/KiWIS/KiWIS?datasource=0&service=kisters&type=queryServices&request=getrequestinfo) do seem to be available from SEPA though. These provide information about available functions, allowed variables to request, etc.
A good [SEPA walkthrough](https://timeseriesdoc.sepa.org.uk/api-documentation/api-function-reference/principal-query-functions)