-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add ability to check if local file exists & is up to date #46
Comments
Have you started on this @MattCowgill ? Do you have any thoughts on how to check the "AND is up to date" condition? |
Hi @daviddiviny, @HughParsonage has done some work on this but (unless I've misunderstood) hasn't yet done the "and is up to date" part. The only ways I can think to infer whether or not a file is up to date are:
Thoughts? |
& FYI I haven't done much work on |
I would be happy with an argument to |
Is the alternative to just make the argument |
I want to revive/revisit this... The function
I have some version of that in various analysis scripts of my own, but I think it would be useful functionality to build into the package. I'm just scratching my head a bit about the best way forward. I'd be grateful for any thoughts! |
My instinct would be that [*] I'm not talking about something fancy; I'm more thinking of a basic text file, hosted on this repository that says when each catalogue was last updated. |
I'm not sure I understand what you have in mind, @HughParsonage. At the moment, the process goes:
If I understand correctly, your proposal would modify step (2). When a user requests a table, the URL for that table would be obtained from a text file hosted in this repo rather than directly from the ABS TSD. Have I understood correctly? If so, I'm not sure how that relates to the "check if updated" process, other than probably saving a fraction of a second (because querying GitHub will likely be faster than the ~0.5 seconds it takes to query the ABS TSD). |
As I understand, the fundamental problems this feature request is trying to solve is that there is a tradeoff between downloading every time and using local files. Downloading every time is much slower but using local risks not being up-to-date. So if we can reduce the time it takes to search, download, and clean the tables to a short enough time that the tradeoff is negligible, we've solved the problem. So now considering using the existing method of going from a user request for a table to the cleaned table itself as an automated, regular operation that stores the cleaned table for each request and the metadata associated with each table. Then the user-visible functions of readabs will only need to access this metadata file and, if the file requires updating, the data stored. Much depends on the real timing differences of these approaches vis-à-vis typical user operations. One could, for example, download the metadata file |
Your TSD looks like it has columns that could help you decide whether to invalidated your local cached copy: glimpse(xml_dfs)
# Rows: 114
# Columns: 18
# $ ProductNumber <chr> "6202.0", "6202.0", "6202.0", "6202.0", "6202.0", "…
# $ ProductTitle <chr> "Labour Force, Australia", "Labour Force, Australia…
# $ ProductIssue <date> 2022-02-01, 2022-02-01, 2022-02-01, 2022-02-01, 20…
# $ ProductReleaseDate <date> 2022-03-17, 2022-03-17, 2022-03-17, 2022-03-17, 20…
# $ ProductURL <chr> "https://www.abs.gov.au/statistics/labour/employmen…
# $ TableURL <chr> "https://www.abs.gov.au/statistics/labour/employmen…
# $ TableTitle <chr> "Table 1. Labour force status by Sex, Australia - T…
# $ TableOrder <dbl> 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20,…
# $ Description <chr> "Employed total ; Persons ;", "Employed total ; P…
# $ Unit <chr> "000", "000", "000", "000", "000", "000", "000", "0…
# $ SeriesType <chr> "Trend", "Seasonally Adjusted", "Original", "Trend"…
# $ DataType <chr> "20", "20", "20", "20", "20", "20", "20", "20", "20…
# $ Frequency <chr> "Month", "Month", "Month", "Month", "Month", "Month…
# $ CollectionMonth <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "…
# $ SeriesStart <date> 1978-02-01, 1978-02-01, 1978-02-01, 1978-02-01, 19…
# $ SeriesEnd <date> 2022-02-01, 2022-02-01, 2022-02-01, 2022-02-01, 20…
# $ NoObs <chr> "529", "529", "529", "529", "529", "529", "529", "5…
# $ SeriesID <chr> "A84423127L", "A84423043C", "A84423085A", "A8442311… Could you use the |
(Also, I had a look at whether you could use HTTP caching tools - eg. with |
Hi @jimjam-slam: yes! |
if local file exists AND is up to date, load local file
if not, get file from ABS
This could be used to form a new argument to
read_abs()
, something liketry_local = TRUE
The text was updated successfully, but these errors were encountered: