streamline data situation #53

maxheld83 · 2020-04-29T22:22:20Z

We seem to be running into a similar problem in several projects, including http://github.com/subugoe/hoad/, http://github.com/subugoe/openairegraph/ and the crossref dump situation http://github.com/njahn82/cr_dump/:

There's big-ish (>1MB) serialised data, usually JSON, CSV or the same compressed, which is either/or

expensive to recompute
very large / slow to download.

(I'm not talking about databases here, that's a separate concern).

These files cause several problems / face limitations:

they're too big/expensive to recompute often
they can't be git commited (too large)
they can make or break results; i.e. they are crucial for reproducibility
we need them in CI and for tests
we need to share them

Possible straightforward solutions might be:

~~store only locally~~ (no reproducibility)
~~store on a network drive~~ (no reproducibility)
~~setting up a database~~ (too expensive/too much hassle unless absolutely necessary)
store in github releases (not easily automated, very limited storage)
...

I think we need something else which neatly abstracts away all this.
There's probably a good solution out there already.

One avenue to pursue would be git lfs.

Ideally, we should have a solution which understands serialised data, and has a better understanding of diffing rows. (order does not matter).

Anyway, this shouldn't be too complicated and we might start with something small.

I'm going to look into this when I have the time.
I think this could save us all a lot of time.

The text was updated successfully, but these errors were encountered:

maxheld83 · 2020-04-29T22:25:11Z

among other things, the repeated downloads of the big dumps via download.file() should be transparently cached.

also opens #7 #6 #5 #4

maxheld83 · 2020-04-29T23:17:59Z

this would also actually be a feature for a lot of users, who might face the same problem when they run this in CI or collaboratively.

maxheld83 self-assigned this Apr 29, 2020

maxheld83 referenced this issue in subugoe/openairegraph Apr 29, 2020

enable ci and make stuff reproduce as per #2

21f6972

also opens #7 #6 #5 #4

maxheld83 transferred this issue from subugoe/openairegraph Apr 30, 2020

maxheld83 added the needs-votes 👍 Please upvote, if this is worthwhile label Apr 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

streamline data situation #53

streamline data situation #53

maxheld83 commented Apr 29, 2020 •

edited

Loading

maxheld83 commented Apr 29, 2020

maxheld83 commented Apr 29, 2020

streamline data situation #53

streamline data situation #53

Comments

maxheld83 commented Apr 29, 2020 • edited Loading

maxheld83 commented Apr 29, 2020

maxheld83 commented Apr 29, 2020

maxheld83 commented Apr 29, 2020 •

edited

Loading