Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

streamline data situation #53

Open
maxheld83 opened this issue Apr 29, 2020 · 2 comments
Open

streamline data situation #53

maxheld83 opened this issue Apr 29, 2020 · 2 comments
Assignees
Labels
needs-votes 👍 Please upvote, if this is worthwhile

Comments

@maxheld83
Copy link
Contributor

maxheld83 commented Apr 29, 2020

We seem to be running into a similar problem in several projects, including http://github.com/subugoe/hoad/, http://github.com/subugoe/openairegraph/ and the crossref dump situation http://github.com/njahn82/cr_dump/:

There's big-ish (>1MB) serialised data, usually JSON, CSV or the same compressed, which is either/or

  • expensive to recompute
  • very large / slow to download.

(I'm not talking about databases here, that's a separate concern).

These files cause several problems / face limitations:

  • they're too big/expensive to recompute often
  • they can't be git commited (too large)
  • they can make or break results; i.e. they are crucial for reproducibility
  • we need them in CI and for tests
  • we need to share them

Possible straightforward solutions might be:

  • store only locally (no reproducibility)
  • store on a network drive (no reproducibility)
  • setting up a database (too expensive/too much hassle unless absolutely necessary)
  • store in github releases (not easily automated, very limited storage)
  • ...

I think we need something else which neatly abstracts away all this.
There's probably a good solution out there already.

One avenue to pursue would be git lfs.

Ideally, we should have a solution which understands serialised data, and has a better understanding of diffing rows. (order does not matter).

Anyway, this shouldn't be too complicated and we might start with something small.

I'm going to look into this when I have the time.
I think this could save us all a lot of time.

@maxheld83 maxheld83 self-assigned this Apr 29, 2020
@maxheld83
Copy link
Contributor Author

among other things, the repeated downloads of the big dumps via download.file() should be transparently cached.

maxheld83 referenced this issue in subugoe/openairegraph Apr 29, 2020
@maxheld83
Copy link
Contributor Author

this would also actually be a feature for a lot of users, who might face the same problem when they run this in CI or collaboratively.

@maxheld83 maxheld83 transferred this issue from subugoe/openairegraph Apr 30, 2020
@maxheld83 maxheld83 added the needs-votes 👍 Please upvote, if this is worthwhile label Apr 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-votes 👍 Please upvote, if this is worthwhile
Projects
None yet
Development

No branches or pull requests

1 participant