A tool to manage file metadata in a git repo, while keeping the file contents elsewhere.
It's generally not a good idea to put large binary files in git. Gitshed allows you to put only the file name and version metadata in git, while storing the file contents outside the git repo.
Gitshed is similar in spirit to git-annex, but simpler and more tailored to the author's requirements.
The name "gitshed" is a triple pun:
- It helps you "shed" binary files from your git repo.
- The files live outside the repo, in a "shed".
- The project was named in a bikeshedding session.
Gitshed stores file contents in a key->value content store. The content store is typically hosted on a remote server, so multiple collaborators can use it.
File contents are pulled from the content store and stored locally in the shed
(<repo root>/.gitshed/files
). The shed directory must be gitignored.
Files are represented in the repo as relative symlinks into the shed directory. A file managed by Gitshed can be in one of two states: unsynced or synced.
- An unsynced file has no content in the shed, and is represented by a broken symlink.
- A synced file has content in the shed, and is represented by a symlink to that content.
Syncing a file pulls the content in from the content store into the shed, healing the symlink.
Gitshed is typically installed as a custom git command, and invoked thus: git shed <subcommand> <args>
.
It may also be invoked directly, but for the rest of this documentation we'll assume the former invocation style.
Gitshed has several subcommands. Some of these take file paths as arguments, and these can be specified in two ways:
- As a list of globs:
git shed manage data/*.bin archives/*.jar
- As a file containing paths, one per line:
git shed manage -f path/to/argfile
orgit shed manage --argfile=path/to/argfile
Run git shed
or git shed --help
to get help. Run git shed <subcommand> --help
to get help for that subcommand.
Places files under gitshed's management.
git shed manage <file args>
Each file is uploaded to the content store, moved into the shed and replaced by a symlink.
Syncs files into the shed.
git shed sync <file args>
The contents of all managed but unsynced files are pulled from the content store to their symlinked locations in the shed.
To sync every managed file in the repo, omit the file arguments:
git shed sync
Re-download managed files into the shed.
git shed resync <file args>
The contents of all managed files, even those already synced, are pulled from the content store to their symlinked location in the shed.
To resync every managed file in the repo, omit the file arguments:
git shed resync
Prints a short status message with counts of synced and unsynced files.
git shed status
Lists all synced files.
git shed synced
Lists all unsynced files.
git shed unsynced
Remove files from gitshed's management.
git shed unmanage <file args>
Symlinks are overwritten with the content they symlinked to.
To unmanage all managed files in the repo, omit the file arguments:
git shed unmanage
Verifies that the Gitshed instance is working (e.g., that it's able to communicate with the
content store, and that the shed directory <repo root>/.gitshed/files
is gitignored).
git shed setup
When adding a new file to the shed, typical workflow is:
- Add the file at the relevant path in the repo.
git manage path/to/file
to place the file under Gitshed's management. The file will be replaced by a symlink.- Commit the symlink.
After pulling, other contributors will have a broken symlink at path/to/file
and will
need to git shed sync
to heal it and have access to the file content.
You may want to use git hooks to have git shed sync
called automatically when the workspace changes.
The relevant hooks are: post-applypatch
, post-checkout
, post-commit
, post-merge
and post-rewrite
.
In the future we hope to create proper pypi releases. For now you have to build Gitshed from source.
Gitshed uses the pants build tool to create a .pex
file.
This is a self-contained, standalone python executable that requires only a python interpreter
to run.
To build gitshed:
pants binary src/python/gitshed
This will create dist/gitshed.pex
.
You can run this file directly, but, as mentioned above, it's convenient to install it as a custom git command called
'shed'. You can do so either using a wrapper script named git-shed
on your PATH
, or using git config alias
.
The only setup steps are:
- Add
<repo root>/.gitshed/files
to your.gitignore
file. - Create a configuration file.
Config lives in the file <repo root>/.gitshed/config.json
. For example, to set up a remote content store:
{
...
"content_store": {
"chunk_size": 20,
"remote": {
"host": "mycontentstore",
"root_path": "/data/gitshed/myrepo/",
}
}
...
}
This will upload and download using rsync, to/from the specified root path on the specified host. Each invocation of rsync will read/write chunk_size files.
This is currently the only remote content store implementation available, but it would be very straightforward to write new ones (e.g., a RESTful content store). Feel free to contribute one.
{
...
"exclude": [".pants.d", ".pants.bootstrap"],
...
}
This will prevent pants from searching for symlinks into the shed in those directories, which can help performance.
{
...
"concurrency": {
"get": 6,
"put": 4
}
...
}
This will cause git shed to use 6 threads for downloading content while syncing and 4 threads when uploading content while putting files under management.
There are example config files in this repo:
.gitshed/config.json.local
: For a local content store, useful for playing around..gitshed/config.json.remote
: For a remote content store.