Skip to content

Latest commit

 

History

History
120 lines (84 loc) · 2.27 KB

dvc.md

File metadata and controls

120 lines (84 loc) · 2.27 KB

DVC

Data Version Control.

Common commands

Initialize dvc - needs to be inside a git repo:

dvc init

Add remote storage:

dvc remote add -d remote_storage path/to/your/dvc_remote

Test adding remote storage using /tmp/:

dvc remote add -d remote_storage /tmp/local_dvc_storage/

Turn off analytics collection:

dvc config core.analytics false

Track data files or folder with dvc:

dvc add data/raw/train

It will do the following operations:

  • Add data/raw/train to .gitignore
  • Create a file data/raw/train.dvc containing md5, nfiles and other informations about the tracked data
  • Copy data/raw/train to a staging area - a cache located in .dvc/cache

Push data to the remote storage:

dvc push

Get the data from the remote storage:

dvc checkout data/raw/train.dvc

On a freshly cloned repo, one can fetch all data from the remote in cache:

dvc fetch

Or fetch only some files:

dvc fetch data/raw/val.dvc

To combine fetching and checking out:

dvc pull

Modify content:

  1. Unlink the file/folder with dvc unprotect: dvc unprotect data/raw/train
  2. Update the data or download new one
  3. Add the file/folder back to dvc: dvc add data/raw/train

Note: Often, it is better to version the raw data and just track a new version of this. Consider the raw data as immutable.

Commit new data: Record changes to files or directories tracked by DVC.

dvc commit

Pipelines & DAGs

DVC makes it possible to execute data pipelines and track generated artificats.

dvc stage add \
  -n prepare \                 # name
  -d src/prepare.py \          # deps
  -d data/raw \                # deps
  -o data/prepared/train.csv \ # outs
  -o data/prepared/test.csv \  # outs
  python src/prepare.py        # cmd 

which is equivalent to this section in the dvc.yaml file:

stages:
  prepare:
    cmd: python src/prepare.py
    deps:
      - data/raw
      - src/prepare.py
    outs:
      - data/prepared/test.csv
      - data/prepared/train.csv

Misc

The configuration file is located in .dvc/config. Use a share cache when multiple repos points to the same data files: dvc cache dir path/to/shared_cache and mv .dvc/cache/* path/to/shared_cache.