From 4e94a4523201cea05b9c2b4359ecfba6175606c8 Mon Sep 17 00:00:00 2001 From: Emre Sahin Date: Mon, 1 Jan 2024 16:17:45 +0300 Subject: [PATCH] Update intro doc to link to DVC comparison and benchmarks --- book/src/how-to/benchmark-versus-dvc.md | 2 +- book/src/intro/index.md | 6 +----- 2 files changed, 2 insertions(+), 6 deletions(-) diff --git a/book/src/how-to/benchmark-versus-dvc.md b/book/src/how-to/benchmark-versus-dvc.md index 3c789a234..c3ae89645 100644 --- a/book/src/how-to/benchmark-versus-dvc.md +++ b/book/src/how-to/benchmark-versus-dvc.md @@ -13,7 +13,7 @@ We'll test the tools in the following scenarios: - Checking in and out large files: We'll create 100 large files using `xvc-test-helper` and repeat the above tests. - Running small pipelines: We'll create a pipeline with 10 steps to run simple commands. - Running medium sized pipelines: We'll create a pipeline with 100 steps to run simple commands. -- Running large pipelines: We'll create a pipeline with 10000 steps to run simple commands. +- Running large pipelines: We'll create a pipeline with 1000 steps to run simple commands. ## Setup diff --git a/book/src/intro/index.md b/book/src/intro/index.md index d86869e67..7b7540313 100644 --- a/book/src/intro/index.md +++ b/book/src/intro/index.md @@ -21,16 +21,12 @@ There are many similar tools for managing large files on Git, managing machine l Similar tools for file management on Git are the following: +- `dvc`: See [Xvc for DVC Users](./start/from-dvc.md) and [Benchmarks against DVC](./how-to/benchmarks-versus-dvc.md) documents for a detailed comparison. - `git-annex`: One of the earliest and most successful projects to manage large files on Git. It supports a large number of remote storage types, as well as adding other utilities as backends, similar to [`xvc storage new generic`](xvc-book/ref/xvc-storage-new-generic.md). It features an assistant aimed to make it easier for common use cases. It uses SHA-256 as the single digest option and uses symlinks as a [recheck method][recheck-method] It doesn't have data pipeline features. - `git-lfs`: It uses Git internals to track binary files. It requires server support for remote storages and allows only Git remotes to be used for binary file storage. Uses the same digest function Git uses. (By default, SHA-1). Uses `.gitattributes` mechanism to track certain files by default. It doesn't have data pipeline features. -- `dvc`: Uses YAML files _in the working directory_ to track file content. It uses MD5 sums. It can use different - [recheck method][recheck-method] for all the files in the repository. It has experiments tracking features, data pipelines, and a - [SaaS GUI.](https://studio.iterative.ai) - -I have done some preliminary benchmarks to measure _time to add_ files. I added 70.000 files with a single command. `xvc file track` (0.3.1) finished in 19 seconds, `git lfs track '*.png' ; git add 'data/images/**/*.png'` in 56 seconds, `dvc add data/images` in 80 seconds and `git-annex add data/images` in around 11 minutes. Note that these measurements are affected by output behavior and commands may gain some speed by turning off the default terminal output. Some finer benchmarks may be provided in the future, when Xvc is optimized. [recheck-method]: ./concepts/recheck.md