Skip to content

Commit

Permalink
Merge pull request #38 from InfuseAI/docs/20220412
Browse files Browse the repository at this point in the history
Update document
  • Loading branch information
popcornylu authored Apr 12, 2022
2 parents 0475d74 + 0f17b3f commit 6895c74
Show file tree
Hide file tree
Showing 17 changed files with 359 additions and 47 deletions.
21 changes: 7 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,26 +1,19 @@
# ArtiVC

[ArtiVC](https://artivc.io/) (**Arti**facts **V**ersion **C**ontrol) is a version control system for large files.

To store and share large files, we may use NFS or object storage (e.g. s3, MinIO). However, if we would like to do versioning on top of them, it is not a trivial thing. ArtiVC is a CLI tool to enable you to version files on your storage without pain. You don't need to install any additional server or gateway and we turn your storage into the versioned repository.
[ArtiVC](https://artivc.io/) (**Arti**facts **V**ersion **C**ontrol) is a handy command-line tool for data versioning on cloud storage. With only one command, it helps you neatly snapshot your data and Switch data between versions. Even better, it seamlessly integrates your existing cloud environment. ArtiVC supports three major cloud providers (AWS S3, Google Cloud Storage, Azure Blob Storage) and the remote filesystem using SSH.

[![asciicast](https://asciinema.org/a/6JEhzpJ5QMiSkiC74s5CyT257.svg)](https://asciinema.org/a/6JEhzpJ5QMiSkiC74s5CyT257?autoplay=1)

Try it out from the [Getting Started](https://artivc.io/usage/getting-started/) guide

# Features

- **Use your own storage**: If you store data in NFS or S3, just use the storage you already use.
- **No additional server required**: ArtiVC is a CLI tool. No server or gateway is required to install or operate.
- **Multiple backend support**: Currently, we support local, NFS (by local repo), and s3. And more in the future

- **Reproducible**: A commit is stored in a single file and cannot be changed. There is no way to add/remove/modify a single file in a commit.
- **Expose your data publicly**: Expose your repository with a public HTTP endpoint, then you can download your data in this way
```
avc get -o /tmp/dataset https://mybucket.s3.ap-northeast-1.amazonaws.com/path/to/my/[email protected]
```
- **Smart storage and transfer**: For the same content of files, there is only one instance stored in the artifact repository. If a file has been uploaded by other commits, no upload is required because we know the file is already there in the repository. Under the hood, we use [content-addressable storage](https://en.wikipedia.org/wiki/Content-addressable_storage) to put the objects.

- **Data Versioning**: Version your data like versioning code. ArtiVC supports commit history, commit message, and version tag. You can diff two commits, and pull data from the specific version.
- **Use your own storage**: We are used to putting large files in NFS or S3. To use ArtiVC, you can keep putting your files on the same storage without changes.
- **No additional server is required**: ArtiVC is a CLI tool. No server or gateway is required to install and operate.
- **Multiple backends support**: ArtiVC natively supports local filesystem, remote filesystem (by SSH), AWS S3, Google Cloud Storage, and Azure Blob Storage as backend. And 40+ backends are supported through [Rclone](https://artivc.io/backends/rclone/) integration. [Learn more](https://artivc.io/backends/)
- **Painless Configuration**: No one like to configure. So we leverage the original configuration as much as possible. Use `.ssh/config` for ssh access, and use `aws configure`, `gcloud auth application-default login`, `az login` for the cloud platforms.
- **Efficient storage and transfer**: The file structure of the repository is stored and transferred efficiently by [design](https://artivc.io/design/how-it-works/). It prevents storing duplicated content and minimum the number of files to upload when pushing a new version. [Learn more](https://artivc.io/design/benchmark/)

# Documentation

Expand Down
20 changes: 9 additions & 11 deletions docs/content/en/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,12 +7,10 @@ geekdocAnchor: false
---

{{< columns >}}
### ArtiVC (Artifact Version Control) is a version control system for large files.


**rsync** is an ssh-based tool that provides fast incremental file transfer.<br>
**Rclone** is a rsync-like tool for cloud storage.<br>
**ArtiVC** is like Git for files versioning and like Rclone for cloud storage.
<p style="text-align: left">
<b>ArtiVC (Artifact Version Control) is a handy command-line tool for data versioning on cloud storage.</b> With only one command, it helps you neatly snapshot your data and Switch data between versions. Even better, it seamlessly integrates your existing cloud environment. ArtiVC supports three major cloud providers (AWS S3, Google Cloud Storage, Azure Blob Storage) and the remote filesystem using SSH.
</p>

<--->
[![asciicast](https://asciinema.org/a/6JEhzpJ5QMiSkiC74s5CyT257.svg)](https://asciinema.org/a/6JEhzpJ5QMiSkiC74s5CyT257?autoplay=1)
Expand All @@ -25,17 +23,17 @@ geekdocAnchor: false
{{< columns >}}
### Data Versioning

Version your data like versioning code. ArtiVC supports commmit history, commit message, version tag. You can diff two commits, pull data from speciifc version.
Version your data like versioning code. ArtiVC supports commit history, commit message, and version tag. You can diff two commits, and pull data from the specific version.

<--->

### Use your own storage

We are used to putting large files in NFS or S3. To use ArtiVC, you can keep put your files on the same storage without changes.
We are used to putting large files in NFS or S3. To use ArtiVC, you can keep putting your files on the same storage without changes.

<--->

### No additional server required
### No additional server is required

ArtiVC is a CLI tool. No server or gateway is required to install and operate.

Expand All @@ -45,19 +43,19 @@ ArtiVC is a CLI tool. No server or gateway is required to install and operate.

### Multiple backends support

ArtiVC natively supports local filesystem, remote filesystem (by SSH), AWS S3, Google Cloud Storage, Azure Blob Storage as backend. And 40+ backends are supported through [Rclone](backends/rclone/) integration. [Learn more](backends/)
ArtiVC natively supports local filesystem, remote filesystem (by SSH), AWS S3, Google Cloud Storage, and Azure Blob Storage as backend. And 40+ backends are supported through [Rclone](backends/rclone/) integration. [Learn more](backends/)

<--->

### Painless Configuration

No one like to configure. So we leverage the original configuraion as much as possible. Use `.ssh/config` for ssh access, and use `aws configure`, `gcloud auth application-default login`, `az login` for the cloud platforms.
No one like to configure. So we leverage the original configuration as much as possible. Use `.ssh/config` for ssh access, and use `aws configure`, `gcloud auth application-default login`, `az login` for the cloud platforms.

<--->

### Efficient storage and transfer

The file structure of repository is storage and transfer effiecntly by [design](design/how-it-works/). It prevents from storing duplicated content and minimum the number of files to upload when pushing a new version. [Learn more](design/benchmark/)
The file structure of the repository is stored and transferred efficiently by [design](design/how-it-works/). It prevents storing duplicated content and minimum the number of files to upload when pushing a new version. [Learn more](design/benchmark/)


{{< /columns >}}
4 changes: 4 additions & 0 deletions docs/content/en/backends/azureblob.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@ title: Azure Blob Storage
weight: 13
---

{{< toc >}}

Use [Azure Blob Storage](https://azure.microsoft.com/services/storage/blobs/) as the repository backend.

## Configuration
Expand All @@ -22,6 +24,8 @@ The logged-in account requires **Storage Blob Data Contributor** role to the sto
For more information, please see https://docs.microsoft.com/azure/storage/blobs/assign-azure-role-data-access
{{< /hint >}}

The azure blob storage backend authenticates by a default procedure defined by [Azure SDK for Go](https://docs.microsoft.com/azure/developer/go/azure-sdk-authentication)

### Use Azure CLI to login

This backend supports to use [Azure CLI](https://docs.microsoft.com/cli/azure/install-azure-cli) to configure the login account. It will open the browser and start the login process.
Expand Down
11 changes: 10 additions & 1 deletion docs/content/en/backends/gcs.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@ title: Google Cloud Storage
weight: 12
---

{{< toc >}}

Use [Google Cloud Storage (GCS)](https://cloud.google.com/storage) as the repository backend.

Note that Google Cloud Storage is not [Google Drive](https://www.google.com.tw/drive/). They are different google product.
Expand Down Expand Up @@ -30,7 +32,7 @@ Before using the backend, you have to configure the service account credential.
1. Use the service account in the GCP resources (e.g. GCE, GKE). It is recommended way if the `ArtiVC` is run in the GCP environment. Please see [default service accounts](https://cloud.google.com/iam/docs/service-accounts#default) document
The GCS backend finds credentials by a default procedure defined by [Google Cloud](https://cloud.google.com/docs/authentication/production)
Expand All @@ -46,3 +48,10 @@ Clone a repository
avc clone gs://mybucket/path/to/mydataset
cd mydataset/
```


## Environment Variables

| Name | Description | Default value |
| --- | --- | --- |
| `GOOGLE_APPLICATION_CREDENTIALS` | The location of service account keys in JSON | |
23 changes: 22 additions & 1 deletion docs/content/en/backends/s3.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@ title: AWS S3
weight: 11
---

{{< toc >}}

Use the S3 as the repository backend.

## Features
Expand All @@ -12,7 +14,17 @@ Use the S3 as the repository backend.

## Configuration

Prepare the `~/.aws/credentials` to access the s3 backend. Please see the [AWS documentation](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html)
1. Install the [AWS CLI](https://aws.amazon.com/cli/)
2. Configure the AWS CLI. Please see the [AWS documentation](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html)
```
aws configure
```
3. Check current config
```
aws configure list
```

The S3 backend loads configuration by a default procedure of [AWS SDK for Go](https://aws.github.io/aws-sdk-go-v2/docs/configuring-sdk/#specifying-credentials)

## Usage

Expand All @@ -26,3 +38,12 @@ Clone a repository
avc clone s3://mybucket/path/to/mydataset
cd mydataset/
```

## Environment Variables

| Name | Description | Default value |
| --- | --- | --- |
| `AWS_ACCESS_KEY_ID` | The access key | |
| `AWS_SECRET_ACCESS_KEY` | The access secret key | |
| `AWS_PROFILE` | The profile to use in the credential file | `default` |
| `AWS_REGION` | The region to use | the region from profile |
2 changes: 2 additions & 0 deletions docs/content/en/backends/ssh.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@ title: Remote Filesystem (SSH)
weight: 2
---

{{< toc >}}

Use remote filesystem through SSH as the repository backend.

## Features
Expand Down
2 changes: 1 addition & 1 deletion docs/content/en/design/images/benchmark1.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion docs/content/en/design/images/benchmark2.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion docs/content/en/design/images/benchmark3.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
36 changes: 36 additions & 0 deletions docs/content/en/usage/dryrun.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
---
title: Dry Run
weight: 11
---

Pushing and pulling data is time-consuming. And need to be double-checked before transferring. Dry-run is the feature that allows listing the changeset before sending.


## Push

1. Dry run before pushing
```shell
avc push --dry-run
```

1. Do the actual push
```
avc push
```

## Pull

1. Dry run before pulling
```shell
avc pull -dry-run
# or check in delete mode
# avc pull --delete -dry-run
```

1. Do the actual pull

```shell
avc pull
# avc pull --delete
```

Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
---
title: Expose the dataset
weight: 3
title: Expose the data
weight: 20
---

ArtiVC repository can be exposed as a http endpoint. In S3, we can just make the bucket and give the data consumer the http endpiont of the repository. In this way, we can download data through CDN or other reverse proxies.
ArtiVC repository can be exposed as an HTTP endpoint. In S3, we can just make the bucket and give the data consumer the HTTP endpoint of the repository. In this way, we can download data through CDN or other reverse proxies.

1. [Make your S3 bucket public](https://aws.amazon.com/premiumsupport/knowledge-center/read-access-objects-s3-bucket/?nc1=h_ls)
1. Copy the public URL of your repository. For example
Expand Down
Loading

0 comments on commit 6895c74

Please sign in to comment.