Skip to content
This repository has been archived by the owner on Aug 28, 2024. It is now read-only.

Commit

Permalink
Add pluggable data loaders (#59)
Browse files Browse the repository at this point in the history
* Add pluggable containerized data loaders (see
https://github.com/substratusai/dataset-squad).
* Change "data-puller" name to "data-loader" - sometimes a data loader
might actually generate data instead of pulling it.
* Add Model training parameters `.spec.training.params`.
* Fix bug with Notebook error handling when source Model does not exist.
* Fix bug with (training) Model error handling when source Model does
not exist.
* Fix bug with ModelServer error handling when source Model does not
exist.
* Add "ai" category to each CRD. Now `kubectl get ai` will show all
Models, Datasets, Notebooks, ModelServers.
* Set default container for kubectl (logs, etc) (see
https://kubernetes.io/docs/reference/labels-annotations-taints/#kubectl-kubernetes-io-default-container)/
  • Loading branch information
nstogner authored Jul 1, 2023
1 parent 73d3fb3 commit 0572e07
Show file tree
Hide file tree
Showing 30 changed files with 697 additions and 175 deletions.
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@

# Image URL to use all building/pushing image targets
IMG ?= docker.io/substratusai/controller-manager:v0.1.0-alpha
IMG ?= docker.io/substratusai/controller-manager:v0.3.0-alpha
# ENVTEST_K8S_VERSION refers to the version of kubebuilder assets to be downloaded by envtest binary.
ENVTEST_K8S_VERSION = 1.26.1

Expand Down
22 changes: 14 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,12 +69,16 @@ The Model API is capable of building base Models from Git repositories, or finet
apiVersion: substratus.ai/v1
kind: Model
metadata:
name: my-model
name: fb-opt-125m-squad
spec:
source:
modelName: facebook-opt-125m
training:
datasetName: favorite-colors
datasetName: squad
params:
epochs: 30
batchSize: 3
dataLimit: 120
# TODO: This should be copied from the source Model.
size:
parameterBits: 32
Expand All @@ -92,9 +96,9 @@ The ModelServer API runs a web server that serves the Model for inference (FUTUR
apiVersion: substratus.ai/v1
kind: ModelServer
metadata:
name: my-model-server
name: fb-opt-125m-squad
spec:
modelName: my-model
modelName: fb-opt-125m-squad
```
### Dataset API
Expand All @@ -106,11 +110,12 @@ The Dataset API snapshots and locally caches remote datasets to facilitate effic
apiVersion: substratus.ai/v1
kind: Dataset
metadata:
name: favorite-colors
name: squad
spec:
filename: all.jsonl
source:
url: https://raw.githubusercontent.com/substratusai/model-facebook-opt-125m/main/hack/sample-data.jsonl
filename: fav-colors.jsonl
git:
url: https://github.com/substratusai/dataset-squad
```
### Notebook API
Expand All @@ -124,8 +129,9 @@ Notebooks can be opened using the `kubectl open notebook` command (which is a su
apiVersion: substratus.ai/v1
kind: Notebook
metadata:
name: nick-fb-opt-125m
name: facebook-opt-125m
spec:
suspend: true
modelName: facebook-opt-125m
```

8 changes: 4 additions & 4 deletions api/v1/dataset_types.go
Original file line number Diff line number Diff line change
Expand Up @@ -6,13 +6,12 @@ import (

// DatasetSpec defines the desired state of Dataset
type DatasetSpec struct {
Source DatasetSource `json:"source,omitempty"`
Filename string `json:"filename"`
Source DatasetSource `json:"source,omitempty"`
}

type DatasetSource struct {
// URL supports http and https schemes.
URL string `json:"url"`
Filename string `json:"filename"`
Git *GitSource `json:"git,omitempty"`
}

// DatasetStatus defines the observed state of Dataset
Expand All @@ -22,6 +21,7 @@ type DatasetStatus struct {
URL string `json:"url,omitempty"`
}

//+kubebuilder:resource:categories=ai
//+kubebuilder:object:root=true
//+kubebuilder:subresource:status
//+kubebuilder:printcolumn:name="Ready",type="string",JSONPath=".status.conditions[?(@.type=='Ready')].status"
Expand Down
19 changes: 18 additions & 1 deletion api/v1/model_types.go
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,23 @@ type ModelSize struct {
}

type Training struct {
DatasetName string `json:"datasetName"`
DatasetName string `json:"datasetName"`
Params TrainingParams `json:"params"`
}

type TrainingParams struct {
//+kubebuilder:default:=3
// Epochs is the total number of iterations that should be run through the training data.
// Increasing this number will increase training time.
Epochs int64 `json:"epochs,omitempty"`
//+kubebuilder:default:=1000000000000
// DataLimit is the maximum number of training records to use. In the case of JSONL, this would be the total number of lines
// to train with. Increasing this number will increase training time.
DataLimit int64 `json:"dataLimit,omitempty"`
//+kubebuilder:default:=1
// BatchSize is the number of training records to use per (forward and backward) pass through the model.
// Increasing this number will increase the memory requirements of the training process.
BatchSize int64 `json:"batchSize,omitempty"`
}

type ModelSource struct {
Expand Down Expand Up @@ -76,6 +92,7 @@ type ModelStatus struct {
Servers []string `json:"servers,omitempty"`
}

//+kubebuilder:resource:categories=ai
//+kubebuilder:object:root=true
//+kubebuilder:subresource:status
//+kubebuilder:printcolumn:name="Ready",type="string",JSONPath=".status.conditions[?(@.type=='Ready')].status"
Expand Down
1 change: 1 addition & 0 deletions api/v1/modelserver_types.go
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ type ModelServerStatus struct {
Conditions []metav1.Condition `json:"conditions,omitempty"`
}

//+kubebuilder:resource:categories=ai
//+kubebuilder:object:root=true
//+kubebuilder:subresource:status
//+kubebuilder:printcolumn:name="Ready",type="string",JSONPath=".status.conditions[?(@.type=='Ready')].status"
Expand Down
1 change: 1 addition & 0 deletions api/v1/notebook_types.go
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ type NotebookStatus struct {
Conditions []metav1.Condition `json:"conditions,omitempty"`
}

//+kubebuilder:resource:categories=ai
//+kubebuilder:object:root=true
//+kubebuilder:subresource:status
//+kubebuilder:printcolumn:name="Ready",type="string",JSONPath=".status.conditions[?(@.type=='Ready')].status"
Expand Down
25 changes: 23 additions & 2 deletions api/v1/zz_generated.deepcopy.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

26 changes: 18 additions & 8 deletions config/crd/bases/substratus.ai_datasets.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@ metadata:
spec:
group: substratus.ai
names:
categories:
- ai
kind: Dataset
listKind: DatasetList
plural: datasets
Expand Down Expand Up @@ -42,17 +44,25 @@ spec:
spec:
description: DatasetSpec defines the desired state of Dataset
properties:
filename:
type: string
source:
properties:
filename:
type: string
url:
description: URL supports http and https schemes.
type: string
required:
- filename
- url
git:
properties:
branch:
type: string
path:
description: Path within the git repository referenced by
url.
type: string
url:
description: 'URL to the git repository. Example: github.com/my-account/my-repo'
type: string
type: object
type: object
required:
- filename
type: object
status:
description: DatasetStatus defines the observed state of Dataset
Expand Down
29 changes: 29 additions & 0 deletions config/crd/bases/substratus.ai_models.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@ metadata:
spec:
group: substratus.ai
names:
categories:
- ai
kind: Model
listKind: ModelList
plural: models
Expand Down Expand Up @@ -87,8 +89,35 @@ spec:
properties:
datasetName:
type: string
params:
properties:
batchSize:
default: 1
description: BatchSize is the number of training records to
use per (forward and backward) pass through the model. Increasing
this number will increase the memory requirements of the
training process.
format: int64
type: integer
dataLimit:
default: 1000000000000
description: DataLimit is the maximum number of training records
to use. In the case of JSONL, this would be the total number
of lines to train with. Increasing this number will increase
training time.
format: int64
type: integer
epochs:
default: 3
description: Epochs is the total number of iterations that
should be run through the training data. Increasing this
number will increase training time.
format: int64
type: integer
type: object
required:
- datasetName
- params
type: object
required:
- compute
Expand Down
2 changes: 2 additions & 0 deletions config/crd/bases/substratus.ai_modelservers.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@ metadata:
spec:
group: substratus.ai
names:
categories:
- ai
kind: ModelServer
listKind: ModelServerList
plural: modelservers
Expand Down
2 changes: 2 additions & 0 deletions config/crd/bases/substratus.ai_notebooks.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@ metadata:
spec:
group: substratus.ai
names:
categories:
- ai
kind: Notebook
listKind: NotebookList
plural: notebooks
Expand Down
2 changes: 1 addition & 1 deletion config/manager/kustomization.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,4 @@ kind: Kustomization
images:
- name: controller
newName: docker.io/substratusai/controller-manager
newTag: v0.1.0-alpha
newTag: v0.3.0-alpha
28 changes: 22 additions & 6 deletions docs/arch.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,14 +18,30 @@ Training is triggered by creating a `kind: Model` with `.spec.training` and `.sp

The Dataset API is used to describe data that can be referenced for training Models.

* Training models typically requires a large dataset. Pulling this dataset from a remote source every time you train a model is slow and unreliable. The Dataset API pulls a dataset once and stores it on fast Persistent Disks, mounted directly to training Jobs.
* The Dataset controller pulls in a remote dataset once, and stores it, guaranteeing every model that references that dataset uses the same exact data (reproducable training results).
* The Dataset API allows users to query datasets on the platform (`kubectl get datasets`).
* Datasets pull in remote data sources using containerized data loaders.
* Users can specify their own ETL logic by referencing a repository from a Dataset.
* Users can leverage pre-built data loader integrations with various sources including the Huggingface Hub, materializing the output of SQL queries, scraping and downloading an entire confluence site, etc.
* Training typically requires a large dataset. Pulling this dataset from a remote source every time you train a model is slow and unreliable. The Dataset API pulls a dataset once and stores it in a bucket, which is mounted in training Jobs.
* The Dataset controller pulls in a remote dataset once, and stores it, guaranteeing every model that references that dataset uses the same exact data (facilitating reproducable training results).
* The Dataset API allows users to query ready to use datasets (`kubectl get datasets`).
* The Dataset API allows Kubernetes RBAC to be applied as a mechanism for controlling access to data.
* Similar to the Model API, the Dataset API contains metadata about datasets (size of dataset --> which can be used to inform training job resource requirements).
* Dataset API provides a central place to define the auth credentials for remote dataset sources.
* Dataset API could provide integrations with many data sources including the Huggingface Hub, materializing the output of SQL queries, scraping and downloading an entire confluence site, etc.

### Possible Future Dataset Features

* Scheduled recurring loads
* Continuous loading
* Present metadata about datasets (size of dataset --> which can be used to inform training job resource requirements).
* Dataset API could provide a central place to define the auth credentials for remote dataset sources.
* If Models have consistent or at least declarative training data format expectations, then the Dataset API allows for a programtic way to orchestrate coupling those models to a large number of datasets and producing a matrix of trained models.

<img src="datasets.excalidraw.png" width="70%"></img>

## Notebooks

Notebooks can be used to quickly spin up a development environment backed by high performance compute.

* Integration with Model and Dataset APIs allow for quick iteration.
* Local directory syncing streamlines the developer experience.

<img src="labs.excalidraw.png" width="70%"></img>

Loading

0 comments on commit 0572e07

Please sign in to comment.