Add pluggable data loaders (#59)

* Add pluggable containerized data loaders (see https://github.com/substratusai/dataset-squad). * Change "data-puller" name to "data-loader" - sometimes a data loader might actually generate data instead of pulling it. * Add Model training parameters `.spec.training.params`. * Fix bug with Notebook error handling when source Model does not exist. * Fix bug with (training) Model error handling when source Model does not exist. * Fix bug with ModelServer error handling when source Model does not exist. * Add "ai" category to each CRD. Now `kubectl get ai` will show all Models, Datasets, Notebooks, ModelServers. * Set default container for kubectl (logs, etc) (see https://kubernetes.io/docs/reference/labels-annotations-taints/#kubectl-kubernetes-io-default-container)/
substratusai · Jul 1, 2023 · 0572e07 · 0572e07
1 parent 73d3fb3
commit 0572e07
Show file tree

Hide file tree

Showing 30 changed files with 697 additions and 175 deletions.
diff --git a/Makefile b/Makefile
@@ -1,6 +1,6 @@
 
 # Image URL to use all building/pushing image targets
-IMG ?= docker.io/substratusai/controller-manager:v0.1.0-alpha
+IMG ?= docker.io/substratusai/controller-manager:v0.3.0-alpha
 # ENVTEST_K8S_VERSION refers to the version of kubebuilder assets to be downloaded by envtest binary.
 ENVTEST_K8S_VERSION = 1.26.1
 

diff --git a/README.md b/README.md
@@ -69,12 +69,16 @@ The Model API is capable of building base Models from Git repositories, or finet
 apiVersion: substratus.ai/v1
 kind: Model
 metadata:
-  name: my-model
+  name: fb-opt-125m-squad
 spec:
   source:
     modelName: facebook-opt-125m
   training:
-    datasetName: favorite-colors
+    datasetName: squad
+    params:
+      epochs: 30
+      batchSize: 3
+      dataLimit: 120
   # TODO: This should be copied from the source Model.
   size:
     parameterBits: 32
@@ -92,9 +96,9 @@ The ModelServer API runs a web server that serves the Model for inference (FUTUR
 apiVersion: substratus.ai/v1
 kind: ModelServer
 metadata:
-  name: my-model-server
+  name: fb-opt-125m-squad
 spec:
-  modelName: my-model
+  modelName: fb-opt-125m-squad
 ```
 
 ### Dataset API
@@ -106,11 +110,12 @@ The Dataset API snapshots and locally caches remote datasets to facilitate effic
 apiVersion: substratus.ai/v1
 kind: Dataset
 metadata:
-  name: favorite-colors
+  name: squad
 spec:
+  filename: all.jsonl
   source:
-    url: https://raw.githubusercontent.com/substratusai/model-facebook-opt-125m/main/hack/sample-data.jsonl
-    filename: fav-colors.jsonl
+    git:
+      url: https://github.com/substratusai/dataset-squad
 ```
 
 ### Notebook API
@@ -124,8 +129,9 @@ Notebooks can be opened using the `kubectl open notebook` command (which is a su
 apiVersion: substratus.ai/v1
 kind: Notebook
 metadata:
-  name: nick-fb-opt-125m
+  name: facebook-opt-125m
 spec:
+  suspend: true
   modelName: facebook-opt-125m
 ```
 
diff --git a/api/v1/dataset_types.go b/api/v1/dataset_types.go
@@ -6,13 +6,12 @@ import (
 
 // DatasetSpec defines the desired state of Dataset
 type DatasetSpec struct {
-	Source DatasetSource `json:"source,omitempty"`
+	Filename string        `json:"filename"`
+	Source   DatasetSource `json:"source,omitempty"`
 }
 
 type DatasetSource struct {
-	// URL supports http and https schemes.
-	URL      string `json:"url"`
-	Filename string `json:"filename"`
+	Git *GitSource `json:"git,omitempty"`
 }
 
 // DatasetStatus defines the observed state of Dataset
@@ -22,6 +21,7 @@ type DatasetStatus struct {
 	URL string `json:"url,omitempty"`
 }
 
+//+kubebuilder:resource:categories=ai
 //+kubebuilder:object:root=true
 //+kubebuilder:subresource:status
 //+kubebuilder:printcolumn:name="Ready",type="string",JSONPath=".status.conditions[?(@.type=='Ready')].status"

diff --git a/api/v1/model_types.go b/api/v1/model_types.go
@@ -34,7 +34,23 @@ type ModelSize struct {
 }
 
 type Training struct {
-	DatasetName string `json:"datasetName"`
+	DatasetName string         `json:"datasetName"`
+	Params      TrainingParams `json:"params"`
+}
+
+type TrainingParams struct {
+	//+kubebuilder:default:=3
+	// Epochs is the total number of iterations that should be run through the training data.
+	// Increasing this number will increase training time.
+	Epochs int64 `json:"epochs,omitempty"`
+	//+kubebuilder:default:=1000000000000
+	// DataLimit is the maximum number of training records to use. In the case of JSONL, this would be the total number of lines
+	// to train with. Increasing this number will increase training time.
+	DataLimit int64 `json:"dataLimit,omitempty"`
+	//+kubebuilder:default:=1
+	// BatchSize is the number of training records to use per (forward and backward) pass through the model.
+	// Increasing this number will increase the memory requirements of the training process.
+	BatchSize int64 `json:"batchSize,omitempty"`
 }
 
 type ModelSource struct {
@@ -76,6 +92,7 @@ type ModelStatus struct {
 	Servers        []string           `json:"servers,omitempty"`
 }
 
+//+kubebuilder:resource:categories=ai
 //+kubebuilder:object:root=true
 //+kubebuilder:subresource:status
 //+kubebuilder:printcolumn:name="Ready",type="string",JSONPath=".status.conditions[?(@.type=='Ready')].status"

diff --git a/api/v1/modelserver_types.go b/api/v1/modelserver_types.go
@@ -14,6 +14,7 @@ type ModelServerStatus struct {
 	Conditions []metav1.Condition `json:"conditions,omitempty"`
 }
 
+//+kubebuilder:resource:categories=ai
 //+kubebuilder:object:root=true
 //+kubebuilder:subresource:status
 //+kubebuilder:printcolumn:name="Ready",type="string",JSONPath=".status.conditions[?(@.type=='Ready')].status"

diff --git a/api/v1/notebook_types.go b/api/v1/notebook_types.go
@@ -17,6 +17,7 @@ type NotebookStatus struct {
 	Conditions []metav1.Condition `json:"conditions,omitempty"`
 }
 
+//+kubebuilder:resource:categories=ai
 //+kubebuilder:object:root=true
 //+kubebuilder:subresource:status
 //+kubebuilder:printcolumn:name="Ready",type="string",JSONPath=".status.conditions[?(@.type=='Ready')].status"

diff --git a/api/v1/zz_generated.deepcopy.go b/api/v1/zz_generated.deepcopy.go
diff --git a/config/crd/bases/substratus.ai_datasets.yaml b/config/crd/bases/substratus.ai_datasets.yaml
@@ -9,6 +9,8 @@ metadata:
 spec:
   group: substratus.ai
   names:
+    categories:
+    - ai
     kind: Dataset
     listKind: DatasetList
     plural: datasets
@@ -42,17 +44,25 @@ spec:
           spec:
             description: DatasetSpec defines the desired state of Dataset
             properties:
+              filename:
+                type: string
               source:
                 properties:
-                  filename:
-                    type: string
-                  url:
-                    description: URL supports http and https schemes.
-                    type: string
-                required:
-                - filename
-                - url
+                  git:
+                    properties:
+                      branch:
+                        type: string
+                      path:
+                        description: Path within the git repository referenced by
+                          url.
+                        type: string
+                      url:
+                        description: 'URL to the git repository. Example: github.com/my-account/my-repo'
+                        type: string
+                    type: object
                 type: object
+            required:
+            - filename
             type: object
           status:
             description: DatasetStatus defines the observed state of Dataset

diff --git a/config/crd/bases/substratus.ai_models.yaml b/config/crd/bases/substratus.ai_models.yaml
@@ -9,6 +9,8 @@ metadata:
 spec:
   group: substratus.ai
   names:
+    categories:
+    - ai
     kind: Model
     listKind: ModelList
     plural: models
@@ -87,8 +89,35 @@ spec:
                 properties:
                   datasetName:
                     type: string
+                  params:
+                    properties:
+                      batchSize:
+                        default: 1
+                        description: BatchSize is the number of training records to
+                          use per (forward and backward) pass through the model. Increasing
+                          this number will increase the memory requirements of the
+                          training process.
+                        format: int64
+                        type: integer
+                      dataLimit:
+                        default: 1000000000000
+                        description: DataLimit is the maximum number of training records
+                          to use. In the case of JSONL, this would be the total number
+                          of lines to train with. Increasing this number will increase
+                          training time.
+                        format: int64
+                        type: integer
+                      epochs:
+                        default: 3
+                        description: Epochs is the total number of iterations that
+                          should be run through the training data. Increasing this
+                          number will increase training time.
+                        format: int64
+                        type: integer
+                    type: object
                 required:
                 - datasetName
+                - params
                 type: object
             required:
             - compute

diff --git a/config/crd/bases/substratus.ai_modelservers.yaml b/config/crd/bases/substratus.ai_modelservers.yaml
@@ -9,6 +9,8 @@ metadata:
 spec:
   group: substratus.ai
   names:
+    categories:
+    - ai
     kind: ModelServer
     listKind: ModelServerList
     plural: modelservers

diff --git a/config/crd/bases/substratus.ai_notebooks.yaml b/config/crd/bases/substratus.ai_notebooks.yaml
@@ -9,6 +9,8 @@ metadata:
 spec:
   group: substratus.ai
   names:
+    categories:
+    - ai
     kind: Notebook
     listKind: NotebookList
     plural: notebooks

diff --git a/config/manager/kustomization.yaml b/config/manager/kustomization.yaml
@@ -5,4 +5,4 @@ kind: Kustomization
 images:
 - name: controller
   newName: docker.io/substratusai/controller-manager
-  newTag: v0.1.0-alpha
+  newTag: v0.3.0-alpha
diff --git a/docs/arch.md b/docs/arch.md
@@ -18,14 +18,30 @@ Training is triggered by creating a `kind: Model` with `.spec.training` and `.sp
 
 The Dataset API is used to describe data that can be referenced for training Models.
 
-* Training models typically requires a large dataset. Pulling this dataset from a remote source every time you train a model is slow and unreliable. The Dataset API pulls a dataset once and stores it on fast Persistent Disks, mounted directly to training Jobs.
-* The Dataset controller pulls in a remote dataset once, and stores it, guaranteeing every model that references that dataset uses the same exact data (reproducable training results).
-* The Dataset API allows users to query datasets on the platform (`kubectl get datasets`).
+* Datasets pull in remote data sources using containerized data loaders.
+* Users can specify their own ETL logic by referencing a repository from a Dataset.
+* Users can leverage pre-built data loader integrations with various sources including the Huggingface Hub, materializing the output of SQL queries, scraping and downloading an entire confluence site, etc.
+* Training typically requires a large dataset. Pulling this dataset from a remote source every time you train a model is slow and unreliable. The Dataset API pulls a dataset once and stores it in a bucket, which is mounted in training Jobs.
+* The Dataset controller pulls in a remote dataset once, and stores it, guaranteeing every model that references that dataset uses the same exact data (facilitating reproducable training results).
+* The Dataset API allows users to query ready to use datasets (`kubectl get datasets`).
 * The Dataset API allows Kubernetes RBAC to be applied as a mechanism for controlling access to data.
-* Similar to the Model API, the Dataset API contains metadata about datasets (size of dataset --> which can be used to inform training job resource requirements).
-* Dataset API provides a central place to define the auth credentials for remote dataset sources.
-* Dataset API could provide integrations with many data sources including the Huggingface Hub, materializing the output of SQL queries, scraping and downloading an entire confluence site, etc.
+
+### Possible Future Dataset Features
+
+* Scheduled recurring loads
+* Continuous loading
+* Present metadata about datasets (size of dataset --> which can be used to inform training job resource requirements).
+* Dataset API could provide a central place to define the auth credentials for remote dataset sources.
 * If Models have consistent or at least declarative training data format expectations, then the Dataset API allows for a programtic way to orchestrate coupling those models to a large number of datasets and producing a matrix of trained models.
 
 <img src="datasets.excalidraw.png" width="70%"></img>
 
+## Notebooks
+
+Notebooks can be used to quickly spin up a development environment backed by high performance compute.
+
+* Integration with Model and Dataset APIs allow for quick iteration.
+* Local directory syncing streamlines the developer experience.
+
+<img src="labs.excalidraw.png" width="70%"></img>
+