The task
The job
The scheduler
The manager
The worker
The cluster
The CLI

processes commands which can be ran against tasks @runtime besides [stop|]
command interface which can be implemented by each task

The first is long-running services that should “never” go
down, and handle short-lived latency-sensitive requests (a
few ms to a few hundred ms)

Task:

Granularity: A task is typically smaller in scope and represents a specific action or activity that contributes to achieving a larger goal or project.
Duration: Tasks are generally shorter in duration and can be completed within hours or days.
Dependencies: Tasks can have dependencies on other tasks and often form part of a sequential workflow.
Assignee: Tasks are often assigned to individuals or team members responsible for their completion.

Job:

Granularity: A job can refer to a broader unit of work, which may encompass multiple tasks or activities.
Duration: Jobs are usually longer in duration, potentially taking weeks, months, or even years to complete.
Dependencies: Jobs may involve complex dependencies but can also encompass multiple tasks or sub-jobs.
Assignee: Jobs may be assigned to teams, departments, or individuals responsible for overseeing and managing the entire job.

JOB

name
owner
number of tasks + replicas
priority
resources required for the job
time limit per task
command line template
environment variables
input files/directories
output directory
log file location
list of dependencies on other jobs or tasks. If any
dependency fails then this job will not run at all.
status: pending | running | completed | failed | killed
start time
end time
exit code if it has finished successfully
reason why it stopped

Job descriptions are written in a declarative language

Jobs can have constraints to force
its tasks to run on machines with particular attributes such as
processor architecture, OS version, or an external IP address

Each task maps to a set of Linux processes running in
a container on a machine

no VMs or Virtualization

A task has properties too, such as its resource requirements and the task’s index within the job. Most task properties are the same across all tasks in a job, but can be overridden – e.g., to provide task-specific command-line flags.
Each resource dimension (CPU cores, RAM, disk space,
disk access rate, TCP ports,2 etc.) is specified independently
at fine granularity; we don’t impose fixed-sized buckets or
slots (x5.4).

A user can change the properties of some or all of the
tasks in a running job by pushing a new job configuration
to Borg, and then instructing Borg to update the tasks to
the new specification

STATES
submit > accept | reject

Pending (update) (failed, kill, lost) -> schedule
Running (update, evict: pending) -> finish, failed, kill, lost
Dead (submit: pending)
Completed

look into handling different workload types

container runtimes
stateful workloads
batch workloads
interactive workloads
streaming workloads
data processing workloads
distributed training workloads
serverless workloads

Task Definition

name of container image for task
amount of resources for task
number of replicas
restart policies

The scheduler should perform these functions:

Determine a set of candidate machines on which a task could run.
Score the candidate machines from best to worst.
Pick the machine with the best score

EPVM scheduler (used as part of Google’s Borg scheduler)

The manager is the brain of an orchestrator and the main entry point for
users.The manager also periodically collects metrics from each of its workers,
which are used in the scheduling

The manager should do the following:

Accept requests from users to start and stop tasks.
Schedule tasks onto worker machines.
Keep track of tasks, their states, and the machine on which they run.
Persist jobs in the system in the datastore

The worker provides the muscles of an orchestrator.

(Tainting Workers)
The worker is responsible for the following:

Running tasks as Docker containers.
Accepting tasks to run from a manager.
Providing relevant statistics to the manager for the purpose of
scheduling tasks.
Keeping track of its tasks and their state.

Finally, our CLI, the main user interface, should allow a user to:

Start and stop tasks
Get the status of tasks
See the state of machines (i.e. the workers)
Add workers

servers - leader & followers

replication across servers
workers - clients
use rpc to communicate with servers

Operators should define how Tasks are executed => BashOperator, PythonOperator, GoOperator, DockerOperator

DataSources define data/ params for the task -> move data between Tasks

=> Components of Master Node - Control Plane

In memory persistent storage - etcd
scheduler
api-server
controller manager
Plugin Controller - connect to trigger external workloads

=> Components of Worker Nodes

proxy - for network routing
Task Operators

Task Agent

TaskWeaver: Orchestrates and weaves tasks as jobs across the cluster, managing their distribution, dependencies and execution.

NodeLoom: Manages the nodes within the cluster, weaving them into the fabric of the system and overseeing their availability and health.

ThreadScheduler: Coordinates the scheduling and weaving of individual tasks onto available nodes, akin to arranging threads in a fabric.

TaskSpinner: Executes and spins tasks on the nodes, overseeing their lifecycle as they're woven into completion.

FabricRegistry: Acts as the central registry or loom holding information about available tasks, their dependencies, and weaving status.

ClusterForge: Shapes and forms the distributed cluster, ensuring its solidity and resilience in the system's architecture. Monitors the overall cluster's fabric, weaving together information about node health, performance, and resource utilization.

EventSeamstress: Seamlessly handles event notifications and communication between the fabric's components, triggering actions based on task updates and system events.

MetricSilkweaver: Weaves together the threads of performance metrics and data from tasks and nodes, providing a comprehensive view of the fabric's health.

AuthWeft: Manages the authentication and authorization threads within the fabric, ensuring secure access and interactions between its components.

LogQuilter: Weaves and organizes the threads of logs generated by tasks and system components, providing a cohesive fabric of data for analysis and auditing.

SchedulerMaestro: Directs the scheduling and orchestration of tasks and jobs across the distributed system.

ManagerChoreographer: Coordinates and orchestrates various managerial tasks, ensuring efficient management of resources and components.

client CLI: Acts as the conductor, orchestrating interactions and commands via the command-line interface, providing a unified way to communicate with the system.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

think.md

think.md

Files

think.md

Latest commit

History

think.md

File metadata and controls