- Glossary
- Access Scopes
- Ansible
- Bigquery
- Compute Engine
- Federated Cluster
- Firewall Rules
- GCP
- GCP Marketplace
- IAM
- IAM Roles
- Instance Template
- Multi-Cluster
- MUNGE
- OS Login
- Packer
- PackerUser
- Packer Project
- Preemptible VM
- Private Google Access
- Python
- Pip
- GCP Quota
- Service Account
- Secret Manager
- Self Link
- Slurm
- SPOT VM
- Terraform
- TerraformUser
- Terraform Project
- Terraform Registry
- TPU
- VM
https://cloud.google.com/compute/docs/access/service-accounts#accesscopesiam
Access scopes are the legacy method of specifying authorization for your instance. They define the default OAuth scopes used in requests from the gcloud CLI or the client libraries.
Access scopes apply on a per-instance basis. You set access scopes when creating an instance and the access scopes persists only for the life of the instance.
Red Hat® Ansible® Automation Platform is a foundation for building and operating automation across an organization. The platform includes all the tools needed to implement enterprise-wide automation.
https://docs.ansible.com/ansible/latest/installation_guide/intro_installation.html
https://cloud.google.com/bigquery
Serverless, highly scalable, and cost-effective multicloud data warehouse designed for business agility.
https://cloud.google.com/compute
Secure and customizable compute service that lets you create and run virtual machines on Google’s infrastructure.
https://slurm.schedmd.com/federation.html
Slurm includes support for creating a federation of clusters and scheduling jobs in a peer-to-peer fashion between them. Jobs submitted to a federation receive a unique job ID that is unique among all clusters in the federation. A job is submitted to the local cluster (the cluster defined in the slurm.conf) and is then replicated across the clusters in the federation. Each cluster then independently attempts to the schedule the job based off of its own scheduling policies. The clusters coordinate with the "origin" cluster (cluster the job was submitted to) to schedule the job.
https://cloud.google.com/vpc/docs/firewalls
VPC firewall rules let you allow or deny connections to or from your virtual machine (VM) instances based on a configuration that you specify. Enabled VPC firewall rules are always enforced, protecting your instances regardless of their configuration and operating system, even if they have not started up.
Google Cloud Platform (GCP).
https://console.cloud.google.com/marketplace
Marketplace lets you quickly deploy software on Google Cloud Platform.
https://cloud.google.com/iam/docs/overview
IAM lets you grant granular access to specific Google Cloud resources and helps prevent access to other resources. IAM lets you adopt the security principle of least privilege, which states that nobody should have more permissions than they actually need.
https://cloud.google.com/iam/docs/understanding-roles
A role contains a set of permissions that allows you to perform specific actions on Google Cloud resources. To make permissions available to principals, including users, groups, and service accounts, you grant roles to the principals.
https://cloud.google.com/compute/docs/instance-templates
Instance templates define the machine type, boot disk image or container image, labels, startup script, and other instance properties. You can then use an instance template to create a MIG or to create individual VMs. Instance templates are a convenient way to save a VM instance's configuration so you can use it later to create VMs or groups of VMs.
https://slurm.schedmd.com/multi_cluster.html
A cluster is comprised of all the nodes managed by a single slurmctld daemon. Slurm offers the ability to target commands to other clusters instead of, or in addition to, the local cluster on which the command is invoked. When this behavior is enabled, users can submit jobs to one or many clusters and receive status from those remote clusters.
https://github.com/dun/munge/wiki/Man-7-munge
MUNGE (MUNGE Uid 'N' Gid Emporium) is an authentication service for creating and validating user credentials. It is designed to be highly scalable for use in an HPC cluster environment. It provides a portable API for encoding the user's identity into a tamper-proof credential that can be obtained by an untrusted client and forwarded by untrusted intermediaries within a security realm. Clients within this realm can create and validate credentials without the use of root privileges, reserved ports, or platform-specific methods.
https://cloud.google.com/compute/docs/oslogin
Use OS Login to manage SSH access to your instances using IAM without having to create and manage individual SSH keys. OS Login maintains a consistent Linux user identity across VM instances and is the recommended way to manage many users across multiple instances or projects.
Create identical machine images for multiple platforms from a single source configuration.
https://www.packer.io/downloads.html
The PackerUser
is the user who invokes the packer
command. This user must
have correct permissions and GCP IAM roles to create, delete, and
modify resources as defined in the packer project.
A packer project is any directory that contains a set of packer files
(*.pkr.hcl
) which define providers, resources, and data.
https://cloud.google.com/compute/docs/instances/preemptible
Preemptible VM instances are available at much lower price—a 60-91% discount—compared to the price of standard VMs. However, Compute Engine might stop (preempt) these instances if it needs to reclaim the compute capacity for allocation to other VMs. Preemptible instances use excess Compute Engine capacity, so their availability varies with usage.
https://cloud.google.com/vpc/docs/configure-private-google-access
Private Google Access also allows access to the external IP addresses used by App Engine, including third-party App Engine-based services.
https://docs.python.org/3/faq/general.html
Python is an interpreted, interactive, object-oriented programming language. It incorporates modules, exceptions, dynamic typing, very high level dynamic data types, and classes. It supports multiple programming paradigms beyond object-oriented programming, such as procedural and functional programming. Python combines remarkable power with very clear syntax. It has interfaces to many system calls and libraries, as well as to various window systems, and is extensible in C or C++. It is also usable as an extension language for applications that need a programmable interface. Finally, Python is portable: it runs on many Unix variants including Linux and macOS, and on Windows.
https://en.wikipedia.org/wiki/Pip\_(package_manager)
pip is a package-management system written in Python used to install and manage software packages. It connects to an online repository of public packages, called the Python Package Index. pip can also be configured to connect to other package repositories (local or remote), provided that they comply to Python Enhancement Proposal 503.
https://cloud.google.com/docs/quota
Google Cloud uses quotas to restrict how much of a particular shared Google Cloud resource that you can use. Each quota represents a specific countable resource, such as API calls to a particular service, the number of load balancers used concurrently by your project, or the number of projects that you can create.
There are two categories for quotas:
- Rate quotas are typically used for limiting the number of requests you can make to an API or service. Rate quotas reset after a time interval that is specific to the service—for example, the number of API requests per day.
- Allocation quotas are used to restrict the use of resources that don't have a rate of usage, such as the number of VMs used by your project at a given time. Allocation quotas don't reset over time, instead you must explicitly release the resource when you no longer want to use it—for example, by deleting a GKE cluster.
https://cloud.google.com/iam/docs/service-accounts
A service account is a special kind of account used by an application or compute workload, such as a Compute Engine virtual machine (VM) instance, rather than a person. Applications use service accounts to make authorized API calls, authorized as either the service account itself, or as Google Workspace or Cloud Identity users through domain-wide delegation.
https://cloud.google.com/secret-manager
Secret Manager is a secure and convenient storage system for API keys, passwords, certificates, and other sensitive data. Secret Manager provides a central place and single source of truth to manage, access, and audit secrets across Google Cloud.
The URI to a resource within GCP (e.g.
"projects/my-project/zones/us-central1/instances/my-instance"
).
https://slurm.schedmd.com/overview.html
Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. Slurm requires no kernel modifications for its operation and is relatively self-contained. As a cluster workload manager, Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work. Optional plugins can be used for accounting, advanced reservation, gang scheduling (time sharing for parallel jobs), backfill scheduling, topology optimized resource selection, resource limits by user or bank account, and sophisticated multifactor job prioritization algorithms.
https://slurm.schedmd.com/slurmctld.html
slurmctld is the central management daemon of Slurm. It monitors all other Slurm daemons and resources, accepts work (jobs), and allocates resources to those jobs. Given the critical functionality of slurmctld, there may be a backup server to assume these functions in the event that the primary server fails.
https://slurm.schedmd.com/slurmdbd.html
slurmdbd provides a secure enterprise-wide interface to a database for Slurm. This is particularly useful for archiving accounting records.
https://slurm.schedmd.com/slurmrestd.html
slurmrestd is REST API interface for Slurm.
https://slurm.schedmd.com/slurmd.html
The compute node daemon for Slurm.
https://slurm.schedmd.com/slurmstepd.html
slurmstepd is a job step manager for Slurm. It is spawned by the slurmd daemon when a job step is launched and terminates when the job step does. It is responsible for managing input and output (stdin, stdout and stderr) for the job step along with its accounting and signal processing. slurmstepd should not be initiated by users or system administrators.
https://cloud.google.com/compute/docs/instances/spot
Spot VMs are available at much lower price—a 60-91% discount—compared to the price of standard VMs. However, Compute Engine might preempt Spot VMs if it needs to reclaim those resources for other tasks. At this uncertain preemption time, Compute Engine either stops (default) or deletes your Spot VMs depending on your specified termination action for each VM. Spot VMs are excess Compute Engine capacity, so their availability varies with usage. Spot VMs do not have a minimum or maximum runtime.
Terraform is an open-source infrastructure as code software tool that provides a consistent CLI workflow to manage hundreds of cloud services. Terraform codifies cloud APIs into declarative configuration files.
The TerraformUser
is the user who invokes the terraform
command. This user
must have correct permissions and GCP IAM roles to create, delete,
and modify resources as defined in the terraform project.
A terraform project is any directory that contains a set of terraform files
(*.tf
) which define providers, resources, and data.
Discover Terraform providers that power all of Terraform’s resource types, or find modules for quickly deploying common infrastructure configurations.
TPUs or Tensor Processing Units are machine learning accelerators optimized to accelerate the training and inference of machine learning models, making them ideal for a variety of applications, including natural language processing, computer vision, and speech recognition. TPUs are specialized hardware that accelerate ML workloads at scale. Features, such as the matrix multiply unit (MXU) optimize large matrix operations, while optical circuit switch (OCS) technology enables high-bandwidth memory (HBM). They can be connected into groups called Pods that scale up and accelerate ML training and inference. Cloud TPUs offer high performance, ease of development, and cost efficiency in terms of ML development and productionization.
https://cloud.google.com/learn/what-is-a-virtual-machine
A virtual machine (VM) is a digital version of a physical computer. Virtual machine software can run programs and operating systems, store data, connect to networks, and do other computing functions, and requires maintenance such as updates and system monitoring. Multiple VMs can be hosted on a single physical machine, often a server, and then managed using virtual machine software. This provides flexibility for compute resources (compute, storage, network) to be distributed among VMs as needed, increasing overall efficiency. This architecture provides the basic building blocks for the advanced virtualized resources we use today, including cloud computing.