Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linting and precommit #45

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 17 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# .pre-commit-config.yaml
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.4.0
hooks:
- id: check-yaml
- id: end-of-file-fixer
- id: trailing-whitespace
args: [--markdown-linebreak-ext=md]
- repo: https://github.com/psf/black
rev: 23.3.0
hooks:
- id: black
- repo: https://github.com/antonbabenko/pre-commit-terraform
rev: v1.80.0
hooks:
- id: terraform_fmt
8 changes: 6 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
[![](https://img.shields.io/badge/[email protected]?logo=slack )](http://slack.outerbounds.co/)
[![](https://img.shields.io/badge/[email protected]?logo=slack )](http://slack.outerbounds.co/)

# ⚒️ Metaflow Admin Tools

This repository contains various configuration files, tools and utilities for operating [Metaflow](https://github.com/Netflix/metaflow) in production. See [Metaflow documentation](https://docs.metaflow.org) for more information about Metaflow architecture. Top level folders are structured as follows:
Expand Down Expand Up @@ -30,5 +30,9 @@ are used to drive end-to-end CI coverage internally at Outerbounds. They live u
## Utility scripts (/scripts)
Scripts that make life easier either deploying or using your new Metaflow stacks.

## Precommit Hooks

We use [pre-commit](https://pre-commit.com/#install) to do some basic linting and other checks on this repo. You can see all linting steps in [the precommit config](.pre-commit-config.yaml).

# Questions?
Talk to us on [Slack](http://http://slack.outerbounds.co/).
4 changes: 2 additions & 2 deletions aws/cloudformation/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,12 +56,12 @@ Please note: This section can be ignored if `EnableUI` is set to false (this is

This template deploys the UI with authentication using Amazon Cognito. For Cognito to work, you'll need to provide a DNS name and SSL certificate from AWS ACM. That means you'll need a few additional steps if using the UI:

1. Figure out what DNS name to use, that you have control of. You can either register a new domain name, or create a subdomain.
1. Figure out what DNS name to use, that you have control of. You can either register a new domain name, or create a subdomain.
2. Generate and verify a SSL certificate valid for that name using AWS ACM. Follow [the instructions from AWS](https://docs.aws.amazon.com/acm/latest/userguide/gs-acm-request-public.html) for this.
3. Deploy this Cloudformation template. You'll need to set `EnableUI` to "true", and in addition to this:
- set `PublicDomainName` to the domain name you chose
- set `CertificateArn` to the certificate ARN from step 2 above
4. After Cloudformation template is deployed, make note of `LoadBalancerUIDNSName` output value. You'll need to modify DNS settings to point your domain name to that name.
* If you're using Route53, create an A record that is an Alias and choose the load balancer from the drop down.
* If you're using Route53, create an A record that is an Alias and choose the load balancer from the drop down.
* If using a different DNS management tool/registrar, create a CNAME record that points to `LoadBalancerUIDNSName`
5. After DNS changes propagate, you should be able to navigate to the DNS name in your browser and see a login prompt. To create a user, go to AWS Console -> Cognito -> User Pools, find the pool that corresponds to this stack and create a new user under "Users and Groups".
4 changes: 2 additions & 2 deletions aws/terraform/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Complete Metaflow Terraform Example

This directory contains a set of Terraform configuration files for deploying a complete, end-to-end set of resources for running Metaflow on AWS using Terraform modules from [terraform-aws-metaflow](https://github.com/outerbounds/terraform-aws-metaflow).
This directory contains a set of Terraform configuration files for deploying a complete, end-to-end set of resources for running Metaflow on AWS using Terraform modules from [terraform-aws-metaflow](https://github.com/outerbounds/terraform-aws-metaflow).

This repo only contains configuration for non-Metaflow-specific resources, such as AWS VPC infra and Sagemaker notebook instance; Metaflow-specific parts are provided by reusable modules from [terraform-aws-metaflow](https://github.com/outerbounds/terraform-aws-metaflow).

Expand Down Expand Up @@ -40,7 +40,7 @@ terraform apply --var-file prod.tfvars

### Metaflow stack

The metaflow sub-project uses modules from [terraform-aws-metaflow](https://github.com/outerbounds/terraform-aws-metaflow) to provision the Metaflow service, AWS Step Functions, and AWS Batch resources.
The metaflow sub-project uses modules from [terraform-aws-metaflow](https://github.com/outerbounds/terraform-aws-metaflow) to provision the Metaflow service, AWS Step Functions, and AWS Batch resources.

Copy `example.tfvars` to `prod.tfvars` (or whatever environment name you prefer) and update that `env` name and the `region` as needed. These variables are used to construct unique names for infrastructure resources.

Expand Down
4 changes: 2 additions & 2 deletions aws/terraform/infra/example.tfvars
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
env = "prod"
aws_region = "us-west-2"
env = "prod"
aws_region = "us-west-2"
3 changes: 1 addition & 2 deletions azure/README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,7 @@
# Metaflow-on-Azure - infrastructure code samples

This folder houses resources to help you get started with running
This folder houses resources to help you get started with running
Metaflow on Azure.

## terraform/
These sample templates bring up a minimal Metaflow services stack on Azure.

28 changes: 14 additions & 14 deletions azure/terraform/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ Next, apply the `infra` module (creates Azure cloud resources only).

If you do not create Azure PostgreSQL Flexible Server instances often, Azure API may be flaky initially:

| Error: waiting for creation of the Postgresql Flexible Server "metaflow-database-server-xyz" (Resource Group "rg-db-metaflow-xyz"):
| Error: waiting for creation of the Postgresql Flexible Server "metaflow-database-server-xyz" (Resource Group "rg-db-metaflow-xyz"):
| Code="InternalServerError" Message="An unexpected error occured while processing the request. Tracking ID: 'xyz'"
|
| with module.infra.azurerm_postgresql_flexible_server.metaflow_database_server,
Expand All @@ -57,22 +57,22 @@ on real-time availability of such instances in your region or availability zone,

**VM Availability** issues might look something like this:

| Error: waiting for creation of Node Pool: (Agent Pool Name "taskworkers" / Managed Cluster Name "metaflow-kubernetes-xyz" /
| Resource Group "rg-k8s-metaflow-xyz"): Code="ReconcileVMSSAgentPoolFailed" Message="Code=\"AllocationFailed\" Message=\"Allocation failed.
| We do not have sufficient capacity for the requested VM size in this region. Read more about improving likelihood of allocation success
| Error: waiting for creation of Node Pool: (Agent Pool Name "taskworkers" / Managed Cluster Name "metaflow-kubernetes-xyz" /
| Resource Group "rg-k8s-metaflow-xyz"): Code="ReconcileVMSSAgentPoolFailed" Message="Code=\"AllocationFailed\" Message=\"Allocation failed.
| We do not have sufficient capacity for the requested VM size in this region. Read more about improving likelihood of allocation success
| at http://aka.ms/allocation-guidance\""

**VM quotas** may also cause provisioning to fail:

| Error: creating Node Pool: (Agent Pool Name "taskworkers" / Managed Cluster Name "metaflow-kubernetes-default" / Resource Group "rg-k8s-metaflow-default"):
| containerservice.AgentPoolsClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: Code="PreconditionFailed"
| Message="Provisioning of resource(s) for Agent Pool taskworkers failed. Error: {\n \"code\": \"InvalidTemplateDeployment\",\n
| \"message\": \"The template deployment '8b1a99f1-e35e-44be-a8ac-0f82009b7149' is not valid according to the validation procedure.
| The tracking id is 'xyz'. See inner errors for details.\",\n \"details\":
| [\n {\n \"code\": \"QuotaExceeded\",\n \"message\": \"Operation could not be completed as it results in exceeding approved standardDv5Family Cores quota.
| Additional details - Deployment Model: Resource Manager, Location: westeurope, Current Limit: 0, Current Usage: 0,
| Additional Required: 4, (Minimum) New Limit Required: 4.
| Submit a request for Quota increase at https://<AZURE_LINK> by specifying parameters listed in the ‘Details’ section for deployment to succeed.
| Error: creating Node Pool: (Agent Pool Name "taskworkers" / Managed Cluster Name "metaflow-kubernetes-default" / Resource Group "rg-k8s-metaflow-default"):
| containerservice.AgentPoolsClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: Code="PreconditionFailed"
| Message="Provisioning of resource(s) for Agent Pool taskworkers failed. Error: {\n \"code\": \"InvalidTemplateDeployment\",\n
| \"message\": \"The template deployment '8b1a99f1-e35e-44be-a8ac-0f82009b7149' is not valid according to the validation procedure.
| The tracking id is 'xyz'. See inner errors for details.\",\n \"details\":
| [\n {\n \"code\": \"QuotaExceeded\",\n \"message\": \"Operation could not be completed as it results in exceeding approved standardDv5Family Cores quota.
| Additional details - Deployment Model: Resource Manager, Location: westeurope, Current Limit: 0, Current Usage: 0,
| Additional Required: 4, (Minimum) New Limit Required: 4.
| Submit a request for Quota increase at https://<AZURE_LINK> by specifying parameters listed in the ‘Details’ section for deployment to succeed.
| Please read more about quota limits at https://docs.microsoft.com/en-us/azure/azure-supportability/per-vm-quota-requests\"\n }\n ]\n }"

Then, apply the `services` module (deploys Metaflow services to AKS)
Expand Down Expand Up @@ -121,4 +121,4 @@ Some reasons include:
a single copy of tfstate.
* You wish to mitigate the risk of data-loss on your local disk.

For more details, see [Terraform docs](https://www.terraform.io/language/settings/backends/configuration).
For more details, see [Terraform docs](https://www.terraform.io/language/settings/backends/configuration).
10 changes: 5 additions & 5 deletions azure/terraform/infra/airflow.tf
Original file line number Diff line number Diff line change
@@ -1,22 +1,22 @@
# There is an airflow related blob-storage container required as a part of the deployment when deploying airflow

# This is done because airflow doesn't allow any way of configuring the container name in the azure blob store and assumes `airflow-logs` as the container name where airflow writes its logs (Airflow 2.3.3 with it's associated helm chart).
# This is done because airflow doesn't allow any way of configuring the container name in the azure blob store and assumes `airflow-logs` as the container name where airflow writes its logs (Airflow 2.3.3 with it's associated helm chart).

# Hence `airflow_container` is not declared from top level and we set it in the locals here.
# Hence `airflow_container` is not declared from top level and we set it in the locals here.
locals {
airflow_container = "airflow-logs"
airflow_container = "airflow-logs"
}

resource "azurerm_storage_container" "airflow_logs_container" {
name = local.airflow_container
storage_account_name = azurerm_storage_account.metaflow_storage_account.name
container_access_type = "private"
count = var.deploy_airflow ? 1 : 0
count = var.deploy_airflow ? 1 : 0
}

resource "azurerm_role_assignment" "airflow_storage_role_permissions" {
scope = azurerm_storage_container.airflow_logs_container[0].resource_manager_id
role_definition_name = "Storage Blob Data Contributor"
principal_id = azuread_service_principal.service_principal.id
count = var.deploy_airflow ? 1 : 0
count = var.deploy_airflow ? 1 : 0
}
6 changes: 3 additions & 3 deletions azure/terraform/infra/credentials.tf
Original file line number Diff line number Diff line change
Expand Up @@ -13,13 +13,13 @@ resource "azuread_application" "service_principal_application" {

resource "azuread_service_principal" "service_principal" {
application_id = azuread_application.service_principal_application.application_id
owners = [data.azuread_client_config.current.object_id]
owners = [data.azuread_client_config.current.object_id]
}

# This will be used as a AZURE_CLIENT_SECRET in Metaflow's AKS workloads
resource "azuread_service_principal_password" "service_principal_password" {
service_principal_id = azuread_service_principal.service_principal.id
display_name = azuread_service_principal.service_principal.display_name
display_name = azuread_service_principal.service_principal.display_name
}

# Allow the new service principal to access the storage container
Expand All @@ -41,4 +41,4 @@ resource "azurerm_role_assignment" "aks_user_role_assignment" {
scope = azurerm_kubernetes_cluster.metaflow_kubernetes.id
role_definition_name = "Azure Kubernetes Service Cluster User Role"
principal_id = azuread_service_principal.service_principal.id
}
}
6 changes: 3 additions & 3 deletions azure/terraform/infra/database.tf
Original file line number Diff line number Diff line change
Expand Up @@ -7,12 +7,12 @@ resource "azurerm_private_dns_zone_virtual_network_link" "metaflow_database_priv
name = "metaflowDatabaseVnetZone.com"
private_dns_zone_name = azurerm_private_dns_zone.metaflow_database_private_dns_zone.name
virtual_network_id = azurerm_virtual_network.metaflow_virtual_network.id
resource_group_name = azurerm_resource_group.metaflow_resource_group.name
resource_group_name = azurerm_resource_group.metaflow_resource_group.name
}

resource "azurerm_postgresql_flexible_server" "metaflow_database_server" {
name = var.database_server_name
resource_group_name = azurerm_resource_group.metaflow_resource_group.name
resource_group_name = azurerm_resource_group.metaflow_resource_group.name
location = azurerm_resource_group.metaflow_resource_group.location
version = "12"
delegated_subnet_id = azurerm_subnet.metaflow_database_subnet.id
Expand Down Expand Up @@ -44,4 +44,4 @@ resource "azurerm_postgresql_flexible_server_database" "metaflow_database_server
server_id = azurerm_postgresql_flexible_server.metaflow_database_server.id
collation = "en_US.utf8"
charset = "utf8"
}
}
22 changes: 11 additions & 11 deletions azure/terraform/infra/kubernetes.tf
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,13 @@ resource "azurerm_kubernetes_cluster" "metaflow_kubernetes" {
resource_group_name = azurerm_resource_group.metaflow_resource_group.name

default_node_pool {
name = "default"
node_count = 1
vm_size = "Standard_D2_v2"
name = "default"
node_count = 1
vm_size = "Standard_D2_v2"
enable_auto_scaling = true
min_count = 1
max_count = 10
vnet_subnet_id = azurerm_subnet.metaflow_kubernetes_subnet.id
min_count = 1
max_count = 10
vnet_subnet_id = azurerm_subnet.metaflow_kubernetes_subnet.id
}
lifecycle {
ignore_changes = [default_node_pool.0.node_count]
Expand All @@ -36,10 +36,10 @@ resource "azurerm_kubernetes_cluster_node_pool" "metaflow_kubernetes_compute_nod
kubernetes_cluster_id = azurerm_kubernetes_cluster.metaflow_kubernetes.id
vm_size = "Standard_D4_v5"
node_count = 1
enable_auto_scaling = true
vnet_subnet_id = azurerm_subnet.metaflow_kubernetes_subnet.id
min_count = 1
max_count = 50
enable_auto_scaling = true
vnet_subnet_id = azurerm_subnet.metaflow_kubernetes_subnet.id
min_count = 1
max_count = 50

lifecycle {
ignore_changes = [node_count]
Expand All @@ -55,4 +55,4 @@ output "kube_config" {
value = azurerm_kubernetes_cluster.metaflow_kubernetes.kube_config_raw

sensitive = true
}
}
4 changes: 2 additions & 2 deletions azure/terraform/infra/main.tf
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
terraform {
required_providers {
azurerm = {
source = "hashicorp/azurerm"
source = "hashicorp/azurerm"
version = "3.14.0"
}
azuread = {
source = "hashicorp/azuread"
source = "hashicorp/azuread"
version = "2.26.1"
}
}
Expand Down
2 changes: 1 addition & 1 deletion azure/terraform/infra/outputs.tf
Original file line number Diff line number Diff line change
Expand Up @@ -8,4 +8,4 @@ output "service_principal_client_id" {

output "service_principal_client_secret" {
value = azuread_service_principal_password.service_principal_password.value
}
}
1 change: 0 additions & 1 deletion azure/terraform/infra/storage.tf
Original file line number Diff line number Diff line change
Expand Up @@ -15,4 +15,3 @@ resource "azurerm_storage_container" "metaflow_storage_container" {
storage_account_name = azurerm_storage_account.metaflow_storage_account.name
container_access_type = "private"
}

2 changes: 1 addition & 1 deletion azure/terraform/infra/variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -49,4 +49,4 @@ variable "k8s_subnet_name" {

variable "deploy_airflow" {
type = bool
}
}
8 changes: 4 additions & 4 deletions azure/terraform/infra/virtual_network.tf
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ resource "azurerm_virtual_network" "metaflow_virtual_network" {

resource "azurerm_subnet" "metaflow_database_subnet" {
name = var.db_subnet_name
resource_group_name = azurerm_resource_group.metaflow_resource_group.name
resource_group_name = azurerm_resource_group.metaflow_resource_group.name
virtual_network_name = azurerm_virtual_network.metaflow_virtual_network.name
address_prefixes = ["172.16.0.0/24"]
service_endpoints = ["Microsoft.Storage"]
Expand All @@ -28,8 +28,8 @@ resource "azurerm_subnet" "metaflow_database_subnet" {

resource "azurerm_subnet" "metaflow_kubernetes_subnet" {
name = var.k8s_subnet_name
resource_group_name = azurerm_resource_group.metaflow_resource_group.name
resource_group_name = azurerm_resource_group.metaflow_resource_group.name
virtual_network_name = azurerm_virtual_network.metaflow_virtual_network.name
# 65k addresses is a lot... but not a lot. This will be used by AKS workloads (1 IP per pod)
address_prefixes = ["172.17.0.0/16"]
}
address_prefixes = ["172.17.0.0/16"]
}
28 changes: 14 additions & 14 deletions azure/terraform/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -9,35 +9,35 @@ terraform {
version = "3.14.0"
}
helm = {
source = "hashicorp/helm"
source = "hashicorp/helm"
version = "2.6.0"
}
}
}


data "azurerm_kubernetes_cluster" "default" {
depends_on = [module.infra] # refresh cluster state before reading
depends_on = [module.infra] # refresh cluster state before reading
resource_group_name = local.metaflow_resource_group_name
name = local.kubernetes_cluster_name
}

data "azurerm_postgresql_flexible_server" "default" {
depends_on = [module.infra] # refresh cluster state before reading
depends_on = [module.infra] # refresh cluster state before reading
resource_group_name = local.metaflow_resource_group_name
name = local.database_server_name
}

data "azurerm_storage_account" "default" {
depends_on = [module.infra] # refresh cluster state before reading
depends_on = [module.infra] # refresh cluster state before reading
resource_group_name = local.metaflow_resource_group_name
name = local.storage_account_name

}

data "azurerm_storage_container" "default" {
depends_on = [module.infra] # refresh cluster state before reading
name = local.storage_container_name
depends_on = [module.infra] # refresh cluster state before reading
name = local.storage_container_name
storage_account_name = local.storage_account_name
}

Expand Down Expand Up @@ -95,15 +95,15 @@ module "services" {
metaflow_db_user = local.metaflow_database_server_admin_login
metaflow_db_password = local.metaflow_db_password
metaflow_kubernetes_secret_name = local.metaflow_kubernetes_secret_name
azure_storage_credentials = {
azure_storage_credentials = {
AZURE_CLIENT_ID = module.infra.service_principal_client_id
AZURE_TENANT_ID = module.infra.service_principal_tenant_id
AZURE_CLIENT_SECRET = module.infra.service_principal_client_secret
}
deploy_airflow = var.deploy_airflow
deploy_argo = var.deploy_argo
airflow_version = local.airflow_version
airflow_frenet_secret = local.airflow_frenet_secret

deploy_airflow = var.deploy_airflow
deploy_argo = var.deploy_argo

airflow_version = local.airflow_version
airflow_frenet_secret = local.airflow_frenet_secret
}
Loading