Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Zscaler "Base_cc_vmss" deployment type

This deployment type is intended for greenfield/pov/lab purposes. It will deploy a fully functioning sandbox environment in a new Resource Group/VNet with test workload VMs. Full set of resources provisioned listed below; Effectively, this will create all network infrastructure dependencies for an Azure environment. Everything from "Base" deployment type (Creates 1 new Resource Group; 1 VNet with 1 public subnet and 1 private/workload subnet; 1 Centos server workload in the private subnet; 1 Bastion Host in the public subnet assigned a Public IP; and generates local key pair .pem file for ssh access).

Additionally: Depending on the configuration, creates 1 or more Flexible Orchestration Virtual Machine Scale Sets (VMSS) and scaling policies for Cloud Connector in private subnet(s); and 1 function app for VMSS; Standard Azure Load Balancer; and workload private subnet UDR routing to the Load Balancer Frontend IP.

Terraform client requirements

If run_manual_sync variable is True (True by default) the bash script scripts/manual_sync.sh is invoked to perform this manual sync (more information in the Caveates section), it is advised that you run from a MacOS or Linux workstation and have the following tools installed: - bash | curl | jq

Caveats/Considerations

  • WSL2 DNS bug: If you are trying to run these Azure terraform deployments specifically from a Windows WSL2 instance like Ubuntu and receive an error containing a message similar to this "dial tcp: lookup management.azure.com on 172.21.240.1:53: cannot unmarshal DNS message" please refer here for a WSL2 resolv.conf fix. microsoft/WSL#5420 (comment).
  • Function App Manual Sync: On creation time of the Function App, used for managing Cloud Connectors in the Scale Set, Azure requires that a "Manual Sync" operation is done. This can be done through an API call or through simply navigating to the Function App on the Azure console and having the page load. This action will tell the Function App to load the zip file from the Storage Account and start running the Functions. We have attemped to automate this Manual Sync call through terraform by triggering scripts/manual_sync.sh through a provisioner in the Function App Terraform module. If this attempt fails an output message (shown below) will be displayed in the testbed.txt and printed to the screen at the end of the deployment. If the Manual Sync operation fails during terraform apply, the steps listed in the message can be used to remediate the issue. This is a one time action at Function App creation time.
**IMPORTANT (ONLY APPLICABLE FOR INITIAL CREATE OF FUNCTION APP)**
Based on the recorded output, the manual sync to start your Azure Function App failed. To perform this manual sync perform one of the following steps:
  1. Navigate to the Azure Function App /subscriptions/<subscription-id>/resourceGroups/<resource-group>/providers/Microsoft.Web/sites/<function-app> on the Azure Portal. The loading of the Function App page triggers the manual sync and will start your Function App.
  2. Attempt to rerun the manual_sync.sh script manually using the following command (path to file is based on root of the repo):
      ../../modules/terraform-zscc-function-app-azure/manual_sync.sh <subscription-id> <resource-group> <function-app>
**IMPORTANT (ONLY APPLICABLE FOR INITIAL CREATE OF FUNCTION APP)**

Components

VMSS Topology drawio (8)

Topology Details

  • Security Stack will be deployed into its own Resource Group.
  • Based on zonal needs, a VMSS will be created in each configured zone.
  • An Azure Internal Load Balancer (ILB) is deployed on top of all the Scale Sets and is used as the entry point for the Security Stack.
  • A NAT Gateway will be deployed in each configured zone and will have a dedicated IP associated with it, this will be used for outbound traffic from the Cloud Connectors.

It is recommended that this security stack is deployed into its own VNet (Security VNet) and Workload VNets are peered with it. Once the security stack is deployed, route tables in the Workload VNets should have a User Defined Route steering traffic to the ILB sitting on top of the Cloud Connectors.

Azure Function App

The Azure Function App will contain two Azure Functions.

  1. Health Monitoring Function - Responsible for using the custom metrics published by each CC to determine if there are any unhealthy CCs that need to be replaced. If a CC is found to be unhealthy, the function will terminate the instance and will replace it with a new one. This function will run every one minute.
  2. Resource Sync Function - Responsible for ensuring the VMs advertised in your Cloud Connector Group on the Zscaler Cloud Connector Portal match what is existing in your Azure Scale Set. If it finds that a CC exists in the Cloud Connector Group but not in the Azure Scale Set, it will perform the clean up of that instance from the Cloud Connector Group to ensure the two entities are in sync. This function will every every 30 minutes.

How to deploy:

Option 1 (guided):

From the examples directory, run the zsec bash script that walks to all required inputs.

  • ./zsec up
  • enter "greenfield"
  • enter "base_cc_vmss"
  • follow the remainder of the authentication and configuration input prompts.
  • script will detect client operating system and download/run a specific version of terraform in a temporary bin directory
  • inputs will be validated and terraform init/apply will automatically exectute.
  • verify all resources that will be created/modified and enter "yes" to confirm

Option 2 (manual):

Modify/populate any required variable input values in base_cc_vmss/terraform.tfvars file and save.

From base_cc_vmss directory execute:

  • terraform init
  • terraform apply

How to destroy:

Option 1 (guided):

From the examples directory, run the zsec bash script that walks to all required inputs.

  • ./zsec destroy

Option 2 (manual):

From base_cc_vmss directory execute:

  • terraform destroy

Special Features

Isolated Permissions

This solution includes two entities that will be performing Azure Operations, the Cloud Connectors and the Azure Function App. The Cloud Connector will need a Managed Identity associated with it along with the Azure Function App. For the Azure Function App to be able to perform the operations described above, it will need an increased permission set that is not necessarily required for the Cloud Connector. To enforce proper RBAC we are allowing for two Managed Identities to be used. One specifically for the Cloud Connector with the reduced permissions set and one for the Function App with the expanded permission set.

Terraform Configuration

Set the following variables:

cc_vm_managed_identity_name         = <cc-managed-identity-name>
cc_vm_managed_identity_rg           = <cc-managed-identity-resource-group>
function_app_managed_identity_name  = <function-app-managed-identity-name>
function_app_managed_identity_rg    = <function-app-managed-identity-resource-group>

ZSEC Configuration

Configure the following options:

Cloud Connector User Managed Identity Information:
Is the Managed Identity in the same Subscription ID? [yes/no]: yes
Managed Identity is in the same Subscription
Enter Managed Identity Name: <cc-managed-identity-name>
Enter Managed Identity Resource Group: <cc-managed-identity-resource-group>
Function App User Managed Identity Information:
Assign the same User Managed Identity (<cc-managed-identity-name>) to Function App? [yes/no]: no
Enter Function App designated Managed Identity Name: <function-app-managed-identity-name>
Enter Function App designated Managed Identity Resource Group: <function-app-managed-identity-resource-group>

Scheduled Scaling

  • Enables you to redefine minimum Cloud Connectors in Scale Set for specific time periods.
  • Should be used if you have predictable traffic patterns (9am-5pm Monday-Friday).

Terraform Configuration

Setting the following variables:

scheduled_scaling_enabled         = true
scheduled_scaling_vmss_min_ccs    = 3
scheduled_scaling_days_of_week    = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday"]
scheduled_scaling_start_time_hour = 7
scheduled_scaling_start_time_min  = 30
scheduled_scaling_end_time_hour   = 18
scheduled_scaling_end_time_min    = 30

ZSEC Configuration

Configure the following options:

Do you want to enable scheduled scaling on the VMSS? [yes/no]: yes
Enter the minimum amount of scheduled Cloud Connectors in VMSS? [Default=2]: 3
Apply Scheduled Scaling Policy on Sunday? [yes/no]: no
Not configuring Sunday on Scheduled Scaling configuration.
Apply Scheduled Scaling Policy on Monday? [yes/no]: yes
Adding Monday on Scheduled Scaling configuration.
Apply Scheduled Scaling Policy on Tuesday? [yes/no]: yes
Adding Tuesday on Scheduled Scaling configuration.
Apply Scheduled Scaling Policy on Wednesday? [yes/no]: yes
Adding Wednesday on Scheduled Scaling configuration.
Apply Scheduled Scaling Policy on Thursday? [yes/no]: yes
Adding Thursday on Scheduled Scaling configuration.
Apply Scheduled Scaling Policy on Friday? [yes/no]: yes
Adding Friday on Scheduled Scaling configuration.
Apply Scheduled Scaling Policy on Saturday? [yes/no]: no 
Not configuring Saturday on Scheduled Scaling configuration.
Configuring the following days on the Scheduled Scaling Policy: Monday Tuesday Wednesday Thursday Friday 
Enter the start time hour for the scheduled scaling configuration? [Default=9]: 7
Enter the start time min for the scheduled scaling configuration? [Default=0]: 30
Enter the end time hour for the scheduled scaling configuration? [Default=17]: 18
Enter the end time min for the scheduled scaling configuration? [Default=30]: 30

Debugging Tips

Viewing Cloud Connector Health Metrics

Cloud Connector Health metrics are published every 1 minute by the Cloud Connector and are managed by Application Insights. One easy way to view the metrics is to navigate to one of the running instances: Resource Group -> Scale Set -> Instances (tab on left) -> select instance -> Metrics (tab on left). Next create a metric query where:

  • Scope = vm-name
  • Metric Namespace = zscaler/cloudconnectors
  • Metric = cloud_connector_aggr_health
  • Aggregation = average Screenshot 2024-07-23 at 2 16 52 PM

Viewing Virtual Machine Scale Set Scaling Metrics

Cloud Connectors in a Scale Set publish scaling metrics to the Scale Set resource once a minute. These scaling metrics include smedge_cpu_utilization, smedge_mem_utilization, smedge_bytes_in and smedge_bytes_out. The scaling rules in the Scale Set scaling configuration will look at the smedge_cpu_utilization and compare it to the defined threshold.

To view these metrics navigate to the Scale Set you are interested in: Resource Group -> select Scale Set -> Metrics (tab on left). Next create a metrics query where:

  • Scope = scale-set-name
  • Metric Namespace = zscaler/cloudconnectors
  • Metric = smedge_metrics
  • Aggregation = average

Lastly, create a filter where:

  • metric_name = smedge_cpu_utilization Screenshot 2024-07-23 at 2 16 25 PM

Viewing Function App Logs

There are a couple approaches for viewing logs from a Function inside a Function App.

Recent Invocations

To view recent invocations you can navigate to the function you are interested in: Resource Group -> select Function App -> select Function (shown on overview page) -> Invocations Screenshot 2024-07-23 at 2 36 37 PM

Real Time Log Steaming

To view real time logs from function executing at that time you can navigate to the function you interested in: Resource Group -> select Function App -> select Function (shown on overview page) -> Logs Screenshot 2024-07-23 at 2 30 56 PM

Viewing through Application Insights

The more complex but powerful approach for viewing logs would be to use Application Insights. Application Insights will give you the ability to perform queries to view specific log messages, executions, timeframes, etc. One basic example of viewing logs from the Health Monitor function where it has found no instances need to be terminated. You can see that a specific message is defined when querying the logs, this will allow you to refine your search instead of manually going through each invocation or continuously watching the real time streaming.

Navigate to: Resource Group -> Application Insights -> Logs (tab on left) Use the following query:

union traces
| union exceptions
| where timestamp > ago(1d)
| where customDimensions['Category'] == 'Function.healthMonitor.User' or customDimensions['Category'] == 'Function.healthMonitor'
| where message contains "No instances to terminate on this iteration."
| order by timestamp asc
| project
    timestamp,
    message = iff(message != '', message, iff(innermostMessage != '', innermostMessage, customDimensions.['prop__{OriginalFormat}']))

Screenshot 2024-07-23 at 2 47 42 PM

FAQs

When is a Cloud Connector considered to be unhealthy and should be replaced?

Each Cloud Connector will broadcast its health to the Azure Application Insights Instance in the Resource Group (to view these metrics refer to Debugging Tips->Viewing Cloud Connector Health Metrics). The health relates to the dataplanes health and correlates to the active/inactive state you will in the Cloud Connetor Group on the Zscaler Connector Portal. This health is evaluated by a process in the Cloud Connector and a value is published to this metric every 1 minute, 0 indicates unhealthy and 100 indicates healthy. An instance should be replaced in one of the two scenarios:

  1. The Cloud Connector reports unhealty 5 times in a row. This indicates the Cloud Connector is down and should be replaced.
  2. The Cloud Connector reports unhealthy 7 out of 10 times. This indicates the Cloud Connector is flapping and should be replaced.

The Health Monitoring Function in the Function App will perform this evaluation every 1 minute and will determine if any instances should be replaced. When an instance is replaced, it will be terminated and the Health Monitoring Function will ensure a new one is brought up to replace it.

I am seeing unhealthy instance not being replaced in my Scale Set, what could be the issue?

In this scenario you should first check to see if the metrics published by the unhealthy instance are of value 0, this indicated unhealthy (100 indicates healthy). Please refer to the Debugging Tips->Viewing Cloud Connector Health Metrics section. If you are seeing the value of this metric at 0 for a long period of time (refer to FAQs->When is a Cloud Connector considered to be unhealthy and should be replaced?), the next thing you should check is to see if the Function App is running. During creation of the Function App, there is a manual sync trigger that needs to be successfully invoked for the Function App to start (refer to Caveats/Considerations->Function App Manual Sync), if the Function App is not running the unhealthy instances will not be replaced. Navigate to the Function App on the Azure Console to invoke the Manual Sync and view the invocations (Debugging Tips->Viewing Function App Logs->Recent Invocations) to see if it has been running.

How can I stop the Health Monitoring Function from terminating unhealthy instances?

This can be configured through modifying the following terraform variable and then applying the change:

terminate_unhealthy_instances = false

It can also be configured manually on Azure Portal by navigating to the environment variables of the Function App: Resource Group -> select Function App -> Environment variables. Then selecting TERMINATE_UNHEALTHY_INSTANCES and setting the value to false. Once this is done apply the change. Screenshot 2024-07-23 at 2 50 58 PM

Can I just use one Managed Identity for both the Cloud Connectors and Azure Function App?

Yes, this can be done with terraform by not setting the following variables: function_app_managed_identity_name and function_app_managed_identity_rg.

How can I find the Mgmt IP address of a Cloud Connector in a Scale Set?

Mgmt IP address will not be printed after the terraform executes because the dynamic nature of a Scale Set results in us not know what the IP address is. Therefore if you wish to SSH into one of the Cloud Connectors you will need to find the instance you are interested in on the Azure Portal to get the IP address to use for the connection.

To find this Mgmt IP navigate to: Resource Group -> select Scale Set -> Instances (tab on left) -> select Instance -> Network Settings (tab on left). Once here you can check to make sure you are looking at the mgmt interface. This can be confirmed by seeing “mgmt” in the interface name. From there you can copy the IP address. Screenshot 2024-07-23 at 2 59 50 PM

Requirements

Name Version
terraform >= 0.13.7, < 2.0.0
azurerm >= 3.108.0, <= 3.116
local ~> 2.5.0
null ~> 3.1.0
random ~> 3.3.0
tls ~> 3.4.0

Providers

Name Version
local ~> 2.5.0
random ~> 3.3.0
tls ~> 3.4.0

Modules

Name Source Version
bastion ../../modules/terraform-zscc-bastion-azure n/a
cc_functionapp ../../modules/terraform-zscc-function-app-azure n/a
cc_identity ../../modules/terraform-zscc-identity-azure n/a
cc_lb ../../modules/terraform-zscc-lb-azure n/a
cc_nsg ../../modules/terraform-zscc-nsg-azure n/a
cc_vmss ../../modules/terraform-zscc-ccvmss-azure n/a
network ../../modules/terraform-zscc-network-azure n/a
workload ../../modules/terraform-zscc-workload-azure n/a

Resources

Name Type
local_file.private_key resource
local_file.testbed resource
local_file.user_data_file resource
random_string.suffix resource
tls_private_key.key resource

Inputs

Name Description Type Default Required
accelerated_networking_enabled Enable/Disable accelerated networking support on all Cloud Connector service interfaces bool true no
arm_location The Azure Region where resources are to be deployed string "westus2" no
azure_vault_url Azure Vault URL string n/a yes
bastion_nsg_source_prefix user input for locking down SSH access to bastion to a specific IP or CIDR range string "*" no
cc_subnets Cloud Connector Subnets to create in VNet. This is only required if you want to override the default subnets that this code creates via network_address_space variable. list(string) null no
cc_vm_managed_identity_name Azure Managed Identity name to attach to the CC VM. E.g zspreview-66117-mi string n/a yes
cc_vm_managed_identity_rg Resource Group of the Azure Managed Identity name to attach to the CC VM. E.g. edgeconnector_rg_1 string n/a yes
cc_vm_prov_url Zscaler Cloud Connector Provisioning URL string n/a yes
ccvm_image_offer Azure Marketplace Cloud Connector Image Offer string "zia_cloud_connector" no
ccvm_image_publisher Azure Marketplace Cloud Connector Image Publisher string "zscaler1579058425289" no
ccvm_image_sku Azure Marketplace Cloud Connector Image SKU string "zs_ser_gen1_cc_01" no
ccvm_image_version Azure Marketplace Cloud Connector Image Version string "latest" no
ccvm_instance_type Cloud Connector Image size string "Standard_D2s_v3" no
ccvm_source_image_id Custom Cloud Connector Source Image ID. Set this value to the path of a local subscription Microsoft.Compute image to override the Cloud Connector deployment instead of using the marketplace publisher string null no
encryption_at_host_enabled User input for enabling or disabling host encryption bool true no
env_subscription_id Azure Subscription ID where resources are to be deployed in string n/a yes
environment Customer defined environment tag. ie: Dev, QA, Prod, etc. string "Development" no
existing_log_analytics_workspace Set to True if you wish to use an existing Log Analytics Workspace to associate with the AppInsights Instance. Default is false meaning Terraform module will create a new one bool false no
existing_log_analytics_workspace_id ID of existing Log Analytics Workspace to associate with the AppInsights Instance. string "" no
existing_storage_account Set to True if you wish to use an existing Storage Account to associate with the Function App. Default is false meaning Terraform module will create a new one bool false no
existing_storage_account_name Name of existing Storage Account to associate with the Function App. string "" no
existing_storage_account_rg Resource Group of existing Storage Account to associate with the Function App. string "" no
function_app_managed_identity_name Azure Managed Identity name to attach to the Function App. E.g zspreview-66117-mi string "" no
function_app_managed_identity_rg Resource Group of the Azure Managed Identity name to attach to the Function App. E.g. edgeconnector_rg_1 string "" no
health_check_interval The interval, in seconds, for how frequently to probe the endpoint for health status. Typically, the interval is slightly less than half the allocated timeout period (in seconds) which allows two full probes before taking the instance out of rotation. The default value is 15, the minimum value is 5 number 15 no
http_probe_port Port number for Cloud Connector cloud init to enable listener port for HTTP probe from Azure LB number 50000 no
load_distribution Azure LB load distribution method string "Default" no
managed_identity_subscription_id Azure Subscription ID where the User Managed Identity resource exists. Only required if this Subscription ID is different than env_subscription_id string null no
name_prefix The name prefix for all your resources string "zscc" no
network_address_space VNet IP CIDR Range. All subnet resources that might get created (public, workload, cloud connector) are derived from this /16 CIDR. If you require creating a VNet smaller than /16, you may need to explicitly define all other subnets via public_subnets, workload_subnets, cc_subnets, and route53_subnets variables string "10.1.0.0/16" no
number_of_probes The number of probes where if no response, will result in stopping further traffic from being delivered to the endpoint. This values allows endpoints to be taken out of rotation faster or slower than the typical times used in Azure number 1 no
owner_tag Customer defined owner tag value. ie: Org, Dept, username, etc. string "zscc-admin" no
path_to_scripts Path to script_directory string "" no
probe_threshold The number of consecutive successful or failed probes in order to allow or deny traffic from being delivered to this endpoint. After failing the number of consecutive probes equal to this value, the endpoint will be taken out of rotation and require the same number of successful consecutive probes to be placed back in rotation. number 2 no
public_subnets Public/Bastion Subnets to create in VNet. This is only required if you want to override the default subnets that this code creates via network_address_space variable. list(string) null no
run_manual_sync Set to True if you would like terraform to run the manual sync operation to start the Function App after creation. The alternative is to navigate to the Function App on the Azure Portal UI or to manually invoke the script yourself. bool true no
scale_in_threshold Metric threshold for determining scale in. number 50 no
scale_out_threshold Metric threshold for determining scale out. number 70 no
scheduled_scaling_days_of_week Days of the week to apply scheduled scaling profile. list(string)
[
"Monday",
"Tuesday",
"Wednesday",
"Thursday",
"Friday"
]
no
scheduled_scaling_enabled Enable scheduled scaling on top of metric scaling. bool false no
scheduled_scaling_end_time_hour Hour to end scheduled scaling profile. number 17 no
scheduled_scaling_end_time_min Minute to end scheduled scaling profile. number 0 no
scheduled_scaling_start_time_hour Hour to start scheduled scaling profile. number 9 no
scheduled_scaling_start_time_min Minute to start scheduled scaling profile. number 0 no
scheduled_scaling_timezone Timezone the times for the scheduled scaling profile are specified in. string "Pacific Standard Time" no
scheduled_scaling_vmss_min_ccs Minimum number of CCs in vmss for scheduled scaling profile. number 2 no
support_access_enabled If Network Security Group is being configured, enable a specific outbound rule for Cloud Connector to be able to establish connectivity for Zscaler support access. Default is true bool true no
terminate_unhealthy_instances Indicate whether detected unhealthy instances are terminated or not. bool true no
tls_key_algorithm algorithm for tls_private_key resource string "RSA" no
upload_function_app_zip By default, this Terraform will create a new Storage Account/Container/Blob to upload the zip file. The function app will pull from the blobl url to run. Setting this value to false will prevent creation/upload of the blob file bool true no
vmss_default_ccs Default number of CCs in vmss. number 2 no
vmss_max_ccs Maximum number of CCs in vmss. number 16 no
vmss_min_ccs Minimum number of CCs in vmss. number 2 no
workload_count The number of Workload VMs to deploy number 1 no
workloads_subnets Workload Subnets to create in VNet. This is only required if you want to override the default subnets that this code creates via network_address_space variable. list(string) null no
zones Specify which availability zone(s) to deploy VM resources in if zones_enabled variable is set to true list(string)
[
"1"
]
no
zones_enabled Determine whether to provision Cloud Connector VMs explicitly in defined zones (if supported by the Azure region provided in the location variable). If left false, Azure will automatically choose a zone and module will create an availability set resource instead for VM fault tolerance bool false no
zscaler_cc_function_public_url Publicly accessible URL path where Function App can pull its zip file build from. This is only required when var.upload_function_app_zip is set to false string "" no

Outputs

Name Description
testbedconfig Azure Testbed results