Typha autoscaler's autoscaling profile to be configurable #3095

consideRatio · 2024-01-09T15:58:15Z

Expected Behavior

I expect that the logic to map cluster node count to typha replica's isn't hardcoded here:

Lines 17 to 56 in ff57548

    
           // GetExpectedTyphaScale will return the number of Typhas needed for the number of nodes. 
        
           // 
        
           // Nodes       	Replicas 
        
           // 
        
           //	  1-2              1 
        
           //	  3-4              2 
        
           //	 <200              3 
        
           //	 >400              4 
        
           //	 >600              5 
        
           //	 >800              6 
        
           //	>1000              7 
        
           //	... 
        
           //	>2000             12 
        
           //	... 
        
           //	>3600             20 
        
           func GetExpectedTyphaScale(nodes int) int { 
        
           	var maxNodesPerTypha int = 200 
        
           	// This gives a count of how many 200s so we need 1+ this number to get at least 
        
           	// 1 typha for every 200 nodes. 
        
           	typhas := (nodes / maxNodesPerTypha) + 1 
        
           	// We add one more to ensure there is always 1 extra for high availability purposes. 
        
           	typhas += 1 
        
           	// We have a couple special cases for small clusters. We want to ensure that we run one fewer 
        
           	// Typha instances than there are nodes, so that there is room for rescheduling. We also want 
        
           	// to ensure we have at least two, where possible, so that we have redundancy. 
        
           	if nodes <= 2 { 
        
           		// For one and two node clusters, we only need a single typha. 
        
           		typhas = 1 
        
           	} else if nodes <= 4 { 
        
           		// For three and four node clusters, we can run an additional typha. 
        
           		typhas = 2 
        
           	} else if typhas < 3 { 
        
           		// For clusters with more than 4 nodes, make sure we have a minimum of three for redundancy. 
        
           		typhas = 3 
        
           	} 
        
           	return typhas 
        
           }

Practically, it could be made configurable with a nodesToReplicas ladder like done in GKE's managed calico deployment via a configmap.

# ...
data:
  ladder: |-
    {
      "coresToReplicas": [],
      "nodesToReplicas":
      [
        [1, 1],
        [2, 2],
        [100, 3],
        [250, 4],
        [500, 5],
        [1000, 6],
        [1500, 7],
        [2000, 8]
      ]
    }

Current Behavior

The autoscaling profile is fixed and can't be influenced.

Possible Solution

To provide nodesToReplicas configuration of the typha autoscaler nested under Installation resource somewhere, where the default value of this configuration mimics the current implementation.

Context

I'd like this feature to avoid forcing additional nodes to a small cluster just to house these pods that can't schedule next to each other, as that incurr cloud cost but also for hogs available compute in clouds and wastes energy for the world.

In a small k8s clusters having for example just four nodes, but where three of them are reserved for other things and only one is available to run the calico-typha pod, will fail to schedule 2/3 calico-typha pods (# node(s) didn't have free ports for the requested pod ports). When this happen a cluster-autoscaler could end up creating additional nodes even though the admin determines just one or two calico-typha pod would have sufficed.

Your Environment

AKS 1.28.3 using tigera operator tigera/operator:v1.28.13
We have "core nodes" and "user nodes", where we typically just have one or possibly two core nodes where workloads like calico-typha should run, but often a few additional "user nodes" where those are forbidden to run via taints.

The text was updated successfully, but these errors were encountered:

consideRatio mentioned this issue Jan 9, 2024

AKS / UToronto: calico-typha scaled to three replicas, forcing three core nodes 2i2c-org/infrastructure#3592

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Typha autoscaler's autoscaling profile to be configurable #3095

Typha autoscaler's autoscaling profile to be configurable #3095

consideRatio commented Jan 9, 2024

Typha autoscaler's autoscaling profile to be configurable #3095

Typha autoscaler's autoscaling profile to be configurable #3095

Comments

consideRatio commented Jan 9, 2024

Expected Behavior

Current Behavior

Possible Solution

Context

Your Environment