Skip to content

Commit

Permalink
add power manager
Browse files Browse the repository at this point in the history
Signed-off-by: HermioneKT <[email protected]>
  • Loading branch information
HermioneKT committed Mar 31, 2024
1 parent 5143a83 commit 4c33ae6
Show file tree
Hide file tree
Showing 15 changed files with 756 additions and 1,153 deletions.
115 changes: 115 additions & 0 deletions docs/power_manager.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
# K8s Power Manager

## Components
1. **PowerManager Controller**: ensures the actual state matches the desired state of the cluster.
2. **PowerConfig Controller**: sees the powerConfig created by user and deploys Power Node Agents onto each node specified using a DaemonSet.
- powerNodeSelector: A key/value map used to define a list of node labels that a node must satisfy for the operator's node
agent to be deployed.
- powerProfiles: The list of PowerProfiles that the user wants available on the nodes
3. **Power Node Agent**: containerized applications used to communicate with the node's Kubelet PodResources endpoint to discover the exact CPUs that
are allocated per container and tune frequency of the cores as requested

4. **Power Profile**: predefined configuration that specifies how the system should manage power consumption for various components such as CPUs and GPUs. It includes settings applied to host level such as CPU frequency, governer etc.

4. **Power Workload**: the object used to define the lists of CPUs configured with a particular PowerProfile. A PowerWorkload is created for each PowerProfile on each Node with the Power Node Agent deployed. A PowerWorkload is represented in the Intel Power Optimization Library by a Pool. The Pools hold the values of the PowerProfile used, their frequencies, and the CPUs that need to be configured. The creation of the Pool – and any additions to the Pool – then
carries out the changes.

## Setup

Execute the following below **as a non-root user with sudo rights** using **bash**:
1. Follow [a quick-start guide](docs/quickstart_guide.md) to set up a Knative cluster to run the experiments.

2. 4 vSwarm benchmarks are used to run the experiments (Spinning, Sleeping, AES, Auth). On master node, deploy these benchmarks on the Knative cluster.
```bash
git clone --depth=1 https://github.com/vhive-serverless/vSwarm.git

cd $HOME/vSwarm/tools/test-client && go build ./test-client.go

kn service apply -f $HOME/vSwarm/benchmarks/auth/yamls/knative/kn-auth-python.yaml
kn service apply -f $HOME/vSwarm/benchmarks/aes/yamls/knative/kn-aes-python.yaml
kn service apply -f $HOME/vSwarm/benchmarks/sleeping/yamls/knative/kn-sleeping-go.yaml
kn service apply -f $HOME/vSwarm/benchmarks/spinning/yamls/knative/kn-spinning-go.yaml
```

### Experiment 1: Workload sensitivity
2 node cluster is needed.
1. On master node, run the node setup script:
```bash
./scripts/power_manager/setup_power_manager.sh;
```
Then run the experiment:
```bash
go run $HOME/vhive/examples/power_manager/workload_sensitivity/main.go;
```

### Experiment 2: Internode scaling
3 node cluster is needed. 3 scenarios are performed:
- Scenario 1: All worker nodes have low frequency
- Scenario 2: All worker nodes have high frequency
- Scenario 3: 1 worker node has high frequency, another with low frequency (need to manually tune like experiment 3 point 4&5 below)

1. On master node, run the node setup script:
```bash
./scripts/power_manager/setup_power_manager.sh;
```
Then run the experiment:
```bash
go run $HOME/vhive/examples/power_manager/internode_scaling_exp/main.go;
```

### Experiment 3: Class Assignment
3 node cluster is needed. We need to able to assign the workload to specific node.

1. Thus on master node, we need to enable nodeSelector:
```bash
kubectl patch configmap config-features -n knative-serving -p '{"data": {"kubernetes.podspec-nodeselector": "enabled"}}'
```

2. We need to modify these benchmark knative yaml file to add nodeSelector in vSwarm before deployment (Spinning and AES to worker-high class, Sleeping and Auth to worker-low class). For example:
````yaml
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: spinning-go
namespace: default
spec:
template:
spec:
nodeSelector:
loader-nodetype: worker-high
containers:
- image: docker.io/vhiveease/relay:latest
ports:
- name: h2c
containerPort: 50000
args:
- --addr=0.0.0.0:50000
- --function-endpoint-url=0.0.0.0
- --function-endpoint-port=50051
- --function-name=aes-go
- image: docker.io/kt05docker/sleeping-go:latest
args:
- --addr=0.0.0.0:50051 -
````

3. On master node, label the worker node
```bash
kubectl label node node-1.kt-cluster.ntu-cloud-pg0.utah.cloudlab.us loader-nodetype=worker-low
kubectl label node node-2.kt-cluster.ntu-cloud-pg0.utah.cloudlab.us loader-nodetype=worker-high
```
Run the node setup script:
```bash
./scripts/power_manager/setup_power_manager.sh;
```
4. On worker node 1, manually set all CPU frequency to 1.2GHz. ie run the below command for all CPU core:
```bash
echo performance | sudo tee /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
echo 1200000 | sudo tee /sys/devices/system/cpu/cpu0/cpufreq/scaling_min_freq
echo 1200000 | sudo tee /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
```
5. On worker node 2, manually set all CPU frequency to 2.4GHz.
```bash
echo performance | sudo tee /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
echo 2400000 | sudo tee /sys/devices/system/cpu/cpu0/cpufreq/scaling_min_freq
echo 2400000 | sudo tee /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
```
122 changes: 122 additions & 0 deletions examples/power_manager/assign_exp/main.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
package powermanager

import (
"encoding/csv"
"fmt"
"os"
"os/exec"
"sync"
"time"

powermanager "github.com/vhive-serverless/vhive/examples/power_manager"
)

var (
serviceAssignment = map[string]bool{
"spinning-go": false,
"sleeping-go": false,
"aes-python": false,
"auth-python": false,
}
)

func processLatencies(records []int64, serviceName string) {
if len(records) == 0 {
fmt.Println("No data to process")
return
}

fifthPercentile := powermanager.GetDataAtPercentile(records, 5)
ninetiethPercentile := powermanager.GetDataAtPercentile(records, 90)
difference := float64(ninetiethPercentile-fifthPercentile) / float64(fifthPercentile)
if difference >= 0.40 && !serviceAssignment[serviceName] { // Assign to high performance class
fmt.Println("Assigning to high performance class")
command := fmt.Sprintf("kubectl patch service.serving.knative.dev %s --type merge --patch '{\"spec\":{\"template\":{\"spec\":{\"nodeSelector\":{\"loader-nodetype\":\"worker-high\"}}}}}' --namespace default", serviceName)
cmd := exec.Command("bash", "-c", command)
_, err := cmd.CombinedOutput()
if err != nil {
fmt.Printf(fmt.Sprintf("Error assigning to high performance class: %+v", err))
return
}
serviceAssignment[serviceName] = true
}
if difference < 0.10 && !serviceAssignment[serviceName] { // Assign to low performance class
fmt.Println("Assigning to low performance class")
command := fmt.Sprintf("kubectl patch service.serving.knative.dev %s --type merge --patch '{\"spec\":{\"template\":{\"spec\":{\"nodeSelector\":{\"loader-nodetype\":\"worker-low\"}}}}}' --namespace default", serviceName)
cmd := exec.Command("bash", "-c", command)
_, err := cmd.CombinedOutput()
if err != nil {
fmt.Printf(fmt.Sprintf("Error assigning to low performance class: %+v", err))
return
}
serviceAssignment[serviceName] = true
}
}

func assignWorkload(ch_latency <-chan int64, serviceName string, wg *sync.WaitGroup) {
defer wg.Done()

ticker := time.NewTicker(1 * time.Minute)
defer ticker.Stop()

var records []int64

for {
select {
case record, ok := <-ch_latency:
if !ok {
// Channel is closed, process remaining data
processLatencies(records, serviceName)
return
}
records = append(records, record)
case <-ticker.C:
// Time to process the data
processLatencies(records, serviceName)
}
}
}

func main() {
file, err := os.Create("metrics2.csv")
if err != nil {
panic(err)
}
defer file.Close()

writer := csv.NewWriter(file)
defer writer.Flush()

err = writer.Write(append([]string{"startTime", "endTime", "spinningLatency", "sleepingLatency"}))
if err != nil {
fmt.Printf("Error writing metrics to the CSV file: %v\n", err)
}

ch := make(chan []string)
ch_latency_spinning := make(chan int64)
ch_latency_sleeping := make(chan int64)

var wg sync.WaitGroup
wg.Add(3)
go powermanager.WriteToCSV(writer, ch, &wg)
go assignWorkload(ch_latency_spinning, "spinning-go", &wg)
go assignWorkload(ch_latency_sleeping, "sleeping-go", &wg)

now := time.Now()
for time.Since(now) < (time.Minute * 5) {
go powermanager.InvokeConcurrently(5, powermanager.SleepingURL, ch, ch_latency_spinning, ch_latency_sleeping, false)
go powermanager.InvokeConcurrently(5, powermanager.SpinningURL, ch, ch_latency_spinning, ch_latency_sleeping, true)

time.Sleep(1 * time.Second) // Wait for 1 second before invoking again
}
close(ch)
close(ch_latency_spinning)
close(ch_latency_sleeping)
wg.Wait()

err = writer.Write(append([]string{"-", "-", "-", "-"}))
if err != nil {
fmt.Printf("Error writing metrics to the CSV file: %v\n", err)
}
fmt.Println("done")
}
3 changes: 3 additions & 0 deletions examples/power_manager/go.mod
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
module github.com/vhive-serverless/vhive/examples/power_manager

go 1.19
Empty file added examples/power_manager/go.sum
Empty file.
68 changes: 68 additions & 0 deletions examples/power_manager/internode_scaling_exp/main.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
package main

import (
"encoding/csv"
"fmt"
"os"
"sync"
"time"

powermanager "github.com/vhive-serverless/vhive/examples/power_manager"
)

func main() {
file, err := os.Create("metrics1.csv")
if err != nil {
panic(err)
}
defer file.Close()

writer := csv.NewWriter(file)
defer writer.Flush()

err = writer.Write(append([]string{"startTime", "endTime", "spinningLatency", "sleepingLatency"}))
if err != nil {
fmt.Printf("Error writing metrics to the CSV file: %v\n", err)
}

ch := make(chan []string)
ch_latency_spinning := make(chan int64)
ch_latency_sleeping := make(chan int64)

var wg sync.WaitGroup
wg.Add(3)
go powermanager.WriteToCSV(writer, ch, &wg)

frequencies := map[string]int64{
powermanager.LowFrequencyPowerProfile: 1200,
powermanager.HighFrequencyPowerProfile: 2400,
} // for 50/50, need to manually tune the frequency of the individual node
for powerProfile, freq := range frequencies {
err := powermanager.SetPowerProfileToNode(powerProfile, powermanager.Node1Name, freq)
if err != nil {
fmt.Printf(fmt.Sprintf("Error setting up power profile for node1: %+v", err))
}
err = powermanager.SetPowerProfileToNode(powerProfile, powermanager.Node2Name, freq)
if err != nil {
fmt.Printf(fmt.Sprintf("Error setting up power profile for node2: %+v", err))
}

now := time.Now()
for time.Since(now) < (time.Minute * 5) {
go powermanager.InvokeConcurrently(5, powermanager.SleepingURL, ch, ch_latency_spinning, ch_latency_sleeping, false)
go powermanager.InvokeConcurrently(5, powermanager.SpinningURL, ch, ch_latency_spinning, ch_latency_sleeping, true)

time.Sleep(1 * time.Second) // Wait for 1 second before invoking again
}
close(ch)
close(ch_latency_spinning)
close(ch_latency_sleeping)
wg.Wait()

err = writer.Write(append([]string{"-", "-", "-", "-"}))
if err != nil {
fmt.Printf("Error writing metrics to the CSV file: %v\n", err)
}
fmt.Println("done")
}
}
Loading

0 comments on commit 4c33ae6

Please sign in to comment.