Using GPU to train models in a K8s Pod with the K8s Device Plugin and PyTorch framework, the training time is 6% longer compared to running on bare metal. #1101

lyon-v · 2024-12-16T06:56:24Z

Here is the code:

import time
import torch
import torch.nn as nn
import torch.optim
import torch.utils.data
import torchvision
import torchvision.transforms as T
import torchvision.datasets as datasets
import torch.profiler
from torchvision import models
import torchvision.transforms as transforms

device = torch.device("cuda:0")
model = models.resnet50()
model.cuda(device)

train_dataset = datasets.FakeData(51246, (3, 224, 224), 1000, transforms.ToTensor())

train_loader = torch.utils.data.DataLoader(
    train_dataset, batch_size=256, shuffle=None,
    num_workers=4, pin_memory=True)

criterion = nn.CrossEntropyLoss().cuda(device)
optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
model.train()
total_epochs = 2

wait = 1
warmup = 1
active = 4

sumstep = wait + warmup + active
# Training loop
skip = wait + warmup

for i in range(total_epochs):
    totaltime=0
    step_start_time = time.time()  # Track the start time of the iteration
    for step, data in enumerate(train_loader, 0):
        
        inputs, labels = data[0].to(device=device), data[1].to(device=device)
        
        data_loading_time = time.time() - step_start_time
        
        # if step > skip +active:
        #         torch.cuda.cudart().cudaProfilerStop()
        #         print("break out")
        #         break
        # if step == skip:
        #     torch.cuda.cudart().cudaProfilerStart()

        # 2. Forward pass
        outputs = model(inputs)
        loss = criterion(outputs, labels)

        # 3. Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        iteration_time = time.time() - step_start_time
        totaltime +=iteration_time
        print(f"Step {step}: Iteration Time: {iteration_time:.4f}s, Data Loading Time: {data_loading_time:.4f}s")
        
        # Measure data loading time
        step_start_time = time.time() 
        
    print(f"epoch {i}: Total Time: {totaltime:.4f}s, Avg Iteration Time: {totaltime/step:.4f}s")

print("Training Finished")

Under the same configuration, including CPU, memory, and GPU, training one epoch in a K8s Pod takes 6% longer compared to running on bare metal (using Docker).

k8s-pod: epoch 1: Total Time: 79.8202s, Avg Iteration Time: 0.3991s
bare metal: epoch 1: Total Time: 73.0051s, Avg Iteration Time: 0.3650s

so，I need your help

The text was updated successfully, but these errors were encountered:

chipzoller · 2024-12-18T13:43:00Z

I don't think this would have anything to do with the device plugin.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using GPU to train models in a K8s Pod with the K8s Device Plugin and PyTorch framework, the training time is 6% longer compared to running on bare metal. #1101

Using GPU to train models in a K8s Pod with the K8s Device Plugin and PyTorch framework, the training time is 6% longer compared to running on bare metal. #1101

lyon-v commented Dec 16, 2024

chipzoller commented Dec 18, 2024

Using GPU to train models in a K8s Pod with the K8s Device Plugin and PyTorch framework, the training time is 6% longer compared to running on bare metal. #1101

Using GPU to train models in a K8s Pod with the K8s Device Plugin and PyTorch framework, the training time is 6% longer compared to running on bare metal. #1101

Comments

lyon-v commented Dec 16, 2024

chipzoller commented Dec 18, 2024