You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Using GPU to train models in a K8s Pod with the K8s Device Plugin and PyTorch framework, the training time is 6% longer compared to running on bare metal.
#1101
import time
import torch
import torch.nn as nn
import torch.optim
import torch.utils.data
import torchvision
import torchvision.transforms as T
import torchvision.datasets as datasets
import torch.profiler
from torchvision import models
import torchvision.transforms as transforms
device = torch.device("cuda:0")
model = models.resnet50()
model.cuda(device)
train_dataset = datasets.FakeData(51246, (3, 224, 224), 1000, transforms.ToTensor())
train_loader = torch.utils.data.DataLoader(
train_dataset, batch_size=256, shuffle=None,
num_workers=4, pin_memory=True)
criterion = nn.CrossEntropyLoss().cuda(device)
optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
model.train()
total_epochs = 2
wait = 1
warmup = 1
active = 4
sumstep = wait + warmup + active
# Training loop
skip = wait + warmup
for i in range(total_epochs):
totaltime=0
step_start_time = time.time() # Track the start time of the iteration
for step, data in enumerate(train_loader, 0):
inputs, labels = data[0].to(device=device), data[1].to(device=device)
data_loading_time = time.time() - step_start_time
# if step > skip +active:
# torch.cuda.cudart().cudaProfilerStop()
# print("break out")
# break
# if step == skip:
# torch.cuda.cudart().cudaProfilerStart()
# 2. Forward pass
outputs = model(inputs)
loss = criterion(outputs, labels)
# 3. Backward pass
optimizer.zero_grad()
loss.backward()
optimizer.step()
iteration_time = time.time() - step_start_time
totaltime +=iteration_time
print(f"Step {step}: Iteration Time: {iteration_time:.4f}s, Data Loading Time: {data_loading_time:.4f}s")
# Measure data loading time
step_start_time = time.time()
print(f"epoch {i}: Total Time: {totaltime:.4f}s, Avg Iteration Time: {totaltime/step:.4f}s")
print("Training Finished")
Under the same configuration, including CPU, memory, and GPU, training one epoch in a K8s Pod takes 6% longer compared to running on bare metal (using Docker).
k8s-pod: epoch 1: Total Time: 79.8202s, Avg Iteration Time: 0.3991s
bare metal: epoch 1: Total Time: 73.0051s, Avg Iteration Time: 0.3650s
so,I need your help
The text was updated successfully, but these errors were encountered:
Here is the code:
Under the same configuration, including CPU, memory, and GPU, training one epoch in a K8s Pod takes 6% longer compared to running on bare metal (using Docker).
k8s-pod: epoch 1: Total Time: 79.8202s, Avg Iteration Time: 0.3991s
bare metal: epoch 1: Total Time: 73.0051s, Avg Iteration Time: 0.3650s
so,I need your help
The text was updated successfully, but these errors were encountered: