You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm sorry to bother you, but I'm really confused
I use the graph.txt file generated by your single GPU. The bandwidth(B1,B2) is from 1GB to 30GB, and the interval is 500MB. I can't get the same result as your model after segmentation.
I think that after trying so many bandwidths, I can get a model that is the same as your segmentation, but I can only get 3 stages of models. The key problem is that almost none of these models work. So I want to ask you to get the specific parameters of the model divided into 2 stages in vgg16.gpus = 16
Just read your paper again and found your bandwidth: the bandwidth between your machines is 25Gb, and the bandwidth inside your machine is NVLINK (20-25GB). The results you get are as follows:
model:VGG16
(Split start, split end) / compute time taken per stage / replication factor per stage:
(0, 17) 0.5142380000000001 8
(17, 23) 0.10629900000000003 5
(23, 29) 0.040430999999999995 2
(29, 40) 0.011566999999999772 1
When i execute python convert_graph_to_model.py -f vgg16_partitioned/gpus=16.txt -n VGG16Partitioned -a vgg16 -o ../runtime/image_classification/models/vgg16/gpus=16 --stage_to_num_ranks 0:8,1:5,2:2,3:1
Threw an error
Traceback (most recent call last):
File "main_with_runtime.py", line 578, in
main()
File "main_with_runtime.py", line 307, in main
train(train_loader, r, optimizer, epoch)
File "main_with_runtime.py", line 337, in train
r.train(n)
File "../runtime.py", line 395, in train
num_iterations, forward_only=False)
File "../communication.py", line 274, in start_helper_threads
num_iterations=num_iterations)
File "../communication.py", line 237, in num_iterations_for_helper_threads
assert forward_num_iterations % self.num_ranks_in_next_stage == 0
AssertionError
I'm sorry to bother you, but I'm really confused
I use the graph.txt file generated by your single GPU. The bandwidth(B1,B2) is from 1GB to 30GB, and the interval is 500MB. I can't get the same result as your model after segmentation.
I think that after trying so many bandwidths, I can get a model that is the same as your segmentation, but I can only get 3 stages of models. The key problem is that almost none of these models work. So I want to ask you to get the specific parameters of the model divided into 2 stages in vgg16.gpus = 16
python optimizer_graph_hierarchical.py -f ../profiler/image_classification/profiles/vgg16/graph.txt -n 8 2 --activation_compression_ratio 1 -o vgg16_partitioned -b B1 B2
The text was updated successfully, but these errors were encountered: