I can't get the same result as your model after segmentation #41

ADAM-CT · 2020-01-19T09:00:41Z

I'm sorry to bother you, but I'm really confused
I use the graph.txt file generated by your single GPU. The bandwidth（B1,B2） is from 1GB to 30GB, and the interval is 500MB. I can't get the same result as your model after segmentation.

I think that after trying so many bandwidths, I can get a model that is the same as your segmentation, but I can only get 3 stages of models. The key problem is that almost none of these models work. So I want to ask you to get the specific parameters of the model divided into 2 stages in vgg16.gpus = 16

python optimizer_graph_hierarchical.py -f ../profiler/image_classification/profiles/vgg16/graph.txt -n 8 2 --activation_compression_ratio 1 -o vgg16_partitioned -b B1 B2

The text was updated successfully, but these errors were encountered:

ADAM-CT · 2020-01-19T09:43:24Z

Just read your paper again and found your bandwidth: the bandwidth between your machines is 25Gb, and the bandwidth inside your machine is NVLINK (20-25GB). The results you get are as follows：
model:VGG16

(Split start, split end) / compute time taken per stage / replication factor per stage:
(0, 17) 0.5142380000000001 8
(17, 23) 0.10629900000000003 5
(23, 29) 0.040430999999999995 2
(29, 40) 0.011566999999999772 1

Total number of stages: 4

ADAM-CT · 2020-01-19T10:02:45Z

When i execute
`python convert_graph_to_model.py -f vgg16_partitioned/gpus=16.txt -n VGG16Partitioned -a vgg16 -o ../runtime/image_classification/models/vgg16/gpus=16 --stage_to_num_ranks 0:8,1:5,2:2,3:1`

Threw an error
Traceback (most recent call last):
File "main_with_runtime.py", line 578, in
main()
File "main_with_runtime.py", line 307, in main
train(train_loader, r, optimizer, epoch)
File "main_with_runtime.py", line 337, in train
r.train(n)
File "../runtime.py", line 395, in train
num_iterations, forward_only=False)
File "../communication.py", line 274, in start_helper_threads
num_iterations=num_iterations)
File "../communication.py", line 237, in num_iterations_for_helper_threads
assert forward_num_iterations % self.num_ranks_in_next_stage == 0
AssertionError

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I can't get the same result as your model after segmentation #41

I can't get the same result as your model after segmentation #41

ADAM-CT commented Jan 19, 2020

ADAM-CT commented Jan 19, 2020

ADAM-CT commented Jan 19, 2020

I can't get the same result as your model after segmentation #41

I can't get the same result as your model after segmentation #41

Comments

ADAM-CT commented Jan 19, 2020

ADAM-CT commented Jan 19, 2020

Just read your paper again and found your bandwidth: the bandwidth between your machines is 25Gb, and the bandwidth inside your machine is NVLINK (20-25GB). The results you get are as follows： model:VGG16

ADAM-CT commented Jan 19, 2020

When i execute python convert_graph_to_model.py -f vgg16_partitioned/gpus=16.txt -n VGG16Partitioned -a vgg16 -o ../runtime/image_classification/models/vgg16/gpus=16 --stage_to_num_ranks 0:8,1:5,2:2,3:1

Just read your paper again and found your bandwidth: the bandwidth between your machines is 25Gb, and the bandwidth inside your machine is NVLINK (20-25GB). The results you get are as follows：
model:VGG16

When i execute
`python convert_graph_to_model.py -f vgg16_partitioned/gpus=16.txt -n VGG16Partitioned -a vgg16 -o ../runtime/image_classification/models/vgg16/gpus=16 --stage_to_num_ranks 0:8,1:5,2:2,3:1`