Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I can't get the same result as your model after segmentation #41

Open
ADAM-CT opened this issue Jan 19, 2020 · 2 comments
Open

I can't get the same result as your model after segmentation #41

ADAM-CT opened this issue Jan 19, 2020 · 2 comments

Comments

@ADAM-CT
Copy link

ADAM-CT commented Jan 19, 2020

I'm sorry to bother you, but I'm really confused
I use the graph.txt file generated by your single GPU. The bandwidth(B1,B2) is from 1GB to 30GB, and the interval is 500MB. I can't get the same result as your model after segmentation.

I think that after trying so many bandwidths, I can get a model that is the same as your segmentation, but I can only get 3 stages of models. The key problem is that almost none of these models work. So I want to ask you to get the specific parameters of the model divided into 2 stages in vgg16.gpus = 16

python optimizer_graph_hierarchical.py -f ../profiler/image_classification/profiles/vgg16/graph.txt -n 8 2 --activation_compression_ratio 1 -o vgg16_partitioned -b B1 B2

@ADAM-CT
Copy link
Author

ADAM-CT commented Jan 19, 2020

Just read your paper again and found your bandwidth: the bandwidth between your machines is 25Gb, and the bandwidth inside your machine is NVLINK (20-25GB). The results you get are as follows:
model:VGG16

(Split start, split end) / compute time taken per stage / replication factor per stage:
(0, 17) 0.5142380000000001 8
(17, 23) 0.10629900000000003 5
(23, 29) 0.040430999999999995 2
(29, 40) 0.011566999999999772 1

Total number of stages: 4

@ADAM-CT
Copy link
Author

ADAM-CT commented Jan 19, 2020

When i execute
python convert_graph_to_model.py -f vgg16_partitioned/gpus=16.txt -n VGG16Partitioned -a vgg16 -o ../runtime/image_classification/models/vgg16/gpus=16 --stage_to_num_ranks 0:8,1:5,2:2,3:1

Threw an error
Traceback (most recent call last):
File "main_with_runtime.py", line 578, in
main()
File "main_with_runtime.py", line 307, in main
train(train_loader, r, optimizer, epoch)
File "main_with_runtime.py", line 337, in train
r.train(n)
File "../runtime.py", line 395, in train
num_iterations, forward_only=False)
File "../communication.py", line 274, in start_helper_threads
num_iterations=num_iterations)
File "../communication.py", line 237, in num_iterations_for_helper_threads
assert forward_num_iterations % self.num_ranks_in_next_stage == 0
AssertionError

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant