Skip to content

Commit

Permalink
Feat general basic communication (#8437)
Browse files Browse the repository at this point in the history
* Add a slight cost for B->S and B->P in 2d sbp

* Add penalty for P in consumer

* Fix a slight bug

* Add at most 1 middle node for general basic communication

* Add the cost for general basic communication

* Add the slight penalty for eager

* Skip initialization of boxing collector if not needed

* Fix a bug

* Dev nd nccl send recv boxing (#8467)

* nd nccl_send_recv_boxing

* rm print

* support num_axes > 2

* Add distributed optional run (#8372)

* Add

* change deps

* add install

* add skip

* autoprof supports bandwidth (#8367)

* autoprof supports bandwidth

Signed-off-by: daquexian <[email protected]>

* print bandwidth

Signed-off-by: daquexian <[email protected]>

* auto format by CI

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <[email protected]>

* remove tmp buffer of cumprod cpu backward kernel (#8369)

* remove tmp buffer of cumprod cpu backward kernel

* refine

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Move tensor api to cpython part3 (#8342)

* add tensor_functions

* concat py methods

* add hash, restore tensor.py

* check replacement

* refine code, remove commented tensor.py

* refine code

* move some api

* add cpu and cuda api

* add triu tril norm and etc.

* remove tensor_functions.h

* move more api

* move more api, refine size

* fix typo

* format code, remove useless include

* refine code

* refine code, fix typo

* align .cuda to python

* refine code

* split some api to part3 for review

* remove positional only arguments of argmax and argmin

* remove arguments parse

* modify arguments name in matmul and floor_divide

* rename BINARY_FUNC to DIRECT_PASS_FUNC, modify some functions

* refine code, format code

* add inplace /=, add comments

* remove name in macros

* remove python api

* remove redundant include

* remove cout

* format code

* refactor tensor.size by directly call shape.at, refactor tensor.sub_ by calling nb_sub_

* remove redundant code

* auto format by CI

* fix typo, fix wrong call

* modify idx datatype from int32 to int64 in tensor.size

* add some DIRECT_PASS_FUNC

* add cpu cuda var pow and etc.

* add masked_fill any all

* make REDUCE_FUNC macro, add reduce_* functions

* add 0dim check in ReduceSumWhole, refine yaml

* fix bug

* restore add add_ sub sub_

* add unittest for tensor.half tensor.add tensor.add_

* refine code

* refine code

* fix typo

* fix bug of tensor.std()

* refactor var std and cuda, using c++ functional api

* add beta and threshold in softplus

* auto format by CI

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add nn_functor Check (#7910)

* add bias_add_check

* add bias_add error test

* fix conv2d nhwc bias_add error

* add nhwc conv test

* add bias_add_error test

* Add bias add error check

* Rename

* add batch matmul error check

* add matmul check error msg

* remove annotation

* add fused mlp error msg check

* Add pixel shuffle check test

* add more test until normalization add relu functor

* refine error message

* finish all nnfunctor check msg

* handle type error

* remove useless symbol

* modify back to TypeError

* fix all comment

* Remove redundant code

* Remove pad ndim check

* fix bias add space

* fix check logic cause ci gpu not always gpu:0

Co-authored-by: hjchen2 <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add FusedMatmulBiasAddReluDropout [OneEmbedding] (#8222)

* previous version for fused_matmul_bias_add_relu_dropout

* add op infer

* fix detail

* finish forward

* support dropout rate list

* add forward test

* fix bug for output buffer

* Configurable alpha params

* try to add bit mask logic

* Add bitmask first version!

* Add row col bitmask logic

* support not align4 reludropout

* simplify relu dropout ld logic

* Add naive relu dropout grad kernel

* add simple relu dropout grad kernel

* Rename

* support relu_dropout bitmask backward

* add vectorized optimization

* fix tmp buffer

* add to amp list

* add lazy backward logic

* Refine kernel

* add indextype dispatch

* simplify functor logic

* fix cublas fused mlp aux_ld shape bug

* Add more relu dropout kernel

* add full unittest

* fix bug in skip final activation

* refine

* Remove dump func

* fix format

* Remove cmake

* remove redundant divide

* add padded version

* fix dropout

* oneflow curand

* refine

* remove redundant kernel

* add unroll logic

* add unroll and ballot sync

* refine format

* Remove fast curand

* Refine python interface

* Add if branch for memset

* fix python logic

* just for debug

* not use matmul bias add grad

* add launch 1 block limit

* fix unittest

* Refine

* fix graph backward bug

* limit to 11060

* change to use int32_t dtype for cublas aux

* Fix jc comment

* fix comment

* fix convert

* fix static_analysis

* fix at

* fix userops td

* fix userops td

* fix const ref

* fix compile error for bfloat16

* limit to 11060

* fix bug

Co-authored-by: Juncheng <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix gather 0-dim tensor bug (#8376)

* fix 0-dim tensor bug

* refine

* support input 0-dim tensor for gather

* refine

* refine

* refine dim_scatter_kernel check

* refine

* refine check

* fix clang_tidy error

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add api to apply external job pass (#8370)

* Add condition to find-test-cache-distributed (#8387)

* add condition to find-test-cache-distributed

* fix

* warp dim util (#8382)

* warp dim util

* format

* use more maybe_wrap_dim

* refine array functor

* add more

* refine math_functor

* fix_bug_in_broadcast_min_max_grad_and_broadcast_like (#8379)

* fix_bug_in_broadcast_min_max_grad_and_broadcast_like

* refine

* fix static check error

* fix bug about index (#8388)

* fix bug about index

* add test case

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* LogicalSliceAssign support full slice sbp (#8344)

* feat(SliceOp): slice ops support 2d sbp

* fix(SliceOp): fix [B, P] 2d sbp bug

* refine error message

* fix bug in parallel_num == 1

* add comment

* add warning and format

* add NOLINT for boxing check

* feat(LogicalSliceOps): support all nd_sbp

* feat(LogicalSlice): support nd_sbp

* add error message

* fix(AutoTest): fix auto_test bug in module.parameter pass

* auto format by CI

* fix(LogicalSliceAssign): skip test when 1n1d

* fix SliceParams memset error

* remove memset

* add CHECK_JUST

* fix(*): make sure split_axis >= 0 or equal to SPLIT_AXIS_FOR_NON_SPLIT

* remove memset

* fix spilit_info.axis bug

* feat(LogicalSliceOps): support grad

* add logical_slice gradient_funcs

* feat(LogicalSliceAssign): LogicalSliceAssign support full slice sbp

* auto format by CI

* test(LogicalSlice): fix logical_slice dims

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Houjiang Chen <[email protected]>
Co-authored-by: oneflow-ci-bot <[email protected]>

* fix_tensor_from_numpy_mem_leak_bug (#8391)

* fix_tensor_from_numpy_mem_leak_bug

* add note

* refine note

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Make of_pyext_obj static only to make sure only a python ext so has python symbols (#8393)

* make of_pyext_obj static only

* refine note

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Adjust tolerance setting in embedding_renorm unit test (#8394)

* support front end compile for job to iree (#8249)

* support frontend dev version

* polish name

* add tosa-to-elf.mlir

* tosa to elf by llvm

* conv2d partial

* an enhanced frontend runner

* support numpy as input

* enable multiple using nn graph with different input(jobname make it  it cd /home/yuhao/frontend/oneflow ; /usr/bin/env /usr/bin/python3 /home/yuhao/.vscode-server/extensions/ms-python.python-2022.6.2/pythonFiles/lib/python/debugpy/launcher 40873 -- /home/yuhao/frontend/oneflow/oneflow/ir/test/Frontend/runner.py )

* enable multiple input

* enable cpu and cuda

* change full_name to _full_name

* support exchange cuda with cpu seamlessly

* remove pip

* lit config

* polish

* trim

* auto format by CI

* modify

* auto format by CI

* last line polish

* use unittest

* auto format by CI

* use allclose

* auto format by CI

* pulish

* optimize convert oneflow to tosa

* conv2d

* conv2d enhanced && conv2d examples add

* add road map

* add add_n2Op and boardcast_addOp conversion

* add matmulOp conversion

* support converting normailzation op to tosa(partically)

* update roadmap

* support i64 tensor to dense elem attr

* support 100% resnet op conversion

* add test mlir

* add test iree resnet python script

* auto format by CI

* done

* enhance iree resnet test script

* auto format by CI

* rebuild code

* auto format by CI

* rebuild test script

* update

* auto format by CI

* pub

* trim test scripts

* move

* move

* input and output add block arg judgement

* emit error in variable conversion

* error handle for ci

* modify err info

* auto format by CI

* merge

* auto format by CI

* output not block

* flow ones

* rm const

* trim maybe

* trim maybe with header file

* const auto

* solve clangd error

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Feat/zero mix with mp (#8036)

* add zero limit

* add debug

* add mix zero test

* refactor zero api

* zero test with mp

* add 2d test

* add zero nd

* add nd zero

* add sbp cast

* test passed soft limit consumer

* refine size api

* zero use stage 2

* add limit consumer api

* add new api

* refine zero s select

* fix index out of range

* rm zero limit on device type

* zero test with activation checkpointing

* add indentity when dp sequence len is 1

* move to base with master

* fix

* fix

* fix

* add test

* debug bad case

* refine test for eager and graph boxing

* test case ready

* simplify

* refine test

* fix buff size

* fix conflict

* refine zero nd

* refine

* add full test

* revert change

* refine split check

* fix typo

* rm log

* spit long func

* restore test

* Update optimizer_placement_optimization_pass.cpp

* auto format by CI

* auto format by CI

* fix static check

* add tips for zero api change

* auto format by CI

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Revert embedding normal path and fix amp list (#8374)

* revert embedding normal path, fix amp list

* fix amp

* fix memset bug in gather cpu kernel

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* replace fixed_vector with small_vector and make Shape inherit from it (#8365)

* Replace fixed_vector with llvm::SmallVector

Signed-off-by: daquexian <[email protected]>

* Shape inherited from llvm::SmallVector

Signed-off-by: daquexian <[email protected]>

* refine cmake

Signed-off-by: daquexian <[email protected]>

* rename fixed_vector to small_vector

Signed-off-by: daquexian <[email protected]>

* fix reviews

Signed-off-by: daquexian <[email protected]>

* auto format by CI

* update Shape constructor

Signed-off-by: daquexian <[email protected]>

* add 'PUBLIC' keyword to all target_link_libraries

Signed-off-by: daquexian <[email protected]>

* auto format by CI

* update cmake

Signed-off-by: daquexian <[email protected]>

* auto format by CI

* update cmake

Signed-off-by: daquexian <[email protected]>

* update cmake

Signed-off-by: daquexian <[email protected]>

* auto format by CI

* set is_initialized_ default to true

Signed-off-by: daquexian <[email protected]>

* override some methods to set is_initialized_

Signed-off-by: daquexian <[email protected]>

* auto format by CI

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <[email protected]>

* Light plan for debug (#8396)

* Light plan for debug

* fix note

* disable terminfo to fix missing terminfo symbols (#8400)

* disable terminfo to fix missing terminfo symbols

Signed-off-by: daquexian <[email protected]>

* auto format by CI

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix bug of ZeRO MP in complex case (#8404)

* Remove redundant output_lbns in ir (#8409)

* mv case

* remove redundant info

* Dev FusedCrossInteraction[OneEmbedding] (#8335)

* add simple fused cross interaction forward

* add packed fused

* Add cross interaction grad

* simplify code

* fix bug

* support crossnet v2

* support cross interaction v2

* add lazy backward

* Rename and add test

* fix jc comment

* fix comment

* fix bug

* fix userops td elem_cnt for FUSED Group

* fix header file

* fix clang static analysis

* fix unittest

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add exe graph physical shape check msg (#8002)

* fix index select op in graph

* add exe graph physical shape check msg

* improve the debug information for the python stack trace

1. add a parameter 'max_stack_depth' to specify the max depth for the stack trace
2. refactor other debug related classes.

* remove parens

* update

* resolve PR comments

* update

* update graph debug test file.

* restore self._debug in class Graph and class ModuleBlock

* Do not shorten the stack frame string if it is in debug mode

* delete TODOs

* disable conv3d test (#7969)

Signed-off-by: daquexian <[email protected]>

* skip layernorm random_data_warp test (#7941)

* skip layernorm random_data_warp test

* warp/block/uncached case only test gpu

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Lock click version (#7967)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add global avgpool unittest (#7585)

* fix (#7978)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Support negative dim in scatter op (#7934)

* support negative dim in scatter op

* refine scatter test

* refine scatter test again

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* run barrier callback in BarrierPhyInstrOperand::~BarrierPhyInstrOperand (#7702)

* run barrier callback in BarrierPhyInstrOperand::~BarrierPhyInstrOperand

* lock gil in vm Callback thread

* more comments for VirtualMachineEngine::Callback()

* the Env is never destroyed.

* export Env into python

* more unittests

* wait shared_ptr.use_count() == 0

* export unittest.TestCase in framework/unittest.py

* SwitchToShuttingDownPhase

* optional is_normal_exit

* VirtualMachine::CloseVMThreads

* Delete env_api.h

env_api.h is deleted by master

* reshape_only_one_dim_infered

* address pr comments

* fix a ref-cnt bug in TryRunBarrierInstruction.

* rollback flow.env.all_device_placement

* no distributed running test_shutting_down.py

* auto format by CI

* expand lifetime of module oneflow in test_shutting_down.py

* refine del depend on of

* capture oneflow._oneflow_internal.eager when calling sync in __del__

* add try in flaky test

Co-authored-by: Luyang <[email protected]>
Co-authored-by: chengtbf <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: Xiaoyu Xu <[email protected]>

* Fix one hot scalar tensor bug (#7975)

* fix reduce_sum scalar check bug

* fix one_hot scalar tensor bug

* fix clang tidy error

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* support ctor np array from of tensor (#7970)

* support ctor np array from of tensor

* add test case constructing np array from tensor

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add_manual_seed_all_api (#7957)

* add_manual_seed_all_api

* Update conf.py

* refine

* add test case

* auto format by CI

* Update random_generator.cpp

* auto format by CI

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* one_embedding add doc string (#7902)

* add doc string

* add example

* add

* fix doc

* refine

* address review

* mb to MB

* add make_table_option

* option to options

* refine

* add forward

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Support numpy scalar parameters (#7935)

* feat(functional): support numpy scalar parameters

* rename inferface

* feat(*): TensorIndex support numpy scalar

* feat(TensorIndex): support advance indexing

* add unittest and int32 support for branch feat-param_support_np_scalar (#7939)

* add unittest

* refactor unittest

* add todo for int16 advanced indexing

* add int32 supporting for advance indexing

* auto format by CI

Co-authored-by: Wang Yi <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <[email protected]>

* fix tensor_scatter_nd_update (#7953)

* fix tensor_scatter_nd_update

* auto backward

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix one_embedding adam (#7974)

* fix one_embedding adam

* fix tidy

* fix normal

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* speed test with score (#7990)

Signed-off-by: daquexian <[email protected]>

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Feat/graph del by ref (#7857)

* remove IsMultiClient() and single client logic

Signed-off-by: daquexian <[email protected]>

* rename eager.multi_client to eager

Signed-off-by: daquexian <[email protected]>

* auto format by CI

* add py ref

* refine new session

* clean code

* make scope api inner use

* use session with ref cnt

* run barrier callback in BarrierPhyInstrOperand::~BarrierPhyInstrOperand

* test pass

* lock gil in vm Callback thread

* more comments for VirtualMachineEngine::Callback()

* merge

* merge rm single client

* rm initenv

* merge and fix master

* refactor env c api

* add debug code

* fix and serving test pass

* test passed

* rm useless

* rm useless code

* format

* rm useless include

* rm sync in py

* the Env is never destroyed.

* export Env into python

* more unittests

* fix and pass tests

* revert virtual_machine.cpp

* revert core/vm

* remove outdated python class oneflow.unittest.TestCase

* graph test passed

* wait shared_ptr.use_count() == 0

* export unittest.TestCase in framework/unittest.py

* SwitchToShuttingDownPhase

* optional is_normal_exit

* VirtualMachine::CloseVMThreads

* Delete env_api.h

env_api.h is deleted by master

* address pr comments

* rm is env init

* Clear empty thread when graph destroy (#7633)

* Revert "Clear empty thread when graph destroy (#7633)" (#7860)

This reverts commit 3e8585e5fa20b97229d6b0be46a7ff814dc8cd83.

* fix a ref-cnt bug in TryRunBarrierInstruction.

* rm env_api

* fix clang-tidy error

* fix clang-tidy in env_imp

* refine env api

* format

* refine graph del and sync at shuttingdown

* fix typo

* add comment

* rm useless

* rm useless

Co-authored-by: daquexian <[email protected]>
Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: lixinqi <[email protected]>
Co-authored-by: Li Xinqi <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Luyang <[email protected]>
Co-authored-by: cheng cheng <[email protected]>

* [PersistentTable] Fix num blocks (#7986)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add auto benchmark for flowvision (#7806)

* update yml

* update workflow

* add resnet50

* [PersistentTable] Async write (#7946)

* [PersistentTable] Async write

* fix

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* save log in separate dir by default (#7825)

Signed-off-by: daquexian <[email protected]>

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix index select op in graph

* add exe graph physical shape check msg

* improve the debug information for the python stack trace

1. add a parameter 'max_stack_depth' to specify the max depth for the stack trace
2. refactor other debug related classes.

* remove parens

* update

* resolve PR comments

* update

* update graph debug test file.

* restore self._debug in class Graph and class ModuleBlock

* Do not shorten the stack frame string if it is in debug mode

* delete TODOs

* Revert "Merge branch 'master' into fea/graph_check_msg"

This reverts commit 28833b73a8041463e5e3d130784be386ee248bd8, reversing
changes made to baadf6045f2fce69c090e442a755229c1c949773.

* Revert "Revert "Merge branch 'master' into fea/graph_check_msg""

This reverts commit 1d5e196d8530ffd2b9bf781abcf168b94ff9ca41.

* update

* resolve conflicts

* resolve conflicts

Co-authored-by: Cijie Xia <[email protected]>
Co-authored-by: daquexian <[email protected]>
Co-authored-by: guo ran <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Shenghang Tsai <[email protected]>
Co-authored-by: Houjiang Chen <[email protected]>
Co-authored-by: Peihong Liu <[email protected]>
Co-authored-by: Li Xinqi <[email protected]>
Co-authored-by: Luyang <[email protected]>
Co-authored-by: chengtbf <[email protected]>
Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: Xiaoyu Zhang <[email protected]>
Co-authored-by: liufengwei0103 <[email protected]>
Co-authored-by: binbinHan <[email protected]>
Co-authored-by: Yinggang Wang <[email protected]>
Co-authored-by: Wang Yi <[email protected]>
Co-authored-by: Shijie <[email protected]>
Co-authored-by: lixinqi <[email protected]>
Co-authored-by: Juncheng <[email protected]>

* add batch_matmul sbp (#8385)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* suppress gcc11 false positive warning (#8401)

Signed-off-by: daquexian <[email protected]>

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix variable op conversion to tosa error in ninja c1 (#8412)

* pub

* move test iree resnet python script to oneflow_iree repo

* add bracket

* rename const_val to const_val_ and restore resnet.py test script

Co-authored-by: Shenghang Tsai <[email protected]>

* nccl send/recv support different placement

* refine

* auto format by CI

* rm out ctrl

* auto format by CI

Co-authored-by: guo-ran <[email protected]>
Co-authored-by: Shenghang Tsai <[email protected]>
Co-authored-by: daquexian <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: liufengwei0103 <[email protected]>
Co-authored-by: Wang Yi <[email protected]>
Co-authored-by: ZZK <[email protected]>
Co-authored-by: hjchen2 <[email protected]>
Co-authored-by: Juncheng <[email protected]>
Co-authored-by: Xiaoyu Zhang <[email protected]>
Co-authored-by: Luyang <[email protected]>
Co-authored-by: binbinHan <[email protected]>
Co-authored-by: Yinggang Wang <[email protected]>
Co-authored-by: Yao Zihang <[email protected]>
Co-authored-by: yuhao <[email protected]>
Co-authored-by: Xiaoyu Xu <[email protected]>
Co-authored-by: cheng cheng <[email protected]>
Co-authored-by: Cijie Xia <[email protected]>
Co-authored-by: Peihong Liu <[email protected]>
Co-authored-by: Li Xinqi <[email protected]>
Co-authored-by: Shijie <[email protected]>
Co-authored-by: lixinqi <[email protected]>

* Support different hierarchy

* Merge branch 'master' into feat-general_basic_communication (#8477)

* Add distributed optional run (#8372)

* Add

* change deps

* add install

* add skip

* autoprof supports bandwidth (#8367)

* autoprof supports bandwidth

Signed-off-by: daquexian <[email protected]>

* print bandwidth

Signed-off-by: daquexian <[email protected]>

* auto format by CI

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <[email protected]>

* remove tmp buffer of cumprod cpu backward kernel (#8369)

* remove tmp buffer of cumprod cpu backward kernel

* refine

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Move tensor api to cpython part3 (#8342)

* add tensor_functions

* concat py methods

* add hash, restore tensor.py

* check replacement

* refine code, remove commented tensor.py

* refine code

* move some api

* add cpu and cuda api

* add triu tril norm and etc.

* remove tensor_functions.h

* move more api

* move more api, refine size

* fix typo

* format code, remove useless include

* refine code

* refine code, fix typo

* align .cuda to python

* refine code

* split some api to part3 for review

* remove positional only arguments of argmax and argmin

* remove arguments parse

* modify arguments name in matmul and floor_divide

* rename BINARY_FUNC to DIRECT_PASS_FUNC, modify some functions

* refine code, format code

* add inplace /=, add comments

* remove name in macros

* remove python api

* remove redundant include

* remove cout

* format code

* refactor tensor.size by directly call shape.at, refactor tensor.sub_ by calling nb_sub_

* remove redundant code

* auto format by CI

* fix typo, fix wrong call

* modify idx datatype from int32 to int64 in tensor.size

* add some DIRECT_PASS_FUNC

* add cpu cuda var pow and etc.

* add masked_fill any all

* make REDUCE_FUNC macro, add reduce_* functions

* add 0dim check in ReduceSumWhole, refine yaml

* fix bug

* restore add add_ sub sub_

* add unittest for tensor.half tensor.add tensor.add_

* refine code

* refine code

* fix typo

* fix bug of tensor.std()

* refactor var std and cuda, using c++ functional api

* add beta and threshold in softplus

* auto format by CI

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add nn_functor Check (#7910)

* add bias_add_check

* add bias_add error test

* fix conv2d nhwc bias_add error

* add nhwc conv test

* add bias_add_error test

* Add bias add error check

* Rename

* add batch matmul error check

* add matmul check error msg

* remove annotation

* add fused mlp error msg check

* Add pixel shuffle check test

* add more test until normalization add relu functor

* refine error message

* finish all nnfunctor check msg

* handle type error

* remove useless symbol

* modify back to TypeError

* fix all comment

* Remove redundant code

* Remove pad ndim check

* fix bias add space

* fix check logic cause ci gpu not always gpu:0

Co-authored-by: hjchen2 <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add FusedMatmulBiasAddReluDropout [OneEmbedding] (#8222)

* previous version for fused_matmul_bias_add_relu_dropout

* add op infer

* fix detail

* finish forward

* support dropout rate list

* add forward test

* fix bug for output buffer

* Configurable alpha params

* try to add bit mask logic

* Add bitmask first version!

* Add row col bitmask logic

* support not align4 reludropout

* simplify relu dropout ld logic

* Add naive relu dropout grad kernel

* add simple relu dropout grad kernel

* Rename

* support relu_dropout bitmask backward

* add vectorized optimization

* fix tmp buffer

* add to amp list

* add lazy backward logic

* Refine kernel

* add indextype dispatch

* simplify functor logic

* fix cublas fused mlp aux_ld shape bug

* Add more relu dropout kernel

* add full unittest

* fix bug in skip final activation

* refine

* Remove dump func

* fix format

* Remove cmake

* remove redundant divide

* add padded version

* fix dropout

* oneflow curand

* refine

* remove redundant kernel

* add unroll logic

* add unroll and ballot sync

* refine format

* Remove fast curand

* Refine python interface

* Add if branch for memset

* fix python logic

* just for debug

* not use matmul bias add grad

* add launch 1 block limit

* fix unittest

* Refine

* fix graph backward bug

* limit to 11060

* change to use int32_t dtype for cublas aux

* Fix jc comment

* fix comment

* fix convert

* fix static_analysis

* fix at

* fix userops td

* fix userops td

* fix const ref

* fix compile error for bfloat16

* limit to 11060

* fix bug

Co-authored-by: Juncheng <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix gather 0-dim tensor bug (#8376)

* fix 0-dim tensor bug

* refine

* support input 0-dim tensor for gather

* refine

* refine

* refine dim_scatter_kernel check

* refine

* refine check

* fix clang_tidy error

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add api to apply external job pass (#8370)

* Add condition to find-test-cache-distributed (#8387)

* add condition to find-test-cache-distributed

* fix

* warp dim util (#8382)

* warp dim util

* format

* use more maybe_wrap_dim

* refine array functor

* add more

* refine math_functor

* fix_bug_in_broadcast_min_max_grad_and_broadcast_like (#8379)

* fix_bug_in_broadcast_min_max_grad_and_broadcast_like

* refine

* fix static check error

* fix bug about index (#8388)

* fix bug about index

* add test case

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* LogicalSliceAssign support full slice sbp (#8344)

* feat(SliceOp): slice ops support 2d sbp

* fix(SliceOp): fix [B, P] 2d sbp bug

* refine error message

* fix bug in parallel_num == 1

* add comment

* add warning and format

* add NOLINT for boxing check

* feat(LogicalSliceOps): support all nd_sbp

* feat(LogicalSlice): support nd_sbp

* add error message

* fix(AutoTest): fix auto_test bug in module.parameter pass

* auto format by CI

* fix(LogicalSliceAssign): skip test when 1n1d

* fix SliceParams memset error

* remove memset

* add CHECK_JUST

* fix(*): make sure split_axis >= 0 or equal to SPLIT_AXIS_FOR_NON_SPLIT

* remove memset

* fix spilit_info.axis bug

* feat(LogicalSliceOps): support grad

* add logical_slice gradient_funcs

* feat(LogicalSliceAssign): LogicalSliceAssign support full slice sbp

* auto format by CI

* test(LogicalSlice): fix logical_slice dims

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Houjiang Chen <[email protected]>
Co-authored-by: oneflow-ci-bot <[email protected]>

* fix_tensor_from_numpy_mem_leak_bug (#8391)

* fix_tensor_from_numpy_mem_leak_bug

* add note

* refine note

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Make of_pyext_obj static only to make sure only a python ext so has python symbols (#8393)

* make of_pyext_obj static only

* refine note

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Adjust tolerance setting in embedding_renorm unit test (#8394)

* support front end compile for job to iree (#8249)

* support frontend dev version

* polish name

* add tosa-to-elf.mlir

* tosa to elf by llvm

* conv2d partial

* an enhanced frontend runner

* support numpy as input

* enable multiple using nn graph with different input(jobname make it  it cd /home/yuhao/frontend/oneflow ; /usr/bin/env /usr/bin/python3 /home/yuhao/.vscode-server/extensions/ms-python.python-2022.6.2/pythonFiles/lib/python/debugpy/launcher 40873 -- /home/yuhao/frontend/oneflow/oneflow/ir/test/Frontend/runner.py )

* enable multiple input

* enable cpu and cuda

* change full_name to _full_name

* support exchange cuda with cpu seamlessly

* remove pip

* lit config

* polish

* trim

* auto format by CI

* modify

* auto format by CI

* last line polish

* use unittest

* auto format by CI

* use allclose

* auto format by CI

* pulish

* optimize convert oneflow to tosa

* conv2d

* conv2d enhanced && conv2d examples add

* add road map

* add add_n2Op and boardcast_addOp conversion

* add matmulOp conversion

* support converting normailzation op to tosa(partically)

* update roadmap

* support i64 tensor to dense elem attr

* support 100% resnet op conversion

* add test mlir

* add test iree resnet python script

* auto format by CI

* done

* enhance iree resnet test script

* auto format by CI

* rebuild code

* auto format by CI

* rebuild test script

* update

* auto format by CI

* pub

* trim test scripts

* move

* move

* input and output add block arg judgement

* emit error in variable conversion

* error handle for ci

* modify err info

* auto format by CI

* merge

* auto format by CI

* output not block

* flow ones

* rm const

* trim maybe

* trim maybe with header file

* const auto

* solve clangd error

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Feat/zero mix with mp (#8036)

* add zero limit

* add debug

* add mix zero test

* refactor zero api

* zero test with mp

* add 2d test

* add zero nd

* add nd zero

* add sbp cast

* test passed soft limit consumer

* refine size api

* zero use stage 2

* add limit consumer api

* add new api

* refine zero s select

* fix index out of range

* rm zero limit on device type

* zero test with activation checkpointing

* add indentity when dp sequence len is 1

* move to base with master

* fix

* fix

* fix

* add test

* debug bad case

* refine test for eager and graph boxing

* test case ready

* simplify

* refine test

* fix buff size

* fix conflict

* refine zero nd

* refine

* add full test

* revert change

* refine split check

* fix typo

* rm log

* spit long func

* restore test

* Update optimizer_placement_optimization_pass.cpp

* auto format by CI

* auto format by CI

* fix static check

* add tips for zero api change

* auto format by CI

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Revert embedding normal path and fix amp list (#8374)

* revert embedding normal path, fix amp list

* fix amp

* fix memset bug in gather cpu kernel

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* replace fixed_vector with small_vector and make Shape inherit from it (#8365)

* Replace fixed_vector with llvm::SmallVector

Signed-off-by: daquexian <[email protected]>

* Shape inherited from llvm::SmallVector

Signed-off-by: daquexian <[email protected]>

* refine cmake

Signed-off-by: daquexian <[email protected]>

* rename fixed_vector to small_vector

Signed-off-by: daquexian <[email protected]>

* fix reviews

Signed-off-by: daquexian <[email protected]>

* auto format by CI

* update Shape constructor

Signed-off-by: daquexian <[email protected]>

* add 'PUBLIC' keyword to all target_link_libraries

Signed-off-by: daquexian <[email protected]>

* auto format by CI

* update cmake

Signed-off-by: daquexian <[email protected]>

* auto format by CI

* update cmake

Signed-off-by: daquexian <[email protected]>

* update cmake

Signed-off-by: daquexian <[email protected]>

* auto format by CI

* set is_initialized_ default to true

Signed-off-by: daquexian <[email protected]>

* override some methods to set is_initialized_

Signed-off-by: daquexian <[email protected]>

* auto format by CI

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <[email protected]>

* Light plan for debug (#8396)

* Light plan for debug

* fix note

* disable terminfo to fix missing terminfo symbols (#8400)

* disable terminfo to fix missing terminfo symbols

Signed-off-by: daquexian <[email protected]>

* auto format by CI

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix bug of ZeRO MP in complex case (#8404)

* Remove redundant output_lbns in ir (#8409)

* mv case

* remove redundant info

* Dev FusedCrossInteraction[OneEmbedding] (#8335)

* add simple fused cross interaction forward

* add packed fused

* Add cross interaction grad

* simplify code

* fix bug

* support crossnet v2

* support cross interaction v2

* add lazy backward

* Rename and add test

* fix jc comment

* fix comment

* fix bug

* fix userops td elem_cnt for FUSED Group

* fix header file

* fix clang static analysis

* fix unittest

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add exe graph physical shape check msg (#8002)

* fix index select op in graph

* add exe graph physical shape check msg

* improve the debug information for the python stack trace

1. add a parameter 'max_stack_depth' to specify the max depth for the stack trace
2. refactor other debug related classes.

* remove parens

* update

* resolve PR comments

* update

* update graph debug test file.

* restore self._debug in class Graph and class ModuleBlock

* Do not shorten the stack frame string if it is in debug mode

* delete TODOs

* disable conv3d test (#7969)

Signed-off-by: daquexian <[email protected]>

* skip layernorm random_data_warp test (#7941)

* skip layernorm random_data_warp test

* warp/block/uncached case only test gpu

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Lock click version (#7967)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add global avgpool unittest (#7585)

* fix (#7978)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Support negative dim in scatter op (#7934)

* support negative dim in scatter op

* refine scatter test

* refine scatter test again

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* run barrier callback in BarrierPhyInstrOperand::~BarrierPhyInstrOperand (#7702)

* run barrier callback in BarrierPhyInstrOperand::~BarrierPhyInstrOperand

* lock gil in vm Callback thread

* more comments for VirtualMachineEngine::Callback()

* the Env is never destroyed.

* export Env into python

* more unittests

* wait shared_ptr.use_count() == 0

* export unittest.TestCase in framework/unittest.py

* SwitchToShuttingDownPhase

* optional is_normal_exit

* VirtualMachine::CloseVMThreads

* Delete env_api.h

env_api.h is deleted by master

* reshape_only_one_dim_infered

* address pr comments

* fix a ref-cnt bug in TryRunBarrierInstruction.

* rollback flow.env.all_device_placement

* no distributed running test_shutting_down.py

* auto format by CI

* expand lifetime of module oneflow in test_shutting_down.py

* refine del depend on of

* capture oneflow._oneflow_internal.eager when calling sync in __del__

* add try in flaky test

Co-authored-by: Luyang <[email protected]>
Co-authored-by: chengtbf <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: Xiaoyu Xu <[email protected]>

* Fix one hot scalar tensor bug (#7975)

* fix reduce_sum scalar check bug

* fix one_hot scalar tensor bug

* fix clang tidy error

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* support ctor np array from of tensor (#7970)

* support ctor np array from of tensor

* add test case constructing np array from tensor

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add_manual_seed_all_api (#7957)

* add_manual_seed_all_api

* Update conf.py

* refine

* add test case

* auto format by CI

* Update random_generator.cpp

* auto format by CI

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* one_embedding add doc string (#7902)

* add doc string

* add example

* add

* fix doc

* refine

* address review

* mb to MB

* add make_table_option

* option to options

* refine

* add forward

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Support numpy scalar parameters (#7935)

* feat(functional): support numpy scalar parameters

* rename inferface

* feat(*): TensorIndex support numpy scalar

* feat(TensorIndex): support advance indexing

* add unittest and int32 support for branch feat-param_support_np_scalar (#7939)

* add unittest

* refactor unittest

* add todo for int16 advanced indexing

* add int32 supporting for advance indexing

* auto format by CI

Co-authored-by: Wang Yi <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <[email protected]>

* fix tensor_scatter_nd_update (#7953)

* fix tensor_scatter_nd_update

* auto backward

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix one_embedding adam (#7974)

* fix one_embedding adam

* fix tidy

* fix normal

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* speed test with score (#7990)

Signed-off-by: daquexian <[email protected]>

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Feat/graph del by ref (#7857)

* remove IsMultiClient() and single client logic

Signed-off-by: daquexian <[email protected]>

* rename eager.multi_client to eager

Signed-off-by: daquexian <[email protected]>

* auto format by CI

* add py ref

* refine new session

* clean code

* make scope api inner use

* use session with ref cnt

* run barrier callback in BarrierPhyInstrOperand::~BarrierPhyInstrOperand

* test pass

* lock gil in vm Callback thread

* more comments for VirtualMachineEngine::Callback()

* merge

* merge rm single client

* rm initenv

* merge and fix master

* refactor env c api

* add debug code

* fix and serving test pass

* test passed

* rm useless

* rm useless code

* format

* rm useless include

* rm sync in py

* the Env is never destroyed.

* export Env into python

* more unittests

* fix and pass tests

* revert virtual_machine.cpp

* revert core/vm

* remove outdated python class oneflow.unittest.TestCase

* graph test passed

* wait shared_ptr.use_count() == 0

* export unittest.TestCase in framework/unittest.py

* SwitchToShuttingDownPhase

* optional is_normal_exit

* VirtualMachine::CloseVMThreads

* Delete env_api.h

env_api.h is deleted by master

* address pr comments

* rm is env init

* Clear empty thread when graph destroy (#7633)

* Revert "Clear empty thread when graph destroy (#7633)" (#7860)

This reverts commit 3e8585e5fa20b97229d6b0be46a7ff814dc8cd83.

* fix a ref-cnt bug in TryRunBarrierInstruction.

* rm env_api

* fix clang-tidy error

* fix clang-tidy in env_imp

* refine env api

* format

* refine graph del and sync at shuttingdown

* fix typo

* add comment

* rm useless

* rm useless

Co-authored-by: daquexian <[email protected]>
Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: lixinqi <[email protected]>
Co-authored-by: Li Xinqi <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Luyang <[email protected]>
Co-authored-by: cheng cheng <[email protected]>

* [PersistentTable] Fix num blocks (#7986)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add auto benchmark for flowvision (#7806)

* update yml

* update workflow

* add resnet50

* [PersistentTable] Async write (#7946)

* [PersistentTable] Async write

* fix

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* save log in separate dir by default (#7825)

Signed-off-by: daquexian <[email protected]>

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix index select op in graph

* add exe graph physical shape check msg

* improve the debug information for the python stack trace

1. add a parameter 'max_stack_depth' to specify the max depth for the stack trace
2. refactor other debug related classes.

* remove parens

* update

* resolve PR comments

* update

* update graph debug test file.

* restore self._debug in class Graph and class ModuleBlock

* Do not shorten the stack frame string if it is in debug mode

* delete TODOs

* Revert "Merge branch 'master' into fea/graph_check_msg"

This reverts commit 28833b73a8041463e5e3d130784be386ee248bd8, reversing
changes made to baadf6045f2fce69c090e442a755229c1c949773.

* Revert "Revert "Merge branch 'master' into fea/graph_check_msg""

This reverts commit 1d5e196d8530ffd2b9bf781abcf168b94ff9ca41.

* update

* resolve conflicts

* resolve conflicts

Co-authored-by: Cijie Xia <[email protected]>
Co-authored-by: daquexian <[email protected]>
Co-authored-by: guo ran <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Shenghang Tsai <[email protected]>
Co-authored-by: Houjiang Chen <[email protected]>
Co-authored-by: Peihong Liu <[email protected]>
Co-authored-by: Li Xinqi <[email protected]>
Co-authored-by: Luyang <[email protected]>
Co-authored-by: chengtbf <[email protected]>
Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: Xiaoyu Zhang <[email protected]>
Co-authored-by: liufengwei0103 <[email protected]>
Co-authored-by: binbinHan <[email protected]>
Co-authored-by: Yinggang Wang <[email protected]>
Co-authored-by: Wang Yi <[email protected]>
Co-authored-by: Shijie <[email protected]>
Co-authored-by: lixinqi <[email protected]>
Co-authored-by: Juncheng <[email protected]>

* add batch_matmul sbp (#8385)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* suppress gcc11 false positive warning (#8401)

Signed-off-by: daquexian <[email protected]>

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix variable op conversion to tosa error in ninja c1 (#8412)

* pub

* move test iree resnet python script to oneflow_iree repo

* add bracket

* rename const_val to const_val_ and restore resnet.py test script

Co-authored-by: Shenghang Tsai <[email protected]>

* Fix eval error in FusedMLP (#8413)

Fix eval error

* Init NCCL communicator in graph mode unifiedly (#8263)

* centralized comm init

* address review

* revert

* rename

* ref nccl logical send recv

* fix cpu only

Co-authored-by: cheng cheng <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix dim_scatter 0-dim tensor bug (#8418)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* target based external libraries (#8421)

Signed-off-by: daquexian <[email protected]>

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Refine hardcoded attr setting/getting in ir (#8420)

* use names in trait static func

* more changes on op name attr

* use wrapped func

* Replace cu115 with cu116 in nightly (#8423)

update workflows

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix repeat interleave 0-size tensor bug (#8414)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Autotest support print input in ci (#8383)

* support print tensor value in autotest to provide more details in ci

* revert

* refine

* auto format by CI

* control precision to 1e-5 when record

* fix bug

* auto format by CI

* relax tensor_size_mb

* fix bug

* fix bug

* refine

* releax

* refinew

* refine

* fix bug

* relax

* refine

* restruct

* auto format by CI

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Modify sbp.split()'s karg: axis to dim (#8411)

* Modify sbp.split()'s axis karg to dim

* Refine

* Refine

* Refine

* Refine

* Feat/graph logical op debug repr (#8131)

* add zero limit

* add debug

* add mix zero test

* refactor zero api

* zero test with mp

* add 2d test

* add zero nd

* add nd zero

* add sbp cast

* test passed soft limit consumer

* refine size api

* add module config

* save nn.Module info in job.proto for better debugging

* add new line

* add ModuleBlock.ops_proto() API

* zero use stage 2

* print operators' info when print ModuleBlock

* handle VariableOpConf

* update

* update

* fix

* move operators repr method to graph util

* add limit consumer api

* add new api

* refine zero s select

* add module block

* fix

* refact for rm op in module conf

* fix

* add sbp debug

* add sbp repr

* add shape

* refine

* add sys op in repr

* add full op debug

* fix index out of range

* rm zero limit on device type

* add no scope op to graph

* zero test with activation checkpointing

* fix order

* add indentity when dp sequence len is 1

* add debug repr

* refine repr of op

* refine and fix

* rm useless log

* move to base with master

* fix

* fix

* fix

* fix proto

* refine test

* fix type

* add test

* debug bad case

* refine test for eager and graph boxing

* test case ready

* simplify

* refine test

* fix buff size

* fix conflict

* refine zero nd

* refine

* add full test

* revert change

* refine split check

* fix typo

* rm log

* spit long func

* refine

* restore test

* refine pass and mem debug

* merge master

* repr dtype

* add placement

* Update optimizer_placement_optimization_pass.cpp

* auto format by CI

* auto format by CI

* fix static check

* add tips for zero api change

* auto format by CI

* fix merge

* auto format by CI

* auto format by CI

* refine get job api

* refine graph util import order

* auto format by CI

* fix static check

* auto format by CI

* fix special case

* refine level print and add full dtype repr

* rm useless

Co-authored-by: Cijie Xia <[email protected]>
Co-authored-by: Cijie Xia <[email protected]>
Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* rm some test case in test_fused_dot_feature_interaction_pooling_sum (#8425)

rm some case in test

* Remove unused linkages (#8426)

remove unused linkages

* refactor stride (#8402)

* Stride inherits DimVector

Signed-off-by: daquexian <[email protected]>

* auto format by CI

* fix argument type of OFStrideToNumpyStride

Signed-off-by: daquexian <[email protected]>

Co-authored-by: oneflow-ci-bot <[email protected]>

* Move Tensor.__setitem__  and global related api to Python/C api (#8375)

* add local_to_global, global_to_global, to_global. global_to_global still have bugs

* fix bug of global_to_global

* remove python api

* add setitem

* remove local_to_global sbp pack, format code

* format code

* remove redundant code

* add error msg, refine check of to_global

* fix bug of check

* add error msg

* fix clang static check error

* remove useless api in tensor.py, remove redundant code, remove useless CHECK

* add to_local

* fix wrong exception type in unittest for to_local exception message

* cuda add default error msg (#8427)

default error

Co-authored-by: Shenghang Tsai <[email protected]>

* Refactor ShapeView (#8422)

* update

Signed-off-by: daquexian <[email protected]>

* update and add docs

Signed-off-by: daquexian <[email protected]>

* turn on view slice (#8302)

* turn_on_view_slice

* inplace scalar math hnandle non-contiguous input

* fix clang check

* add docs

* refactor

* auto format by CI

Co-authored-by: oneflow-ci-bot <[email protected]>

* Add flow env init rdma api (#8415)

* add_flow_env_init_rdma_api

* adjust persistent_workers logic for RDMA support

* adjust persistent_workers logic for RDMA support

* add rmda_inited api

* minro fix

* add docs

* Update python/oneflow/utils/data/dataloader.py

Co-authored-by: daquexian <[email protected]>

* fix typo

* refine

* fix RDMAIsInitialized

* minor fix

* refine

* rename InitRdma to InitRDMA

* refine

Co-authored-by: Flowingsun007 <[email protected]>
Co-authored-by: daquexian <[email protected]>

* add 1d send recv in nccl logical (#8355)

* add 1d send recv in nccl logical

* Update insert_nccl_logical_op_pass.cpp

* auto format by CI

Co-authored-by: cheng cheng <[email protected]>
Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Support iree ci (#8419)

* create mlir cpu and modify build gcc 7 shell script

* fix the bug of test_iree_resnet.py cuda test in cpu version error

* fix constant folding tests

* suport oneflow_test_cpu_only

* pub

* build script add flag

* modify test yml

* add python3 into \PATH

* don't use pretrain model

* install flowvision

Co-authored-by: mosout <[email protected]>
Co-authored-by: jackalcooper <[email protected]>

* Feat straighten task nodes (#8347)

* Add a fast topological traversal

* Add an initial implementation of straighen nodes

* Add the straighen nodes algorithm

* Change algorithm structure

* Remove some debug information

* Finalize the straighten algorithm after
deciding the parameters by experiments

* Notify the usage of straighten algorithm

* Of format

* Update oneflow/core/graph/straighten_nodes.cpp

Of format

Co-authored-by: daquexian <[email protected]>

* Of format

* Stop using visual string before we find a better key

* Remove magic numbers and Of format

* Remove starts

* Of format

* Fix a bug of using GetMaxVal<int32_t>() as an
initial number for comparing

* Refactor add straighten algo interface (#8435)

* feat(*): export straighten nodes algorithm inferface

* export documentation

* Update python/oneflow/nn/graph/graph_config.py

Co-authored-by: Yipeng Li <[email protected]>

Co-authored-by: Yipeng Li <[email protected]>

* Use TopoForEachNodeFast as default. (#8436)

* Use TopoForEachNodeFast as default.
Rename the original one as TopoForEachNodeDynamic

* Speed up TopoForEachNodeFast when traversing a subgraph

* Rename the switch and code clean up

* Hide the class TopoStruct

* Hide all the other functions

* Grammar

* Of format

Co-authored-by: daquexian <[email protected]>
Co-authored-by: Yinggang Wang <[email protected]>

* Refactor NLLLoss to support split class dim (#8380)

* refactor

* RuntimeError

* avoid atomic add

* test

* fixes

* update test

* update test

* update test

* fix kernel

* improve backward

* update test

* out_weight to be required

* address static analysis errer

* fix static analysis error

* fix static analysis error

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Strict ordering in memory reuse algorithm (#8441)

* Support broadcast in fused_softmax kernel (#8321)

* support broadcast

* refine

* Remove shape check

* fix sbp when broadcast

* rollback softmax grad threshold

* increase threshold of test conv bn folding

* tol to 1e-2

* check error msg of fuse softmax ops

* add more dispatch

* remove double datatype test and add broadcast test

Co-authored-by: cheng cheng <[email protected]>

* Merge slice and logical slice (#8416)

* remove Slice, SliceUpdate, SliceGrad op

* rename logical_slice to slice and logical_slice_assign to slice_update

* move gradient_func logical_slice.cpp to slice.cpp

* fix some bug and refine local test

* feat(SliceUpdate): support 0size tensor

* test(Slice): refine consistent slice test

* test(SliceUpdate): refine consistent slice_update test

* not export slice_update's inplace parameter

* auto format by CI

* recovery slice_grad_op

* fix slice_view bug

* add error message and attr judgement

* modified old test

* auto format by CI

* update test README

* update tensor_string code

* fix test bug

* auto format by CI

* fix(hsplit): hsplit functor bug

* fix vsplit doc test bug

* refine

* fix test

* fix pin_memory bug

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Graph block.config.set_stage() for recommended Pipeline api. (#8442)

* Graph block.config.set_stage() for recommended Pipeline api.

* revert diff

* refine api doc

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Update PolynomialLR's doc and paramater (#8430)

* update PolynomialLR doc, current_batch = min(decay_batch, current_batch)

* * update PolynomialLR doc, current_batch = min(decay_batch, current_batch)
* rename the steps to decay_batch in parameters

* update PolynomialLR test case

Co-authored-by: Yinggang Wang <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add mv op (#8445)

* add mv op with bug that Int is incompatible

* add test

* update test_mv.py

* fix based on comments

* fix based on comments

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* enable oneflow_iree(python package) and corresponding test works in ci (#8431)

* update test.yml

* add pytest for oneflow_iree examples

* add oneflow frontend test

* Dev tensor is pinned api (#8447)

* support tensor.is_pinned

* add test case

* add docs

* auto format by CI

* refine

* auto format by CI

* refine

* auto format by CI

* refine

* refine

* refine

Co-authored-by: oneflow-ci-bot <[email protected]>

* Nd sbp tensor str (#8458)

* nd sbp tensor str

* add nd sbp tensor str test

* bigger input size

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Patch sbp cost (#8378)

* Add a slight cost for B->S and B->P in 2d sbp

* Add penalty for P in consumer

* Add the slight penalty for eager

* Consider B -> (B, B) for a scalar

* Do not consider parallel description in priority ratio

* Of format

* Fix a bug in the old version group boxing with 2D SBP (#8448)

* Update group boxing to deal with hierarchy [1, 2]

* Use a uniform sbp while grouping consumers

* Steal "ParallelDimReduce"
from "hierarchical_sub_task_graph_builder_impl" to "sbp_infer_util"

* Fix bugs of patch-sbp_cost (#8456)

* Update group boxing to deal with hierarchy [1, 2]

* Use a uniform sbp while grouping consumers

* Steal "ParallelDimReduce"
from "hierarchical_sub_task_graph_builder_impl" to "sbp_infer_util"

* Reduce to uniform B for 1 device.
Use the actual parallel description for each tensor

* Fix a bug of fix-group_boxing-bug

* Group boxing reduce [2, 2]: (S0, S0) to [4]: S0,
then we might infer a 1D SBP from a 2D SBP hint

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: cheng cheng <[email protected]>

* Decouple stream and instruction (#7607)

* remove deprecated python api

* backup code

* backup code

* fix compiler complaints

* fix typo in refactoring

* kMockDevice

* add unit test test_mock.py

* revert mock kernels

* vert DEVICE_TYPE_SEQ

* mock placement

* address pr comments

* register device kCriticalSectionDevice and kLazyJobLauncher

* kControlDevice

* Stream::vm_stream_

* fix compiler complaints

* backup code

* rename StreamIsTransport to IsCommNetStream

* decouple vm::StreamType and vm::InstructionType

* fix compiler complaints

* remove 'gpu' related code

* address static analyzer complaints

* address static analyzer complaints

* remove unused module in test_mock.py

* the Env is never destroyed.

* export Env into python

* more unittests

* export unittest.TestCase in framework/unittest.py

* SwitchToShuttingDownPhase

* optional is_normal_exit

* VirtualMachine::CloseVMThreads

* Delete env_api.h

env_api.h is deleted by master

* reshape_only_one_dim_infered

* address pr comments

* rollback flow.env.all_device_placement

* no distributed running test_shutting_down.py

* auto format by CI

* expand lifetime of module oneflow in test_shutting_down.py

* refine del depend on of

* fix oneflow.placement.__str__

* revert GlobalSync

* init_producer_stream in oneflow.from_numpy

* debug code for vm

* init disable_vm_threads_ in VirtualMachine::VirtualMachine

* Update oneflow/core/vm/virtual_machine.h

Co-authored-by: daquexian <[email protected]>

* create stream in forked subprocesses.

* refactor StreamRoleSwitch to StreamRoleVisistor

* ThreadLocalGuard

* auto format by CI

* fix compiler complaints

* fix static analyzer complaints

* VirtualMachine::GetVmStream

* fix static analyzer complaints

* reimplement AddAndReadVector by std::deque

* reimplement AddAndReadVector

* merge master

* increase atol for test_consistent_rnn_cell.py

* StreamRole::AsyncLaunchedCommNet is bound to EventRecordedCudaStreamType

* auto format by CI

* remove StreamRoleVisitor<T>::VisitInvalid

* no copy in AddAndReadVector

* fix bug of AddAndReadVector::size_

* disable terminfo to fix missing terminfo symbols

Signed-off-by: daquexian <[email protected]>

* auto format by CI

* fix AddAndReadVector::GetGranularity

* remove bad unittest

* auto format by CI

* rename CallInstructionType to OpCallInstructionType

* sta…
  • Loading branch information
1 parent 146288e commit 73f84df
Show file tree
Hide file tree
Showing 29 changed files with 1,768 additions and 175 deletions.
270 changes: 216 additions & 54 deletions oneflow/core/auto_parallel/boxing_collector.cpp

Large diffs are not rendered by default.

14 changes: 14 additions & 0 deletions oneflow/core/auto_parallel/boxing_collector.h
Original file line number Diff line number Diff line change
Expand Up @@ -129,6 +129,15 @@ class BoxingCollector final {
BoxingCollector* boxing_collector_producer,
BoxingCollector* boxing_collector_consumer,
const std::vector<std::vector<int32_t>>& diag_nodes);
// Ask for sbp combination for general basic communication
Maybe<void> AskSbpCombination4GeneralBasicCommunication(
const NdSbp& sbp_producer, const NdSbp& sbp_consumer, const BlobDesc& logical_blob_desc,
const ParallelDesc& producer_parallel_desc, const ParallelDesc& consumer_parallel_desc,
std::vector<NdSbp>& middle_sbps, int32_t* diag_node_pos);
// Ask for a all-split sbp which is closed to the original one
Maybe<void> AskCloseAllSplitSbp(const NdSbp& nd_sbp, const ParallelDesc& parallel_desc,
const BlobDesc& logical_blob_desc,
std::vector<NdSbp>& middle_sbps);
// Stores all the possible SbpParallel.
HashMap<::oneflow::SbpParallel, int32_t> sbp_parallel_universe_;
// Relationship between id and Sbp Parallel
Expand All @@ -154,6 +163,11 @@ class BoxingCollector final {
std::vector<int32_t> id_1d_2_nd_;
// The sbp size in the combination table
int32_t hierarchy_num_;
// How the boxing collector is initialized
int32_t init_type_ = -1;
// Enable general basic communication or not
const bool enable_general_basic_communication =
ParseBooleanFromEnv("ONEFLOW_BOXING_ENABLE_GENERAL_BASIC_COMMUNICATION", false);
}; // class BoxingCollector

} // namespace oneflow
Expand Down
177 changes: 173 additions & 4 deletions oneflow/core/framework/sbp_infer_util.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,15 @@ limitations under the License.
#include "oneflow/core/framework/sbp_infer_util.h"
#include "oneflow/core/auto_parallel/boxing_collector.h"
#include "oneflow/core/boxing/eager_boxing_interpreter_mgr.h"
#include "oneflow/core/common/device_type.pb.h"
#include "oneflow/core/common/nd_index_offset_helper.h"
#include "oneflow/core/common/util.h"
#include "oneflow/core/job/global_for.h"
#include "oneflow/core/job/lazy_mode.h"
#include "oneflow/core/job/nd_sbp_util.h"
#include "oneflow/core/job/parallel_desc.h"
#include "oneflow/core/job/resource_desc.h"
#include "oneflow/core/job/sbp_parallel.pb.h"

namespace oneflow {

Expand Down Expand Up @@ -55,6 +61,15 @@ double Penalty4PartialInConsumer(double logical_blob_size, int32_t producer_para
}
}

int32_t Ratio4Sbp(const NdSbp& nd_sbp, const ParallelDesc& parallel_desc,
const std::function<bool(const SbpParallel&)>& classifier) {
int32_t ratio = 1;
for (int32_t sbp_id = 0; sbp_id < nd_sbp.sbp_parallel_size(); sbp_id++) {
if (classifier(nd_sbp.sbp_parallel(sbp_id))) { ratio *= parallel_desc.hierarchy()->At(sbp_id); }
}
return ratio;
}

Maybe<double> ComputCopyCostBetweenTwoSbpParallel(const SbpParallel& producer_sbp_parallel,
const SbpParallel& consumer_sbp_parallel,
const BlobDesc& logical_blob_desc,
Expand Down Expand Up @@ -409,6 +424,16 @@ void CollaborativeParallelDimReduce(const ParallelDesc& in_parallel_desc,

} // namespace

int32_t PartialRatio4Producer(const NdSbp& sbp_producer,
const ParallelDesc& producer_parallel_desc) {
return Ratio4Sbp(sbp_producer, producer_parallel_desc, &SbpParallel::has_partial_sum_parallel);
}

int32_t BroadcastRatio4Consumer(const NdSbp& sbp_consumer,
const ParallelDesc& consumer_parallel_desc) {
return Ratio4Sbp(sbp_consumer, consumer_parallel_desc, &SbpParallel::has_broadcast_parallel);
}

void NdSbpDimReduce(const ParallelDesc& parallel_desc, const NdSbp& nd_sbp,
ParallelDesc* reduced_parallel_desc, NdSbp* reduced_nd_sbp) {
const auto& hierarchy = parallel_desc.hierarchy();
Expand Down Expand Up @@ -496,14 +521,31 @@ Maybe<double> ComputeLazyCopyCostBetweenNdSbp(const NdSbp& producer_sbp_parallel
reduced_in_nd_sbp.sbp_parallel(0), reduced_out_nd_sbp.sbp_parallel(0),
logical_blob_desc, reduced_in_parallel_desc, reduced_out_parallel_desc));
}
// Not supporting different hierarchy
// TODO: Support it in the future

#ifdef WITH_CUDA
static const bool enable_general_basic_communication =
ParseBooleanFromEnv("ONEFLOW_BOXING_ENABLE_GENERAL_BASIC_COMMUNICATION", false);
// Use a general basic communication if no P in the consumer
if ((((Singleton<ResourceDesc, ForSession>::Get()->nccl_use_compute_stream()
&& producer_parallel_desc == consumer_parallel_desc)
|| enable_general_basic_communication)
&& !NdSbpHasPartialParallel(consumer_sbp_parallel))
&& producer_parallel_desc.device_type() == DeviceType::kCUDA
&& consumer_parallel_desc.device_type() == DeviceType::kCUDA) {
return Cost4GeneralBasicCommunication(producer_sbp_parallel, consumer_sbp_parallel,
logical_blob_desc, producer_parallel_desc,
consumer_parallel_desc)
+ GetTransferCost();
}
#endif // WITH_CUDA

// Not supporting different hierarchy without general basic communication
if (in_hierarchy->elem_cnt() != out_hierarchy->elem_cnt()) { return kUnsupportedBoxing; }

double logical_blob_size =
logical_blob_desc.shape().elem_cnt() * GetSizeOfDataType(logical_blob_desc.data_type());
bool on_same_devices =
reduced_in_parallel_desc.EqualsIgnoringHierarchy(reduced_out_parallel_desc);
double logical_blob_size =
logical_blob_desc.shape().elem_cnt() * GetSizeOfDataType(logical_blob_desc.data_type());

if (in_dim == 2 && out_dim == 2) {
// Not supporting different hierarchy
Expand Down Expand Up @@ -629,6 +671,39 @@ Maybe<double> ComputeCopyCostWithMiddleNodes(const NdSbp& producer_sbp_parallel,
const ParallelDesc& producer_parallel_desc,
const ParallelDesc& consumer_parallel_desc,
bool requires_same_sbp) {
// Reduce before cost computation
ParallelDesc reduced_in_parallel_desc = producer_parallel_desc;
NdSbp reduced_in_nd_sbp;
NdSbpDimReduce(producer_parallel_desc, producer_sbp_parallel, &reduced_in_parallel_desc,
&reduced_in_nd_sbp);

ParallelDesc reduced_out_parallel_desc = consumer_parallel_desc;
NdSbp reduced_out_nd_sbp;
NdSbpDimReduce(consumer_parallel_desc, consumer_sbp_parallel, &reduced_out_parallel_desc,
&reduced_out_nd_sbp);
// In 90% of the transfer, we would have the same parallel description for producer and consumer
// We need to speed it up and give an approximation of the cost
if (reduced_in_parallel_desc == reduced_out_parallel_desc
&& reduced_in_nd_sbp == reduced_out_nd_sbp) {
return 0.0;
}
#ifdef WITH_CUDA
static const bool enable_general_basic_communication =
ParseBooleanFromEnv("ONEFLOW_BOXING_ENABLE_GENERAL_BASIC_COMMUNICATION", false);
// Use a general basic communication if no P in the consumer
if ((((Singleton<ResourceDesc, ForSession>::Get()->nccl_use_compute_stream()
&& producer_parallel_desc == consumer_parallel_desc)
|| enable_general_basic_communication)
&& !NdSbpHasPartialParallel(consumer_sbp_parallel))
&& producer_parallel_desc.device_type() == DeviceType::kCUDA
&& consumer_parallel_desc.device_type() == DeviceType::kCUDA) {
return Cost4GeneralBasicCommunication(producer_sbp_parallel, consumer_sbp_parallel,
logical_blob_desc, producer_parallel_desc,
consumer_parallel_desc)
+ GetTransferCost();
}
#endif // WITH_CUDA

// Initialize boxing collector
constexpr int32_t kRegularMaxSplitAxes = 6;
static thread_local BoxingCollector boxing_collector(kRegularMaxSplitAxes);
Expand Down Expand Up @@ -727,4 +802,98 @@ double ComputeSbpInferPriority(const NdSbp& producer_nd_sbp, const NdSbp& consum
}
}

// The transfer ratio for general basic communication
// Cost = ratio * data amount
// When we get the this function, either producer_sbp_parallel != consumer_sbp_parallel
// or producer_parallel_desc != consumer_parallel_desc
double Cost4GeneralBasicCommunication(const NdSbp& producer_sbp_parallel,
const NdSbp& consumer_sbp_parallel,
const BlobDesc& logical_blob_desc,
const ParallelDesc& producer_parallel_desc,
const ParallelDesc& consumer_parallel_desc) {
// The upper bound of the amount of the transferred data
int32_t producer_partial_ratio =
PartialRatio4Producer(producer_sbp_parallel, producer_parallel_desc);
int32_t consumer_broadcast_ratio =
BroadcastRatio4Consumer(consumer_sbp_parallel, consumer_parallel_desc);
// More intersection on the same devices
bool on_same_devices = producer_parallel_desc.EqualsIgnoringHierarchy(consumer_parallel_desc);
// approximate intersection ratio
double intersection_ratio = 1.0;
// (?, P, ?)->(Si, Sj)->(?, B, ?), two-step transfer
if (producer_partial_ratio > 1 && consumer_broadcast_ratio > 1) {
if (on_same_devices) {
// Pure P in the producer or B in the consumer
// (P, P, P) -> ? or ? -> (B, B)
if (producer_partial_ratio == producer_parallel_desc.parallel_num()
|| consumer_broadcast_ratio == consumer_parallel_desc.parallel_num()) {
// There some cases which is not applicable to this ratio
// We just take the one with the largest possibility
// For example: (P, S0) -> (B, B) for 1-D blob with machine hierarchy [n, m]
// The path should be (P, S0) -> (S0, S0) -> (B, B)
// true intersection ratio = 1/m + 1
intersection_ratio = 2.0;
} else {
// sbp_consumer = (B, Si) or (Si, B)
for (int32_t sbp_id = 0; sbp_id < std::min(producer_sbp_parallel.sbp_parallel_size(),
consumer_sbp_parallel.sbp_parallel_size());
sbp_id++) {
if (consumer_sbp_parallel.sbp_parallel(sbp_id).has_split_parallel()) {
const auto& producer_sbp4sbp_id = producer_sbp_parallel.sbp_parallel(sbp_id);
// (B, P) or (Si, P) -> (Si, B)
// (P, B) or (P, Si) -> (B, Si)
if (producer_sbp4sbp_id.has_broadcast_parallel()
|| producer_sbp4sbp_id == consumer_sbp_parallel.sbp_parallel(sbp_id)) {
intersection_ratio = 2.0;
break;
}
}
}
// Judge whether the intersection ratio is given a value (2.0)
if (intersection_ratio == 1.0) {
// The true intersection ratio range from 0 to 2,
// we just take a middle point of the range as the approximation
// For example: (P, S0) -> (S0, B), Path: (P, S0) -> (S1, S0) -> (S0, B)
// true intersection ratio = 1 + 1/m
// For example: (P, S0) -> (S1, B), Path: (P, S0) -> (S1, S0) -> (S1, B)
// true intersection ratio = 1 + 1
// For example: (P, S0) -> (B, S0), with a 1D blob
// true intersection ratio = (n+p-1)/nm + (n+p-1)/nm
// For example: (S0, P) -> (B, S0), Path: (S0, P) -> (S0, S1) -> (B, S0)
// true intersection ratio = 1 + 1/n

// We use the approximation 1 + (1/n + 1/m)/2
intersection_ratio = 1.0 + 0.5 / producer_parallel_desc.hierarchy()->At(0)
+ 0.5 / producer_parallel_desc.hierarchy()->At(1);
}
}
}
// Otherwise, on different devices
// intersection_ratio = 1.0;
} else {
// No P in the producer or no B in the consumer, one-step transfer
if (on_same_devices) {
// We use simulation for nD sbp with n=1,2,3,...
TensorSliceView in_second_slice =
GetTensorSliceView4ParallelId(*producer_parallel_desc.hierarchy(), producer_sbp_parallel,
logical_blob_desc.shape(), /*parallel_id=*/1);
TensorSliceView out_second_slice =
GetTensorSliceView4ParallelId(*consumer_parallel_desc.hierarchy(), consumer_sbp_parallel,
logical_blob_desc.shape(), /*parallel_id=*/1);
const TensorSliceView& intersection = in_second_slice.Intersect(out_second_slice);
// The intersection ratio is design for two steps.
// However, we only have one step here, we would increase the ratio by 1.0
// to eliminate the unused step
intersection_ratio += std::min(
1.0, (double)(intersection.shape().elem_cnt() * producer_parallel_desc.parallel_num())
/ logical_blob_desc.shape().elem_cnt());
}
// Otherwise, on different devices
// intersection_ratio = 1.0;
}
// Subtract the intersection part
return (producer_partial_ratio + consumer_broadcast_ratio - intersection_ratio)
* logical_blob_desc.shape().elem_cnt() * GetSizeOfDataType(logical_blob_desc.data_type());
}

} // namespace oneflow
18 changes: 18 additions & 0 deletions oneflow/core/framework/sbp_infer_util.h
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,16 @@ enum Penalty4PartialInConsumerTag : int {
kStrict = 3 // Not allow a transfer to P
};

// [2, 3, 4, 5, 9, 100, 8]: (P, S0, P, P, B, S1, P)
// partial ratio = 2 * 4 * 5 * 8
int32_t PartialRatio4Producer(const NdSbp& sbp_producer,
const ParallelDesc& producer_parallel_desc);

// [2, 3, 4, 5, 9, 100, 8]: (P, S0, B, P, B, S1, P)
// broadcast ratio = 4 * 9
int32_t BroadcastRatio4Consumer(const NdSbp& sbp_consumer,
const ParallelDesc& consumer_parallel_desc);

void NdSbpDimReduce(const ParallelDesc& parallel_desc, const NdSbp& nd_sbp,
ParallelDesc* reduced_parallel_desc, NdSbp* reduced_nd_sbp);

Expand Down Expand Up @@ -96,6 +106,14 @@ double ComputeSbpInferPriority(const NdSbp& producer_sbp_parallel,
const ParallelDesc& producer_parallel_desc,
const ParallelDesc& consumer_parallel_desc, bool requires_same_sbp);

// The transfer ratio for general basic communication
// Cost = ratio * data amount
double Cost4GeneralBasicCommunication(const NdSbp& producer_sbp_parallel,
const NdSbp& consumer_sbp_parallel,
const BlobDesc& logical_blob_desc,
const ParallelDesc& producer_parallel_desc,
const ParallelDesc& consumer_parallel_desc);

} // namespace oneflow

#endif // ONEFLOW_CORE_FRAMEWORK_SBP_INFER_UTIL_H_
Loading

0 comments on commit 73f84df

Please sign in to comment.