Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[just for review]Release/mt5 opt #9318

Closed
wants to merge 87 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
87 commits
Select commit Hold shift + click to select a range
972dfee
default nccl use compute stream in grad acc
chengtbf Aug 11, 2022
f27e2ec
Merge branch 'master' of https://github.com/Oneflow-Inc/oneflow into …
chengtbf Aug 16, 2022
5c19afa
rm sharable mem block graph
chengtbf Aug 16, 2022
e08a79a
half implement of LogicalChains
chengtbf Aug 17, 2022
d9a5c82
part-0 : Logical Chain
chengtbf Aug 18, 2022
3d996b0
Merge branch 'master' of https://github.com/Oneflow-Inc/oneflow into …
chengtbf Aug 18, 2022
d3c9b09
fix compile
chengtbf Aug 18, 2022
6f7ed2c
logical chain runnable
chengtbf Aug 19, 2022
dccee3f
fix bug of logical chain dp
chengtbf Aug 19, 2022
2e750d3
Part 1 : AfterGradAccChain
chengtbf Aug 20, 2022
42e6f86
fix bug of crush in acc chain infer
chengtbf Aug 22, 2022
7c35e1a
AccCtrlTick Op/Task/Actor/Pass
chengtbf Aug 26, 2022
23c7721
tmp
chengtbf Aug 26, 2022
f32d247
AccCtrlTick runnable
chengtbf Aug 26, 2022
d6c1760
rename group boxing identity and model diff scale op name
chengtbf Aug 26, 2022
c15cbf0
stric order by acc tick
chengtbf Aug 26, 2022
98e7856
Merge branch 'dev_cc_acc_mem_by_acc_ctrl_tick' of https://github.com/…
chengtbf Aug 26, 2022
c4ce8fb
merge mem block by logical chain id group
chengtbf Aug 30, 2022
adad5c6
fix conflict
chengtbf Sep 4, 2022
e4087e3
fix user op register
chengtbf Sep 4, 2022
10a9755
fix GLOG error when no grad acc
chengtbf Sep 4, 2022
a1c71bc
Inplace repeat variable
chengtbf Sep 4, 2022
cb31e53
Inplace repeat support consumed/produced ctrl regst
chengtbf Sep 6, 2022
f05b963
Part-4: merge acc op in to chain for reuse memory acc input (#9071)
chengtbf Sep 8, 2022
f5f852d
find first source/sink op in acc chain which can be insert ctrl
chengtbf Sep 9, 2022
5d98216
TryMergeAfterAccLogicalChainToFirstLogicalChain
chengtbf Sep 9, 2022
55db36d
remove debug log
chengtbf Sep 9, 2022
317e4a5
rm old version repeat kernel
chengtbf Sep 9, 2022
2e56c11
fix format
chengtbf Sep 9, 2022
5d83933
MergeChainByLogicalChainId/PhysicalTaskGraph
chengtbf Sep 9, 2022
0e7eb72
IsValidChainId
chengtbf Sep 9, 2022
0459319
rm useless file
chengtbf Sep 9, 2022
e554b89
remove note
chengtbf Sep 9, 2022
3d5e919
Merge branch 'master' of https://github.com/Oneflow-Inc/oneflow into …
chengtbf Sep 9, 2022
db24bad
fix clang-tidy
chengtbf Sep 10, 2022
7fa2aaf
Merge branch 'master' into dev_cc_acc_mem_v5
chengtbf Sep 13, 2022
1062ac2
fix conflict
chengtbf Sep 14, 2022
8b97608
more IsValidChainId
chengtbf Sep 14, 2022
679d1ac
Merge branch 'master' into dev_cc_acc_mem_v5
chengtbf Sep 14, 2022
448f5e9
rm debug log
chengtbf Sep 14, 2022
5a2ff2b
rm note
chengtbf Sep 14, 2022
680cb0d
fix bug of cpu repeat inplace var bug
chengtbf Sep 14, 2022
70b7e6c
fix bug of memory reuse for 0-size regst in time line algo
chengtbf Sep 15, 2022
2f3b2ae
fix bug of acc chain merge mem guard
chengtbf Oct 9, 2022
4aae0e7
fix conflicts and merge master
chengtbf Oct 9, 2022
724fe49
reuse cast to tick op
chengtbf Oct 11, 2022
54cb129
fix bug of acc different stream hint cause sync backward compute
chengtbf Oct 11, 2022
0527ce1
actor name log
chengtbf Oct 11, 2022
685948a
fix for review
chengtbf Oct 11, 2022
fe49a47
remove log
chengtbf Oct 11, 2022
635453f
Merge branch 'master' of https://github.com/Oneflow-Inc/oneflow into …
chengtbf Oct 11, 2022
54c7e4a
fix note
chengtbf Oct 11, 2022
ff85c5b
fix bug of connect to cast to tick op
chengtbf Oct 17, 2022
c8ea5f9
merge master
chengtbf Oct 17, 2022
ee1f717
refactor(RanddomOp): refactor random op with consistent data
wyg1997 Oct 24, 2022
d7c11bc
Add a GetSbpSignature with use parallel num
Yipeng1994 Oct 24, 2022
f30f29d
Get sbp_sig_list for each dimension of hierarchy
Yipeng1994 Oct 24, 2022
fdc7ee8
Add test script and print out information
Yipeng1994 Oct 24, 2022
e1b4a96
Remove parallel description in GetSbpSignature()
Yipeng1994 Oct 24, 2022
dc23ff7
Fix small bug
Yipeng1994 Oct 24, 2022
195b0ea
Disable InferNdSbp for reshape op
Yipeng1994 Oct 24, 2022
3e39ce6
test(RandomOp): add data consistent test
wyg1997 Oct 25, 2022
0b41745
Merge branch 'master' into refactor-random_source_op
wyg1997 Oct 25, 2022
3f6d981
Merge branch 'master' into refactor-GetSbpSignature
Yipeng1994 Oct 25, 2022
f7d29d1
Revert "Add test script and print out information"
Yipeng1994 Oct 25, 2022
4f08466
refactor(Initializer): refactor normal with oneflow kernel
wyg1997 Oct 25, 2022
d63a59b
fix(RandomSeed): fix parallel_num==1
wyg1997 Oct 26, 2022
f6f91c0
Merge remote-tracking branch 'origin/refactor-random_source_op' into …
wyg1997 Oct 26, 2022
45080d4
merge master
chengtbf Oct 26, 2022
5f37956
Add hierarchy value
Yipeng1994 Oct 26, 2022
f92e330
Address comments
Yipeng1994 Oct 26, 2022
0c0954c
parallel num j-> hierarchy value for reshape op
Yipeng1994 Oct 26, 2022
16973bb
Static analysis
Yipeng1994 Oct 26, 2022
ccd8c57
refine
Yipeng1994 Oct 26, 2022
f471774
Update user_op.cpp
Yipeng1994 Oct 26, 2022
64832e4
Update operator.cpp
Yipeng1994 Oct 26, 2022
657c79e
Merge branch 'master' into refactor-GetSbpSignature
Yipeng1994 Oct 26, 2022
582d20a
auto format by CI
oneflow-ci-bot Oct 26, 2022
778da76
test(initializer): add initializer data test
wyg1997 Oct 27, 2022
2d24db4
format code
wyg1997 Oct 27, 2022
b967432
Merge branch 'dev_cc_acc_mem_v5' of https://github.com/Oneflow-Inc/on…
strint Oct 27, 2022
752fce0
Merge branch 'refactor-GetSbpSignature' of https://github.com/Oneflow…
strint Oct 27, 2022
0404d1d
Merge branch 'refactor-normal_initializer' of https://github.com/Onef…
strint Oct 27, 2022
8f7ca2f
Revert Update operator.cpp
Yipeng1994 Oct 27, 2022
5b3a585
Merge branch 'refactor-GetSbpSignature' of https://github.com/Oneflow…
strint Oct 27, 2022
4a7a6b1
boxing to cpu first in flow.save
daquexian Oct 28, 2022
2d080aa
Merge branch 'fix_extra_memory_in_save' of https://github.com/Oneflow…
strint Oct 28, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions oneflow/core/common/env_var/debug_mode.h
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,9 @@ DEFINE_ENV_BOOL(ONEFLOW_DEBUG, false);

inline bool IsInDebugMode() { return EnvBool<ONEFLOW_DEBUG_MODE>() || EnvBool<ONEFLOW_DEBUG>(); }

DEFINE_ENV_BOOL(ENABLE_LOGICAL_CHAIN, true);
inline bool EnableLogicalChain() { return EnvBool<ENABLE_LOGICAL_CHAIN>(); }

} // namespace oneflow

#endif // ONEFLOW_CORE_COMMON_ENV_VAR_DEBUG_MODE_H_
1 change: 0 additions & 1 deletion oneflow/core/framework/nn_graph.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -339,7 +339,6 @@ Maybe<void> NNGraph::CompileAndInitRuntime() {
// TODO(chengcheng): test collective boxing for multi-job.
PlanUtil::GenCollectiveBoxingPlan(&job_, &plan_);
sub_compile_tc->Count("[GraphCompile]" + name_ + " GenCollectiveBoxingPlan", 1);
// PlanUtil::SetForceInplaceMemBlock(&plan_); NOTE(chengcheng): only for ssp.
PlanUtil::DumpCtrlRegstInfoToPlan(&plan_);
sub_compile_tc->Count("[GraphCompile]" + name_ + " DumpCtrlRegstInfoToPlan", 1);
PlanUtil::PlanMemoryLog(&plan_, name_);
Expand Down
12 changes: 12 additions & 0 deletions oneflow/core/functional/impl/common.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,8 @@ limitations under the License.
#include "oneflow/core/functional/impl/common.h"
#include "oneflow/core/autograd/autograd_mode.h"
#include "oneflow/core/common/wrap_dim_utils.h"
#include "oneflow/core/ccl/ccl.h"
#include "oneflow/core/job/rank_group.h"

namespace oneflow {
namespace one {
Expand Down Expand Up @@ -220,6 +222,16 @@ Maybe<std::tuple<Shape, bool, bool>> InferUnifiedShapeForBroadcasting(const Shap
return std::make_tuple(target, need_to_broadcast.first, need_to_broadcast.second);
}

Maybe<void> BroadcastSeedToAllRanks(uint64_t* seed, int64_t root) {
CHECK_NOTNULL_OR_RETURN(seed) << "seed is not allowed to be nullptr";
const auto& rank_group = JUST(RankGroup::DefaultRankGroup());
const auto& parallel_desc = JUST(RankGroup::GetDefaultParallelDesc(DeviceType::kCPU, rank_group));
const auto& meta_transport_token =
JUST(TransportToken::NewTransportToken(kTransportTokenTypeMeta));
JUST(ccl::CpuBroadcast(seed, seed, sizeof(*seed), root, parallel_desc, meta_transport_token));
return Maybe<void>::Ok();
}

} // namespace functional
} // namespace one
} // namespace oneflow
2 changes: 2 additions & 0 deletions oneflow/core/functional/impl/common.h
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,8 @@ Maybe<Shape> InferShapeUnspecifiedDim(const int64_t& elem_count, const Shape& sh
Maybe<std::tuple<Shape, bool, bool>> InferUnifiedShapeForBroadcasting(const Shape& input_shape,
const Shape& other_shape);

Maybe<void> BroadcastSeedToAllRanks(uint64_t* seed, int64_t root = 0);

} // namespace functional
} // namespace one
} // namespace oneflow
Expand Down
21 changes: 17 additions & 4 deletions oneflow/core/functional/impl/nn_functor.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,8 @@ limitations under the License.
#include "oneflow/user/kernels/dropout_kernel.h"
#include "oneflow/core/common/container_util.h"
#include "oneflow/user/kernels/distributions/common.h"
#include "oneflow/user/kernels/random_seed_util.h"
#include "oneflow/core/rpc/include/global_process_ctx.h"

namespace oneflow {
namespace one {
Expand Down Expand Up @@ -2105,20 +2107,31 @@ class GlobalNormalFunctor {
dtype = output_tensor_dtype;
}

const auto gen = optional_generator.value_or(JUST(one::DefaultAutoGenerator()));
auto& attrs = THREAD_CACHED_MUTABLE_ATTR_MAP("mean", "std", "shape", "dtype", "seed", "nd_sbp");

const auto& distribution_state = std::make_shared<DistributionKernelState>(gen);
const auto& nd_sbp = JUST(GetNdSbp(sbp_tuple));

std::shared_ptr<Generator> gen = optional_generator.value_or(JUST(one::DefaultAutoGenerator()));
uint64_t init_seed = JUST(gen->Get<CPUGeneratorImpl>(0))->engine()();

if (LazyMode::is_enabled()) {
attrs.SetAllAttrs(static_cast<double>(mean), static_cast<double>(std), shape,
dtype->data_type(), static_cast<int64_t>(gen->current_seed()),
dtype->data_type(), static_cast<int64_t>(init_seed),
*JUST(GetNdSbpStrList(nd_sbp)));
} else {
uint64_t rank_seed = 0;
{
JUST(BroadcastSeedToAllRanks(&init_seed, /*root=*/0));
rank_seed =
JUST(GetRandomSeedForRank(*placement, *nd_sbp, init_seed, GlobalProcessCtx::Rank()));
}
attrs.SetAllAttrs(static_cast<double>(mean), static_cast<double>(std), shape,
dtype->data_type(), static_cast<int64_t>(gen->current_seed()), NullOpt);
dtype->data_type(), static_cast<int64_t>(rank_seed), NullOpt);
gen = JUST(MakeGenerator(placement->device_type()));
gen->set_current_seed(rank_seed);
}
const auto& distribution_state = std::make_shared<DistributionKernelState>(gen);

if (out.has_value()) {
std::shared_ptr<TensorTuple> outputs = std::make_shared<TensorTuple>(1);
(*outputs)[0] = JUST(out);
Expand Down
1 change: 0 additions & 1 deletion oneflow/core/graph/plan_task_graph.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,6 @@ namespace oneflow {

int64_t PlanTaskNode::chain_id() const {
int64_t chain_id = task_proto_->task_set_info().chain_id();
CHECK_NE(chain_id, -1);
return chain_id;
}

Expand Down
97 changes: 0 additions & 97 deletions oneflow/core/graph/sharable_mem_block_graph.cpp

This file was deleted.

64 changes: 0 additions & 64 deletions oneflow/core/graph/sharable_mem_block_graph.h

This file was deleted.

1 change: 1 addition & 0 deletions oneflow/core/graph/straighten_nodes.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -126,6 +126,7 @@ bool ShouldRunASAP(TaskType task_type) {
case TaskType::kAcc: // 0
case TaskType::kSourceTick: // 0
case TaskType::kAccTick: // 0
case TaskType::kAccCtrlTick: // ?
case TaskType::kCase: // 0
case TaskType::kEsac: // 0
case TaskType::kReentrantLock: return true; // 0
Expand Down
Loading