Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix_bug_in_broadcast_min_max_grad_and_broadcast_like #8379

Merged

Conversation

clackhan
Copy link
Contributor

@clackhan clackhan commented Jun 7, 2022

修复 Oneflow-Inc/libai#288 (comment) 中描述的bug,最简代码如下:

import oneflow as flow

x = flow.rand(1, 1, 4, requires_grad=True)
y = flow.rand(1, 4, requires_grad=True)
z = flow.max(x, y)
loss = z.sum()
loss.backward()
F20220607 10:56:17.827364 39752 shape.cpp:184] Check failed: !broadcast_axis_vec.empty() 
*** Check failure stack trace: ***
    @     0x7f4f74f2ff9a  google::LogMessage::Fail()
    @     0x7f4f74f30282  google::LogMessage::SendToLog()
    @     0x7f4f74f2fb07  google::LogMessage::Flush()
    @     0x7f4f74f32679  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f4f6be5ac9d  oneflow::Shape::Axes4BroadcastTo()
    @     0x7f4f6bc1c4ef  oneflow::one::BroadcastMinMax::Apply()
    @     0x7f4f6bc1d5d1  oneflow::one::OpExprGradFunction<>::ApplyIf()
    @     0x7f4f6d61c609  _ZNSt17_Function_handlerIFN7oneflow5MaybeIvvEERKNS0_3one11TensorTupleEPS4_bEZNKS3_19AutogradInterpreter5ApplyERKNS3_6OpExprES6_S7_RKNS3_19OpExprInterpContextEEUlS6_S7_bE0_E9_M_invokeERKSt9_Any_dataS6_OS7_Ob
    @     0x7f4f6bbd5407  oneflow::one::FunctionNode::Apply()
    @     0x7f4f6bbd9158  oneflow::one::GraphTask::Apply()
    @     0x7f4f6bbd9fb8  oneflow::one::GraphAutogradEngine::RunBackwardAndSaveGrads4LeafTensor()
    @     0x7f4f6bbd3ef5  oneflow::one::AutogradEngine::RunBackwardAndSaveGrads4LeafTensorIf()
    @     0x7f50283c48e9  oneflow::autograd::Backward()
    @     0x7f50283bc21f  (unknown)
    @     0x7f50285ddc79  (unknown)
    @     0x55ade7f25348  PyCFunction_Call
    @     0x55ade7f14dbc  _PyObject_MakeTpCall.localalias.6
    @     0x55ade7f9c545  _PyEval_EvalFrameDefault
    @     0x55ade7f6a270  _PyEval_EvalCodeWithName.localalias.4
    @     0x55ade7f6b0a3  _PyFunction_Vectorcall.localalias.352
    @     0x55ade7ed4a61  _PyEval_EvalFrameDefault.cold.2825
    @     0x55ade7f6a270  _PyEval_EvalCodeWithName.localalias.4
    @     0x55ade7f6b0a3  _PyFunction_Vectorcall.localalias.352
    @     0x55ade7ed4a40  _PyEval_EvalFrameDefault.cold.2825
    @     0x55ade7f6a270  _PyEval_EvalCodeWithName.localalias.4
    @     0x55ade7fff543  PyEval_EvalCode
    @     0x55ade7fff5e4  run_eval_code_obj
    @     0x55ade8025854  run_mod
    @     0x55ade7ee6390  pyrun_file
    @     0x55ade7ee90d2  PyRun_SimpleFileExFlags.localalias.16
    @     0x55ade7ee9bf0  Py_RunMain.cold.2953
    @     0x55ade8028a09  Py_BytesMain
Aborted

}
}
JUST(attrs.SetAttr<std::vector<int32_t>>("broadcast_axes", broadcast_axes));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里主要改动是这里,broadcast_axes在本作用域中有效,需要在if里边 set attr

Comment on lines 272 to 278
if (left_extended_x_shape == out_shape) {
broad_x_ = JUST(functional::ReshapeLike(x, out_grads.at(0)));
} else {
const AxisVector& broadcast_axis_vec = left_extended_x_shape.Axes4BroadcastTo(out_shape);
const std::vector<int32_t> x_axis =
std::vector<int32_t>{broadcast_axis_vec.begin(), broadcast_axis_vec.end()};
broad_x_ = JUST(functional::BroadcastLike(x, out_grads.at(0), x_axis));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bug根源就是这里,如果left_extended_x_shape与out_shape一致,则无需BroadcastLike,直接reshape即可

@clackhan clackhan requested review from wyg1997 and removed request for oneflow-ci-bot June 8, 2022 01:37
@github-actions
Copy link
Contributor

github-actions bot commented Jun 8, 2022

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/8379/

<< " must match the existing size (" << prepend_shape[i]
<< ") at non-singleton dimension " << i
<< ". Target sizes: " << like_shape.ToString()
<< ". Tensor sizes: " << x_shape.ToString();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里是不是多了个空格

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里是不是多了个空格

是的,已修改

@github-actions
Copy link
Contributor

github-actions bot commented Jun 8, 2022

Speed stats:
GPU Name: NVIDIA GeForce GTX 1080 

❌ OneFlow resnet50 time: 129.8ms (= 12977.6ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 143.2ms (= 14319.6ms / 100, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.10 (= 143.2ms / 129.8ms)

OneFlow resnet50 time: 75.9ms (= 7592.5ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 83.6ms (= 8357.6ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.10 (= 83.6ms / 75.9ms)

OneFlow resnet50 time: 49.6ms (= 9919.9ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 58.7ms (= 11742.8ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.18 (= 58.7ms / 49.6ms)

OneFlow resnet50 time: 40.9ms (= 8181.8ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 41.5ms (= 8306.5ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.02 (= 41.5ms / 40.9ms)

OneFlow resnet50 time: 35.2ms (= 7039.9ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 35.6ms (= 7122.8ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.01 (= 35.6ms / 35.2ms)

OneFlow swin dataloader time: 0.243s (= 48.672s / 200, num_workers=1)
PyTorch swin dataloader time: 0.147s (= 29.408s / 200, num_workers=1)
Relative speed: 0.604 (= 0.147s / 0.243s)

OneFlow swin dataloader time: 0.065s (= 13.016s / 200, num_workers=4)
PyTorch swin dataloader time: 0.042s (= 8.485s / 200, num_workers=4)
Relative speed: 0.652 (= 0.042s / 0.065s)

OneFlow swin dataloader time: 0.035s (= 6.969s / 200, num_workers=8)
PyTorch swin dataloader time: 0.022s (= 4.384s / 200, num_workers=8)
Relative speed: 0.629 (= 0.022s / 0.035s)

❌ OneFlow resnet50 time: 145.8ms (= 14579.3ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 170.6ms (= 17063.6ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.17 (= 170.6ms / 145.8ms)

OneFlow resnet50 time: 95.8ms (= 9579.0ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 111.5ms (= 11147.9ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.16 (= 111.5ms / 95.8ms)

OneFlow resnet50 time: 71.6ms (= 14324.0ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 87.0ms (= 17405.4ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.22 (= 87.0ms / 71.6ms)

OneFlow resnet50 time: 60.2ms (= 12049.1ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 74.5ms (= 14894.4ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.24 (= 74.5ms / 60.2ms)

OneFlow resnet50 time: 54.2ms (= 10839.0ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 69.2ms (= 13843.2ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.28 (= 69.2ms / 54.2ms)

clackhan added 2 commits June 8, 2022 10:59
…fix_bug_in_broadcast_min_max_grad_and_broadcast_like
@clackhan clackhan requested a review from oneflow-ci-bot June 8, 2022 03:00
@github-actions
Copy link
Contributor

github-actions bot commented Jun 8, 2022

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/8379/

@github-actions
Copy link
Contributor

github-actions bot commented Jun 8, 2022

Speed stats:
GPU Name: NVIDIA GeForce GTX 1080 

❌ OneFlow resnet50 time: 130.0ms (= 12999.4ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 145.8ms (= 14584.7ms / 100, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.12 (= 145.8ms / 130.0ms)

OneFlow resnet50 time: 76.1ms (= 7613.3ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 85.4ms (= 8535.8ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.12 (= 85.4ms / 76.1ms)

OneFlow resnet50 time: 51.8ms (= 10369.8ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 59.1ms (= 11811.8ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.14 (= 59.1ms / 51.8ms)

OneFlow resnet50 time: 42.4ms (= 8473.6ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 44.8ms (= 8951.5ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.06 (= 44.8ms / 42.4ms)

OneFlow resnet50 time: 38.7ms (= 7731.4ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 41.2ms (= 8245.6ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.07 (= 41.2ms / 38.7ms)

OneFlow swin dataloader time: 0.243s (= 48.557s / 200, num_workers=1)
PyTorch swin dataloader time: 0.151s (= 30.292s / 200, num_workers=1)
Relative speed: 0.624 (= 0.151s / 0.243s)

OneFlow swin dataloader time: 0.065s (= 12.922s / 200, num_workers=4)
PyTorch swin dataloader time: 0.040s (= 7.990s / 200, num_workers=4)
Relative speed: 0.618 (= 0.040s / 0.065s)

OneFlow swin dataloader time: 0.056s (= 11.170s / 200, num_workers=8)
PyTorch swin dataloader time: 0.022s (= 4.428s / 200, num_workers=8)
Relative speed: 0.396 (= 0.022s / 0.056s)

❌ OneFlow resnet50 time: 146.4ms (= 14644.8ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 171.3ms (= 17125.9ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.17 (= 171.3ms / 146.4ms)

OneFlow resnet50 time: 95.9ms (= 9585.3ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 112.2ms (= 11218.4ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.17 (= 112.2ms / 95.9ms)

OneFlow resnet50 time: 71.0ms (= 14207.6ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 89.0ms (= 17791.5ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.25 (= 89.0ms / 71.0ms)

OneFlow resnet50 time: 58.8ms (= 11752.2ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 78.0ms (= 15596.1ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.33 (= 78.0ms / 58.8ms)

OneFlow resnet50 time: 54.1ms (= 10819.4ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 69.7ms (= 13945.2ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.29 (= 69.7ms / 54.1ms)

@clackhan clackhan requested a review from oneflow-ci-bot June 8, 2022 06:05
@github-actions
Copy link
Contributor

github-actions bot commented Jun 8, 2022

Static analysis with clang failed. PR label automerge has been removed

@github-actions github-actions bot removed the automerge label Jun 8, 2022
@github-actions
Copy link
Contributor

github-actions bot commented Jun 8, 2022

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/8379/

@mergify mergify bot merged commit c10a30c into master Jun 9, 2022
@mergify mergify bot deleted the fix_bug_in_broadcast_min_max_grad_and_broadcast_like branch June 9, 2022 04:16
Yipeng1994 added a commit that referenced this pull request Jun 23, 2022
* nd nccl_send_recv_boxing

* rm print

* support num_axes > 2

* Add distributed optional run (#8372)

* Add

* change deps

* add install

* add skip

* autoprof supports bandwidth (#8367)

* autoprof supports bandwidth

Signed-off-by: daquexian <[email protected]>

* print bandwidth

Signed-off-by: daquexian <[email protected]>

* auto format by CI

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <[email protected]>

* remove tmp buffer of cumprod cpu backward kernel (#8369)

* remove tmp buffer of cumprod cpu backward kernel

* refine

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Move tensor api to cpython part3 (#8342)

* add tensor_functions

* concat py methods

* add hash, restore tensor.py

* check replacement

* refine code, remove commented tensor.py

* refine code

* move some api

* add cpu and cuda api

* add triu tril norm and etc.

* remove tensor_functions.h

* move more api

* move more api, refine size

* fix typo

* format code, remove useless include

* refine code

* refine code, fix typo

* align .cuda to python

* refine code

* split some api to part3 for review

* remove positional only arguments of argmax and argmin

* remove arguments parse

* modify arguments name in matmul and floor_divide

* rename BINARY_FUNC to DIRECT_PASS_FUNC, modify some functions

* refine code, format code

* add inplace /=, add comments

* remove name in macros

* remove python api

* remove redundant include

* remove cout

* format code

* refactor tensor.size by directly call shape.at, refactor tensor.sub_ by calling nb_sub_

* remove redundant code

* auto format by CI

* fix typo, fix wrong call

* modify idx datatype from int32 to int64 in tensor.size

* add some DIRECT_PASS_FUNC

* add cpu cuda var pow and etc.

* add masked_fill any all

* make REDUCE_FUNC macro, add reduce_* functions

* add 0dim check in ReduceSumWhole, refine yaml

* fix bug

* restore add add_ sub sub_

* add unittest for tensor.half tensor.add tensor.add_

* refine code

* refine code

* fix typo

* fix bug of tensor.std()

* refactor var std and cuda, using c++ functional api

* add beta and threshold in softplus

* auto format by CI

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add nn_functor Check (#7910)

* add bias_add_check

* add bias_add error test

* fix conv2d nhwc bias_add error

* add nhwc conv test

* add bias_add_error test

* Add bias add error check

* Rename

* add batch matmul error check

* add matmul check error msg

* remove annotation

* add fused mlp error msg check

* Add pixel shuffle check test

* add more test until normalization add relu functor

* refine error message

* finish all nnfunctor check msg

* handle type error

* remove useless symbol

* modify back to TypeError

* fix all comment

* Remove redundant code

* Remove pad ndim check

* fix bias add space

* fix check logic cause ci gpu not always gpu:0

Co-authored-by: hjchen2 <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add FusedMatmulBiasAddReluDropout [OneEmbedding] (#8222)

* previous version for fused_matmul_bias_add_relu_dropout

* add op infer

* fix detail

* finish forward

* support dropout rate list

* add forward test

* fix bug for output buffer

* Configurable alpha params

* try to add bit mask logic

* Add bitmask first version!

* Add row col bitmask logic

* support not align4 reludropout

* simplify relu dropout ld logic

* Add naive relu dropout grad kernel

* add simple relu dropout grad kernel

* Rename

* support relu_dropout bitmask backward

* add vectorized optimization

* fix tmp buffer

* add to amp list

* add lazy backward logic

* Refine kernel

* add indextype dispatch

* simplify functor logic

* fix cublas fused mlp aux_ld shape bug

* Add more relu dropout kernel

* add full unittest

* fix bug in skip final activation

* refine

* Remove dump func

* fix format

* Remove cmake

* remove redundant divide

* add padded version

* fix dropout

* oneflow curand

* refine

* remove redundant kernel

* add unroll logic

* add unroll and ballot sync

* refine format

* Remove fast curand

* Refine python interface

* Add if branch for memset

* fix python logic

* just for debug

* not use matmul bias add grad

* add launch 1 block limit

* fix unittest

* Refine

* fix graph backward bug

* limit to 11060

* change to use int32_t dtype for cublas aux

* Fix jc comment

* fix comment

* fix convert

* fix static_analysis

* fix at

* fix userops td

* fix userops td

* fix const ref

* fix compile error for bfloat16

* limit to 11060

* fix bug

Co-authored-by: Juncheng <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix gather 0-dim tensor bug (#8376)

* fix 0-dim tensor bug

* refine

* support input 0-dim tensor for gather

* refine

* refine

* refine dim_scatter_kernel check

* refine

* refine check

* fix clang_tidy error

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add api to apply external job pass (#8370)

* Add condition to find-test-cache-distributed (#8387)

* add condition to find-test-cache-distributed

* fix

* warp dim util (#8382)

* warp dim util

* format

* use more maybe_wrap_dim

* refine array functor

* add more

* refine math_functor

* fix_bug_in_broadcast_min_max_grad_and_broadcast_like (#8379)

* fix_bug_in_broadcast_min_max_grad_and_broadcast_like

* refine

* fix static check error

* fix bug about index (#8388)

* fix bug about index

* add test case

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* LogicalSliceAssign support full slice sbp (#8344)

* feat(SliceOp): slice ops support 2d sbp

* fix(SliceOp): fix [B, P] 2d sbp bug

* refine error message

* fix bug in parallel_num == 1

* add comment

* add warning and format

* add NOLINT for boxing check

* feat(LogicalSliceOps): support all nd_sbp

* feat(LogicalSlice): support nd_sbp

* add error message

* fix(AutoTest): fix auto_test bug in module.parameter pass

* auto format by CI

* fix(LogicalSliceAssign): skip test when 1n1d

* fix SliceParams memset error

* remove memset

* add CHECK_JUST

* fix(*): make sure split_axis >= 0 or equal to SPLIT_AXIS_FOR_NON_SPLIT

* remove memset

* fix spilit_info.axis bug

* feat(LogicalSliceOps): support grad

* add logical_slice gradient_funcs

* feat(LogicalSliceAssign): LogicalSliceAssign support full slice sbp

* auto format by CI

* test(LogicalSlice): fix logical_slice dims

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Houjiang Chen <[email protected]>
Co-authored-by: oneflow-ci-bot <[email protected]>

* fix_tensor_from_numpy_mem_leak_bug (#8391)

* fix_tensor_from_numpy_mem_leak_bug

* add note

* refine note

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Make of_pyext_obj static only to make sure only a python ext so has python symbols (#8393)

* make of_pyext_obj static only

* refine note

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Adjust tolerance setting in embedding_renorm unit test (#8394)

* support front end compile for job to iree (#8249)

* support frontend dev version

* polish name

* add tosa-to-elf.mlir

* tosa to elf by llvm

* conv2d partial

* an enhanced frontend runner

* support numpy as input

* enable multiple using nn graph with different input(jobname make it  it cd /home/yuhao/frontend/oneflow ; /usr/bin/env /usr/bin/python3 /home/yuhao/.vscode-server/extensions/ms-python.python-2022.6.2/pythonFiles/lib/python/debugpy/launcher 40873 -- /home/yuhao/frontend/oneflow/oneflow/ir/test/Frontend/runner.py )

* enable multiple input

* enable cpu and cuda

* change full_name to _full_name

* support exchange cuda with cpu seamlessly

* remove pip

* lit config

* polish

* trim

* auto format by CI

* modify

* auto format by CI

* last line polish

* use unittest

* auto format by CI

* use allclose

* auto format by CI

* pulish

* optimize convert oneflow to tosa

* conv2d

* conv2d enhanced && conv2d examples add

* add road map

* add add_n2Op and boardcast_addOp conversion

* add matmulOp conversion

* support converting normailzation op to tosa(partically)

* update roadmap

* support i64 tensor to dense elem attr

* support 100% resnet op conversion

* add test mlir

* add test iree resnet python script

* auto format by CI

* done

* enhance iree resnet test script

* auto format by CI

* rebuild code

* auto format by CI

* rebuild test script

* update

* auto format by CI

* pub

* trim test scripts

* move

* move

* input and output add block arg judgement

* emit error in variable conversion

* error handle for ci

* modify err info

* auto format by CI

* merge

* auto format by CI

* output not block

* flow ones

* rm const

* trim maybe

* trim maybe with header file

* const auto

* solve clangd error

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Feat/zero mix with mp (#8036)

* add zero limit

* add debug

* add mix zero test

* refactor zero api

* zero test with mp

* add 2d test

* add zero nd

* add nd zero

* add sbp cast

* test passed soft limit consumer

* refine size api

* zero use stage 2

* add limit consumer api

* add new api

* refine zero s select

* fix index out of range

* rm zero limit on device type

* zero test with activation checkpointing

* add indentity when dp sequence len is 1

* move to base with master

* fix

* fix

* fix

* add test

* debug bad case

* refine test for eager and graph boxing

* test case ready

* simplify

* refine test

* fix buff size

* fix conflict

* refine zero nd

* refine

* add full test

* revert change

* refine split check

* fix typo

* rm log

* spit long func

* restore test

* Update optimizer_placement_optimization_pass.cpp

* auto format by CI

* auto format by CI

* fix static check

* add tips for zero api change

* auto format by CI

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Revert embedding normal path and fix amp list (#8374)

* revert embedding normal path, fix amp list

* fix amp

* fix memset bug in gather cpu kernel

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* replace fixed_vector with small_vector and make Shape inherit from it (#8365)

* Replace fixed_vector with llvm::SmallVector

Signed-off-by: daquexian <[email protected]>

* Shape inherited from llvm::SmallVector

Signed-off-by: daquexian <[email protected]>

* refine cmake

Signed-off-by: daquexian <[email protected]>

* rename fixed_vector to small_vector

Signed-off-by: daquexian <[email protected]>

* fix reviews

Signed-off-by: daquexian <[email protected]>

* auto format by CI

* update Shape constructor

Signed-off-by: daquexian <[email protected]>

* add 'PUBLIC' keyword to all target_link_libraries

Signed-off-by: daquexian <[email protected]>

* auto format by CI

* update cmake

Signed-off-by: daquexian <[email protected]>

* auto format by CI

* update cmake

Signed-off-by: daquexian <[email protected]>

* update cmake

Signed-off-by: daquexian <[email protected]>

* auto format by CI

* set is_initialized_ default to true

Signed-off-by: daquexian <[email protected]>

* override some methods to set is_initialized_

Signed-off-by: daquexian <[email protected]>

* auto format by CI

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <[email protected]>

* Light plan for debug (#8396)

* Light plan for debug

* fix note

* disable terminfo to fix missing terminfo symbols (#8400)

* disable terminfo to fix missing terminfo symbols

Signed-off-by: daquexian <[email protected]>

* auto format by CI

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix bug of ZeRO MP in complex case (#8404)

* Remove redundant output_lbns in ir (#8409)

* mv case

* remove redundant info

* Dev FusedCrossInteraction[OneEmbedding] (#8335)

* add simple fused cross interaction forward

* add packed fused

* Add cross interaction grad

* simplify code

* fix bug

* support crossnet v2

* support cross interaction v2

* add lazy backward

* Rename and add test

* fix jc comment

* fix comment

* fix bug

* fix userops td elem_cnt for FUSED Group

* fix header file

* fix clang static analysis

* fix unittest

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add exe graph physical shape check msg (#8002)

* fix index select op in graph

* add exe graph physical shape check msg

* improve the debug information for the python stack trace

1. add a parameter 'max_stack_depth' to specify the max depth for the stack trace
2. refactor other debug related classes.

* remove parens

* update

* resolve PR comments

* update

* update graph debug test file.

* restore self._debug in class Graph and class ModuleBlock

* Do not shorten the stack frame string if it is in debug mode

* delete TODOs

* disable conv3d test (#7969)

Signed-off-by: daquexian <[email protected]>

* skip layernorm random_data_warp test (#7941)

* skip layernorm random_data_warp test

* warp/block/uncached case only test gpu

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Lock click version (#7967)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add global avgpool unittest (#7585)

* fix (#7978)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Support negative dim in scatter op (#7934)

* support negative dim in scatter op

* refine scatter test

* refine scatter test again

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* run barrier callback in BarrierPhyInstrOperand::~BarrierPhyInstrOperand (#7702)

* run barrier callback in BarrierPhyInstrOperand::~BarrierPhyInstrOperand

* lock gil in vm Callback thread

* more comments for VirtualMachineEngine::Callback()

* the Env is never destroyed.

* export Env into python

* more unittests

* wait shared_ptr.use_count() == 0

* export unittest.TestCase in framework/unittest.py

* SwitchToShuttingDownPhase

* optional is_normal_exit

* VirtualMachine::CloseVMThreads

* Delete env_api.h

env_api.h is deleted by master

* reshape_only_one_dim_infered

* address pr comments

* fix a ref-cnt bug in TryRunBarrierInstruction.

* rollback flow.env.all_device_placement

* no distributed running test_shutting_down.py

* auto format by CI

* expand lifetime of module oneflow in test_shutting_down.py

* refine del depend on of

* capture oneflow._oneflow_internal.eager when calling sync in __del__

* add try in flaky test

Co-authored-by: Luyang <[email protected]>
Co-authored-by: chengtbf <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: Xiaoyu Xu <[email protected]>

* Fix one hot scalar tensor bug (#7975)

* fix reduce_sum scalar check bug

* fix one_hot scalar tensor bug

* fix clang tidy error

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* support ctor np array from of tensor (#7970)

* support ctor np array from of tensor

* add test case constructing np array from tensor

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add_manual_seed_all_api (#7957)

* add_manual_seed_all_api

* Update conf.py

* refine

* add test case

* auto format by CI

* Update random_generator.cpp

* auto format by CI

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* one_embedding add doc string (#7902)

* add doc string

* add example

* add

* fix doc

* refine

* address review

* mb to MB

* add make_table_option

* option to options

* refine

* add forward

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Support numpy scalar parameters (#7935)

* feat(functional): support numpy scalar parameters

* rename inferface

* feat(*): TensorIndex support numpy scalar

* feat(TensorIndex): support advance indexing

* add unittest and int32 support for branch feat-param_support_np_scalar (#7939)

* add unittest

* refactor unittest

* add todo for int16 advanced indexing

* add int32 supporting for advance indexing

* auto format by CI

Co-authored-by: Wang Yi <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <[email protected]>

* fix tensor_scatter_nd_update (#7953)

* fix tensor_scatter_nd_update

* auto backward

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix one_embedding adam (#7974)

* fix one_embedding adam

* fix tidy

* fix normal

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* speed test with score (#7990)

Signed-off-by: daquexian <[email protected]>

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Feat/graph del by ref (#7857)

* remove IsMultiClient() and single client logic

Signed-off-by: daquexian <[email protected]>

* rename eager.multi_client to eager

Signed-off-by: daquexian <[email protected]>

* auto format by CI

* add py ref

* refine new session

* clean code

* make scope api inner use

* use session with ref cnt

* run barrier callback in BarrierPhyInstrOperand::~BarrierPhyInstrOperand

* test pass

* lock gil in vm Callback thread

* more comments for VirtualMachineEngine::Callback()

* merge

* merge rm single client

* rm initenv

* merge and fix master

* refactor env c api

* add debug code

* fix and serving test pass

* test passed

* rm useless

* rm useless code

* format

* rm useless include

* rm sync in py

* the Env is never destroyed.

* export Env into python

* more unittests

* fix and pass tests

* revert virtual_machine.cpp

* revert core/vm

* remove outdated python class oneflow.unittest.TestCase

* graph test passed

* wait shared_ptr.use_count() == 0

* export unittest.TestCase in framework/unittest.py

* SwitchToShuttingDownPhase

* optional is_normal_exit

* VirtualMachine::CloseVMThreads

* Delete env_api.h

env_api.h is deleted by master

* address pr comments

* rm is env init

* Clear empty thread when graph destroy (#7633)

* Revert "Clear empty thread when graph destroy (#7633)" (#7860)

This reverts commit 3e8585e.

* fix a ref-cnt bug in TryRunBarrierInstruction.

* rm env_api

* fix clang-tidy error

* fix clang-tidy in env_imp

* refine env api

* format

* refine graph del and sync at shuttingdown

* fix typo

* add comment

* rm useless

* rm useless

Co-authored-by: daquexian <[email protected]>
Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: lixinqi <[email protected]>
Co-authored-by: Li Xinqi <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Luyang <[email protected]>
Co-authored-by: cheng cheng <[email protected]>

* [PersistentTable] Fix num blocks (#7986)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add auto benchmark for flowvision (#7806)

* update yml

* update workflow

* add resnet50

* [PersistentTable] Async write (#7946)

* [PersistentTable] Async write

* fix

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* save log in separate dir by default (#7825)

Signed-off-by: daquexian <[email protected]>

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix index select op in graph

* add exe graph physical shape check msg

* improve the debug information for the python stack trace

1. add a parameter 'max_stack_depth' to specify the max depth for the stack trace
2. refactor other debug related classes.

* remove parens

* update

* resolve PR comments

* update

* update graph debug test file.

* restore self._debug in class Graph and class ModuleBlock

* Do not shorten the stack frame string if it is in debug mode

* delete TODOs

* Revert "Merge branch 'master' into fea/graph_check_msg"

This reverts commit 28833b7, reversing
changes made to baadf60.

* Revert "Revert "Merge branch 'master' into fea/graph_check_msg""

This reverts commit 1d5e196.

* update

* resolve conflicts

* resolve conflicts

Co-authored-by: Cijie Xia <[email protected]>
Co-authored-by: daquexian <[email protected]>
Co-authored-by: guo ran <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Shenghang Tsai <[email protected]>
Co-authored-by: Houjiang Chen <[email protected]>
Co-authored-by: Peihong Liu <[email protected]>
Co-authored-by: Li Xinqi <[email protected]>
Co-authored-by: Luyang <[email protected]>
Co-authored-by: chengtbf <[email protected]>
Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: Xiaoyu Zhang <[email protected]>
Co-authored-by: liufengwei0103 <[email protected]>
Co-authored-by: binbinHan <[email protected]>
Co-authored-by: Yinggang Wang <[email protected]>
Co-authored-by: Wang Yi <[email protected]>
Co-authored-by: Shijie <[email protected]>
Co-authored-by: lixinqi <[email protected]>
Co-authored-by: Juncheng <[email protected]>

* add batch_matmul sbp (#8385)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* suppress gcc11 false positive warning (#8401)

Signed-off-by: daquexian <[email protected]>

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix variable op conversion to tosa error in ninja c1 (#8412)

* pub

* move test iree resnet python script to oneflow_iree repo

* add bracket

* rename const_val to const_val_ and restore resnet.py test script

Co-authored-by: Shenghang Tsai <[email protected]>

* nccl send/recv support different placement

* refine

* auto format by CI

* rm out ctrl

* auto format by CI

Co-authored-by: guo-ran <[email protected]>
Co-authored-by: Shenghang Tsai <[email protected]>
Co-authored-by: daquexian <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: liufengwei0103 <[email protected]>
Co-authored-by: Wang Yi <[email protected]>
Co-authored-by: ZZK <[email protected]>
Co-authored-by: hjchen2 <[email protected]>
Co-authored-by: Juncheng <[email protected]>
Co-authored-by: Xiaoyu Zhang <[email protected]>
Co-authored-by: Luyang <[email protected]>
Co-authored-by: binbinHan <[email protected]>
Co-authored-by: Yinggang Wang <[email protected]>
Co-authored-by: Yao Zihang <[email protected]>
Co-authored-by: yuhao <[email protected]>
Co-authored-by: Xiaoyu Xu <[email protected]>
Co-authored-by: cheng cheng <[email protected]>
Co-authored-by: Cijie Xia <[email protected]>
Co-authored-by: Peihong Liu <[email protected]>
Co-authored-by: Li Xinqi <[email protected]>
Co-authored-by: Shijie <[email protected]>
Co-authored-by: lixinqi <[email protected]>
Yipeng1994 added a commit that referenced this pull request Jun 23, 2022
* Add distributed optional run (#8372)

* Add

* change deps

* add install

* add skip

* autoprof supports bandwidth (#8367)

* autoprof supports bandwidth

Signed-off-by: daquexian <[email protected]>

* print bandwidth

Signed-off-by: daquexian <[email protected]>

* auto format by CI

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <[email protected]>

* remove tmp buffer of cumprod cpu backward kernel (#8369)

* remove tmp buffer of cumprod cpu backward kernel

* refine

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Move tensor api to cpython part3 (#8342)

* add tensor_functions

* concat py methods

* add hash, restore tensor.py

* check replacement

* refine code, remove commented tensor.py

* refine code

* move some api

* add cpu and cuda api

* add triu tril norm and etc.

* remove tensor_functions.h

* move more api

* move more api, refine size

* fix typo

* format code, remove useless include

* refine code

* refine code, fix typo

* align .cuda to python

* refine code

* split some api to part3 for review

* remove positional only arguments of argmax and argmin

* remove arguments parse

* modify arguments name in matmul and floor_divide

* rename BINARY_FUNC to DIRECT_PASS_FUNC, modify some functions

* refine code, format code

* add inplace /=, add comments

* remove name in macros

* remove python api

* remove redundant include

* remove cout

* format code

* refactor tensor.size by directly call shape.at, refactor tensor.sub_ by calling nb_sub_

* remove redundant code

* auto format by CI

* fix typo, fix wrong call

* modify idx datatype from int32 to int64 in tensor.size

* add some DIRECT_PASS_FUNC

* add cpu cuda var pow and etc.

* add masked_fill any all

* make REDUCE_FUNC macro, add reduce_* functions

* add 0dim check in ReduceSumWhole, refine yaml

* fix bug

* restore add add_ sub sub_

* add unittest for tensor.half tensor.add tensor.add_

* refine code

* refine code

* fix typo

* fix bug of tensor.std()

* refactor var std and cuda, using c++ functional api

* add beta and threshold in softplus

* auto format by CI

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add nn_functor Check (#7910)

* add bias_add_check

* add bias_add error test

* fix conv2d nhwc bias_add error

* add nhwc conv test

* add bias_add_error test

* Add bias add error check

* Rename

* add batch matmul error check

* add matmul check error msg

* remove annotation

* add fused mlp error msg check

* Add pixel shuffle check test

* add more test until normalization add relu functor

* refine error message

* finish all nnfunctor check msg

* handle type error

* remove useless symbol

* modify back to TypeError

* fix all comment

* Remove redundant code

* Remove pad ndim check

* fix bias add space

* fix check logic cause ci gpu not always gpu:0

Co-authored-by: hjchen2 <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add FusedMatmulBiasAddReluDropout [OneEmbedding] (#8222)

* previous version for fused_matmul_bias_add_relu_dropout

* add op infer

* fix detail

* finish forward

* support dropout rate list

* add forward test

* fix bug for output buffer

* Configurable alpha params

* try to add bit mask logic

* Add bitmask first version!

* Add row col bitmask logic

* support not align4 reludropout

* simplify relu dropout ld logic

* Add naive relu dropout grad kernel

* add simple relu dropout grad kernel

* Rename

* support relu_dropout bitmask backward

* add vectorized optimization

* fix tmp buffer

* add to amp list

* add lazy backward logic

* Refine kernel

* add indextype dispatch

* simplify functor logic

* fix cublas fused mlp aux_ld shape bug

* Add more relu dropout kernel

* add full unittest

* fix bug in skip final activation

* refine

* Remove dump func

* fix format

* Remove cmake

* remove redundant divide

* add padded version

* fix dropout

* oneflow curand

* refine

* remove redundant kernel

* add unroll logic

* add unroll and ballot sync

* refine format

* Remove fast curand

* Refine python interface

* Add if branch for memset

* fix python logic

* just for debug

* not use matmul bias add grad

* add launch 1 block limit

* fix unittest

* Refine

* fix graph backward bug

* limit to 11060

* change to use int32_t dtype for cublas aux

* Fix jc comment

* fix comment

* fix convert

* fix static_analysis

* fix at

* fix userops td

* fix userops td

* fix const ref

* fix compile error for bfloat16

* limit to 11060

* fix bug

Co-authored-by: Juncheng <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix gather 0-dim tensor bug (#8376)

* fix 0-dim tensor bug

* refine

* support input 0-dim tensor for gather

* refine

* refine

* refine dim_scatter_kernel check

* refine

* refine check

* fix clang_tidy error

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add api to apply external job pass (#8370)

* Add condition to find-test-cache-distributed (#8387)

* add condition to find-test-cache-distributed

* fix

* warp dim util (#8382)

* warp dim util

* format

* use more maybe_wrap_dim

* refine array functor

* add more

* refine math_functor

* fix_bug_in_broadcast_min_max_grad_and_broadcast_like (#8379)

* fix_bug_in_broadcast_min_max_grad_and_broadcast_like

* refine

* fix static check error

* fix bug about index (#8388)

* fix bug about index

* add test case

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* LogicalSliceAssign support full slice sbp (#8344)

* feat(SliceOp): slice ops support 2d sbp

* fix(SliceOp): fix [B, P] 2d sbp bug

* refine error message

* fix bug in parallel_num == 1

* add comment

* add warning and format

* add NOLINT for boxing check

* feat(LogicalSliceOps): support all nd_sbp

* feat(LogicalSlice): support nd_sbp

* add error message

* fix(AutoTest): fix auto_test bug in module.parameter pass

* auto format by CI

* fix(LogicalSliceAssign): skip test when 1n1d

* fix SliceParams memset error

* remove memset

* add CHECK_JUST

* fix(*): make sure split_axis >= 0 or equal to SPLIT_AXIS_FOR_NON_SPLIT

* remove memset

* fix spilit_info.axis bug

* feat(LogicalSliceOps): support grad

* add logical_slice gradient_funcs

* feat(LogicalSliceAssign): LogicalSliceAssign support full slice sbp

* auto format by CI

* test(LogicalSlice): fix logical_slice dims

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Houjiang Chen <[email protected]>
Co-authored-by: oneflow-ci-bot <[email protected]>

* fix_tensor_from_numpy_mem_leak_bug (#8391)

* fix_tensor_from_numpy_mem_leak_bug

* add note

* refine note

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Make of_pyext_obj static only to make sure only a python ext so has python symbols (#8393)

* make of_pyext_obj static only

* refine note

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Adjust tolerance setting in embedding_renorm unit test (#8394)

* support front end compile for job to iree (#8249)

* support frontend dev version

* polish name

* add tosa-to-elf.mlir

* tosa to elf by llvm

* conv2d partial

* an enhanced frontend runner

* support numpy as input

* enable multiple using nn graph with different input(jobname make it  it cd /home/yuhao/frontend/oneflow ; /usr/bin/env /usr/bin/python3 /home/yuhao/.vscode-server/extensions/ms-python.python-2022.6.2/pythonFiles/lib/python/debugpy/launcher 40873 -- /home/yuhao/frontend/oneflow/oneflow/ir/test/Frontend/runner.py )

* enable multiple input

* enable cpu and cuda

* change full_name to _full_name

* support exchange cuda with cpu seamlessly

* remove pip

* lit config

* polish

* trim

* auto format by CI

* modify

* auto format by CI

* last line polish

* use unittest

* auto format by CI

* use allclose

* auto format by CI

* pulish

* optimize convert oneflow to tosa

* conv2d

* conv2d enhanced && conv2d examples add

* add road map

* add add_n2Op and boardcast_addOp conversion

* add matmulOp conversion

* support converting normailzation op to tosa(partically)

* update roadmap

* support i64 tensor to dense elem attr

* support 100% resnet op conversion

* add test mlir

* add test iree resnet python script

* auto format by CI

* done

* enhance iree resnet test script

* auto format by CI

* rebuild code

* auto format by CI

* rebuild test script

* update

* auto format by CI

* pub

* trim test scripts

* move

* move

* input and output add block arg judgement

* emit error in variable conversion

* error handle for ci

* modify err info

* auto format by CI

* merge

* auto format by CI

* output not block

* flow ones

* rm const

* trim maybe

* trim maybe with header file

* const auto

* solve clangd error

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Feat/zero mix with mp (#8036)

* add zero limit

* add debug

* add mix zero test

* refactor zero api

* zero test with mp

* add 2d test

* add zero nd

* add nd zero

* add sbp cast

* test passed soft limit consumer

* refine size api

* zero use stage 2

* add limit consumer api

* add new api

* refine zero s select

* fix index out of range

* rm zero limit on device type

* zero test with activation checkpointing

* add indentity when dp sequence len is 1

* move to base with master

* fix

* fix

* fix

* add test

* debug bad case

* refine test for eager and graph boxing

* test case ready

* simplify

* refine test

* fix buff size

* fix conflict

* refine zero nd

* refine

* add full test

* revert change

* refine split check

* fix typo

* rm log

* spit long func

* restore test

* Update optimizer_placement_optimization_pass.cpp

* auto format by CI

* auto format by CI

* fix static check

* add tips for zero api change

* auto format by CI

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Revert embedding normal path and fix amp list (#8374)

* revert embedding normal path, fix amp list

* fix amp

* fix memset bug in gather cpu kernel

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* replace fixed_vector with small_vector and make Shape inherit from it (#8365)

* Replace fixed_vector with llvm::SmallVector

Signed-off-by: daquexian <[email protected]>

* Shape inherited from llvm::SmallVector

Signed-off-by: daquexian <[email protected]>

* refine cmake

Signed-off-by: daquexian <[email protected]>

* rename fixed_vector to small_vector

Signed-off-by: daquexian <[email protected]>

* fix reviews

Signed-off-by: daquexian <[email protected]>

* auto format by CI

* update Shape constructor

Signed-off-by: daquexian <[email protected]>

* add 'PUBLIC' keyword to all target_link_libraries

Signed-off-by: daquexian <[email protected]>

* auto format by CI

* update cmake

Signed-off-by: daquexian <[email protected]>

* auto format by CI

* update cmake

Signed-off-by: daquexian <[email protected]>

* update cmake

Signed-off-by: daquexian <[email protected]>

* auto format by CI

* set is_initialized_ default to true

Signed-off-by: daquexian <[email protected]>

* override some methods to set is_initialized_

Signed-off-by: daquexian <[email protected]>

* auto format by CI

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <[email protected]>

* Light plan for debug (#8396)

* Light plan for debug

* fix note

* disable terminfo to fix missing terminfo symbols (#8400)

* disable terminfo to fix missing terminfo symbols

Signed-off-by: daquexian <[email protected]>

* auto format by CI

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix bug of ZeRO MP in complex case (#8404)

* Remove redundant output_lbns in ir (#8409)

* mv case

* remove redundant info

* Dev FusedCrossInteraction[OneEmbedding] (#8335)

* add simple fused cross interaction forward

* add packed fused

* Add cross interaction grad

* simplify code

* fix bug

* support crossnet v2

* support cross interaction v2

* add lazy backward

* Rename and add test

* fix jc comment

* fix comment

* fix bug

* fix userops td elem_cnt for FUSED Group

* fix header file

* fix clang static analysis

* fix unittest

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add exe graph physical shape check msg (#8002)

* fix index select op in graph

* add exe graph physical shape check msg

* improve the debug information for the python stack trace

1. add a parameter 'max_stack_depth' to specify the max depth for the stack trace
2. refactor other debug related classes.

* remove parens

* update

* resolve PR comments

* update

* update graph debug test file.

* restore self._debug in class Graph and class ModuleBlock

* Do not shorten the stack frame string if it is in debug mode

* delete TODOs

* disable conv3d test (#7969)

Signed-off-by: daquexian <[email protected]>

* skip layernorm random_data_warp test (#7941)

* skip layernorm random_data_warp test

* warp/block/uncached case only test gpu

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Lock click version (#7967)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add global avgpool unittest (#7585)

* fix (#7978)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Support negative dim in scatter op (#7934)

* support negative dim in scatter op

* refine scatter test

* refine scatter test again

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* run barrier callback in BarrierPhyInstrOperand::~BarrierPhyInstrOperand (#7702)

* run barrier callback in BarrierPhyInstrOperand::~BarrierPhyInstrOperand

* lock gil in vm Callback thread

* more comments for VirtualMachineEngine::Callback()

* the Env is never destroyed.

* export Env into python

* more unittests

* wait shared_ptr.use_count() == 0

* export unittest.TestCase in framework/unittest.py

* SwitchToShuttingDownPhase

* optional is_normal_exit

* VirtualMachine::CloseVMThreads

* Delete env_api.h

env_api.h is deleted by master

* reshape_only_one_dim_infered

* address pr comments

* fix a ref-cnt bug in TryRunBarrierInstruction.

* rollback flow.env.all_device_placement

* no distributed running test_shutting_down.py

* auto format by CI

* expand lifetime of module oneflow in test_shutting_down.py

* refine del depend on of

* capture oneflow._oneflow_internal.eager when calling sync in __del__

* add try in flaky test

Co-authored-by: Luyang <[email protected]>
Co-authored-by: chengtbf <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: Xiaoyu Xu <[email protected]>

* Fix one hot scalar tensor bug (#7975)

* fix reduce_sum scalar check bug

* fix one_hot scalar tensor bug

* fix clang tidy error

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* support ctor np array from of tensor (#7970)

* support ctor np array from of tensor

* add test case constructing np array from tensor

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add_manual_seed_all_api (#7957)

* add_manual_seed_all_api

* Update conf.py

* refine

* add test case

* auto format by CI

* Update random_generator.cpp

* auto format by CI

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* one_embedding add doc string (#7902)

* add doc string

* add example

* add

* fix doc

* refine

* address review

* mb to MB

* add make_table_option

* option to options

* refine

* add forward

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Support numpy scalar parameters (#7935)

* feat(functional): support numpy scalar parameters

* rename inferface

* feat(*): TensorIndex support numpy scalar

* feat(TensorIndex): support advance indexing

* add unittest and int32 support for branch feat-param_support_np_scalar (#7939)

* add unittest

* refactor unittest

* add todo for int16 advanced indexing

* add int32 supporting for advance indexing

* auto format by CI

Co-authored-by: Wang Yi <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <[email protected]>

* fix tensor_scatter_nd_update (#7953)

* fix tensor_scatter_nd_update

* auto backward

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix one_embedding adam (#7974)

* fix one_embedding adam

* fix tidy

* fix normal

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* speed test with score (#7990)

Signed-off-by: daquexian <[email protected]>

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Feat/graph del by ref (#7857)

* remove IsMultiClient() and single client logic

Signed-off-by: daquexian <[email protected]>

* rename eager.multi_client to eager

Signed-off-by: daquexian <[email protected]>

* auto format by CI

* add py ref

* refine new session

* clean code

* make scope api inner use

* use session with ref cnt

* run barrier callback in BarrierPhyInstrOperand::~BarrierPhyInstrOperand

* test pass

* lock gil in vm Callback thread

* more comments for VirtualMachineEngine::Callback()

* merge

* merge rm single client

* rm initenv

* merge and fix master

* refactor env c api

* add debug code

* fix and serving test pass

* test passed

* rm useless

* rm useless code

* format

* rm useless include

* rm sync in py

* the Env is never destroyed.

* export Env into python

* more unittests

* fix and pass tests

* revert virtual_machine.cpp

* revert core/vm

* remove outdated python class oneflow.unittest.TestCase

* graph test passed

* wait shared_ptr.use_count() == 0

* export unittest.TestCase in framework/unittest.py

* SwitchToShuttingDownPhase

* optional is_normal_exit

* VirtualMachine::CloseVMThreads

* Delete env_api.h

env_api.h is deleted by master

* address pr comments

* rm is env init

* Clear empty thread when graph destroy (#7633)

* Revert "Clear empty thread when graph destroy (#7633)" (#7860)

This reverts commit 3e8585e.

* fix a ref-cnt bug in TryRunBarrierInstruction.

* rm env_api

* fix clang-tidy error

* fix clang-tidy in env_imp

* refine env api

* format

* refine graph del and sync at shuttingdown

* fix typo

* add comment

* rm useless

* rm useless

Co-authored-by: daquexian <[email protected]>
Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: lixinqi <[email protected]>
Co-authored-by: Li Xinqi <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Luyang <[email protected]>
Co-authored-by: cheng cheng <[email protected]>

* [PersistentTable] Fix num blocks (#7986)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add auto benchmark for flowvision (#7806)

* update yml

* update workflow

* add resnet50

* [PersistentTable] Async write (#7946)

* [PersistentTable] Async write

* fix

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* save log in separate dir by default (#7825)

Signed-off-by: daquexian <[email protected]>

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix index select op in graph

* add exe graph physical shape check msg

* improve the debug information for the python stack trace

1. add a parameter 'max_stack_depth' to specify the max depth for the stack trace
2. refactor other debug related classes.

* remove parens

* update

* resolve PR comments

* update

* update graph debug test file.

* restore self._debug in class Graph and class ModuleBlock

* Do not shorten the stack frame string if it is in debug mode

* delete TODOs

* Revert "Merge branch 'master' into fea/graph_check_msg"

This reverts commit 28833b7, reversing
changes made to baadf60.

* Revert "Revert "Merge branch 'master' into fea/graph_check_msg""

This reverts commit 1d5e196.

* update

* resolve conflicts

* resolve conflicts

Co-authored-by: Cijie Xia <[email protected]>
Co-authored-by: daquexian <[email protected]>
Co-authored-by: guo ran <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Shenghang Tsai <[email protected]>
Co-authored-by: Houjiang Chen <[email protected]>
Co-authored-by: Peihong Liu <[email protected]>
Co-authored-by: Li Xinqi <[email protected]>
Co-authored-by: Luyang <[email protected]>
Co-authored-by: chengtbf <[email protected]>
Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: Xiaoyu Zhang <[email protected]>
Co-authored-by: liufengwei0103 <[email protected]>
Co-authored-by: binbinHan <[email protected]>
Co-authored-by: Yinggang Wang <[email protected]>
Co-authored-by: Wang Yi <[email protected]>
Co-authored-by: Shijie <[email protected]>
Co-authored-by: lixinqi <[email protected]>
Co-authored-by: Juncheng <[email protected]>

* add batch_matmul sbp (#8385)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* suppress gcc11 false positive warning (#8401)

Signed-off-by: daquexian <[email protected]>

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix variable op conversion to tosa error in ninja c1 (#8412)

* pub

* move test iree resnet python script to oneflow_iree repo

* add bracket

* rename const_val to const_val_ and restore resnet.py test script

Co-authored-by: Shenghang Tsai <[email protected]>

* Fix eval error in FusedMLP (#8413)

Fix eval error

* Init NCCL communicator in graph mode unifiedly (#8263)

* centralized comm init

* address review

* revert

* rename

* ref nccl logical send recv

* fix cpu only

Co-authored-by: cheng cheng <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix dim_scatter 0-dim tensor bug (#8418)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* target based external libraries (#8421)

Signed-off-by: daquexian <[email protected]>

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Refine hardcoded attr setting/getting in ir (#8420)

* use names in trait static func

* more changes on op name attr

* use wrapped func

* Replace cu115 with cu116 in nightly (#8423)

update workflows

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix repeat interleave 0-size tensor bug (#8414)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Autotest support print input in ci (#8383)

* support print tensor value in autotest to provide more details in ci

* revert

* refine

* auto format by CI

* control precision to 1e-5 when record

* fix bug

* auto format by CI

* relax tensor_size_mb

* fix bug

* fix bug

* refine

* releax

* refinew

* refine

* fix bug

* relax

* refine

* restruct

* auto format by CI

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Modify sbp.split()'s karg: axis to dim (#8411)

* Modify sbp.split()'s axis karg to dim

* Refine

* Refine

* Refine

* Refine

* Feat/graph logical op debug repr (#8131)

* add zero limit

* add debug

* add mix zero test

* refactor zero api

* zero test with mp

* add 2d test

* add zero nd

* add nd zero

* add sbp cast

* test passed soft limit consumer

* refine size api

* add module config

* save nn.Module info in job.proto for better debugging

* add new line

* add ModuleBlock.ops_proto() API

* zero use stage 2

* print operators' info when print ModuleBlock

* handle VariableOpConf

* update

* update

* fix

* move operators repr method to graph util

* add limit consumer api

* add new api

* refine zero s select

* add module block

* fix

* refact for rm op in module conf

* fix

* add sbp debug

* add sbp repr

* add shape

* refine

* add sys op in repr

* add full op debug

* fix index out of range

* rm zero limit on device type

* add no scope op to graph

* zero test with activation checkpointing

* fix order

* add indentity when dp sequence len is 1

* add debug repr

* refine repr of op

* refine and fix

* rm useless log

* move to base with master

* fix

* fix

* fix

* fix proto

* refine test

* fix type

* add test

* debug bad case

* refine test for eager and graph boxing

* test case ready

* simplify

* refine test

* fix buff size

* fix conflict

* refine zero nd

* refine

* add full test

* revert change

* refine split check

* fix typo

* rm log

* spit long func

* refine

* restore test

* refine pass and mem debug

* merge master

* repr dtype

* add placement

* Update optimizer_placement_optimization_pass.cpp

* auto format by CI

* auto format by CI

* fix static check

* add tips for zero api change

* auto format by CI

* fix merge

* auto format by CI

* auto format by CI

* refine get job api

* refine graph util import order

* auto format by CI

* fix static check

* auto format by CI

* fix special case

* refine level print and add full dtype repr

* rm useless

Co-authored-by: Cijie Xia <[email protected]>
Co-authored-by: Cijie Xia <[email protected]>
Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* rm some test case in test_fused_dot_feature_interaction_pooling_sum (#8425)

rm some case in test

* Remove unused linkages (#8426)

remove unused linkages

* refactor stride (#8402)

* Stride inherits DimVector

Signed-off-by: daquexian <[email protected]>

* auto format by CI

* fix argument type of OFStrideToNumpyStride

Signed-off-by: daquexian <[email protected]>

Co-authored-by: oneflow-ci-bot <[email protected]>

* Move Tensor.__setitem__  and global related api to Python/C api (#8375)

* add local_to_global, global_to_global, to_global. global_to_global still have bugs

* fix bug of global_to_global

* remove python api

* add setitem

* remove local_to_global sbp pack, format code

* format code

* remove redundant code

* add error msg, refine check of to_global

* fix bug of check

* add error msg

* fix clang static check error

* remove useless api in tensor.py, remove redundant code, remove useless CHECK

* add to_local

* fix wrong exception type in unittest for to_local exception message

* cuda add default error msg (#8427)

default error

Co-authored-by: Shenghang Tsai <[email protected]>

* Refactor ShapeView (#8422)

* update

Signed-off-by: daquexian <[email protected]>

* update and add docs

Signed-off-by: daquexian <[email protected]>

* turn on view slice (#8302)

* turn_on_view_slice

* inplace scalar math hnandle non-contiguous input

* fix clang check

* add docs

* refactor

* auto format by CI

Co-authored-by: oneflow-ci-bot <[email protected]>

* Add flow env init rdma api (#8415)

* add_flow_env_init_rdma_api

* adjust persistent_workers logic for RDMA support

* adjust persistent_workers logic for RDMA support

* add rmda_inited api

* minro fix

* add docs

* Update python/oneflow/utils/data/dataloader.py

Co-authored-by: daquexian <[email protected]>

* fix typo

* refine

* fix RDMAIsInitialized

* minor fix

* refine

* rename InitRdma to InitRDMA

* refine

Co-authored-by: Flowingsun007 <[email protected]>
Co-authored-by: daquexian <[email protected]>

* add 1d send recv in nccl logical (#8355)

* add 1d send recv in nccl logical

* Update insert_nccl_logical_op_pass.cpp

* auto format by CI

Co-authored-by: cheng cheng <[email protected]>
Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Support iree ci (#8419)

* create mlir cpu and modify build gcc 7 shell script

* fix the bug of test_iree_resnet.py cuda test in cpu version error

* fix constant folding tests

* suport oneflow_test_cpu_only

* pub

* build script add flag

* modify test yml

* add python3 into \PATH

* don't use pretrain model

* install flowvision

Co-authored-by: mosout <[email protected]>
Co-authored-by: jackalcooper <[email protected]>

* Feat straighten task nodes (#8347)

* Add a fast topological traversal

* Add an initial implementation of straighen nodes

* Add the straighen nodes algorithm

* Change algorithm structure

* Remove some debug information

* Finalize the straighten algorithm after
deciding the parameters by experiments

* Notify the usage of straighten algorithm

* Of format

* Update oneflow/core/graph/straighten_nodes.cpp

Of format

Co-authored-by: daquexian <[email protected]>

* Of format

* Stop using visual string before we find a better key

* Remove magic numbers and Of format

* Remove starts

* Of format

* Fix a bug of using GetMaxVal<int32_t>() as an
initial number for comparing

* Refactor add straighten algo interface (#8435)

* feat(*): export straighten nodes algorithm inferface

* export documentation

* Update python/oneflow/nn/graph/graph_config.py

Co-authored-by: Yipeng Li <[email protected]>

Co-authored-by: Yipeng Li <[email protected]>

* Use TopoForEachNodeFast as default. (#8436)

* Use TopoForEachNodeFast as default.
Rename the original one as TopoForEachNodeDynamic

* Speed up TopoForEachNodeFast when traversing a subgraph

* Rename the switch and code clean up

* Hide the class TopoStruct

* Hide all the other functions

* Grammar

* Of format

Co-authored-by: daquexian <[email protected]>
Co-authored-by: Yinggang Wang <[email protected]>

* Refactor NLLLoss to support split class dim (#8380)

* refactor

* RuntimeError

* avoid atomic add

* test

* fixes

* update test

* update test

* update test

* fix kernel

* improve backward

* update test

* out_weight to be required

* address static analysis errer

* fix static analysis error

* fix static analysis error

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Strict ordering in memory reuse algorithm (#8441)

* Support broadcast in fused_softmax kernel (#8321)

* support broadcast

* refine

* Remove shape check

* fix sbp when broadcast

* rollback softmax grad threshold

* increase threshold of test conv bn folding

* tol to 1e-2

* check error msg of fuse softmax ops

* add more dispatch

* remove double datatype test and add broadcast test

Co-authored-by: cheng cheng <[email protected]>

* Merge slice and logical slice (#8416)

* remove Slice, SliceUpdate, SliceGrad op

* rename logical_slice to slice and logical_slice_assign to slice_update

* move gradient_func logical_slice.cpp to slice.cpp

* fix some bug and refine local test

* feat(SliceUpdate): support 0size tensor

* test(Slice): refine consistent slice test

* test(SliceUpdate): refine consistent slice_update test

* not export slice_update's inplace parameter

* auto format by CI

* recovery slice_grad_op

* fix slice_view bug

* add error message and attr judgement

* modified old test

* auto format by CI

* update test README

* update tensor_string code

* fix test bug

* auto format by CI

* fix(hsplit): hsplit functor bug

* fix vsplit doc test bug

* refine

* fix test

* fix pin_memory bug

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Graph block.config.set_stage() for recommended Pipeline api. (#8442)

* Graph block.config.set_stage() for recommended Pipeline api.

* revert diff

* refine api doc

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Update PolynomialLR's doc and paramater (#8430)

* update PolynomialLR doc, current_batch = min(decay_batch, current_batch)

* * update PolynomialLR doc, current_batch = min(decay_batch, current_batch)
* rename the steps to decay_batch in parameters

* update PolynomialLR test case

Co-authored-by: Yinggang Wang <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add mv op (#8445)

* add mv op with bug that Int is incompatible

* add test

* update test_mv.py

* fix based on comments

* fix based on comments

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* enable oneflow_iree(python package) and corresponding test works in ci (#8431)

* update test.yml

* add pytest for oneflow_iree examples

* add oneflow frontend test

* Dev tensor is pinned api (#8447)

* support tensor.is_pinned

* add test case

* add docs

* auto format by CI

* refine

* auto format by CI

* refine

* auto format by CI

* refine

* refine

* refine

Co-authored-by: oneflow-ci-bot <[email protected]>

* Nd sbp tensor str (#8458)

* nd sbp tensor str

* add nd sbp tensor str test

* bigger input size

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Patch sbp cost (#8378)

* Add a slight cost for B->S and B->P in 2d sbp

* Add penalty for P in consumer

* Add the slight penalty for eager

* Consider B -> (B, B) for a scalar

* Do not consider parallel description in priority ratio

* Of format

* Fix a bug in the old version group boxing with 2D SBP (#8448)

* Update group boxing to deal with hierarchy [1, 2]

* Use a uniform sbp while grouping consumers

* Steal "ParallelDimReduce"
from "hierarchical_sub_task_graph_builder_impl" to "sbp_infer_util"

* Fix bugs of patch-sbp_cost (#8456)

* Update group boxing to deal with hierarchy [1, 2]

* Use a uniform sbp while grouping consumers

* Steal "ParallelDimReduce"
from "hierarchical_sub_task_graph_builder_impl" to "sbp_infer_util"

* Reduce to uniform B for 1 device.
Use the actual parallel description for each tensor

* Fix a bug of fix-group_boxing-bug

* Group boxing reduce [2, 2]: (S0, S0) to [4]: S0,
then we might infer a 1D SBP from a 2D SBP hint

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: cheng cheng <[email protected]>

* Decouple stream and instruction (#7607)

* remove deprecated python api

* backup code

* backup code

* fix compiler complaints

* fix typo in refactoring

* kMockDevice

* add unit test test_mock.py

* revert mock kernels

* vert DEVICE_TYPE_SEQ

* mock placement

* address pr comments

* register device kCriticalSectionDevice and kLazyJobLauncher

* kControlDevice

* Stream::vm_stream_

* fix compiler complaints

* backup code

* rename StreamIsTransport to IsCommNetStream

* decouple vm::StreamType and vm::InstructionType

* fix compiler complaints

* remove 'gpu' related code

* address static analyzer complaints

* address static analyzer complaints

* remove unused module in test_mock.py

* the Env is never destroyed.

* export Env into python

* more unittests

* export unittest.TestCase in framework/unittest.py

* SwitchToShuttingDownPhase

* optional is_normal_exit

* VirtualMachine::CloseVMThreads

* Delete env_api.h

env_api.h is deleted by master

* reshape_only_one_dim_infered

* address pr comments

* rollback flow.env.all_device_placement

* no distributed running test_shutting_down.py

* auto format by CI

* expand lifetime of module oneflow in test_shutting_down.py

* refine del depend on of

* fix oneflow.placement.__str__

* revert GlobalSync

* init_producer_stream in oneflow.from_numpy

* debug code for vm

* init disable_vm_threads_ in VirtualMachine::VirtualMachine

* Update oneflow/core/vm/virtual_machine.h

Co-authored-by: daquexian <[email protected]>

* create stream in forked subprocesses.

* refactor StreamRoleSwitch to StreamRoleVisistor

* ThreadLocalGuard

* auto format by CI

* fix compiler complaints

* fix static analyzer complaints

* VirtualMachine::GetVmStream

* fix static analyzer complaints

* reimplement AddAndReadVector by std::deque

* reimplement AddAndReadVector

* merge master

* increase atol for test_consistent_rnn_cell.py

* StreamRole::AsyncLaunchedCommNet is bound to EventRecordedCudaStreamType

* auto format by CI

* remove StreamRoleVisitor<T>::VisitInvalid

* no copy in AddAndReadVector

* fix bug of AddAndReadVector::size_

* disable terminfo to fix missing terminfo symbols

Signed-off-by: daquexian <[email protected]>

* auto format by CI

* fix AddAndReadVector::GetGranularity

* remove bad unittest

* auto format by CI

* rename CallInstructionType to OpCallInstructionType

* static variable  GlobalSingletonPtr is a unique_ptr

* replace ++atomic_cnt with atomic_cnt.fetch_add(1, std::memory_order_relaxed)

* AddAndReadVector::operator[]

* change comments 'lock free' to 'thread safe'

* rename StatefulLocalOpKernel to StatefulOpKernel

* rename VirtualMachine::vm_ to VirtualMachine::engine_

* mark VirtualMachine::NoMoreErasedInstructions private

* mark VirtualMachine::FindOrCreateScheduleLocalDepObject private

* remove unused version of VirtualMachineEngine::Receive

* rename argname for VirtualMachineEngine::Receive

* rename unused PendingInstructionList

* rename AddAndReadVector to SteadyVector

* optimize SteadyVector::operator[] by __builtin_clzll

* refactor SteadyVector::granularity2vector_ to SteadyVector::granularity2data_

* reduce usage of steady_vector::size_

* rename unused anounymous namespace

* greater atol for test_consistent_tensordot.py

* fix BarrierInstructionType::ComputeInFuseMode

* revert container_util.h

* run AccessBlobByCallback in default stream of tensor->device

* reslove static check

* reslove static check

* SteadyVector::MutableOrAdd

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: chengtbf <[email protected]>
Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: Xiaoyu Xu <[email protected]>
Co-authored-by: daquexian <[email protected]>
Co-authored-by: binbinHan <[email protected]>

* fix_tensor_numpy_to_avoid_gpu_mem_increase (#8449)

* fix_tensor_numpy_to_avoid_gpu_mem_increase

* Update tensor.py

* auto format by CI

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <[email protected]>

* Rename user op tensor shape to shape view (#8433)

* ThreadLocalGuard

* rename user_op::Tensor::shape to user_op::Tensor::shape_view

* auto format by CI

* fix static analyzer complaints

* more verbose code for HobDataType

* larger timeout

* larger timeout

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: jackalcooper <[email protected]>
Co-authored-by: binbinHan <[email protected]>

* speedup global test (#8468)

* speedup global test

* Test refine slice ops test (#8471)

* refine consistent_slice test from 112s -> 30s in 4 device

* test(SliceUpdate): refine test from 119s -> 28s in 4 device

* delete useless code

* auto format by CI

Co-authored-by: Yinggang Wang <[email protected]>
Co-authored-by: wyg1997 <[email protected]>
Co-authored-by: oneflow-ci-bot <[email protected]>

* Set the minimum mtu value for IB communication connection (#8451)

* Set the minimum mtu value for IB communication connection

* refine

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Merge branch 'master' into feat-general_basic_communication

Co-authored-by: Shenghang Tsai <[email protected]>
Co-authored-by: daquexian <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: liufengwei0103 <[email protected]>
Co-authored-by: Wang Yi <[email protected]>
Co-authored-by: ZZK <[email protected]>
Co-authored-by: hjchen2 <[email protected]>
Co-authored-by: Juncheng <[email protected]>
Co-authored-by: Xiaoyu Zhang <[email protected]>
Co-authored-by: Luyang <[email protected]>
Co-authored-by: binbinHan <[email protected]>
Co-authored-by: Yinggang Wang <[email protected]>
Co-authored-by: Yao Zihang <[email protected]>
Co-authored-by: yuhao <[email protected]>
Co-authored-by: Xiaoyu Xu <[email protected]>
Co-authored-by: cheng cheng <[email protected]>
Co-authored-by: Cijie Xia <[email protected]>
Co-authored-by: guo ran <[email protected]>
Co-authored-by: Peihong Liu <[email protected]>
Co-authored-by: Li Xinqi <[email protected]>
Co-authored-by: Shijie <[email protected]>
Co-authored-by: lixinqi <[email protected]>
Co-authored-by: leaves-zwx <[email protected]>
Co-authored-by: Li Xiang <[email protected]>
Co-authored-by: Cijie Xia <[email protected]>
Co-authored-by: Jia <[email protected]>
Co-authored-by: Shanshan Zhong <[email protected]>
Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: wyg1997 <[email protected]>
Co-authored-by: Yu OuYang <[email protected]>
mergify bot added a commit that referenced this pull request Jul 25, 2022
* Add a slight cost for B->S and B->P in 2d sbp

* Add penalty for P in consumer

* Fix a slight bug

* Add at most 1 middle node for general basic communication

* Add the cost for general basic communication

* Add the slight penalty for eager

* Skip initialization of boxing collector if not needed

* Fix a bug

* Dev nd nccl send recv boxing (#8467)

* nd nccl_send_recv_boxing

* rm print

* support num_axes > 2

* Add distributed optional run (#8372)

* Add

* change deps

* add install

* add skip

* autoprof supports bandwidth (#8367)

* autoprof supports bandwidth

Signed-off-by: daquexian <[email protected]>

* print bandwidth

Signed-off-by: daquexian <[email protected]>

* auto format by CI

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <[email protected]>

* remove tmp buffer of cumprod cpu backward kernel (#8369)

* remove tmp buffer of cumprod cpu backward kernel

* refine

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Move tensor api to cpython part3 (#8342)

* add tensor_functions

* concat py methods

* add hash, restore tensor.py

* check replacement

* refine code, remove commented tensor.py

* refine code

* move some api

* add cpu and cuda api

* add triu tril norm and etc.

* remove tensor_functions.h

* move more api

* move more api, refine size

* fix typo

* format code, remove useless include

* refine code

* refine code, fix typo

* align .cuda to python

* refine code

* split some api to part3 for review

* remove positional only arguments of argmax and argmin

* remove arguments parse

* modify arguments name in matmul and floor_divide

* rename BINARY_FUNC to DIRECT_PASS_FUNC, modify some functions

* refine code, format code

* add inplace /=, add comments

* remove name in macros

* remove python api

* remove redundant include

* remove cout

* format code

* refactor tensor.size by directly call shape.at, refactor tensor.sub_ by calling nb_sub_

* remove redundant code

* auto format by CI

* fix typo, fix wrong call

* modify idx datatype from int32 to int64 in tensor.size

* add some DIRECT_PASS_FUNC

* add cpu cuda var pow and etc.

* add masked_fill any all

* make REDUCE_FUNC macro, add reduce_* functions

* add 0dim check in ReduceSumWhole, refine yaml

* fix bug

* restore add add_ sub sub_

* add unittest for tensor.half tensor.add tensor.add_

* refine code

* refine code

* fix typo

* fix bug of tensor.std()

* refactor var std and cuda, using c++ functional api

* add beta and threshold in softplus

* auto format by CI

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add nn_functor Check (#7910)

* add bias_add_check

* add bias_add error test

* fix conv2d nhwc bias_add error

* add nhwc conv test

* add bias_add_error test

* Add bias add error check

* Rename

* add batch matmul error check

* add matmul check error msg

* remove annotation

* add fused mlp error msg check

* Add pixel shuffle check test

* add more test until normalization add relu functor

* refine error message

* finish all nnfunctor check msg

* handle type error

* remove useless symbol

* modify back to TypeError

* fix all comment

* Remove redundant code

* Remove pad ndim check

* fix bias add space

* fix check logic cause ci gpu not always gpu:0

Co-authored-by: hjchen2 <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add FusedMatmulBiasAddReluDropout [OneEmbedding] (#8222)

* previous version for fused_matmul_bias_add_relu_dropout

* add op infer

* fix detail

* finish forward

* support dropout rate list

* add forward test

* fix bug for output buffer

* Configurable alpha params

* try to add bit mask logic

* Add bitmask first version!

* Add row col bitmask logic

* support not align4 reludropout

* simplify relu dropout ld logic

* Add naive relu dropout grad kernel

* add simple relu dropout grad kernel

* Rename

* support relu_dropout bitmask backward

* add vectorized optimization

* fix tmp buffer

* add to amp list

* add lazy backward logic

* Refine kernel

* add indextype dispatch

* simplify functor logic

* fix cublas fused mlp aux_ld shape bug

* Add more relu dropout kernel

* add full unittest

* fix bug in skip final activation

* refine

* Remove dump func

* fix format

* Remove cmake

* remove redundant divide

* add padded version

* fix dropout

* oneflow curand

* refine

* remove redundant kernel

* add unroll logic

* add unroll and ballot sync

* refine format

* Remove fast curand

* Refine python interface

* Add if branch for memset

* fix python logic

* just for debug

* not use matmul bias add grad

* add launch 1 block limit

* fix unittest

* Refine

* fix graph backward bug

* limit to 11060

* change to use int32_t dtype for cublas aux

* Fix jc comment

* fix comment

* fix convert

* fix static_analysis

* fix at

* fix userops td

* fix userops td

* fix const ref

* fix compile error for bfloat16

* limit to 11060

* fix bug

Co-authored-by: Juncheng <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix gather 0-dim tensor bug (#8376)

* fix 0-dim tensor bug

* refine

* support input 0-dim tensor for gather

* refine

* refine

* refine dim_scatter_kernel check

* refine

* refine check

* fix clang_tidy error

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add api to apply external job pass (#8370)

* Add condition to find-test-cache-distributed (#8387)

* add condition to find-test-cache-distributed

* fix

* warp dim util (#8382)

* warp dim util

* format

* use more maybe_wrap_dim

* refine array functor

* add more

* refine math_functor

* fix_bug_in_broadcast_min_max_grad_and_broadcast_like (#8379)

* fix_bug_in_broadcast_min_max_grad_and_broadcast_like

* refine

* fix static check error

* fix bug about index (#8388)

* fix bug about index

* add test case

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* LogicalSliceAssign support full slice sbp (#8344)

* feat(SliceOp): slice ops support 2d sbp

* fix(SliceOp): fix [B, P] 2d sbp bug

* refine error message

* fix bug in parallel_num == 1

* add comment

* add warning and format

* add NOLINT for boxing check

* feat(LogicalSliceOps): support all nd_sbp

* feat(LogicalSlice): support nd_sbp

* add error message

* fix(AutoTest): fix auto_test bug in module.parameter pass

* auto format by CI

* fix(LogicalSliceAssign): skip test when 1n1d

* fix SliceParams memset error

* remove memset

* add CHECK_JUST

* fix(*): make sure split_axis >= 0 or equal to SPLIT_AXIS_FOR_NON_SPLIT

* remove memset

* fix spilit_info.axis bug

* feat(LogicalSliceOps): support grad

* add logical_slice gradient_funcs

* feat(LogicalSliceAssign): LogicalSliceAssign support full slice sbp

* auto format by CI

* test(LogicalSlice): fix logical_slice dims

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Houjiang Chen <[email protected]>
Co-authored-by: oneflow-ci-bot <[email protected]>

* fix_tensor_from_numpy_mem_leak_bug (#8391)

* fix_tensor_from_numpy_mem_leak_bug

* add note

* refine note

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Make of_pyext_obj static only to make sure only a python ext so has python symbols (#8393)

* make of_pyext_obj static only

* refine note

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Adjust tolerance setting in embedding_renorm unit test (#8394)

* support front end compile for job to iree (#8249)

* support frontend dev version

* polish name

* add tosa-to-elf.mlir

* tosa to elf by llvm

* conv2d partial

* an enhanced frontend runner

* support numpy as input

* enable multiple using nn graph with different input(jobname make it  it cd /home/yuhao/frontend/oneflow ; /usr/bin/env /usr/bin/python3 /home/yuhao/.vscode-server/extensions/ms-python.python-2022.6.2/pythonFiles/lib/python/debugpy/launcher 40873 -- /home/yuhao/frontend/oneflow/oneflow/ir/test/Frontend/runner.py )

* enable multiple input

* enable cpu and cuda

* change full_name to _full_name

* support exchange cuda with cpu seamlessly

* remove pip

* lit config

* polish

* trim

* auto format by CI

* modify

* auto format by CI

* last line polish

* use unittest

* auto format by CI

* use allclose

* auto format by CI

* pulish

* optimize convert oneflow to tosa

* conv2d

* conv2d enhanced && conv2d examples add

* add road map

* add add_n2Op and boardcast_addOp conversion

* add matmulOp conversion

* support converting normailzation op to tosa(partically)

* update roadmap

* support i64 tensor to dense elem attr

* support 100% resnet op conversion

* add test mlir

* add test iree resnet python script

* auto format by CI

* done

* enhance iree resnet test script

* auto format by CI

* rebuild code

* auto format by CI

* rebuild test script

* update

* auto format by CI

* pub

* trim test scripts

* move

* move

* input and output add block arg judgement

* emit error in variable conversion

* error handle for ci

* modify err info

* auto format by CI

* merge

* auto format by CI

* output not block

* flow ones

* rm const

* trim maybe

* trim maybe with header file

* const auto

* solve clangd error

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Feat/zero mix with mp (#8036)

* add zero limit

* add debug

* add mix zero test

* refactor zero api

* zero test with mp

* add 2d test

* add zero nd

* add nd zero

* add sbp cast

* test passed soft limit consumer

* refine size api

* zero use stage 2

* add limit consumer api

* add new api

* refine zero s select

* fix index out of range

* rm zero limit on device type

* zero test with activation checkpointing

* add indentity when dp sequence len is 1

* move to base with master

* fix

* fix

* fix

* add test

* debug bad case

* refine test for eager and graph boxing

* test case ready

* simplify

* refine test

* fix buff size

* fix conflict

* refine zero nd

* refine

* add full test

* revert change

* refine split check

* fix typo

* rm log

* spit long func

* restore test

* Update optimizer_placement_optimization_pass.cpp

* auto format by CI

* auto format by CI

* fix static check

* add tips for zero api change

* auto format by CI

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Revert embedding normal path and fix amp list (#8374)

* revert embedding normal path, fix amp list

* fix amp

* fix memset bug in gather cpu kernel

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* replace fixed_vector with small_vector and make Shape inherit from it (#8365)

* Replace fixed_vector with llvm::SmallVector

Signed-off-by: daquexian <[email protected]>

* Shape inherited from llvm::SmallVector

Signed-off-by: daquexian <[email protected]>

* refine cmake

Signed-off-by: daquexian <[email protected]>

* rename fixed_vector to small_vector

Signed-off-by: daquexian <[email protected]>

* fix reviews

Signed-off-by: daquexian <[email protected]>

* auto format by CI

* update Shape constructor

Signed-off-by: daquexian <[email protected]>

* add 'PUBLIC' keyword to all target_link_libraries

Signed-off-by: daquexian <[email protected]>

* auto format by CI

* update cmake

Signed-off-by: daquexian <[email protected]>

* auto format by CI

* update cmake

Signed-off-by: daquexian <[email protected]>

* update cmake

Signed-off-by: daquexian <[email protected]>

* auto format by CI

* set is_initialized_ default to true

Signed-off-by: daquexian <[email protected]>

* override some methods to set is_initialized_

Signed-off-by: daquexian <[email protected]>

* auto format by CI

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <[email protected]>

* Light plan for debug (#8396)

* Light plan for debug

* fix note

* disable terminfo to fix missing terminfo symbols (#8400)

* disable terminfo to fix missing terminfo symbols

Signed-off-by: daquexian <[email protected]>

* auto format by CI

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix bug of ZeRO MP in complex case (#8404)

* Remove redundant output_lbns in ir (#8409)

* mv case

* remove redundant info

* Dev FusedCrossInteraction[OneEmbedding] (#8335)

* add simple fused cross interaction forward

* add packed fused

* Add cross interaction grad

* simplify code

* fix bug

* support crossnet v2

* support cross interaction v2

* add lazy backward

* Rename and add test

* fix jc comment

* fix comment

* fix bug

* fix userops td elem_cnt for FUSED Group

* fix header file

* fix clang static analysis

* fix unittest

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add exe graph physical shape check msg (#8002)

* fix index select op in graph

* add exe graph physical shape check msg

* improve the debug information for the python stack trace

1. add a parameter 'max_stack_depth' to specify the max depth for the stack trace
2. refactor other debug related classes.

* remove parens

* update

* resolve PR comments

* update

* update graph debug test file.

* restore self._debug in class Graph and class ModuleBlock

* Do not shorten the stack frame string if it is in debug mode

* delete TODOs

* disable conv3d test (#7969)

Signed-off-by: daquexian <[email protected]>

* skip layernorm random_data_warp test (#7941)

* skip layernorm random_data_warp test

* warp/block/uncached case only test gpu

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Lock click version (#7967)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add global avgpool unittest (#7585)

* fix (#7978)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Support negative dim in scatter op (#7934)

* support negative dim in scatter op

* refine scatter test

* refine scatter test again

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* run barrier callback in BarrierPhyInstrOperand::~BarrierPhyInstrOperand (#7702)

* run barrier callback in BarrierPhyInstrOperand::~BarrierPhyInstrOperand

* lock gil in vm Callback thread

* more comments for VirtualMachineEngine::Callback()

* the Env is never destroyed.

* export Env into python

* more unittests

* wait shared_ptr.use_count() == 0

* export unittest.TestCase in framework/unittest.py

* SwitchToShuttingDownPhase

* optional is_normal_exit

* VirtualMachine::CloseVMThreads

* Delete env_api.h

env_api.h is deleted by master

* reshape_only_one_dim_infered

* address pr comments

* fix a ref-cnt bug in TryRunBarrierInstruction.

* rollback flow.env.all_device_placement

* no distributed running test_shutting_down.py

* auto format by CI

* expand lifetime of module oneflow in test_shutting_down.py

* refine del depend on of

* capture oneflow._oneflow_internal.eager when calling sync in __del__

* add try in flaky test

Co-authored-by: Luyang <[email protected]>
Co-authored-by: chengtbf <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: Xiaoyu Xu <[email protected]>

* Fix one hot scalar tensor bug (#7975)

* fix reduce_sum scalar check bug

* fix one_hot scalar tensor bug

* fix clang tidy error

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* support ctor np array from of tensor (#7970)

* support ctor np array from of tensor

* add test case constructing np array from tensor

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add_manual_seed_all_api (#7957)

* add_manual_seed_all_api

* Update conf.py

* refine

* add test case

* auto format by CI

* Update random_generator.cpp

* auto format by CI

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* one_embedding add doc string (#7902)

* add doc string

* add example

* add

* fix doc

* refine

* address review

* mb to MB

* add make_table_option

* option to options

* refine

* add forward

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Support numpy scalar parameters (#7935)

* feat(functional): support numpy scalar parameters

* rename inferface

* feat(*): TensorIndex support numpy scalar

* feat(TensorIndex): support advance indexing

* add unittest and int32 support for branch feat-param_support_np_scalar (#7939)

* add unittest

* refactor unittest

* add todo for int16 advanced indexing

* add int32 supporting for advance indexing

* auto format by CI

Co-authored-by: Wang Yi <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <[email protected]>

* fix tensor_scatter_nd_update (#7953)

* fix tensor_scatter_nd_update

* auto backward

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix one_embedding adam (#7974)

* fix one_embedding adam

* fix tidy

* fix normal

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* speed test with score (#7990)

Signed-off-by: daquexian <[email protected]>

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Feat/graph del by ref (#7857)

* remove IsMultiClient() and single client logic

Signed-off-by: daquexian <[email protected]>

* rename eager.multi_client to eager

Signed-off-by: daquexian <[email protected]>

* auto format by CI

* add py ref

* refine new session

* clean code

* make scope api inner use

* use session with ref cnt

* run barrier callback in BarrierPhyInstrOperand::~BarrierPhyInstrOperand

* test pass

* lock gil in vm Callback thread

* more comments for VirtualMachineEngine::Callback()

* merge

* merge rm single client

* rm initenv

* merge and fix master

* refactor env c api

* add debug code

* fix and serving test pass

* test passed

* rm useless

* rm useless code

* format

* rm useless include

* rm sync in py

* the Env is never destroyed.

* export Env into python

* more unittests

* fix and pass tests

* revert virtual_machine.cpp

* revert core/vm

* remove outdated python class oneflow.unittest.TestCase

* graph test passed

* wait shared_ptr.use_count() == 0

* export unittest.TestCase in framework/unittest.py

* SwitchToShuttingDownPhase

* optional is_normal_exit

* VirtualMachine::CloseVMThreads

* Delete env_api.h

env_api.h is deleted by master

* address pr comments

* rm is env init

* Clear empty thread when graph destroy (#7633)

* Revert "Clear empty thread when graph destroy (#7633)" (#7860)

This reverts commit 3e8585e5fa20b97229d6b0be46a7ff814dc8cd83.

* fix a ref-cnt bug in TryRunBarrierInstruction.

* rm env_api

* fix clang-tidy error

* fix clang-tidy in env_imp

* refine env api

* format

* refine graph del and sync at shuttingdown

* fix typo

* add comment

* rm useless

* rm useless

Co-authored-by: daquexian <[email protected]>
Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: lixinqi <[email protected]>
Co-authored-by: Li Xinqi <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Luyang <[email protected]>
Co-authored-by: cheng cheng <[email protected]>

* [PersistentTable] Fix num blocks (#7986)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add auto benchmark for flowvision (#7806)

* update yml

* update workflow

* add resnet50

* [PersistentTable] Async write (#7946)

* [PersistentTable] Async write

* fix

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* save log in separate dir by default (#7825)

Signed-off-by: daquexian <[email protected]>

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix index select op in graph

* add exe graph physical shape check msg

* improve the debug information for the python stack trace

1. add a parameter 'max_stack_depth' to specify the max depth for the stack trace
2. refactor other debug related classes.

* remove parens

* update

* resolve PR comments

* update

* update graph debug test file.

* restore self._debug in class Graph and class ModuleBlock

* Do not shorten the stack frame string if it is in debug mode

* delete TODOs

* Revert "Merge branch 'master' into fea/graph_check_msg"

This reverts commit 28833b73a8041463e5e3d130784be386ee248bd8, reversing
changes made to baadf6045f2fce69c090e442a755229c1c949773.

* Revert "Revert "Merge branch 'master' into fea/graph_check_msg""

This reverts commit 1d5e196d8530ffd2b9bf781abcf168b94ff9ca41.

* update

* resolve conflicts

* resolve conflicts

Co-authored-by: Cijie Xia <[email protected]>
Co-authored-by: daquexian <[email protected]>
Co-authored-by: guo ran <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Shenghang Tsai <[email protected]>
Co-authored-by: Houjiang Chen <[email protected]>
Co-authored-by: Peihong Liu <[email protected]>
Co-authored-by: Li Xinqi <[email protected]>
Co-authored-by: Luyang <[email protected]>
Co-authored-by: chengtbf <[email protected]>
Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: Xiaoyu Zhang <[email protected]>
Co-authored-by: liufengwei0103 <[email protected]>
Co-authored-by: binbinHan <[email protected]>
Co-authored-by: Yinggang Wang <[email protected]>
Co-authored-by: Wang Yi <[email protected]>
Co-authored-by: Shijie <[email protected]>
Co-authored-by: lixinqi <[email protected]>
Co-authored-by: Juncheng <[email protected]>

* add batch_matmul sbp (#8385)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* suppress gcc11 false positive warning (#8401)

Signed-off-by: daquexian <[email protected]>

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix variable op conversion to tosa error in ninja c1 (#8412)

* pub

* move test iree resnet python script to oneflow_iree repo

* add bracket

* rename const_val to const_val_ and restore resnet.py test script

Co-authored-by: Shenghang Tsai <[email protected]>

* nccl send/recv support different placement

* refine

* auto format by CI

* rm out ctrl

* auto format by CI

Co-authored-by: guo-ran <[email protected]>
Co-authored-by: Shenghang Tsai <[email protected]>
Co-authored-by: daquexian <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: liufengwei0103 <[email protected]>
Co-authored-by: Wang Yi <[email protected]>
Co-authored-by: ZZK <[email protected]>
Co-authored-by: hjchen2 <[email protected]>
Co-authored-by: Juncheng <[email protected]>
Co-authored-by: Xiaoyu Zhang <[email protected]>
Co-authored-by: Luyang <[email protected]>
Co-authored-by: binbinHan <[email protected]>
Co-authored-by: Yinggang Wang <[email protected]>
Co-authored-by: Yao Zihang <[email protected]>
Co-authored-by: yuhao <[email protected]>
Co-authored-by: Xiaoyu Xu <[email protected]>
Co-authored-by: cheng cheng <[email protected]>
Co-authored-by: Cijie Xia <[email protected]>
Co-authored-by: Peihong Liu <[email protected]>
Co-authored-by: Li Xinqi <[email protected]>
Co-authored-by: Shijie <[email protected]>
Co-authored-by: lixinqi <[email protected]>

* Support different hierarchy

* Merge branch 'master' into feat-general_basic_communication (#8477)

* Add distributed optional run (#8372)

* Add

* change deps

* add install

* add skip

* autoprof supports bandwidth (#8367)

* autoprof supports bandwidth

Signed-off-by: daquexian <[email protected]>

* print bandwidth

Signed-off-by: daquexian <[email protected]>

* auto format by CI

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <[email protected]>

* remove tmp buffer of cumprod cpu backward kernel (#8369)

* remove tmp buffer of cumprod cpu backward kernel

* refine

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Move tensor api to cpython part3 (#8342)

* add tensor_functions

* concat py methods

* add hash, restore tensor.py

* check replacement

* refine code, remove commented tensor.py

* refine code

* move some api

* add cpu and cuda api

* add triu tril norm and etc.

* remove tensor_functions.h

* move more api

* move more api, refine size

* fix typo

* format code, remove useless include

* refine code

* refine code, fix typo

* align .cuda to python

* refine code

* split some api to part3 for review

* remove positional only arguments of argmax and argmin

* remove arguments parse

* modify arguments name in matmul and floor_divide

* rename BINARY_FUNC to DIRECT_PASS_FUNC, modify some functions

* refine code, format code

* add inplace /=, add comments

* remove name in macros

* remove python api

* remove redundant include

* remove cout

* format code

* refactor tensor.size by directly call shape.at, refactor tensor.sub_ by calling nb_sub_

* remove redundant code

* auto format by CI

* fix typo, fix wrong call

* modify idx datatype from int32 to int64 in tensor.size

* add some DIRECT_PASS_FUNC

* add cpu cuda var pow and etc.

* add masked_fill any all

* make REDUCE_FUNC macro, add reduce_* functions

* add 0dim check in ReduceSumWhole, refine yaml

* fix bug

* restore add add_ sub sub_

* add unittest for tensor.half tensor.add tensor.add_

* refine code

* refine code

* fix typo

* fix bug of tensor.std()

* refactor var std and cuda, using c++ functional api

* add beta and threshold in softplus

* auto format by CI

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add nn_functor Check (#7910)

* add bias_add_check

* add bias_add error test

* fix conv2d nhwc bias_add error

* add nhwc conv test

* add bias_add_error test

* Add bias add error check

* Rename

* add batch matmul error check

* add matmul check error msg

* remove annotation

* add fused mlp error msg check

* Add pixel shuffle check test

* add more test until normalization add relu functor

* refine error message

* finish all nnfunctor check msg

* handle type error

* remove useless symbol

* modify back to TypeError

* fix all comment

* Remove redundant code

* Remove pad ndim check

* fix bias add space

* fix check logic cause ci gpu not always gpu:0

Co-authored-by: hjchen2 <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add FusedMatmulBiasAddReluDropout [OneEmbedding] (#8222)

* previous version for fused_matmul_bias_add_relu_dropout

* add op infer

* fix detail

* finish forward

* support dropout rate list

* add forward test

* fix bug for output buffer

* Configurable alpha params

* try to add bit mask logic

* Add bitmask first version!

* Add row col bitmask logic

* support not align4 reludropout

* simplify relu dropout ld logic

* Add naive relu dropout grad kernel

* add simple relu dropout grad kernel

* Rename

* support relu_dropout bitmask backward

* add vectorized optimization

* fix tmp buffer

* add to amp list

* add lazy backward logic

* Refine kernel

* add indextype dispatch

* simplify functor logic

* fix cublas fused mlp aux_ld shape bug

* Add more relu dropout kernel

* add full unittest

* fix bug in skip final activation

* refine

* Remove dump func

* fix format

* Remove cmake

* remove redundant divide

* add padded version

* fix dropout

* oneflow curand

* refine

* remove redundant kernel

* add unroll logic

* add unroll and ballot sync

* refine format

* Remove fast curand

* Refine python interface

* Add if branch for memset

* fix python logic

* just for debug

* not use matmul bias add grad

* add launch 1 block limit

* fix unittest

* Refine

* fix graph backward bug

* limit to 11060

* change to use int32_t dtype for cublas aux

* Fix jc comment

* fix comment

* fix convert

* fix static_analysis

* fix at

* fix userops td

* fix userops td

* fix const ref

* fix compile error for bfloat16

* limit to 11060

* fix bug

Co-authored-by: Juncheng <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix gather 0-dim tensor bug (#8376)

* fix 0-dim tensor bug

* refine

* support input 0-dim tensor for gather

* refine

* refine

* refine dim_scatter_kernel check

* refine

* refine check

* fix clang_tidy error

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add api to apply external job pass (#8370)

* Add condition to find-test-cache-distributed (#8387)

* add condition to find-test-cache-distributed

* fix

* warp dim util (#8382)

* warp dim util

* format

* use more maybe_wrap_dim

* refine array functor

* add more

* refine math_functor

* fix_bug_in_broadcast_min_max_grad_and_broadcast_like (#8379)

* fix_bug_in_broadcast_min_max_grad_and_broadcast_like

* refine

* fix static check error

* fix bug about index (#8388)

* fix bug about index

* add test case

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* LogicalSliceAssign support full slice sbp (#8344)

* feat(SliceOp): slice ops support 2d sbp

* fix(SliceOp): fix [B, P] 2d sbp bug

* refine error message

* fix bug in parallel_num == 1

* add comment

* add warning and format

* add NOLINT for boxing check

* feat(LogicalSliceOps): support all nd_sbp

* feat(LogicalSlice): support nd_sbp

* add error message

* fix(AutoTest): fix auto_test bug in module.parameter pass

* auto format by CI

* fix(LogicalSliceAssign): skip test when 1n1d

* fix SliceParams memset error

* remove memset

* add CHECK_JUST

* fix(*): make sure split_axis >= 0 or equal to SPLIT_AXIS_FOR_NON_SPLIT

* remove memset

* fix spilit_info.axis bug

* feat(LogicalSliceOps): support grad

* add logical_slice gradient_funcs

* feat(LogicalSliceAssign): LogicalSliceAssign support full slice sbp

* auto format by CI

* test(LogicalSlice): fix logical_slice dims

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Houjiang Chen <[email protected]>
Co-authored-by: oneflow-ci-bot <[email protected]>

* fix_tensor_from_numpy_mem_leak_bug (#8391)

* fix_tensor_from_numpy_mem_leak_bug

* add note

* refine note

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Make of_pyext_obj static only to make sure only a python ext so has python symbols (#8393)

* make of_pyext_obj static only

* refine note

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Adjust tolerance setting in embedding_renorm unit test (#8394)

* support front end compile for job to iree (#8249)

* support frontend dev version

* polish name

* add tosa-to-elf.mlir

* tosa to elf by llvm

* conv2d partial

* an enhanced frontend runner

* support numpy as input

* enable multiple using nn graph with different input(jobname make it  it cd /home/yuhao/frontend/oneflow ; /usr/bin/env /usr/bin/python3 /home/yuhao/.vscode-server/extensions/ms-python.python-2022.6.2/pythonFiles/lib/python/debugpy/launcher 40873 -- /home/yuhao/frontend/oneflow/oneflow/ir/test/Frontend/runner.py )

* enable multiple input

* enable cpu and cuda

* change full_name to _full_name

* support exchange cuda with cpu seamlessly

* remove pip

* lit config

* polish

* trim

* auto format by CI

* modify

* auto format by CI

* last line polish

* use unittest

* auto format by CI

* use allclose

* auto format by CI

* pulish

* optimize convert oneflow to tosa

* conv2d

* conv2d enhanced && conv2d examples add

* add road map

* add add_n2Op and boardcast_addOp conversion

* add matmulOp conversion

* support converting normailzation op to tosa(partically)

* update roadmap

* support i64 tensor to dense elem attr

* support 100% resnet op conversion

* add test mlir

* add test iree resnet python script

* auto format by CI

* done

* enhance iree resnet test script

* auto format by CI

* rebuild code

* auto format by CI

* rebuild test script

* update

* auto format by CI

* pub

* trim test scripts

* move

* move

* input and output add block arg judgement

* emit error in variable conversion

* error handle for ci

* modify err info

* auto format by CI

* merge

* auto format by CI

* output not block

* flow ones

* rm const

* trim maybe

* trim maybe with header file

* const auto

* solve clangd error

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Feat/zero mix with mp (#8036)

* add zero limit

* add debug

* add mix zero test

* refactor zero api

* zero test with mp

* add 2d test

* add zero nd

* add nd zero

* add sbp cast

* test passed soft limit consumer

* refine size api

* zero use stage 2

* add limit consumer api

* add new api

* refine zero s select

* fix index out of range

* rm zero limit on device type

* zero test with activation checkpointing

* add indentity when dp sequence len is 1

* move to base with master

* fix

* fix

* fix

* add test

* debug bad case

* refine test for eager and graph boxing

* test case ready

* simplify

* refine test

* fix buff size

* fix conflict

* refine zero nd

* refine

* add full test

* revert change

* refine split check

* fix typo

* rm log

* spit long func

* restore test

* Update optimizer_placement_optimization_pass.cpp

* auto format by CI

* auto format by CI

* fix static check

* add tips for zero api change

* auto format by CI

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Revert embedding normal path and fix amp list (#8374)

* revert embedding normal path, fix amp list

* fix amp

* fix memset bug in gather cpu kernel

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* replace fixed_vector with small_vector and make Shape inherit from it (#8365)

* Replace fixed_vector with llvm::SmallVector

Signed-off-by: daquexian <[email protected]>

* Shape inherited from llvm::SmallVector

Signed-off-by: daquexian <[email protected]>

* refine cmake

Signed-off-by: daquexian <[email protected]>

* rename fixed_vector to small_vector

Signed-off-by: daquexian <[email protected]>

* fix reviews

Signed-off-by: daquexian <[email protected]>

* auto format by CI

* update Shape constructor

Signed-off-by: daquexian <[email protected]>

* add 'PUBLIC' keyword to all target_link_libraries

Signed-off-by: daquexian <[email protected]>

* auto format by CI

* update cmake

Signed-off-by: daquexian <[email protected]>

* auto format by CI

* update cmake

Signed-off-by: daquexian <[email protected]>

* update cmake

Signed-off-by: daquexian <[email protected]>

* auto format by CI

* set is_initialized_ default to true

Signed-off-by: daquexian <[email protected]>

* override some methods to set is_initialized_

Signed-off-by: daquexian <[email protected]>

* auto format by CI

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <[email protected]>

* Light plan for debug (#8396)

* Light plan for debug

* fix note

* disable terminfo to fix missing terminfo symbols (#8400)

* disable terminfo to fix missing terminfo symbols

Signed-off-by: daquexian <[email protected]>

* auto format by CI

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix bug of ZeRO MP in complex case (#8404)

* Remove redundant output_lbns in ir (#8409)

* mv case

* remove redundant info

* Dev FusedCrossInteraction[OneEmbedding] (#8335)

* add simple fused cross interaction forward

* add packed fused

* Add cross interaction grad

* simplify code

* fix bug

* support crossnet v2

* support cross interaction v2

* add lazy backward

* Rename and add test

* fix jc comment

* fix comment

* fix bug

* fix userops td elem_cnt for FUSED Group

* fix header file

* fix clang static analysis

* fix unittest

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add exe graph physical shape check msg (#8002)

* fix index select op in graph

* add exe graph physical shape check msg

* improve the debug information for the python stack trace

1. add a parameter 'max_stack_depth' to specify the max depth for the stack trace
2. refactor other debug related classes.

* remove parens

* update

* resolve PR comments

* update

* update graph debug test file.

* restore self._debug in class Graph and class ModuleBlock

* Do not shorten the stack frame string if it is in debug mode

* delete TODOs

* disable conv3d test (#7969)

Signed-off-by: daquexian <[email protected]>

* skip layernorm random_data_warp test (#7941)

* skip layernorm random_data_warp test

* warp/block/uncached case only test gpu

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Lock click version (#7967)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add global avgpool unittest (#7585)

* fix (#7978)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Support negative dim in scatter op (#7934)

* support negative dim in scatter op

* refine scatter test

* refine scatter test again

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* run barrier callback in BarrierPhyInstrOperand::~BarrierPhyInstrOperand (#7702)

* run barrier callback in BarrierPhyInstrOperand::~BarrierPhyInstrOperand

* lock gil in vm Callback thread

* more comments for VirtualMachineEngine::Callback()

* the Env is never destroyed.

* export Env into python

* more unittests

* wait shared_ptr.use_count() == 0

* export unittest.TestCase in framework/unittest.py

* SwitchToShuttingDownPhase

* optional is_normal_exit

* VirtualMachine::CloseVMThreads

* Delete env_api.h

env_api.h is deleted by master

* reshape_only_one_dim_infered

* address pr comments

* fix a ref-cnt bug in TryRunBarrierInstruction.

* rollback flow.env.all_device_placement

* no distributed running test_shutting_down.py

* auto format by CI

* expand lifetime of module oneflow in test_shutting_down.py

* refine del depend on of

* capture oneflow._oneflow_internal.eager when calling sync in __del__

* add try in flaky test

Co-authored-by: Luyang <[email protected]>
Co-authored-by: chengtbf <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: Xiaoyu Xu <[email protected]>

* Fix one hot scalar tensor bug (#7975)

* fix reduce_sum scalar check bug

* fix one_hot scalar tensor bug

* fix clang tidy error

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* support ctor np array from of tensor (#7970)

* support ctor np array from of tensor

* add test case constructing np array from tensor

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add_manual_seed_all_api (#7957)

* add_manual_seed_all_api

* Update conf.py

* refine

* add test case

* auto format by CI

* Update random_generator.cpp

* auto format by CI

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* one_embedding add doc string (#7902)

* add doc string

* add example

* add

* fix doc

* refine

* address review

* mb to MB

* add make_table_option

* option to options

* refine

* add forward

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Support numpy scalar parameters (#7935)

* feat(functional): support numpy scalar parameters

* rename inferface

* feat(*): TensorIndex support numpy scalar

* feat(TensorIndex): support advance indexing

* add unittest and int32 support for branch feat-param_support_np_scalar (#7939)

* add unittest

* refactor unittest

* add todo for int16 advanced indexing

* add int32 supporting for advance indexing

* auto format by CI

Co-authored-by: Wang Yi <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <[email protected]>

* fix tensor_scatter_nd_update (#7953)

* fix tensor_scatter_nd_update

* auto backward

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix one_embedding adam (#7974)

* fix one_embedding adam

* fix tidy

* fix normal

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* speed test with score (#7990)

Signed-off-by: daquexian <[email protected]>

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Feat/graph del by ref (#7857)

* remove IsMultiClient() and single client logic

Signed-off-by: daquexian <[email protected]>

* rename eager.multi_client to eager

Signed-off-by: daquexian <[email protected]>

* auto format by CI

* add py ref

* refine new session

* clean code

* make scope api inner use

* use session with ref cnt

* run barrier callback in BarrierPhyInstrOperand::~BarrierPhyInstrOperand

* test pass

* lock gil in vm Callback thread

* more comments for VirtualMachineEngine::Callback()

* merge

* merge rm single client

* rm initenv

* merge and fix master

* refactor env c api

* add debug code

* fix and serving test pass

* test passed

* rm useless

* rm useless code

* format

* rm useless include

* rm sync in py

* the Env is never destroyed.

* export Env into python

* more unittests

* fix and pass tests

* revert virtual_machine.cpp

* revert core/vm

* remove outdated python class oneflow.unittest.TestCase

* graph test passed

* wait shared_ptr.use_count() == 0

* export unittest.TestCase in framework/unittest.py

* SwitchToShuttingDownPhase

* optional is_normal_exit

* VirtualMachine::CloseVMThreads

* Delete env_api.h

env_api.h is deleted by master

* address pr comments

* rm is env init

* Clear empty thread when graph destroy (#7633)

* Revert "Clear empty thread when graph destroy (#7633)" (#7860)

This reverts commit 3e8585e5fa20b97229d6b0be46a7ff814dc8cd83.

* fix a ref-cnt bug in TryRunBarrierInstruction.

* rm env_api

* fix clang-tidy error

* fix clang-tidy in env_imp

* refine env api

* format

* refine graph del and sync at shuttingdown

* fix typo

* add comment

* rm useless

* rm useless

Co-authored-by: daquexian <[email protected]>
Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: lixinqi <[email protected]>
Co-authored-by: Li Xinqi <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Luyang <[email protected]>
Co-authored-by: cheng cheng <[email protected]>

* [PersistentTable] Fix num blocks (#7986)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add auto benchmark for flowvision (#7806)

* update yml

* update workflow

* add resnet50

* [PersistentTable] Async write (#7946)

* [PersistentTable] Async write

* fix

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* save log in separate dir by default (#7825)

Signed-off-by: daquexian <[email protected]>

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix index select op in graph

* add exe graph physical shape check msg

* improve the debug information for the python stack trace

1. add a parameter 'max_stack_depth' to specify the max depth for the stack trace
2. refactor other debug related classes.

* remove parens

* update

* resolve PR comments

* update

* update graph debug test file.

* restore self._debug in class Graph and class ModuleBlock

* Do not shorten the stack frame string if it is in debug mode

* delete TODOs

* Revert "Merge branch 'master' into fea/graph_check_msg"

This reverts commit 28833b73a8041463e5e3d130784be386ee248bd8, reversing
changes made to baadf6045f2fce69c090e442a755229c1c949773.

* Revert "Revert "Merge branch 'master' into fea/graph_check_msg""

This reverts commit 1d5e196d8530ffd2b9bf781abcf168b94ff9ca41.

* update

* resolve conflicts

* resolve conflicts

Co-authored-by: Cijie Xia <[email protected]>
Co-authored-by: daquexian <[email protected]>
Co-authored-by: guo ran <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Shenghang Tsai <[email protected]>
Co-authored-by: Houjiang Chen <[email protected]>
Co-authored-by: Peihong Liu <[email protected]>
Co-authored-by: Li Xinqi <[email protected]>
Co-authored-by: Luyang <[email protected]>
Co-authored-by: chengtbf <[email protected]>
Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: Xiaoyu Zhang <[email protected]>
Co-authored-by: liufengwei0103 <[email protected]>
Co-authored-by: binbinHan <[email protected]>
Co-authored-by: Yinggang Wang <[email protected]>
Co-authored-by: Wang Yi <[email protected]>
Co-authored-by: Shijie <[email protected]>
Co-authored-by: lixinqi <[email protected]>
Co-authored-by: Juncheng <[email protected]>

* add batch_matmul sbp (#8385)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* suppress gcc11 false positive warning (#8401)

Signed-off-by: daquexian <[email protected]>

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix variable op conversion to tosa error in ninja c1 (#8412)

* pub

* move test iree resnet python script to oneflow_iree repo

* add bracket

* rename const_val to const_val_ and restore resnet.py test script

Co-authored-by: Shenghang Tsai <[email protected]>

* Fix eval error in FusedMLP (#8413)

Fix eval error

* Init NCCL communicator in graph mode unifiedly (#8263)

* centralized comm init

* address review

* revert

* rename

* ref nccl logical send recv

* fix cpu only

Co-authored-by: cheng cheng <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix dim_scatter 0-dim tensor bug (#8418)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* target based external libraries (#8421)

Signed-off-by: daquexian <[email protected]>

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Refine hardcoded attr setting/getting in ir (#8420)

* use names in trait static func

* more changes on op name attr

* use wrapped func

* Replace cu115 with cu116 in nightly (#8423)

update workflows

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix repeat interleave 0-size tensor bug (#8414)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Autotest support print input in ci (#8383)

* support print tensor value in autotest to provide more details in ci

* revert

* refine

* auto format by CI

* control precision to 1e-5 when record

* fix bug

* auto format by CI

* relax tensor_size_mb

* fix bug

* fix bug

* refine

* releax

* refinew

* refine

* fix bug

* relax

* refine

* restruct

* auto format by CI

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Modify sbp.split()'s karg: axis to dim (#8411)

* Modify sbp.split()'s axis karg to dim

* Refine

* Refine

* Refine

* Refine

* Feat/graph logical op debug repr (#8131)

* add zero limit

* add debug

* add mix zero test

* refactor zero api

* zero test with mp

* add 2d test

* add zero nd

* add nd zero

* add sbp cast

* test passed soft limit consumer

* refine size api

* add module config

* save nn.Module info in job.proto for better debugging

* add new line

* add ModuleBlock.ops_proto() API

* zero use stage 2

* print operators' info when print ModuleBlock

* handle VariableOpConf

* update

* update

* fix

* move operators repr method to graph util

* add limit consumer api

* add new api

* refine zero s select

* add module block

* fix

* refact for rm op in module conf

* fix

* add sbp debug

* add sbp repr

* add shape

* refine

* add sys op in repr

* add full op debug

* fix index out of range

* rm zero limit on device type

* add no scope op to graph

* zero test with activation checkpointing

* fix order

* add indentity when dp sequence len is 1

* add debug repr

* refine repr of op

* refine and fix

* rm useless log

* move to base with master

* fix

* fix

* fix

* fix proto

* refine test

* fix type

* add test

* debug bad case

* refine test for eager and graph boxing

* test case ready

* simplify

* refine test

* fix buff size

* fix conflict

* refine zero nd

* refine

* add full test

* revert change

* refine split check

* fix typo

* rm log

* spit long func

* refine

* restore test

* refine pass and mem debug

* merge master

* repr dtype

* add placement

* Update optimizer_placement_optimization_pass.cpp

* auto format by CI

* auto format by CI

* fix static check

* add tips for zero api change

* auto format by CI

* fix merge

* auto format by CI

* auto format by CI

* refine get job api

* refine graph util import order

* auto format by CI

* fix static check

* auto format by CI

* fix special case

* refine level print and add full dtype repr

* rm useless

Co-authored-by: Cijie Xia <[email protected]>
Co-authored-by: Cijie Xia <[email protected]>
Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* rm some test case in test_fused_dot_feature_interaction_pooling_sum (#8425)

rm some case in test

* Remove unused linkages (#8426)

remove unused linkages

* refactor stride (#8402)

* Stride inherits DimVector

Signed-off-by: daquexian <[email protected]>

* auto format by CI

* fix argument type of OFStrideToNumpyStride

Signed-off-by: daquexian <[email protected]>

Co-authored-by: oneflow-ci-bot <[email protected]>

* Move Tensor.__setitem__  and global related api to Python/C api (#8375)

* add local_to_global, global_to_global, to_global. global_to_global still have bugs

* fix bug of global_to_global

* remove python api

* add setitem

* remove local_to_global sbp pack, format code

* format code

* remove redundant code

* add error msg, refine check of to_global

* fix bug of check

* add error msg

* fix clang static check error

* remove useless api in tensor.py, remove redundant code, remove useless CHECK

* add to_local

* fix wrong exception type in unittest for to_local exception message

* cuda add default error msg (#8427)

default error

Co-authored-by: Shenghang Tsai <[email protected]>

* Refactor ShapeView (#8422)

* update

Signed-off-by: daquexian <[email protected]>

* update and add docs

Signed-off-by: daquexian <[email protected]>

* turn on view slice (#8302)

* turn_on_view_slice

* inplace scalar math hnandle non-contiguous input

* fix clang check

* add docs

* refactor

* auto format by CI

Co-authored-by: oneflow-ci-bot <[email protected]>

* Add flow env init rdma api (#8415)

* add_flow_env_init_rdma_api

* adjust persistent_workers logic for RDMA support

* adjust persistent_workers logic for RDMA support

* add rmda_inited api

* minro fix

* add docs

* Update python/oneflow/utils/data/dataloader.py

Co-authored-by: daquexian <[email protected]>

* fix typo

* refine

* fix RDMAIsInitialized

* minor fix

* refine

* rename InitRdma to InitRDMA

* refine

Co-authored-by: Flowingsun007 <[email protected]>
Co-authored-by: daquexian <[email protected]>

* add 1d send recv in nccl logical (#8355)

* add 1d send recv in nccl logical

* Update insert_nccl_logical_op_pass.cpp

* auto format by CI

Co-authored-by: cheng cheng <[email protected]>
Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Support iree ci (#8419)

* create mlir cpu and modify build gcc 7 shell script

* fix the bug of test_iree_resnet.py cuda test in cpu version error

* fix constant folding tests

* suport oneflow_test_cpu_only

* pub

* build script add flag

* modify test yml

* add python3 into \PATH

* don't use pretrain model

* install flowvision

Co-authored-by: mosout <[email protected]>
Co-authored-by: jackalcooper <[email protected]>

* Feat straighten task nodes (#8347)

* Add a fast topological traversal

* Add an initial implementation of straighen nodes

* Add the straighen nodes algorithm

* Change algorithm structure

* Remove some debug information

* Finalize the straighten algorithm after
deciding the parameters by experiments

* Notify the usage of straighten algorithm

* Of format

* Update oneflow/core/graph/straighten_nodes.cpp

Of format

Co-authored-by: daquexian <[email protected]>

* Of format

* Stop using visual string before we find a better key

* Remove magic numbers and Of format

* Remove starts

* Of format

* Fix a bug of using GetMaxVal<int32_t>() as an
initial number for comparing

* Refactor add straighten algo interface (#8435)

* feat(*): export straighten nodes algorithm inferface

* export documentation

* Update python/oneflow/nn/graph/graph_config.py

Co-authored-by: Yipeng Li <[email protected]>

Co-authored-by: Yipeng Li <[email protected]>

* Use TopoForEachNodeFast as default. (#8436)

* Use TopoForEachNodeFast as default.
Rename the original one as TopoForEachNodeDynamic

* Speed up TopoForEachNodeFast when traversing a subgraph

* Rename the switch and code clean up

* Hide the class TopoStruct

* Hide all the other functions

* Grammar

* Of format

Co-authored-by: daquexian <[email protected]>
Co-authored-by: Yinggang Wang <[email protected]>

* Refactor NLLLoss to support split class dim (#8380)

* refactor

* RuntimeError

* avoid atomic add

* test

* fixes

* update test

* update test

* update test

* fix kernel

* improve backward

* update test

* out_weight to be required

* address static analysis errer

* fix static analysis error

* fix static analysis error

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Strict ordering in memory reuse algorithm (#8441)

* Support broadcast in fused_softmax kernel (#8321)

* support broadcast

* refine

* Remove shape check

* fix sbp when broadcast

* rollback softmax grad threshold

* increase threshold of test conv bn folding

* tol to 1e-2

* check error msg of fuse softmax ops

* add more dispatch

* remove double datatype test and add broadcast test

Co-authored-by: cheng cheng <[email protected]>

* Merge slice and logical slice (#8416)

* remove Slice, SliceUpdate, SliceGrad op

* rename logical_slice to slice and logical_slice_assign to slice_update

* move gradient_func logical_slice.cpp to slice.cpp

* fix some bug and refine local test

* feat(SliceUpdate): support 0size tensor

* test(Slice): refine consistent slice test

* test(SliceUpdate): refine consistent slice_update test

* not export slice_update's inplace parameter

* auto format by CI

* recovery slice_grad_op

* fix slice_view bug

* add error message and attr judgement

* modified old test

* auto format by CI

* update test README

* update tensor_string code

* fix test bug

* auto format by CI

* fix(hsplit): hsplit functor bug

* fix vsplit doc test bug

* refine

* fix test

* fix pin_memory bug

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Graph block.config.set_stage() for recommended Pipeline api. (#8442)

* Graph block.config.set_stage() for recommended Pipeline api.

* revert diff

* refine api doc

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Update PolynomialLR's doc and paramater (#8430)

* update PolynomialLR doc, current_batch = min(decay_batch, current_batch)

* * update PolynomialLR doc, current_batch = min(decay_batch, current_batch)
* rename the steps to decay_batch in parameters

* update PolynomialLR test case

Co-authored-by: Yinggang Wang <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add mv op (#8445)

* add mv op with bug that Int is incompatible

* add test

* update test_mv.py

* fix based on comments

* fix based on comments

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* enable oneflow_iree(python package) and corresponding test works in ci (#8431)

* update test.yml

* add pytest for oneflow_iree examples

* add oneflow frontend test

* Dev tensor is pinned api (#8447)

* support tensor.is_pinned

* add test case

* add docs

* auto format by CI

* refine

* auto format by CI

* refine

* auto format by CI

* refine

* refine

* refine

Co-authored-by: oneflow-ci-bot <[email protected]>

* Nd sbp tensor str (#8458)

* nd sbp tensor str

* add nd sbp tensor str test

* bigger input size

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Patch sbp cost (#8378)

* Add a slight cost for B->S and B->P in 2d sbp

* Add penalty for P in consumer

* Add the slight penalty for eager

* Consider B -> (B, B) for a scalar

* Do not consider parallel description in priority ratio

* Of format

* Fix a bug in the old version group boxing with 2D SBP (#8448)

* Update group boxing to deal with hierarchy [1, 2]

* Use a uniform sbp while grouping consumers

* Steal "ParallelDimReduce"
from "hierarchical_sub_task_graph_builder_impl" to "sbp_infer_util"

* Fix bugs of patch-sbp_cost (#8456)

* Update group boxing to deal with hierarchy [1, 2]

* Use a uniform sbp while grouping consumers

* Steal "ParallelDimReduce"
from "hierarchical_sub_task_graph_builder_impl" to "sbp_infer_util"

* Reduce to uniform B for 1 device.
Use the actual parallel description for each tensor

* Fix a bug of fix-group_boxing-bug

* Group boxing reduce [2, 2]: (S0, S0) to [4]: S0,
then we might infer a 1D SBP from a 2D SBP hint

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: cheng cheng <[email protected]>

* Decouple stream and instruction (#7607)

* remove deprecated python api

* backup code

* backup code

* fix compiler complaints

* fix typo in refactoring

* kMockDevice

* add unit test test_mock.py

* revert mock kernels

* vert DEVICE_TYPE_SEQ

* mock placement

* address pr comments

* register device kCriticalSectionDevice and kLazyJobLauncher

* kControlDevice

* Stream::vm_stream_

* fix compiler complaints

* backup code

* rename StreamIsTransport to IsCommNetStream

* decouple vm::StreamType and vm::InstructionType

* fix compiler complaints

* remove 'gpu' related code

* address static analyzer complaints

* address static analyzer complaints

* remove unused module in test_mock.py

* the Env is never destroyed.

* export Env into python

* more unittests

* export unittest.TestCase in framework/unittest.py

* SwitchToShuttingDownPhase

* optional is_normal_exit

* VirtualMachine::CloseVMThreads

* Delete env_api.h

env_api.h is deleted by master

* reshape_only_one_dim_infered

* address pr comments

* rollback flow.env.all_device_placement

* no distributed running test_shutting_down.py

* auto format by CI

* expand lifetime of module oneflow in test_shutting_down.py

* refine del depend on of

* fix oneflow.placement.__str__

* revert GlobalSync

* init_producer_stream in oneflow.from_numpy

* debug code for vm

* init disable_vm_threads_ in VirtualMachine::VirtualMachine

* Update oneflow/core/vm/virtual_machine.h

Co-authored-by: daquexian <[email protected]>

* create stream in forked subprocesses.

* refactor StreamRoleSwitch to StreamRoleVisistor

* ThreadLocalGuard

* auto format by CI

* fix compiler complaints

* fix static analyzer complaints

* VirtualMachine::GetVmStream

* fix static analyzer complaints

* reimplement AddAndReadVector by std::deque

* reimplement AddAndReadVector

* merge master

* increase atol for test_consistent_rnn_cell.py

* StreamRole::AsyncLaunchedCommNet is bound to EventRecordedCudaStreamType

* auto format by CI

* remove StreamRoleVisitor<T>::VisitInvalid

* no copy in AddAndReadVector

* fix bug of AddAndReadVector::size_

* disable terminfo to fix missing terminfo symbols

Signed-off-by: daquexian <[email protected]>

* auto format by CI

* fix AddAndReadVector::GetGranularity

* remove bad unittest

* auto format by CI

* rename CallInstructionType to OpCallInstructionType

* sta…
Ikkyu321 added a commit to ZJLabDubhe/oneflow-zj that referenced this pull request Aug 23, 2022
* Multi Tensor apply Optimizer (#8373)

* Add optim_cast and modify sgd

* Remove

* try to add fuseUpdatecast pass logic

* use pass

* still have bug in inplace

* ban inplace and fix sgd update

* fix regst num

* add env var

* remove cuda graph wrong use

* add support for graph

* initialize

* add functional impl

* add simple job rewrite

* delete redundant sgd update kernel

* support half

* add kernel

* use single loop kernel

* refine

* when in eval mode, we turn off multi tensor update

* refine format

* use juncheng kernel

* Refine

* group multi tensor op by some attr

* add parallel conf to key

* refine

* Add unroll logic

* fix bug

* restruct

* use pointer list

* add adam kernel

* support multi tensor adam update

* Remove cpu

* support skip if and scale by tensor

* support sgd adam unittest

* add more check

* Remove config

* Restruct tensorparams

* support fused cast in multi tensor update

* support cast in multi tensor

* fix bug in model update cast pass

* fix multi tensor sgd update with cast Pass check logic

* refine

* support multi tensor adam update with cast

* refine format

* Remove redundant template args

* merge modify for fused cast

* only allow fused cast in train mode

* only support data parallel in multi tensor update

* rewrite fuse update cast pass logic

* remove redundant if

* fix format

* add new line

* rename

* Remove print

* rename and add LOG

* Add more type and test

* still have bug in multi tensor adam

* Fix multi tensor adam update bug

* add multi tensor adam update with cast test

* simplify code

* fix format

* Add model diff datatype in optimizer key

* remove random seed

* fix comment

* fix comment

* fix to use model copy

* use for loop

* Fix comment

* use hashcombine

* fix clang analysis error

* add with cuda macro

* fix env var in unittest

* remove redundant unittest

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Fix doc and ops template auto gen (#8546)

* fix doc and add op calculator

* fix bug

* fix gen_ops

* fix diag 0size tensr shape infer bug (#8557)

* fix diag 0size tensr shape infer bug

* refine

* refine

* auto format by CI

* auto format by CI

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Format tensor on cpu (#8548)

* Format tensor on cpu

* use tensor.detach

* Remove useless WITH_CUDAs (#8562)

* unique identity (#8509)

* unique identity

* fix

* add identit name

* rm debug log

* mv identity form class to graph

* auto format by CI

* fix unique iden with having multiple stage

* auto format by CI

* Update block.py

Co-authored-by: cheng cheng <[email protected]>
Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add GenericStreamContext (#8560)

* Modify some file and add test (#8556)

* Modify some file and add test

* modify the content

* modify the format and test function name

* modify the format and aligned with pytorch

* delete print

* modity the function name

* auto format by CI

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <[email protected]>

* Move some op into amp gray list (#8545)

enlarge gray list

Co-authored-by: cheng cheng <[email protected]>

* Refine inplace expand runtime_error (#8561)

* Refine inplace expand runtime_error

* Opt

* Refine

* Add Note

* OneEmbedding use malloc async (#8543)

* in out ptrs

* ops and test

* test pass

* prefetch tmp buffer

* embedding shuffle tmp buffer

* gradient shuffle

* tmp buffer size

* mem pool

* cuda 11.2

* add id_shuffle to setNumunique in update tests

* default not use dynamic alloc

* fix of_tidy

* add fused op

* address review

* init tmp_buffer

* mv memset

* fix

* one_embedding fused_lookup_init_cast and fused_update_put (#8564)

* add fused op

* mv memset

* fix

* address review

* rm fullcache n_missing check

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix cpu aligned_alloc size (#8569)

Signed-off-by: daquexian <[email protected]>

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add flow norm (#8535)

* add flow norm

* rm import

* rm  doctest.testmod

* fix pad_packed_sequence method input requires_grad==True (#8574)

* fix pad_packed_sequence method input requires_grad==True

* fix append error when batch_first=True

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix embedding manager tmp buffer (#8585)

* fix embedding manager

* format

* fix reduce_ops 0size bug (#8551)

* fix reduce_ops 0size bug

* fix commnet

* auto format by CI

* fix bug

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Align Momentum Optimizer (#8549)

* fix moemntum update

* align momentum

* fix bug and finish eager unittest

* Support Graph optimizer

* fix momentum bug

* refine beta

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Fill GetSbp bug and consistent test bug (#8576)

fix(FillOp): fill GetSbp bug and consistent test bug

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Dev Fully fused MLP Grad[OneEmbedding] (#8462)

* support fully fused mlp grad in eager

* support lazy backward

* fix output size

* add fallback to tmp_buf logic when ones buffer is not enough

* build sbp

* overlap allreduce

* fix overlap order

* fix format

* CUDA Graphs delayed capture

* Add ifcomm create for graph

* insert weight event roughly

* fix dbias allreduce error

* simplify code

* Add 11060 limit

* Remove print

* Rename

* fix fill bug and remove comm to cache

* Rename variable and add debug code for cache

* Use kernel state and fix bug

* remove print

* fix allreduce dbias bug

* fix header file

* fix comment

* remove redundant headerfile

* fix userops build error

* refine

* init nccl comm before execute kernel

* fix comment

Co-authored-by: liujuncheng <[email protected]>

* rename mirrored to local (#8503)

* rename mirrored to local

* rename files

* rename files

* auto format by CI

* revert change of package_mirror.py

* rename LocalObject to Dependence

* rename fn LocalObject to Dependence

* merge master

* handle clang check

* fix

* refine

* rename local_object to dependence

Co-authored-by: oneflow-ci-bot <[email protected]>

* Implement BroadcastElementwiseUnary primitive (#8384)

* Add code skeleton for broadcast unary primitive

* first try

* finish impl

* finish impl

* format

* fix build error

* address review

* refine

* address review comments

* use broadcast unary primitive in fill_tensor_ kernel

* handle pack tail statically

* fix

* address review

* address review

* Fix SimplifyBroadcastDims

* fix

* revert fill_kernel

Co-authored-by: Juncheng <[email protected]>

* skip cpu autotest for graph global (#8593)

* TODO

* skip cpu autotest for graph global

* Refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add function_library.h Exception (#8241)

* add RuntimeError for checking

* add RuntimeError to CHECK_EQ

* auto format by CI

Co-authored-by: oneflow-ci-bot <[email protected]>

* Refactor shrink (#8573)

* caching allocator

* auto format by CI

* Update ep_device_context.h

* EpDeviceCtx with CachingAllocator

* rm RawAllocator typename

* auto format by CI

* specific allo in EpDeviceCtx

* auto format by CI

* rm outdated alloc

* simplify thread safe guard

* auto format by CI

* avoid return mutex

* auto format by CI

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Speed up SliceKernel (#8589)

* perf(SliceKernel): descrease number of cuda kernel and speed up

* perf(SliceKernel): use old kernel when small tensor is all fullslice

* use std::copy to copy contiguous memory

* fix cpu kernel bug

* Update readme and vsn for 0.8.0 (#8600)

* update version

* remove py3.6

* modify some file and improve error message (#8592)

* modify some file and improve error message

* modify scalar_by_tensor_op.cpp

* Update scalar_by_tensor_op.cpp

* Update slice_op.cpp

* Update test_slice_op.py

* Update test_slice_op.py

* auto format by CI

* auto format by CI

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* rename consistent to global (#8505)

* rename consistent to global

* rename consistent to global

* rename files

* rename files

* refine

* auto format by CI

* refine

* fix clang check

* fix

* fix

* fix

* rm to_consistent docs

* auto format by CI

* refine

* fix

* fix

* revert changes

* auto format by CI

* revert changes

* revert changes

* rename

* rename

Co-authored-by: oneflow-ci-bot <[email protected]>

* add module releated container docs (#8580)

* add module releated container docs

* auto format by CI

* fix comment

* refine

* refine

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix rnn util extra memory usage when requires_grad=False (#8603)

* fix rnn util extra memory usage when requires_grad=False

* add comments

* refine comments

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* use bracket format slice in tensor str (#8489)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Perf TensorInfo constructor (#8606)

* perf(Autograd): perf TensorInfo constructor

* rename consistent to global

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* print operators' python location when print nn_graph (#8558)

1. add a flag in nn.Graph.debug() named print_op_loc for printing operator location.
2. add a flag in nn.Graph.debug() named only_print_user_code_loc for only print users' code location

* Add randint like (#8598)

* add randnint_like op

* add docs for random

* refine

* auto format by CI

* add randint_like global test

* refine doc

* refine randint_like docs

* fix bug

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add full_like api (#8595)

* add full_like_op api

* refine

* add test

* refine

* refine docs

* refine

* add consistent_full test

* add full_like op

* fix docs commnet

* change scalar sbp return value from list to tuple

* auto format by CI

* merge conflict

* revert

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix cumsum GenBackwardOpConfFn (#8604)

* fix cumsum GenBackwardOpConfFn

* add test case

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* revert change (#8613)

* fix test graph optimization conf CI bug (#8617)

* restore resource config after random tests

* refine

* refine

* Release pod tensor (#8552)

* ThreadLocalGuard

* split ReleaseTensor into ReleasePodTensor and ReleaseNonPodTensor.

* rename

Co-authored-by: luyang <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add param group for optimizer (#8611)

* add add_param_group interface for Optimize

* add test for add_param_group

* revert

* fix comment

* refine

* auto format by CI

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix broadcast_elementwise_binary cpu (#8625)

fix broadcast_elementwise_binary_cpu

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* align exception msg to torch (#8627)

* align exception msg to torch

* auto format by CI

Co-authored-by: oneflow-ci-bot <[email protected]>

* skip unstable global test in ci, reduce failture rate (#8635)

* fuse embedding interaction (#8586)

* fuse embedding interaction

* fix of_tidy

* refine

* fix

* address review

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix flip gen backward opconf (#8605)

* fix flip gen backward opconf

* use new opconf api

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add ONEFLOW_ONE_EMBEDDING_PERSISTENT_TABLE_SNAPSHOT_LOAD_MMAP_LOCKED (#8597)

* Add ONEFLOW_ONE_EMBEDDING_PERSISTENT_TABLE_SNAPSHOT_LOAD_MMAP_LOCKED

* refine

* use MAP_POPULATE

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Profiling main thread (#8601)

* ThreadLocalGuard

* refactor EagerBlobObjectList

* op_args_reserved_size

* remove useless comments

Co-authored-by: binbinHan <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Fully Memory Log V2 with more details (#8565)

* Fully Memory Log V2 with more details

* refine log and long op name

* fix clang tidy

* fix test

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Xiaoyu Xu <[email protected]>

* Stream policy (#8590)

* ThreadLocalGuard

* refactor signature of StreamType::InitDeviceCtx

* refactor hint

* add StreamPolicy

* remove DeviceCtx args

* refine OpCallInstructionUtil::Prepare & Compute

* merge EpDeviceCtx and LazyJobDeviceCtx into StreamPolicy

* minor fix

* minor fix

* del useless code

* fix error

* fix merge error

* fix segment fault bug

* fix complie error

* del methods belong to Subclass

* reslove comment

Co-authored-by: binbinHan <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add fully support for broadcast matmul (#6937)

* fix arange bug

* fully support broadcast matmul

* add more check

* remove check

* add fully sbp

* fix full sbp

* Fix broadcast matmul grad

* remove old broadcast matmul grad

* add broadcast grad back and when B numaxes is 2, we use broadcast_gradB instead of matmul+reduce

* add lazy backward

* Add restrict when transpose_a is false we can use bmatmul_grad_b

* revert

* fix broadcast matmul backward

* fix single client dispatch matmul logic

* revert old bcast matmul grad b kernel

* fix eager functional matmul backward

* add more test case

* remove redundant code

* add more special case

* when b num axes is 2, we only save tensor a

* fix annotation

* fix conflict and format

* remove single client matmul code

* Fix eval error

* fix conflict

* fix unittest

* Add init value

* support matrix vector matmul

* add vector matrix product

* Use matmul primitive to rewrite matrix vector product forward and backward

* Add fullllllllly support for vector matrix product

* Fix sbp

* fix bug

* add unittest

* Add consistent test for broadcast matmul

* Remove redundant code

* fix userops annotation

* fix

* refine

* Fix clang static analysis

* fix clang analysis

* set check graph as false

* fix

* fix for unittest

* fix broadcast sbp bug

* try to fix unittest

* Fix consistent test

* fix multiplier to 4 for unittest

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Revert "skip cpu autotest for graph global" (#8608)

* Revert "skip cpu autotest for graph global (#8593)"

This reverts commit b076be782fd8f21e50ee4915f2d1562f3a9ab4c0.

* cherry pick from master

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* OneEmbedding add tmp_buffer allocator (#8588)

* fix embedding manager

* format

* refine embedding_manager tmp_buffer allocator

* fix

* format

* refine

* refine

* auto format by CI

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <[email protected]>

* refine error msg for some user ops (#8579)

* refine error msg for some user ops

* refine error msg for some user ops

* optimize

* optimize the writing

* optimize the writing

* optimize the writing

* auto format by CI

* optimize writing

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add tril fill value (#8655)

add tril fill value

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix_non_pod_data_allocate_bug (#8657)

Co-authored-by: Li Xinqi <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Fix norm (#8629)

* fix norm

* add doc

* add bool &

* update math_functor.cpp

* add note

* fix_decorate_mem_leak_bug_in_eager_boxing (#8661)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add higher order derivative for leaky_relu and negative op (#8643)

* add higher derivative for leakyrelu and negative

* fix a typo

* remove functor

* add initialize alpha

* fix incorrect dim size in global test

* fix incorrect dim size in global test

* optimize testcase

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* update oneflow intro to show the difference (#8669)

* update oneflow intro

* refine

* refine

* refine

* refine

* refine

* refine

* refine

* refine

* refine

* refine oneflow intro

* Stacked error (#8671)

* ThreadLocalGuard

* StackedError

* StackedError

Co-authored-by: Shenghang Tsai <[email protected]>

* Refactor tensor initializer (#8626)

* fix(*): fix xavier_initializer

* refactor(Initializer): refactor initializer

* fix function name

* auto format by CI

* refine

* fix interface in tensor.py

* fix(trunc_normal_): fix init bug and add test

* auto format by CI

* fix bug

* add oneflow.nn.init.normal_ test

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <[email protected]>

* Fix nn doc (#8650)

* fix hsplit doc

* add doc for module

* fix dtype

* fix formula

* add ref

* fix row length

* Fix reduce max min bool dtype bug (#8651)

* fix reduce_max_min_bool_dtype

* fix bug

* auto format by CI

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Remove redundant exception wrapper (#8631)

* remove redundant ExceptionWrapper

* refine KeyErrorMessage

* refine

* auto format by CI

Co-authored-by: oneflow-ci-bot <[email protected]>

* Refactor MemoryCase to eliminate determine statements of device_type (#7727)

* ref memory_case_util

* ref BlobObject::CheckMemCase

* ref mem_case using

* address review

* address review

* namespace memcase -> memory

* fix conflict

* address review

* address static analysis

* rm check

* cpu device_id is always 0

* fix conflict

* timeout-minutes: 50

* revert change

* increase thrd limit in container

* skip 2x2 TestEinsumConsistent

* skip failed case of distributed test

* auto format by CI

* fix_non_pod_data_allocate_bug

Co-authored-by: Li Xinqi <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: tsai <[email protected]>
Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: clackhan <[email protected]>

* fix some data races in c++ api and SteadyVector (#8654)

* fix some data races in c++ api and SteadyVector

Signed-off-by: daquexian <[email protected]>

* skip self copy in MutShapeView::ToShape

Signed-off-by: daquexian <[email protected]>

* auto format by CI

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <[email protected]>

* Fix sin/cos higher order derivative (#8648)

* fix(GradGrad): fix sin/cos higher order derivative

* fix(GradGrad): fix calculate error

* refine autograd global test

* auto format by CI

* refine sin/cos grad_grad calculate

* fix static analysis

* merge conflict

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: Ping Zhu <[email protected]>
Co-authored-by: Zhu, Ping <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* refine_eager_boxing_to_adapt_ep (#8568)

* refine_eager_boxing_to_adapt_ep

* fix typo

* refine

* refine symmetric-acyclic-nd-sbp-to-nd-sbp

* refine

* fix error

* fix static check

* add NOLINT

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Fix repeat bug (#8645)

* make result contiguous

* add test case

* auto format by CI

Co-authored-by: oneflow-ci-bot <[email protected]>

* Instruction policy (#8583)

* ThreadLocalGuard

* vm::InstructionPolicy

* fix compile error (#8623)

* fix compile error

* change MirroredObject to Dependence

* Modify DependenceVector

* rm include stream type

* fix stream type

* auto format by CI

Co-authored-by: Yu OuYang <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <[email protected]>

* handle non-contiguous input (#8665)

* handle non-contiguous input

* refine

* auto format by CI

Co-authored-by: oneflow-ci-bot <[email protected]>

* rename define CONSISTENT to GLOBAL (#8652)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Refine naive interpret (#8672)

* ThreadLocalGuard

* refactor EagerBlobObjectList

* op_args_reserved_size

* remove useless comments

* rename one::EagerBlobObjectList* to vm::EagerBlobObject*

* refactor signature of InstructionsBuiler::Call

* PhysicalRun

* refactor InstructionsBuilder::Call

* remove unused StatefulOpKernel::need_check_mem_case

* remove EagerLocalTensorImpl::is_shape_synced_

* refactor SoftSync

* move SmallVector from common/container_util.h to framework/instructions_builder.cpp

* explicit scalar initialization

Co-authored-by: clackhan <[email protected]>

* Rebuild Docs V0.8.0 (#8392)

* rebuild for 5 module

* fix bug

* fix for doctree and content  in nn and

* fix

* fix

* fix

* add some

* fix for oneflow.rst

* update oneflow oneflow.nn

* update tensor

* update tensor module

* update

* test

* update

* update

* fix for undone desc

* docs: oneflow.utils.data (#8485)

* feat(utils.data): add oneflow.utils.data

* docs(dataloader): change the docstring of DataLoader

* docs(tensor): add methods to oneflow.Tensor document

* docs(optim): change docstring of optimizer and add a note to the doucument

* nn.graph

* fix for graph

* fix bug

* review nn and linalg document (#8515)

* docs(nn): add contents to oneflow.nn document

* docs(linalg): refactor oneflow.linalg document

* change attributes.rst and review nn.functional.rst (#8514)

* change attributes.rst and review nn.functional.rst

* reconstruction oneflow.cuda

* fix cuda and rebuild comm demo (#8582)

* update image

* add distributed

* oneembedding & refine graph

* update for sdisributed one_embedding

* fix rnn.py (#8616)

* 重构 oneflow.nn.init 文档 (#8622)

docs(nn.init): refactore nn.init document

* docs(nn.init): remove the comments

* docs(utils.data): remove the comments

* update and fix bug

* docs(review): refine the documents (#8646)

* docs(review): refine oneflow, nn, Tensor, nn.init, linalg, utils.data, optim modules

* docs(optim): modify the code examples

* docs(tensor): edit note

* 重构 oneflow.autograd 文档 (#8594)

* docs(autograd): refactor oneflow.autograd

* docs(autograd): edit "Default gradient layouts".

* docs(autograd): reedit "Default gradient layouts"

* docs(autograd): add comment

* docs(autograd): add reference

* update

* docs(tensor): change autoclass to autosummary

* update

* update

* add oneflow.linalg.diagonal (#8653)

* docs(linalg): add oneflow.linalg.diagonal

* update enviorment variable

* Update docs/source/distributed.rst

Co-authored-by: Houjiang Chen <[email protected]>

* Update docs/source/distributed.rst

Co-authored-by: Houjiang Chen <[email protected]>

* update enviorment variable

* update for ev & distributed

* update distribued

* update ev

* update distribute desc

* Update docs/source/distributed.rst

Co-authored-by: Houjiang Chen <[email protected]>

* update

* 修改 docstring 描述 (#8656)

* docs: move pytorch refernce to end

* docs: add some docstring

* docs(refs): add refs

* Update docs/source/distributed.rst

* updte for distributed details and environment_variable

* docs(docstring): Modify all reference links to version 1.10 (#8663)

* fix bug

* fix bug

* fix all warning

Co-authored-by: Guoliang Cheng <[email protected]>
Co-authored-by: liu xuan <[email protected]>
Co-authored-by: Guoliang Cheng <[email protected]>
Co-authored-by: laoliu97 <[email protected]>
Co-authored-by: Yao Chi <[email protected]>
Co-authored-by: Houjiang Chen <[email protected]>

* Fix zeros like and ones_like api (#8632)

* fix zeros_like and ones_like bug

* refine

* revert

* refine

* fix tensor_slice_view infer physic_shape bug

* add test

* refine

* auto format by CI

* fix bug

* refine

* auto format by CI

* fix import error

* fix bug

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Fix sbp print bug (#8689)

* Add a normal priority with no transfer but different sbp

* Fix the bug for printing no boxing edge

* Do not use P for weights

* auto format by CI

Co-authored-by: oneflow-ci-bot <[email protected]>

* eager_local_interpreter_with_infer_cache (#8619)

* ThreadLocalGuard

* refactor EagerBlobObjectList

* op_args_reserved_size

* remove useless comments

* rename one::EagerBlobObjectList* to vm::EagerBlobObject*

* refactor signature of InstructionsBuiler::Call

* PhysicalRun

* refactor InstructionsBuilder::Call

* remove unused StatefulOpKernel::need_check_mem_case

* remove EagerLocalTensorImpl::is_shape_synced_

* eager_local_interpreter_with_infer_cache

* remove useless code

* reslove comments

* refactor TensorMeta::TensorMeta(const TensorMeta)

* use small vector

* add kMaxNumDims

* fix error include

* fix split Symbol LocalTensorMeta error

* refactor SoftSync

* move SmallVector from common/container_util.h to framework/instructions_builder.cpp

* mone ONEFLOW_EAGER_ENABLE_LOCAL_INFER_CACHE to eager.h

* add blank line

* reslove comments

* minor fix

* refine

* explicit scalar initialization

* fix static check error

* auto format by CI

* of_format

* reslove comment

* refine

* refine

* refine

Co-authored-by: lixinqi <[email protected]>
Co-authored-by: Li Xinqi <[email protected]>
Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix gelu nn.Module bug and support tanh mode. (#8693)

* add gelu2 api

* refine test

* refine docs

* refine

* restuct

* delete useless headfile

* format

* rm doc of tensor.gelu (#8696)

Co-authored-by: Shanshan Zhong <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Fix bug in CrossFeatureInteraction LazyBackward (#8677)

fix bug

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix floating-point scalar tensor in arange (#8673)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add nn functional fold (#8667)

* add fold

* update fold.py

* add test

* fix doc

* fix comment

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* modify some file and improve the error message (#8566)

* modify some file and improve the error message

* modify the content

* modify the content

* auto format by CI

* Update roi_align_op.cpp

* Update roi_align_op.cpp

* Update reshape_user_op_util.cpp

* auto format by CI

* Update roi_align_op.cpp

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <[email protected]>

* [OneEmbedding] add id_shuffle_copy_out (#8683)

add id_shuffle_copy_out

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix add_param_group step key not match error (#8698)

* fix add_param_group step key not match error

* auto format by CI

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add env ONEFLOW_EP_CUDA_DEVICE_FLAGS and ONEFLOW_EP_CUDA_STREAM_FLAGS (#8703)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix for docsv0.8 (#8710)

* fix repeat op 0-size releated bug (both in FW and AD) (#8707)

* fix repeat op 0-size releated bug (both in FW and AD)

* refine

* refine static check

* refine

* fix commnet

* fix comment

* refine

* fix test

* auto format by CI

* auto format by CI

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Support Dropout Scale in FusedMLPGrad[OneEmbedding] (#8633)

* support alpha list

* Remove redundant modify

* remove redundant alpha set

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Fix bug of Tensor.type (#8697)

* fix bug of tensor.type(flow.Tensor)

* fix bug of tensor.type(flow.Tensor) about device

* Fix tensor type doc (#8699)

fix doc of tensor.type

* add test for tensor.type(flow.Tensor)

* move PyTensorMetaCls_CheckExact to header file

Co-authored-by: Shanshan Zhong <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* ONEFLOW_GRAPH_PLACE_TRAINING_STATE_ON_ALL_RANKS (#8706)

* ONEFLOW_GRAPH_PLACE_TRAINING_STATE_ON_ALL_RANKS

* auto format by CI

Co-authored-by: liujuncheng <[email protected]>
Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* define_mut_output_shape_and_mut_output_stride_in_infer_ctx (#8709)

* define_mut_output_shape_and_mut_output_stride_in_infer_ctx

* fix merge master error

* fix typo

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add qat conv modules (#8368)

* add qat conv modules

* add quantization related modules to doc

* refine qatconv modules doc

* add qat conv module tests

* refine

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add unsqueeze_multiple_op (#8714)

* add unsqueeze_multiple_op

* modify the format

* Update functional_api.yaml

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* modify broadcast_like_op.cpp and add test (#8720)

* modify broadcast_like_op.cpp and add test

* modify broadcast_like_op.cpp

* Update broadcast_like_op.cpp

Co-authored-by: Yinggang Wang <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* JIT LR (#8500)

* add example code

* Update cosine_annealing_lr.py

* enable self params transformer

* enable pass ast to c++ api

* enable jit backend for lr

* enable jit global register and invoke

* convert Global to Singleton for new merge

* enable pybind11 walk on python ast

* enable test all existent get_lr of oneflow in python

* enable py_ast_wrapper pass ast from python to mlir

* switch all ast to ast-wrapper in mlir scope

* define python ast partially

* partial python ast definition

* trim asdl of python ast

* mlir gen

* add symbol table

* from ast to jit done

* switch llvm::errs() to mlir::emitError and convert switch to typeSwitch

* trim duplicate namespace use

* fix LIT header

* add some docs

* enable compare with or_else, if with return seamless in branch and mutable variable

* trim code and refine struct

* register pybind11 ast node for shared_ptr

* enable cpp class in python

* go through python to mlir to llvm to jit to run

* add addf subf op

* work well on stepLR linearLR exponentialLR coseineDecayLR cosineAnnealingLR constantLR

* enable maxf minf conversion to llvm ir

* rename LR_JIT to LRJITRegister

* remove LR_JIT_Engine and swith Invoke to std::function ret by  lookup

* refine struct

* enable bisect_right and python resigter api have dump option arg

* add bisect_left and bisect_transformer specially, delete former test python script

* remove c++17 standard

* restore double hash to iterator

* publish

* publish

* publish

* use llvm classof and typeswitch rightly

* trim

* commit

* commit

* commit

* commit

* commit

* commit

* auto format by CI

* Update ir.cpp

* Update OneFlowLRJITRegistry.h

* auto format by CI

* Update AstMlirGen.h

* Update lr_jit.cpp

* auto format by CI

* Naming conventions

* auto format by CI

* auto format by CI

* deploy _ behind

Co-authored-by: leaves-zwx <[email protected]>
Co-authored-by: yuhao <[email protected]>
Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: yuhao <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add logspace (#8599)

* add logspace

* add global test

* restore rand

* fix doc

* rename consistent to global

* adjust import order

* add todo

* Add hann_window (#8615)

* add hann_window

* rm useless include

* add check

* adjust import order

* add ONEFLOW_VM_PENDING_HANDLE_WINDOW_SIZE (#8730)

* add ONEFLOW_VM_PENDING_HANDLE_WINDOW_SIZE

* add environment to vm.h

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Fix as strided bool type and view bug (#8713)

* fix as_stride bug

* refine

* refine

* refine

* delete useless head file

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add functional binary cross entropy (#8708)

* add gelu2 api

* refine test

* refine docs

* refine

* restuct

* delete useless headfile

* format

* rm doc of tensor.gelu

* add functional binary cross entropy

Co-authored-by: BBuf <[email protected]>
Co-authored-by: Xiaoyu Zhang <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* support map_location in flow.load (#8666)

* support map_location in flow.load

Signed-off-by: daquexian <[email protected]>

* auto format by CI

* fix tests

Signed-off-by: daquexian <[email protected]>

* fix bug when map_location is None

Signed-off-by: daquexian <[email protected]>

* auto format by CI

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <[email protected]>

* Add addcdiv (#8581)

* add addcdiv

* fix tensor_functions

* fix inplace

* add test number

* rename consistent to global

* Inner most dim case for cumsum cumprod op (#8403)

* cumsum use cub scansum in some case

* prod use cub scan

* refine name

* refine

* optimize cum op

* format

* fix

* get device properties by cuda stream class

* revert useless code

* refine

* outer dim use parallel sweep algo

* refine

* fix a fraction of threads hit __syncthreads

* revert

* refine kernel define

* refine

* refine

* refine

* refine

* move comment

* fix

* fix

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Define mut output dtype and mut output is dynamic in infer ctx (#8716)

* define_mut_output_shape_and_mut_output_stride_in_infer_ctx

* fix merge master error

* fix typo

* define_mut_output_dtype_and_mut_output_is_dynamic_in_infer_ctx

* replce const DataType& with DataType

* replace const DataType& with DataType ret

* split TensorDesc4ArgNameAndIndex and MutTensorDesc4ArgNameAndIndex

* refine

* minor fix

* refine

* fix static check error

* Update op_expr.cpp

* Update op_expr.cpp

* Update stateful_opkernel.cpp

* refine

* fix static check error

* refine

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Dev refactor fuse instruction policy (#8624)

* ThreadLocalGuard

* vm::InstructionPolicy

* refactor fuse instruction policy

* fix compile error (#8623)

* fix compile error

* change MirroredObject to Dependence

* Modify DependenceVector

* add instruction policy util

* add instruction policy util

* remove include

* add include

* rm fuse instruction type

* Modifying variable properties

* add stream_sequential_dependence_ to instruction_policy

Co-authored-by: lixinqi <[email protected]>
Co-authored-by: Li Xinqi <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix bug of batchnorm num_batches_tracked global error when loading state_dict (#8723)

add condition for assign num_batches_tracked

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add launch master port limit (#8563)

* add launch master port limit

* Update python/oneflow/distributed/launch.py

Co-authored-by: daquexian <[email protected]>

Co-authored-by: daquexian <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Fix docs import distance (#8691)

* fix import distance

* add functional apis

* add smooth_l1_loss docs

* refine activation.py

* add deleted api

* review

* 添加oneflow, nn 等模块文档中遗漏的接口 (#8704)

* docs: add api

* docs(nn): refactor nn

* review

Co-authored-by: Guoliang Cheng <[email protected]>
Co-authored-by: ChenQiaoling <[email protected]>

* refactor control stream type (#8647)

* refactor control stream type

* auto format by CI

* Add method implementation

* refine

* refien

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: Li Xinqi <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Define mut output tensor desc (#8717)

* define_mut_output_shape_and_mut_output_stride_in_infer_ctx

* fix merge master error

* fix typo

* define_mut_output_dtype_and_mut_output_is_dynamic_in_infer_ctx

* define_mut_output_dtype_and_mut_output_tensor_desc

* replce const DataType& with DataType

* replace const DataType& with DataType ret

* split TensorDesc4ArgNameAndIndex and MutTensorDesc4ArgNameAndIndex

* refine

* minor fix

* fix merge error

* fix warning error

* refine

* fix static check error

* Update op_expr.cpp

* Update op_expr.cpp

* Update stateful_opkernel.cpp

* refine

* fix static check error

* refine

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Symbolic local tensor meta (#8662)

* ThreadLocalGuard

* refactor EagerBlobObjectList

* op_args_reserved_size

* remove useless comments

* rename one::EagerBlobObjectList* to vm::EagerBlobObject*

* refactor signature of InstructionsBuiler::Call

* PhysicalRun

* refactor InstructionsBuilder::Call

* remove unused StatefulOpKernel::need_check_mem_case

* remove EagerLocalTensorImpl::is_shape_synced_

* eager_local_interpreter_with_infer_cache

* remove useless code

* reslove comments

* refactor TensorMeta::TensorMeta(const TensorMeta)

* use small vector

* Symbolic LocalTensorMeta

* check shape in critical_sectio

* add kMaxNumDims

* fix error include

* fix split Symbol LocalTensorMeta error

* fix split cache and symbolic local tensor meta error

* refactor SoftSync

* move SmallVector from common/container_util.h to framework/instructions_builder.cpp

* mone ONEFLOW_EAGER_ENABLE_LOCAL_INFER_CACHE to eager.h

* add blank line

* reslove comments

* minor fix

* refine

* explicit scalar initialization

* fix static check error

* auto format by CI

* of_format

* reslove comment

* refine

* refine

* refine

* fix error

* define MutOutputShape and MutOutputStride in InferContext

* define_mut_output_shape_and_mut_output_stride_in_infer_ctx

* fix merge master error

* fix typo

* fix static check error

* define_mut_output_dtype_and_mut_output_is_dynamic_in_infer_ctx

* define_mut_output_dtype_and_mut_output_tensor_desc

* replce const DataType& with DataType

* split const and mut func in LocalTensorMeta

* replace const DataType& with DataType ret

* split TensorDesc4ArgNameAndIndex and MutTensorDesc4ArgNameAndIndex

* refine

* minor fix

* fix merge error

* fix warning error

* refine

* fix static check error

* Update op_expr.cpp

* Update op_expr.cpp

* split MutTensorMeta and MutLocalTensorMeta

* Update stateful_opkernel.cpp

* refine

* fix static check error

* refine

* refine

* reslove comment

* refine

* fix typo

Co-authored-by: Houjiang Chen <[email protected]>

* fxi typo

* use OpArgsVector

Co-authored-by: lixinqi <[email protected]>
Co-authored-by: Li Xinqi <[email protected]>
Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Houjiang Chen <[email protected]>

* Feat general basic communication (#8437)

* Add a slight cost for B->S and B->P in 2d sbp

* Add penalty for P in consumer

* Fix a slight bug

* Add at most 1 middle node for general basic communication

* Add the cost for general basic communication

* Add the slight penalty for eager

* Skip initialization of boxing collector if not needed

* Fix a bug

* Dev nd nccl send recv boxing (#8467)

* nd nccl_send_recv_boxing

* rm print

* support num_axes > 2

* Add distributed optional run (#8372)

* Add

* change deps

* add install

* add skip

* autoprof supports bandwidth (#8367)

* autoprof supports bandwidth

Signed-off-by: daquexian <[email protected]>

* print bandwidth

Signed-off-by: daquexian <[email protected]>

* auto format by CI

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <[email protected]>

* remove tmp buffer of cumprod cpu backward kernel (#8369)

* remove tmp buffer of cumprod cpu backward kernel

* refine

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Move tensor api to cpython part3 (#8342)

* add tensor_functions

* concat py methods

* add hash, restore tensor.py

* check replacement

* refine code, remove commented tensor.py

* refine code

* move some api

* add cpu and cuda api

* add triu tril norm and etc.

* remove tensor_functions.h

* move more api

* move more api, refine size

* fix typo

* format code, remove useless include

* refine code

* refine code, fix typo

* align .cuda to python

* refine code

* split some api to part3 for review

* remove positional only arguments of argmax and argmin

* remove arguments parse

* modify arguments name in matmul and floor_divide

* rename BINARY_FUNC to DIRECT_PASS_FUNC, modify some functions

* refine code, format code

* add inplace /=, add comments

* remove name in macros

* remove python api

* remove redundant include

* remove cout

* format code

* refactor tensor.size by directly call shape.at, refactor tensor.sub_ by calling nb_sub_

* remove redundant code

* auto format by CI

* fix typo, fix wrong call

* modify idx datatype from int32 to int64 in tensor.size

* add some DIRECT_PASS_FUNC

* add cpu cuda var pow and etc.

* add masked_fill any all

* make REDUCE_FUNC macro, add reduce_* functions

* add 0dim check in ReduceSumWhole, refine yaml

* fix bug

* restore add add_ sub sub_

* add unittest for tensor.half tensor.add tensor.add_

* refine code

* refine code

* fix typo

* fix bug of tensor.std()

* refactor var std and cuda, using c++ functional api

* add beta and threshold in softplus

* auto format by CI

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add nn_functor Check (#7910)

* add bias_add_check

* add bias_add error test

* fix conv2d nhwc bias_add error

* add nhwc conv test

* add bias_add_error test

* Add bias add error check

* Rename

* add batch matmul error check

* add matmul check error msg

* remove annotation

* add fused mlp error msg check

* Add pixel shuffle check test

* add more test until normalization add relu functor

* refine error message

* finish all nnfunctor check msg

* handle type error

* remove useless symbol

* modify back to TypeError

* fix all comment

* Remove redundant code

* Remove pad ndim check

* fix bias add space

* fix check logic cause ci gpu not always gpu:0

Co-authored-by: hjchen2 <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add FusedMatmulBiasAddReluDropout [OneEmbedding] (#8222)

* previous version for fused_matmul_bias_add_relu_dropout

* add op infer

* fix detail

* finish forward

* support dropout rate list

* add forward test

* fix bug for output buffer

* Configurable alpha params

* try to add bit mask logic

* Add bitmask first version!

* Add row col bitmask logic

* support not align4 reludropout

* simplify relu dropout ld logic

* Add naive relu dropout grad kernel

* add simple relu dropout grad kernel

* Rename

* support relu_dropout bitmask backward

* add vectorized optimization

* fix tmp buffer

* add to amp list

* add lazy backward logic

* Refine kernel

* add indextype dispatch

* simplify functor logic

* fix cublas fused mlp aux_ld shape bug

* Add more relu dropout kernel

* add full unittest

* fix bug in skip final activation

* refine

* Remove dump func

* fix format

* Remove cmake

* remove redundant divide

* add padded version

* fix dropout

* oneflow curand

* refine

* remove redundant kernel

* add unroll logic

* add unroll and ballot sync

* refine format

* Remove fast curand

* Refine python interface

* Add if branch for memset

* fix python logic

* just for debug

* not use matmul bias add grad

* add launch 1 block limit

* fix unittest

* Refine

* fix graph backward bug

* limit to 11060

* change to use int32_t dtype for cublas aux

* Fix jc comment

* fix comment

* fix convert

* fix static_analysis

* fix at

* fix userops td

* fix userops td

* fix const ref

* fix compile error for bfloat16

* limit to 11060

* fix bug

Co-authored-by: Juncheng <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix gather 0-dim tensor bug (#8376)

* fix 0-dim tensor bug

* refine

* support input 0-dim tensor for gather

* refine

* refine

* refine dim_scatter_kernel check

* refine

* refine check

* fix clang_tidy error

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add api to apply external job pass (#8370)

* Add condition to find-test-cache-distributed (#8387)

* add condition to find-test-cache-distributed

* fix

* warp dim util (#8382)

* warp dim util

* format

* use more maybe_wrap_dim

* refine array functor

* add more

* refine math_functor

* fix_bug_in_broadcast_min_max_grad_and_broadcast_like (#8379)

* fix_bug_in_broadcast_min_max_grad_and_broadcast_like

* refine

* fix static check error

* fix bug about index (#8388)

* fix bug about index

* add test case

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* LogicalSliceAssign support full slice sbp (#8344)

* feat(SliceOp): slice ops support 2d sbp

* fix(SliceOp): fix [B, P] 2d sbp bug

* refine error message

* fix bug in parallel_num == 1

* add comment

* add warning and format

* add NOLINT for boxing check

* feat(LogicalSliceOps): support all nd_sbp

* feat(LogicalSlice): support nd_sbp

* add error message

* fix(AutoTest): fix auto_test bug in module.parameter pass

* auto format by CI

* fix(LogicalSliceAssign): skip test when 1n1d

* fix SliceParams memset error

* remove memset

* add CHECK_JUST

* fix(*): make sure split_axis >= 0 or equal to SPLIT_AXIS_FOR_NON_SPLIT

* remove memset

* fix spilit_info.axis bug

* feat(LogicalSliceOps): support grad

* add logical_slice gradient_funcs

* feat(LogicalSliceAssign): LogicalSliceAssign support full slice sbp

* auto format by CI

* test(LogicalSlice): fix logical_slice dims

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Houjiang Chen <[email protected]>
Co-authored-by: oneflow-ci-bot <[email protected]>

* fix_tensor_from_numpy_mem_leak_bug (#8391)

* fix_tensor_from_numpy_mem_leak_bug

* add note

* refine note

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Make of_pyext_obj static only to make sure only a python ext so has python symbols (#8393)

* make of_pyext_obj static only

* refine note

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Adjust tolerance setting in embedding_renorm unit test (#8394)

* support front end compile for job to iree (#8249)

* support frontend dev version

* polish name

* add tosa-to-elf.mlir

* tosa to elf by llvm

* conv2d partial

* an enhanced frontend runner

* support numpy as input

* enable multiple using nn graph with different input(jobname make it  it cd /home/yuhao/frontend/oneflow ; /usr/bin/env /usr/bin/python3 /home/yuhao/.vscode-server/extensions/ms-python.python-2022.6.2/pythonFiles/lib/python/debugpy/launcher 40873 -- /home/yuhao/frontend/oneflow/oneflow/ir/test/Frontend/runner.py )

* enable multiple input

* enable cpu and cuda

* change full_name to _full_name

* support exchange cuda with cpu seamlessly

* remove pip

* lit config

* polish

* trim

* auto format by CI

* modify

* auto format by CI

* last line polish

* use unittest

* auto format by CI

* use allclose

* auto format by CI

* pulish

* optimize convert oneflow to tosa

* conv2d

* conv2d enhanced && conv2d examples add

* add road map

* add add_n2Op and boardcast_addOp conversion

* add matmulOp conversion

* support converting normailzation op to tosa(partically)

* update roadmap

* support i64 tensor to dense elem attr

* support 100% resnet op conversion

* add test mlir

* add test iree resnet python script

* auto format by CI

* done

* enhance iree resnet test script

* auto format by CI

* rebuild code

* auto format by CI

* rebuild test script

* update

* auto format by CI

* pub

* trim test scripts

* move

* move

* input and output add block arg judgement

* emit error in variable conversion

* error handle for ci

* modify err info

* auto format by CI

* merge

* auto format by CI

* output not block

* flow ones

* rm const

* trim maybe

* trim maybe with header file

* const auto

* solve clangd error

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Feat/zero mix with mp (#8036)

* add zero limit

* add debug

* add mix zero test

* refactor zero api

* zero test with mp

* add 2d test

* add zero nd

* add nd zero

* add sbp cast

* test passed soft limit consumer

* refine size api

* zero use stage 2

* add limit consumer api

* add new api

* refine zero s select

* fix index out of range

* rm zero limit on device type

* zero test with activation checkpointing

* add indentity when dp sequence len is 1

* move to base with master

* fix

* fix

* fix

* add test

* debug bad case

* refine test for eager and graph boxing

* test case ready

* simplify

* refine test

* fix buff size

* fix conflict

* refine zero nd

* refine

* add full test

* revert change

* refine split check

* fix typo

* rm log

* spit long func

* restore test

* Update optimizer_placement_optimization_pass.cpp

* auto format by CI

* auto format by CI

* fix static check

* add tips for zero api change

* auto format by CI

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Revert embedding normal path and fix amp list (#8374)

* revert embedding normal path, fix amp list

* fix amp

* fix memset bug in gather cpu kernel

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* replace fixed_vector with small_vector and make Shape inherit from it (#8365)

* Replace fixed_vector with llvm::SmallVector

Signed-off-by: daquexian <[email protected]>

* Shape inherited from llvm::SmallVector

Signed-off-by: daquexian <[email protected]>

* refine cmake

Signed-off-by: daquexian <[email protected]>

* rename fixed_vector to small_vector

Signed-off-by: daquexian <[email protected]>

* fix reviews

Signed-off-by: daquexian <[email protected]>

* auto format by CI

* update Shape constructor

Signed-off-by: daquexian <[email protected]>

* add 'PUBLIC' keyword to all target_link_libraries

Signed-off-by: daquexian <[email protected]>

* auto format by CI

* update cmake

Signed-off-by: daquexian <[email protected]>

* auto format by CI

* update cmake

Signed-off-by: daquexian <[email protected]>

* update cmake

Signed-off-by: daquexian <[email protected]>

* auto format by CI

* set is_initialized_ default to true

Signed-off-by: daquexian <[email protected]>

* override some methods to set is_initialized_

Signed-off-by: daquexian <[email protected]>

* auto format by CI

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <[email protected]>

* Light plan for debug (#8396)

* Light plan for debug

* fix note

* disable terminfo to fix missing terminfo symbols (#8400)

* disable terminfo to fix missing terminfo symbols

Signed-off-by: daquexian <[email protected]>

* auto format by CI

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix bug of ZeRO MP in complex case (#8404)

* Remove redundant output_lbns in ir (#8409)

* mv case

* remove redundant info

* Dev FusedCrossInteraction[OneEmbedding] (#8335)

* add simple fused cross interaction forward

* add packed fused

* Add cross interaction grad

* simplify code

* fix bug

* support crossnet v2

* support cross interaction v2

* add lazy backward

* Rename and add test

* fix jc comment

* fix comment

* fix bug

* fix userops td elem_cnt for FUSED Group

* fix header file

* fix clang static analysis

* fix unittest

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add exe graph physical shape check msg (#8002)

* fix index select op in graph

* add exe graph physical shape check msg

* improve the debug information for the python stack trace

1. add a parameter 'max_stack_depth' to specify the max depth for the stack trace
2. refactor other debug related classes.

* remove parens

* update

* resolve PR comments

* update

* update graph debug test file.

* restore self._debug in class Graph and class ModuleBlock

* Do not shorten the stack frame string if it is in debug mode

* delete TODOs

* disable conv3d test (#7969)

Signed-off-by: daquexian <[email protected]>

* skip layernorm random_data_warp test (#7941)

* skip layernorm random_data_warp test

* warp/block/uncached case only test gpu

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Lock click version (#7967)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add global avgpool unittest (#7585)

* fix (#7978)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Support negative dim in scatter op (#7934)

* support negative dim in scatter op

* refine scatter test

* refine scatter test again

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* run barrier callback in BarrierPhyInstrOperand::~BarrierPhyInstrOperand (#7702)

* run barrier callback in BarrierPhyInstrOperand::~BarrierPhyInstrOperand

* lock gil in vm Callback thread

* more comments for VirtualMachineEngine::Callback()

* the Env is never destroyed.

* export Env into python

* more unittests

* wait shared_ptr.use_count() == 0

* export unittest.TestCase in framework/unittest.py

* SwitchToShuttingDownPhase

* optional is_normal_exit

* VirtualMachine::CloseVMThreads

* Delete env_api.h

env_api.h is deleted by master

* reshape_only_one_dim_infered

* address pr comments

* fix a ref-cnt bug in TryRunBarrierInstruction.

* rollback flow.env.all_device_placement

* no distributed running test_shutting_down.py

* auto format by CI

* expand lifetime of module oneflow in test_shutting_down.py

* refine del depend on of

* capture oneflow._oneflow_internal.eager when calling sync in __del__

* add try in flaky test

Co-authored-by: Luyang <[email protected]>
Co-authored-by: chengtbf <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: Xiaoyu Xu <[email protected]>

* Fix one hot scalar tensor bug (#7975)

* fix reduce_sum scalar check bug

* fix one_hot scalar tensor bug

* fix clang tidy error

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* support ctor np array from of tensor (#7970)

* support ctor np array from of tensor

* add test case constructing np array from tensor

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add_manual_seed_all_api (#7957)

* add_manual_seed_all_api

* Update conf.py

* refine

* add test case

* auto format by CI

* Update random_generator.cpp

* auto format by CI

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* one_embedding add doc string (#7902)

* add doc string

* add example

* add

* fix doc

* refine

* address review

* mb to MB

* add make_table_option

* option to options

* refine

* add forward

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Support numpy scalar parameters (#7935)

* feat(functional): support numpy scalar parameters

* rename inferface

* feat(*): TensorIndex support numpy scalar

* feat(TensorIndex): support advance indexing

* add unittest and int32 support for branch feat-param_support_np_scalar (#7939)

* add unittest

* refactor unittest

* add todo for int16 advanced indexing

* add int32 supporting for advance indexing

* auto format by CI

Co-authored-by: Wang Yi <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <[email protected]>

* fix tensor_scatter_nd_update (#7953)

* fix tensor_scatter_nd_update

* auto backward

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix one_embedding adam (#7974)

* fix one_embedding adam

* fix tidy

* fix normal

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* speed test with score (#7990)

Signed-off-by: daquexian <[email protected]>

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Feat/graph del by ref (#7857)

* remove IsMultiClient() and single client logic

Signed-off-by: daquexian <[email protected]>

* rename eager.multi_client to eager

Signed-off-by: daquexian <[email protected]>

* auto format by CI

* add py ref

* refine new session

* clean code

* make scope api inner use

* use session with ref cnt

* run barrier callback in BarrierPhyInstrOperand::~BarrierPhyInstrOperand

* test pass

* lock gil in vm Callback thread

* more comments for VirtualMachineEngine::Callback()

* merge

* merge rm single client

* rm initenv

* merge and fix master

* refactor env c api

* add debug code

* fix and serving test pass

* test passed

* rm useless

* rm useless code

* format

* rm useless include

* rm sync in py

* the Env is never destroyed.

* export Env into python

* more unittests

* fix and pass tests

* revert virtual_machine.cpp

* revert core/vm

* remove outdated python class oneflow.unittest.TestCase

* graph test passed

* wait shared_ptr.use_count() == 0

* export unittest.TestCase in framework/unittest.py

* SwitchToShuttingDownPhase

* optional is_normal_exit

* VirtualMachine::CloseVMThreads

* Delete env_api.h

env_api.h is deleted by master

* address pr comments

* rm is env init

* Clear empty thread when graph destroy (#7633)

* Revert "Clear empty thread when graph destroy (#7633)" (#7860)

This reverts commit 3e8585e5fa20b97229d6b0be46a7ff814dc8cd83.

* fix a ref-cnt bug in TryRunBarrierInstruction.

* rm env_api

* fix clang-tidy error

* fix clang-tidy in env_imp

* refine env api

* format

* refine graph del and sync at shuttingdown

* fix typo

* add comment

* rm useless

* rm useless

Co-authored-by: daquexian <[email protected]>
Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: lixinqi <[email protected]>
Co-authored-by: Li Xinqi <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Luyang <[email protected]>
Co-authored-by: cheng cheng <[email protected]>

* [PersistentTable] Fix num blocks (#7986)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add auto benchmark for flowvision (#7806)

* update yml

* update workflow

* add resnet50

* [PersistentTable] Async write (#7946)

* [PersistentTable] Async write

* fix

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* save log in separate dir by default (#7825)

Signed-off-by: daquexian <[email protected]>

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix index select op in graph

* add exe graph physical shape check msg

* improve the debug inform…
Ikkyu321 added a commit to ZJLabDubhe/oneflow-zj that referenced this pull request Aug 26, 2022
* edit tanh to a closure op (#5)

Co-authored-by: yoonlee888 <[email protected]>

* Dev sin loop grad (#7)

* edit tanh to a closure op

* add grad-looped sin_cos_negative

* add test case

Co-authored-by: yoonlee888 <[email protected]>
Co-authored-by: Zhenhua <[email protected]>

* add log_grad_grad (#12)

* Add exp_grad_grad (#11)

* Revert "Dev sin loop grad (#7)" (#13)

This reverts commit c256a5a326d7e04c2ad4af802318661d18f72441.

* fix bugs (#16)

* fix ScalarSub param

* Add test case

* code format

* fix

* add higher order derivative Interface draft (#6)

* add higher order derivative Interface draft

* solve bugs of no Tensor.is_sparse attrs

* rm some  Interface comments

* fix & format

Co-authored-by: Zhenhua <[email protected]>
Co-authored-by: Huang Zhenhua <[email protected]>

* add Higher derivative vjp (#9)

* add Higher derivative vjp

* add autotest code

* add autograd.functional.vhp and motified functional

* Merge Testcase

* Rm chinese chars

Co-authored-by: Zhenhua <[email protected]>
Co-authored-by: Huang Zhenhua <[email protected]>

* merge Master into zj/develop (#21)

* Multi Tensor apply Optimizer (#8373)

* Add optim_cast and modify sgd

* Remove

* try to add fuseUpdatecast pass logic

* use pass

* still have bug in inplace

* ban inplace and fix sgd update

* fix regst num

* add env var

* remove cuda graph wrong use

* add support for graph

* initialize

* add functional impl

* add simple job rewrite

* delete redundant sgd update kernel

* support half

* add kernel

* use single loop kernel

* refine

* when in eval mode, we turn off multi tensor update

* refine format

* use juncheng kernel

* Refine

* group multi tensor op by some attr

* add parallel conf to key

* refine

* Add unroll logic

* fix bug

* restruct

* use pointer list

* add adam kernel

* support multi tensor adam update

* Remove cpu

* support skip if and scale by tensor

* support sgd adam unittest

* add more check

* Remove config

* Restruct tensorparams

* support fused cast in multi tensor update

* support cast in multi tensor

* fix bug in model update cast pass

* fix multi tensor sgd update with cast Pass check logic

* refine

* support multi tensor adam update with cast

* refine format

* Remove redundant template args

* merge modify for fused cast

* only allow fused cast in train mode

* only support data parallel in multi tensor update

* rewrite fuse update cast pass logic

* remove redundant if

* fix format

* add new line

* rename

* Remove print

* rename and add LOG

* Add more type and test

* still have bug in multi tensor adam

* Fix multi tensor adam update bug

* add multi tensor adam update with cast test

* simplify code

* fix format

* Add model diff datatype in optimizer key

* remove random seed

* fix comment

* fix comment

* fix to use model copy

* use for loop

* Fix comment

* use hashcombine

* fix clang analysis error

* add with cuda macro

* fix env var in unittest

* remove redundant unittest

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Fix doc and ops template auto gen (#8546)

* fix doc and add op calculator

* fix bug

* fix gen_ops

* fix diag 0size tensr shape infer bug (#8557)

* fix diag 0size tensr shape infer bug

* refine

* refine

* auto format by CI

* auto format by CI

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Format tensor on cpu (#8548)

* Format tensor on cpu

* use tensor.detach

* Remove useless WITH_CUDAs (#8562)

* unique identity (#8509)

* unique identity

* fix

* add identit name

* rm debug log

* mv identity form class to graph

* auto format by CI

* fix unique iden with having multiple stage

* auto format by CI

* Update block.py

Co-authored-by: cheng cheng <[email protected]>
Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add GenericStreamContext (#8560)

* Modify some file and add test (#8556)

* Modify some file and add test

* modify the content

* modify the format and test function name

* modify the format and aligned with pytorch

* delete print

* modity the function name

* auto format by CI

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <[email protected]>

* Move some op into amp gray list (#8545)

enlarge gray list

Co-authored-by: cheng cheng <[email protected]>

* Refine inplace expand runtime_error (#8561)

* Refine inplace expand runtime_error

* Opt

* Refine

* Add Note

* OneEmbedding use malloc async (#8543)

* in out ptrs

* ops and test

* test pass

* prefetch tmp buffer

* embedding shuffle tmp buffer

* gradient shuffle

* tmp buffer size

* mem pool

* cuda 11.2

* add id_shuffle to setNumunique in update tests

* default not use dynamic alloc

* fix of_tidy

* add fused op

* address review

* init tmp_buffer

* mv memset

* fix

* one_embedding fused_lookup_init_cast and fused_update_put (#8564)

* add fused op

* mv memset

* fix

* address review

* rm fullcache n_missing check

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix cpu aligned_alloc size (#8569)

Signed-off-by: daquexian <[email protected]>

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add flow norm (#8535)

* add flow norm

* rm import

* rm  doctest.testmod

* fix pad_packed_sequence method input requires_grad==True (#8574)

* fix pad_packed_sequence method input requires_grad==True

* fix append error when batch_first=True

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix embedding manager tmp buffer (#8585)

* fix embedding manager

* format

* fix reduce_ops 0size bug (#8551)

* fix reduce_ops 0size bug

* fix commnet

* auto format by CI

* fix bug

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Align Momentum Optimizer (#8549)

* fix moemntum update

* align momentum

* fix bug and finish eager unittest

* Support Graph optimizer

* fix momentum bug

* refine beta

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Fill GetSbp bug and consistent test bug (#8576)

fix(FillOp): fill GetSbp bug and consistent test bug

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Dev Fully fused MLP Grad[OneEmbedding] (#8462)

* support fully fused mlp grad in eager

* support lazy backward

* fix output size

* add fallback to tmp_buf logic when ones buffer is not enough

* build sbp

* overlap allreduce

* fix overlap order

* fix format

* CUDA Graphs delayed capture

* Add ifcomm create for graph

* insert weight event roughly

* fix dbias allreduce error

* simplify code

* Add 11060 limit

* Remove print

* Rename

* fix fill bug and remove comm to cache

* Rename variable and add debug code for cache

* Use kernel state and fix bug

* remove print

* fix allreduce dbias bug

* fix header file

* fix comment

* remove redundant headerfile

* fix userops build error

* refine

* init nccl comm before execute kernel

* fix comment

Co-authored-by: liujuncheng <[email protected]>

* rename mirrored to local (#8503)

* rename mirrored to local

* rename files

* rename files

* auto format by CI

* revert change of package_mirror.py

* rename LocalObject to Dependence

* rename fn LocalObject to Dependence

* merge master

* handle clang check

* fix

* refine

* rename local_object to dependence

Co-authored-by: oneflow-ci-bot <[email protected]>

* Implement BroadcastElementwiseUnary primitive (#8384)

* Add code skeleton for broadcast unary primitive

* first try

* finish impl

* finish impl

* format

* fix build error

* address review

* refine

* address review comments

* use broadcast unary primitive in fill_tensor_ kernel

* handle pack tail statically

* fix

* address review

* address review

* Fix SimplifyBroadcastDims

* fix

* revert fill_kernel

Co-authored-by: Juncheng <[email protected]>

* skip cpu autotest for graph global (#8593)

* TODO

* skip cpu autotest for graph global

* Refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add function_library.h Exception (#8241)

* add RuntimeError for checking

* add RuntimeError to CHECK_EQ

* auto format by CI

Co-authored-by: oneflow-ci-bot <[email protected]>

* Refactor shrink (#8573)

* caching allocator

* auto format by CI

* Update ep_device_context.h

* EpDeviceCtx with CachingAllocator

* rm RawAllocator typename

* auto format by CI

* specific allo in EpDeviceCtx

* auto format by CI

* rm outdated alloc

* simplify thread safe guard

* auto format by CI

* avoid return mutex

* auto format by CI

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Speed up SliceKernel (#8589)

* perf(SliceKernel): descrease number of cuda kernel and speed up

* perf(SliceKernel): use old kernel when small tensor is all fullslice

* use std::copy to copy contiguous memory

* fix cpu kernel bug

* Update readme and vsn for 0.8.0 (#8600)

* update version

* remove py3.6

* modify some file and improve error message (#8592)

* modify some file and improve error message

* modify scalar_by_tensor_op.cpp

* Update scalar_by_tensor_op.cpp

* Update slice_op.cpp

* Update test_slice_op.py

* Update test_slice_op.py

* auto format by CI

* auto format by CI

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* rename consistent to global (#8505)

* rename consistent to global

* rename consistent to global

* rename files

* rename files

* refine

* auto format by CI

* refine

* fix clang check

* fix

* fix

* fix

* rm to_consistent docs

* auto format by CI

* refine

* fix

* fix

* revert changes

* auto format by CI

* revert changes

* revert changes

* rename

* rename

Co-authored-by: oneflow-ci-bot <[email protected]>

* add module releated container docs (#8580)

* add module releated container docs

* auto format by CI

* fix comment

* refine

* refine

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix rnn util extra memory usage when requires_grad=False (#8603)

* fix rnn util extra memory usage when requires_grad=False

* add comments

* refine comments

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* use bracket format slice in tensor str (#8489)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Perf TensorInfo constructor (#8606)

* perf(Autograd): perf TensorInfo constructor

* rename consistent to global

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* print operators' python location when print nn_graph (#8558)

1. add a flag in nn.Graph.debug() named print_op_loc for printing operator location.
2. add a flag in nn.Graph.debug() named only_print_user_code_loc for only print users' code location

* Add randint like (#8598)

* add randnint_like op

* add docs for random

* refine

* auto format by CI

* add randint_like global test

* refine doc

* refine randint_like docs

* fix bug

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add full_like api (#8595)

* add full_like_op api

* refine

* add test

* refine

* refine docs

* refine

* add consistent_full test

* add full_like op

* fix docs commnet

* change scalar sbp return value from list to tuple

* auto format by CI

* merge conflict

* revert

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix cumsum GenBackwardOpConfFn (#8604)

* fix cumsum GenBackwardOpConfFn

* add test case

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* revert change (#8613)

* fix test graph optimization conf CI bug (#8617)

* restore resource config after random tests

* refine

* refine

* Release pod tensor (#8552)

* ThreadLocalGuard

* split ReleaseTensor into ReleasePodTensor and ReleaseNonPodTensor.

* rename

Co-authored-by: luyang <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add param group for optimizer (#8611)

* add add_param_group interface for Optimize

* add test for add_param_group

* revert

* fix comment

* refine

* auto format by CI

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix broadcast_elementwise_binary cpu (#8625)

fix broadcast_elementwise_binary_cpu

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* align exception msg to torch (#8627)

* align exception msg to torch

* auto format by CI

Co-authored-by: oneflow-ci-bot <[email protected]>

* skip unstable global test in ci, reduce failture rate (#8635)

* fuse embedding interaction (#8586)

* fuse embedding interaction

* fix of_tidy

* refine

* fix

* address review

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix flip gen backward opconf (#8605)

* fix flip gen backward opconf

* use new opconf api

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add ONEFLOW_ONE_EMBEDDING_PERSISTENT_TABLE_SNAPSHOT_LOAD_MMAP_LOCKED (#8597)

* Add ONEFLOW_ONE_EMBEDDING_PERSISTENT_TABLE_SNAPSHOT_LOAD_MMAP_LOCKED

* refine

* use MAP_POPULATE

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Profiling main thread (#8601)

* ThreadLocalGuard

* refactor EagerBlobObjectList

* op_args_reserved_size

* remove useless comments

Co-authored-by: binbinHan <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Fully Memory Log V2 with more details (#8565)

* Fully Memory Log V2 with more details

* refine log and long op name

* fix clang tidy

* fix test

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Xiaoyu Xu <[email protected]>

* Stream policy (#8590)

* ThreadLocalGuard

* refactor signature of StreamType::InitDeviceCtx

* refactor hint

* add StreamPolicy

* remove DeviceCtx args

* refine OpCallInstructionUtil::Prepare & Compute

* merge EpDeviceCtx and LazyJobDeviceCtx into StreamPolicy

* minor fix

* minor fix

* del useless code

* fix error

* fix merge error

* fix segment fault bug

* fix complie error

* del methods belong to Subclass

* reslove comment

Co-authored-by: binbinHan <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add fully support for broadcast matmul (#6937)

* fix arange bug

* fully support broadcast matmul

* add more check

* remove check

* add fully sbp

* fix full sbp

* Fix broadcast matmul grad

* remove old broadcast matmul grad

* add broadcast grad back and when B numaxes is 2, we use broadcast_gradB instead of matmul+reduce

* add lazy backward

* Add restrict when transpose_a is false we can use bmatmul_grad_b

* revert

* fix broadcast matmul backward

* fix single client dispatch matmul logic

* revert old bcast matmul grad b kernel

* fix eager functional matmul backward

* add more test case

* remove redundant code

* add more special case

* when b num axes is 2, we only save tensor a

* fix annotation

* fix conflict and format

* remove single client matmul code

* Fix eval error

* fix conflict

* fix unittest

* Add init value

* support matrix vector matmul

* add vector matrix product

* Use matmul primitive to rewrite matrix vector product forward and backward

* Add fullllllllly support for vector matrix product

* Fix sbp

* fix bug

* add unittest

* Add consistent test for broadcast matmul

* Remove redundant code

* fix userops annotation

* fix

* refine

* Fix clang static analysis

* fix clang analysis

* set check graph as false

* fix

* fix for unittest

* fix broadcast sbp bug

* try to fix unittest

* Fix consistent test

* fix multiplier to 4 for unittest

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Revert "skip cpu autotest for graph global" (#8608)

* Revert "skip cpu autotest for graph global (#8593)"

This reverts commit b076be782fd8f21e50ee4915f2d1562f3a9ab4c0.

* cherry pick from master

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* OneEmbedding add tmp_buffer allocator (#8588)

* fix embedding manager

* format

* refine embedding_manager tmp_buffer allocator

* fix

* format

* refine

* refine

* auto format by CI

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <[email protected]>

* refine error msg for some user ops (#8579)

* refine error msg for some user ops

* refine error msg for some user ops

* optimize

* optimize the writing

* optimize the writing

* optimize the writing

* auto format by CI

* optimize writing

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add tril fill value (#8655)

add tril fill value

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix_non_pod_data_allocate_bug (#8657)

Co-authored-by: Li Xinqi <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Fix norm (#8629)

* fix norm

* add doc

* add bool &

* update math_functor.cpp

* add note

* fix_decorate_mem_leak_bug_in_eager_boxing (#8661)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add higher order derivative for leaky_relu and negative op (#8643)

* add higher derivative for leakyrelu and negative

* fix a typo

* remove functor

* add initialize alpha

* fix incorrect dim size in global test

* fix incorrect dim size in global test

* optimize testcase

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* update oneflow intro to show the difference (#8669)

* update oneflow intro

* refine

* refine

* refine

* refine

* refine

* refine

* refine

* refine

* refine

* refine oneflow intro

* Stacked error (#8671)

* ThreadLocalGuard

* StackedError

* StackedError

Co-authored-by: Shenghang Tsai <[email protected]>

* Refactor tensor initializer (#8626)

* fix(*): fix xavier_initializer

* refactor(Initializer): refactor initializer

* fix function name

* auto format by CI

* refine

* fix interface in tensor.py

* fix(trunc_normal_): fix init bug and add test

* auto format by CI

* fix bug

* add oneflow.nn.init.normal_ test

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <[email protected]>

* Fix nn doc (#8650)

* fix hsplit doc

* add doc for module

* fix dtype

* fix formula

* add ref

* fix row length

* Fix reduce max min bool dtype bug (#8651)

* fix reduce_max_min_bool_dtype

* fix bug

* auto format by CI

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Remove redundant exception wrapper (#8631)

* remove redundant ExceptionWrapper

* refine KeyErrorMessage

* refine

* auto format by CI

Co-authored-by: oneflow-ci-bot <[email protected]>

* Refactor MemoryCase to eliminate determine statements of device_type (#7727)

* ref memory_case_util

* ref BlobObject::CheckMemCase

* ref mem_case using

* address review

* address review

* namespace memcase -> memory

* fix conflict

* address review

* address static analysis

* rm check

* cpu device_id is always 0

* fix conflict

* timeout-minutes: 50

* revert change

* increase thrd limit in container

* skip 2x2 TestEinsumConsistent

* skip failed case of distributed test

* auto format by CI

* fix_non_pod_data_allocate_bug

Co-authored-by: Li Xinqi <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: tsai <[email protected]>
Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: clackhan <[email protected]>

* fix some data races in c++ api and SteadyVector (#8654)

* fix some data races in c++ api and SteadyVector

Signed-off-by: daquexian <[email protected]>

* skip self copy in MutShapeView::ToShape

Signed-off-by: daquexian <[email protected]>

* auto format by CI

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <[email protected]>

* Fix sin/cos higher order derivative (#8648)

* fix(GradGrad): fix sin/cos higher order derivative

* fix(GradGrad): fix calculate error

* refine autograd global test

* auto format by CI

* refine sin/cos grad_grad calculate

* fix static analysis

* merge conflict

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: Ping Zhu <[email protected]>
Co-authored-by: Zhu, Ping <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* refine_eager_boxing_to_adapt_ep (#8568)

* refine_eager_boxing_to_adapt_ep

* fix typo

* refine

* refine symmetric-acyclic-nd-sbp-to-nd-sbp

* refine

* fix error

* fix static check

* add NOLINT

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Fix repeat bug (#8645)

* make result contiguous

* add test case

* auto format by CI

Co-authored-by: oneflow-ci-bot <[email protected]>

* Instruction policy (#8583)

* ThreadLocalGuard

* vm::InstructionPolicy

* fix compile error (#8623)

* fix compile error

* change MirroredObject to Dependence

* Modify DependenceVector

* rm include stream type

* fix stream type

* auto format by CI

Co-authored-by: Yu OuYang <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <[email protected]>

* handle non-contiguous input (#8665)

* handle non-contiguous input

* refine

* auto format by CI

Co-authored-by: oneflow-ci-bot <[email protected]>

* rename define CONSISTENT to GLOBAL (#8652)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Refine naive interpret (#8672)

* ThreadLocalGuard

* refactor EagerBlobObjectList

* op_args_reserved_size

* remove useless comments

* rename one::EagerBlobObjectList* to vm::EagerBlobObject*

* refactor signature of InstructionsBuiler::Call

* PhysicalRun

* refactor InstructionsBuilder::Call

* remove unused StatefulOpKernel::need_check_mem_case

* remove EagerLocalTensorImpl::is_shape_synced_

* refactor SoftSync

* move SmallVector from common/container_util.h to framework/instructions_builder.cpp

* explicit scalar initialization

Co-authored-by: clackhan <[email protected]>

* Rebuild Docs V0.8.0 (#8392)

* rebuild for 5 module

* fix bug

* fix for doctree and content  in nn and

* fix

* fix

* fix

* add some

* fix for oneflow.rst

* update oneflow oneflow.nn

* update tensor

* update tensor module

* update

* test

* update

* update

* fix for undone desc

* docs: oneflow.utils.data (#8485)

* feat(utils.data): add oneflow.utils.data

* docs(dataloader): change the docstring of DataLoader

* docs(tensor): add methods to oneflow.Tensor document

* docs(optim): change docstring of optimizer and add a note to the doucument

* nn.graph

* fix for graph

* fix bug

* review nn and linalg document (#8515)

* docs(nn): add contents to oneflow.nn document

* docs(linalg): refactor oneflow.linalg document

* change attributes.rst and review nn.functional.rst (#8514)

* change attributes.rst and review nn.functional.rst

* reconstruction oneflow.cuda

* fix cuda and rebuild comm demo (#8582)

* update image

* add distributed

* oneembedding & refine graph

* update for sdisributed one_embedding

* fix rnn.py (#8616)

* 重构 oneflow.nn.init 文档 (#8622)

docs(nn.init): refactore nn.init document

* docs(nn.init): remove the comments

* docs(utils.data): remove the comments

* update and fix bug

* docs(review): refine the documents (#8646)

* docs(review): refine oneflow, nn, Tensor, nn.init, linalg, utils.data, optim modules

* docs(optim): modify the code examples

* docs(tensor): edit note

* 重构 oneflow.autograd 文档 (#8594)

* docs(autograd): refactor oneflow.autograd

* docs(autograd): edit "Default gradient layouts".

* docs(autograd): reedit "Default gradient layouts"

* docs(autograd): add comment

* docs(autograd): add reference

* update

* docs(tensor): change autoclass to autosummary

* update

* update

* add oneflow.linalg.diagonal (#8653)

* docs(linalg): add oneflow.linalg.diagonal

* update enviorment variable

* Update docs/source/distributed.rst

Co-authored-by: Houjiang Chen <[email protected]>

* Update docs/source/distributed.rst

Co-authored-by: Houjiang Chen <[email protected]>

* update enviorment variable

* update for ev & distributed

* update distribued

* update ev

* update distribute desc

* Update docs/source/distributed.rst

Co-authored-by: Houjiang Chen <[email protected]>

* update

* 修改 docstring 描述 (#8656)

* docs: move pytorch refernce to end

* docs: add some docstring

* docs(refs): add refs

* Update docs/source/distributed.rst

* updte for distributed details and environment_variable

* docs(docstring): Modify all reference links to version 1.10 (#8663)

* fix bug

* fix bug

* fix all warning

Co-authored-by: Guoliang Cheng <[email protected]>
Co-authored-by: liu xuan <[email protected]>
Co-authored-by: Guoliang Cheng <[email protected]>
Co-authored-by: laoliu97 <[email protected]>
Co-authored-by: Yao Chi <[email protected]>
Co-authored-by: Houjiang Chen <[email protected]>

* Fix zeros like and ones_like api (#8632)

* fix zeros_like and ones_like bug

* refine

* revert

* refine

* fix tensor_slice_view infer physic_shape bug

* add test

* refine

* auto format by CI

* fix bug

* refine

* auto format by CI

* fix import error

* fix bug

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Fix sbp print bug (#8689)

* Add a normal priority with no transfer but different sbp

* Fix the bug for printing no boxing edge

* Do not use P for weights

* auto format by CI

Co-authored-by: oneflow-ci-bot <[email protected]>

* eager_local_interpreter_with_infer_cache (#8619)

* ThreadLocalGuard

* refactor EagerBlobObjectList

* op_args_reserved_size

* remove useless comments

* rename one::EagerBlobObjectList* to vm::EagerBlobObject*

* refactor signature of InstructionsBuiler::Call

* PhysicalRun

* refactor InstructionsBuilder::Call

* remove unused StatefulOpKernel::need_check_mem_case

* remove EagerLocalTensorImpl::is_shape_synced_

* eager_local_interpreter_with_infer_cache

* remove useless code

* reslove comments

* refactor TensorMeta::TensorMeta(const TensorMeta)

* use small vector

* add kMaxNumDims

* fix error include

* fix split Symbol LocalTensorMeta error

* refactor SoftSync

* move SmallVector from common/container_util.h to framework/instructions_builder.cpp

* mone ONEFLOW_EAGER_ENABLE_LOCAL_INFER_CACHE to eager.h

* add blank line

* reslove comments

* minor fix

* refine

* explicit scalar initialization

* fix static check error

* auto format by CI

* of_format

* reslove comment

* refine

* refine

* refine

Co-authored-by: lixinqi <[email protected]>
Co-authored-by: Li Xinqi <[email protected]>
Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix gelu nn.Module bug and support tanh mode. (#8693)

* add gelu2 api

* refine test

* refine docs

* refine

* restuct

* delete useless headfile

* format

* rm doc of tensor.gelu (#8696)

Co-authored-by: Shanshan Zhong <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Fix bug in CrossFeatureInteraction LazyBackward (#8677)

fix bug

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix floating-point scalar tensor in arange (#8673)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add nn functional fold (#8667)

* add fold

* update fold.py

* add test

* fix doc

* fix comment

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* modify some file and improve the error message (#8566)

* modify some file and improve the error message

* modify the content

* modify the content

* auto format by CI

* Update roi_align_op.cpp

* Update roi_align_op.cpp

* Update reshape_user_op_util.cpp

* auto format by CI

* Update roi_align_op.cpp

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <[email protected]>

* [OneEmbedding] add id_shuffle_copy_out (#8683)

add id_shuffle_copy_out

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix add_param_group step key not match error (#8698)

* fix add_param_group step key not match error

* auto format by CI

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add env ONEFLOW_EP_CUDA_DEVICE_FLAGS and ONEFLOW_EP_CUDA_STREAM_FLAGS (#8703)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix for docsv0.8 (#8710)

* fix repeat op 0-size releated bug (both in FW and AD) (#8707)

* fix repeat op 0-size releated bug (both in FW and AD)

* refine

* refine static check

* refine

* fix commnet

* fix comment

* refine

* fix test

* auto format by CI

* auto format by CI

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Support Dropout Scale in FusedMLPGrad[OneEmbedding] (#8633)

* support alpha list

* Remove redundant modify

* remove redundant alpha set

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Fix bug of Tensor.type (#8697)

* fix bug of tensor.type(flow.Tensor)

* fix bug of tensor.type(flow.Tensor) about device

* Fix tensor type doc (#8699)

fix doc of tensor.type

* add test for tensor.type(flow.Tensor)

* move PyTensorMetaCls_CheckExact to header file

Co-authored-by: Shanshan Zhong <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* ONEFLOW_GRAPH_PLACE_TRAINING_STATE_ON_ALL_RANKS (#8706)

* ONEFLOW_GRAPH_PLACE_TRAINING_STATE_ON_ALL_RANKS

* auto format by CI

Co-authored-by: liujuncheng <[email protected]>
Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* define_mut_output_shape_and_mut_output_stride_in_infer_ctx (#8709)

* define_mut_output_shape_and_mut_output_stride_in_infer_ctx

* fix merge master error

* fix typo

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add qat conv modules (#8368)

* add qat conv modules

* add quantization related modules to doc

* refine qatconv modules doc

* add qat conv module tests

* refine

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add unsqueeze_multiple_op (#8714)

* add unsqueeze_multiple_op

* modify the format

* Update functional_api.yaml

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* modify broadcast_like_op.cpp and add test (#8720)

* modify broadcast_like_op.cpp and add test

* modify broadcast_like_op.cpp

* Update broadcast_like_op.cpp

Co-authored-by: Yinggang Wang <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* JIT LR (#8500)

* add example code

* Update cosine_annealing_lr.py

* enable self params transformer

* enable pass ast to c++ api

* enable jit backend for lr

* enable jit global register and invoke

* convert Global to Singleton for new merge

* enable pybind11 walk on python ast

* enable test all existent get_lr of oneflow in python

* enable py_ast_wrapper pass ast from python to mlir

* switch all ast to ast-wrapper in mlir scope

* define python ast partially

* partial python ast definition

* trim asdl of python ast

* mlir gen

* add symbol table

* from ast to jit done

* switch llvm::errs() to mlir::emitError and convert switch to typeSwitch

* trim duplicate namespace use

* fix LIT header

* add some docs

* enable compare with or_else, if with return seamless in branch and mutable variable

* trim code and refine struct

* register pybind11 ast node for shared_ptr

* enable cpp class in python

* go through python to mlir to llvm to jit to run

* add addf subf op

* work well on stepLR linearLR exponentialLR coseineDecayLR cosineAnnealingLR constantLR

* enable maxf minf conversion to llvm ir

* rename LR_JIT to LRJITRegister

* remove LR_JIT_Engine and swith Invoke to std::function ret by  lookup

* refine struct

* enable bisect_right and python resigter api have dump option arg

* add bisect_left and bisect_transformer specially, delete former test python script

* remove c++17 standard

* restore double hash to iterator

* publish

* publish

* publish

* use llvm classof and typeswitch rightly

* trim

* commit

* commit

* commit

* commit

* commit

* commit

* auto format by CI

* Update ir.cpp

* Update OneFlowLRJITRegistry.h

* auto format by CI

* Update AstMlirGen.h

* Update lr_jit.cpp

* auto format by CI

* Naming conventions

* auto format by CI

* auto format by CI

* deploy _ behind

Co-authored-by: leaves-zwx <[email protected]>
Co-authored-by: yuhao <[email protected]>
Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: yuhao <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add logspace (#8599)

* add logspace

* add global test

* restore rand

* fix doc

* rename consistent to global

* adjust import order

* add todo

* Add hann_window (#8615)

* add hann_window

* rm useless include

* add check

* adjust import order

* add ONEFLOW_VM_PENDING_HANDLE_WINDOW_SIZE (#8730)

* add ONEFLOW_VM_PENDING_HANDLE_WINDOW_SIZE

* add environment to vm.h

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Fix as strided bool type and view bug (#8713)

* fix as_stride bug

* refine

* refine

* refine

* delete useless head file

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add functional binary cross entropy (#8708)

* add gelu2 api

* refine test

* refine docs

* refine

* restuct

* delete useless headfile

* format

* rm doc of tensor.gelu

* add functional binary cross entropy

Co-authored-by: BBuf <[email protected]>
Co-authored-by: Xiaoyu Zhang <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* support map_location in flow.load (#8666)

* support map_location in flow.load

Signed-off-by: daquexian <[email protected]>

* auto format by CI

* fix tests

Signed-off-by: daquexian <[email protected]>

* fix bug when map_location is None

Signed-off-by: daquexian <[email protected]>

* auto format by CI

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <[email protected]>

* Add addcdiv (#8581)

* add addcdiv

* fix tensor_functions

* fix inplace

* add test number

* rename consistent to global

* Inner most dim case for cumsum cumprod op (#8403)

* cumsum use cub scansum in some case

* prod use cub scan

* refine name

* refine

* optimize cum op

* format

* fix

* get device properties by cuda stream class

* revert useless code

* refine

* outer dim use parallel sweep algo

* refine

* fix a fraction of threads hit __syncthreads

* revert

* refine kernel define

* refine

* refine

* refine

* refine

* move comment

* fix

* fix

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Define mut output dtype and mut output is dynamic in infer ctx (#8716)

* define_mut_output_shape_and_mut_output_stride_in_infer_ctx

* fix merge master error

* fix typo

* define_mut_output_dtype_and_mut_output_is_dynamic_in_infer_ctx

* replce const DataType& with DataType

* replace const DataType& with DataType ret

* split TensorDesc4ArgNameAndIndex and MutTensorDesc4ArgNameAndIndex

* refine

* minor fix

* refine

* fix static check error

* Update op_expr.cpp

* Update op_expr.cpp

* Update stateful_opkernel.cpp

* refine

* fix static check error

* refine

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Dev refactor fuse instruction policy (#8624)

* ThreadLocalGuard

* vm::InstructionPolicy

* refactor fuse instruction policy

* fix compile error (#8623)

* fix compile error

* change MirroredObject to Dependence

* Modify DependenceVector

* add instruction policy util

* add instruction policy util

* remove include

* add include

* rm fuse instruction type

* Modifying variable properties

* add stream_sequential_dependence_ to instruction_policy

Co-authored-by: lixinqi <[email protected]>
Co-authored-by: Li Xinqi <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix bug of batchnorm num_batches_tracked global error when loading state_dict (#8723)

add condition for assign num_batches_tracked

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add launch master port limit (#8563)

* add launch master port limit

* Update python/oneflow/distributed/launch.py

Co-authored-by: daquexian <[email protected]>

Co-authored-by: daquexian <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Fix docs import distance (#8691)

* fix import distance

* add functional apis

* add smooth_l1_loss docs

* refine activation.py

* add deleted api

* review

* 添加oneflow, nn 等模块文档中遗漏的接口 (#8704)

* docs: add api

* docs(nn): refactor nn

* review

Co-authored-by: Guoliang Cheng <[email protected]>
Co-authored-by: ChenQiaoling <[email protected]>

* refactor control stream type (#8647)

* refactor control stream type

* auto format by CI

* Add method implementation

* refine

* refien

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: Li Xinqi <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Define mut output tensor desc (#8717)

* define_mut_output_shape_and_mut_output_stride_in_infer_ctx

* fix merge master error

* fix typo

* define_mut_output_dtype_and_mut_output_is_dynamic_in_infer_ctx

* define_mut_output_dtype_and_mut_output_tensor_desc

* replce const DataType& with DataType

* replace const DataType& with DataType ret

* split TensorDesc4ArgNameAndIndex and MutTensorDesc4ArgNameAndIndex

* refine

* minor fix

* fix merge error

* fix warning error

* refine

* fix static check error

* Update op_expr.cpp

* Update op_expr.cpp

* Update stateful_opkernel.cpp

* refine

* fix static check error

* refine

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Symbolic local tensor meta (#8662)

* ThreadLocalGuard

* refactor EagerBlobObjectList

* op_args_reserved_size

* remove useless comments

* rename one::EagerBlobObjectList* to vm::EagerBlobObject*

* refactor signature of InstructionsBuiler::Call

* PhysicalRun

* refactor InstructionsBuilder::Call

* remove unused StatefulOpKernel::need_check_mem_case

* remove EagerLocalTensorImpl::is_shape_synced_

* eager_local_interpreter_with_infer_cache

* remove useless code

* reslove comments

* refactor TensorMeta::TensorMeta(const TensorMeta)

* use small vector

* Symbolic LocalTensorMeta

* check shape in critical_sectio

* add kMaxNumDims

* fix error include

* fix split Symbol LocalTensorMeta error

* fix split cache and symbolic local tensor meta error

* refactor SoftSync

* move SmallVector from common/container_util.h to framework/instructions_builder.cpp

* mone ONEFLOW_EAGER_ENABLE_LOCAL_INFER_CACHE to eager.h

* add blank line

* reslove comments

* minor fix

* refine

* explicit scalar initialization

* fix static check error

* auto format by CI

* of_format

* reslove comment

* refine

* refine

* refine

* fix error

* define MutOutputShape and MutOutputStride in InferContext

* define_mut_output_shape_and_mut_output_stride_in_infer_ctx

* fix merge master error

* fix typo

* fix static check error

* define_mut_output_dtype_and_mut_output_is_dynamic_in_infer_ctx

* define_mut_output_dtype_and_mut_output_tensor_desc

* replce const DataType& with DataType

* split const and mut func in LocalTensorMeta

* replace const DataType& with DataType ret

* split TensorDesc4ArgNameAndIndex and MutTensorDesc4ArgNameAndIndex

* refine

* minor fix

* fix merge error

* fix warning error

* refine

* fix static check error

* Update op_expr.cpp

* Update op_expr.cpp

* split MutTensorMeta and MutLocalTensorMeta

* Update stateful_opkernel.cpp

* refine

* fix static check error

* refine

* refine

* reslove comment

* refine

* fix typo

Co-authored-by: Houjiang Chen <[email protected]>

* fxi typo

* use OpArgsVector

Co-authored-by: lixinqi <[email protected]>
Co-authored-by: Li Xinqi <[email protected]>
Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Houjiang Chen <[email protected]>

* Feat general basic communication (#8437)

* Add a slight cost for B->S and B->P in 2d sbp

* Add penalty for P in consumer

* Fix a slight bug

* Add at most 1 middle node for general basic communication

* Add the cost for general basic communication

* Add the slight penalty for eager

* Skip initialization of boxing collector if not needed

* Fix a bug

* Dev nd nccl send recv boxing (#8467)

* nd nccl_send_recv_boxing

* rm print

* support num_axes > 2

* Add distributed optional run (#8372)

* Add

* change deps

* add install

* add skip

* autoprof supports bandwidth (#8367)

* autoprof supports bandwidth

Signed-off-by: daquexian <[email protected]>

* print bandwidth

Signed-off-by: daquexian <[email protected]>

* auto format by CI

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <[email protected]>

* remove tmp buffer of cumprod cpu backward kernel (#8369)

* remove tmp buffer of cumprod cpu backward kernel

* refine

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Move tensor api to cpython part3 (#8342)

* add tensor_functions

* concat py methods

* add hash, restore tensor.py

* check replacement

* refine code, remove commented tensor.py

* refine code

* move some api

* add cpu and cuda api

* add triu tril norm and etc.

* remove tensor_functions.h

* move more api

* move more api, refine size

* fix typo

* format code, remove useless include

* refine code

* refine code, fix typo

* align .cuda to python

* refine code

* split some api to part3 for review

* remove positional only arguments of argmax and argmin

* remove arguments parse

* modify arguments name in matmul and floor_divide

* rename BINARY_FUNC to DIRECT_PASS_FUNC, modify some functions

* refine code, format code

* add inplace /=, add comments

* remove name in macros

* remove python api

* remove redundant include

* remove cout

* format code

* refactor tensor.size by directly call shape.at, refactor tensor.sub_ by calling nb_sub_

* remove redundant code

* auto format by CI

* fix typo, fix wrong call

* modify idx datatype from int32 to int64 in tensor.size

* add some DIRECT_PASS_FUNC

* add cpu cuda var pow and etc.

* add masked_fill any all

* make REDUCE_FUNC macro, add reduce_* functions

* add 0dim check in ReduceSumWhole, refine yaml

* fix bug

* restore add add_ sub sub_

* add unittest for tensor.half tensor.add tensor.add_

* refine code

* refine code

* fix typo

* fix bug of tensor.std()

* refactor var std and cuda, using c++ functional api

* add beta and threshold in softplus

* auto format by CI

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add nn_functor Check (#7910)

* add bias_add_check

* add bias_add error test

* fix conv2d nhwc bias_add error

* add nhwc conv test

* add bias_add_error test

* Add bias add error check

* Rename

* add batch matmul error check

* add matmul check error msg

* remove annotation

* add fused mlp error msg check

* Add pixel shuffle check test

* add more test until normalization add relu functor

* refine error message

* finish all nnfunctor check msg

* handle type error

* remove useless symbol

* modify back to TypeError

* fix all comment

* Remove redundant code

* Remove pad ndim check

* fix bias add space

* fix check logic cause ci gpu not always gpu:0

Co-authored-by: hjchen2 <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add FusedMatmulBiasAddReluDropout [OneEmbedding] (#8222)

* previous version for fused_matmul_bias_add_relu_dropout

* add op infer

* fix detail

* finish forward

* support dropout rate list

* add forward test

* fix bug for output buffer

* Configurable alpha params

* try to add bit mask logic

* Add bitmask first version!

* Add row col bitmask logic

* support not align4 reludropout

* simplify relu dropout ld logic

* Add naive relu dropout grad kernel

* add simple relu dropout grad kernel

* Rename

* support relu_dropout bitmask backward

* add vectorized optimization

* fix tmp buffer

* add to amp list

* add lazy backward logic

* Refine kernel

* add indextype dispatch

* simplify functor logic

* fix cublas fused mlp aux_ld shape bug

* Add more relu dropout kernel

* add full unittest

* fix bug in skip final activation

* refine

* Remove dump func

* fix format

* Remove cmake

* remove redundant divide

* add padded version

* fix dropout

* oneflow curand

* refine

* remove redundant kernel

* add unroll logic

* add unroll and ballot sync

* refine format

* Remove fast curand

* Refine python interface

* Add if branch for memset

* fix python logic

* just for debug

* not use matmul bias add grad

* add launch 1 block limit

* fix unittest

* Refine

* fix graph backward bug

* limit to 11060

* change to use int32_t dtype for cublas aux

* Fix jc comment

* fix comment

* fix convert

* fix static_analysis

* fix at

* fix userops td

* fix userops td

* fix const ref

* fix compile error for bfloat16

* limit to 11060

* fix bug

Co-authored-by: Juncheng <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix gather 0-dim tensor bug (#8376)

* fix 0-dim tensor bug

* refine

* support input 0-dim tensor for gather

* refine

* refine

* refine dim_scatter_kernel check

* refine

* refine check

* fix clang_tidy error

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add api to apply external job pass (#8370)

* Add condition to find-test-cache-distributed (#8387)

* add condition to find-test-cache-distributed

* fix

* warp dim util (#8382)

* warp dim util

* format

* use more maybe_wrap_dim

* refine array functor

* add more

* refine math_functor

* fix_bug_in_broadcast_min_max_grad_and_broadcast_like (#8379)

* fix_bug_in_broadcast_min_max_grad_and_broadcast_like

* refine

* fix static check error

* fix bug about index (#8388)

* fix bug about index

* add test case

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* LogicalSliceAssign support full slice sbp (#8344)

* feat(SliceOp): slice ops support 2d sbp

* fix(SliceOp): fix [B, P] 2d sbp bug

* refine error message

* fix bug in parallel_num == 1

* add comment

* add warning and format

* add NOLINT for boxing check

* feat(LogicalSliceOps): support all nd_sbp

* feat(LogicalSlice): support nd_sbp

* add error message

* fix(AutoTest): fix auto_test bug in module.parameter pass

* auto format by CI

* fix(LogicalSliceAssign): skip test when 1n1d

* fix SliceParams memset error

* remove memset

* add CHECK_JUST

* fix(*): make sure split_axis >= 0 or equal to SPLIT_AXIS_FOR_NON_SPLIT

* remove memset

* fix spilit_info.axis bug

* feat(LogicalSliceOps): support grad

* add logical_slice gradient_funcs

* feat(LogicalSliceAssign): LogicalSliceAssign support full slice sbp

* auto format by CI

* test(LogicalSlice): fix logical_slice dims

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Houjiang Chen <[email protected]>
Co-authored-by: oneflow-ci-bot <[email protected]>

* fix_tensor_from_numpy_mem_leak_bug (#8391)

* fix_tensor_from_numpy_mem_leak_bug

* add note

* refine note

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Make of_pyext_obj static only to make sure only a python ext so has python symbols (#8393)

* make of_pyext_obj static only

* refine note

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Adjust tolerance setting in embedding_renorm unit test (#8394)

* support front end compile for job to iree (#8249)

* support frontend dev version

* polish name

* add tosa-to-elf.mlir

* tosa to elf by llvm

* conv2d partial

* an enhanced frontend runner

* support numpy as input

* enable multiple using nn graph with different input(jobname make it  it cd /home/yuhao/frontend/oneflow ; /usr/bin/env /usr/bin/python3 /home/yuhao/.vscode-server/extensions/ms-python.python-2022.6.2/pythonFiles/lib/python/debugpy/launcher 40873 -- /home/yuhao/frontend/oneflow/oneflow/ir/test/Frontend/runner.py )

* enable multiple input

* enable cpu and cuda

* change full_name to _full_name

* support exchange cuda with cpu seamlessly

* remove pip

* lit config

* polish

* trim

* auto format by CI

* modify

* auto format by CI

* last line polish

* use unittest

* auto format by CI

* use allclose

* auto format by CI

* pulish

* optimize convert oneflow to tosa

* conv2d

* conv2d enhanced && conv2d examples add

* add road map

* add add_n2Op and boardcast_addOp conversion

* add matmulOp conversion

* support converting normailzation op to tosa(partically)

* update roadmap

* support i64 tensor to dense elem attr

* support 100% resnet op conversion

* add test mlir

* add test iree resnet python script

* auto format by CI

* done

* enhance iree resnet test script

* auto format by CI

* rebuild code

* auto format by CI

* rebuild test script

* update

* auto format by CI

* pub

* trim test scripts

* move

* move

* input and output add block arg judgement

* emit error in variable conversion

* error handle for ci

* modify err info

* auto format by CI

* merge

* auto format by CI

* output not block

* flow ones

* rm const

* trim maybe

* trim maybe with header file

* const auto

* solve clangd error

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Feat/zero mix with mp (#8036)

* add zero limit

* add debug

* add mix zero test

* refactor zero api

* zero test with mp

* add 2d test

* add zero nd

* add nd zero

* add sbp cast

* test passed soft limit consumer

* refine size api

* zero use stage 2

* add limit consumer api

* add new api

* refine zero s select

* fix index out of range

* rm zero limit on device type

* zero test with activation checkpointing

* add indentity when dp sequence len is 1

* move to base with master

* fix

* fix

* fix

* add test

* debug bad case

* refine test for eager and graph boxing

* test case ready

* simplify

* refine test

* fix buff size

* fix conflict

* refine zero nd

* refine

* add full test

* revert change

* refine split check

* fix typo

* rm log

* spit long func

* restore test

* Update optimizer_placement_optimization_pass.cpp

* auto format by CI

* auto format by CI

* fix static check

* add tips for zero api change

* auto format by CI

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Revert embedding normal path and fix amp list (#8374)

* revert embedding normal path, fix amp list

* fix amp

* fix memset bug in gather cpu kernel

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* replace fixed_vector with small_vector and make Shape inherit from it (#8365)

* Replace fixed_vector with llvm::SmallVector

Signed-off-by: daquexian <[email protected]>

* Shape inherited from llvm::SmallVector

Signed-off-by: daquexian <[email protected]>

* refine cmake

Signed-off-by: daquexian <[email protected]>

* rename fixed_vector to small_vector

Signed-off-by: daquexian <[email protected]>

* fix reviews

Signed-off-by: daquexian <[email protected]>

* auto format by CI

* update Shape constructor

Signed-off-by: daquexian <[email protected]>

* add 'PUBLIC' keyword to all target_link_libraries

Signed-off-by: daquexian <[email protected]>

* auto format by CI

* update cmake

Signed-off-by: daquexian <[email protected]>

* auto format by CI

* update cmake

Signed-off-by: daquexian <[email protected]>

* update cmake

Signed-off-by: daquexian <[email protected]>

* auto format by CI

* set is_initialized_ default to true

Signed-off-by: daquexian <[email protected]>

* override some methods to set is_initialized_

Signed-off-by: daquexian <[email protected]>

* auto format by CI

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <[email protected]>

* Light plan for debug (#8396)

* Light plan for debug

* fix note

* disable terminfo to fix missing terminfo symbols (#8400)

* disable terminfo to fix missing terminfo symbols

Signed-off-by: daquexian <[email protected]>

* auto format by CI

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix bug of ZeRO MP in complex case (#8404)

* Remove redundant output_lbns in ir (#8409)

* mv case

* remove redundant info

* Dev FusedCrossInteraction[OneEmbedding] (#8335)

* add simple fused cross interaction forward

* add packed fused

* Add cross interaction grad

* simplify code

* fix bug

* support crossnet v2

* support cross interaction v2

* add lazy backward

* Rename and add test

* fix jc comment

* fix comment

* fix bug

* fix userops td elem_cnt for FUSED Group

* fix header file

* fix clang static analysis

* fix unittest

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add exe graph physical shape check msg (#8002)

* fix index select op in graph

* add exe graph physical shape check msg

* improve the debug information for the python stack trace

1. add a parameter 'max_stack_depth' to specify the max depth for the stack trace
2. refactor other debug related classes.

* remove parens

* update

* resolve PR comments

* update

* update graph debug test file.

* restore self._debug in class Graph and class ModuleBlock

* Do not shorten the stack frame string if it is in debug mode

* delete TODOs

* disable conv3d test (#7969)

Signed-off-by: daquexian <[email protected]>

* skip layernorm random_data_warp test (#7941)

* skip layernorm random_data_warp test

* warp/block/uncached case only test gpu

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Lock click version (#7967)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add global avgpool unittest (#7585)

* fix (#7978)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Support negative dim in scatter op (#7934)

* support negative dim in scatter op

* refine scatter test

* refine scatter test again

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* run barrier callback in BarrierPhyInstrOperand::~BarrierPhyInstrOperand (#7702)

* run barrier callback in BarrierPhyInstrOperand::~BarrierPhyInstrOperand

* lock gil in vm Callback thread

* more comments for VirtualMachineEngine::Callback()

* the Env is never destroyed.

* export Env into python

* more unittests

* wait shared_ptr.use_count() == 0

* export unittest.TestCase in framework/unittest.py

* SwitchToShuttingDownPhase

* optional is_normal_exit

* VirtualMachine::CloseVMThreads

* Delete env_api.h

env_api.h is deleted by master

* reshape_only_one_dim_infered

* address pr comments

* fix a ref-cnt bug in TryRunBarrierInstruction.

* rollback flow.env.all_device_placement

* no distributed running test_shutting_down.py

* auto format by CI

* expand lifetime of module oneflow in test_shutting_down.py

* refine del depend on of

* capture oneflow._oneflow_internal.eager when calling sync in __del__

* add try in flaky test

Co-authored-by: Luyang <[email protected]>
Co-authored-by: chengtbf <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: Xiaoyu Xu <[email protected]>

* Fix one hot scalar tensor bug (#7975)

* fix reduce_sum scalar check bug

* fix one_hot scalar tensor bug

* fix clang tidy error

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* support ctor np array from of tensor (#7970)

* support ctor np array from of tensor

* add test case constructing np array from tensor

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add_manual_seed_all_api (#7957)

* add_manual_seed_all_api

* Update conf.py

* refine

* add test case

* auto format by CI

* Update random_generator.cpp

* auto format by CI

Co-authored-by: oneflow-ci-bot <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* one_embedding add doc string (#7902)

* add doc string

* add example

* add

* fix doc

* refine

* address review

* mb to MB

* add make_table_option

* option to options

* refine

* add forward

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Support numpy scalar parameters (#7935)

* feat(functional): support numpy scalar parameters

* rename inferface

* feat(*): TensorIndex support numpy scalar

* feat(TensorIndex): support advance indexing

* add unittest and int32 support for branch feat-param_support_np_scalar (#7939)

* add unittest

* refactor unittest

* add todo for int16 advanced indexing

* add int32 supporting for advance indexing

* auto format by CI

Co-authored-by: Wang Yi <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <[email protected]>

* fix tensor_scatter_nd_update (#7953)

* fix tensor_scatter_nd_update

* auto backward

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix one_embedding adam (#7974)

* fix one_embedding adam

* fix tidy

* fix normal

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* speed test with score (#7990)

Signed-off-by: daquexian <[email protected]>

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Feat/graph del by ref (#7857)

* remove IsMultiClient() and single client logic

Signed-off-by: daquexian <[email protected]>

* rename eager.multi_client to eager

Signed-off-by: daquexian <[email protected]>

* auto format by CI

* add py ref

* refine new session

* clean code

* make scope api inner use

* use session with ref cnt

* run barrier callback in BarrierPhyInstrOperand::~BarrierPhyInstrOperand

* test pass

* lock gil in vm Callback thread

* more comments for VirtualMachineEngine::Callback()

* merge

* merge rm single client

* rm initenv

* merge and fix master

* refactor env c api

* add debug code

* fix and serving test pass

* test passed

* rm useless

* rm useless code

* format

* rm useless include

* rm sync in py

* the Env is never destroyed.

* export Env into python

* more unittests

* fix and pass tests

* revert virtual_machine.cpp

* revert core/vm

* remove outdated python class oneflow.unittest.TestCase

* graph test passed

* wait shared_ptr.use_count() == 0

* export unittest.TestCase in framework/unittest.py

* SwitchToShuttingDownPhase

* optional is_normal_exit

* VirtualMachine::CloseVMThreads

* Delete env_api.h

env_api.h is deleted by master

* address pr comments

* rm is env init

* Clear empty thread when graph destroy (#7633)

* Revert "Clear empty thread when graph destroy (#7633)" (#7860)

This reverts commit 3e8585e5fa20b97229d6b0be46a7ff814dc8cd83.

* fix a ref-cnt bug in TryRunBarrierInstruction.

* rm env_api

* …
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants