Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fusion time耗时异常 #136

Open
wly2020-robot opened this issue Mar 5, 2021 · 109 comments
Open

fusion time耗时异常 #136

wly2020-robot opened this issue Mar 5, 2021 · 109 comments

Comments

@wly2020-robot
Copy link

你好,根据你之前的调试建议,我在arm_linux环境上调整const int tileRows = 32;const int tileCols = 256;const int num_threads_ = 4;这几个参数;发现检测耗时,每过一会有fusion time耗时会异常,差别有60-100ms,一般会有10-30ms偏差.调了比较长时间,问题依然没有得到解决,请问还有其它地方可以优化吗?谢谢。

@meiqua
Copy link
Owner

meiqua commented Mar 5, 2021

是fusion阶段耗时异常还是matching阶段?

@wly2020-robot
Copy link
Author

是fusion阶段的第一次耗时计算。

@meiqua
Copy link
Owner

meiqua commented Mar 6, 2021

想了下,有个地方很可能会产生波动:内存分配
可能因为这一块是多线程,不断地销毁再申请内存导致这种嵌入式性能的平台扛不住。可以试试先把一块固定的内存给好

@meiqua
Copy link
Owner

meiqua commented Mar 6, 2021

可以通过c++的placement new实现

@meiqua
Copy link
Owner

meiqua commented Mar 6, 2021

整体耗时可能也有点帮助

@wly2020-robot
Copy link
Author

嗯,用placement new 替代buffer_0,buffer_1分配固定内存难以入手。

@wly2020-robot
Copy link
Author

placement new平时用的少,介绍是说在已经分配的内存上创建对象。

@meiqua
Copy link
Owner

meiqua commented Mar 6, 2021

需要改一些代码,有时间我看看

@wly2020-robot
Copy link
Author

嗯,非常感谢,期待。

@meiqua
Copy link
Owner

meiqua commented Mar 7, 2021

简单改了下,试试这个fix_memo branch

@wly2020-robot
Copy link
Author

融合进行了调试和测试,几乎没有多大变化,依然是第一次fusion time不稳定,有突变。

@wly2020-robot

This comment has been minimized.

@wly2020-robot
Copy link
Author

调试参数设置:const int tileRows = 32;
const int tileCols = 256;
const int num_threads_ = 4;

@wly2020-robot

This comment has been minimized.

@wly2020-robot

This comment has been minimized.

@meiqua
Copy link
Owner

meiqua commented Mar 8, 2021

没有在GPU上跑。这样的话需要更细致的profile看看,可以这样:

  1. 先把openmp关掉看一下,确认是不是openmp带来的问题
  2. 更细致地对fusion每个阶段计算耗时,尽量缩小范围看哪一部分跳动

@wly2020-robot
Copy link
Author

wly2020-robot commented Mar 8, 2021 via email

@wly2020-robot

This comment has been minimized.

@wly2020-robot
Copy link
Author

在windows上加入opemmp跑出来的效果比较明显。在linux下加与不加差不多,是不是GPU没有调用起来?

@meiqua
Copy link
Owner

meiqua commented Mar 8, 2021

linux默认加上了,GPU本来就没用到

@wly2020-robot
Copy link
Author

基本可以确定是match函数里面的process函数耗时异常。

@wly2020-robot
Copy link
Author

process函数里面有好几处关于_OPENMP的宏定义判断。应该是_OPENMP下的代码都没有跑起来。

@meiqua
Copy link
Owner

meiqua commented Mar 8, 2021

关掉openmp,时间会波动吗

@meiqua
Copy link
Owner

meiqua commented Mar 8, 2021

process是主要运算函数,波动的话肯定是这个。可以在里面测试下各部分耗时

@wly2020-robot
Copy link
Author

在我工作电脑上不会。开发板上arm-linux一样会。我用的是QT来下代码编译的,QT下怎么配置才能使_OPENMP下的代码跑起来?

@meiqua
Copy link
Owner

meiqua commented Mar 8, 2021

一般加-fopenmp会自动定义这个宏。波动的话,先不开openmp测一下时间,这样单线程比如容易确定是哪部分

@wly2020-robot
Copy link
Author

wly2020-robot commented Mar 8, 2021 via email

@wly2020-robot
Copy link
Author

你好,通过调试,基本可以确定是哪个代码段引发运行耗时不稳定。定位到的代码断:
// update one by one
for(int i=0; i<nodes_private.size(); i++) nodes_private[i]->update();
第一次fusion time,在process函数中大for循环中循环16次,每次执行以上代码段耗时不一样,一般耗时在2-38ms波动;第二次fusion time,在process函数中大for循环中循环4次,每次执行以上代码段耗时比较稳定, 每次执行耗时5ms,波动不大。

@wly2020-robot
Copy link
Author

以上调试是关闭了openmp了的。

@meiqua
Copy link
Owner

meiqua commented Mar 9, 2021

主要运算就是这个update;fix memo branch新加入了计时的代码,可以试试update里哪一步波动最大

@meiqua
Copy link
Owner

meiqua commented Mar 12, 2021

之前说的有点问题,是tile划分越细,fusion重复计算的部分越多,但同时更利于并行;
一般tile不要这么大,极端点最大的tile就是普通的算法了,把cols改小试试;
我测下来的经验是threads超过两倍CPU核心就没有太多提升,更多的话稍有下降。

@wly2020-robot
Copy link
Author

wly2020-robot commented Mar 12, 2021 via email

@meiqua
Copy link
Owner

meiqua commented Mar 12, 2021

openmp只提供了环境变量设置的方法。如果想代码里绑定,可能得自己写个thread pool然后参考前面那个链接设置

@wly2020-robot
Copy link
Author

wly2020-robot commented Mar 13, 2021 via email

@wly2020-robot
Copy link
Author

wly2020-robot commented Mar 13, 2021 via email

@meiqua
Copy link
Owner

meiqua commented Mar 13, 2021

可以参考windows配置。波动的话,之前在windows上不是跑的还行吗?

@wly2020-robot
Copy link
Author

wly2020-robot commented Mar 15, 2021 via email

@meiqua
Copy link
Owner

meiqua commented Mar 16, 2021 via email

@wly2020-robot

This comment has been minimized.

@wly2020-robot
Copy link
Author

wly2020-robot commented Mar 16, 2021 via email

@wly2020-robot

This comment has been minimized.

@meiqua
Copy link
Owner

meiqua commented Mar 17, 2021

加一个环境变量 OMP_PROC_BIND=true 试试?

@wly2020-robot
Copy link
Author

wly2020-robot commented Mar 18, 2021 via email

@wly2020-robot
Copy link
Author

wly2020-robot commented Mar 18, 2021 via email

@wly2020-robot
Copy link
Author

wly2020-robot commented Mar 20, 2021 via email

@wly2020-robot

This comment has been minimized.

@meiqua
Copy link
Owner

meiqua commented Mar 26, 2021

有意思的现象。同样的情况,限制成一个核上跑的单线程还会这样吗?

@meiqua
Copy link
Owner

meiqua commented Mar 26, 2021

OMP_PROC_BIND可以让线程绑定到核,对缓存比较友好。可以设置OMP_DISPLAY_ENV=true看看前面设置成功了没

@wly2020-robot

This comment has been minimized.

@meiqua
Copy link
Owner

meiqua commented Mar 27, 2021

在windows上怎么确认绑定到单核了?可以试试这个链接

@wly2020-robot
Copy link
Author

wly2020-robot commented Mar 27, 2021 via email

@wly2020-robot
Copy link
Author

wly2020-robot commented Mar 27, 2021 via email

@wly2020-robot
Copy link
Author

wly2020-robot commented Mar 27, 2021 via email

@wly2020-robot
Copy link
Author

wly2020-robot commented Mar 27, 2021 via email

@wly2020-robot
Copy link
Author

wly2020-robot commented Apr 6, 2021 via email

@meiqua
Copy link
Owner

meiqua commented Apr 6, 2021

所以从测试结果看,i5四代的CPU上windows会波动,linux正常,i5九代都正常?

@wly2020-robot

This comment has been minimized.

@meiqua
Copy link
Owner

meiqua commented Apr 7, 2021

差一点的CPU波动大,感觉是系统调度的问题?可以试试设置进程优先级

@wly2020-robot
Copy link
Author

wly2020-robot commented Apr 7, 2021 via email

@paul070701
Copy link

fusion.h里有一行use_simd = true,改成false就行

---原始邮件--- 发件人: "wly2020-robot"<[email protected]> 发送时间: 2021年3月10日(周三) 下午4:00 收件人: "meiqua/shape_based_matching"<[email protected]>; 抄送: "meiqua"<[email protected]>;"Comment"<[email protected]>; 主题: Re: [meiqua/shape_based_matching] fusion time耗时异常 (#136) 你好,是不是把QT中工程文件的.pro中的QMAKE_CXXFLAGS_RELEASE += -O3-Wno-sign-compare注释掉,SIMD就关掉了,有点不太确定?
------------------&nbsp;原始邮件&nbsp;------------------ 发件人: "meiqua/shape_based_matching" @.&gt;; 发送时间:&nbsp;2021年3月10日(星期三) 下午2:37 @.&gt;; @.@.&gt;; 主题:&nbsp;Re: [meiqua/shape_based_matching] fusion time耗时异常 (#136) 同样的情况下暂时把也SIMD关了呢
---原始邮件--- 发件人: @.&amp;gt; 发送时间: 2021年3月10日(周三) 下午2:05 收件人: @.&amp;gt;; 抄送: @.@.&amp;gt;; 主题: Re: [meiqua/shape_based_matching] fusion time耗时异常 (#136) 补充一点,关闭了openmp,以上update_simd耗时是第一次fusion time的16次循环累计结果。 — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.


修改为 use_simd = false后, 发现推理耗时和use_simd=true基本一样的 500万图像 差不多200毫秒。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants