Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DDF.py is stuck at Evolve Pop. #10

Open
p-jacquot opened this issue Jul 8, 2019 · 3 comments
Open

DDF.py is stuck at Evolve Pop. #10

p-jacquot opened this issue Jul 8, 2019 · 3 comments

Comments

@p-jacquot
Copy link

While running DDF.py, I got stuck at the step

 - 10:04:02 - ClassImageDeconvMa | Deconvolve large islands (>1000 pixels) (parallelised per island)
   Evolve pop...................   0/5[                                                  ]  00%  - 0'00" 

Using htop I noticed that no CPU was working hard at this moment. (RAM was not fully used, it is just a CPU problem)

Tracing syscalls with strace shows that the program is doing continuously a select syscall that always fail because of timeout. I have the following line that appears again and again in command prompt :

select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=50000}) = 0 (Timeout)

I suspect this to be a sort of deadlock, but I am not sure, because I wasn't able to find in the source code the lines that were involved in this affair.

Setting parameter --GAClean-NCPU to 1 seems to resolve the problem for me though. So I think the problem is really a problem of coordination between cores and subprocesses.

DDF.py --Output-Name=/tmp/dest/image_DI --Cache-Dir=/tmp/cache --Data-MS mslist.txt --Data-ColName DATA --Deconv-MaxMajorIter=3 --Deconv-CycleFactor=2.5 --Data-Sort=1 --Cache-HMP=1 --Output-Also onNeds --GAClean-NCPU=1 --Deconv-Mode SS --Mask-Auto=1 --Mask-SigTh=5.00

Maybe this error comes from a parameter that I have omitted ? Or is this problem known ?

@p-jacquot
Copy link
Author

Finally, even with the GAClean-NCPU parameter set to 1, the same problem appears during the next major cycle, at the same step.

@o-smirnov
Copy link
Contributor

The remaining problem with GAClean is threefold:

  • @cyriltasse uses standard multiprocessing and not APP like the rest of the code

  • Even with --GAClean-NCPU=1, it still runs (one) worker process, rather than processing islands serially in the main process.

  • If a worker process falls over with a SEGV or an out-of-memory condition (perhaps an island that is too big?), the result (with standard multiprocessing) is that the main process hangs indefinitely and doesn't get an error code. I haven't discovered a way around this with the standard multiprocessing, which is one of the reasons I wrote APP in the first place. (Although I think there's a fix for this in latest Python 3 versions...)

Anyway, I'll bet you a beer this is exactly what you're seeing here. Now all you need is another beer to bribe @cyriltasse to fix this.

@p-jacquot
Copy link
Author

Well, I misread my logs : it appears at the third iteration on two of my .ms files.

I don't think the problem comes from an island size, because I had something like more 600Go of RAM space available when it happened.

Is this problem specific to only few dataset ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants