-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fpart and parasync mismatching parts #30
Comments
Hmm - that's a steaming big chunk of data. For my own info, what kind of network are
you using and what kind of bandwidth are you seeing? And is this the first time you've
used it or is this an unexpected glitch in an otherwise routine use of it?
Also, thank you for providing such a nicely informative question. As someone who spent
most of his life answering poorly worded questions, I appreciate it.
There are a few reasons this might be the case.
1 - parsyncfp (pfp) does check whether it has overrun fpart and will wait for it - you may
have seen the error message indicating this - often seen on mostly rsync'ed transfers bc
there are NP rsync processes waiting to consume the efforts of a single fpart. There
could be a bug in this process, but I haven't seen it yet.
2 - if it's an active filesystem and something happened to the files during the fpart
recursive descent, fpart may have gotten confused and coughed up a digital hairball
instead of the final chunkfile. The files that were supposed to go in the final file were
deleted or moved as the file was being prepped. I have seen one case where this
(supposedly) happened. This isn't a bug per se, but a result of file jitter.
3 - there was an off-by-one error in an older version of pfp where the this might have
been the case. You're using the 1.60 version, I have it up to 1.61.
Just in case, I'm including the latest version I have. Try that and see what happens.
Because it's rsync-based, you shouldn't have lost any time except for the (long) fpart
recursion. ie it should be 25/26 finished.
Best harry
…On Tuesday, November 12, 2019 12:08:29 AM PST Andrea Del Monaco wrote:
Hi all,
We are facing a weird problem. We have our parsync job running for 18h
(which might be ok, considering the amount of data we're trying to
backup: 440TB) Since it looked suspicious, i ran an strace on parsync
and this is what i noticed: ***@***.*** fpcache]# strace -ff -tt -s
4096 -p 2858510
strace: Process 2858510 attached
09:05:02.861749 restart_syscall(<... resuming interrupted nanosleep ...>)
= 0 09:05:04.483017 stat("/root/.parsyncfp-TMC/fpcache/f.26", 0x603138)
= -1 ENOENT (No such file or directory) 09:05:04.483408
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
09:05:04.483648 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
09:05:04.483803 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
09:05:04.483928 nanosleep({2, 0}, 0x7fffffffbfe0) = 0
09:05:06.484248 stat("/root/.parsyncfp-TMC/fpcache/f.26", 0x603138) = -1
ENOENT (No such file or directory) 09:05:06.484480
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
09:05:06.484641 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
09:05:06.484823 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
09:05:06.484931 nanosleep({2, 0}, 0x7fffffffbfe0) = 0
09:05:08.485355 stat("/root/.parsyncfp-TMC/fpcache/f.26", 0x603138) = -1
ENOENT (No such file or directory) 09:05:08.485584
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
09:05:08.485735 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
09:05:08.485849 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
09:05:08.485963 nanosleep({2, 0}, strace: Process 2858510 detached
So i said, ok, maybe it's waiting for the generation of the parts. Then i
checked fpart processes and logs: ***@***.*** fpcache]# pwd
/root/.parsyncfp-TMC/fpcache
***@***.*** fpcache]# ps aux | grep -i fpart
root 2992692 0.0 0.0 112716 984 pts/1 S+ 09:05 0:00 grep
--color=auto -i fpart ***@***.*** fpcache]# ls -lah
|
Hi Harry, Thank you for your fast reply! :) Getting back to the issue, i tried to see the logs, but i don't see anything wrong:
(i used tac instead of cat, so it's backwards - that's because the log file was so big) Regarding the chunk size: maybe some advice would be good. We have ~400TB of data to move around, therefore we did 400TB / 1000(number of parts). Do you think that's the right approach? At 2k we get the warning, so better not to risk it? Another thing: actually, we have two shares mounted locally, therefore we don't use any network.
Meanwhile, i am trying another run using version 1.61 - hope this works! |
Short update: |
On Wednesday, November 13, 2019 1:03:12 AM PST Andrea Del Monaco wrote:
Short update:
I just restarted using 1.61 and outcome didn't change.
But, changhing the chunk size to ~440TB / 2000 instead of 1000 seemed to
have improved the situation. Any idea why this could happen?
The size and number of chunk files should be in the range that allows pfp to start up fast
and keep going, and not so many that you end up with a bazillion chunks and large logs.
It depends on the size of your data and the number of files; it also has implications for
the number of stat()s that have to be done to track the chun files, so that argues for a
lower number as well.
Obviously, if you define few chunk files with a large data source, it will take a long time
for fpart to fill all the chunks, leading to delayed startup.
So I somewhat arbitrarily defined 2K as the limit for reasons of efficiency in scanning the
single dir.
But changing the # of chunks to 2K from 1K (if that's what you meant) should not have
made a big change.
Harry Mangalam,
Info[1]
…--------
[1] http://moo.nac.uci.edu/~hjm/hjm.sig.html
|
On Wednesday, November 13, 2019 1:03:12 AM PST Andrea Del Monaco wrote:
Short update:
I just restarted using 1.61 and outcome didn't change.
But, changhing the chunk size to ~440TB / 2000 instead of 1000 seemed to
have improved the situation. Any idea why this could happen?
So your job (using 1.61) still hung at the end of the run? That's very odd. I'm
traveling and short of time today, but after I get home, I'd like to take this up again.
Harry Mangalam,
Info[1]
…--------
[1] http://moo.nac.uci.edu/~hjm/hjm.sig.html
|
I'm seeing something a little similar, and it appears to have only started in the last couple of weeks. I can't recall if I upgraded or not recently; I should have kept track. Here's my fpart.log:
I'm not sure I checked this file the previous times, so I can't tell you if the problem correlates with the above error. It seems to me I've seen the latter without the former, but I'm not sure. Then I'm seeing this in the parsyncfp output:
In my experience, it will never finish waiting, and will increment the time forever. I've been just killing off the parsyncfp script at the end when there are no more rsyncs, but obviously it would be better to avoid that. |
Hi Ryan,
Apologies for the delay. I think that error is coming from fpart, not
parsycfp and as such, I'm unfamiliar with the conditions that generated
it. I'll get in touch with Ganael (fpart author) and see what he says.
There are filename conditions that will cause even rsync to hack up a
hairball occasionally, so it's probable that a similar situation could
cause fpart to stutter. the fact that it started a little while ago and
keeps repeating seems to imply that you have a file or dir somewhere that's
causing fpart to die on it. The log entry you included (thank you):
[[
Filled part #13: size = 10997135371653, 47733 file(s)
Filled part #14: size = 10995837809948, 111756 file(s)
error parsing input values: <<<<<
Filled part #15: size = 7276234302256, 105608 file(s)
1430679 file(s) found.
]]
at the '<<<<<' indicates a problem, but I'll have to debug with Ganael to
see what could be causing it.
The log does imply that fpart continued to run and fill another
chunkfile correctly, but I'm not sure if the errors are being emitted and
captured synchronously or not.
The last chunkfile correctly (14 or 15 in the log above) filled should give
an indication of what dir was being traversed when it died. Can you
examine that dir and see if there's an oddball filename or damaged file
lurking there?
Best harry
…On Wed, Apr 22, 2020 at 11:55 AM Ryan Novosielski ***@***.***> wrote:
I'm seeing something a little similar, and it appears to have only started
in the last couple of weeks. I can't recall if I upgraded or not recently;
I should have kept track. Here's my fpart.log:
***@***.*** fpcache]# cat fpart.log.23.01.14_2020-04-21
Examining filesystem...
Filled part #0: size = 11033969843194, 137378 file(s)
Filled part #1: size = 11010341095020, 96042 file(s)
Filled part #2: size = 10995456710513, 99478 file(s)
Filled part #3: size = 10997042249668, 57445 file(s)
Filled part #4: size = 10996205514131, 83592 file(s)
Filled part #5: size = 11235699669681, 72848 file(s)
Filled part #6: size = 10999193108455, 78821 file(s)
Filled part #7: size = 10999539432427, 35181 file(s)
Filled part #8: size = 11006485913550, 81898 file(s)
Filled part #9: size = 10995307950334, 109058 file(s)
Filled part #10: size = 11003985383714, 151601 file(s)
Filled part #11: size = 11123773319829, 99136 file(s)
Filled part #12: size = 10999398939451, 63104 file(s)
Filled part #13: size = 10997135371653, 47733 file(s)
Filled part #14: size = 10995837809948, 111756 file(s)
error parsing input values:
Filled part #15: size = 7276234302256, 105608 file(s)
1430679 file(s) found.
...and then I'm seeing this in the parsyncfp output:
| Elapsed | 1m | [ ens6] MB/s | Running || Susp'd | Chunks [2020-04-21]
Time | time(m) | Load | TCP / RDMA out | PIDs || PIDs | [UpTo] of [ToDo]
05.34.05 422.72 1.72 5.00 / 0.00 3 <> 0 [16] of [16]
INFO: Waiting [31502]s for next chunkfile..
In my experience, it will never finish waiting. I've been just killing off
the parsyncfp script at the end when there are no more rsyncs, but
obviously it would be better to avoid that.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#30 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AASF3Y6P5TBMR546VZ6WFJLRN44TTANCNFSM4JL7XUVQ>
.
--
Harry Mangalam
|
Thanks, Harry -- I didn't consider that, that a new file could be causing the problem. It really does seem to be parsyncfp that is having the problem though. This counter keeps incrementing:
...and fpart is no longer running:
...or am I misunderstanding what you were saying? |
Hmm - well since pfp called fpart, pfp should deal with fpart's unexpected
death smoothly. It probably does not.
I'll look into that. I thought it checked on every loop that the fpart PID
was still running but I might have buggered that check. It's too late for
coding for a geezer like me. I'll check after morning coffee.
Again, thanks for using it and especially taking the time to write up the
bug.
Best
Harry
…On Fri, Apr 24, 2020 at 9:40 PM Ryan Novosielski ***@***.***> wrote:
Thanks, Harry -- I didn't consider that, that a new file could be causing
the problem. It really does seem to be parsyncfp that is having the problem
though. This counter keeps incrementing:
INFO: Waiting [239372]s for next chunkfile..
...and fpart is no longer running:
***@***.*** fpcache]# ps -ef | grep fpart
root 25927 4342 0 00:39 pts/1 00:00:00 grep --color=auto fpart
...or am I misunderstanding what you were saying?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#30 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AASF3YZARVXKKWWBPNYJD4TROJSSRANCNFSM4JL7XUVQ>
.
--
Harry Mangalam
|
Thank you! We use this software a lot (Bill Abbott is a colleague of mine -- I know you and he worked on reading from a list vs. walking the FS). |
Some additional info, post-coffee. Meandering thru the parsyncfp code..
That pfp warning:
[[
INFO: Waiting [31502]s for next chunkfile..
]]
only is emitted if the specific fpart PID is still running:
[[
$FPART_RUNNING = `ps ux | grep fpar[t] | grep $FPART_PID | wc -l`;
chomp $FPART_RUNNING;
while (! -e $CUR_FP_FLE && $FPART_RUNNING eq '1'){
if ($VERBOSE >= 2) {INFO("Waiting [$cfw]s for next chunkfile..\r");
sleep 2; $cfw += 2;}
}
]]
The next time you run pfp, you should be able to verify this by 'ps'. If
that's the case, it may be an fpart-specific issue - fpart still running,
but hung for some problem. You could check this by doing a strace on the
fpart PID.
Also, checking the fpart code, that error is supposed to name the file that
it's having trouble with, but in your case, there's nothing, suggesting
that possibly your file mistakenly got named as whitespace (space, tab,
newline, etc). Is that a possibility?
hjm
…On Fri, Apr 24, 2020 at 9:55 PM Ryan Novosielski ***@***.***> wrote:
Thank *you*! We use this software a lot (Bill Abbott is a colleague of
mine -- I know you and he worked on reading from a list vs. the FS).
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#30 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AASF3Y66TALNEYGZINEGGODROJUNJANCNFSM4JL7XUVQ>
.
--
Harry Mangalam
|
I'm still running the copy that I wrote about (this is a large transfer), so I check whatever else you want without waiting till next time. There are still rsyncs running so it doesn't hold me up any. You might have seen that I mentioned above, nothing even containing FPART is running. I assume $FPART_PID can also be found in the file called FP_PIDFILE*? If so:
Just for fun, I looked at what parsyncfp is doing:
Looks like about what you'd expect, but why it thinks there's going to be an f.16 at this point, I don't know. |
I think I found the reason you're seeing that bug - the fpart PID wasn't
refreshed during the test, allowing the old PID to remain valid infinitely,
hence the long waiting period.
Try the new version (1.64 / Covid) to see if it solves your problem. It is
odd that you are the only person to report this - I've never seen this
problem before, so it may be a function of the particular set of files and
dirs that you have that allowed this bug to manifest.
Let me know..
Harry
…On Sat, Apr 25, 2020 at 4:01 PM Ryan Novosielski ***@***.***> wrote:
I'm still running the copy that I wrote about (this is a large transfer),
so I check whatever else you want without waiting till next time. There are
still rsyncs running so it doesn't hold me up any. You might have seen that
I mentioned above, nothing even containing FPART is running. I assume
$FPART_PID can also be found in the file called FP_PIDFILE*?
If so:
total 129464
drwxr-xr-x 2 root root 300 Apr 21 23:42 .
drwxr-xr-x 3 root root 4096 Apr 21 23:42 ..
-rw-r--r-- 1 root root 12754535 Apr 21 23:01 f.0
-rw-r--r-- 1 root root 8882982 Apr 21 23:01 f.1
-rw-r--r-- 1 root root 13988256 Apr 21 23:01 f.10
-rw-r--r-- 1 root root 9335152 Apr 21 23:01 f.11
-rw-r--r-- 1 root root 5728710 Apr 21 23:42 f.12
-rw-r--r-- 1 root root 4113466 Apr 21 23:42 f.13
-rw-r--r-- 1 root root 10384947 Apr 21 23:42 f.14
-rw-r--r-- 1 root root 9944264 Apr 21 23:42 f.15
-rw-r--r-- 1 root root 9261208 Apr 21 23:01 f.2
-rw-r--r-- 1 root root 5349997 Apr 21 23:01 f.3
-rw-r--r-- 1 root root 7775218 Apr 21 23:01 f.4
-rw-r--r-- 1 root root 6749460 Apr 21 23:01 f.5
-rw-r--r-- 1 root root 7301965 Apr 21 23:01 f.6
-rw-r--r-- 1 root root 3237238 Apr 21 23:01 f.7
-rw-r--r-- 1 root root 7605602 Apr 21 23:01 f.8
-rw-r--r-- 1 root root 10107389 Apr 21 23:01 f.9
-rw-r--r-- 1 root root 935 Apr 21 23:01 fpart.log.23.01.14_2020-04-21
-rw-r--r-- 1 root root 5 Apr 21 23:01 FP_PIDFILE23.01.14_2020-04-21
-rw-r--r-- 1 root root 84 Apr 21 23:42 rsync-PIDs-23.01.14_2020-04-21
***@***.*** fpcache]# ps ux | grep fpar[t] | grep $(cat FP_PIDFILE23.01.14_2020-04-21) | wc -l
0
***@***.*** fpcache]# ps aux | grep fpar
root 15263 0.0 0.0 112708 976 pts/1 S+ 18:57 0:00 grep --color=auto fpar
***@***.*** fpcache]# ps aux | grep fpar[t]
Just for fun, I looked at what parsyncfp is doing:
strace: Process 5439 attached
restart_syscall(<... resuming interrupted nanosleep ...>) = 0
stat("/root/.parsyncfp-backups/fpcache/f.16", 0x25f5138) = -1 ENOENT (No such file or directory)
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({2, 0}, 0x7ffee5b72460) = 0
stat("/root/.parsyncfp-backups/fpcache/f.16", 0x25f5138) = -1 ENOENT (No such file or directory)
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({2, 0}, 0x7ffee5b72460) = 0
stat("/root/.parsyncfp-backups/fpcache/f.16", 0x25f5138) = -1 ENOENT (No such file or directory)
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
Looks like about what you'd expect, but why it thinks there's going to be
an f.16 at this point, I don't know.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#30 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AASF3Y5HOGQ3RQAFPASI3UTRONTWDANCNFSM4JL7XUVQ>
.
--
Harry Mangalam
|
Hi all,
Sorry to chime in, but i think that this might (maybe) help.
We've been seeing the same behavior, when doing a copy of many TB and when
most of these files were so many small files. I think that for some reason,
this breaks fpart.
I know this isn't much, but it's all i remember for now :)
Wish you all a pleasant day!
Regards,
Andrea
On Sun, Apr 26, 2020 at 6:06 AM Harry Mangalam <[email protected]>
wrote:
… I think I found the reason you're seeing that bug - the fpart PID wasn't
refreshed during the test, allowing the old PID to remain valid infinitely,
hence the long waiting period.
Try the new version (1.64 / Covid) to see if it solves your problem. It is
odd that you are the only person to report this - I've never seen this
problem before, so it may be a function of the particular set of files and
dirs that you have that allowed this bug to manifest.
Let me know..
Harry
On Sat, Apr 25, 2020 at 4:01 PM Ryan Novosielski ***@***.***
>
wrote:
> I'm still running the copy that I wrote about (this is a large transfer),
> so I check whatever else you want without waiting till next time. There
are
> still rsyncs running so it doesn't hold me up any. You might have seen
that
> I mentioned above, nothing even containing FPART is running. I assume
> $FPART_PID can also be found in the file called FP_PIDFILE*?
>
> If so:
>
> total 129464
> drwxr-xr-x 2 root root 300 Apr 21 23:42 .
> drwxr-xr-x 3 root root 4096 Apr 21 23:42 ..
> -rw-r--r-- 1 root root 12754535 Apr 21 23:01 f.0
> -rw-r--r-- 1 root root 8882982 Apr 21 23:01 f.1
> -rw-r--r-- 1 root root 13988256 Apr 21 23:01 f.10
> -rw-r--r-- 1 root root 9335152 Apr 21 23:01 f.11
> -rw-r--r-- 1 root root 5728710 Apr 21 23:42 f.12
> -rw-r--r-- 1 root root 4113466 Apr 21 23:42 f.13
> -rw-r--r-- 1 root root 10384947 Apr 21 23:42 f.14
> -rw-r--r-- 1 root root 9944264 Apr 21 23:42 f.15
> -rw-r--r-- 1 root root 9261208 Apr 21 23:01 f.2
> -rw-r--r-- 1 root root 5349997 Apr 21 23:01 f.3
> -rw-r--r-- 1 root root 7775218 Apr 21 23:01 f.4
> -rw-r--r-- 1 root root 6749460 Apr 21 23:01 f.5
> -rw-r--r-- 1 root root 7301965 Apr 21 23:01 f.6
> -rw-r--r-- 1 root root 3237238 Apr 21 23:01 f.7
> -rw-r--r-- 1 root root 7605602 Apr 21 23:01 f.8
> -rw-r--r-- 1 root root 10107389 Apr 21 23:01 f.9
> -rw-r--r-- 1 root root 935 Apr 21 23:01 fpart.log.23.01.14_2020-04-21
> -rw-r--r-- 1 root root 5 Apr 21 23:01 FP_PIDFILE23.01.14_2020-04-21
> -rw-r--r-- 1 root root 84 Apr 21 23:42 rsync-PIDs-23.01.14_2020-04-21
> ***@***.*** fpcache]# ps ux | grep fpar[t] | grep $(cat
FP_PIDFILE23.01.14_2020-04-21) | wc -l
> 0
> ***@***.*** fpcache]# ps aux | grep fpar
> root 15263 0.0 0.0 112708 976 pts/1 S+ 18:57 0:00 grep --color=auto fpar
> ***@***.*** fpcache]# ps aux | grep fpar[t]
>
> Just for fun, I looked at what parsyncfp is doing:
>
> strace: Process 5439 attached
> restart_syscall(<... resuming interrupted nanosleep ...>) = 0
> stat("/root/.parsyncfp-backups/fpcache/f.16", 0x25f5138) = -1 ENOENT (No
such file or directory)
> rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
> rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
> rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> nanosleep({2, 0}, 0x7ffee5b72460) = 0
> stat("/root/.parsyncfp-backups/fpcache/f.16", 0x25f5138) = -1 ENOENT (No
such file or directory)
> rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
> rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
> rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> nanosleep({2, 0}, 0x7ffee5b72460) = 0
> stat("/root/.parsyncfp-backups/fpcache/f.16", 0x25f5138) = -1 ENOENT (No
such file or directory)
> rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
> rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
> rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
>
> Looks like about what you'd expect, but why it thinks there's going to be
> an f.16 at this point, I don't know.
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <
#30 (comment)>,
> or unsubscribe
> <
https://github.com/notifications/unsubscribe-auth/AASF3Y5HOGQ3RQAFPASI3UTRONTWDANCNFSM4JL7XUVQ
>
> .
>
--
Harry Mangalam
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#30 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACTS3HT5LFQWSXTTZANZCCTROOXMNANCNFSM4JL7XUVQ>
.
|
Hi Andrea,
Thanks very much for chiming in. (In reading back thru the thread, I see
that I neglected to answer some of your points, which may have addressed
this bug as well - I'll take another look at them today or tomorrow).
In talking to Ganael about the possible fpart bug, I think it's more
likely a bug in the pfp code - specifically the one I referred to yesterday
that could introduce a 'wait-forever' bug. I think I fixed it, but bc it
tends to show up on very large transfers, I'm still trying to test it on
situations that will fail otherwise. If you want to try the latest version
on your large transfers, that would help as well. The only significant
change was that fix.
hjm
On Mon, Apr 27, 2020 at 1:11 AM Andrea Del Monaco <[email protected]>
wrote:
… Hi all,
Sorry to chime in, but i think that this might (maybe) help.
We've been seeing the same behavior, when doing a copy of many TB and when
most of these files were so many small files. I think that for some reason,
this breaks fpart.
I know this isn't much, but it's all i remember for now :)
Wish you all a pleasant day!
Regards,
Andrea
On Sun, Apr 26, 2020 at 6:06 AM Harry Mangalam ***@***.***>
wrote:
> I think I found the reason you're seeing that bug - the fpart PID wasn't
> refreshed during the test, allowing the old PID to remain valid
infinitely,
> hence the long waiting period.
>
> Try the new version (1.64 / Covid) to see if it solves your problem. It
is
> odd that you are the only person to report this - I've never seen this
> problem before, so it may be a function of the particular set of files
and
> dirs that you have that allowed this bug to manifest.
> Let me know..
> Harry
>
>
> On Sat, Apr 25, 2020 at 4:01 PM Ryan Novosielski <
***@***.***
> >
> wrote:
>
> > I'm still running the copy that I wrote about (this is a large
transfer),
> > so I check whatever else you want without waiting till next time. There
> are
> > still rsyncs running so it doesn't hold me up any. You might have seen
> that
> > I mentioned above, nothing even containing FPART is running. I assume
> > $FPART_PID can also be found in the file called FP_PIDFILE*?
> >
> > If so:
> >
> > total 129464
> > drwxr-xr-x 2 root root 300 Apr 21 23:42 .
> > drwxr-xr-x 3 root root 4096 Apr 21 23:42 ..
> > -rw-r--r-- 1 root root 12754535 Apr 21 23:01 f.0
> > -rw-r--r-- 1 root root 8882982 Apr 21 23:01 f.1
> > -rw-r--r-- 1 root root 13988256 Apr 21 23:01 f.10
> > -rw-r--r-- 1 root root 9335152 Apr 21 23:01 f.11
> > -rw-r--r-- 1 root root 5728710 Apr 21 23:42 f.12
> > -rw-r--r-- 1 root root 4113466 Apr 21 23:42 f.13
> > -rw-r--r-- 1 root root 10384947 Apr 21 23:42 f.14
> > -rw-r--r-- 1 root root 9944264 Apr 21 23:42 f.15
> > -rw-r--r-- 1 root root 9261208 Apr 21 23:01 f.2
> > -rw-r--r-- 1 root root 5349997 Apr 21 23:01 f.3
> > -rw-r--r-- 1 root root 7775218 Apr 21 23:01 f.4
> > -rw-r--r-- 1 root root 6749460 Apr 21 23:01 f.5
> > -rw-r--r-- 1 root root 7301965 Apr 21 23:01 f.6
> > -rw-r--r-- 1 root root 3237238 Apr 21 23:01 f.7
> > -rw-r--r-- 1 root root 7605602 Apr 21 23:01 f.8
> > -rw-r--r-- 1 root root 10107389 Apr 21 23:01 f.9
> > -rw-r--r-- 1 root root 935 Apr 21 23:01 fpart.log.23.01.14_2020-04-21
> > -rw-r--r-- 1 root root 5 Apr 21 23:01 FP_PIDFILE23.01.14_2020-04-21
> > -rw-r--r-- 1 root root 84 Apr 21 23:42 rsync-PIDs-23.01.14_2020-04-21
> > ***@***.*** fpcache]# ps ux | grep fpar[t] | grep $(cat
> FP_PIDFILE23.01.14_2020-04-21) | wc -l
> > 0
> > ***@***.*** fpcache]# ps aux | grep fpar
> > root 15263 0.0 0.0 112708 976 pts/1 S+ 18:57 0:00 grep --color=auto
fpar
> > ***@***.*** fpcache]# ps aux | grep fpar[t]
> >
> > Just for fun, I looked at what parsyncfp is doing:
> >
> > strace: Process 5439 attached
> > restart_syscall(<... resuming interrupted nanosleep ...>) = 0
> > stat("/root/.parsyncfp-backups/fpcache/f.16", 0x25f5138) = -1 ENOENT
(No
> such file or directory)
> > rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
> > rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
> > rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> > nanosleep({2, 0}, 0x7ffee5b72460) = 0
> > stat("/root/.parsyncfp-backups/fpcache/f.16", 0x25f5138) = -1 ENOENT
(No
> such file or directory)
> > rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
> > rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
> > rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> > nanosleep({2, 0}, 0x7ffee5b72460) = 0
> > stat("/root/.parsyncfp-backups/fpcache/f.16", 0x25f5138) = -1 ENOENT
(No
> such file or directory)
> > rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
> > rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
> > rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> >
> > Looks like about what you'd expect, but why it thinks there's going to
be
> > an f.16 at this point, I don't know.
> >
> > —
> > You are receiving this because you commented.
> > Reply to this email directly, view it on GitHub
> > <
> #30 (comment)
>,
> > or unsubscribe
> > <
>
https://github.com/notifications/unsubscribe-auth/AASF3Y5HOGQ3RQAFPASI3UTRONTWDANCNFSM4JL7XUVQ
> >
> > .
> >
>
>
> --
>
> Harry Mangalam
>
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub
> <
#30 (comment)>,
> or unsubscribe
> <
https://github.com/notifications/unsubscribe-auth/ACTS3HT5LFQWSXTTZANZCCTROOXMNANCNFSM4JL7XUVQ
>
> .
>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#30 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AASF3YZJTH7H4PQLWVNCVADROU427ANCNFSM4JL7XUVQ>
.
--
Harry Mangalam
|
Hi All,
Pretty sure (touch wood) that this last set of changes has addressed this
'failure to terminate' bug. I've tested it on some TB/8M file dirs and it
seems to be behaving itself. In examining the code, I've also identified a
number of changes that will speed up the start and especially increase the
speed of sync'ing substantially resync'ed dirs. Those will require some
substantial changes so I've not included them in the current 1.65.
Please give it a try and see if this fixes the bugs you identified.
Again, thanks for your help in making it better.
Harry
On Mon, Apr 27, 2020 at 1:11 AM Andrea Del Monaco <[email protected]>
wrote:
… Hi all,
Sorry to chime in, but i think that this might (maybe) help.
We've been seeing the same behavior, when doing a copy of many TB and when
most of these files were so many small files. I think that for some reason,
this breaks fpart.
I know this isn't much, but it's all i remember for now :)
Wish you all a pleasant day!
Regards,
Andrea
On Sun, Apr 26, 2020 at 6:06 AM Harry Mangalam ***@***.***>
wrote:
> I think I found the reason you're seeing that bug - the fpart PID wasn't
> refreshed during the test, allowing the old PID to remain valid
infinitely,
> hence the long waiting period.
>
> Try the new version (1.64 / Covid) to see if it solves your problem. It
is
> odd that you are the only person to report this - I've never seen this
> problem before, so it may be a function of the particular set of files
and
> dirs that you have that allowed this bug to manifest.
> Let me know..
> Harry
>
>
> On Sat, Apr 25, 2020 at 4:01 PM Ryan Novosielski <
***@***.***
> >
> wrote:
>
> > I'm still running the copy that I wrote about (this is a large
transfer),
> > so I check whatever else you want without waiting till next time. There
> are
> > still rsyncs running so it doesn't hold me up any. You might have seen
> that
> > I mentioned above, nothing even containing FPART is running. I assume
> > $FPART_PID can also be found in the file called FP_PIDFILE*?
> >
> > If so:
> >
> > total 129464
> > drwxr-xr-x 2 root root 300 Apr 21 23:42 .
> > drwxr-xr-x 3 root root 4096 Apr 21 23:42 ..
> > -rw-r--r-- 1 root root 12754535 Apr 21 23:01 f.0
> > -rw-r--r-- 1 root root 8882982 Apr 21 23:01 f.1
> > -rw-r--r-- 1 root root 13988256 Apr 21 23:01 f.10
> > -rw-r--r-- 1 root root 9335152 Apr 21 23:01 f.11
> > -rw-r--r-- 1 root root 5728710 Apr 21 23:42 f.12
> > -rw-r--r-- 1 root root 4113466 Apr 21 23:42 f.13
> > -rw-r--r-- 1 root root 10384947 Apr 21 23:42 f.14
> > -rw-r--r-- 1 root root 9944264 Apr 21 23:42 f.15
> > -rw-r--r-- 1 root root 9261208 Apr 21 23:01 f.2
> > -rw-r--r-- 1 root root 5349997 Apr 21 23:01 f.3
> > -rw-r--r-- 1 root root 7775218 Apr 21 23:01 f.4
> > -rw-r--r-- 1 root root 6749460 Apr 21 23:01 f.5
> > -rw-r--r-- 1 root root 7301965 Apr 21 23:01 f.6
> > -rw-r--r-- 1 root root 3237238 Apr 21 23:01 f.7
> > -rw-r--r-- 1 root root 7605602 Apr 21 23:01 f.8
> > -rw-r--r-- 1 root root 10107389 Apr 21 23:01 f.9
> > -rw-r--r-- 1 root root 935 Apr 21 23:01 fpart.log.23.01.14_2020-04-21
> > -rw-r--r-- 1 root root 5 Apr 21 23:01 FP_PIDFILE23.01.14_2020-04-21
> > -rw-r--r-- 1 root root 84 Apr 21 23:42 rsync-PIDs-23.01.14_2020-04-21
> > ***@***.*** fpcache]# ps ux | grep fpar[t] | grep $(cat
> FP_PIDFILE23.01.14_2020-04-21) | wc -l
> > 0
> > ***@***.*** fpcache]# ps aux | grep fpar
> > root 15263 0.0 0.0 112708 976 pts/1 S+ 18:57 0:00 grep --color=auto
fpar
> > ***@***.*** fpcache]# ps aux | grep fpar[t]
> >
> > Just for fun, I looked at what parsyncfp is doing:
> >
> > strace: Process 5439 attached
> > restart_syscall(<... resuming interrupted nanosleep ...>) = 0
> > stat("/root/.parsyncfp-backups/fpcache/f.16", 0x25f5138) = -1 ENOENT
(No
> such file or directory)
> > rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
> > rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
> > rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> > nanosleep({2, 0}, 0x7ffee5b72460) = 0
> > stat("/root/.parsyncfp-backups/fpcache/f.16", 0x25f5138) = -1 ENOENT
(No
> such file or directory)
> > rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
> > rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
> > rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> > nanosleep({2, 0}, 0x7ffee5b72460) = 0
> > stat("/root/.parsyncfp-backups/fpcache/f.16", 0x25f5138) = -1 ENOENT
(No
> such file or directory)
> > rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
> > rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
> > rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> >
> > Looks like about what you'd expect, but why it thinks there's going to
be
> > an f.16 at this point, I don't know.
> >
> > —
> > You are receiving this because you commented.
> > Reply to this email directly, view it on GitHub
> > <
> #30 (comment)
>,
> > or unsubscribe
> > <
>
https://github.com/notifications/unsubscribe-auth/AASF3Y5HOGQ3RQAFPASI3UTRONTWDANCNFSM4JL7XUVQ
> >
> > .
> >
>
>
> --
>
> Harry Mangalam
>
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub
> <
#30 (comment)>,
> or unsubscribe
> <
https://github.com/notifications/unsubscribe-auth/ACTS3HT5LFQWSXTTZANZCCTROOXMNANCNFSM4JL7XUVQ
>
> .
>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#30 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AASF3YZJTH7H4PQLWVNCVADROU427ANCNFSM4JL7XUVQ>
.
--
Harry Mangalam
|
I haven't had a chance to try the newer version yet, which I should be able to do try for tomorrow's backup. FYI, this week running 1.61, all three campuses eventually wound up in the endless "waiting for chunkfile" state:
As you can see, this one ran for a really long time before that happened. I think that this problem appeared between 1.55 and 1.61, but it's hard to be sure, since the content could have changed instead and caused the problem (maybe a little less likely given that it happened on all campuses). |
Hi all,
We are facing a weird problem. We have our parsync job running for 18h (which might be ok, considering the amount of data we're trying to backup: 440TB)
Since it looked suspicious, i ran an strace on parsync and this is what i noticed:
[root@filler002 fpcache]# strace -ff -tt -s 4096 -p 2858510
strace: Process 2858510 attached
09:05:02.861749 restart_syscall(<... resuming interrupted nanosleep ...>) = 0
09:05:04.483017 stat("/root/.parsyncfp-TMC/fpcache/f.26", 0x603138) = -1 ENOENT (No such file or directory)
09:05:04.483408 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
09:05:04.483648 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
09:05:04.483803 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
09:05:04.483928 nanosleep({2, 0}, 0x7fffffffbfe0) = 0
09:05:06.484248 stat("/root/.parsyncfp-TMC/fpcache/f.26", 0x603138) = -1 ENOENT (No such file or directory)
09:05:06.484480 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
09:05:06.484641 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
09:05:06.484823 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
09:05:06.484931 nanosleep({2, 0}, 0x7fffffffbfe0) = 0
09:05:08.485355 stat("/root/.parsyncfp-TMC/fpcache/f.26", 0x603138) = -1 ENOENT (No such file or directory)
09:05:08.485584 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
09:05:08.485735 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
09:05:08.485849 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
09:05:08.485963 nanosleep({2, 0}, strace: Process 2858510 detached
So i said, ok, maybe it's waiting for the generation of the parts. Then i checked fpart processes and logs:
[root@filler002 fpcache]# pwd
/root/.parsyncfp-TMC/fpcache
[root@filler002 fpcache]# ps aux | grep -i fpart
root 2992692 0.0 0.0 112716 984 pts/1 S+ 09:05 0:00 grep --color=auto -i fpart
[root@filler002 fpcache]# ls -lah
total 35M
drwxr-xr-x 2 root root 4.0K Nov 11 14:05 .
drwxr-xr-x 3 root root 4.0K Nov 11 14:05 ..
-rw-r--r-- 1 root root 351K Nov 11 13:54 f.0
-rw-r--r-- 1 root root 930K Nov 11 13:54 f.1
-rw-r--r-- 1 root root 1.1M Nov 11 13:58 f.10
-rw-r--r-- 1 root root 2.4M Nov 11 13:59 f.11
-rw-r--r-- 1 root root 2.0M Nov 11 13:59 f.12
-rw-r--r-- 1 root root 1.6M Nov 11 13:59 f.13
-rw-r--r-- 1 root root 2.7M Nov 11 13:59 f.14
-rw-r--r-- 1 root root 1.9M Nov 11 14:00 f.15
-rw-r--r-- 1 root root 1.2M Nov 11 14:00 f.16
-rw-r--r-- 1 root root 3.0M Nov 11 14:00 f.17
-rw-r--r-- 1 root root 1.8M Nov 11 14:00 f.18
-rw-r--r-- 1 root root 803K Nov 11 14:01 f.19
-rw-r--r-- 1 root root 12K Nov 11 13:55 f.2
-rw-r--r-- 1 root root 1.9M Nov 11 14:02 f.20
-rw-r--r-- 1 root root 567K Nov 11 14:02 f.21
-rw-r--r-- 1 root root 1.8M Nov 11 14:03 f.22
-rw-r--r-- 1 root root 2.7M Nov 11 14:03 f.23
-rw-r--r-- 1 root root 1.1M Nov 11 14:04 f.24
-rw-r--r-- 1 root root 1.6M Nov 11 14:05 f.25
-rw-r--r-- 1 root root 1.2M Nov 11 13:56 f.3
-rw-r--r-- 1 root root 493K Nov 11 13:57 f.4
-rw-r--r-- 1 root root 721K Nov 11 13:57 f.5
-rw-r--r-- 1 root root 14K Nov 11 13:57 f.6
-rw-r--r-- 1 root root 64K Nov 11 13:58 f.7
-rw-r--r-- 1 root root 452K Nov 11 13:58 f.8
-rw-r--r-- 1 root root 2.6M Nov 11 13:58 f.9
-rw-r--r-- 1 root root 1.4K Nov 11 14:06 fpart.log.13.54.12_2019-11-11
-rw-r--r-- 1 root root 8 Nov 11 13:54 FP_PIDFILE13.54.12_2019-11-11
-rw-r--r-- 1 root root 208 Nov 11 14:05 rsync-PIDs-13.54.12_2019-11-11
[root@filler002 fpcache]# cat fpart.log.13.54.12_2019-11-11
Examining filesystem...
Filled part #0: size = 488118230349, 119714 file(s)
Filled part #1: size = 488126329044, 117869 file(s)
Filled part #2: size = 488119026333, 329636 file(s)
Filled part #3: size = 488195328885, 189742 file(s)
Filled part #4: size = 489718855986, 126551 file(s)
Filled part #5: size = 523602512613, 92303 file(s)
Filled part #6: size = 507470117001, 26582 file(s)
Filled part #7: size = 564502186104, 115597 file(s)
Filled part #8: size = 552879767274, 66996 file(s)
Filled part #9: size = 488970973281, 39277 file(s)
Filled part #10: size = 506558236575, 120349 file(s)
Filled part #11: size = 488124669226, 60410 file(s)
Filled part #12: size = 507594913183, 64456 file(s)
Filled part #13: size = 488121789469, 71924 file(s)
Filled part #14: size = 488117693970, 65408 file(s)
Filled part #15: size = 546032339031, 66658 file(s)
Filled part #16: size = 493759970422, 73142 file(s)
Filled part #17: size = 489034017302, 154668 file(s)
Filled part #18: size = 492438162214, 178504 file(s)
Filled part #19: size = 488127491874, 109397 file(s)
Filled part #20: size = 488210804913, 136003 file(s)
Filled part #21: size = 488125298794, 98261 file(s)
Filled part #22: size = 505097570795, 113938 file(s)
Filled part #23: size = 488117458189, 227673 file(s)
Filled part #24: size = 577134074354, 234131 file(s)
Filled part #25: size = 386946098762, 157991 file(s)
3157180 file(s) found.
It looks like here's a mismatch, since parsync is waiting for the 26th file but that's not being created.
Have you ever faced this kind of situation before?
Something about our environment:
[root@filler002 fpcache]# ps aux | grep -i parsyncfp
root 2858510 0.0 0.0 137324 6100 ? S Nov11 0:04 perl /usr/bin/parsyncfp --verbose=2 --nowait --NP=8 --ro -aAX --chunksize=-488116876476 --altcache=/root/.parsyncfp-TMC --startdir=/mnt/beegfs/data TMC /mnt/cephfs/backup/data
root 2992869 0.0 0.0 112716 984 pts/1 S+ 09:06 0:00 grep --color=auto -i parsyncfp
[root@filler002 fpcache]# uname -r
3.10.0-1062.4.1.el7.x86_64
[root@filler002 fpcache]# cat /etc/redhat-release
CentOS Linux release 7.7.1908 (Core)
[root@filler002 fpcache]# parsyncfp -V
Option v requires an argument
parsyncfp version 1.60 (Dullnig)
June 24, 2019
by Harry Mangalam [email protected]
How do you suggest to move forward from this point on?
Thank you in advance!
The text was updated successfully, but these errors were encountered: