Skip to content

Commit

Permalink
Fix potential TEST-E-TIMEOUT failure on slow systems waiting for REOR…
Browse files Browse the repository at this point in the history
…G.TXT* to be removed

On a slow system where a MUPIP SIZE operation takes more than half an hour, it is possible
the the backgrounded online_reorg.csh script takes more than 1-1/2 hours to finish the
3 MUPIP SIZE operations (in case it chose MUPIP SIZE randomly) before recognizing that
the close_reorg.csh script invoked by the foreground test script had asked for it to stop
running. This could cause the close_reorg.csh script to time out as it has a timeout of
5400 seconds (1-1/2 hours) currently.

The fix is to rework the online_reorg.csh script so it does the check of the stop signal
between each of the 3 MUPIP SIZE operations instead of after all the 3. This should
reduce the maximum time that it can take for the pending stop signal to be seen.

While at this, a few other changes were done.

a) I noticed an issue in the script where it overwrote the online_reorg_*.outx file instead
   of appending to it in one place. This would cause prior rounds of the reorg output to be
   erased from the log file and make analysis of a test failure difficult.

b) Enhanced the script to record what choice it made and the actual MUPIP SIZE or MUPIP
   REORG commands it ran.
  • Loading branch information
nars1 committed May 14, 2018
1 parent 0c9905d commit eea3e7c
Showing 1 changed file with 19 additions and 7 deletions.
26 changes: 19 additions & 7 deletions com/online_reorg.csh
Original file line number Diff line number Diff line change
Expand Up @@ -49,17 +49,29 @@ while (1)
# This is the case for multisrv_crash test case which deliberately kills online_reorg sub processes
set tmpoutput = "online_reorg_$$.outx"
echo "# `date` : ===== Begin round $cnt of mupip size/reorg ($tmpoutput) ====="
echo "# `date` : cnt = $cnt ; ff = $ff ; inff = $inff" >&! $tmpoutput
echo "# `date` : cnt = $cnt ; ff = $ff ; inff = $inff" >>&! $tmpoutput
if ($cnt % 5 == 2) then
$try_nice $MUPIP size -heuristic="arsample,samples=100000" >>&! $tmpoutput && \
$try_nice $MUPIP size -heuristic="impsample,samples=100000" >>&! $tmpoutput && \
$try_nice $MUPIP size -heuristic="scan,level=1" >>&! $tmpoutput
set ret = $status
foreach heuristic ("arsample,samples=100000" "impsample,samples=100000" "scan,level=1")
echo "# `date` : $try_nice $MUPIP size -heuristic=\"$heuristic\"" >>& $tmpoutput
$try_nice $MUPIP size -heuristic="$heuristic" >>& $tmpoutput
set ret = $status
if ($ret) then
break
endif
# If test has asked us to stop, check for it in between iterations of for loop
# or else test could wait for too long (as much as 2 hours) for 3 MUPIP SIZE operations to finish
# causing test to fail with a TEST-E-TIMEOUT error timing out waiting for REORG.TXT* to be removed.
if (-f REORG.END) then
break
endif
end
else if ($cnt < 5) then
$try_nice $MUPIP reorg -fill=$ff -index=$inff -truncate >>&! $tmpoutput
echo "# `date` : $try_nice $MUPIP reorg -fill=$ff -index=$inff -truncate" >>&! $tmpoutput
$try_nice $MUPIP reorg -fill=$ff -index=$inff -truncate >>&! $tmpoutput
set ret = $status
else
$try_nice $MUPIP reorg -fill=$ff -index=$inff >>&! $tmpoutput
echo "# `date` : $try_nice $MUPIP reorg -fill=$ff -index=$inff" >>&! $tmpoutput
$try_nice $MUPIP reorg -fill=$ff -index=$inff >>&! $tmpoutput
set ret = $status
endif

Expand Down

0 comments on commit eea3e7c

Please sign in to comment.