Skip to content

Commit

Permalink
gnsyncbot: add incident report for today's linux builder issue
Browse files Browse the repository at this point in the history
  • Loading branch information
nico committed Aug 20, 2020
1 parent 3ac627f commit a0d78f8
Showing 1 changed file with 71 additions and 0 deletions.
71 changes: 71 additions & 0 deletions llvmgnsyncbot/incidents/2.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@

Wrong crontab entry, causing over 40000 corrupted build logs
============================================================

Date: 2020-08-20T09:02:00Z

What
----

I had put this in the crontab of the linux builder:

@reboot tmux new-session -d -c src/llvm-project 'bash --init-file ../hack/llvmgnsyncbot/syncbot.sh'

The plan was to auto-start syncbot after a forced reboot. However, this instead
caused `syncbot.sh` to immediately die with
`./syncbot.py: No such file or directory`. This only took fractions of a second,
and by the time I noticed (16 hours after the reboot), over 40000 bad build logs
were generated.

Since they didn't contain any of the expected log files, they made the html
generation script crash, which is how I ultimately noticed.


Why
===

The crontab entry doesn't work, and there's no easy way to test crontab entries
as far as I know.


Now what
========

I removed the crontab entry for now. I should run it manually before adding it
next time and figure out what exactly went wrong.


Manual cleanup
==============

To clean up, I signed in to the linux box, killed the bad tmux process,
used `who -b` to find the last boot time, and the deleted all build logs that
were newer than that:

cd src/llvm-project/
rm buildlogs/{26040..65083}.txt
rm buildlogs/{26040..65083}.meta.json

I edited buildlog.state to make the next build start at 26040:

echo 26040 > syncbot.state

Then I ssh'd into the web server and removed them there too, together with
the corresponding cache entries:

rm buildlog/linux/{26040..65083}.txt
rm buildlog/linux/{26040..65083}.meta.json
rm buildlog_cache/linux/{26040..65083}.json

Then I manually checked that the html generation now works:

hack/llvmgnsyncbot.py buildlogs html

Sadly, I had manually started llvmgnsyncbot.py once before, and so the local
git state wasn't as behind as it should have been. I tried to fix this by
running

git update-ref refs/remotes/origin/master 31adc28d24b

but that didn't reset enough state. Now build 26040 claims that it only contains
2 commits, which isn't correct. Oh well.

0 comments on commit a0d78f8

Please sign in to comment.