-
Notifications
You must be signed in to change notification settings - Fork 9
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
gnsyncbot: add incident report for today's linux builder issue
- Loading branch information
Showing
1 changed file
with
71 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,71 @@ | ||
|
||
Wrong crontab entry, causing over 40000 corrupted build logs | ||
============================================================ | ||
|
||
Date: 2020-08-20T09:02:00Z | ||
|
||
What | ||
---- | ||
|
||
I had put this in the crontab of the linux builder: | ||
|
||
@reboot tmux new-session -d -c src/llvm-project 'bash --init-file ../hack/llvmgnsyncbot/syncbot.sh' | ||
|
||
The plan was to auto-start syncbot after a forced reboot. However, this instead | ||
caused `syncbot.sh` to immediately die with | ||
`./syncbot.py: No such file or directory`. This only took fractions of a second, | ||
and by the time I noticed (16 hours after the reboot), over 40000 bad build logs | ||
were generated. | ||
|
||
Since they didn't contain any of the expected log files, they made the html | ||
generation script crash, which is how I ultimately noticed. | ||
|
||
|
||
Why | ||
=== | ||
|
||
The crontab entry doesn't work, and there's no easy way to test crontab entries | ||
as far as I know. | ||
|
||
|
||
Now what | ||
======== | ||
|
||
I removed the crontab entry for now. I should run it manually before adding it | ||
next time and figure out what exactly went wrong. | ||
|
||
|
||
Manual cleanup | ||
============== | ||
|
||
To clean up, I signed in to the linux box, killed the bad tmux process, | ||
used `who -b` to find the last boot time, and the deleted all build logs that | ||
were newer than that: | ||
|
||
cd src/llvm-project/ | ||
rm buildlogs/{26040..65083}.txt | ||
rm buildlogs/{26040..65083}.meta.json | ||
|
||
I edited buildlog.state to make the next build start at 26040: | ||
|
||
echo 26040 > syncbot.state | ||
|
||
Then I ssh'd into the web server and removed them there too, together with | ||
the corresponding cache entries: | ||
|
||
rm buildlog/linux/{26040..65083}.txt | ||
rm buildlog/linux/{26040..65083}.meta.json | ||
rm buildlog_cache/linux/{26040..65083}.json | ||
|
||
Then I manually checked that the html generation now works: | ||
|
||
hack/llvmgnsyncbot.py buildlogs html | ||
|
||
Sadly, I had manually started llvmgnsyncbot.py once before, and so the local | ||
git state wasn't as behind as it should have been. I tried to fix this by | ||
running | ||
|
||
git update-ref refs/remotes/origin/master 31adc28d24b | ||
|
||
but that didn't reset enough state. Now build 26040 claims that it only contains | ||
2 commits, which isn't correct. Oh well. |