Skip to content

Commit

Permalink
Edit for clarity and unpublish some unhelpful posts
Browse files Browse the repository at this point in the history
  • Loading branch information
kencx committed Jan 22, 2024
1 parent 72432b2 commit 1f9faa9
Show file tree
Hide file tree
Showing 7 changed files with 103 additions and 90 deletions.
8 changes: 4 additions & 4 deletions content/posts/auto-generate-resume.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: "Automating my Resume"
date: 2023-08-21
lastmod: 2023-08-21
lastmod: 2024-01-22
draft: false
toc: true
tags:
Expand All @@ -20,12 +20,12 @@ There's a [bunch](https://github.com/xitanggg/open-resume)
[of](https://github.com/AmruthPillai/Reactive-Resume)
[resume](https://github.com/topics/resume-builder) building sites out there,
mostly catered for people in Tech, but many involve creating an account on their
sites and using one of their templates. No thank you.
sites and using one of their templates. That's not really my thing.

I also have an existing resume in LaTeX, which I could just use to build a PDF
automatically with Github Actions, and call it a day, but there's no fun in
that. Instead, I *have* to over-engineer an entire resume pipeline that
automatically builds it in three different formats:
that. Instead, I wanted to over-engineer a full pipeline that automatically
builds a resume in three different formats:

1. pdf
2. html to host a static site (because why not?)
Expand Down
60 changes: 34 additions & 26 deletions content/posts/automated-testing-of-restic-backups.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: "Automated Testing of Restic Backups"
date: 2023-08-09
lastmod: 2023-08-09
lastmod: 2024-01-22
draft: false
toc: true
tags:
Expand All @@ -17,22 +17,21 @@ rule](https://www.backblaze.com/blog/the-3-2-1-backup-strategy/) by being
automatic, redundant and offsite.

However, a backup is only as good as its ability to be restored successfully. It
would be disastrous if you tried to restore a backup snapshot to find that your
files have been unknowningly corrupted or loss during the backup process.
can be potentially disastrous if we tried to restore a backup after data loss
and realise that the data has been unknowningly corrupted or loss during the
backup process.

Which is why my paranoid self also runs automated restore testing as part of the
daily backup process.

After restic back ups any new data and prunes old snapshots, we run some
restoration tests:
A good measure involves performing automated restoration tests as part of the
backup process. After restic backs up any new data and prunes old snapshots, it
also performs the following:

- Check a subset of all data with [restic
check](https://restic.readthedocs.io/en/stable/045_working_with_repos.html#checking-integrity-and-consistency)
(1% in my case)
- Restore a series of test files and compare them with the original

While complete checks and restores would be more representative of the integrity
of the backups, they are also very unfeasible for obvious reasons[^1].
of the backups, they are also very unfeasible[^1].

## Restic check

Expand All @@ -55,39 +54,48 @@ $ autorestic exec -av -- check --read-data-subset=1%
## Restoring Test Files

In addition to restic's native integrity checks, we also run explicit checks by
restoring a test file after the backup process.

Before every backup, a `generate-restore-test-files` script is executed to
create a test file with random contents in a specified test directory. The test
directory stores the last 5 generated test files and is included in the backup.
restoring a test file after the backup process. This involves creating a test
file with random content in a specified test directory before the backup:

```bash
#!/bin/bash

# generate-restore-test-files.sh
dd if=/dev/random of="$RESTORE_DIR/test-$(date +%Y-%m-%d)" count=10 >/dev/null 2>&1
TEST_DIR="~/restore-test"
dd if=/dev/random of="$TEST_DIR/test-$(date +%Y-%m-%d)" count=10 >/dev/null 2>&1

# delete any files older than 5 days
cd $TEST_DIR && \
find . -type f ! -newerct "$(date --date='-5 days' '+%Y/%m/%d %H:%M:%S')" -delete
```

After every backup, all files in the test directory are restored from the latest
backup snapshot to a separate temporary directory. These restored files are `diff`-ed
with the originals found in the test directory. The backup will fail if any of
the files are different.
Any test files older than the last 5 generated files are discarded. During the
backup, we have restic restore the files in this directory to a temporary
directory.

```bash
# backup.sh

...
RESTORE_DIR="~/restore-test"
TMP_DIR="$(mktemp -d)"
autorestic restore -v --include "$RESTORE_DIR" --to "$TMP_DIR"
```
These restored files are `diff`-ed with the originals found in the test
directory. The backup will fail if any of the files are different.

```bash
RESTORED_FILES="$(cd "$RESTORE_DIR" && find . -type f -printf '%f\n')"

for file in $RESTORED_FILES; do
diff "$RESTORE_DIR/$file" "${TMP_DIR}${RESTORE_DIR}/$file"
done
```

The backup process is run with systemd timers. An extract of the
If there are any differences, `diff` returns an exit code of `1`, causing the
script to fail. Otherwise, the backup passes and the script cleans up any
temporary directories.

## Systemd Timer

The backup process is scheduled with systemd timers. An extract of the
`backup.service` file is as follows:

```
Expand All @@ -108,6 +116,6 @@ role.
## References
- [Preparing for the worst](https://tomm.org/2022/preparing-for-the-worst)

[^1]: High bandwidth costs for checks on remote backup repositories, the need
for disk space to perform the restores to, very time-consuming depending on
your network speeds etc.
[^1]: Due to high bandwidth costs for checks on remote backup repositories, the
need for disk space to perform the restores to, all of which can be very
expensive and time-consuming.
11 changes: 3 additions & 8 deletions content/posts/hubble.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: "Hubble Homelab"
date: 2022-07-25T16:30:00+08:00
lastmod: 2022-09-01
lastmod: 2024-01-22
draft: false
toc: true
images:
Expand All @@ -10,9 +10,6 @@ tags:
- selfhosted
---

After the [Planck]({{< ref "/posts/selfhosting.md" >}})[^1], I wanted a dedicated
server for learning and working with DevOps concepts and tools.

## Hubble

Hubble is an Intel HP Elitedesk 800 G2 Mini NUC (i5-6500T, 8GB DDR4). It has more than 1
Expand Down Expand Up @@ -183,7 +180,7 @@ were still ongoing since they began five hours ago. Not good.

Without thinking, I decided to go for the easiest solution: turn it off and on again and
hope for the best (In hindsight, NEVER DO THIS). On boot, I checked for data loss.
Everything seemed normal[^2] and all files were supposedly there. It was then I also
Everything seemed normal[^1] and all files were supposedly there. It was then I also
realised I needed a better way to check for data loss and backup integrity.

Next, I tried to identity the root cause:
Expand Down Expand Up @@ -241,7 +238,5 @@ Let's see where we'll be in another six months.
>At the time of writing (Jul 2022), Hubble has remained online and stable for
>more than two months without maintenance, while I took a break.
[^1]: Not to be confused with the [Planck]({{< ref "/posts/keyboards/planck.md" >}})
keyboard that I use.
[^2]: Except that I discovered that the static route from the Proxmox host to NFS server
[^1]: Except that I discovered that the static route from the Proxmox host to NFS server
disappeared on reboot. I had forgotten to set up a permanent route.
2 changes: 1 addition & 1 deletion content/posts/hugo-serve.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
title: "hugo serve"
date: 2021-11-18T16:30:33+08:00
lastmod: 2021-11-18
draft: false
draft: true
toc: false
images:
tags:
Expand Down
29 changes: 14 additions & 15 deletions content/posts/keyboards/planck.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
---
title: "Keyboards - Planck"
date: 2022-01-21T17:10:11+08:00
lastmod: 2024-01-22
draft: false
toc: true
tags:
Expand All @@ -9,35 +10,32 @@ tags:

{{< figure src="https://imgs.xkcd.com/comics/borrow_your_laptop.png" caption="relevant xkcd 1806" link="https://xkcd.com/1806" class="center" >}}

I have been using the Planck keyboard for almost a year now, at this time of writing.

{{< figure src="/posts/keyboards/images/planck.png" caption="The Planck Rev 6" alt="The Planck Rev 6" class="center" width="350px">}}

Specs:
At the time of writing, I have been using the [Planck
keyboard](https://olkb.com/collections/planck) for almost a year. Specs ([bill of materials](#bill-of-materials)):
- 67g Tangerines linear switches
- Black DSA blank keycaps
- Lubed and filmed with Krytox 205g0 and Deskeys

A [bill of materials](#bill-of-materials) is included at the end of this post.

{{< figure src="/posts/keyboards/images/planck.png" caption="The Planck Rev 6" alt="The Planck Rev 6" class="center" width="350px">}}

## Features

I got the Planck because I wanted to try out a 40% ortholinear keyboard. Why? I
just thought it might be fun.

The Planck has a 4x12 layout with a maximum of just 48 keys. It is fully
programmable with [QMK firmware](https://github.com/qmk/qmk_firmware) and to top
it off, its fully hotswappable. Of course, soldering is fun too, but there are
other opportunities for that.
programmable with [QMK firmware](https://github.com/qmk/qmk_firmware) and fully
hotswappable[^1].

As far as 40% keyboards go, the Planck is a classic choice. *Layers*[^1] make up
As far as 40% keyboards go, the Planck is a classic choice. Layers[^2] make up
for the lack of number and function rows, and you can create some cool key
combos based on your workflow.

However, the ortholinear layout does take some time getting used to, as opposed
to the staggered layout. My WPM fell sharply in my first 3 weeks, but I
adapted quickly as I was already practicing touch typing. I also switched to
the Planck around the time I was fully writing my undergraduate thesis helped me
to the staggered layout. My WPM fell sharply in my first 3 weeks, but I adapted
quickly as I was already practicing touch typing. I also switched to the Planck
around the time I was fully writing my undergraduate thesis which helped me
practice.

{{< figure src="/posts/keyboards/images/monkeytype.png" caption="You can clearly see the steep drop, followed by consistently low tries. From [monkeytype.com](https://monkeytype.com)" alt="My drop in WPM" class="center" >}}
Expand All @@ -48,7 +46,7 @@ a little better now - I can pinpoint `$` as the 4th symbol, although I still
occasionally mix up the positions of `%, ^, &` and `*`.

I also discovered that I use my index finger to hit the spacebar as opposed to
my thumbs, and this is considered weird. To me, it seems natural, granted I've
my thumbs and this is considered weird. To me, it seems natural, granted I've
been doing it all my life. I did consider forcing myself to relearn this but I
didn't see a point since my keyboard was already so tiny.

Expand Down Expand Up @@ -124,4 +122,5 @@ You also need the following optional items:
- Switch puller
- Switch opener (or use a screwdriver)

[^1]: Layers are activated by holding down the *raise* or *lower* keys and pressing the desired key.
[^1]: Soldering is fun, but there are other opportunities for that.
[^2]: Layers are activated by holding down the *raise* or *lower* keys and pressing the desired key.
81 changes: 46 additions & 35 deletions content/posts/monitoring-backups-with-prometheus.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: "Monitoring Backups With Prometheus"
date: 2023-08-10
lastmod: 2023-08-10
lastmod: 2024-01-22
draft: false
toc: true
tags:
Expand All @@ -10,52 +10,39 @@ tags:
- prometheus
---

I previously wrote about [running automated restore tests]({{< ref "automated-testing-of-restic-backups.md" >}})
for daily restic backups. But we didn't dicuss how we will be alerted should any
backups fail. Some possible methods of sending notifications are:
I previously wrote about [running automated restore tests]({{< ref
"automated-testing-of-restic-backups.md" >}}) when performing daily restic
backups. If the backup script fails, it should send out an alert or
notification. Some possible methods of alerting include:

- systemd's `OnFailure` key to run a script on failure
- via webhooks (eg. with [uptime-kuma](https://github.com/louislam/uptime-kuma)
or Gotify)
- webhooks (eg. with [uptime-kuma](https://github.com/louislam/uptime-kuma) or
Gotify)
- Prometheus and AlertManager

I decided to go with the last option because I've never written a Prometheus
exporter before and also wanted to monitor some backup metrics anyway.
exporter before and wanted to try it out. It would also provide some backup
metrics that might be useful.

## Prometheus Exporter

The `backup-exporter` script generates metrics after every backup that are
consumed by [Node exporter's
My Prometheus exporter is a Python
[script](https://github.com/kencx/homelab/blob/master/ansible/roles/autorestic/files/backup-exporter)
that generates text metrics which are then consumed by [Node exporter's
textfile-collector](https://github.com/prometheus/node_exporter#textfile-collector).
These metrics are exposed to Prometheus, where they are consumed by Grafana and
AlertManager.
These metrics are exposed to Prometheus, where they are then consumed by Grafana
and AlertManager.

Extending on the previous `backup.service`:

```
# /etc/systemd/system/backup.service
[Service]
Type=oneshot
ExecStartPre=/usr/bin/generate-restore-test-files.sh
ExecStart=/usr/bin/autorestic-backup.sh
ExecStartPost=/usr/bin/backup-exporter -l /var/log/autorestic.log -e restic.prom
```

{{< alert type="note" >}}
The complete `backup-exporter` script is found
[here](https://github.com/kencx/homelab/blob/master/ansible/roles/autorestic/files/backup-exporter).
{{< /alert >}}

Because restic does not output any metrics or logs in a machine-readable format,
`backup-exporter` is a custom Python script to parse the log output of restic:
Because restic does not output any metrics or logs in a machine-readable format
(AKA `json`), the script reads and parses the log output of restic directly:

```bash
# autorestic.log
Files: 0 new, 0 changed, 11621 unmodified
Dirs: 0 new, 2 changed, 1338 unmodified
Added to the repository: 724 B (862 B stored)
processed 11621 files, 14.647 GiB in 0:02
```

```python
files = re.compile(r"Files:.*?(\d+) new.*?(\d+) changed.*?(\d+) unmodified")
dirs = re.compile(r"Dirs:.*?(\d+) new.*?(\d+) changed.*?(\d+) unmodified")
Expand All @@ -78,10 +65,20 @@ restic_repo_total_files{location="archives",backend="remote"} 11621
restic_repo_duration_seconds{location="archives",backend="remote"} 355
```

These above metrics are repeated for each separate autorestic location and
These generated metrics are repeated for each separate autorestic location and
backend. With these repository specific metrics, there are also two general
metrics that indicate if the backup passed and when the backup was last ran:

```python
def add_general_metrics(success):
num = 0 if success else 1
m = """
restic_backup_success {num}
restic_backup_latest_datetime {timestamp}
""".format(num=num, timestamp=datetime.datetime.now().timestamp())
return m.strip()
```

```
restic_backup_success 0
restic_backup_latest_datetime 1691533984.343583
Expand All @@ -90,16 +87,30 @@ restic_backup_latest_datetime 1691533984.343583
Should a backup fail without any logs/stats to parse, the script will only
generate the general metrics.

## Systemd

This custom script is run after a backup by extending `backup.service` to
include `ExecStartPost`

```
# /etc/systemd/system/backup.service
[Service]
Type=oneshot
ExecStartPre=/usr/bin/generate-restore-test-files.sh
ExecStart=/usr/bin/autorestic-backup.sh
ExecStartPost=/usr/bin/backup-exporter -l /var/log/autorestic.log -e restic.prom
```

## Grafana Dashboard

{{< figure src="/posts/images/backup-grafana-dashboard.png" caption="Grafana dashboard for backups" class="center" >}}

## AlertManager

AlertManager is configured to send a Telegram notification if:
Finally, we configure AlertManager to send a Telegram notification if:

- A backup fails
- A backup has not been successfully completed in the past 26 hours (timestamp
- A backup has not been successfully completed in the past 26 hours (i.e. timestamp
metric is too old).

```yml
Expand All @@ -116,8 +127,8 @@ groups:
summary: 'Backup failed at {{ with query "restic_backup_latest_datetime" }}{{ . | first | value | humanizeTimestamp }}{{ end }}'
```
A 2 hour grace period is given to account for certain days where the backup
might take longer than the previous day's, which would cause a false-negative.
A 2 hour grace period is given to account for a scenario where a backup might
take longer than the previous day, resulting in a false-negative.
## References
Expand Down
Loading

0 comments on commit 1f9faa9

Please sign in to comment.