Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat(authentik): Add api and library docs
[Using the API](https://docs.goauthentik.io/docs/developer-docs/api/) There is a [python library](https://pypi.org/project/authentik-client/) feat(badblocks#Check the health of a disk with badblocks): Check the health of a disk with badblocks The `badblocks` command will write and read the disk with different patterns, thus overwriting the whole disk, so you will loose all the data in the disk. This test is good for rotational disks as there is no disk degradation on massive writes, do not use it on SSD though. WARNING: be sure that you specify the correct disk!! ```bash badblocks -wsv -b 4096 /dev/sde | tee disk_analysis_log.txt ``` If errors are shown is that all of the spare sectors of the disk are used, so you must not use this disk anymore. Again, check `dmesg` for traces of disk errors. feat(kubectl_commands#Get the node architecture of the pods of a deployment): Get the node architecture of the pods of a deployment Here are a few ways to check the node architecture of pods in a deployment: 1. Get the nodes where the pods are running: ```bash kubectl get pods -l app=your-deployment-label -o wide ``` This will show which nodes are running your pods. 2. Then check the architecture of those nodes: ```bash kubectl get nodes -o custom-columns=NAME:.metadata.name,ARCH:.status.nodeInfo.architecture ``` Or you can combine this into a single command: ```bash kubectl get pods -l app=your-deployment-label -o json | jq -r '.items[].spec.nodeName' | xargs -I {} kubectl get node {} -o custom-columns=NAME:.metadata.name,ARCH:.status.nodeInfo.architecture ``` You can also check if your deployment is explicitly targeting specific architectures through node selectors or affinity rules: ```bash kubectl get deployment your-deployment-name -o yaml | grep -A 5 nodeSelector ``` feat(dragonsweeper): Introduce dragonsweeper [DragonSweeper](https://danielben.itch.io/dragonsweeper) is an addictive simple RPG-tinged take on the Minesweeper formula. You can [play it for free](https://danielben.itch.io/dragonsweeper) in your browser. If you're lost at the beginning start reading the [ArsTechnica blog post](https://arstechnica.com/gaming/2025/02/dragonsweeper-is-my-favorite-game-of-2025-so-far). **Tips** - Use `Shift` to mark numbers you already know. **References** - [Play](https://danielben.itch.io/dragonsweeper) - [Home](https://danielben.itch.io/dragonsweeper) - [ArsTechnica blog post](https://arstechnica.com/gaming/2025/02/dragonsweeper-is-my-favorite-game-of-2025-so-far) feat(fzf_nvim#How to exclude some files from the search): How to exclude some files from the search If anyone else comes here in the future and have the following setup - Using `fd` as default command: `export FZF_DEFAULT_COMMAND='fd --type file --hidden --follow'` - Using `:Rg` to grep in files And want to exclude a specific path in a git project say `path/to/exclude` (but that should not be included in `.gitignore`) from both `fd` and `rg` as used by `fzf.vim`, then the easiest way I found to solve to create ignore files for the respective tool then ignore this file in the local git clone (as they are only used by me) ```bash cd git_proj/ echo "path/to/exclude" > .rgignore echo "path/to/exclude" > .fdignore printf ".rgignore\n.fdignore" >> .git/info/exclude ``` feat(hacktivist_collectives): Add critical switch [Critical Switch](https://critical-switch.org/): una colectiva transhackfeminista no mixta1 interesades en la cultura libre, la privacidad y la seguridad digital. Promovemos la cultura de la seguridad para generar espacios más seguros en los movimientos sociales y activistas. feat(hacktivist_collectives): Add méxico collectives - [Sursiendo](https://sursiendo.org/quienes-somos/) - [Tecnoafecciones](https://tecnoafecciones.net) feat(himalaya#Configure GPG): Configure GPG Himalaya relies on cargo features to enable gpg. You can see the default enabled features in the [Cargo.toml](https://github.com/pimalaya/himalaya/blob/master/Cargo.toml#L18) file. As of 2025-01-27 the `pgp-commands` is enabled. You only need to add the next section to your config: ```ini pgp.type = "commands" ``` And then you can use both the cli and the vim plugin with gpg. Super easy feat(instant_messages_management): Add interesting article to merge all protocols under matrix You can [use bridges to merge all into matrix](https://technicallyrural.ca/2021/04/05/unify-signal-whatsapp-and-sms-in-a-personal-matrix-server-part-1-matrix/) feat(k9#How to set a master password): How to set a master password You can't, it's not supported and it doesn't look that it will ([1](https://forum.k9mail.app/t/password-protection-for-launch-of-k9/6871/11), [2](https://forum.k9mail.app/t/can-i-password-protect-app-on-startup/6755/6)) feat(zfs#Removing a disk from the pool): Removing a disk from the pool ```bash zpool remove tank0 sda ``` This will trigger the data evacuation from the disk. Check `zpool status` to see when it finishes. feat(zfs#Encrypting ZFS Drives with LUKS): Encrypting ZFS Drives with LUKS **Warning: Proceed with Extreme Caution** **IMPORTANT SAFETY NOTICE:** - These instructions will COMPLETELY WIPE the target drive - Do NOT attempt on production servers - Experiment only on drives with no valuable data - Seek professional help if anything is unclear **Prerequisites** - A drive you want to encrypt (will be referred to as `/dev/sdx`) - Root access - Basic understanding of Linux command line - Backup of all important data **Step 1: Create LUKS Encryption Layer** First, format the drive with LUKS encryption: ```bash sudo cryptsetup luksFormat /dev/sdx ``` - You'll be prompted for a sudo password - Create a strong encryption password (mix of uppercase, lowercase, numbers, symbols) - Note the precise capitalization in commands **Step 2: Open the Encrypted Disk** Open the newly encrypted disk: ```bash sudo cryptsetup luksOpen /dev/sdx sdx_crypt ``` This creates a mapped device at `/dev/mapper/sdx_crypt` **Step 3: Create ZFS Pool or the vdev** For example to create a ZFS pool on the encrypted device: ```bash sudo zpool create -f -o ashift=12 \ -O compression=lz4 \ + zpool /dev/mapper/sdx_crypt ``` Check the [create zpool section](#create-your-pool) to know which configuration flags to use. **Step 4: Set Up Automatic Unlocking** *Generate a Keyfile* Create a random binary keyfile: ```bash sudo dd bs=1024 count=4 if=/dev/urandom of=/etc/zfs/keys/sdx.key sudo chmod 0400 /etc/zfs/keys/sdx.key ``` *Add Keyfile to LUKS* Add the keyfile to the LUKS disk: ```bash sudo cryptsetup luksAddKey /dev/sdx /etc/zfs/keys/sdx.key ``` - You'll be asked to enter the original encryption password - This adds the binary file to the LUKS disk header - Now you can unlock the drive using either the password or the keyfile **Step 5: Configure Automatic Mounting** *Find Drive UUID* Get the drive's UUID: ```bash sudo blkid ``` Look for the line with `TYPE="crypto_LUKS"`. Copy the UUID. *Update Crypttab* Edit the crypttab file: ```bash sudo vim /etc/crypttab ``` Add an entry like: ``` sdx_crypt UUID=your-uuid-here /etc/zfs/keys/sdx.key luks,discard ``` **Final Step: Reboot** - Reboot your system - The drive will be automatically decrypted and imported **Best Practices** +- Keep your keyfile and encryption password secure - Store keyfiles with restricted permissions - Consider backing up the LUKS header **Troubleshooting** - Double-check UUIDs - Verify keyfile permissions - Ensure cryptsetup and ZFS are installed **Security Notes** - This method provides full-disk encryption at rest - Data is inaccessible without the key or password - Protects against physical drive theft **Disclaimer** While these instructions are comprehensive, they come with inherent risks. Always: - Have backups - Test in non-critical environments first - Understand each step before executing **Further reading** - [Setting up ZFS on LUKS - Alpine Linux Wiki](https://wiki.alpinelinux.org/wiki/Setting_up_ZFS_on_LUKS) - [Decrypt Additional LUKS Encrypted Volumes on Boot](https://www.malachisoord.com/2023/11/04/decrypt-additiona-luks-encrypted-volumes-on-boot/) - [Auto-Unlock LUKS Encrypted Drive - Dradis Support Guide](https://dradis.com/support/guides/customization/auto-unlock-luks-encrypted-drive.html) - [How do I automatically decrypt an encrypted filesystem on the next reboot? - Ask Ubuntu](https://askubuntu.com/questions/996155/how-do-i-automatically-decrypt-an-encrypted-filesystem-on-the-next-reboot) feat(zfs#Add a disk to an existing vdev): Add a disk to an existing vdev ```bash zpool add tank /dev/sdx ``` feat(zfs#Add a vdev to an existing pool): Add a vdev to an existing pool ``bash zpool add main raidz1-1 /dev/disk-1 /dev/disk-2 /dev/disk-3 /dev/disk-4 ``` You don't need to specify the `ashift` or the `autoexpand` as they are set on zpool creation. feat(zfs#books): Add zfs book - [FreeBSD Mastery: ZFS by Michael W Lucas and Allan Jude](https://mwl.io/nonfiction/os#fmzfs) feat(linux_snippets#Record the audio from your computer): Record the audio from your computer You can record audio being played in a browser using `ffmpeg` 1. Check your default audio source: ```sh pactl list sources | grep -E 'Name|Description' ``` 2. Record using `ffmpeg`: ```sh ffmpeg -f pulse -i <your_monitor_source> output.wav ``` Example: ```sh ffmpeg -f pulse -i alsa_output.pci-0000_00_1b.0.analog-stereo.monitor output.wav ``` 3. Stop recording with **Ctrl+C**. feat(nas): Suggest to look at the slimbook I built a server pretty much the same as the [slimbook](https://slimbook.com/en/shop/product/nas-cube-1510?category=10). feat(orgmode#Footnotes): Footnotes A footnote is started by a footnote marker in square brackets in column 0, no indentation allowed. It ends at the next footnote definition, headline, or after two consecutive empty lines. The footnote reference is simply the marker in square brackets, inside text. Markers always start with ‘fn:’. For example: ``` The Org website[fn:1] now looks a lot better than it used to. ... [fn:50] The link is: https://orgmode.org ``` Nvim-orgmode has [some basic support for footnotes](https://github.com/nvim-orgmode/orgmode/commit/4f62b7f#diff-fa091537281e07e5e58902b6484b097442300c98e115ab29f4374abbe98b8d3d). feat(orgmode#custom-agendas): Custom agendas You an use [custom agenda commands](https://github.com/nvim-orgmode/orgmode/blob/d62fd3cdb2958e2e76fb0af4ea64d6209703fbe0/DOCS.md#org_agenda_custom_commands) Define custom agenda views that are available through the `org_agenda` mapping. It is possible to combine multiple agenda types into single view. An example: ```lua require('orgmode').setup({ org_agenda_files = {'~/org/**/*'}, org_agenda_custom_commands = { -- "c" is the shortcut that will be used in the prompt c = { description = 'Combined view', -- Description shown in the prompt for the shortcut types = { { type = 'tags_todo', -- Type can be agenda | tags | tags_todo match = '+PRIORITY="A"', --Same as providing a "Match:" for tags view <leader>oa + m, See: https://orgmode.org/manual/Matching-tags-and-properties.html org_agenda_overriding_header = 'High priority todos', org_agenda_todo_ignore_deadlines = 'far', -- Ignore all deadlines that are too far in future (over org_deadline_warning_days). Possible values: all | near | far | past | future }, { type = 'agenda', org_agenda_overriding_header = 'My daily agenda', org_agenda_span = 'day' -- can be any value as org_agenda_span }, { type = 'tags', match = 'WORK', --Same as providing a "Match:" for tags view <leader>oa + m, See: https://orgmode.org/manual/Matching-tags-and-properties.html org_agenda_overriding_header = 'My work todos', org_agenda_todo_ignore_scheduled = 'all', -- Ignore all headlines that are scheduled. Possible values: past | future | all }, { type = 'agenda', org_agenda_overriding_header = 'Whole week overview', org_agenda_span = 'week', -- 'week' is default, so it's not necessary here, just an example org_agenda_start_on_weekday = 1 -- Start on Monday org_agenda_remove_tags = true -- Do not show tags only for this view }, } }, p = { description = 'Personal agenda', types = { { type = 'tags_todo', org_agenda_overriding_header = 'My personal todos', org_agenda_category_filter_preset = 'todos', -- Show only headlines from `todos` category. Same value providad as when pressing `/` in the Agenda view org_agenda_sorting_strategy = {'todo-state-up', 'priority-down'} -- See all options available on org_agenda_sorting_strategy }, { type = 'agenda', org_agenda_overriding_header = 'Personal projects agenda', org_agenda_files = {'~/my-projects/**/*'}, -- Can define files outside of the default org_agenda_files }, { type = 'tags', org_agenda_overriding_header = 'Personal projects notes', org_agenda_files = {'~/my-projects/**/*'}, org_agenda_tag_filter_preset = 'NOTES-REFACTOR' -- Show only headlines with NOTES tag that does not have a REFACTOR tag. Same value providad as when pressing `/` in the Agenda view }, } } } }) ``` You can also define the `org_agenda_sorting_strategy`. The default value is `{ agenda = {'time-up', 'priority-down', 'category-keep'}, todo = {'priority-down', 'category-keep'}, tags = {'priority-down', 'category-keep'}}`. The available list of sorting strategies to apply to a given view are: - `time-up`: Sort entries by time of day. Applicable only in agenda view - `time-down`: Opposite of time-up - `priority-down`: Sort by priority, from highest to lowest - `priority-up`: Sort by priority, from lowest to highest - `tag-up`: Sort by sorted tags string, ascending - `tag-down`: Sort by sorted tags string, descending - `todo-state-up`: Sort by todo keyword by position (example: 'TODO, PROGRESS, DONE' has a sort value of 1, 2 and 3), ascending - `todo-state-down`: Sort by todo keyword, descending - `clocked-up`: Show clocked in headlines first - `clocked-down`: Show clocked in headines last - `category-up`: Sort by category name, ascending - `category-down`: Sort by category name, descending - `category-keep`: Keep default category sorting, as it appears in org-agenda-files You can open the custom agendas with the API too. For example to open the agenda stored under `t`: ```lua keys = { { "gt", function() vim.notify("Opening today's agenda", vim.log.levels.INFO) require("orgmode.api.agenda").open_by_key("t") end, desc = "Open orgmode agenda for today", }, }, ``` In that case I'm configuring the `keys` section of the lazyvim plugin. Through the API you can also configure these options: - `org_agenda_files` - `org_agenda_sorting_strategy` - `org_agenda_category_filter_preset` - `org_agenda_todo_ignore_deadlines`: Ignore all deadlines that are too far in future (over org_deadline_warning_days). Possible values: all | near | far | past | future - `org_agenda_todo_ignore_scheduled`: Ignore all headlines that are scheduled. Possible values: past | future | all feat(orgmode#Load different agendas with the same binding depending on the time): Load different agendas with the same binding depending on the time I find it useful to bind `gt` to Today's agenda, but what today means is different between week days. Imagine that you want to load an agenda if you're from monday to friday before 18:00 (a work agenda) versus a personal agenda the rest of the time. You could then configure this function: ```lua keys = { { "gt", function() local current_time = os.date("*t") local day = current_time.wday -- 1 = Sunday, 2 = Monday, etc. local hour = current_time.hour local agenda_key = "t" local agenda_name = "Today's" -- default -- Monday (2) through Friday (6) if day >= 2 and day <= 6 then if hour < 17 then agenda_key = "w" agenda_name = "Today + Work" end end vim.notify("Opening " .. agenda_name .. " agenda", vim.log.levels.INFO) require("orgmode.api.agenda").open_by_key(agenda_key) end, desc = "Open orgmode agenda for today", }, } ``` feat(orgmode#Better handle indentations): Better handle indentations There is something called [virtual indents](https://github.com/nvim-orgmode/orgmode/blob/master/docs/configuration.org#org_startup_indented) that will prevent you from many indentation headaches. To enable them set the `org_startup_indented = true` configuration. If you need to adjust the indentation of your document (for example after enabling the option on existent orgmode code), visually select the lines to correct the indentation (`V`) and then press `=`. You can do this with the whole file `(╥﹏╥)`. feat(orgmode#Remove some tags when the state has changed so DONE): Remove some tags when the state has changed so DONE For example if you want to remove them for recurrent tasks ```lua local function remove_specific_tags(headline) local tagsToRemove = { "t", "w", "m", "q", "y" } local currentTags = headline:get_tags() local newTags = {} local needsUpdate = false -- Build new tags list excluding t, w, m for _, tag in ipairs(currentTags) do local shouldKeep = true for _, removeTag in ipairs(tagsToRemove) do if tag == removeTag then shouldKeep = false needsUpdate = true break end end if shouldKeep then table.insert(newTags, tag) end end -- Only update if we actually removed something if needsUpdate then headline:set_tags(table.concat(newTags, ":")) headline:refresh() end end local EventManager = require("orgmode.events") EventManager.listen(EventManager.event.TodoChanged, function(event) ---@cast event OrgTodoChangedEvent if event.headline then if type == "DONE" then remove_specific_tags(event.headline) end end end) ``` feat(orgmode#Register the todo changes in the logbook): Register the todo changes in the logbook You can now register the changes with events. Add this to your plugin config. If you're using lazyvim: ```lua return { { "nvim-orgmode/orgmode", config = function() require("orgmode").setup({...}) local EventManager = require("orgmode.events") local Date = require("orgmode.objects.date") EventManager.listen(EventManager.event.TodoChanged, function(event) ---@cast event OrgTodoChangedEvent if event.headline then local current_todo, _, _ = event.headline:get_todo() local now = Date.now() event.headline:add_note({ 'State "' .. current_todo .. '" from "' .. event.old_todo_state .. '" [' .. now:to_string() .. "]", }) end end) end, }, } ``` feat(orgmode#API usage): API usage **[Get the headline under the cursor](https://github.com/nvim-orgmode/orgmode/commit/2c806ca)** **[Read and write files](https://github.com/nvim-orgmode/orgmode/commit/500004ff315475033e3a9247b61addd922d1f5da)** You have information on how to do it in [this pr](https://github.com/nvim-orgmode/orgmode/commit/500004ff315475033e3a9247b61addd922d1f5da) **[Create custom hyperlink types](https://github.com/nvim-orgmode/orgmode/commit/8cdfc8d34bd9c5993ea8f933b5f5c306081ffb97)** Custom types can trigger functionality such as opening the terminal and pings the provided URL . To add your own custom hyperlink type, provide a custom handler to `hyperlinks.sources` setting. Each handler needs to have a `get_name()` method that returns a name for the handler. Additionally, `follow(link)` and `autocomplete(link)` optional methods are available to open the link and provide the autocompletion. ## [Refile a headline to another destination](https://github.com/nvim-orgmode/orgmode/issues/471#event-16071077147) **[Refile a headline to another destination](https://github.com/nvim-orgmode/orgmode/issues/471#event-16071077147)** You can do this [with the API](https://github.com/nvim-orgmode/orgmode/blob/master/doc/orgmode_api.txt#L27). Assuming you are in the filewhere your TODOs are: local api = require('orgmode.api') local closest_headline = api.current():get_closest_headline() local destination_file = api.load('~/org/journal.org') ocal destination_headline = vim.tbl_filter(function(headline) return headline.title == 'My journal' end, destination_file.headlines)[1] api.refile({ source = closest_headline, destination = destination_headline }) **[Use events](https://github.com/nvim-orgmode/orgmode/tree/master/lua/orgmode/events)** feat(orgzly#Not adding a todo state when creating a new element by default): Not adding a todo state when creating a new element by default The default state `NOTE` doesn't add any state. fix(pdm): Suggest to check uv Maybe use [uv](https://astral.sh/blog/uv) instead (although so far I'm still using `pdm`) feat(pretalx#Import a pretalx calendar in giggity): Import a pretalx calendar in giggity Search the url similar to https://pretalx.com/<conference-name>/schedule/export/schedule.xml feat(renovate#Installation): Installation in gitea actions - Create Renovate Bot Account and generate a token for the Gitea Action secret - Add the renovate bot account as collaborator with write permissions to the repository you want to update. - Create a repository to store our Renovate bot configurations, assuming called renovate-config. In renovate-config, create a file config.js to configure Renovate: ```json module.exports = { "endpoint": "https://gitea.com/api/v1", // replace it with your actual endpoint "gitAuthor": "Renovate Bot <[email protected]>", "platform": "gitea", "onboardingConfigFileName": "renovate.json", "autodiscover": true, "optimizeForDisabled": true, }; ``` If you're using mysql or you see errors like `.../repository/pulls 500 internal error` you [may need to set `unicodeEmoji: false`](https://github.com/renovatebot/renovate/issues/10264). feat(roadmap_adjustment): Adjust the month review process To record the results of the review create the section in `pages/reviews.org` with the following template: ```org * winter ** january review *** work *** personal **** month review ***** mental dump ****** What worries you right now? ****** What drained your energy or brought you down emotionally this last month? ****** What are the little things that burden you or slow you down? ****** What do you desire right now? ****** Where is your mind these days? ****** What did you enjoy most this last month? ****** What did help you most this last month? ****** What things would you want to finish throughout the month so you can carry them to the next? ****** What things do you feel you need to do? ****** What are you most proud of this month? ***** month checks ***** analyze ***** decide ``` I'm assuming it's the january's review and that you have two kinds of reviews, one personal and one for work. **Dump your mind** The first thing we want to do in the review is to dump all that's in our mind into our system to free up mental load. Try not to, but if you think of decisions you want to make that address the elements you're discovering, write them down in the `Decide` section of your review document. There are different paths to discover actionable items: - Analyze what is in your mind: Take 10 minutes to answer to the questions of the template under the "mental dump" section (you don't need to answer them all). Notice that we do not need to review our life logging tools (diary, action manager, ...) to answer these questions. This means that we're doing an analysis of what is in our minds right now, not throughout the month. It's flawed but as we do this analysis often, it's probably fine. We add more importance to the latest events in our life anyway. **Clean your notebook** - Empty the elements you added to the `review box`. I have them in my inbox with the tag `:review:` (you have it in the month agenda view `gM`) - Clean your life notebook by: - Iterate over the areas of `proyects.org` only checking the first level of projects, don't go deeper and for each element: - Move the done elements either to `archive.org` or `logbook.org`. - Move to `backlog.org` the elements that don't make sense to be active anymore - Check if you have any `DONE` element in `calendar.org`. - Empty the `inbox.org` - Empty the `DONE` elements of `talk.org` - Clean the elements that don't make sense anymore from `think.org` - Process your `month checks`. For each of them: - If you need, add action elements in the `mental dump` section of the review. - Think of whether you've met the check. **Refresh your idea of how the month go** - Open your `bitácora.org` agenda view to see what has been completed in the last month `match = 'CLOSED>"<-30d>"-work-steps-done',` ordered by name `org_agenda_sorting_strategy = { "category-keep" },` and change the priority of the elements according to the impact. Open your `recurrent.org` agenda view to see what has been done the last month `match = 'LAST_REPEAT>"<-30d>"-work'` - Check what has been left of your month objectives `+m` and refile the elements that don't make sense anymore. - Check the reports of your weekly reviews of the month in the `reviews.org` document. **Check your close compromises** Check all your action management tools (in my case `orgmode` and `ikhal`) to identify: - Arranged compromises - trips feat(roadmap_adjustment#Life roadmap adjustment): Life roadmap adjustment **Create next stage's life notebook** After reading "The Bulletproof Journal", I was drawn to the idea of changing notebooks each year, carrying over only the necessary things. I find this to be a powerful concept since you start each stage with a clean canvas. This brings you closer to desire versus duty as it removes the commitments you made to yourself, freeing up significant mental load. From this point, it's much easier to allow yourself to dream about what you want to do in this new stage. I want to apply this concept to my digital life notebook as I see the following advantages: - It lightens my files making them easier to manage and faster to process with orgmode - It's a very easy way to clean up - It's an elegant way to preserve what you've recorded without it becoming a hindrance - In each stage, you can start with a different notebook structure, meaning new axes, tools, and structures. This helps avoid falling into the rigidity of a constrained system or artifacts defined by inertia rather than conscious decision - It allows you to avoid maintaining files that follow an old scheme or having to migrate them to the new system - Additionally, you get rid of all those actions you've been reluctant to delete in one fell swoop The notebook change can be done in two phases: - Notebook Construction - Stage Closure **Notebook Construction** This phase spans from when you start making stage adjustments until you finally close the current stage. You can follow these steps: - Create a directory with the name of the new stage. In my case, it's the number of my predominant age during the stage - Create a directory for the current stage's notebook within "notebooks" in your references. Here we'll move everything that doesn't make sense to maintain. It's important that this directory isn't within your agenda files - Quickly review the improvements you've noted that you want to implement in next year's notebook to keep them in mind. You can note the references in the "Create new notebook" action As you review the stage, decide if it makes sense for the file you're viewing to exist as-is in the new notebook. Remember that the idea is to migrate minimal structure and data. - If it makes sense: - Create a symbolic link in the new notebook. When closing the stage, we'll replace the link with the file's final state - If the file no longer makes sense, move it to `references/notebooks` feat(smartctl): Introduce smartctl [Self-Monitoring, Analysis, and Reporting Technology (S.M.A.R.T. or SMART)](https://en.wikipedia.org/wiki/Self-Monitoring,_Analysis_and_Reporting_Technology) is a monitoring system included in computer hard disk drives (HDDs) and solid-state drives (SSDs). Its primary function is to detect and report various indicators of drive reliability, or how long a drive can function while anticipating imminent hardware failures. When S.M.A.R.T. data indicates a possible imminent drive failure, software running on the host system may notify the user so action can be taken to prevent data loss, and the failing drive can be replaced and no data is lost. **General information** *[Accuracy](https://en.wikipedia.org/wiki/Self-Monitoring,_Analysis_and_Reporting_Technology#Accuracy)* A field study at Google covering over 100,000 consumer-grade drives from December 2005 to August 2006 found correlations between certain S.M.A.R.T. information and annualized failure rates: - In the 60 days following the first uncorrectable error on a drive (S.M.A.R.T. attribute 0xC6 or 198) detected as a result of an offline scan, the drive was, on average, 39 times more likely to fail than a similar drive for which no such error occurred. - First errors in reallocations, offline reallocations (S.M.A.R.T. attributes 0xC4 and 0x05 or 196 and 5) and probational counts (S.M.A.R.T. attribute 0xC5 or 197) were also strongly correlated to higher probabilities of failure. - Conversely, little correlation was found for increased temperature and no correlation for usage level. However, the research showed that a large proportion (56%) of the failed drives failed without recording any count in the "four strong S.M.A.R.T. warnings" identified as scan errors, reallocation count, offline reallocation, and probational count. - Further, 36% of failed drives did so without recording any S.M.A.R.T. error at all, except the temperature, meaning that S.M.A.R.T. data alone was of limited usefulness in anticipating failures. **[Installation](https://blog.shadypixel.com/monitoring-hard-drive-health-on-linux-with-smartmontools/)** On Debian systems: ```bash sudo apt-get install smartmontools ``` By default when you install it all your drives are checked periodically with the `smartd` daemon under the `smartmontools` systemd service. **Usage** **Running the tests** **[Test types](https://en.wikipedia.org/wiki/Self-Monitoring,_Analysis_and_Reporting_Technology#Self-tests)** S.M.A.R.T. drives may offer a number of self-tests: - Short: Checks the electrical and mechanical performance as well as the read performance of the disk. Electrical tests might include a test of buffer RAM, a read/write circuitry test, or a test of the read/write head elements. Mechanical test includes seeking and servo on data tracks. Scans small parts of the drive's surface (area is vendor-specific and there is a time limit on the test). Checks the list of pending sectors that may have read errors, and it usually takes under two minutes. - Long/extended: A longer and more thorough version of the short self-test, scanning the entire disk surface with no time limit. This test usually takes several hours, depending on the read/write speed of the drive and its size. It is possible for the long test to pass even if the short test fails. - Conveyance: Intended as a quick test to identify damage incurred during transporting of the device from the drive manufacturer to the computer manufacturer. Only available on ATA drives, and it usually takes several minutes. Drives remain operable during self-test, unless a "captive" option (ATA only) is requested. **Long test** Start with a long self test with `smartctl`. Assuming the disk to test is `/dev/sdd`: ```bash smartctl -t long /dev/sdd ``` The command will respond with an estimate of how long it thinks the test will take to complete. To check progress use: ```bash martctl -A /dev/sdd | grep remaining smartctl -c /dev/sdd | grep remaining ``` Don't check too often because it can abort the test with some drives. If you receive an empty output, examine the reported status with: ```bash smartctl -l selftest /dev/sdd `` If errors are shown, check the `dmesg` as there are usually useful traces of the error. feat(smartctl#Understanding the tests): Understanding the tests The output of a `smartctl` command is difficult to read: ``` smartctl 5.40 2010-03-16 r3077 [x86_64-unknown-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Model Family: SAMSUNG SpinPoint F2 EG series Device Model: SAMSUNG HD502HI Serial Number: S1VZJ9CS712490 Firmware Version: 1AG01118 User Capacity: 500,107,862,016 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 3b Local Time is: Wed Feb 9 15:30:42 2011 CET SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (6312) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 106) minutes. Conveyance self-test routine recommended polling time: ( 12) minutes. SCT capabilities: (0x003f) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 099 099 051 Pre-fail Always - 2376 3 Spin_Up_Time 0x0007 091 091 011 Pre-fail Always - 3620 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 405 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 253 253 051 Pre-fail Always - 0 8 Seek_Time_Performance 0x0025 100 100 015 Pre-fail Offline - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 717 10 Spin_Retry_Count 0x0033 100 100 051 Pre-fail Always - 0 11 Calibration_Retry_Count 0x0012 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 405 13 Read_Soft_Error_Rate 0x000e 099 099 000 Old_age Always - 2375 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 84 End-to-End_Error 0x0033 100 100 000 Pre-fail Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 2375 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 084 074 000 Old_age Always - 16 (Lifetime Min/Max 16/16) 194 Temperature_Celsius 0x0022 084 071 000 Old_age Always - 16 (Lifetime Min/Max 16/16) 195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always - 3558 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0012 098 098 000 Old_age Always - 81 198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 100 100 000 Old_age Always - 1 200 Multi_Zone_Error_Rate 0x000a 100 100 000 Old_age Always - 0 201 Soft_Read_Error_Rate 0x000a 253 253 000 Old_age Always - 0 MART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. ``` **Checking overall health** Somewhere in your report you'll see something like: ``` === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED ``` If it doesn’t return PASSED, you should immediately backup all your data. Your hard drive is probably failing. That message can also be shown with `smartctl -H /dev/sda` **[Checking the SMART attributes](https://en.wikipedia.org/wiki/Self-Monitoring,_Analysis_and_Reporting_Technology#Known_ATA_S.M.A.R.T._attributes)** Each drive manufacturer defines a set of attributes, and sets threshold values beyond which attributes should not pass under normal operation. But they do not agree on precise attribute definitions and measurement units, the following list of attributes is a general guide only. If one or more attribute have the "prefailure" flag, and the "current value" of such prefailure attribute is smaller than or equal to its "threshold value" (unless the "threshold value" is 0), that will be reported as a "drive failure". In addition, a utility software can send SMART RETURN STATUS command to the ATA drive, it may report three status: "drive OK", "drive warning" or "drive failure". **[SMART attributes columns](https://ma.juii.net/blog/interpret-smart-attributes)** Every of the SMART attributes has several columns as shown by “smartctl -a <device>”: - ID: The ID number of the attribute, good for comparing with other lists like [Wikipedia: S.M.A.R.T.: Known ATA S.M.A.R.T. attributes](https://en.wikipedia.org/wiki/Self-Monitoring,_Analysis_and_Reporting_Technology#Known_ATA_S.M.A.R.T._attributes) because the attribute names sometimes differ. Name: The name of the SMART attribute. - Value: The current, normalized value of the attribute. Higher values are always better (except for temperature for hard disks of some manufacturers). The range is normally 0-100, for some attributes 0-255 (so that 100 resp. 255 is best, 0 is worst). There is no standard on how manufacturers convert their raw value to this normalized one: when the normalized value approaches threshold, it can do linearily, exponentially, logarithmically or any other way, meaning that a doubled normalized value does not necessarily mean “twice as good”. - Worst: The worst (normalized) value that this attribute had at any point of time where SMART was enabled. There seems to be no mechanism to reset current SMART attribute values, but this still makes sense as some SMART attributes, for some manufacturers, fluctuate over time so that keeping the worst one ever is meaningful. - Threshold: The threshold below which the normalized value will be considered “exceeding specifications”. If the attribute type is “Pre-fail”, this means that SMART thinks the hard disk is just before failure. This will “trigger” SMART: setting it from “SMART test passed” to “SMART impending failure” or similar status. - Type: The type of the attribute. Either “Pre-fail” for attributes that are said to indicate impending failure, or “Old_age” for attributes that just indicate wear and tear. Note that one and the same attribute can be classified as “Pre-fail” by one manufacturer or for one model and as “Old_age” by another or for another model. This is the case for example for attribute Seek_Error_Rate (ID 7), which is a widespread phenomenon on many disks and not considered critical by some manufacturers, but Seagate has it as “Pre-fail”. - Raw value: The current raw value that was converted to the normalized value above. smartctl shows all as decimal values, but some attribute values of some manufacturers cannot be reasonably interpreted that way feat(smartctl#Reacting to SMART Values): Reacting to SMART Values It is said that a drive that starts getting bad sectors (attribute ID 5) or “pending” bad sectors (attribute ID 197; they most likely are bad, too) will usually be trash in 6 months or less. The only exception would be if this does not happen: that is, bad sector count increases, but then stays stable for a long time, like a year or more. For that reason, one normally needs a diagramming / journaling tool for SMART. Many admins will exchange the hard drive if it gets reallocated sectors (ID 5) or sectors “under investigation” (ID 197) **[Critical SMART attributes](https://en.wikipedia.org/wiki/Self-Monitoring,_Analysis_and_Reporting_Technology#Known_ATA_S.M.A.R.T._attributes)** Of all the attributes I'm going to analyse only the critical ones **Read Error Rate** ID: 01 (0x01) deal: Low +Correlation with probability of failure: not clear (Vendor specific raw value.) Stores data related to the rate of hardware read errors that occurred when reading data from a disk surface. The raw value has different structure for different vendors and is often not meaningful as a decimal number. For some drives, this number may increase during normal operation without necessarily signifying errors. **Reallocated Sectors Count** ID: 05 (0x05) Ideal: Low Correlation with probability of failure: Strong Count of reallocated sectors. The raw value represents a count of the bad sectors that have been found and remapped. Thus, the higher the attribute value, the more sectors the drive has had to reallocate. This value is primarily used as a metric of the life expectancy of the drive; a drive which has had any reallocations at all is significantly more likely to fail in the immediate months. If Raw value of 0x05 attribute is higher than its Threshold value, that will reported as "drive warning". **Spin Retry Count** ID: 10 (0x0A) Ideal: Low Correlation with probability of failure: Strong Count of retry of spin start attempts. This attribute stores a total count of the spin start attempts to reach the fully operational speed (under the condition that the first attempt was unsuccessful). An increase of this attribute value is a sign of problems in the hard disk mechanical subsystem. **Current Pending Sector Count** ID: 197 (0xC5) Ideal: Low Correlation with probability of failure: Strong Count of "unstable" sectors (waiting to be remapped, because of unrecoverable read errors). If an unstable sector is subsequently read successfully, the sector is remapped and this value is decreased. Read errors on a sector will not remap the sector immediately (since the correct value cannot be read and so the value to remap is not known, and also it might become readable later); instead, the drive firmware remembers that the sector needs to be remapped, and will remap it the next time it has been successfully read.[76] However, some drives will not immediately remap such sectors when successfully read; instead the drive will first attempt to write to the problem sector, and if the write operation is successful the sector will then be marked as good (in this case, the "Reallocation Event Count" (0xC4) will not be increased). This is a serious shortcoming, for if such a drive contains marginal sectors that consistently fail only after some time has passed following a successful write operation, then the drive will never remap these problem sectors. If Raw value of 0xC5 attribute is higher than its Threshold value, that will reported as "drive warning" **(Offline) Uncorrectable Sector Count** ID: 198 (0xC6) Ideal: Low Correlation with probability of failure: Strong The total count of uncorrectable errors when reading/writing a sector. A rise in the value of this attribute indicates defects of the disk surface and/or problems in the mechanical subsystem. In the 60 days following the first uncorrectable error on a drive (S.M.A.R.T. attribute 0xC6 or 198) detected as a result of an offline scan, the drive was, on average, 39 times more likely to fail than a similar drive for which no such error occurred. **Non critical SMART attributes** The next attributes appear to change in the logs but that doesn't mean that there is anything going wrong **Hardware ECC Recovered** ID: 195 (0xC3) Ideal: Varies Correlation with probability of failure: Low (Vendor-specific raw value.) The raw value has different structure for different vendors and is often not meaningful as a decimal number. For some drives, this number may increase during normal operation without necessarily signifying errors. feat(smartctl#Monitorization): Monitorization To monitor your drive health you can use [prometheus](prometheus.md) with [alertmanager](alertmanager.md) for alerts and [grafana](grafana.md) for dashboards. **Installing the exporter** The prometheus community has it's own [smartctl exporter](https://github.com/prometheus-community/smartctl_exporter) **Using the binary** You can download the latest binary from the repository [releases](https://github.com/prometheus-community/smartctl_exporter/releases) and configure the [systemd service](https://github.com/prometheus-community/smartctl_exporter/blob/master/systemd/smartctl_exporter.service) ```bash unp smartctl_exporter-0.13.0.linux-amd64.tar.gz sudo mv smartctl_exporter-0.13.0.linux-amd64/smartctl_exporter /usr/bin ``` Add the [service](https://github.com/prometheus-community/smartctl_exporter/blob/master/systemd/smartctl_exporter.service) to `/etc/systemd/system/smartctl-exporter.service` ```ini [Unit] Description=smartctl exporter service After=network-online.target [Service] Type=simple PIDFile=/run/smartctl_exporter.pid ExecStart=/usr/bin/smartctl_exporter User=root Group=root SyslogIdentifier=smartctl_exporter Restart=on-failure RemainAfterExit=no RestartSec=100ms StandardOutput=journal StandardError=journal [Install] WantedBy=multi-user.target ``` hen enable it: ```bash sudo systemctl enable smartctl-exporter sudo service smartctl-exporter start ``` **[Using docker](https://github.com/prometheus-community/smartctl_exporter?tab=readme-ov-file#example-of-running-in-docker)** ```yaml --- services: smartctl-exporter: container_name: smartctl-exporter image: prometheuscommunity/smartctl-exporter privileged: true user: root ports: - "9633:9633" ``` **Configuring prometheus** Add the next scraping metrics: ```yaml - job_name: smartctl_exporter metrics_path: /metrics scrape_timeout: 60s static_configs: - targets: [smartctl-exporter:9633] labels: hostname: "your-hostname" ``` **Configuring the alerts** Taking as a reference the [awesome prometheus rules](https://samber.github.io/awesome-prometheus-alerts/rules#s.m.a.r.t-device-monitoring) and [this wired post](https://www.wirewd.com/hacks/blog/monitoring_a_mixed_fleet_of_flash_hdd_and_nvme_devices_with_node_exporter_and_prometheus) I'm using the next rules: ```yaml --- groups: - name: smartctl exporter rules: - alert: SmartDeviceTemperatureWarning expr: smartctl_device_temperature > 60 for: 2m labels: severity: warning annotations: summary: Smart device temperature warning (instance {{ $labels.hostname }}) description: "Device temperature warning (instance {{ $labels.hostname }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: SmartDeviceTemperatureCritical expr: smartctl_device_temperature > 80 for: 2m labels: severity: critical annotations: summary: Smart device temperature critical (instance {{ $labels.hostname }}) description: "Device temperature critical (instance {{ $labels.hostname }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: SmartCriticalWarning expr: smartctl_device_critical_warning > 0 for: 15m labels: severity: critical annotations: summary: Smart critical warning (instance {{ $labels.hostname }}) description: "device has critical warning (instance {{ $labels.hostname }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: SmartNvmeWearoutIndicator expr: smartctl_device_available_spare{device=~"nvme.*"} < smartctl_device_available_spare_threshold{device=~"nvme.*"} for: 15m labels: severity: critical annotations: summary: Smart NVME Wearout Indicator (instance {{ $labels.hostname }}) description: "NVMe device is wearing out (instance {{ $labels.hostname }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: SmartNvmeMediaError expr: smartctl_device_media_errors > 0 for: 15m labels: severity: warning annotations: summary: Smart NVME Media errors (instance {{ $labels.hostname }}) description: "Contains the number of occurrences where the controller detected an unrecovered data integrity error. Errors such as uncorrectable ECC, CRC checksum failure, or LBA tag mismatch are included in this field (instance {{ $labels.hostname }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: SmartSmartStatusError expr: smartctl_device_smart_status < 1 for: 15m labels: severity: critical annotations: summary: Smart general status error (instance {{ $labels.hostname }}) description: " (instance {{ $labels.hostname }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: DiskReallocatedSectorsIncreased expr: smartctl_device_attribute{attribute_id="5", attribute_value_type="raw"} > max_over_time(smartctl_device_attribute{attribute_id="5", attribute_value_type="raw"}[1h]) labels: severity: warning annotations: summary: "SMART Attribute Reallocated Sectors Count Increased" description: "The SMART attribute 5 (Reallocated Sectors Count) has increased on {{ $labels.device }} (instance {{ $labels.hostname }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: DiskSpinRetryCountIncreased expr: smartctl_device_attribute{attribute_id="10", attribute_value_type="raw"} > max_over_time(smartctl_device_attribute{attribute_id="10", attribute_value_type="raw"}[1h]) labels: severity: warning annotations: summary: "SMART Attribute Spin Retry Count Increased" description: "The SMART attribute 10 (Spin Retry Count) has increased on {{ $labels.device }} (instance {{ $labels.hostname }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: DiskCurrentPendingSectorCountIncreased expr: smartctl_device_attribute{attribute_id="197", attribute_value_type="raw"} > max_over_time(smartctl_device_attribute{attribute_id="197", attribute_value_type="raw"}[1h]) labels: severity: warning annotations: summary: "SMART Attribute Current Pending Sector Count Increased" description: "The SMART attribute 197 (Current Pending Sector Count) has increased on {{ $labels.device }} (instance {{ $labels.hostname }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: DiskUncorrectableSectorCountIncreased expr: smartctl_device_attribute{attribute_id="198", attribute_value_type="raw"} > max_over_time(smartctl_device_attribute{attribute_id="198", attribute_value_type="raw"}[1h]) labels: severity: warning annotations: summary: "SMART Attribute Uncorrectable Sector Count Increased" description: "The SMART attribute 198 (Uncorrectable Sector Count) has increased on {{ $labels.device }} (instance {{ $labels.hostname }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" ``` **Configuring the grafana dashboards** Of the different grafana dashboards ([1](https://grafana.com/grafana/dashboards/22604-smartctl-exporter-dashboard/), [2](https://grafana.com/grafana/dashboards/20204-smart-hdd/), [3](https://grafana.com/grafana/dashboards/22381-smartctl-exporter/)) I went for the first one. Import it with the UI of grafana, make it work and then export the json to store it in your infra as code respository. **References** - [Wikipedia](https://en.wikipedia.org/wiki/Self-Monitoring,_Analysis_and_Reporting_Technology) - [Home](https://sourceforge.net/projects/smartmontools/) - [Documentation](https://www.smartmontools.org/wiki/TocDoc) feat(year_reviews#2025): cositas del 2025 **Fascismo** En el acto de toma de posesión del cargo de trump, elon musk hace el saludo nazi.  **Feminismo** - [trump nada más llegar al poder restablece el género de nacimiento a las personas trans y amenaza con terminar los programas de diversidad, inclusión e igualdad](https://www.usnews.com/news/business/articles/2025-01-20/trump-orders-reflect-his-promises-to-roll-back-transgender-protections-and-end-dei-programs) feat(zfs_storage_planning#Thoughts on adding new disks to ZFS): Thoughts on adding new disks to ZFS When it comes to expanding an existing ZFS storage system, careful consideration is crucial. In my case, I faced a decision point with my storage cluster: after two years of reliable service from my 8TB drives, I needed more capacity. This led me to investigate the best way to integrate newly acquired refurbished 12TB drives into the system. Here's my journey through this decision-making process and the insights gained along the way. **The Starting Point** My existing setup consisted of 8TB drives purchased new, which had been running smoothly for two years. The need for expansion led me to consider refurbished 12TB drives as a cost-effective solution. However, mixing new and refurbished drives, especially of different capacities, raised several important considerations that needed careful analysis. **Initial Drive Assessment** The first step was to evaluate the reliability of all drives. Using `smartctl`, I analyzed the SMART data across both the existing and new drives: ```bash for disk in a b c d e f g h i; do echo "/dev/sd$disk: old $(smartctl -a /dev/sd$disk | grep Old | wc -l) pre-fail: $(smartctl -a /dev/sd$disk | grep Pre- | wc -l)" done ``` The results showed similar values across all drives, with "Old_Age" attributes ranging from 14-17 and "Pre-fail" attributes between 3-6. While this indicated all drives were aging, they were still functioning with acceptable parameters. However, raw SMART data doesn't tell the whole story, especially when comparing new versus refurbished drives. **Drive Reliability Considerations** After careful evaluation, I found myself trusting the existing 8TB drives more than the newer refurbished 12TB ones. This conclusion was based on several factors: - The 8TB drives had a proven track record in my specific environment - Their smaller size meant faster resilver times, reducing the window of vulnerability during recovery - One of the refurbished 12TB drives was already showing concerning symptoms (8 reallocated sectors, although a badblocks didn't increase that number), which reduced confidence in the entire batch - The existing drives were purchased new, while the 12TB drives were refurbished, adding an extra layer of uncertainty **Layout Options Analysis** When expanding a ZFS system, there's always the temptation to simply add more vdevs to the existing pool. However, I investigated two main approaches: 1. Creating a new separate ZFS pool with the new disks 2. Add another vdev to the existent pool **Resilver time** Adding the 12TB drives to the pool and redistributing the data across all 8 drives will help reduce the resilver time. Here's a detailed breakdown: 1. **Current Situation** - 4x 8TB drives at 95% capacity means each drive is heavily packed - High data density means longer resilver times - Limited free space for data movement and reconstruction 2. **After Adding 12TB Drives** - Total pool capacity increases significantly - ZFS will automatically start rebalancing data across all 8 drives - This process (sometimes called "data shuffling" or "data redistribution") has several benefits: - Reduces data density per drive - Creates more free space - Improves overall pool performance - Potentially reduces future resilver times 3. **Resilver Time Reduction Mechanism** - With data spread across more drives, each individual drive has less data to resilver - Less data per drive = faster resilver process - The redistribution happens gradually and in the background **Understanding Failure Scenarios** The key differentiator between these approaches came down to failure scenarios: **Single Drive Failure** Both configurations handle single drive failures similarly, though the 12TB drives' longer resilver time creates a longer window of vulnerability in the two-vdev configuration if the data load is evenly shared between the disks. This is particularly concerning with refurbished drives, where the failure probability might be higher. However if as soon as you add the other vdev to the pool you defragment the data inside zfs, the 8TB drives will be less filled, so until more data is added you may reduce the resilver time as they have less data. **Double Drive Failure** This is where the configurations differ significantly: - In a two-vdev pool, losing two drives from the same vdev would cause complete pool failure - With separate pools, a double drive failure would only affect one pool, allowing the other to continue operating. This way you can store the critical data on the pool you trust more. - Given the mixed drive origins (new vs refurbished), isolating potential failures becomes more critical **Performance Considerations** While investigating performance implications, I found several interesting points about IOPS and throughput: - ZFS stripes data across vdevs, meaning more vdevs generally means better IOPS - In RAIDZ configurations, IOPS are limited by the slowest drive in the vdev - Multiple mirrored vdevs provide the best combined IOPS performance - Streaming speeds scale with the number of data disks in a RAIDZ vdev - When mixing drive sizes, ZFS tends to favor larger vdevs, which could lead to uneven wear **Easiness of configuration** **Cache and log** If you already have a zpool with a cache and logs in nvme, then if you were to use two pools, you'd need to reformat your nvme drives to create space for the new partitions needed for the new zpool. This would allow you to specify different cache sizes for each pool. But it comes at the cost of a more complex operation. **New pool creation** Adding a vdev to an existing pool is quicker and easier than to create a zpool. You need to make sure that you initialise it with the correct configuration. **Storage management** Having two pools doubles the operation tasks. One of the pools are to be filled soon, so you may need to manually move files and directories around to rebalance it. **Final Decision** After weighing all factors, if you favour reliability over easiness of your life implement two separate ZFS pools. This statement is primarily driven by: 1. **Enhanced Reliability**: By separating the pools, we can maintain service availability even if one pool fails completely 2. **Data Prioritization**: This allows placing critical application data on the more reliable pool (8TB drives), while using the refurbished drives for less critical data like media files 3. **Risk Isolation**: Keeping the proven, new-purchased drives separate from the refurbished ones minimizes the impact of potential issues with the refurbished drives 4. **Consistent Performance**: Following the best practice of keeping same-sized drives together in pools However I'm currently favouring easiness of life and trust my backup solution (I hope not to read this line in the future with regret :P), so I'll go with two vdevs. **Key Takeaways** Through this investigation, I learned several important lessons about ZFS storage design: 1. Raw parity drive count isn't the only reliability metric - configuration matters more than simple redundancy numbers 2. Pool layout significantly impacts both performance and failure scenarios 3. Sometimes simpler configurations (like separate pools) can provide better overall reliability than more complex ones 4. Consider the full lifecycle of the storage, including maintenance operations like resilver times 5. When expanding storage, don't underestimate the value of isolating different generations or sources of hardware 6. The history and source of drives (new vs refurbished) should influence your pool design decisions This investigation reinforced that storage design isn't just about maximizing space or performance - it's about finding the right balance of reliability, performance, and manageability for your specific needs. When dealing with mixed drive sources and different capacities, this balance becomes even more critical. **References and further reading** - [Truenas post](https://www.truenas.com/blog/zfs-pool-performance-2/) - [Freebsd post](https://forums.freebsd.org/threads/when-does-it-make-more-sense-to-use-multiple-vdevs-in-a-zfs-pool.83586/) - [Klarasystems post](https://klarasystems.com/articles/choosing-the-right-zfs-pool-layout/) diff --git a/mkdocs.yml b/mkdocs.yml index 74292dd717..596f1506e8 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -103,6 +103,8 @@ nav: - Email clients: - himalaya: himalaya.md - alot: alot.md + - k9: k9.md + - Email protocols: - Maildir: maildir.md - Instant Messages Management: @@ -371,6 +373,7 @@ nav: - File management configuration: - NeoTree: neotree.md - Telescope: telescope.md + - fzf.nvim: fzf_nvim.md - Editing specific configuration: - vim_editor_plugins.md - Vim formatters: vim_formatters.md @@ -566,7 +569,10 @@ nav: - OpenZFS storage planning: zfs_storage_planning.md - Sanoid: sanoid.md - ZFS Prometheus exporter: zfs_exporter.md - - Hard drive health: hard_drive_health.md + - Hard drive health: + - hard_drive_health.md + - Smartctl: smartctl.md + - badblocks: badblocks.md - Resilience: - linux_resilience.md - Memtest: memtest.md @@ -768,7 +774,8 @@ nav: # - Streaming channels: streaming_channels.md - Music: - Sister Rosetta Tharpe: sister_rosetta_tharpe.md - - Video Gaming: + - Videogames: + - DragonSweeper: dragonsweeper.md - King Arthur Gold: kag.md - The Battle for Wesnoth: - The Battle for Wesnoth: wesnoth.md
- Loading branch information