Minor Stability

My home server has been running steadily for the past 10 days without any issues … is what I’d say if replacing the RAM with ECC modules fixed 100% of the problems I was encountering. The majority of the problems were indeed fixed, but I encountered some new ones now that I didn’t have to worry about services randomly crashing and preventing access to the web UI.

I started noticing startlingly high load averages on the hardware. I’m used to seeing load averages well under the number of logical cores in the system, but was constantly seeing values over 34 for 1, 5, and 15-minute averages, which is not something I’d want to see for a machine with 12 physical/24 logical cores. It didn’t take long after the system booted for these values to show up either. Proxmox’s host-level Summary page showed a pretty steady CPU utilization of 15% with a constant 6-7% IO delay, which correlates with the OS’s iowait metric. This was all a bit unusual.

First port of call was good ol’ htop, which revealed that udisks2 was eating up a lot of CPU. I didn’t install this explicitly and hadn’t seen it on other Proxmox machines before, and after a little digging it turned out that this was normally installed alongside cockpit. This is a package I tend to install on most machines because it looks pretty, but I hadn’t installed it on this host so I had to dig a bit further.

While researching iowait issues I stumbled upon a forum post that said that high iowaits can be a result of issues with the underlying hardware. I hadn’t seen any hardware-related alerts, and the SMART values looked healthy for the SSDs and spinning rust, but the output of zpool status -v revealed that the ZFS pool did indeed have one single, solitary error, and was already performing a scrub operation to clear it. The file containing the error was for an experimental VM that I didn’t need anymore, so I just deleted the VM and restarted the scrub. That cleared the error but didn’t have any effect on the high system load.

It was at that point that I connected a couple dots. I was running several LXC containers for things like Pi-Hole, Squid, Homebridge, Gitea, Ansible, and a Minecraft server, and all of these kept their files on the host’s file system. That is, unlike VMs, LXCs don’t have their own dedicated virtual hard disks carved out as one large VMDK (or QCOW2 or VHDX) file, with all of the smaller files within. So, if an unprivileged container was having trouble accessing files due to a permissions issue between the container and the host, that could also show up as high iowaits. To test this theory I disabled autostart on all containers and rebooted the host. The VMs came up and were happily running, with host system load averages peaking at 4. The Pi-Hole container was the first one I started back up, and it had no observable impact on the load; however, the squid and gitea containers made the hardware extremely unhappy, driving load averages slightly higher than they were before, to over 36.

I quickly spun up a Debian VM for squid, copied the config from the container, and … no observable system impact. Built an Ubuntu VM for gitea (one-line installer via snap) and, again, no observable system impact. At that point I figured “well, if the VMs run with this little impact on overall system performance, I’ll just move all of my LXC roles into separate VMs,” and that’s exactly what I did.

That brings us to a new count of 11 VMs and 0 LXC containers:

  1. Pi-Hole
  2. Squid
  3. Plex
  4. VM for Docker
  5. Tailscale endpoint
  6. Desktop Environment Playground VM
  7. Certbot
  8. Homebridge
  9. Gitea
  10. Ansible
  11. Minecraft

…and, with these VMs running for the past 6 hours or so, my load averages are currently sitting at 0.36, 0.27, and 0.35, with iowait values of less than 0.1%.

So, the past week was a bit of a mixed bag when it came to the overall stability of the system, which means it’s time to observe for a bit longer. This also begs the question: if I’m only running VMs on this box, why am I running Proxmox? Could I get away with running Rocky Linux 9 on the bare metal, with the VMs running in a standard KVM install? Or would I get fancy with something like TrueNAS Scale, effectively rendering my small Synology NAS redundant? For now I’ll stick with what I know, but there could be some interesting plans for the future.