My k3s cluster dies when PBS runs backups

Oct 26, 2025
4
0
1
I am a newbie.

I have a toy k3s cluster running on 3 proxmox hosts (Minisforum MS-A2 - so fairly beefy for home lab). Each node has 2 Samsung 990 Pro SSDs. The issue I am having is that whenever PBS runs it's backups k3s experiences all sorts of issues.

The issues I get are:
  • k3s logs show context deadline exceeded / api timeout around backup windows
  • etcd reports “apply entries took too long” and occasional leader re-elections
  • k3s API server briefly becomes unresponsive
  • Longhorn volumes enter Degraded state and sometimes rebuild replicas

I just feel that given the spec of my hardware and the very low load of what is running on it, that PBS backups shouldn't cause these pod restarts and k3s node restarts. I think there must be something fundamentally wrong somewhere.

The backups are configures as snapshots and the backup destination is a NAS drive.
 
Last edited:
Hi,

the issues indicate, that your controlplane is stalling during the backup.
Etcd is very sensitive to disk write latency, which might be an issue in your setup.
But maybe there are further configuration incompatibilities.

Do you use Backup Fleecing? (https://pve.proxmox.com/wiki/Backup_and_Restore#_vm_backup_fleecing)
This could minimize the impact.

BR, Lucas
Thanks for replying. I’ll look into the fleecing option.

I guess I just feel like everyone who runs k3s and uses PBS would run into these issues if I am. I am on decent hardware on an empty cluster. I just feel like it would be more common if it was a case of tweaking knobs to get it to work.
 
Hi,
well that might be a little bit depending on the architecture. With virtualization there might be several points (more than in a baremetal setup), that might introduce latency, depending on the used storage, storage classes etc. and than beefy hardware might not appear so beefy in the vm. (e.g. the access to the beefy hardware ressources)

K8s is actually made to handle HA itself. So it should be treated that way I guess. *1
Which from my perspective means, to backup the configuration files and the persistent storage,
either trying the PBS-Backup-Client or by adding individual disks as mounts to the directory tree.
This supports the approach, that K8s should be used declarative and should circumvent the current issues with the controlplane.

But I guess, handling the issue via backupfleecing is much easier to configure, so you can try that first. :)

BR, Lucas


1: For the same reason it might be helpful to disable HA handling via the PVE-Cluster, or at least configure anti-affinity for the vms hosting the k3s.
 
  • Like
Reactions: Johannes S
Appreciate the responses. I setup backup fleecing and the same issues persisted.

I completely understand that k3s is designed to handle HA itself and I do backup the k3s snapshots etc, but I would really like to just whack a VM level backup in place too. It seems like lots of people do it and don't get the issues I am getting (or maybe they don't realise... alot of the issues are relatively silent as k3s self-heals).

GPT is suggesting that I disable the QEMU agent on the VMs in order to stop it from requesting an fsfreeze... I don't like blindly following GPT though.
 
Did you already try backup to a local storage just to exclude the network as potential root cause?
The ChatGPT proposal is propably ( as often ) ill-advised. My understanding ( I'm happy to be corrected ) is that fsfreeze is used to ensure the consistency of the backup. If this true, disabling it would be a bad idea
 
Last edited:
Did you already try backup to a local storage just to exclude the network as potential root cause?
The ChatGPT proposal is propably ( as often ) ill-advised. My understanding ( I'm happy to be corrected ) is that fsfreeze is used to ensure the consistency of the backup. If this true, disablibg it would be a bad idea
So I just tried a backup to local and I saw no timeouts in the journalctl k3s logs.

The hosts are connected to the NAS over a 2.5GB link. I am uprading to 10GB tomorrow, but I assumed 2.5GB on my low workload would be sufficient.
 
The hosts are connected to the NAS over a 2.5GB link. I am uprading to 10GB tomorrow, but I assumed 2.5GB on my low workload would be sufficient.
In general network shares are not recommended as PBS datastores.
Can you host vms on your NAS? Then a PBS VM on it will propably work better. Especially if you combine this with a PBS + local storage on your PVE host:

- Setup the PBS with a local store on your PVE host, set a prune policy to just keep the backups for a minimal time ( since it's just a first step before the final backup)
- Setup a PBS on the NAS and create a pull-sync-job to sync the backups to it.
- If your nas doesn't support VMs, have one local datastore as backup target, afterwards sync them to the network share
 
Last edited:
  • Like
Reactions: bl1mp