My k3s cluster dies when PBS runs backups

strugglingbackupuser · Oct 26, 2025

I am a newbie.

I have a toy k3s cluster running on 3 proxmox hosts (Minisforum MS-A2 - so fairly beefy for home lab). Each node has 2 Samsung 990 Pro SSDs. The issue I am having is that whenever PBS runs it's backups k3s experiences all sorts of issues.

The issues I get are:

k3s logs show context deadline exceeded / api timeout around backup windows
etcd reports “apply entries took too long” and occasional leader re-elections
k3s API server briefly becomes unresponsive
Longhorn volumes enter Degraded state and sometimes rebuild replicas

I just feel that given the spec of my hardware and the very low load of what is running on it, that PBS backups shouldn't cause these pod restarts and k3s node restarts. I think there must be something fundamentally wrong somewhere.

The backups are configures as snapshots and the backup destination is a NAS drive.

bl1mp · Oct 26, 2025

Hi,

the issues indicate, that your controlplane is stalling during the backup.
Etcd is very sensitive to disk write latency, which might be an issue in your setup.
But maybe there are further configuration incompatibilities.

Do you use Backup Fleecing? (https://pve.proxmox.com/wiki/Backup_and_Restore#_vm_backup_fleecing)
This could minimize the impact.

BR, Lucas

strugglingbackupuser · Oct 26, 2025

bl1mp said:
Hi,

the issues indicate, that your controlplane is stalling during the backup.
Etcd is very sensitive to disk write latency, which might be an issue in your setup.
But maybe there are further configuration incompatibilities.

Do you use Backup Fleecing? (https://pve.proxmox.com/wiki/Backup_and_Restore#_vm_backup_fleecing)
This could minimize the impact.

BR, Lucas

Thanks for replying. I’ll look into the fleecing option.

I guess I just feel like everyone who runs k3s and uses PBS would run into these issues if I am. I am on decent hardware on an empty cluster. I just feel like it would be more common if it was a case of tweaking knobs to get it to work.

bl1mp · Oct 27, 2025

Hi,
well that might be a little bit depending on the architecture. With virtualization there might be several points (more than in a baremetal setup), that might introduce latency, depending on the used storage, storage classes etc. and than beefy hardware might not appear so beefy in the vm. (e.g. the access to the beefy hardware ressources)

K8s is actually made to handle HA itself. So it should be treated that way I guess. *1
Which from my perspective means, to backup the configuration files and the persistent storage,
either trying the PBS-Backup-Client or by adding individual disks as mounts to the directory tree.
This supports the approach, that K8s should be used declarative and should circumvent the current issues with the controlplane.

But I guess, handling the issue via backupfleecing is much easier to configure, so you can try that first.

BR, Lucas

1: For the same reason it might be helpful to disable HA handling via the PVE-Cluster, or at least configure anti-affinity for the vms hosting the k3s.

strugglingbackupuser · Oct 27, 2025

Appreciate the responses. I setup backup fleecing and the same issues persisted.

I completely understand that k3s is designed to handle HA itself and I do backup the k3s snapshots etc, but I would really like to just whack a VM level backup in place too. It seems like lots of people do it and don't get the issues I am getting (or maybe they don't realise... alot of the issues are relatively silent as k3s self-heals).

GPT is suggesting that I disable the QEMU agent on the VMs in order to stop it from requesting an fsfreeze... I don't like blindly following GPT though.

Johannes S · Oct 27, 2025

Did you already try backup to a local storage just to exclude the network as potential root cause?
The ChatGPT proposal is propably ( as often ) ill-advised. My understanding ( I'm happy to be corrected ) is that fsfreeze is used to ensure the consistency of the backup. If this true, disabling it would be a bad idea

strugglingbackupuser · Oct 27, 2025

Johannes S said:
Did you already try backup to a local storage just to exclude the network as potential root cause?
The ChatGPT proposal is propably ( as often ) ill-advised. My understanding ( I'm happy to be corrected ) is that fsfreeze is used to ensure the consistency of the backup. If this true, disablibg it would be a bad idea

So I just tried a backup to local and I saw no timeouts in the journalctl k3s logs.

The hosts are connected to the NAS over a 2.5GB link. I am uprading to 10GB tomorrow, but I assumed 2.5GB on my low workload would be sufficient.

bl1mp · Oct 28, 2025

Well, basically, I agree with Johannes S. regarding GPT and Co.
But maybe you can give it a try to disable the qemu agent.
I would be interesting on feedback, if this solves the issue.

BR, Lucas

Johannes S · Oct 28, 2025

strugglingbackupuser said:
The hosts are connected to the NAS over a 2.5GB link. I am uprading to 10GB tomorrow, but I assumed 2.5GB on my low workload would be sufficient.

In general network shares are not recommended as PBS datastores.
Can you host vms on your NAS? Then a PBS VM on it will propably work better. Especially if you combine this with a PBS + local storage on your PVE host:

- Setup the PBS with a local store on your PVE host, set a prune policy to just keep the backups for a minimal time ( since it's just a first step before the final backup)
- Setup a PBS on the NAS and create a pull-sync-job to sync the backups to it.
- If your nas doesn't support VMs, have one local datastore as backup target, afterwards sync them to the network share

Search

Search

My k3s cluster dies when PBS runs backups

strugglingbackupuser

New Member

bl1mp

Active Member

strugglingbackupuser

New Member

bl1mp

Active Member

strugglingbackupuser

New Member

Johannes S

Distinguished Member

strugglingbackupuser

New Member

bl1mp

Active Member

Johannes S

Distinguished Member

We value your privacy