Hi everyone!
Apologies if this has been brought up before.
Here's my scenario: this morning, the PVE web interface on the node went down, along with several VMs. The node itself is beefy – packed with RAM, all SSDs, multiple ZFS pools, and even the root filesystem lives on ZFS. When I SSH'd in, I saw 0 free space on the Proxmox root filesystem and a hung backup process. Tried to kill it… no luck. Even a reboot from the shell wouldn't go through. I got lucky managing to free up a few GBs on the rpool, after which I power-cycled the node. It came back up, and everything started working.
Digging into the backup logs, I figured out the reason. Overnight, a backup job kicked off targeting PBS with 'snapshot' mode. Since everything was on ZFS, it had been working flawlessly. However, I recently added an NFS storage, moved some mountpoints of an LXC container there, and didn't double-check the backup settings.
During the night, the backup client attempted to snapshot this LXC. Obviously, that's not an option for NFS, so it automatically fell back to 'suspend' mode. It froze the container and started dumping its volume to the NFS share via a temporary folder to release the container faster. The default temp folder is /var/tmp. You can guess the rest – it filled up the entire root partition and crashed. PVE limped along for a few hours before starting to crumble.
Sure, that's on me for not thoroughly studying the docs on how the backup client works. But I was genuinely surprised that a backup job with near-default settings could so easily take down a PVE node.
PS. I'm relatively new to PVE and PBS; before this, I spent a long time working with infrastructure on the Microsoft stack. I really like PVE; I'm impressed that an open-source product can be this mature and functional. I registered on the forum and wrote this post hoping it might come in handy for someone. Also, maybe in future versions, the backup client could check free space in the temp folder before writing to it.
Apologies if this has been brought up before.
Here's my scenario: this morning, the PVE web interface on the node went down, along with several VMs. The node itself is beefy – packed with RAM, all SSDs, multiple ZFS pools, and even the root filesystem lives on ZFS. When I SSH'd in, I saw 0 free space on the Proxmox root filesystem and a hung backup process. Tried to kill it… no luck. Even a reboot from the shell wouldn't go through. I got lucky managing to free up a few GBs on the rpool, after which I power-cycled the node. It came back up, and everything started working.
Digging into the backup logs, I figured out the reason. Overnight, a backup job kicked off targeting PBS with 'snapshot' mode. Since everything was on ZFS, it had been working flawlessly. However, I recently added an NFS storage, moved some mountpoints of an LXC container there, and didn't double-check the backup settings.
During the night, the backup client attempted to snapshot this LXC. Obviously, that's not an option for NFS, so it automatically fell back to 'suspend' mode. It froze the container and started dumping its volume to the NFS share via a temporary folder to release the container faster. The default temp folder is /var/tmp. You can guess the rest – it filled up the entire root partition and crashed. PVE limped along for a few hours before starting to crumble.
Sure, that's on me for not thoroughly studying the docs on how the backup client works. But I was genuinely surprised that a backup job with near-default settings could so easily take down a PVE node.
PS. I'm relatively new to PVE and PBS; before this, I spent a long time working with infrastructure on the Microsoft stack. I really like PVE; I'm impressed that an open-source product can be this mature and functional. I registered on the forum and wrote this post hoping it might come in handy for someone. Also, maybe in future versions, the backup client could check free space in the temp folder before writing to it.
Last edited: