Unresponsive server due to root disk full (ZFS)

Oct 7, 2019
793
325
108
Spain
Hello,

I had a problem yesterday with a cluster node which had its ZFS root disk filled with a bunch of snapshots and the rpool ended up being 100% full. It was monitored but it just happend so fast we couldn't react on time.

  • The VMs residing in that storage became unresponsive. Other VMs using Ceph storage were working correctly.
  • I could SSH into the server, and also connect to the web interface either directly to that node or through another one of the cluster.
  • The information in the status pages was updating properly except for the VMs in the ZFS storage. I could not list contents of any storage.
  • I could not issue any command to any VM in that server: start/stop, migrate, remove snapshots, etc.
Luckly there were a bunch of ISO files that I could remove and got like 8GB back. I expected that the server would (slowly?) start to recover, but after a couple of hours nothing changed (the stuck VMs are not critical). Then I stopped the working VMs within their OS and moved them manually to another node. After waiting another hour, the node was still malfunctioning. I tried to restart PVE services, kill processes and even an ordered reboot. Nothing helped. In the end I had to reset the server to get it back.

Is all this expected?

I've only had a similar issue once but at that time the storage was LVM and it was a single node. That time simply freeing a couple GB and waiting a few minutes got the node and the VMs working again.

Thanks!
 
I had a problem yesterday with a cluster node which had its ZFS root disk filled with a bunch of snapshots and the rpool ended up being 100% full. It was monitored but it just happend so fast we couldn't react on time.
A ZFS pool shouldn't be filled up more than 80% anyway. So for the future it won't hurt to set a pool-wide quota of 90%. So no matter what happens, that pool would still got 10% of free space and you could easily change that quota temporarily to get some free space so the pool is working again that you can free up some stuff.

And don't forget a fstrim -a and zpool trim rpool to actually free up those deleted ISOs.
 
I know... In fact I'm really paranoid about having enough free disk space, but this time a perfect storm happed and nearly 300GB got created in less than 10 minutes.

Anyway the problem is not that the disk got filled up, as I know exactly what happend, but the fact that Proxmox became unresponsive even after creating free space.

Fstrim is enabled everywhere and it does work as expected during normal operation. When I removed the ISO files, the free space was reported both in zfs list and zpool list -v and I could create files (used dd to create a couple of 100MB files). Maybe this is a corner case where pending I/Os never recover... if that's even possible.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!