Unresponsive server due to root disk full (ZFS)

VictorSTS · Dec 29, 2022

Hello,

I had a problem yesterday with a cluster node which had its ZFS root disk filled with a bunch of snapshots and the rpool ended up being 100% full. It was monitored but it just happend so fast we couldn't react on time.

The VMs residing in that storage became unresponsive. Other VMs using Ceph storage were working correctly.
I could SSH into the server, and also connect to the web interface either directly to that node or through another one of the cluster.
The information in the status pages was updating properly except for the VMs in the ZFS storage. I could not list contents of any storage.
I could not issue any command to any VM in that server: start/stop, migrate, remove snapshots, etc.

Luckly there were a bunch of ISO files that I could remove and got like 8GB back. I expected that the server would (slowly?) start to recover, but after a couple of hours nothing changed (the stuck VMs are not critical). Then I stopped the working VMs within their OS and moved them manually to another node. After waiting another hour, the node was still malfunctioning. I tried to restart PVE services, kill processes and even an ordered reboot. Nothing helped. In the end I had to reset the server to get it back.

Is all this expected?

I've only had a similar issue once but at that time the storage was LVM and it was a single node. That time simply freeing a couple GB and waiting a few minutes got the node and the VMs working again.

Thanks!

Dunuin · Dec 29, 2022

VictorSTS said:
I had a problem yesterday with a cluster node which had its ZFS root disk filled with a bunch of snapshots and the rpool ended up being 100% full. It was monitored but it just happend so fast we couldn't react on time.

A ZFS pool shouldn't be filled up more than 80% anyway. So for the future it won't hurt to set a pool-wide quota of 90%. So no matter what happens, that pool would still got 10% of free space and you could easily change that quota temporarily to get some free space so the pool is working again that you can free up some stuff.

And don't forget a fstrim -a and zpool trim rpool to actually free up those deleted ISOs.

VictorSTS · Dec 29, 2022

I know... In fact I'm really paranoid about having enough free disk space, but this time a perfect storm happed and nearly 300GB got created in less than 10 minutes.

Anyway the problem is not that the disk got filled up, as I know exactly what happend, but the fact that Proxmox became unresponsive even after creating free space.

Fstrim is enabled everywhere and it does work as expected during normal operation. When I removed the ISO files, the free space was reported both in zfs list and zpool list -v and I could create files (used dd to create a couple of 100MB files). Maybe this is a corner case where pending I/Os never recover... if that's even possible.

oktay · Mar 30, 2025

This happened to me today on a single node. Proxmox seems to have suspended all the VMs, presumably to protect them, which was a good thing since all of them came back up properly after the disk was cleared.

One thing that did cause a lot of issues was that /etc/pve is apparently actually a fuse mounted sqlite database which caused /etc/pve to become read only when the disk filled. Again the file itself seemed fine, not corrupted or anything.

After a few false starts I was able to fix this too. In hindsight I think cleaning up disks to open up some space, then restarting pvescheduler would fix the fuse mount. umount /etc/pve may or may not be necessary. Then it's a matter of resuming vms.

Apparently single nodes are also using some cluster related things.

alexskysilk · Mar 30, 2025

VictorSTS said:
but the fact that Proxmox became unresponsive even after creating free space.

look for processes in uninterruptible sleep state, eg
ps -eo ppid,pid,user,stat,pcpu,comm,wchan:32 | grep " D"

post the output if you need further troubleshooting assistance.

Unresponsive server due to root disk full (ZFS)

VictorSTS

Distinguished Member

Dunuin

Distinguished Member

VictorSTS

Distinguished Member

oktay

Active Member

alexskysilk

Distinguished Member

We value your privacy