ZFS full can't remove anything to clear up space, causing timeouts for PVE

brianowen111

New Member
Jan 12, 2023
2
0
1
Hi, I know this topic comes up a bunch but so far none of the solutions I've seen have worked for me so I'm posting a new thread, hopefully someone can share some other insight.

We were running into space issues before and a bunch of VM's suddenly expanded a great deal and every VM in the ZFS stopped running. All other VMs stored in other places continued (and continue) to run perfectly fine, but everything related to PVE has stopped working. I can access the shell from another machine on the cluster, but all commands related to PVE time out when ran.

I went into some of the pools and tried removing some VMs but they still show up when I use zfs list. I can't destroy anything because it all says that its still in use. There's no snapshots and no dumps so I can't clear anything like that out to make some space.

I don't really want to restart it, since all other VMs are still accessible and if a restart doesn't do anything they won't turn back on and even more will be down (13/38 currently). I do have backups of these computers so deleting them isn't a problem but there's not really a place to move all of them, so I'd like to get this back up and running.

All help or advice is appreciated, thanks
 

Attachments

  • Screenshot 2024-03-10 114711.png
    Screenshot 2024-03-10 114711.png
    51.9 KB · Views: 4
  • Screenshot 2024-03-10 114752.png
    Screenshot 2024-03-10 114752.png
    72.6 KB · Views: 3
  • Screenshot 2024-03-10 114956.png
    Screenshot 2024-03-10 114956.png
    15.7 KB · Views: 3
https://github.com/kneutron/ansitest/blob/master/ZFS/zfs-emergency-free-space.sh

I would strongly recommend doing a bulk shutdown of all running VMs when you're doing emergency maint, to avoid further writing to disk while trying to free up space by deleting things
Thanks, I tried this first thing but the before and after show the exact same results, I also can't do any kind of bulk shutdown because I can't send any qm commands because they all time out, or access any of the computers from this node through the GUI since they all show I/o errors.

And I can't fstrim because of the same issue, no access to any of the vm's running on the zfs pool.

Like I said I can navigate through the command line and attempt to delete files there, but when I remove files from zvol nothing happens, and when I try to use zfs destroy or unmount commands I get an error saying that the volume is busy.

Does anyone know if I can unlock those to start removing or if that would even work?
 
You can try ' telinit 1 ' from the host console, that should kill everything and put you into single-user mode.

Or reboot (may need to hard-reset it) and at the grub prompt, Edit and put ' init=/bin/bash ' to do pretty much the same thing, but IIRC you wouldn't need to use the root password with the grub method

If this is a homelab you should be ok, but if small business or larger you should be talking to your users about downtime and not filling up the disk(s) - and possibly filing a support ticket

Last resort, if you have a pool of mirrors you could tack on another column of 2x more disks (same size OR larger) and expand the free space that way. RAIDZx you would need to add another set of x-same-sized disks that match the original vdev specs for balanced I/O.

Sounds like long-term you will need to expand the pool size anyway.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!