[SOLVED] Node can't start pve-cluster after rpool filled to 100%

lichie · Feb 14, 2024

Today I was trying to backup one of my VMs before migrating it to a new node, and I mistakenly assumed the node it was on had enough local storage to complete the backup (all of the other nodes in the cluster have 2TB or more of local storage, but this node only had 512GB for some reason). The VM had a 256gb disk, and the node ran out of space on rpool and the backup job hung. At this point I wasn't sure why this was happening. The node wouldn't let me reboot because the backup job was in an uninteruptable sleep, so I disconnected the power from the machine and restarted it.

Now, the machine is unable to connect to the cluster. I looked at the logs for pve-cluster service, and this is when I realized the disk was full.

Code:

Feb 14 14:25:29 pve2 systemd[1]: Starting pve-cluster.service - The Proxmox VE cluster filesystem...
Feb 14 14:25:29 pve2 pmxcfs[1412]: [main] notice: resolved node name 'pve2' to '192.168.50.3' for default node IP address
Feb 14 14:25:29 pve2 pmxcfs[1412]: [main] notice: resolved node name 'pve2' to '192.168.50.3' for default node IP address
Feb 14 14:25:29 pve2 pmxcfs[1412]: [database] crit: chmod failed: No space left on device
Feb 14 14:25:29 pve2 pmxcfs[1412]: [main] crit: memdb_open failed - unable to open database '/var/lib/pve-cluster/config.db'
Feb 14 14:25:29 pve2 pmxcfs[1412]: [main] notice: exit proxmox configuration filesystem (-1)
Feb 14 14:25:29 pve2 pmxcfs[1412]: [database] crit: chmod failed: No space left on device
Feb 14 14:25:29 pve2 pmxcfs[1412]: [main] crit: memdb_open failed - unable to open database '/var/lib/pve-cluster/config.db'
Feb 14 14:25:29 pve2 pmxcfs[1412]: [main] notice: exit proxmox configuration filesystem (-1)
Feb 14 14:25:29 pve2 systemd[1]: pve-cluster.service: Control process exited, code=exited, status=255/EXCEPTION
Feb 14 14:25:29 pve2 systemd[1]: pve-cluster.service: Failed with result 'exit-code'.
Feb 14 14:25:29 pve2 systemd[1]: Failed to start pve-cluster.service - The Proxmox VE cluster filesystem.
Feb 14 14:25:29 pve2 systemd[1]: pve-cluster.service: Scheduled restart job, restart counter is at 4.
Feb 14 14:25:29 pve2 systemd[1]: Stopped pve-cluster.service - The Proxmox VE cluster filesystem.

I'm not really sure how to proceed at this point. I am pretty sure all my configuration files are fine, I think I just need to delete the files from the failed backup and I should be able to reconnect, but I am no ZFS-guru. Here is the output of some commands for some info on my storage situation:

Code:

root@pve2:~# df -h
Filesystem        Size  Used Avail Use% Mounted on
udev               32G     0   32G   0% /dev
tmpfs             6.3G  1.3M  6.3G   1% /run
rpool/ROOT/pve-1  268G  268G     0 100% /
tmpfs              32G     0   32G   0% /dev/shm
tmpfs             5.0M     0  5.0M   0% /run/lock
efivarfs          128K  8.3K  115K   7% /sys/firmware/efi/efivars
rpool             128K  128K     0 100% /rpool
rpool/ROOT        128K  128K     0 100% /rpool/ROOT
rpool/data        128K  128K     0 100% /rpool/data
rpool/vms         128K  128K     0 100% /rpool/vms
tmpfs             6.3G     0  6.3G   0% /run/user/0
root@pve2:~#
root@pve2:~# zfs list -o space,refquota,quota,volsize
NAME                     AVAIL   USED  USEDSNAP  USEDDS  USEDREFRESERV  USEDCHILD  REFQUOTA  QUOTA  VOLSIZE
rpool                       0B   435G        0B    104K             0B       435G      none   none        -
rpool/ROOT                  0B   268G        0B     96K             0B       268G      none   none        -
rpool/ROOT/pve-1            0B   268G        0B    268G             0B         0B      none   none        -
rpool/data                  0B    96K        0B     96K             0B         0B      none   none        -
rpool/vms                   0B   167G        0B     96K             0B       167G      none   none        -
rpool/vms/vm-102-disk-0     0B   167G        0B    167G             0B         0B         -      -     256G

As you can see there are 0 bytes available in the pool. However, when I navigate to any of those locations (e.g. /rpool/vms/) there are no files there. How can I get to the files that I need to clear out?

Thanks.

lichie · Feb 15, 2024

Nevermind, I just fixed it myself.

I still don't understand the ZFS mount points at all, but I was able to find the partial backup files with find / | grep vzdump and from there it was as simple as deleting the files associated with the botched backup, and then reboot.

Search

Search

[SOLVED] Node can't start pve-cluster after rpool filled to 100%

lichie

New Member

lichie

New Member