I've run into an issue where I can't connect to a node through the Proxmox interface because it's out of space. I was going to delete a test VM to free up space and allow me to reconfigure the backups to point to my NAS and off of the small run drive I have on this node. I'm having issues freeing up space without being able to access the node / backups in the Proxmox management portal. I can SSH, but may have borked the zfs pool to the point I can't access / remove files from the cli.
I'm hoping someone could help me find a way to get back into the management portal to reconfigure my backups and potentially nuke the node and rework the zfs pool to be correctly set up. I was good in that I have the backups, but was bad as I hadn't set up the backups to be on a different or multiple nodes. Ideally I'd be able to move the one VM that I care about off the node or pull it's backup over to a different node to restore from there while reconfiguring the full node.
Any suggestions would be very welcome.
I've checked the following resources and below are the troubleshooting steps I've taken.
https://forum.proxmox.com/threads/e...key-key_file-or-key-at-usr-share-perl5.48943/
https://forum.proxmox.com/threads/required-command-to-remove-the-zfs-snapshot-via-cli.111704/
https://technotes.seastrom.com/asse...-a-ZFS-Filesystem-that-is-100percent-Full.pdf
https://forum.proxmox.com/threads/no-space-left-on-device.77411/
UI is throwing
I checked this thread
https://forum.proxmox.com/threads/e...key-key_file-or-key-at-usr-share-perl5.48943/
Logs show the hostname is resolving correctly
Checked systemctl It was running with degraded proxmox services
All of the failures were
systemctl reset-failed worked and all of the degraded services were loaded correctly
Checked the PVE Cluster and the service is exited with an Exception
Journal is showing the device is out of space
Leading me to believe I have a poorly configured ZFS pool. I checked different threads on both space availability.
https://forum.proxmox.com/threads/required-command-to-remove-the-zfs-snapshot-via-cli.111704/
https://technotes.seastrom.com/asse...-a-ZFS-Filesystem-that-is-100percent-Full.pdf
Looking into making space by deleting snapshots threw errors
Checked the isocket issue as outlined in an above thread. Guess I'm not using SAMBA
https://forum.proxmox.com/threads/no-space-left-on-device.77411/
I tried to qm remove VM 106 as it was a testing vm, but cannot get a connection
I'm hoping someone could help me find a way to get back into the management portal to reconfigure my backups and potentially nuke the node and rework the zfs pool to be correctly set up. I was good in that I have the backups, but was bad as I hadn't set up the backups to be on a different or multiple nodes. Ideally I'd be able to move the one VM that I care about off the node or pull it's backup over to a different node to restore from there while reconfiguring the full node.
Any suggestions would be very welcome.
I've checked the following resources and below are the troubleshooting steps I've taken.
https://forum.proxmox.com/threads/e...key-key_file-or-key-at-usr-share-perl5.48943/
https://forum.proxmox.com/threads/required-command-to-remove-the-zfs-snapshot-via-cli.111704/
https://technotes.seastrom.com/asse...-a-ZFS-Filesystem-that-is-100percent-Full.pdf
https://forum.proxmox.com/threads/no-space-left-on-device.77411/
UI is throwing
hostname lookup 'zodiac' failed - failed to get address info for: zodiac: Name or service not known (500)
I checked this thread
https://forum.proxmox.com/threads/e...key-key_file-or-key-at-usr-share-perl5.48943/
Logs show the hostname is resolving correctly
Bash:
root@zodiac:~# cat /etc/hosts
127.0.0.1 localhost.localdomain localhost
10.10.40.187 zodiac.lab.astrolab.dev zodiac
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts
Bash:
root@zodiac:~# cat /etc/hostname
zodiac
Bash:
root@zodiac:~# ping $(uname -n)
PING zodiac.lab.astrolab.dev (10.10.40.187) 56(84) bytes of data.
64 bytes from zodiac.lab.astrolab.dev (10.10.40.187): icmp_seq=1 ttl=64 time=0.027 ms
64 bytes from zodiac.lab.astrolab.dev (10.10.40.187): icmp_seq=2 ttl=64 time=0.022 ms
...
64 bytes from zodiac.lab.astrolab.dev (10.10.40.187): icmp_seq=7 ttl=64 time=0.037 ms
Checked systemctl It was running with degraded proxmox services
Bash:
root@zodiac:~# systemctl --failed
UNIT LOAD ACTIVE SUB DESCRIPTION
● corosync.service loaded failed failed Corosync Cluster Engine
● postfix@-.service loaded failed failed Postfix Mail Transport Agent (instance -)
● pve-cluster.service loaded failed failed The Proxmox VE cluster filesystem
● pve-firewall.service loaded failed failed Proxmox VE firewall
● pve-guests.service loaded failed failed PVE guests
● pve-ha-crm.service loaded failed failed PVE Cluster HA Resource Manager Daemon
● pve-ha-lrm.service loaded failed failed PVE Local HA Resource Manager Daemon
● pvescheduler.service loaded failed failed Proxmox VE scheduler
● pvestatd.service loaded failed failed PVE Status Daemon
● systemd-hostnamed.service loaded failed failed Hostname Service
● systemd-random-seed.service loaded failed failed Load/Save Random Seed
● systemd-update-utmp.service loaded failed failed Record System Boot/Shutdown in UTMP
All of the failures were
zodiac pveproxy[1321]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 2025.
proxmox_firewall: error updating firewall rules: failed to read guest map from /etc/pve/.vmlist
systemctl reset-failed worked and all of the degraded services were loaded correctly
Bash:
root@zodiac:~# systemctl reset-failed
root@zodiac:~# systemctl status
● zodiac
State: running
Units: 390 loaded (incl. loaded aliases)
Jobs: 0 queued
Failed: 0 units
Since: Wed 2024-07-31 13:52:19 HDT; 26min ago
systemd: 252.26-1~deb12u2
Checked the PVE Cluster and the service is exited with an Exception
Bash:
root@zodiac:~# systemctl status -l pve-cluster
○ pve-cluster.service - The Proxmox VE cluster filesystem
Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; preset: enabled)
Active: inactive (dead) since Wed 2024-07-31 13:52:24 HDT; 30min ago
Process: 1112 ExecStart=/usr/bin/pmxcfs (code=exited, status=255/EXCEPTION)
CPU: 7ms
Jul 31 13:52:24 zodiac systemd[1]: pve-cluster.service: Scheduled restart job, restart counter is at 5.
Jul 31 13:52:24 zodiac systemd[1]: Stopped pve-cluster.service - The Proxmox VE cluster filesystem.
Jul 31 13:52:24 zodiac systemd[1]: pve-cluster.service: Start request repeated too quickly.
Jul 31 13:52:24 zodiac systemd[1]: pve-cluster.service: Failed with result 'exit-code'.
Jul 31 13:52:24 zodiac systemd[1]: Failed to start pve-cluster.service - The Proxmox VE cluster filesystem.
Journal is showing the device is out of space
Bash:
Jul 31 13:52:24 zodiac systemd[1]: Stopped pve-cluster.service - The Proxmox VE cluster filesystem.
Jul 31 13:52:24 zodiac systemd[1]: Starting pve-cluster.service - The Proxmox VE cluster filesystem...
Jul 31 13:52:24 zodiac pmxcfs[1112]: [main] notice: resolved node name 'zodiac' to '10.10.40.187' for default node IP address
Jul 31 13:52:24 zodiac pmxcfs[1112]: [main] notice: resolved node name 'zodiac' to '10.10.40.187' for default node IP address
Jul 31 13:52:24 zodiac pmxcfs[1112]: [database] crit: chmod failed: No space left on device
Jul 31 13:52:24 zodiac pmxcfs[1112]: [main] crit: memdb_open failed - unable to open database '/var/lib/pve-cluster/config.db'
Jul 31 13:52:24 zodiac pmxcfs[1112]: [main] notice: exit proxmox configuration filesystem (-1)
Jul 31 13:52:24 zodiac pmxcfs[1112]: [database] crit: chmod failed: No space left on device
Jul 31 13:52:24 zodiac pmxcfs[1112]: [main] crit: memdb_open failed - unable to open database '/var/lib/pve-cluster/config.db'
Jul 31 13:52:24 zodiac pmxcfs[1112]: [main] notice: exit proxmox configuration filesystem (-1)
Jul 31 13:52:24 zodiac systemd[1]: pve-cluster.service: Control process exited, code=exited, status=255/EXCEPTION
Jul 31 13:52:24 zodiac systemd[1]: pve-cluster.service: Failed with result 'exit-code'.
Jul 31 13:52:24 zodiac systemd[1]: Failed to start pve-cluster.service - The Proxmox VE cluster filesystem.
Jul 31 13:52:24 zodiac systemd[1]: pve-cluster.service: Scheduled restart job, restart counter is at 5.
Jul 31 13:52:24 zodiac systemd[1]: Stopped pve-cluster.service - The Proxmox VE cluster filesystem.
Jul 31 13:52:24 zodiac systemd[1]: pve-cluster.service: Start request repeated too quickly.
Jul 31 13:52:24 zodiac systemd[1]: pve-cluster.service: Failed with result 'exit-code'.
Jul 31 13:52:24 zodiac systemd[1]: Failed to start pve-cluster.service - The Proxmox VE cluster filesystem.
Leading me to believe I have a poorly configured ZFS pool. I checked different threads on both space availability.
https://forum.proxmox.com/threads/required-command-to-remove-the-zfs-snapshot-via-cli.111704/
https://technotes.seastrom.com/asse...-a-ZFS-Filesystem-that-is-100percent-Full.pdf
Looking into making space by deleting snapshots threw errors
Bash:
root@zodiac:~# zfs list -t snapshot
no datasets available
root@zodiac:~# zfs list -t volume
no datasets available
root@zodiac:~# zfs list rpool/dump
cannot open 'rpool/dump': dataset does not exist
root@zodiac:~# zfs list rpool
NAME USED AVAIL REFER MOUNTPOINT
rpool 450G 0B 104K /rpool
root@zodiac:~# zfs list
NAME USED AVAIL REFER MOUNTPOINT
rpool 450G 0B 104K /rpool
rpool/ROOT 450G 0B 96K /rpool/ROOT
rpool/ROOT/pve-1 450G 0B 450G /
rpool/data 96K 0B 96K /rpool/data
rpool/var-lib-vz 104K 0B 104K /var/lib/vz
Checked the isocket issue as outlined in an above thread. Guess I'm not using SAMBA
https://forum.proxmox.com/threads/no-space-left-on-device.77411/
Bash:
root@zodiac:~# cd /var/lib/samba/private/msg.sock
-bash: cd: /var/lib/samba/private/msg.sock: No such file or directory
root@zodiac:~# cd /var/lib/samba/private/
root@zodiac:/var/lib/samba/private# ls -a
. ..
root@zodiac:/var/lib/samba/private#
I tried to qm remove VM 106 as it was a testing vm, but cannot get a connection
Bash:
root@zodiac:~# qm destroy 106
ipcc_send_rec[1] failed: Connection refused
ipcc_send_rec[2] failed: Connection refused
ipcc_send_rec[3] failed: Connection refused
Unable to load access control list: Connection refused
root@zodiac:~#