PROXMOX VE: It stopped working

Rysiu

New Member
Aug 11, 2022
4
0
1
I have a problem with PROXMOX VE.

The PROXMOX node has stopped working.

I have the following symptoms:

root@nodename:/etc/pve/local# /usr/bin/pmxcfs [database] crit: found entry with duplicate name (inode = 0000000002EF9C1A, parent = 000000000000000E, name = '107.conf') [database] crit: DB load failed [main] crit: memdb_open failed - unable to open database '/var/lib/pve-cluster/config.db' [main] notice: exit proxmox configuration filesystem (-1)

and

root@nodename:/etc/pve/local# qm list ipcc_send_rec[1] failed: Connection refused ipcc_send_rec[2] failed: Connection refused ipcc_send_rec[3] failed: Connection refused Unable to load access control list: Connection refused

root@nodename:/etc/pve# systemctl status pve-cluster.service ● pve-cluster.service - The Proxmox VE cluster filesystem Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled) Active: failed (Result: exit-code) since Thu 2022-08-11 08:43:20 CEST; 6s ago Process: 16178 ExecStart=/usr/bin/pmxcfs (code=exited, status=255/EXCEPTION) Aug 11 08:43:20 nodename systemd[1]: pve-cluster.service: Service RestartSec=100ms expired, scheduling restart. Aug 11 08:43:20 nodename systemd[1]: pve-cluster.service: Scheduled restart job, restart counter is at 5. Aug 11 08:43:20 nodename systemd[1]: Stopped The Proxmox VE cluster filesystem. Aug 11 08:43:20 nodename systemd[1]: pve-cluster.service: Start request repeated too quickly. Aug 11 08:43:20 nodename systemd[1]: pve-cluster.service: Failed with result 'exit-code'. Aug 11 08:43:20 nodename systemd[1]: Failed to start The Proxmox VE cluster filesystem.

root@nodename:/etc/pve# journalctl -xe Aug 11 08:43:22 nodename pveproxy[16179]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1727. Aug 11 08:43:22 nodename pveproxy[16180]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1727. Aug 11 08:43:22 nodename pveproxy[16160]: worker exit Aug 11 08:43:22 nodename pveproxy[1527]: worker 16160 finished Aug 11 08:43:22 nodename pveproxy[1527]: starting 1 worker(s) Aug 11 08:43:22 nodename pveproxy[1527]: worker 16181 started Aug 11 08:43:22 nodename pveproxy[16181]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1727. Aug 11 08:43:27 nodename pveproxy[16179]: worker exit

The web panel also does not work.
What could be the problem?
 
Something went wrong with the sqlite DB that stores the contents of /etc/pve, therefore, all services that need to access files there won't work properly.

First, make a backup of the database before you attempt to fix it:
Code:
cp /var/lib/pve-cluster/config.db /var/lib/pve-cluster/config.db.bkp

Then start investigating. Looks like there are 2 entries for the 107.conf file.

Code:
sqlite3 /var/lib/pve-cluster/config.db
Then first set a few parameters to make the output easier to read:
Code:
sqlite> .header on
sqlite> .mode line

Last, run the following query and post the output here in [code][/code] tags.
Code:
sqlite> select inode,version,mtime,data from tree where name = "107.conf";
 
Select query return:

Code:
sqlite> select inode,version,mtime,data from tree where name = "107.conf";
  inode = 49257498
version = 49257500
  mtime = 1654930792
   data = bootdisk: scsi0
cores: 2
ide2: local:iso/ubuntu-20.04.2-live-server-amd64.iso,media=cdrom
memory: 2048
name: JUMP-000
net0: virtio=4E:FD:9F:51:7B:74,bridge=vmbr0
numa: 0
onboot: 1
ostype: l26
scsi0: local-lvm:vm-107-disk-0,size=32G
scsihw: virtio-scsi-pci
smbios1: uuid=3685e841-fd51-4100-9ead-8f6959f83e71
sockets: 1
vmgenid: c790b03b-7027-4280-a4df-5d1b3e9a1acf


  inode = 49257498
version = 49257500
  mtime = 1654930792
   data = bootdisk: scsi0
cores: 2
ide2: local:iso/ubuntu-20.04.2-live-server-amd64.iso,media=cdrom
memory: 2048
name: JUMP-000
net0: virtio=4E:FD:9F:51:7B:74,bridge=vmbr0
numa: 0
onboot: 1
ostype: l26
scsi0: local-lvm:vm-107-disk-0,size=32G
scsihw: virtio-scsi-pci
smbios1: uuid=3685e841-fd51-4100-9ead-8f6959f83e71
sockets: 1
vmgenid: c790b03b-7027-4280-a4df-5d1b3e9a1acf

What should be implemented next?
 
Unless I am mistaken, this looks exactly the same. In that case, please run the following query.
Code:
sqlite> select * from tree where name = "107.conf";
 
New query return:

Code:
sqlite> select * from tree where name = "107.conf";
  inode = 49257498
 parent = 14
version = 49257500
 writer = 0
  mtime = 1654930792
   type = 8
   name = 107.conf
   data = bootdisk: scsi0
cores: 2
ide2: local:iso/ubuntu-20.04.2-live-server-amd64.iso,media=cdrom
memory: 2048
name: JUMP-000
net0: virtio=4E:FD:9F:51:7B:74,bridge=vmbr0
numa: 0
onboot: 1
ostype: l26
scsi0: local-lvm:vm-107-disk-0,size=32G
scsihw: virtio-scsi-pci
smbios1: uuid=3685e841-fd51-4100-9ead-8f6959f83e71
sockets: 1
vmgenid: c790b03b-7027-4280-a4df-5d1b3e9a1acf


  inode = 49257498
 parent = 14
version = 49257500
 writer = 0
  mtime = 1654930792
   type = 8
   name = 107.conf
   data = bootdisk: scsi0
cores: 2
ide2: local:iso/ubuntu-20.04.2-live-server-amd64.iso,media=cdrom
memory: 2048
name: JUMP-000
net0: virtio=4E:FD:9F:51:7B:74,bridge=vmbr0
numa: 0
onboot: 1
ostype: l26
scsi0: local-lvm:vm-107-disk-0,size=32G
scsihw: virtio-scsi-pci
smbios1: uuid=3685e841-fd51-4100-9ead-8f6959f83e71
sockets: 1
vmgenid: c790b03b-7027-4280-a4df-5d1b3e9a1acf

I see that the result is very similar to the previous one.
 
Okay, both entries are exactly the same.
Try running
Code:
delete from tree where inode = "49257498" limit 1;

after that, running the previous query should only return one entry. Please make sure you have a backup before you try to run the delete query.
 
Ok. After the changes, I have a different error message:

Code:
root@nodename:~# /usr/bin/pmxcfs
fuse: mountpoint is not empty
fuse: if you are sure this is safe, use the 'nonempty' mount option
[main] crit: fuse_mount error: File exists
[main] notice: exit proxmox configuration filesystem (-1)
 
check what is currently located at /etc/pve and move it somewhere else, once the dir is empty, the pve-cluster service can hopefully start again