[SOLVED] pve-cluster Fails to start

mike p

New Member
Dec 13, 2020
8
1
3
28
I recently changed the IP and hostname of a pve server. apparently i did it wrong and not completely.:rolleyes: now I can't get the pve-cluster to start.

pve-cluster:
Bash:
root@DC-BS7-PM4:~# systemctl status pve-cluster -n 30                                                                                                                                                                                       
● pve-cluster.service - The Proxmox VE cluster filesystem                                                                                                                                                                                   
   Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled)                                                                                                                                               
   Active: failed (Result: exit-code) since Fri 2021-01-22 09:17:15 +0330; 22min ago
  Process: 3851 ExecStart=/usr/bin/pmxcfs (code=exited, status=255/EXCEPTION)

Jan 22 09:17:15 DC-BS7-PM4 systemd[1]: pve-cluster.service: Service RestartSec=100ms expired, scheduling restart.
Jan 22 09:17:15 DC-BS7-PM4 systemd[1]: pve-cluster.service: Scheduled restart job, restart counter is at 5.
Jan 22 09:17:15 DC-BS7-PM4 systemd[1]: Stopped The Proxmox VE cluster filesystem.
Jan 22 09:17:15 DC-BS7-PM4 systemd[1]: pve-cluster.service: Start request repeated too quickly.
Jan 22 09:17:15 DC-BS7-PM4 systemd[1]: pve-cluster.service: Failed with result 'exit-code'.
Jan 22 09:17:15 DC-BS7-PM4 systemd[1]: Failed to start The Proxmox VE cluster filesystem.
Jan 22 09:17:19 DC-BS7-PM4 systemd[1]: pve-cluster.service: Start request repeated too quickly.
Jan 22 09:17:19 DC-BS7-PM4 systemd[1]: pve-cluster.service: Failed with result 'exit-code'.
Jan 22 09:17:19 DC-BS7-PM4 systemd[1]: Failed to start The Proxmox VE cluster filesystem.

pveproxy:

Bash:
root@DC-BS7-PM4:~# systemctl status pveproxy
● pveproxy.service - PVE API Proxy Server
   Loaded: loaded (/lib/systemd/system/pveproxy.service; enabled; vendor preset: enabled)
   Active: active (running) since Fri 2021-01-22 08:51:14 +0330; 50min ago
  Process: 1970 ExecStartPre=/usr/bin/pvecm updatecerts --silent (code=exited, status=111)
  Process: 1973 ExecStart=/usr/bin/pveproxy start (code=exited, status=0/SUCCESS)
 Main PID: 1975 (pveproxy)
    Tasks: 4 (limit: 7372)
   Memory: 138.8M
   CGroup: /system.slice/pveproxy.service
           ├─1975 pveproxy
           ├─5129 pveproxy worker
           ├─5130 pveproxy worker
           └─5131 pveproxy worker

Jan 22 09:41:52 DC-BS7-PM4 pveproxy[5129]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1737.
Jan 22 09:41:52 DC-BS7-PM4 pveproxy[5127]: worker exit
Jan 22 09:41:52 DC-BS7-PM4 pveproxy[5128]: worker exit
Jan 22 09:41:52 DC-BS7-PM4 pveproxy[1975]: worker 5128 finished
Jan 22 09:41:52 DC-BS7-PM4 pveproxy[1975]: worker 5127 finished
Jan 22 09:41:52 DC-BS7-PM4 pveproxy[1975]: starting 2 worker(s)
Jan 22 09:41:52 DC-BS7-PM4 pveproxy[1975]: worker 5130 started
Jan 22 09:41:52 DC-BS7-PM4 pveproxy[1975]: worker 5131 started
Jan 22 09:41:52 DC-BS7-PM4 pveproxy[5130]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1737.
Jan 22 09:41:52 DC-BS7-PM4 pveproxy[5131]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1737.

found this in journal -xe:

Bash:
....
-- The job identifier is 9521 and the job result is done.
Jan 22 09:43:22 DC-BS7-PM4 systemd[1]: Starting The Proxmox VE cluster filesystem...
-- Subject: A start job for unit pve-cluster.service has begun execution
-- Defined-By: systemd
-- Support: https://www.debian.org/support
--
-- A start job for unit pve-cluster.service has begun execution.
--
-- The job identifier is 9521.
Jan 22 09:43:22 DC-BS7-PM4 pmxcfs[5237]: [database] crit: missing directory inode (inode = 0000000000054E63)
Jan 22 09:43:22 DC-BS7-PM4 pmxcfs[5237]: [database] crit: missing directory inode (inode = 0000000000054E63)
Jan 22 09:43:22 DC-BS7-PM4 pmxcfs[5237]: [database] crit: DB load failed
Jan 22 09:43:22 DC-BS7-PM4 pmxcfs[5237]: [main] crit: memdb_open failed - unable to open database '/var/lib/pve-cluster/config.db'
Jan 22 09:43:22 DC-BS7-PM4 pmxcfs[5237]: [main] notice: exit proxmox configuration filesystem (-1)
Jan 22 09:43:22 DC-BS7-PM4 pmxcfs[5237]: [database] crit: DB load failed
Jan 22 09:43:22 DC-BS7-PM4 pmxcfs[5237]: [main] crit: memdb_open failed - unable to open database '/var/lib/pve-cluster/config.db'
Jan 22 09:43:22 DC-BS7-PM4 pmxcfs[5237]: [main] notice: exit proxmox configuration filesystem (-1)
Jan 22 09:43:22 DC-BS7-PM4 systemd[1]: pve-cluster.service: Control process exited, code=exited, status=255/EXCEPTION
-- Subject: Unit process exited
-- Defined-By: systemd
-- Support: https://www.debian.org/support
--
-- An ExecStart= process belonging to unit pve-cluster.service has exited.
--
-- The process' exit code is 'exited' and its exit status is 255.
Jan 22 09:43:22 DC-BS7-PM4 systemd[1]: pve-cluster.service: Failed with result 'exit-code'.
-- Subject: Unit failed
-- Defined-By: systemd
....

/var/lib/pve-cluster/config.db exists and I cant Open it with vim.

and:

Bash:
root@DC-BS7-PM4:~# pvecm --help
ipcc_send_rec[1] failed: Connection refused
ipcc_send_rec[2] failed: Connection refused
ipcc_send_rec[3] failed: Connection refused
Unable to load access control list: Connection refused
any Ideas??
 
I recently changed the IP and hostname of a pve server. apparently i did it wrong and not completely.:rolleyes: now I can't get the pve-cluster to start.

What exact steps did you execute? Did you follow any documentation/how-to?

Can you also check the full log for this service, i.e., use journalctl -b -u pve-cluster to get all messages from it logged during the current boot - check at the initial start, maybe we got some additional errors there.
/var/lib/pve-cluster/config.db
That is the backing database of the cluster, and it seems to have some problems, either because it was mishandled, which is IMO unlikely on its own, as the well-known sqlite software used to manage this file is one of the best tested software I know, and we never had any problems reported in this direction.
Another possibility is that the underlying filesystem or disk has some failures?

Check the whole boot log for any errors regarding sdX/scsi device errors or filesystem errors?

journalctl -b

Also, is this node in a cluster? As then we might re-use the .db from the other nodes, if all fails.
 
  • Like
Reactions: mike p
Thanks for your Attention and time.
Also I understand that this is my wrongdoing and you're doing me a favor answering my post.:)
What exact steps did you execute? Did you follow any documentation/how-to?
If I remember correctly, I changed the /etc/network/interfaces then ifdown vlan0 && ifup vlan0. Also edited the /etc/hosts and changed the name. Back Then it didn't come to my mind that the hostname is very important for pve. So I didn't follow any Document.


Can you also check the full log for this service, i.e., use journalctl -b -u pve-cluster
it is only a repetition of the following (probably from me trying to restart it multiple times), and nothing else:
Bash:
-- Logs begin at Fri 2021-01-22 08:51:07 +0330, end at Fri 2021-01-22 11:24:23 . --
Jan 22 08:51:11 DC-BS7-PM4 systemd[1]: Starting The Proxmox VE cluster filesystem...
Jan 22 08:51:11 DC-BS7-PM4 pmxcfs[1722]: [database] crit: missing directory inode (inode = 0000000000054E63)
Jan 22 08:51:11 DC-BS7-PM4 pmxcfs[1722]: [database] crit: missing directory inode (inode = 0000000000054E63)
Jan 22 08:51:11 DC-BS7-PM4 pmxcfs[1722]: [database] crit: DB load failed
Jan 22 08:51:11 DC-BS7-PM4 pmxcfs[1722]: [main] crit: memdb_open failed - unable to open database '/var/lib/pve-cluster/config.db'
Jan 22 08:51:11 DC-BS7-PM4 pmxcfs[1722]: [main] notice: exit proxmox configuration filesystem (-1)
Jan 22 08:51:11 DC-BS7-PM4 pmxcfs[1722]: [database] crit: DB load failed
Jan 22 08:51:11 DC-BS7-PM4 pmxcfs[1722]: [main] crit: memdb_open failed - unable to open database '/var/lib/pve-cluster/config.db'
Jan 22 08:51:11 DC-BS7-PM4 pmxcfs[1722]: [main] notice: exit proxmox configuration filesystem (-1)
Jan 22 08:51:11 DC-BS7-PM4 systemd[1]: pve-cluster.service: Control process exited, code=exited, status=255/EXCEPTION
Jan 22 08:51:11 DC-BS7-PM4 systemd[1]: pve-cluster.service: Failed with result 'exit-code'.
Jan 22 08:51:11 DC-BS7-PM4 systemd[1]: Failed to start The Proxmox VE cluster filesystem.
Jan 22 08:51:11 DC-BS7-PM4 systemd[1]: pve-cluster.service: Service RestartSec=100ms expired, scheduling restart.
Jan 22 08:51:11 DC-BS7-PM4 systemd[1]: pve-cluster.service: Scheduled restart job, restart counter is at 1.
Jan 22 08:51:11 DC-BS7-PM4 systemd[1]: Stopped The Proxmox VE cluster filesystem.


Another possibility is that the underlying filesystem or disk has some failures?
Its a NVMe and nvme smart-log /dev/nvme0 seems ok. And also I didn't change anything related to filesystem.

Check the whole boot log for any errors regarding sdX/scsi device errors or filesystem errors?
Everything seems ok... the problem indeed seems to be with config.db

lso, is this node in a cluster? As then we might re-use the .db from the other nodes, if all fails.
Unfortunately it's not. i've recently removed it from the cluster. And It was working Okay after removal.

supposing that all VMs are healthy, can I somehow backup them now using command line? Or The ONLY way is to restore some old backups?
 
Last edited:
If I remember correctly, I changed the /etc/network/interfaces then ifdown vlan0 && ifup vlan0. Also edited the /etc/hosts and changed the name. Back Then it didn't come to my mind that the hostname is very important for pve. So I didn't follow any Document.
I mean, that seems sensible and even if you forgot to change /etc/hosts that should never have such effects on the pve-cluster backing database file. If you remember something else which has to do with pve-cluster or that file please note here, it could help to resolve that case, even if it was just a power outage or a forced poweroff.

I'd first backup the whole /var/lib/pve-cluster/ folder just to be sure, as root you can do:
tar czf pve-cluster-bak.tgz -C /var/lib/pve-cluster ./
Copy the resulting archive somewhere save.

Then let's do a few sanity checks on the database file, again as root execute the following commands, lets see if sqlite can do something with .db file:
Bash:
sqlite3 /var/lib/pve-cluster/config.db 'PRAGMA integrity_check'
sqlite3 /var/lib/pve-cluster/config.db .schema
sqlite3 /var/lib/pve-cluster/config.db 'SELECT inode,mtime,name FROM tree WHERE parent = 0'
sqlite3 /var/lib/pve-cluster/config.db 'SELECT inode,mtime,name FROM tree WHERE parent = 347747 or inode = 347747'
Please post the output here.
 
  • Like
Reactions: mike p
If you remember something else which has to do with pve-cluster or that file please note here, it could help to resolve that case, even if it was just a power outage or a forced poweroff.
Well, We've had a power outage recentely....

Copy the resulting archive somewhere save.
Will Do, Thanks

Please post the output here.
Bash:
root@DC-BS7-PM4:~# sqlite3 /var/lib/pve-cluster/config.db 'PRAGMA integrity_check'
ok
root@DC-BS7-PM4:~# sqlite3 /var/lib/pve-cluster/config.db .schema
CREATE TABLE tree (  inode INTEGER PRIMARY KEY NOT NULL,  parent INTEGER NOT NULL CHECK(typeof(parent)=='integer'),  version INTEGER NOT NULL CHECK(typeof(version)=='integer'),  writer INTEGER NOT NULL CHECK(typeof(writer)=='integer'),  mtime INTEGER NOT NULL CHECK(typeof(mtime)=='integer'),  type INTEGER NOT NULL CHECK(typeof(type)=='integer'),  name TEXT NOT NULL,  data BLOB);
root@DC-BS7-PM4:~# sqlite3 /var/lib/pve-cluster/config.db 'SELECT inode,mtime,name FROM tree WHERE parent = 0'
0|1611011184|__version__
4|1607358407|user.cfg
6|1607358407|datacenter.cfg
8|1607359078|virtual-guest
9|1607359079|priv
11|1607359079|nodes
24|1607359080|pve-www.key
32|1607359082|pve-root-ca.pem
51|1607359082|ha
53|1607359082|sdn
33961|1607850857|replication.cfg
33965|1607850857|vzdump.cron
48740|1607873781|storage.cfg
344434|1610964025|authkey.pub.old
344437|1610964025|authkey.pub
root@DC-BS7-PM4:~# sqlite3 /var/lib/pve-cluster/config.db 'SELECT inode,mtime,name FROM tree WHERE parent = 347747 or inode = 347747'
347782|1610969161|qemu-server
 
Well, We've had a power outage recentely....
That could explain some things..

Well, those outputs seem definitely encouraging, seems that the DB is not corrupted but rather misses an entry which pmxcfs (pve-cluster) wants to be there, still not really ideal but much better than I thought!
root@DC-BS7-PM4:~# sqlite3 /var/lib/pve-cluster/config.db 'SELECT inode,mtime,name FROM tree WHERE parent = 347747 or inode = 347747' 347782|1610969161|qemu-server
Hmm, the file pmxcfs complains about seems either be a, well, file (which is a bit weird for that name) or an empty directory..

Can you re-run that command with the parent inode and file type added to the SQL statement? (I honestly did not really believed we would make it thus far, the better that we do!)
Bash:
sqlite3 /var/lib/pve-cluster/config.db 'SELECT inode,parent,mtime,type,name FROM tree WHERE parent = 347747 or inode = 347747'
# just to be sure it's a lost dangling entry:
sqlite3 /var/lib/pve-cluster/config.db 'SELECT inode,parent,mtime,type,name FROM tree WHERE parent = (SELECT parent FROM tree WHERE inode = 347747)'

We may probably resolve this by deleting that lost dangling entry, but I'd like to check first if it's really just bogus or may hold data...
 
  • Like
Reactions: mike p
sqlite3 /var/lib/pve-cluster/config.db 'SELECT inode,parent,mtime,type,name FROM tree WHERE parent = (SELECT parent FROM tree WHERE inode = 347747)'
I run this and got no output...


Bash:
root@DC-BS7-PM4:~# sqlite3 /var/lib/pve-cluster/config.db 'SELECT inode,parent,mtime,type,name FROM tree WHERE parent = (SELECT parent FROM tree WHERE inode = 347747)'
root@DC-BS7-PM4:~#

If I'm correct that means the file with inode 347747 has no parent and therefore is dangling.
 
did
Bash:
sqlite3 /var/lib/pve-cluster/config.db 'DELETE FROM tree WHERE parent = 347747 or inode = 347747'

t.lamprecht
I'm very pleased to inform you that deleting the dangling entry has solved the problem and I'm able to start pve-cluster.
many Thanks. Wish you and proxmox project, great future.;)
 
  • Like
Reactions: t.lamprecht
did
Bash:
sqlite3 /var/lib/pve-cluster/config.db 'DELETE FROM tree WHERE parent = 347747 or inode = 347747'

t.lamprecht
I'm very pleased to inform you that deleting the dangling entry has solved the problem and I'm able to start pve-cluster.
many Thanks. Wish you and proxmox project, great future.;)
You query is the same solution I also had in mind and was at this point safe to execute as we ensured that it really wouldn't have any undesired side effects.
Anyway, great to hear that you could solve it! Thanks for the wishes, I can only repeat them back!
 
Last edited:
That could explain some things..

Well, those outputs seem definitely encouraging, seems that the DB is not corrupted but rather misses an entry which pmxcfs (pve-cluster) wants to be there, still not really ideal but much better than I thought!

Hmm, the file pmxcfs complains about seems either be a, well, file (which is a bit weird for that name) or an empty directory..

Can you re-run that command with the parent inode and file type added to the SQL statement? (I honestly did not really believed we would make it thus far, the better that we do!)
Bash:
sqlite3 /var/lib/pve-cluster/config.db 'SELECT inode,parent,mtime,type,name FROM tree WHERE parent = 347747 or inode = 347747'
# just to be sure it's a lost dangling entry:
sqlite3 /var/lib/pve-cluster/config.db 'SELECT inode,parent,mtime,type,name FROM tree WHERE parent = (SELECT parent FROM tree WHERE inode = 347747)'

We may probably resolve this by deleting that lost dangling entry, but I'd like to check first if it's really just bogus or may hold data...
Thanks your solution solved my problem!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!