[SOLVED] cluster does not wake up, after an electrical drop of several hours

sombra3405

Member
Jul 1, 2022
10
3
8
Greetings, I have a cluster of 5 servers, with version 8.2, I know there are quite a few threads about this same detail, but I have not found the solution.
I have these errors in the logs, the /etc/pve folder is empty


Aug 20 08:11:40 pve201 pveproxy[3272]: worker exit
Aug 20 08:11:40 pve201 pveproxy[1841]: worker 3272 finished
Aug 20 08:11:40 pve201 pveproxy[1841]: starting 2 worker(s)
Aug 20 08:11:40 pve201 pveproxy[1841]: worker 3274 started
Aug 20 08:11:40 pve201 pveproxy[1841]: worker 3275 started
Aug 20 08:11:40 pve201 pveproxy[3274]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 2025.
Aug 20 08:11:40 pve201 pveproxy[3275]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 2025.


The cluster service failed with the following errors
Aug 20 07:45:35 pve201 pmxcfs[1621]: [database] crit: found entry with duplicate name 'qemu-server' - A:(inode = 0x00000000077039C8, parent = 0x00000000077039C7, v./mtime = 0x77039C8/0x172>
Aug 20 07:45:35 pve201 pmxcfs[1621]: [database] crit: found entry with duplicate name 'qemu-server' - A:(inode = 0x00000000077039C8, parent = 0x00000000077039C7, v./mtime = 0x77039C8/0x172>
Aug 20 07:45:35 pve201 pmxcfs[1621]: [database] crit: DB load failed
Aug 20 07:45:35 pve201 pmxcfs[1621]: [database] crit: DB load failed
Aug 20 07:45:35 pve201 pmxcfs[1621]: [main] crit: memdb_open failed - unable to open database '/var/lib/pve-cluster/config.db'
Aug 20 07:45:35 pve201 pmxcfs[1621]: [main] crit: memdb_open failed - unable to open database '/var/lib/pve-cluster/config.db'
Aug 20 07:45:35 pve201 pmxcfs[1621]: [main] notice: exit proxmox configuration filesystem (-1)


I already updated, create new certificates with this guide https://pve.proxmox.com/wiki/Proxmox_SSL_Error_Fixing

but I still can't solve this

Thanks.
 
Greetings, I have a cluster of 5 servers, with version 8.2, I know there are quite a few threads about this same detail, but I have not found the solution.
I have these errors in the logs, the /etc/pve folder is empty


Aug 20 08:11:40 pve201 pveproxy[3272]: worker exit
Aug 20 08:11:40 pve201 pveproxy[1841]: worker 3272 finished
Aug 20 08:11:40 pve201 pveproxy[1841]: starting 2 worker(s)
Aug 20 08:11:40 pve201 pveproxy[1841]: worker 3274 started
Aug 20 08:11:40 pve201 pveproxy[1841]: worker 3275 started
Aug 20 08:11:40 pve201 pveproxy[3274]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 2025.
Aug 20 08:11:40 pve201 pveproxy[3275]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 2025.


The cluster service failed with the following errors
Aug 20 07:45:35 pve201 pmxcfs[1621]: [database] crit: found entry with duplicate name 'qemu-server' - A:(inode = 0x00000000077039C8, parent = 0x00000000077039C7, v./mtime = 0x77039C8/0x172>
Aug 20 07:45:35 pve201 pmxcfs[1621]: [database] crit: found entry with duplicate name 'qemu-server' - A:(inode = 0x00000000077039C8, parent = 0x00000000077039C7, v./mtime = 0x77039C8/0x172>
Aug 20 07:45:35 pve201 pmxcfs[1621]: [database] crit: DB load failed
Aug 20 07:45:35 pve201 pmxcfs[1621]: [database] crit: DB load failed
Aug 20 07:45:35 pve201 pmxcfs[1621]: [main] crit: memdb_open failed - unable to open database '/var/lib/pve-cluster/config.db'
Aug 20 07:45:35 pve201 pmxcfs[1621]: [main] crit: memdb_open failed - unable to open database '/var/lib/pve-cluster/config.db'
Aug 20 07:45:35 pve201 pmxcfs[1621]: [main] notice: exit proxmox configuration filesystem (-1)


I already updated, create new certificates with this guide https://pve.proxmox.com/wiki/Proxmox_SSL_Error_Fixing

but I still can't solve this

Thanks.

Do you have the same issue* on ALL of the nodes at the same time? Because it looks like you ended up with corrupt database - /var/lib/pve-cluster/config.db - is what holds the virtual filesystem mounted into /etc/pve at runtime (which is why yours is empty). If you have a good .db file on at least one of the nodes (highly likely), I would just copy it across manually and killall pmxcfs and then systemctl restart pve-cluster .

EDIT: * Your other nodes might not be getting quorum upon start, but have intact config.db. You will see the difference in the log, where instead corosync complains only, but not pmxcfs.

PS Make a backup of that file, just in case.
 
Last edited:
Thanks for the response, looking at the node logs, they show the same error.
Now how can I verify that the database is optimal?

Aug 20 02:49:57 pve202 pmxcfs[17798]: [database] crit: found entry with duplicate name 'qemu-server' - A:(inode = 0x00000000077039C8, parent = 0x00000000077039C7, v./mtime = 0x77039C8/0x17>
Aug 20 02:49:57 pve202 pmxcfs[17798]: [database] crit: found entry with duplicate name 'qemu-server' - A:(inode = 0x00000000077039C8, parent = 0x00000000077039C7, v./mtime = 0x77039C8/0x17>
Aug 20 02:49:57 pve202 pmxcfs[17798]: [database] crit: DB load failed
Aug 20 02:49:57 pve202 pmxcfs[17798]: [database] crit: DB load failed
Aug 20 02:49:57 pve202 pmxcfs[17798]: [main] crit: memdb_open failed - unable to open database '/var/lib/pve-cluster/config.db'
Aug 20 02:49:57 pve202 pmxcfs[17798]: [main] crit: memdb_open failed - unable to open database '/var/lib/pve-cluster/config.db'
Aug 20 02:49:57 pve202 pmxcfs[17798]: [main] notice: exit proxmox configuration filesystem (-1)
Aug 20 02:49:57 pve202 pmxcfs[17798]: [main] notice: exit proxmox configuration filesystem (-1)
Aug 20 02:49:57 pve202 systemd[1]: pve-cluster.service: Control process exited, code=exited, status=255/EXCEPTION
Aug 20 02:49:57 pve202 systemd[1]: pve-cluster.service: Failed with result 'exit-code'.
Aug 20 02:49:57 pve202 systemd[1]: Failed to start pve-cluster.service - The Proxmox VE cluster filesystem.
Aug 20 02:49:57 pve202 systemd[1]: pve-cluster.service: Scheduled restart job, restart counter is at 5.
Aug 20 02:49:57 pve202 systemd[1]: Stopped pve-cluster.service - The Proxmox VE cluster filesystem.
Aug 20 02:49:57 pve202 systemd[1]: pve-cluster.service: Start request repeated too quickly.
Aug 20 02:49:57 pve202 systemd[1]: pve-cluster.service: Failed with result 'exit-code'.
Aug 20 02:49:57 pve202 systemd[1]: Failed to start pve-cluster.service - The Proxmox VE cluster filesystem.
 
Running the command pmxcfs -l gives me the following.
pmxcfs -l
[main] notice: resolved node name 'pve200' to '10.11.7.200' for default node IP address
[database] crit: found entry with duplicate name 'qemu-server' - A:(inode = 0x00000000077039C8, parent = 0x00000000077039C7, v./mtime = 0x77039C8/0x1724090186) vs. B:(inode = 0x0000000007706220, parent = 0x00000000077039C7, v./mtime = 0x7706220/0x1724092847)
[database] crit: DB load failed
[main] crit: memdb_open failed - unable to open database '/var/lib/pve-cluster/config.db'
[main] notice: exit proxmox configuration filesystem (-1



I have a compressed database from a node that I placed days ago in the cluster and removed it, could that database be useful to me, right?

excuse my english
 
Running the command pmxcfs -l gives me the following.
pmxcfs -l
[main] notice: resolved node name 'pve200' to '10.11.7.200' for default node IP address
[database] crit: found entry with duplicate name 'qemu-server' - A:(inode = 0x00000000077039C8, parent = 0x00000000077039C7, v./mtime = 0x77039C8/0x1724090186) vs. B:(inode = 0x0000000007706220, parent = 0x00000000077039C7, v./mtime = 0x7706220/0x1724092847)
[database] crit: DB load failed
[main] crit: memdb_open failed - unable to open database '/var/lib/pve-cluster/config.db'
[main] notice: exit proxmox configuration filesystem (-1



I have a compressed database from a node that I placed days ago in the cluster and removed it, could that database be useful to me, right?

It depends how you "removed it" - if the node itself was simply turned off, then yes, that would recover your nodes into state from the time you have that "backup".

The other option is to literally go and check the database entries (you have duplicates) and manually remove them with SQL as per the above-linked thread. Keep a backup (of that corrupted one now in case you cause more damage).

I just find it very strange that all the nodes (how many?) have corrupt config.db in the same way...
 
checking for duplicate entries gives me this
Aug 20 09:03:26 pve200 pmxcfs[35342]: [database] crit: found entry with duplicate name 'qemu-server' - A:(inode = 0x00000000077039C8, parent = 0x00000000077039C7, v./mtime = 0x77039C8/0x17>
Aug 20 09:03:26 pve200 pmxcfs[35342]: [database] crit: found entry with duplicate name 'qemu-server' - A:(inode = 0x00000000077039C8, parent = 0x00000000077039C7, v./mtime = 0x77039C8/0x17>
Aug 20 09:03:26 pve200 pmxcfs[35342]: [database] crit: DB load failed
Aug 20 09:03:26 pve200 pmxcfs[35342]: [database] crit: DB load failed
Aug 20 09:03:26 pve200 pmxcfs[35342]: [main] crit: memdb_open failed - unable to open database '/var/lib/pve-cluster/config.db'
Aug 20 09:03:26 pve200 pmxcfs[35342]: [main] crit: memdb_open failed - unable to open database '/var/lib/pve-cluster/config.db'
Aug 20 09:03:26 pve200 pmxcfs[35342]: [main] notice: exit proxmox configuration filesystem (-1)
Aug 20 09:03:26 pve200 pmxcfs[35342]: [main] notice: exit proxmox configuration filesystem (-1)
Aug 20 09:03:26 pve200 systemd[1]: pve-cluster.service: Control process exited, code=exited, status=255/EXCEPTION
Aug 20 09:03:26 pve200 systemd[1]: pve-cluster.service: Failed with result 'exit-code'.
Aug 20 09:03:26 pve200 systemd[1]: Failed to start pve-cluster.service - The Proxmox VE cluster filesystem.
Aug 20 09:03:26 pve200 systemd[1]: pve-cluster.service: Scheduled restart job, restart counter is at 5.
Aug 20 09:03:26 pve200 systemd[1]: Stopped pve-cluster.service - The Proxmox VE cluster filesystem.
Aug 20 09:03:26 pve200 systemd[1]: pve-cluster.service: Start request repeated too quickly.
Aug 20 09:03:26 pve200 systemd[1]: pve-cluster.service: Failed with result 'exit-code'.
Aug 20 09:03:26 pve200 systemd[1]: Failed to start pve-cluster.service - The Proxmox VE cluster filesystem.
root@pve200:/var/lib/pve-cluster# sqlite3 /var/lib/pve-cluster/config.db 'SELECT inode,mtime,name FROM tree WHERE parent = 0x00000000077039C7'
124795336|1724090186|qemu-server
124801839|1724091865|pve5
124801840|1724092779|lxc
124801841|1724092779|pve-ssl.key
124801845|1724092779|pve-ssl.pem
124801847|1724092779|priv
124801848|1724092779|ssh_known_hosts
124801850|1724092779|openvz
124805664|1724092847|qemu-server
124805843|1724092891|lrm_status


Now how do I delete that duplicate entry?
 
checking for duplicate entries gives me this
Aug 20 09:03:26 pve200 pmxcfs[35342]: [database] crit: found entry with duplicate name 'qemu-server' - A:(inode = 0x00000000077039C8, parent = 0x00000000077039C7, v./mtime = 0x77039C8/0x17>
Aug 20 09:03:26 pve200 pmxcfs[35342]: [database] crit: found entry with duplicate name 'qemu-server' - A:(inode = 0x00000000077039C8, parent = 0x00000000077039C7, v./mtime = 0x77039C8/0x17>
Aug 20 09:03:26 pve200 pmxcfs[35342]: [database] crit: DB load failed
Aug 20 09:03:26 pve200 pmxcfs[35342]: [database] crit: DB load failed
Aug 20 09:03:26 pve200 pmxcfs[35342]: [main] crit: memdb_open failed - unable to open database '/var/lib/pve-cluster/config.db'
Aug 20 09:03:26 pve200 pmxcfs[35342]: [main] crit: memdb_open failed - unable to open database '/var/lib/pve-cluster/config.db'
Aug 20 09:03:26 pve200 pmxcfs[35342]: [main] notice: exit proxmox configuration filesystem (-1)
Aug 20 09:03:26 pve200 pmxcfs[35342]: [main] notice: exit proxmox configuration filesystem (-1)
Aug 20 09:03:26 pve200 systemd[1]: pve-cluster.service: Control process exited, code=exited, status=255/EXCEPTION
Aug 20 09:03:26 pve200 systemd[1]: pve-cluster.service: Failed with result 'exit-code'.
Aug 20 09:03:26 pve200 systemd[1]: Failed to start pve-cluster.service - The Proxmox VE cluster filesystem.
Aug 20 09:03:26 pve200 systemd[1]: pve-cluster.service: Scheduled restart job, restart counter is at 5.
Aug 20 09:03:26 pve200 systemd[1]: Stopped pve-cluster.service - The Proxmox VE cluster filesystem.
Aug 20 09:03:26 pve200 systemd[1]: pve-cluster.service: Start request repeated too quickly.
Aug 20 09:03:26 pve200 systemd[1]: pve-cluster.service: Failed with result 'exit-code'.
Aug 20 09:03:26 pve200 systemd[1]: Failed to start pve-cluster.service - The Proxmox VE cluster filesystem.
root@pve200:/var/lib/pve-cluster# sqlite3 /var/lib/pve-cluster/config.db 'SELECT inode,mtime,name FROM tree WHERE parent = 0x00000000077039C7'
124795336|1724090186|qemu-server
124801839|1724091865|pve5
124801840|1724092779|lxc
124801841|1724092779|pve-ssl.key
124801845|1724092779|pve-ssl.pem
124801847|1724092779|priv
124801848|1724092779|ssh_known_hosts
124801850|1724092779|openvz
124805664|1724092847|qemu-server
124805843|1724092891|lrm_status


Now how do I delete that duplicate entry?

Something like

sqlite3 /var/lib/pve-cluster/config.db 'DELETE FROM tree WHERE inode = XXX'

You basically take your pick which of the two qemu-server entries you want to be gone, you could go for the one with lower mtime.
 
After deleting the entry, I restarted the service, now it gives me the following

Aug 20 09:20:59 pve200 pmxcfs[2380]: fuse: mountpoint is not empty
Aug 20 09:20:59 pve200 pmxcfs[2380]: fuse: if you are sure this is safe, use the 'nonempty' mount option
Aug 20 09:20:59 pve200 pmxcfs[2380]: [main] crit: fuse_mount error: File exists
Aug 20 09:20:59 pve200 pmxcfs[2380]: [main] crit: fuse_mount error: File exists
Aug 20 09:20:59 pve200 pmxcfs[2380]: [main] notice: exit proxmox configuration filesystem (-1)
Aug 20 09:20:59 pve200 pmxcfs[2380]: [main] notice: exit proxmox configuration filesystem (-1)
Aug 20 09:20:59 pve200 systemd[1]: pve-cluster.service: Control process exited, code=exited, status=255/EXCEPTION
Aug 20 09:20:59 pve200 systemd[1]: pve-cluster.service: Failed with result 'exit-code'.
Aug 20 09:20:59 pve200 systemd[1]: Failed to start pve-cluster.service - The Proxmox VE cluster filesystem.
Aug 20 09:21:00 pve200 systemd[1]: pve-cluster.service: Scheduled restart job, restart counter is at 5.
Aug 20 09:21:00 pve200 systemd[1]: Stopped pve-cluster.service - The Proxmox VE cluster filesystem.
Aug 20 09:21:00 pve200 systemd[1]: pve-cluster.service: Start request repeated too quickly.
Aug 20 09:21:00 pve200 systemd[1]: pve-cluster.service: Failed with result 'exit-code'.
Aug 20 09:21:00 pve200 systemd[1]: Failed to start pve-cluster.service - The Proxmox VE cluster filesystem
 
After deleting the entry, I restarted the service, now it gives me the following

Aug 20 09:20:59 pve200 pmxcfs[2380]: fuse: mountpoint is not empty
Aug 20 09:20:59 pve200 pmxcfs[2380]: fuse: if you are sure this is safe, use the 'nonempty' mount option
Aug 20 09:20:59 pve200 pmxcfs[2380]: [main] crit: fuse_mount error: File exists
Aug 20 09:20:59 pve200 pmxcfs[2380]: [main] crit: fuse_mount error: File exists
Aug 20 09:20:59 pve200 pmxcfs[2380]: [main] notice: exit proxmox configuration filesystem (-1)
Aug 20 09:20:59 pve200 pmxcfs[2380]: [main] notice: exit proxmox configuration filesystem (-1)
Aug 20 09:20:59 pve200 systemd[1]: pve-cluster.service: Control process exited, code=exited, status=255/EXCEPTION
Aug 20 09:20:59 pve200 systemd[1]: pve-cluster.service: Failed with result 'exit-code'.
Aug 20 09:20:59 pve200 systemd[1]: Failed to start pve-cluster.service - The Proxmox VE cluster filesystem.
Aug 20 09:21:00 pve200 systemd[1]: pve-cluster.service: Scheduled restart job, restart counter is at 5.
Aug 20 09:21:00 pve200 systemd[1]: Stopped pve-cluster.service - The Proxmox VE cluster filesystem.
Aug 20 09:21:00 pve200 systemd[1]: pve-cluster.service: Start request repeated too quickly.
Aug 20 09:21:00 pve200 systemd[1]: pve-cluster.service: Failed with result 'exit-code'.
Aug 20 09:21:00 pve200 systemd[1]: Failed to start pve-cluster.service - The Proxmox VE cluster filesystem

I think you had previously created your SSL certs in that location (unmounted), so you need to literally empty the /etc/pve path.
 
Thank you very much, I was able to build, there are already 2 nodes, but now I don't see them as a cluster, I see them as independent
 

Attachments

  • Captura de pantalla de 2024-08-20 10-07-00.png
    Captura de pantalla de 2024-08-20 10-07-00.png
    137.3 KB · Views: 5
  • Captura de pantalla de 2024-08-20 10-06-40.png
    Captura de pantalla de 2024-08-20 10-06-40.png
    94.3 KB · Views: 5
  • Like
Reactions: esi_y
Thank you very much, I was able to build, there are already 2 nodes, but now I don't see them as a cluster, I see them as independent

I wonder if you did a reboot, since you were previously attempting the pmxcfs -l as well ...

If you are troubleshooting nodes that "forgot" they were in a cluster, you would need to start with pvecm status and cat /etc/corosync/corosync.conf

There would be other threads about similar as well. But make a backup of your config.db again. ;)
 
with the pvecm status it tells me that the file does not exist
pvecm status
Error: Corosync config '/etc/pve/corosync.conf' does not exist - is this node part of a cluster?

and the file /etc/corosync/corosync.conf shows me the following

cat /etc/corosync/corosync.conf
logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: pve200
nodeid: 1
quorum_votes: 1
ring0_addr: 10.11.7.200
}
node {
name: pve201
nodeid: 2
quorum_votes: 1
ring0_addr: 10.11.7.201
}
node {
name: pve202
nodeid: 3
quorum_votes: 1
ring0_addr: 10.11.7.202
}
node {
name: pve203
nodeid: 4
quorum_votes: 1
ring0_addr: 10.11.7.203
}
node {
name: pve204
nodeid: 5
quorum_votes: 1
ring0_addr: 10.11.7.204
}
}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: Zumaseguros
config_version: 7
interface {
linknumber: 0
}
ip_version: ipv4-6
link_mode: passive
secauth: on
version: 2
}

root@pve200:/var/lib/pve-cluster# pvecm status
Error: Corosync config '/etc/pve/corosync.conf' does not exist - is this node part of a cluster?
root@pve200:/var/lib/pve-cluster# cat /etc/pve/corosync.conf
cat: /etc/pve/corosync.conf: No such file or directory
root@pve200:/var/lib/pve-cluster# cat /etc/corosync/corosync.conf
logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: pve200
nodeid: 1
quorum_votes: 1
ring0_addr: 10.11.7.200
}
node {
name: pve201
nodeid: 2
quorum_votes: 1
ring0_addr: 10.11.7.201
}
node {
name: pve202
nodeid: 3
quorum_votes: 1
ring0_addr: 10.11.7.202
}
node {
name: pve203
nodeid: 4
quorum_votes: 1
ring0_addr: 10.11.7.203
}
node {
name: pve204
nodeid: 5
quorum_votes: 1
ring0_addr: 10.11.7.204
}
}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: Zumaseguros
config_version: 7
interface {
linknumber: 0
}
ip_version: ipv4-6
link_mode: passive
secauth: on
version: 2
}
 
Has there been a reboot since?

Is systemctl status corosync showing it active?

You may want to try copy /etc/corosync/corosync.conf into /etc/pve/corosync.conf and systemctl restart pve-cluster
 
I did exactly that, and it worked.
How do I delete the node that no longer belongs to the cluster from the gui?
Thank you very much again for the help, and as you mentioned, save the config.db
Thank you very much
 
  • Like
Reactions: esi_y
rm -rf /etc/pve/nodes/NODE-BY-NAME

But obviously, have a backup doing these things! :) Also, you may need to refresh/clear browser cache.

PS Be ready to find some skeletons after this recovery of the .db file, it was a strange state you found it in.
 
  • Like
Reactions: sombra3405
If you want to help others with the same, you can change the thread title tag to "solved". It's available after you "edit thread" on top right then left of the title.

Have fun!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!