[SOLVED] cluster does not wake up, after an electrical drop of several hours

sombra3405 · Aug 20, 2024

Greetings, I have a cluster of 5 servers, with version 8.2, I know there are quite a few threads about this same detail, but I have not found the solution.
I have these errors in the logs, the /etc/pve folder is empty

Aug 20 08:11:40 pve201 pveproxy[3272]: worker exit
Aug 20 08:11:40 pve201 pveproxy[1841]: worker 3272 finished
Aug 20 08:11:40 pve201 pveproxy[1841]: starting 2 worker(s)
Aug 20 08:11:40 pve201 pveproxy[1841]: worker 3274 started
Aug 20 08:11:40 pve201 pveproxy[1841]: worker 3275 started
Aug 20 08:11:40 pve201 pveproxy[3274]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 2025.
Aug 20 08:11:40 pve201 pveproxy[3275]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 2025.

The cluster service failed with the following errors
Aug 20 07:45:35 pve201 pmxcfs[1621]: [database] crit: found entry with duplicate name 'qemu-server' - Ainode = 0x00000000077039C8, parent = 0x00000000077039C7, v./mtime = 0x77039C8/0x172>
Aug 20 07:45:35 pve201 pmxcfs[1621]: [database] crit: found entry with duplicate name 'qemu-server' - Ainode = 0x00000000077039C8, parent = 0x00000000077039C7, v./mtime = 0x77039C8/0x172>
Aug 20 07:45:35 pve201 pmxcfs[1621]: [database] crit: DB load failed
Aug 20 07:45:35 pve201 pmxcfs[1621]: [database] crit: DB load failed
Aug 20 07:45:35 pve201 pmxcfs[1621]: [main] crit: memdb_open failed - unable to open database '/var/lib/pve-cluster/config.db'
Aug 20 07:45:35 pve201 pmxcfs[1621]: [main] crit: memdb_open failed - unable to open database '/var/lib/pve-cluster/config.db'
Aug 20 07:45:35 pve201 pmxcfs[1621]: [main] notice: exit proxmox configuration filesystem (-1)

I already updated, create new certificates with this guide https://pve.proxmox.com/wiki/Proxmox_SSL_Error_Fixing

but I still can't solve this

Thanks.

esi_y · Aug 20, 2024

sombra3405 said:
Greetings, I have a cluster of 5 servers, with version 8.2, I know there are quite a few threads about this same detail, but I have not found the solution.
I have these errors in the logs, the /etc/pve folder is empty

Aug 20 08:11:40 pve201 pveproxy[3272]: worker exit
Aug 20 08:11:40 pve201 pveproxy[1841]: worker 3272 finished
Aug 20 08:11:40 pve201 pveproxy[1841]: starting 2 worker(s)
Aug 20 08:11:40 pve201 pveproxy[1841]: worker 3274 started
Aug 20 08:11:40 pve201 pveproxy[1841]: worker 3275 started
Aug 20 08:11:40 pve201 pveproxy[3274]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 2025.
Aug 20 08:11:40 pve201 pveproxy[3275]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 2025.

The cluster service failed with the following errors
Aug 20 07:45:35 pve201 pmxcfs[1621]: [database] crit: found entry with duplicate name 'qemu-server' - Ainode = 0x00000000077039C8, parent = 0x00000000077039C7, v./mtime = 0x77039C8/0x172>
Aug 20 07:45:35 pve201 pmxcfs[1621]: [database] crit: found entry with duplicate name 'qemu-server' - Ainode = 0x00000000077039C8, parent = 0x00000000077039C7, v./mtime = 0x77039C8/0x172>
Aug 20 07:45:35 pve201 pmxcfs[1621]: [database] crit: DB load failed
Aug 20 07:45:35 pve201 pmxcfs[1621]: [database] crit: DB load failed
Aug 20 07:45:35 pve201 pmxcfs[1621]: [main] crit: memdb_open failed - unable to open database '/var/lib/pve-cluster/config.db'
Aug 20 07:45:35 pve201 pmxcfs[1621]: [main] crit: memdb_open failed - unable to open database '/var/lib/pve-cluster/config.db'
Aug 20 07:45:35 pve201 pmxcfs[1621]: [main] notice: exit proxmox configuration filesystem (-1)

I already updated, create new certificates with this guide https://pve.proxmox.com/wiki/Proxmox_SSL_Error_Fixing

but I still can't solve this

Thanks.

Do you have the same issue* on ALL of the nodes at the same time? Because it looks like you ended up with corrupt database - /var/lib/pve-cluster/config.db - is what holds the virtual filesystem mounted into /etc/pve at runtime (which is why yours is empty). If you have a good .db file on at least one of the nodes (highly likely), I would just copy it across manually and killall pmxcfs and then systemctl restart pve-cluster .

EDIT: * Your other nodes might not be getting quorum upon start, but have intact config.db. You will see the difference in the log, where instead corosync complains only, but not pmxcfs.

PS Make a backup of that file, just in case.

sombra3405 · Aug 20, 2024

Thanks for the response, looking at the node logs, they show the same error.
Now how can I verify that the database is optimal?

Aug 20 02:49:57 pve202 pmxcfs[17798]: [database] crit: found entry with duplicate name 'qemu-server' - A

inode = 0x00000000077039C8, parent = 0x00000000077039C7, v./mtime = 0x77039C8/0x17>
Aug 20 02:49:57 pve202 pmxcfs[17798]: [database] crit: found entry with duplicate name 'qemu-server' - A

inode = 0x00000000077039C8, parent = 0x00000000077039C7, v./mtime = 0x77039C8/0x17>
Aug 20 02:49:57 pve202 pmxcfs[17798]: [database] crit: DB load failed
Aug 20 02:49:57 pve202 pmxcfs[17798]: [database] crit: DB load failed
Aug 20 02:49:57 pve202 pmxcfs[17798]: [main] crit: memdb_open failed - unable to open database '/var/lib/pve-cluster/config.db'
Aug 20 02:49:57 pve202 pmxcfs[17798]: [main] crit: memdb_open failed - unable to open database '/var/lib/pve-cluster/config.db'
Aug 20 02:49:57 pve202 pmxcfs[17798]: [main] notice: exit proxmox configuration filesystem (-1)
Aug 20 02:49:57 pve202 pmxcfs[17798]: [main] notice: exit proxmox configuration filesystem (-1)
Aug 20 02:49:57 pve202 systemd[1]: pve-cluster.service: Control process exited, code=exited, status=255/EXCEPTION
Aug 20 02:49:57 pve202 systemd[1]: pve-cluster.service: Failed with result 'exit-code'.
Aug 20 02:49:57 pve202 systemd[1]: Failed to start pve-cluster.service - The Proxmox VE cluster filesystem.
Aug 20 02:49:57 pve202 systemd[1]: pve-cluster.service: Scheduled restart job, restart counter is at 5.
Aug 20 02:49:57 pve202 systemd[1]: Stopped pve-cluster.service - The Proxmox VE cluster filesystem.
Aug 20 02:49:57 pve202 systemd[1]: pve-cluster.service: Start request repeated too quickly.
Aug 20 02:49:57 pve202 systemd[1]: pve-cluster.service: Failed with result 'exit-code'.
Aug 20 02:49:57 pve202 systemd[1]: Failed to start pve-cluster.service - The Proxmox VE cluster filesystem.

esi_y · Aug 20, 2024

Well that would be a bit tedious, have a look here:

https://forum.proxmox.com/threads/pve-cluster-fails-to-start.82861/#post-364579

sombra3405 · Aug 20, 2024

Running the command pmxcfs -l gives me the following.
pmxcfs -l
[main] notice: resolved node name 'pve200' to '10.11.7.200' for default node IP address
[database] crit: found entry with duplicate name 'qemu-server' - Ainode = 0x00000000077039C8, parent = 0x00000000077039C7, v./mtime = 0x77039C8/0x1724090186) vs. Binode = 0x0000000007706220, parent = 0x00000000077039C7, v./mtime = 0x7706220/0x1724092847)
[database] crit: DB load failed
[main] crit: memdb_open failed - unable to open database '/var/lib/pve-cluster/config.db'
[main] notice: exit proxmox configuration filesystem (-1

I have a compressed database from a node that I placed days ago in the cluster and removed it, could that database be useful to me, right?

excuse my english

esi_y · Aug 20, 2024

sombra3405 said:
Running the command pmxcfs -l gives me the following.
pmxcfs -l
[main] notice: resolved node name 'pve200' to '10.11.7.200' for default node IP address
[database] crit: found entry with duplicate name 'qemu-server' - Ainode = 0x00000000077039C8, parent = 0x00000000077039C7, v./mtime = 0x77039C8/0x1724090186) vs. Binode = 0x0000000007706220, parent = 0x00000000077039C7, v./mtime = 0x7706220/0x1724092847)
[database] crit: DB load failed
[main] crit: memdb_open failed - unable to open database '/var/lib/pve-cluster/config.db'
[main] notice: exit proxmox configuration filesystem (-1

I have a compressed database from a node that I placed days ago in the cluster and removed it, could that database be useful to me, right?

It depends how you "removed it" - if the node itself was simply turned off, then yes, that would recover your nodes into state from the time you have that "backup".

The other option is to literally go and check the database entries (you have duplicates) and manually remove them with SQL as per the above-linked thread. Keep a backup (of that corrupted one now in case you cause more damage).

I just find it very strange that all the nodes (how many?) have corrupt config.db in the same way...

esi_y · Aug 20, 2024

sombra3405 said:
Running the command pmxcfs -l gives me the following.
pmxcfs -l

Also I want to point out this is not what I suggested and you are instead trying to isolate the node from a cluster with this.

sombra3405 · Aug 20, 2024

checking for duplicate entries gives me this
Aug 20 09:03:26 pve200 pmxcfs[35342]: [database] crit: found entry with duplicate name 'qemu-server' - Ainode = 0x00000000077039C8, parent = 0x00000000077039C7, v./mtime = 0x77039C8/0x17>
Aug 20 09:03:26 pve200 pmxcfs[35342]: [database] crit: found entry with duplicate name 'qemu-server' - Ainode = 0x00000000077039C8, parent = 0x00000000077039C7, v./mtime = 0x77039C8/0x17>
Aug 20 09:03:26 pve200 pmxcfs[35342]: [database] crit: DB load failed
Aug 20 09:03:26 pve200 pmxcfs[35342]: [database] crit: DB load failed
Aug 20 09:03:26 pve200 pmxcfs[35342]: [main] crit: memdb_open failed - unable to open database '/var/lib/pve-cluster/config.db'
Aug 20 09:03:26 pve200 pmxcfs[35342]: [main] crit: memdb_open failed - unable to open database '/var/lib/pve-cluster/config.db'
Aug 20 09:03:26 pve200 pmxcfs[35342]: [main] notice: exit proxmox configuration filesystem (-1)
Aug 20 09:03:26 pve200 pmxcfs[35342]: [main] notice: exit proxmox configuration filesystem (-1)
Aug 20 09:03:26 pve200 systemd[1]: pve-cluster.service: Control process exited, code=exited, status=255/EXCEPTION
Aug 20 09:03:26 pve200 systemd[1]: pve-cluster.service: Failed with result 'exit-code'.
Aug 20 09:03:26 pve200 systemd[1]: Failed to start pve-cluster.service - The Proxmox VE cluster filesystem.
Aug 20 09:03:26 pve200 systemd[1]: pve-cluster.service: Scheduled restart job, restart counter is at 5.
Aug 20 09:03:26 pve200 systemd[1]: Stopped pve-cluster.service - The Proxmox VE cluster filesystem.
Aug 20 09:03:26 pve200 systemd[1]: pve-cluster.service: Start request repeated too quickly.
Aug 20 09:03:26 pve200 systemd[1]: pve-cluster.service: Failed with result 'exit-code'.
Aug 20 09:03:26 pve200 systemd[1]: Failed to start pve-cluster.service - The Proxmox VE cluster filesystem.
root@pve200:/var/lib/pve-cluster# sqlite3 /var/lib/pve-cluster/config.db 'SELECT inode,mtime,name FROM tree WHERE parent = 0x00000000077039C7'
124795336|1724090186|qemu-server
124801839|1724091865|pve5
124801840|1724092779|lxc
124801841|1724092779|pve-ssl.key
124801845|1724092779|pve-ssl.pem
124801847|1724092779|priv
124801848|1724092779|ssh_known_hosts
124801850|1724092779|openvz
124805664|1724092847|qemu-server
124805843|1724092891|lrm_status

Now how do I delete that duplicate entry?

esi_y · Aug 20, 2024

sombra3405 said:
checking for duplicate entries gives me this
Aug 20 09:03:26 pve200 pmxcfs[35342]: [database] crit: found entry with duplicate name 'qemu-server' - Ainode = 0x00000000077039C8, parent = 0x00000000077039C7, v./mtime = 0x77039C8/0x17>
Aug 20 09:03:26 pve200 pmxcfs[35342]: [database] crit: found entry with duplicate name 'qemu-server' - Ainode = 0x00000000077039C8, parent = 0x00000000077039C7, v./mtime = 0x77039C8/0x17>
Aug 20 09:03:26 pve200 pmxcfs[35342]: [database] crit: DB load failed
Aug 20 09:03:26 pve200 pmxcfs[35342]: [database] crit: DB load failed
Aug 20 09:03:26 pve200 pmxcfs[35342]: [main] crit: memdb_open failed - unable to open database '/var/lib/pve-cluster/config.db'
Aug 20 09:03:26 pve200 pmxcfs[35342]: [main] crit: memdb_open failed - unable to open database '/var/lib/pve-cluster/config.db'
Aug 20 09:03:26 pve200 pmxcfs[35342]: [main] notice: exit proxmox configuration filesystem (-1)
Aug 20 09:03:26 pve200 pmxcfs[35342]: [main] notice: exit proxmox configuration filesystem (-1)
Aug 20 09:03:26 pve200 systemd[1]: pve-cluster.service: Control process exited, code=exited, status=255/EXCEPTION
Aug 20 09:03:26 pve200 systemd[1]: pve-cluster.service: Failed with result 'exit-code'.
Aug 20 09:03:26 pve200 systemd[1]: Failed to start pve-cluster.service - The Proxmox VE cluster filesystem.
Aug 20 09:03:26 pve200 systemd[1]: pve-cluster.service: Scheduled restart job, restart counter is at 5.
Aug 20 09:03:26 pve200 systemd[1]: Stopped pve-cluster.service - The Proxmox VE cluster filesystem.
Aug 20 09:03:26 pve200 systemd[1]: pve-cluster.service: Start request repeated too quickly.
Aug 20 09:03:26 pve200 systemd[1]: pve-cluster.service: Failed with result 'exit-code'.
Aug 20 09:03:26 pve200 systemd[1]: Failed to start pve-cluster.service - The Proxmox VE cluster filesystem.
root@pve200:/var/lib/pve-cluster# sqlite3 /var/lib/pve-cluster/config.db 'SELECT inode,mtime,name FROM tree WHERE parent = 0x00000000077039C7'
124795336|1724090186|qemu-server
124801839|1724091865|pve5
124801840|1724092779|lxc
124801841|1724092779|pve-ssl.key
124801845|1724092779|pve-ssl.pem
124801847|1724092779|priv
124801848|1724092779|ssh_known_hosts
124801850|1724092779|openvz
124805664|1724092847|qemu-server
124805843|1724092891|lrm_status

Now how do I delete that duplicate entry?

Something like

sqlite3 /var/lib/pve-cluster/config.db 'DELETE FROM tree WHERE inode = XXX'

You basically take your pick which of the two qemu-server entries you want to be gone, you could go for the one with lower mtime.

sombra3405 · Aug 20, 2024

After deleting the entry, I restarted the service, now it gives me the following

Aug 20 09:20:59 pve200 pmxcfs[2380]: fuse: mountpoint is not empty
Aug 20 09:20:59 pve200 pmxcfs[2380]: fuse: if you are sure this is safe, use the 'nonempty' mount option
Aug 20 09:20:59 pve200 pmxcfs[2380]: [main] crit: fuse_mount error: File exists
Aug 20 09:20:59 pve200 pmxcfs[2380]: [main] crit: fuse_mount error: File exists
Aug 20 09:20:59 pve200 pmxcfs[2380]: [main] notice: exit proxmox configuration filesystem (-1)
Aug 20 09:20:59 pve200 pmxcfs[2380]: [main] notice: exit proxmox configuration filesystem (-1)
Aug 20 09:20:59 pve200 systemd[1]: pve-cluster.service: Control process exited, code=exited, status=255/EXCEPTION
Aug 20 09:20:59 pve200 systemd[1]: pve-cluster.service: Failed with result 'exit-code'.
Aug 20 09:20:59 pve200 systemd[1]: Failed to start pve-cluster.service - The Proxmox VE cluster filesystem.
Aug 20 09:21:00 pve200 systemd[1]: pve-cluster.service: Scheduled restart job, restart counter is at 5.
Aug 20 09:21:00 pve200 systemd[1]: Stopped pve-cluster.service - The Proxmox VE cluster filesystem.
Aug 20 09:21:00 pve200 systemd[1]: pve-cluster.service: Start request repeated too quickly.
Aug 20 09:21:00 pve200 systemd[1]: pve-cluster.service: Failed with result 'exit-code'.
Aug 20 09:21:00 pve200 systemd[1]: Failed to start pve-cluster.service - The Proxmox VE cluster filesystem

esi_y · Aug 20, 2024

sombra3405 said:
After deleting the entry, I restarted the service, now it gives me the following

Aug 20 09:20:59 pve200 pmxcfs[2380]: fuse: mountpoint is not empty
Aug 20 09:20:59 pve200 pmxcfs[2380]: fuse: if you are sure this is safe, use the 'nonempty' mount option
Aug 20 09:20:59 pve200 pmxcfs[2380]: [main] crit: fuse_mount error: File exists
Aug 20 09:20:59 pve200 pmxcfs[2380]: [main] crit: fuse_mount error: File exists
Aug 20 09:20:59 pve200 pmxcfs[2380]: [main] notice: exit proxmox configuration filesystem (-1)
Aug 20 09:20:59 pve200 pmxcfs[2380]: [main] notice: exit proxmox configuration filesystem (-1)
Aug 20 09:20:59 pve200 systemd[1]: pve-cluster.service: Control process exited, code=exited, status=255/EXCEPTION
Aug 20 09:20:59 pve200 systemd[1]: pve-cluster.service: Failed with result 'exit-code'.
Aug 20 09:20:59 pve200 systemd[1]: Failed to start pve-cluster.service - The Proxmox VE cluster filesystem.
Aug 20 09:21:00 pve200 systemd[1]: pve-cluster.service: Scheduled restart job, restart counter is at 5.
Aug 20 09:21:00 pve200 systemd[1]: Stopped pve-cluster.service - The Proxmox VE cluster filesystem.
Aug 20 09:21:00 pve200 systemd[1]: pve-cluster.service: Start request repeated too quickly.
Aug 20 09:21:00 pve200 systemd[1]: pve-cluster.service: Failed with result 'exit-code'.
Aug 20 09:21:00 pve200 systemd[1]: Failed to start pve-cluster.service - The Proxmox VE cluster filesystem

I think you had previously created your SSL certs in that location (unmounted), so you need to literally empty the /etc/pve path.

sombra3405 · Aug 20, 2024

Thank you very much, I was able to build, there are already 2 nodes, but now I don't see them as a cluster, I see them as independent

esi_y · Aug 20, 2024

sombra3405 said:
Thank you very much, I was able to build, there are already 2 nodes, but now I don't see them as a cluster, I see them as independent

I wonder if you did a reboot, since you were previously attempting the pmxcfs -l as well ...

If you are troubleshooting nodes that "forgot" they were in a cluster, you would need to start with pvecm status and cat /etc/corosync/corosync.conf

There would be other threads about similar as well. But make a backup of your config.db again.

sombra3405 · Aug 20, 2024

with the pvecm status it tells me that the file does not exist
pvecm status
Error: Corosync config '/etc/pve/corosync.conf' does not exist - is this node part of a cluster?

and the file /etc/corosync/corosync.conf shows me the following

cat /etc/corosync/corosync.conf
logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: pve200
nodeid: 1
quorum_votes: 1
ring0_addr: 10.11.7.200
}
node {
name: pve201
nodeid: 2
quorum_votes: 1
ring0_addr: 10.11.7.201
}
node {
name: pve202
nodeid: 3
quorum_votes: 1
ring0_addr: 10.11.7.202
}
node {
name: pve203
nodeid: 4
quorum_votes: 1
ring0_addr: 10.11.7.203
}
node {
name: pve204
nodeid: 5
quorum_votes: 1
ring0_addr: 10.11.7.204
}
}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: Zumaseguros
config_version: 7
interface {
linknumber: 0
}
ip_version: ipv4-6
link_mode: passive
secauth: on
version: 2
}

root@pve200:/var/lib/pve-cluster# pvecm status
Error: Corosync config '/etc/pve/corosync.conf' does not exist - is this node part of a cluster?
root@pve200:/var/lib/pve-cluster# cat /etc/pve/corosync.conf
cat: /etc/pve/corosync.conf: No such file or directory
root@pve200:/var/lib/pve-cluster# cat /etc/corosync/corosync.conf
logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: pve200
nodeid: 1
quorum_votes: 1
ring0_addr: 10.11.7.200
}
node {
name: pve201
nodeid: 2
quorum_votes: 1
ring0_addr: 10.11.7.201
}
node {
name: pve202
nodeid: 3
quorum_votes: 1
ring0_addr: 10.11.7.202
}
node {
name: pve203
nodeid: 4
quorum_votes: 1
ring0_addr: 10.11.7.203
}
node {
name: pve204
nodeid: 5
quorum_votes: 1
ring0_addr: 10.11.7.204
}
}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: Zumaseguros
config_version: 7
interface {
linknumber: 0
}
ip_version: ipv4-6
link_mode: passive
secauth: on
version: 2
}

esi_y · Aug 20, 2024

Has there been a reboot since?

Is systemctl status corosync showing it active?

You may want to try copy /etc/corosync/corosync.conf into /etc/pve/corosync.conf and systemctl restart pve-cluster

sombra3405 · Aug 20, 2024

I did exactly that, and it worked.
How do I delete the node that no longer belongs to the cluster from the gui?
Thank you very much again for the help, and as you mentioned, save the config.db
Thank you very much

esi_y · Aug 20, 2024

rm -rf /etc/pve/nodes/NODE-BY-NAME

But obviously, have a backup doing these things!

Also, you may need to refresh/clear browser cache.

PS Be ready to find some skeletons after this recovery of the .db file, it was a strange state you found it in.

sombra3405 · Aug 20, 2024

Grateful, happy day

esi_y · Aug 20, 2024

If you want to help others with the same, you can change the thread title tag to "solved". It's available after you "edit thread" on top right then left of the title.

Have fun!

esi_y · Aug 21, 2024

The following feature request was filed thanks to the findings in this thread:
https://bugzilla.proxmox.com/show_bug.cgi?id=5670

Hopefully, when implemented, this will prevent creating files in unmounted /etc/pve.

[SOLVED] cluster does not wake up, after an electrical drop of several hours

Member

Active Member

Member

Active Member

Member

Active Member

Active Member

Member

Active Member

Member

Active Member

Member

Attachments

Active Member

Member

Active Member

Member

Active Member

Member

Active Member

Active Member