Hey,
Here is the most urgent problem I've ever had.
I need your help!
Pmox Cluster / Ceph Cluster was running 11 nodes for nearly 1 year without problem.
Had to move whole cluster due to lack of power ressources.
Every node is interconnected to 2 backbone switches with 2x25Gbit/s LACP (Mikrotik CS520)
Rebuilded cluster on destination, started up - everything fine.
Added 3 more nodes, and had to upgrade existing nodes of cluster due to ceph reef version mismatch.
Went for a smoke, came back and suddenly the mess began: All nodes grayed out, offline.
I was struggling, feared - but the problem dissapeared as fast as it came.
Suddenly it came back - and now: The nodes cant talk to each other anymore.
I had a huge problem with a firmware bug with the Mikrotik Switches, where the cocnnection was lost every second - they fixed it, and it was running stable since.
Now I have no idea, where the problem comes from - but what I know is that I only have 24 hours left to fix this issue.
Is it due to version mismatch of the pve's? (Did: 1. apt update, 2. apt dist-upgrade)
Is it due to LACP for corosync (have been no issue at all for 1 year)
Spoke to AI which told me to measure paketloss - there is none, latency of network is about 0.1 - 0.5 ms.
A few more details:
- VLAN 104 and 105 with seperated networks are for ceph public and internal - they shouldnt be involved, but due to migration the router doesnt exist anymore and still: Communication of ceph is no issue since its network internally only.
- Because of confusion and bad experience with Mikrotiks firmware (even if no changes have been made) I upgraded the firmware of the backbone switches and access switches for management - no positive results
- Machines came back, but now they stay offline
(192.168.101.91 is PBS VM)
Here is the most urgent problem I've ever had.
I need your help!
Pmox Cluster / Ceph Cluster was running 11 nodes for nearly 1 year without problem.
Had to move whole cluster due to lack of power ressources.
Every node is interconnected to 2 backbone switches with 2x25Gbit/s LACP (Mikrotik CS520)
Rebuilded cluster on destination, started up - everything fine.
Added 3 more nodes, and had to upgrade existing nodes of cluster due to ceph reef version mismatch.
Went for a smoke, came back and suddenly the mess began: All nodes grayed out, offline.
I was struggling, feared - but the problem dissapeared as fast as it came.
Suddenly it came back - and now: The nodes cant talk to each other anymore.
I had a huge problem with a firmware bug with the Mikrotik Switches, where the cocnnection was lost every second - they fixed it, and it was running stable since.
Now I have no idea, where the problem comes from - but what I know is that I only have 24 hours left to fix this issue.
Is it due to version mismatch of the pve's? (Did: 1. apt update, 2. apt dist-upgrade)
Is it due to LACP for corosync (have been no issue at all for 1 year)
Spoke to AI which told me to measure paketloss - there is none, latency of network is about 0.1 - 0.5 ms.
A few more details:
- VLAN 104 and 105 with seperated networks are for ceph public and internal - they shouldnt be involved, but due to migration the router doesnt exist anymore and still: Communication of ceph is no issue since its network internally only.
- Because of confusion and bad experience with Mikrotiks firmware (even if no changes have been made) I upgraded the firmware of the backbone switches and access switches for management - no positive results
- Machines came back, but now they stay offline
Code:
root@pve-21:~# pvecm status
Cluster information
-------------------
Name: pmox-cluster-01
Config Version: 16
Transport: knet
Secure auth: on
Quorum information
------------------
Date: Sat Jul 12 23:19:41 2025
Quorum provider: corosync_votequorum
Nodes: 10
Node ID: 0x00000001
Ring ID: 1.1afa5
Quorate: Yes
Votequorum information
----------------------
Expected votes: 12
Highest expected: 12
Total votes: 10
Quorum: 7
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
0x00000001 1 192.168.101.51 (local)
0x00000002 1 192.168.101.58
0x00000003 1 192.168.101.54
0x00000004 1 192.168.101.44
0x00000005 1 192.168.101.41
0x00000006 1 192.168.101.42
0x00000007 1 192.168.101.43
0x00000008 1 192.168.101.52
0x0000000a 1 192.168.101.55
0x0000000b 1 192.168.101.56
Code:
root@pve-21:~# pvecm nodes
Membership information
----------------------
Nodeid Votes Name
1 1 pve-21 (local)
2 1 pve-28
3 1 pve-24
4 1 pve-14
5 1 pve-11
6 1 pve-12
7 1 pve-13
8 1 pve-22
10 1 pve-25
11 1 pve-26
Code:
journalctl -xe
Jul 12 23:19:22 pve-21 pmxcfs[4294]: [dcdb] crit: cpg_send_message failed: 6
Jul 12 23:19:22 pve-21 pvescheduler[128229]: replication: cfs-lock 'file-replication_cfg' error: got lock request timeout
Jul 12 23:19:28 pve-21 corosync[4470]: [TOTEM ] Token has not been received in 7127 ms
Jul 12 23:19:37 pve-21 corosync[4470]: [TOTEM ] Token has not been received in 15981 ms
Jul 12 23:19:44 pve-21 corosync[4470]: [QUORUM] Sync members[10]: 1 2 3 4 5 6 7 8 10 11
Jul 12 23:19:44 pve-21 corosync[4470]: [TOTEM ] A new membership (1.1be7d) was formed. Members
Jul 12 23:19:51 pve-21 corosync[4470]: [TOTEM ] Token has not been received in 7127 ms
Jul 12 23:20:00 pve-21 corosync[4470]: [TOTEM ] Token has not been received in 15981 ms
Jul 12 23:20:03 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 10
Jul 12 23:20:04 pve-21 pvedaemon[5568]: <root@pam> successful auth for user 'root@pam'
Jul 12 23:20:04 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 20
Jul 12 23:20:05 pve-21 pmxcfs[4294]: [status] notice: cpg_send_message retry 10
Jul 12 23:20:05 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 30
Jul 12 23:20:06 pve-21 pmxcfs[4294]: [status] notice: cpg_send_message retry 20
Jul 12 23:20:06 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 40
Jul 12 23:20:07 pve-21 corosync[4470]: [QUORUM] Sync members[10]: 1 2 3 4 5 6 7 8 10 11
Jul 12 23:20:07 pve-21 corosync[4470]: [TOTEM ] A new membership (1.1be91) was formed. Members
Jul 12 23:20:07 pve-21 pmxcfs[4294]: [status] notice: cpg_send_message retry 30
Jul 12 23:20:07 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 50
Jul 12 23:20:08 pve-21 pmxcfs[4294]: [status] notice: cpg_send_message retry 40
Jul 12 23:20:08 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 60
Jul 12 23:20:09 pve-21 pmxcfs[4294]: [status] notice: cpg_send_message retry 50
Jul 12 23:20:09 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 70
Jul 12 23:20:10 pve-21 pmxcfs[4294]: [status] notice: cpg_send_message retry 60
Jul 12 23:20:10 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 80
Jul 12 23:20:11 pve-21 pmxcfs[4294]: [status] notice: cpg_send_message retry 70
Jul 12 23:20:11 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 90
Jul 12 23:20:12 pve-21 pmxcfs[4294]: [status] notice: cpg_send_message retry 80
Jul 12 23:20:12 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 100
Jul 12 23:20:12 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retried 100 times
Jul 12 23:20:12 pve-21 pmxcfs[4294]: [dcdb] crit: cpg_send_message failed: 6
Jul 12 23:20:12 pve-21 pvescheduler[128491]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout
Jul 12 23:20:13 pve-21 pmxcfs[4294]: [status] notice: cpg_send_message retry 90
Jul 12 23:20:13 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 10
Jul 12 23:20:14 pve-21 corosync[4470]: [TOTEM ] Token has not been received in 7127 ms
Jul 12 23:20:14 pve-21 pmxcfs[4294]: [status] notice: cpg_send_message retry 100
Jul 12 23:20:14 pve-21 pmxcfs[4294]: [status] notice: cpg_send_message retried 100 times
Jul 12 23:20:14 pve-21 pmxcfs[4294]: [status] crit: cpg_send_message failed: 6
Jul 12 23:20:14 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 20
Jul 12 23:20:15 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 30
Jul 12 23:20:16 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 40
Jul 12 23:20:17 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 50
Jul 12 23:20:18 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 60
Jul 12 23:20:19 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 70
Jul 12 23:20:20 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 80
Jul 12 23:20:21 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 90
Jul 12 23:20:22 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 100
Jul 12 23:20:22 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retried 100 times
Jul 12 23:20:22 pve-21 pmxcfs[4294]: [dcdb] crit: cpg_send_message failed: 6
Jul 12 23:20:22 pve-21 pvescheduler[128490]: replication: cfs-lock 'file-replication_cfg' error: got lock request timeout
Jul 12 23:20:23 pve-21 corosync[4470]: [TOTEM ] Token has not been received in 15981 ms
Jul 12 23:20:30 pve-21 corosync[4470]: [QUORUM] Sync members[10]: 1 2 3 4 5 6 7 8 10 11
Jul 12 23:20:30 pve-21 corosync[4470]: [TOTEM ] A new membership (1.1bea5) was formed. Members
Jul 12 23:20:37 pve-21 corosync[4470]: [TOTEM ] Token has not been received in 7126 ms
Jul 12 23:20:46 pve-21 corosync[4470]: [TOTEM ] Token has not been received in 15980 ms
Code:
root@pve-21:~# systemctl status pve-cluster
● pve-cluster.service - The Proxmox VE cluster filesystem
Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; preset: enabled)
Active: active (running) since Sat 2025-07-12 17:25:02 CEST; 5h 56min ago
Main PID: 4294 (pmxcfs)
Tasks: 13 (limit: 629145)
Memory: 65.7M
CPU: 21.798s
CGroup: /system.slice/pve-cluster.service
└─4294 /usr/bin/pmxcfs
Jul 12 23:21:15 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 30
Jul 12 23:21:16 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 40
Jul 12 23:21:17 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 50
Jul 12 23:21:18 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 60
Jul 12 23:21:19 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 70
Jul 12 23:21:20 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 80
Jul 12 23:21:21 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 90
Jul 12 23:21:22 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 100
Jul 12 23:21:22 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retried 100 times
Jul 12 23:21:22 pve-21 pmxcfs[4294]: [dcdb] crit: cpg_send_message failed: 6
Code:
root@pve-21:~# systemctl status pvestatd
● pvestatd.service - PVE Status Daemon
Loaded: loaded (/lib/systemd/system/pvestatd.service; enabled; preset: enabled)
Active: active (running) since Sat 2025-07-12 17:25:03 CEST; 5h 57min ago
Main PID: 5069 (pvestatd)
Tasks: 1 (limit: 629145)
Memory: 124.6M
CPU: 2min 3.009s
CGroup: /system.slice/pvestatd.service
└─5069 pvestatd
Jul 12 21:49:13 pve-21 pvestatd[5069]: status update time (1718.559 seconds)
Jul 12 21:49:16 pve-21 pvestatd[5069]: LZ4_01: error fetching datastores - 500 Can't connect to 192.168.101.91:8007 (No route to host)
Jul 12 21:49:20 pve-21 pvestatd[5069]: LZ4_02: error fetching datastores - 500 Can't connect to 192.168.101.91:8007 (No route to host)
Jul 12 21:49:20 pve-21 pvestatd[5069]: status update time (6.432 seconds)
Jul 12 22:07:15 pve-21 pvestatd[5069]: LZ4_01: error fetching datastores - 500 Can't connect to 192.168.101.91:8007 (No route to host)
Jul 12 22:07:18 pve-21 pvestatd[5069]: LZ4_02: error fetching datastores - 500 Can't connect to 192.168.101.91:8007 (No route to host)
Jul 12 22:07:18 pve-21 pvestatd[5069]: status update time (1075.514 seconds)
Jul 12 22:07:21 pve-21 pvestatd[5069]: LZ4_01: error fetching datastores - 500 Can't connect to 192.168.101.91:8007 (No route to host)
Jul 12 22:07:24 pve-21 pvestatd[5069]: LZ4_02: error fetching datastores - 500 Can't connect to 192.168.101.91:8007 (No route to host)
Jul 12 22:07:25 pve-21 pvestatd[5069]: status update time (6.348 seconds)
(192.168.101.91 is PBS VM)
Code:
journalctl -u pvestatd
Sep 12 22:51:32 pve21 systemd[1]: pvestatd.service: Deactivated successfully.
Sep 12 22:51:32 pve21 systemd[1]: Stopped pvestatd.service - PVE Status Daemon.
Sep 12 22:51:32 pve21 systemd[1]: pvestatd.service: Consumed 53.283s CPU time.
-- Boot 609bb646465347b8908ca6077ec4436e --
Sep 12 22:55:36 pve21 systemd[1]: Starting pvestatd.service - PVE Status Daemon...
Sep 12 22:55:37 pve21 pvestatd[2734]: starting server
Sep 12 22:55:37 pve21 systemd[1]: Started pvestatd.service - PVE Status Daemon.
Sep 12 23:01:49 pve21 systemd[1]: Stopping pvestatd.service - PVE Status Daemon...
Sep 12 23:01:49 pve21 pvestatd[2734]: received signal TERM
Sep 12 23:01:49 pve21 pvestatd[2734]: server closing
Sep 12 23:01:49 pve21 pvestatd[2734]: server stopped
Sep 12 23:01:50 pve21 systemd[1]: pvestatd.service: Deactivated successfully.
Sep 12 23:01:50 pve21 systemd[1]: Stopped pvestatd.service - PVE Status Daemon.
Sep 12 23:01:50 pve21 systemd[1]: pvestatd.service: Consumed 2.593s CPU time.
-- Boot 798d664734a34ba7a1e62e52b826def9 --
Sep 12 23:05:41 pve21 systemd[1]: Starting pvestatd.service - PVE Status Daemon...
Sep 12 23:05:41 pve21 pvestatd[2616]: ipcc_send_rec[1] failed: Connection refused
Sep 12 23:05:41 pve21 pvestatd[2616]: ipcc_send_rec[2] failed: Connection refused
Sep 12 23:05:41 pve21 pvestatd[2616]: ipcc_send_rec[3] failed: Connection refused
Sep 12 23:05:41 pve21 pvestatd[2616]: Unable to load access control list: Connection refused
Sep 12 23:05:41 pve21 pvestatd[2616]: ipcc_send_rec[1] failed: Connection refused
Sep 12 23:05:41 pve21 pvestatd[2616]: ipcc_send_rec[2] failed: Connection refused
Sep 12 23:05:41 pve21 pvestatd[2616]: ipcc_send_rec[3] failed: Connection refused
Sep 12 23:05:41 pve21 systemd[1]: pvestatd.service: Control process exited, code=exited, status=111/n/a
Sep 12 23:05:41 pve21 systemd[1]: pvestatd.service: Failed with result 'exit-code'.
Sep 12 23:05:41 pve21 systemd[1]: Failed to start pvestatd.service - PVE Status Daemon.
-- Boot 7e0205044735402d8cef9279ce5118bb --
Sep 12 23:12:15 pve21 systemd[1]: Starting pvestatd.service - PVE Status Daemon...
Sep 12 23:12:15 pve21 pvestatd[2823]: starting server
Sep 12 23:12:15 pve21 systemd[1]: Started pvestatd.service - PVE Status Daemon.
Sep 12 23:18:29 pve-21 systemd[1]: Stopping pvestatd.service - PVE Status Daemon...
Sep 12 23:18:30 pve-21 pvestatd[2823]: received signal TERM
Sep 12 23:18:30 pve-21 pvestatd[2823]: server closing
Sep 12 23:18:30 pve-21 pvestatd[2823]: server stopped
Sep 12 23:18:31 pve-21 systemd[1]: pvestatd.service: Deactivated successfully.
Sep 12 23:18:31 pve-21 systemd[1]: Stopped pvestatd.service - PVE Status Daemon.
Sep 12 23:18:31 pve-21 systemd[1]: pvestatd.service: Consumed 2.654s CPU time.
-- Boot 37cc08deefcf41698108170c20e377dd --
Sep 12 23:22:22 pve-21 systemd[1]: Starting pvestatd.service - PVE Status Daemon...
Sep 12 23:22:23 pve-21 pvestatd[2738]: starting server
Sep 12 23:22:23 pve-21 systemd[1]: Started pvestatd.service - PVE Status Daemon.
Sep 13 21:56:33 pve-21 pvestatd[2738]: auth key pair too old, rotating..
Sep 14 11:27:43 pve-21 pvestatd[2738]: VM 101 qmp command failed - VM 101 not running
Sep 14 13:44:45 pve-21 pvestatd[2738]: VM 101 qmp command failed - VM 101 qmp command 'query-proxmox-support' failed - client closed connection
Sep 14 16:17:39 pve-21 pvestatd[2738]: status update time (5.967 seconds)
Sep 14 16:25:23 pve-21 pvestatd[2738]: VM 100 qmp command failed - VM 100 not running
Sep 14 16:25:41 pve-21 pvestatd[2738]: VM 100 qmp command failed - VM 100 qmp command 'query-proxmox-support' failed - unable to connect to VM 100 qmp socket - timeout after 51 retries
Sep 14 16:25:41 pve-21 pvestatd[2738]: status update time (8.189 seconds)
Sep 14 16:25:51 pve-21 pvestatd[2738]: VM 100 qmp command failed - VM 100 qmp command 'query-proxmox-support' failed - unable to connect to VM 100 qmp socket - timeout after 51 retries
Sep 14 16:25:52 pve-21 pvestatd[2738]: status update time (8.189 seconds)
Sep 14 16:26:01 pve-21 pvestatd[2738]: VM 100 qmp command failed - VM 100 qmp command 'query-proxmox-support' failed - unable to connect to VM 100 qmp socket - timeout after 51 retries
Sep 14 16:26:01 pve-21 pvestatd[2738]: status update time (8.185 seconds)
Sep 14 19:56:14 pve-21 pvestatd[2738]: VM 105 qmp command failed - unable to open monitor socket
Sep 14 21:56:34 pve-21 pvestatd[2738]: auth key pair too old, rotating..
Sep 15 11:52:33 pve-21 pvestatd[2738]: status update time (49.481 seconds)
Last edited: