Urgent, need your help - All pve's offline

relink

New Member
Aug 22, 2024
24
2
3
Hey,

Here is the most urgent problem I've ever had.

I need your help!

Pmox Cluster / Ceph Cluster was running 11 nodes for nearly 1 year without problem.
Had to move whole cluster due to lack of power ressources.

Every node is interconnected to 2 backbone switches with 2x25Gbit/s LACP (Mikrotik CS520)
Rebuilded cluster on destination, started up - everything fine.

Added 3 more nodes, and had to upgrade existing nodes of cluster due to ceph reef version mismatch.

Went for a smoke, came back and suddenly the mess began: All nodes grayed out, offline.

I was struggling, feared - but the problem dissapeared as fast as it came.

Suddenly it came back - and now: The nodes cant talk to each other anymore.
I had a huge problem with a firmware bug with the Mikrotik Switches, where the cocnnection was lost every second - they fixed it, and it was running stable since.

Now I have no idea, where the problem comes from - but what I know is that I only have 24 hours left to fix this issue.

Is it due to version mismatch of the pve's? (Did: 1. apt update, 2. apt dist-upgrade)

Is it due to LACP for corosync (have been no issue at all for 1 year)

Spoke to AI which told me to measure paketloss - there is none, latency of network is about 0.1 - 0.5 ms.

A few more details:
- VLAN 104 and 105 with seperated networks are for ceph public and internal - they shouldnt be involved, but due to migration the router doesnt exist anymore and still: Communication of ceph is no issue since its network internally only.

- Because of confusion and bad experience with Mikrotiks firmware (even if no changes have been made) I upgraded the firmware of the backbone switches and access switches for management - no positive results

- Machines came back, but now they stay offline

Code:
root@pve-21:~# pvecm status

Cluster information

-------------------

Name:             pmox-cluster-01

Config Version:   16

Transport:        knet

Secure auth:      on



Quorum information

------------------

Date:             Sat Jul 12 23:19:41 2025

Quorum provider:  corosync_votequorum

Nodes:            10

Node ID:          0x00000001

Ring ID:          1.1afa5

Quorate:          Yes



Votequorum information

----------------------

Expected votes:   12

Highest expected: 12

Total votes:      10

Quorum:           7

Flags:            Quorate



Membership information

----------------------

    Nodeid      Votes Name

0x00000001          1 192.168.101.51 (local)

0x00000002          1 192.168.101.58

0x00000003          1 192.168.101.54

0x00000004          1 192.168.101.44

0x00000005          1 192.168.101.41

0x00000006          1 192.168.101.42

0x00000007          1 192.168.101.43

0x00000008          1 192.168.101.52

0x0000000a          1 192.168.101.55

0x0000000b          1 192.168.101.56

Code:
root@pve-21:~# pvecm nodes



Membership information

----------------------

    Nodeid      Votes Name

         1          1 pve-21 (local)

         2          1 pve-28

         3          1 pve-24

         4          1 pve-14

         5          1 pve-11

         6          1 pve-12

         7          1 pve-13

         8          1 pve-22

        10          1 pve-25

        11          1 pve-26

Code:
journalctl -xe

Jul 12 23:19:22 pve-21 pmxcfs[4294]: [dcdb] crit: cpg_send_message failed: 6

Jul 12 23:19:22 pve-21 pvescheduler[128229]: replication: cfs-lock 'file-replication_cfg' error: got lock request timeout

Jul 12 23:19:28 pve-21 corosync[4470]:   [TOTEM ] Token has not been received in 7127 ms

Jul 12 23:19:37 pve-21 corosync[4470]:   [TOTEM ] Token has not been received in 15981 ms

Jul 12 23:19:44 pve-21 corosync[4470]:   [QUORUM] Sync members[10]: 1 2 3 4 5 6 7 8 10 11

Jul 12 23:19:44 pve-21 corosync[4470]:   [TOTEM ] A new membership (1.1be7d) was formed. Members

Jul 12 23:19:51 pve-21 corosync[4470]:   [TOTEM ] Token has not been received in 7127 ms

Jul 12 23:20:00 pve-21 corosync[4470]:   [TOTEM ] Token has not been received in 15981 ms

Jul 12 23:20:03 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 10

Jul 12 23:20:04 pve-21 pvedaemon[5568]: <root@pam> successful auth for user 'root@pam'

Jul 12 23:20:04 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 20

Jul 12 23:20:05 pve-21 pmxcfs[4294]: [status] notice: cpg_send_message retry 10

Jul 12 23:20:05 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 30

Jul 12 23:20:06 pve-21 pmxcfs[4294]: [status] notice: cpg_send_message retry 20

Jul 12 23:20:06 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 40

Jul 12 23:20:07 pve-21 corosync[4470]:   [QUORUM] Sync members[10]: 1 2 3 4 5 6 7 8 10 11

Jul 12 23:20:07 pve-21 corosync[4470]:   [TOTEM ] A new membership (1.1be91) was formed. Members

Jul 12 23:20:07 pve-21 pmxcfs[4294]: [status] notice: cpg_send_message retry 30

Jul 12 23:20:07 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 50

Jul 12 23:20:08 pve-21 pmxcfs[4294]: [status] notice: cpg_send_message retry 40

Jul 12 23:20:08 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 60

Jul 12 23:20:09 pve-21 pmxcfs[4294]: [status] notice: cpg_send_message retry 50

Jul 12 23:20:09 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 70

Jul 12 23:20:10 pve-21 pmxcfs[4294]: [status] notice: cpg_send_message retry 60

Jul 12 23:20:10 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 80

Jul 12 23:20:11 pve-21 pmxcfs[4294]: [status] notice: cpg_send_message retry 70

Jul 12 23:20:11 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 90

Jul 12 23:20:12 pve-21 pmxcfs[4294]: [status] notice: cpg_send_message retry 80

Jul 12 23:20:12 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 100

Jul 12 23:20:12 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retried 100 times

Jul 12 23:20:12 pve-21 pmxcfs[4294]: [dcdb] crit: cpg_send_message failed: 6

Jul 12 23:20:12 pve-21 pvescheduler[128491]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout

Jul 12 23:20:13 pve-21 pmxcfs[4294]: [status] notice: cpg_send_message retry 90

Jul 12 23:20:13 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 10

Jul 12 23:20:14 pve-21 corosync[4470]:   [TOTEM ] Token has not been received in 7127 ms

Jul 12 23:20:14 pve-21 pmxcfs[4294]: [status] notice: cpg_send_message retry 100

Jul 12 23:20:14 pve-21 pmxcfs[4294]: [status] notice: cpg_send_message retried 100 times

Jul 12 23:20:14 pve-21 pmxcfs[4294]: [status] crit: cpg_send_message failed: 6

Jul 12 23:20:14 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 20

Jul 12 23:20:15 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 30

Jul 12 23:20:16 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 40

Jul 12 23:20:17 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 50

Jul 12 23:20:18 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 60

Jul 12 23:20:19 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 70

Jul 12 23:20:20 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 80

Jul 12 23:20:21 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 90

Jul 12 23:20:22 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 100

Jul 12 23:20:22 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retried 100 times

Jul 12 23:20:22 pve-21 pmxcfs[4294]: [dcdb] crit: cpg_send_message failed: 6

Jul 12 23:20:22 pve-21 pvescheduler[128490]: replication: cfs-lock 'file-replication_cfg' error: got lock request timeout

Jul 12 23:20:23 pve-21 corosync[4470]:   [TOTEM ] Token has not been received in 15981 ms

Jul 12 23:20:30 pve-21 corosync[4470]:   [QUORUM] Sync members[10]: 1 2 3 4 5 6 7 8 10 11

Jul 12 23:20:30 pve-21 corosync[4470]:   [TOTEM ] A new membership (1.1bea5) was formed. Members

Jul 12 23:20:37 pve-21 corosync[4470]:   [TOTEM ] Token has not been received in 7126 ms

Jul 12 23:20:46 pve-21 corosync[4470]:   [TOTEM ] Token has not been received in 15980 ms




Code:
root@pve-21:~# systemctl status pve-cluster

● pve-cluster.service - The Proxmox VE cluster filesystem

     Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; preset: enabled)

     Active: active (running) since Sat 2025-07-12 17:25:02 CEST; 5h 56min ago

   Main PID: 4294 (pmxcfs)

      Tasks: 13 (limit: 629145)

     Memory: 65.7M

        CPU: 21.798s

     CGroup: /system.slice/pve-cluster.service

             └─4294 /usr/bin/pmxcfs



Jul 12 23:21:15 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 30

Jul 12 23:21:16 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 40

Jul 12 23:21:17 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 50

Jul 12 23:21:18 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 60

Jul 12 23:21:19 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 70

Jul 12 23:21:20 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 80

Jul 12 23:21:21 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 90

Jul 12 23:21:22 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 100

Jul 12 23:21:22 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retried 100 times

Jul 12 23:21:22 pve-21 pmxcfs[4294]: [dcdb] crit: cpg_send_message failed: 6


Code:
root@pve-21:~# systemctl status pvestatd

● pvestatd.service - PVE Status Daemon

     Loaded: loaded (/lib/systemd/system/pvestatd.service; enabled; preset: enabled)

     Active: active (running) since Sat 2025-07-12 17:25:03 CEST; 5h 57min ago

   Main PID: 5069 (pvestatd)

      Tasks: 1 (limit: 629145)

     Memory: 124.6M

        CPU: 2min 3.009s

     CGroup: /system.slice/pvestatd.service

             └─5069 pvestatd



Jul 12 21:49:13 pve-21 pvestatd[5069]: status update time (1718.559 seconds)

Jul 12 21:49:16 pve-21 pvestatd[5069]: LZ4_01: error fetching datastores - 500 Can't connect to 192.168.101.91:8007 (No route to host)

Jul 12 21:49:20 pve-21 pvestatd[5069]: LZ4_02: error fetching datastores - 500 Can't connect to 192.168.101.91:8007 (No route to host)

Jul 12 21:49:20 pve-21 pvestatd[5069]: status update time (6.432 seconds)

Jul 12 22:07:15 pve-21 pvestatd[5069]: LZ4_01: error fetching datastores - 500 Can't connect to 192.168.101.91:8007 (No route to host)

Jul 12 22:07:18 pve-21 pvestatd[5069]: LZ4_02: error fetching datastores - 500 Can't connect to 192.168.101.91:8007 (No route to host)

Jul 12 22:07:18 pve-21 pvestatd[5069]: status update time (1075.514 seconds)

Jul 12 22:07:21 pve-21 pvestatd[5069]: LZ4_01: error fetching datastores - 500 Can't connect to 192.168.101.91:8007 (No route to host)

Jul 12 22:07:24 pve-21 pvestatd[5069]: LZ4_02: error fetching datastores - 500 Can't connect to 192.168.101.91:8007 (No route to host)

Jul 12 22:07:25 pve-21 pvestatd[5069]: status update time (6.348 seconds)

(192.168.101.91 is PBS VM)

Code:
journalctl -u pvestatd

Sep 12 22:51:32 pve21 systemd[1]: pvestatd.service: Deactivated successfully.

Sep 12 22:51:32 pve21 systemd[1]: Stopped pvestatd.service - PVE Status Daemon.

Sep 12 22:51:32 pve21 systemd[1]: pvestatd.service: Consumed 53.283s CPU time.

-- Boot 609bb646465347b8908ca6077ec4436e --

Sep 12 22:55:36 pve21 systemd[1]: Starting pvestatd.service - PVE Status Daemon...

Sep 12 22:55:37 pve21 pvestatd[2734]: starting server

Sep 12 22:55:37 pve21 systemd[1]: Started pvestatd.service - PVE Status Daemon.

Sep 12 23:01:49 pve21 systemd[1]: Stopping pvestatd.service - PVE Status Daemon...

Sep 12 23:01:49 pve21 pvestatd[2734]: received signal TERM

Sep 12 23:01:49 pve21 pvestatd[2734]: server closing

Sep 12 23:01:49 pve21 pvestatd[2734]: server stopped

Sep 12 23:01:50 pve21 systemd[1]: pvestatd.service: Deactivated successfully.

Sep 12 23:01:50 pve21 systemd[1]: Stopped pvestatd.service - PVE Status Daemon.

Sep 12 23:01:50 pve21 systemd[1]: pvestatd.service: Consumed 2.593s CPU time.

-- Boot 798d664734a34ba7a1e62e52b826def9 --

Sep 12 23:05:41 pve21 systemd[1]: Starting pvestatd.service - PVE Status Daemon...

Sep 12 23:05:41 pve21 pvestatd[2616]: ipcc_send_rec[1] failed: Connection refused

Sep 12 23:05:41 pve21 pvestatd[2616]: ipcc_send_rec[2] failed: Connection refused

Sep 12 23:05:41 pve21 pvestatd[2616]: ipcc_send_rec[3] failed: Connection refused

Sep 12 23:05:41 pve21 pvestatd[2616]: Unable to load access control list: Connection refused

Sep 12 23:05:41 pve21 pvestatd[2616]: ipcc_send_rec[1] failed: Connection refused

Sep 12 23:05:41 pve21 pvestatd[2616]: ipcc_send_rec[2] failed: Connection refused

Sep 12 23:05:41 pve21 pvestatd[2616]: ipcc_send_rec[3] failed: Connection refused

Sep 12 23:05:41 pve21 systemd[1]: pvestatd.service: Control process exited, code=exited, status=111/n/a

Sep 12 23:05:41 pve21 systemd[1]: pvestatd.service: Failed with result 'exit-code'.

Sep 12 23:05:41 pve21 systemd[1]: Failed to start pvestatd.service - PVE Status Daemon.

-- Boot 7e0205044735402d8cef9279ce5118bb --

Sep 12 23:12:15 pve21 systemd[1]: Starting pvestatd.service - PVE Status Daemon...

Sep 12 23:12:15 pve21 pvestatd[2823]: starting server

Sep 12 23:12:15 pve21 systemd[1]: Started pvestatd.service - PVE Status Daemon.

Sep 12 23:18:29 pve-21 systemd[1]: Stopping pvestatd.service - PVE Status Daemon...

Sep 12 23:18:30 pve-21 pvestatd[2823]: received signal TERM

Sep 12 23:18:30 pve-21 pvestatd[2823]: server closing

Sep 12 23:18:30 pve-21 pvestatd[2823]: server stopped

Sep 12 23:18:31 pve-21 systemd[1]: pvestatd.service: Deactivated successfully.

Sep 12 23:18:31 pve-21 systemd[1]: Stopped pvestatd.service - PVE Status Daemon.

Sep 12 23:18:31 pve-21 systemd[1]: pvestatd.service: Consumed 2.654s CPU time.

-- Boot 37cc08deefcf41698108170c20e377dd --

Sep 12 23:22:22 pve-21 systemd[1]: Starting pvestatd.service - PVE Status Daemon...

Sep 12 23:22:23 pve-21 pvestatd[2738]: starting server

Sep 12 23:22:23 pve-21 systemd[1]: Started pvestatd.service - PVE Status Daemon.

Sep 13 21:56:33 pve-21 pvestatd[2738]: auth key pair too old, rotating..

Sep 14 11:27:43 pve-21 pvestatd[2738]: VM 101 qmp command failed - VM 101 not running

Sep 14 13:44:45 pve-21 pvestatd[2738]: VM 101 qmp command failed - VM 101 qmp command 'query-proxmox-support' failed - client closed connection

Sep 14 16:17:39 pve-21 pvestatd[2738]: status update time (5.967 seconds)

Sep 14 16:25:23 pve-21 pvestatd[2738]: VM 100 qmp command failed - VM 100 not running

Sep 14 16:25:41 pve-21 pvestatd[2738]: VM 100 qmp command failed - VM 100 qmp command 'query-proxmox-support' failed - unable to connect to VM 100 qmp socket - timeout after 51 retries

Sep 14 16:25:41 pve-21 pvestatd[2738]: status update time (8.189 seconds)

Sep 14 16:25:51 pve-21 pvestatd[2738]: VM 100 qmp command failed - VM 100 qmp command 'query-proxmox-support' failed - unable to connect to VM 100 qmp socket - timeout after 51 retries

Sep 14 16:25:52 pve-21 pvestatd[2738]: status update time (8.189 seconds)

Sep 14 16:26:01 pve-21 pvestatd[2738]: VM 100 qmp command failed - VM 100 qmp command 'query-proxmox-support' failed - unable to connect to VM 100 qmp socket - timeout after 51 retries

Sep 14 16:26:01 pve-21 pvestatd[2738]: status update time (8.185 seconds)

Sep 14 19:56:14 pve-21 pvestatd[2738]: VM 105 qmp command failed - unable to open monitor socket

Sep 14 21:56:34 pve-21 pvestatd[2738]: auth key pair too old, rotating..

Sep 15 11:52:33 pve-21 pvestatd[2738]: status update time (49.481 seconds)
 
Last edited:
Code:
root@pve-21:~# cat /etc/pve/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: pve-11
    nodeid: 5
    quorum_votes: 1
    ring0_addr: 192.168.101.41
  }
  node {
    name: pve-12
    nodeid: 6
    quorum_votes: 1
    ring0_addr: 192.168.101.42
  }
  node {
    name: pve-13
    nodeid: 7
    quorum_votes: 1
    ring0_addr: 192.168.101.43
  }
  node {
    name: pve-14
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 192.168.101.44
  }
  node {
    name: pve-21
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 192.168.101.51
  }
  node {
    name: pve-22
    nodeid: 8
    quorum_votes: 1
    ring0_addr: 192.168.101.52
  }
  node {
    name: pve-23
    nodeid: 12
    quorum_votes: 1
    ring0_addr: 192.168.101.53
  }
  node {
    name: pve-24
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 192.168.101.54
  }
  node {
    name: pve-25
    nodeid: 10
    quorum_votes: 1
    ring0_addr: 192.168.101.55
  }
  node {
    name: pve-26
    nodeid: 11
    quorum_votes: 1
    ring0_addr: 192.168.101.56
  }
  node {
    name: pve-27
    nodeid: 9
    quorum_votes: 1
    ring0_addr: 192.168.101.57
  }
  node {
    name: pve-28
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 192.168.101.58
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: pmox-cluster-01
  config_version: 16
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}
 
Hey Steve,

Thank you for your reply!

Yes, I have tons of 1GBit/s connections on each node available - I heard, that this is best practice, may this could be the reason why I'm in trouble now!

I dont know how to implement a second ring to corosync :/

And why is it, that it happens now while updating pves?

AI says, the kernel mismatch from 6.8.12-11 (newest) to 6.8.12-4 (oldest) is to big - therefore could explain my problems - is that true?
 
Okay,

It seems that this has been a split-brain due to updating the pve'S and version mismatch from 12-4, 12-8 and 12-11!

Up ran updates on all nodes left via ssh (apt update, apt dist-upgrade) and rebooted.
Slowly the machines came back.

But the I realized, that a part of machines could talk to each other and missing out on some others - and the others were online by themselves while missing out on the other group.

That looked like a split brain!

Luckily I paused Ceph for the migration (noout, nodown, pause, norebalance, nobackfill, norecover) - that would have been a massive mess...
Over night the cluster has been stable!

Now ceph is actived and rebalancing to the new OSDs.

I will implement a second corosync ring today, that must be the source of this problem - what do you think?
 
  • Like
Reactions: SteveITS
Docs say “All nodes should have the same version.” I don’t know that kernel versions matter as much as PVE/corosync.

Using other networks for corosync would provide backup communication in case of network problems or switch failure so is generally a good idea.
 
Is it due to LACP for corosync (have been no issue at all for 1 year)
Do not use LACP for corosync network: instead use two corosync links over a simple nic each and let Corosync detect link failures.

It seems that this has been a split-brain due to updating the pve'S and version mismatch from 12-4, 12-8 and 12-11!
I doubt so. Never had an issue with corosync related to kernel version.

I will implement a second corosync ring today, that must be the source of this problem - what do you think?
Impossible to ascertain with the commands you posted. If this ever happens, you will need to run this on each and every node:

Code:
cat /etc/hosts
ip address
pvecm status
corosync-cfgtool -n
corosync-cfgtool -s
cat /proc/net/bonding/bond*