Urgent, need your help - All pve's offline

relink · Jul 12, 2025

Hey,

Here is the most urgent problem I've ever had.

I need your help!

Pmox Cluster / Ceph Cluster was running 11 nodes for nearly 1 year without problem.
Had to move whole cluster due to lack of power ressources.

Every node is interconnected to 2 backbone switches with 2x25Gbit/s LACP (Mikrotik CS520)
Rebuilded cluster on destination, started up - everything fine.

Added 3 more nodes, and had to upgrade existing nodes of cluster due to ceph reef version mismatch.

Went for a smoke, came back and suddenly the mess began: All nodes grayed out, offline.

I was struggling, feared - but the problem dissapeared as fast as it came.

Suddenly it came back - and now: The nodes cant talk to each other anymore.
I had a huge problem with a firmware bug with the Mikrotik Switches, where the cocnnection was lost every second - they fixed it, and it was running stable since.

Now I have no idea, where the problem comes from - but what I know is that I only have 24 hours left to fix this issue.

Is it due to version mismatch of the pve's? (Did: 1. apt update, 2. apt dist-upgrade)

Is it due to LACP for corosync (have been no issue at all for 1 year)

Spoke to AI which told me to measure paketloss - there is none, latency of network is about 0.1 - 0.5 ms.

A few more details:
- VLAN 104 and 105 with seperated networks are for ceph public and internal - they shouldnt be involved, but due to migration the router doesnt exist anymore and still: Communication of ceph is no issue since its network internally only.

- Because of confusion and bad experience with Mikrotiks firmware (even if no changes have been made) I upgraded the firmware of the backbone switches and access switches for management - no positive results

- Machines came back, but now they stay offline

Code:

root@pve-21:~# pvecm status

Cluster information

-------------------

Name:             pmox-cluster-01

Config Version:   16

Transport:        knet

Secure auth:      on



Quorum information

------------------

Date:             Sat Jul 12 23:19:41 2025

Quorum provider:  corosync_votequorum

Nodes:            10

Node ID:          0x00000001

Ring ID:          1.1afa5

Quorate:          Yes



Votequorum information

----------------------

Expected votes:   12

Highest expected: 12

Total votes:      10

Quorum:           7

Flags:            Quorate



Membership information

----------------------

    Nodeid      Votes Name

0x00000001          1 192.168.101.51 (local)

0x00000002          1 192.168.101.58

0x00000003          1 192.168.101.54

0x00000004          1 192.168.101.44

0x00000005          1 192.168.101.41

0x00000006          1 192.168.101.42

0x00000007          1 192.168.101.43

0x00000008          1 192.168.101.52

0x0000000a          1 192.168.101.55

0x0000000b          1 192.168.101.56

Code:

root@pve-21:~# pvecm nodes



Membership information

----------------------

    Nodeid      Votes Name

         1          1 pve-21 (local)

         2          1 pve-28

         3          1 pve-24

         4          1 pve-14

         5          1 pve-11

         6          1 pve-12

         7          1 pve-13

         8          1 pve-22

        10          1 pve-25

        11          1 pve-26

Code:

journalctl -xe

Jul 12 23:19:22 pve-21 pmxcfs[4294]: [dcdb] crit: cpg_send_message failed: 6

Jul 12 23:19:22 pve-21 pvescheduler[128229]: replication: cfs-lock 'file-replication_cfg' error: got lock request timeout

Jul 12 23:19:28 pve-21 corosync[4470]:   [TOTEM ] Token has not been received in 7127 ms

Jul 12 23:19:37 pve-21 corosync[4470]:   [TOTEM ] Token has not been received in 15981 ms

Jul 12 23:19:44 pve-21 corosync[4470]:   [QUORUM] Sync members[10]: 1 2 3 4 5 6 7 8 10 11

Jul 12 23:19:44 pve-21 corosync[4470]:   [TOTEM ] A new membership (1.1be7d) was formed. Members

Jul 12 23:19:51 pve-21 corosync[4470]:   [TOTEM ] Token has not been received in 7127 ms

Jul 12 23:20:00 pve-21 corosync[4470]:   [TOTEM ] Token has not been received in 15981 ms

Jul 12 23:20:03 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 10

Jul 12 23:20:04 pve-21 pvedaemon[5568]: <root@pam> successful auth for user 'root@pam'

Jul 12 23:20:04 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 20

Jul 12 23:20:05 pve-21 pmxcfs[4294]: [status] notice: cpg_send_message retry 10

Jul 12 23:20:05 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 30

Jul 12 23:20:06 pve-21 pmxcfs[4294]: [status] notice: cpg_send_message retry 20

Jul 12 23:20:06 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 40

Jul 12 23:20:07 pve-21 corosync[4470]:   [QUORUM] Sync members[10]: 1 2 3 4 5 6 7 8 10 11

Jul 12 23:20:07 pve-21 corosync[4470]:   [TOTEM ] A new membership (1.1be91) was formed. Members

Jul 12 23:20:07 pve-21 pmxcfs[4294]: [status] notice: cpg_send_message retry 30

Jul 12 23:20:07 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 50

Jul 12 23:20:08 pve-21 pmxcfs[4294]: [status] notice: cpg_send_message retry 40

Jul 12 23:20:08 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 60

Jul 12 23:20:09 pve-21 pmxcfs[4294]: [status] notice: cpg_send_message retry 50

Jul 12 23:20:09 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 70

Jul 12 23:20:10 pve-21 pmxcfs[4294]: [status] notice: cpg_send_message retry 60

Jul 12 23:20:10 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 80

Jul 12 23:20:11 pve-21 pmxcfs[4294]: [status] notice: cpg_send_message retry 70

Jul 12 23:20:11 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 90

Jul 12 23:20:12 pve-21 pmxcfs[4294]: [status] notice: cpg_send_message retry 80

Jul 12 23:20:12 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 100

Jul 12 23:20:12 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retried 100 times

Jul 12 23:20:12 pve-21 pmxcfs[4294]: [dcdb] crit: cpg_send_message failed: 6

Jul 12 23:20:12 pve-21 pvescheduler[128491]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout

Jul 12 23:20:13 pve-21 pmxcfs[4294]: [status] notice: cpg_send_message retry 90

Jul 12 23:20:13 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 10

Jul 12 23:20:14 pve-21 corosync[4470]:   [TOTEM ] Token has not been received in 7127 ms

Jul 12 23:20:14 pve-21 pmxcfs[4294]: [status] notice: cpg_send_message retry 100

Jul 12 23:20:14 pve-21 pmxcfs[4294]: [status] notice: cpg_send_message retried 100 times

Jul 12 23:20:14 pve-21 pmxcfs[4294]: [status] crit: cpg_send_message failed: 6

Jul 12 23:20:14 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 20

Jul 12 23:20:15 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 30

Jul 12 23:20:16 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 40

Jul 12 23:20:17 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 50

Jul 12 23:20:18 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 60

Jul 12 23:20:19 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 70

Jul 12 23:20:20 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 80

Jul 12 23:20:21 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 90

Jul 12 23:20:22 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 100

Jul 12 23:20:22 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retried 100 times

Jul 12 23:20:22 pve-21 pmxcfs[4294]: [dcdb] crit: cpg_send_message failed: 6

Jul 12 23:20:22 pve-21 pvescheduler[128490]: replication: cfs-lock 'file-replication_cfg' error: got lock request timeout

Jul 12 23:20:23 pve-21 corosync[4470]:   [TOTEM ] Token has not been received in 15981 ms

Jul 12 23:20:30 pve-21 corosync[4470]:   [QUORUM] Sync members[10]: 1 2 3 4 5 6 7 8 10 11

Jul 12 23:20:30 pve-21 corosync[4470]:   [TOTEM ] A new membership (1.1bea5) was formed. Members

Jul 12 23:20:37 pve-21 corosync[4470]:   [TOTEM ] Token has not been received in 7126 ms

Jul 12 23:20:46 pve-21 corosync[4470]:   [TOTEM ] Token has not been received in 15980 ms

Code:

root@pve-21:~# systemctl status pve-cluster

● pve-cluster.service - The Proxmox VE cluster filesystem

     Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; preset: enabled)

     Active: active (running) since Sat 2025-07-12 17:25:02 CEST; 5h 56min ago

   Main PID: 4294 (pmxcfs)

      Tasks: 13 (limit: 629145)

     Memory: 65.7M

        CPU: 21.798s

     CGroup: /system.slice/pve-cluster.service

             └─4294 /usr/bin/pmxcfs



Jul 12 23:21:15 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 30

Jul 12 23:21:16 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 40

Jul 12 23:21:17 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 50

Jul 12 23:21:18 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 60

Jul 12 23:21:19 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 70

Jul 12 23:21:20 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 80

Jul 12 23:21:21 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 90

Jul 12 23:21:22 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retry 100

Jul 12 23:21:22 pve-21 pmxcfs[4294]: [dcdb] notice: cpg_send_message retried 100 times

Jul 12 23:21:22 pve-21 pmxcfs[4294]: [dcdb] crit: cpg_send_message failed: 6

Code:

root@pve-21:~# systemctl status pvestatd

● pvestatd.service - PVE Status Daemon

     Loaded: loaded (/lib/systemd/system/pvestatd.service; enabled; preset: enabled)

     Active: active (running) since Sat 2025-07-12 17:25:03 CEST; 5h 57min ago

   Main PID: 5069 (pvestatd)

      Tasks: 1 (limit: 629145)

     Memory: 124.6M

        CPU: 2min 3.009s

     CGroup: /system.slice/pvestatd.service

             └─5069 pvestatd



Jul 12 21:49:13 pve-21 pvestatd[5069]: status update time (1718.559 seconds)

Jul 12 21:49:16 pve-21 pvestatd[5069]: LZ4_01: error fetching datastores - 500 Can't connect to 192.168.101.91:8007 (No route to host)

Jul 12 21:49:20 pve-21 pvestatd[5069]: LZ4_02: error fetching datastores - 500 Can't connect to 192.168.101.91:8007 (No route to host)

Jul 12 21:49:20 pve-21 pvestatd[5069]: status update time (6.432 seconds)

Jul 12 22:07:15 pve-21 pvestatd[5069]: LZ4_01: error fetching datastores - 500 Can't connect to 192.168.101.91:8007 (No route to host)

Jul 12 22:07:18 pve-21 pvestatd[5069]: LZ4_02: error fetching datastores - 500 Can't connect to 192.168.101.91:8007 (No route to host)

Jul 12 22:07:18 pve-21 pvestatd[5069]: status update time (1075.514 seconds)

Jul 12 22:07:21 pve-21 pvestatd[5069]: LZ4_01: error fetching datastores - 500 Can't connect to 192.168.101.91:8007 (No route to host)

Jul 12 22:07:24 pve-21 pvestatd[5069]: LZ4_02: error fetching datastores - 500 Can't connect to 192.168.101.91:8007 (No route to host)

Jul 12 22:07:25 pve-21 pvestatd[5069]: status update time (6.348 seconds)

(192.168.101.91 is PBS VM)

Code:

journalctl -u pvestatd

Sep 12 22:51:32 pve21 systemd[1]: pvestatd.service: Deactivated successfully.

Sep 12 22:51:32 pve21 systemd[1]: Stopped pvestatd.service - PVE Status Daemon.

Sep 12 22:51:32 pve21 systemd[1]: pvestatd.service: Consumed 53.283s CPU time.

-- Boot 609bb646465347b8908ca6077ec4436e --

Sep 12 22:55:36 pve21 systemd[1]: Starting pvestatd.service - PVE Status Daemon...

Sep 12 22:55:37 pve21 pvestatd[2734]: starting server

Sep 12 22:55:37 pve21 systemd[1]: Started pvestatd.service - PVE Status Daemon.

Sep 12 23:01:49 pve21 systemd[1]: Stopping pvestatd.service - PVE Status Daemon...

Sep 12 23:01:49 pve21 pvestatd[2734]: received signal TERM

Sep 12 23:01:49 pve21 pvestatd[2734]: server closing

Sep 12 23:01:49 pve21 pvestatd[2734]: server stopped

Sep 12 23:01:50 pve21 systemd[1]: pvestatd.service: Deactivated successfully.

Sep 12 23:01:50 pve21 systemd[1]: Stopped pvestatd.service - PVE Status Daemon.

Sep 12 23:01:50 pve21 systemd[1]: pvestatd.service: Consumed 2.593s CPU time.

-- Boot 798d664734a34ba7a1e62e52b826def9 --

Sep 12 23:05:41 pve21 systemd[1]: Starting pvestatd.service - PVE Status Daemon...

Sep 12 23:05:41 pve21 pvestatd[2616]: ipcc_send_rec[1] failed: Connection refused

Sep 12 23:05:41 pve21 pvestatd[2616]: ipcc_send_rec[2] failed: Connection refused

Sep 12 23:05:41 pve21 pvestatd[2616]: ipcc_send_rec[3] failed: Connection refused

Sep 12 23:05:41 pve21 pvestatd[2616]: Unable to load access control list: Connection refused

Sep 12 23:05:41 pve21 pvestatd[2616]: ipcc_send_rec[1] failed: Connection refused

Sep 12 23:05:41 pve21 pvestatd[2616]: ipcc_send_rec[2] failed: Connection refused

Sep 12 23:05:41 pve21 pvestatd[2616]: ipcc_send_rec[3] failed: Connection refused

Sep 12 23:05:41 pve21 systemd[1]: pvestatd.service: Control process exited, code=exited, status=111/n/a

Sep 12 23:05:41 pve21 systemd[1]: pvestatd.service: Failed with result 'exit-code'.

Sep 12 23:05:41 pve21 systemd[1]: Failed to start pvestatd.service - PVE Status Daemon.

-- Boot 7e0205044735402d8cef9279ce5118bb --

Sep 12 23:12:15 pve21 systemd[1]: Starting pvestatd.service - PVE Status Daemon...

Sep 12 23:12:15 pve21 pvestatd[2823]: starting server

Sep 12 23:12:15 pve21 systemd[1]: Started pvestatd.service - PVE Status Daemon.

Sep 12 23:18:29 pve-21 systemd[1]: Stopping pvestatd.service - PVE Status Daemon...

Sep 12 23:18:30 pve-21 pvestatd[2823]: received signal TERM

Sep 12 23:18:30 pve-21 pvestatd[2823]: server closing

Sep 12 23:18:30 pve-21 pvestatd[2823]: server stopped

Sep 12 23:18:31 pve-21 systemd[1]: pvestatd.service: Deactivated successfully.

Sep 12 23:18:31 pve-21 systemd[1]: Stopped pvestatd.service - PVE Status Daemon.

Sep 12 23:18:31 pve-21 systemd[1]: pvestatd.service: Consumed 2.654s CPU time.

-- Boot 37cc08deefcf41698108170c20e377dd --

Sep 12 23:22:22 pve-21 systemd[1]: Starting pvestatd.service - PVE Status Daemon...

Sep 12 23:22:23 pve-21 pvestatd[2738]: starting server

Sep 12 23:22:23 pve-21 systemd[1]: Started pvestatd.service - PVE Status Daemon.

Sep 13 21:56:33 pve-21 pvestatd[2738]: auth key pair too old, rotating..

Sep 14 11:27:43 pve-21 pvestatd[2738]: VM 101 qmp command failed - VM 101 not running

Sep 14 13:44:45 pve-21 pvestatd[2738]: VM 101 qmp command failed - VM 101 qmp command 'query-proxmox-support' failed - client closed connection

Sep 14 16:17:39 pve-21 pvestatd[2738]: status update time (5.967 seconds)

Sep 14 16:25:23 pve-21 pvestatd[2738]: VM 100 qmp command failed - VM 100 not running

Sep 14 16:25:41 pve-21 pvestatd[2738]: VM 100 qmp command failed - VM 100 qmp command 'query-proxmox-support' failed - unable to connect to VM 100 qmp socket - timeout after 51 retries

Sep 14 16:25:41 pve-21 pvestatd[2738]: status update time (8.189 seconds)

Sep 14 16:25:51 pve-21 pvestatd[2738]: VM 100 qmp command failed - VM 100 qmp command 'query-proxmox-support' failed - unable to connect to VM 100 qmp socket - timeout after 51 retries

Sep 14 16:25:52 pve-21 pvestatd[2738]: status update time (8.189 seconds)

Sep 14 16:26:01 pve-21 pvestatd[2738]: VM 100 qmp command failed - VM 100 qmp command 'query-proxmox-support' failed - unable to connect to VM 100 qmp socket - timeout after 51 retries

Sep 14 16:26:01 pve-21 pvestatd[2738]: status update time (8.185 seconds)

Sep 14 19:56:14 pve-21 pvestatd[2738]: VM 105 qmp command failed - unable to open monitor socket

Sep 14 21:56:34 pve-21 pvestatd[2738]: auth key pair too old, rotating..

Sep 15 11:52:33 pve-21 pvestatd[2738]: status update time (49.481 seconds)

relink · Jul 12, 2025

Code:

root@pve-21:~# cat /etc/pve/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: pve-11
    nodeid: 5
    quorum_votes: 1
    ring0_addr: 192.168.101.41
  }
  node {
    name: pve-12
    nodeid: 6
    quorum_votes: 1
    ring0_addr: 192.168.101.42
  }
  node {
    name: pve-13
    nodeid: 7
    quorum_votes: 1
    ring0_addr: 192.168.101.43
  }
  node {
    name: pve-14
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 192.168.101.44
  }
  node {
    name: pve-21
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 192.168.101.51
  }
  node {
    name: pve-22
    nodeid: 8
    quorum_votes: 1
    ring0_addr: 192.168.101.52
  }
  node {
    name: pve-23
    nodeid: 12
    quorum_votes: 1
    ring0_addr: 192.168.101.53
  }
  node {
    name: pve-24
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 192.168.101.54
  }
  node {
    name: pve-25
    nodeid: 10
    quorum_votes: 1
    ring0_addr: 192.168.101.55
  }
  node {
    name: pve-26
    nodeid: 11
    quorum_votes: 1
    ring0_addr: 192.168.101.56
  }
  node {
    name: pve-27
    nodeid: 9
    quorum_votes: 1
    ring0_addr: 192.168.101.57
  }
  node {
    name: pve-28
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 192.168.101.58
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: pmox-cluster-01
  config_version: 16
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

SteveITS · Jul 12, 2025

Is there a second network link you can add to corosync?

relink · Jul 13, 2025

Hey Steve,

Thank you for your reply!

Yes, I have tons of 1GBit/s connections on each node available - I heard, that this is best practice, may this could be the reason why I'm in trouble now!

I dont know how to implement a second ring to corosync :/

And why is it, that it happens now while updating pves?

AI says, the kernel mismatch from 6.8.12-11 (newest) to 6.8.12-4 (oldest) is to big - therefore could explain my problems - is that true?

SteveITS · Jul 13, 2025

https://pve.proxmox.com/pve-docs/chapter-pvecm.html#pvecm_redundancy

You can add several.

Not sure about versions I’ve always kept them consistent.

relink · Jul 13, 2025

Okay,

It seems that this has been a split-brain due to updating the pve'S and version mismatch from 12-4, 12-8 and 12-11!

Up ran updates on all nodes left via ssh (apt update, apt dist-upgrade) and rebooted.
Slowly the machines came back.

But the I realized, that a part of machines could talk to each other and missing out on some others - and the others were online by themselves while missing out on the other group.

That looked like a split brain!

Luckily I paused Ceph for the migration (noout, nodown, pause, norebalance, nobackfill, norecover) - that would have been a massive mess...
Over night the cluster has been stable!

Now ceph is actived and rebalancing to the new OSDs.

I will implement a second corosync ring today, that must be the source of this problem - what do you think?

SteveITS · Jul 13, 2025

Docs say “All nodes should have the same version.” I don’t know that kernel versions matter as much as PVE/corosync.

Using other networks for corosync would provide backup communication in case of network problems or switch failure so is generally a good idea.

VictorSTS · Jul 15, 2025

relink said:
Is it due to LACP for corosync (have been no issue at all for 1 year)

Do not use LACP for corosync network: instead use two corosync links over a simple nic each and let Corosync detect link failures.

relink said:
It seems that this has been a split-brain due to updating the pve'S and version mismatch from 12-4, 12-8 and 12-11!

I doubt so. Never had an issue with corosync related to kernel version.

relink said:
I will implement a second corosync ring today, that must be the source of this problem - what do you think?

Impossible to ascertain with the commands you posted. If this ever happens, you will need to run this on each and every node:

Code:

cat /etc/hosts
ip address
pvecm status
corosync-cfgtool -n
corosync-cfgtool -s
cat /proc/net/bonding/bond*

Search

Search

Urgent, need your help - All pve's offline

relink

New Member

relink

New Member

SteveITS

Active Member

relink

New Member

SteveITS

Active Member

relink

New Member

SteveITS

Active Member

VictorSTS

Distinguished Member

We value your privacy