[SOLVED] Lost connection with ceph cluster

sienaimaging

New Member
Mar 21, 2025
6
2
3
Hi folks,

I just lost the comunication between some proxmox nodes and the "storage system" both ceph and local lvm

But i cannot understand what's going on....

Please could you help me to investigate?
 
How many corosync links do you have?

Giving your logs a peek may point you in the right direction.

Code:
/var/log/ceph/
and
Code:
journalctl -u corosync
journalctl --since "10 hour ago"
 
It seems that something goes wrong with the network link (maybe)?

The corosync connection uses two bonded links!

launching "journalctl -r -u corosync" in some nodes

Jun 05 18:38:09 aquila corosync[3431]: [KNET ] pmtud: Global data MTU changed to: 1397
Jun 05 18:38:09 aquila corosync[3431]: [MAIN ] Completed service synchronization, ready to provide service.
Jun 05 18:38:09 aquila corosync[3431]: [QUORUM] Members[10]: 1 2 3 4 5 6 7 8 9 10
Jun 05 18:38:09 aquila corosync[3431]: [TOTEM ] A new membership (1.11b9) was formed. Members joined: 2
Jun 05 18:38:09 aquila corosync[3431]: [QUORUM] Sync joined[1]: 2
Jun 05 18:38:09 aquila corosync[3431]: [QUORUM] Sync members[10]: 1 2 3 4 5 6 7 8 9 10
Jun 05 18:38:09 aquila corosync[3431]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Jun 05 18:38:09 aquila corosync[3431]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
Jun 05 18:33:55 aquila corosync[3431]: [KNET ] host: host: 2 has no active links
Jun 05 18:33:55 aquila corosync[3431]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Jun 05 18:33:55 aquila corosync[3431]: [KNET ] link: host: 2 link: 0 is down
Jun 05 18:33:53 aquila corosync[3431]: [MAIN ] Completed service synchronization, ready to provide service.
Jun 05 18:33:53 aquila corosync[3431]: [QUORUM] Members[9]: 1 3 4 5 6 7 8 9 10
Jun 05 18:33:53 aquila corosync[3431]: [TOTEM ] A new membership (1.11b4) was formed. Members left: 2
Jun 05 18:33:53 aquila corosync[3431]: [QUORUM] Sync left[1]: 2
Jun 05 18:33:53 aquila corosync[3431]: [QUORUM] Sync members[9]: 1 3 4 5 6 7 8 9 10
Jun 05 18:33:52 aquila corosync[3431]: [CFG ] Node 2 was shut down by sysadmin
Jun 05 18:03:26 aquila corosync[3431]: [TOTEM ] Retransmit List: 49abc9
Jun 05 18:03:26 aquila corosync[3431]: [TOTEM ] Retransmit List: 49abc4
Jun 05 18:03:26 aquila corosync[3431]: [TOTEM ] Retransmit List: 49abc1
Jun 05 16:07:48 aquila corosync[3431]: [KNET ] pmtud: Global data MTU changed to: 1397
Jun 05 16:07:48 aquila corosync[3431]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Jun 05 16:07:48 aquila corosync[3431]: [KNET ] link: Resetting MTU for link 0 because host 1 joined
Jun 05 16:07:48 aquila corosync[3431]: [KNET ] rx: host: 1 link: 0 is up
Jun 05 16:07:45 aquila corosync[3431]: [KNET ] pmtud: Global data MTU changed to: 1397
Jun 05 16:07:44 aquila corosync[3431]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Jun 05 16:07:44 aquila corosync[3431]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
Jun 05 16:07:43 aquila corosync[3431]: [KNET ] host: host: 1 has no active links
Jun 05 16:07:43 aquila corosync[3431]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Jun 05 16:07:43 aquila corosync[3431]: [KNET ] host: host: 2 has no active links
Jun 05 16:07:43 aquila corosync[3431]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Jun 05 16:07:43 aquila corosync[3431]: [KNET ] link: host: 1 link: 0 is down
Jun 05 16:07:43 aquila corosync[3431]: [KNET ] link: host: 2 link: 0 is down
Jun 05 16:07:43 aquila corosync[3431]: [TOTEM ] Retransmit List: 485f7c
Jun 05 16:07:23 aquila corosync[3431]: [TOTEM ] Retransmit List: 485ea1
Jun 05 16:05:52 aquila corosync[3431]: [TOTEM ] Retransmit List: 485a5e
Jun 05 03:02:45 aquila corosync[3431]: [TOTEM ] Retransmit List: 3f55dc



Jun 05 20:48:37 pve-02 corosync[2332]: [TOTEM ] Retransmit List: 17e22
Jun 05 20:11:58 pve-02 corosync[2332]: [KNET ] pmtud: Global data MTU changed to: 1397
Jun 05 20:11:58 pve-02 corosync[2332]: [KNET ] host: host: 8 (passive) best link: 0 (pri: 1)
Jun 05 20:11:58 pve-02 corosync[2332]: [KNET ] link: Resetting MTU for link 0 because host 8 joined
Jun 05 20:11:55 pve-02 corosync[2332]: [KNET ] host: host: 8 has no active links
Jun 05 20:11:55 pve-02 corosync[2332]: [KNET ] host: host: 8 (passive) best link: 0 (pri: 1)
Jun 05 20:11:55 pve-02 corosync[2332]: [KNET ] link: host: 8 link: 0 is down
Jun 05 18:38:10 pve-02 corosync[2332]: [KNET ] pmtud: Global data MTU changed to: 1397
Jun 05 18:38:10 pve-02 corosync[2332]: [KNET ] pmtud: PMTUD link change for host: 4 link: 0 from 469 to 1397
Jun 05 18:38:10 pve-02 corosync[2332]: [KNET ] pmtud: PMTUD link change for host: 5 link: 0 from 469 to 1397
Jun 05 18:38:10 pve-02 corosync[2332]: [KNET ] pmtud: PMTUD link change for host: 7 link: 0 from 469 to 1397
Jun 05 18:38:10 pve-02 corosync[2332]: [KNET ] pmtud: PMTUD link change for host: 6 link: 0 from 469 to 1397
Jun 05 18:38:10 pve-02 corosync[2332]: [KNET ] pmtud: PMTUD link change for host: 8 link: 0 from 469 to 1397
Jun 05 18:38:10 pve-02 corosync[2332]: [KNET ] pmtud: PMTUD link change for host: 9 link: 0 from 469 to 1397
Jun 05 18:38:10 pve-02 corosync[2332]: [KNET ] pmtud: PMTUD link change for host: 10 link: 0 from 469 to 1397
Jun 05 18:38:10 pve-02 corosync[2332]: [KNET ] pmtud: PMTUD link change for host: 1 link: 0 from 469 to 1397
Jun 05 18:38:10 pve-02 corosync[2332]: [KNET ] pmtud: PMTUD link change for host: 3 link: 0 from 469 to 1397
Jun 05 18:38:10 pve-02 corosync[2332]: [MAIN ] Completed service synchronization, ready to provide service.
Jun 05 18:38:10 pve-02 corosync[2332]: [QUORUM] Members[10]: 1 2 3 4 5 6 7 8 9 10
Jun 05 18:38:10 pve-02 corosync[2332]: [QUORUM] This node is within the primary component and will provide service.
Jun 05 18:38:10 pve-02 corosync[2332]: [TOTEM ] A new membership (1.11b9) was formed. Members joined: 1 3 4 5 6 7 8 9 10
Jun 05 18:38:10 pve-02 corosync[2332]: [QUORUM] Sync joined[9]: 1 3 4 5 6 7 8 9 10
Jun 05 18:38:10 pve-02 corosync[2332]: [QUORUM] Sync members[10]: 1 2 3 4 5 6 7 8 9 10
Jun 05 18:38:10 pve-02 corosync[2332]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1)
Jun 05 18:38:10 pve-02 corosync[2332]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Jun 05 18:38:10 pve-02 corosync[2332]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Jun 05 18:38:10 pve-02 corosync[2332]: [KNET ] host: host: 7 (passive) best link: 0 (pri: 1)
Jun 05 18:38:10 pve-02 corosync[2332]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
Jun 05 18:38:10 pve-02 corosync[2332]: [KNET ] host: host: 8 (passive) best link: 0 (pri: 1)
Jun 05 18:38:10 pve-02 corosync[2332]: [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)
Jun 05 18:38:10 pve-02 corosync[2332]: [KNET ] link: Resetting MTU for link 0 because host 4 joined
Jun 05 18:38:10 pve-02 corosync[2332]: [KNET ] rx: host: 4 link: 0 is up
Jun 05 18:38:10 pve-02 corosync[2332]: [KNET ] link: Resetting MTU for link 0 because host 1 joined
Jun 05 18:38:10 pve-02 corosync[2332]: [KNET ] rx: host: 1 link: 0 is up
Jun 05 18:38:10 pve-02 corosync[2332]: [KNET ] link: Resetting MTU for link 0 because host 3 joined
Jun 05 18:38:10 pve-02 corosync[2332]: [KNET ] rx: host: 3 link: 0 is up





Jun 05 21:42:31 chiocciola corosync[2415]: [KNET ] pmtud: Global data MTU changed to: 1397
Jun 05 21:42:31 chiocciola corosync[2415]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Jun 05 21:42:31 chiocciola corosync[2415]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
Jun 05 21:42:30 chiocciola corosync[2415]: [KNET ] pmtud: Global data MTU changed to: 1397
Jun 05 21:42:30 chiocciola corosync[2415]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Jun 05 21:42:30 chiocciola corosync[2415]: [KNET ] link: Resetting MTU for link 0 because host 3 joined
Jun 05 21:42:29 chiocciola corosync[2415]: [KNET ] host: host: 2 has no active links
Jun 05 21:42:29 chiocciola corosync[2415]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Jun 05 21:42:29 chiocciola corosync[2415]: [KNET ] host: host: 3 has no active links
Jun 05 21:42:29 chiocciola corosync[2415]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Jun 05 21:42:29 chiocciola corosync[2415]: [KNET ] link: host: 2 link: 0 is down
Jun 05 21:42:29 chiocciola corosync[2415]: [KNET ] link: host: 3 link: 0 is down
Jun 05 20:47:46 chiocciola corosync[2415]: [KNET ] pmtud: Global data MTU changed to: 1397
Jun 05 20:47:46 chiocciola corosync[2415]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Jun 05 20:47:46 chiocciola corosync[2415]: [KNET ] link: Resetting MTU for link 0 because host 1 joined
Jun 05 20:47:46 chiocciola corosync[2415]: [KNET ] host: host: 1 has no active links
Jun 05 20:47:46 chiocciola corosync[2415]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Jun 05 20:47:46 chiocciola corosync[2415]: [KNET ] link: host: 1 link: 0 is down
Jun 05 18:37:19 chiocciola corosync[2415]: [KNET ] pmtud: Global data MTU changed to: 1397
Jun 05 18:37:19 chiocciola corosync[2415]: [MAIN ] Completed service synchronization, ready to provide service.
Jun 05 18:37:19 chiocciola corosync[2415]: [QUORUM] Members[10]: 1 2 3 4 5 6 7 8 9 10
Jun 05 18:37:19 chiocciola corosync[2415]: [TOTEM ] A new membership (1.11b9) was formed. Members joined: 2
Jun 05 18:37:19 chiocciola corosync[2415]: [QUORUM] Sync joined[1]: 2
Jun 05 18:37:19 chiocciola corosync[2415]: [QUORUM] Sync members[10]: 1 2 3 4 5 6 7 8 9 10
Jun 05 18:37:19 chiocciola corosync[2415]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Jun 05 18:37:19 chiocciola corosync[2415]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
Jun 05 18:37:19 chiocciola corosync[2415]: [KNET ] rx: host: 2 link: 0 is up
Jun 05 18:33:04 chiocciola corosync[2415]: [KNET ] host: host: 2 has no active links
Jun 05 18:33:04 chiocciola corosync[2415]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Jun 05 18:33:04 chiocciola corosync[2415]: [KNET ] link: host: 2 link: 0 is down
Jun 05 18:33:02 chiocciola corosync[2415]: [MAIN ] Completed service synchronization, ready to provide service.
Jun 05 18:33:02 chiocciola corosync[2415]: [QUORUM] Members[9]: 1 3 4 5 6 7 8 9 10
Jun 05 18:33:02 chiocciola corosync[2415]: [TOTEM ] A new membership (1.11b4) was formed. Members left: 2
Jun 05 18:33:02 chiocciola corosync[2415]: [QUORUM] Sync left[1]: 2
Jun 05 18:33:02 chiocciola corosync[2415]: [QUORUM] Sync members[9]: 1 3 4 5 6 7 8 9 10
Jun 05 18:33:02 chiocciola corosync[2415]: [CFG ] Node 2 was shut down by sysadmin
Jun 05 18:24:17 chiocciola corosync[2415]: [KNET ] pmtud: Global data MTU changed to: 1397





Jun 05 18:36:34 bruco corosync[2489]: [MAIN ] Completed service synchronization, ready to provide service.
Jun 05 18:36:34 bruco corosync[2489]: [QUORUM] Members[10]: 1 2 3 4 5 6 7 8 9 10
Jun 05 18:36:34 bruco corosync[2489]: [KNET ] pmtud: Global data MTU changed to: 1397
Jun 05 18:36:34 bruco corosync[2489]: [TOTEM ] A new membership (1.11b9) was formed. Members joined: 2
Jun 05 18:36:34 bruco corosync[2489]: [QUORUM] Sync joined[1]: 2
Jun 05 18:36:34 bruco corosync[2489]: [QUORUM] Sync members[10]: 1 2 3 4 5 6 7 8 9 10
Jun 05 18:36:34 bruco corosync[2489]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Jun 05 18:36:34 bruco corosync[2489]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
Jun 05 18:32:20 bruco corosync[2489]: [KNET ] host: host: 2 has no active links
Jun 05 18:32:20 bruco corosync[2489]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Jun 05 18:32:20 bruco corosync[2489]: [KNET ] link: host: 2 link: 0 is down
Jun 05 18:32:17 bruco corosync[2489]: [MAIN ] Completed service synchronization, ready to provide service.
Jun 05 18:32:17 bruco corosync[2489]: [QUORUM] Members[9]: 1 3 4 5 6 7 8 9 10
Jun 05 18:32:17 bruco corosync[2489]: [TOTEM ] A new membership (1.11b4) was formed. Members left: 2
Jun 05 18:32:17 bruco corosync[2489]: [QUORUM] Sync left[1]: 2
Jun 05 18:32:17 bruco corosync[2489]: [QUORUM] Sync members[9]: 1 3 4 5 6 7 8 9 10
Jun 05 18:32:17 bruco corosync[2489]: [CFG ] Node 2 was shut down by sysadmin
Jun 05 18:01:52 bruco corosync[2489]: [KNET ] pmtud: Global data MTU changed to: 1397
Jun 05 18:01:51 bruco corosync[2489]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Jun 05 18:01:51 bruco corosync[2489]: [KNET ] link: Resetting MTU for link 0 because host 3 joined
Jun 05 18:01:50 bruco corosync[2489]: [KNET ] host: host: 3 has no active links
Jun 05 18:01:50 bruco corosync[2489]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Jun 05 18:01:50 bruco corosync[2489]: [KNET ] link: host: 3 link: 0 is down
Jun 05 17:46:12 bruco corosync[2489]: [TOTEM ] Retransmit List: 497e71
Jun 05 17:40:50 bruco corosync[2489]: [TOTEM ] Retransmit List: 496fc2
Jun 05 17:34:03 bruco corosync[2489]: [KNET ] pmtud: Global data MTU changed to: 1397
Jun 05 17:34:03 bruco corosync[2489]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Jun 05 17:34:03 bruco corosync[2489]: [KNET ] link: Resetting MTU for link 0 because host 1 joined
Jun 05 17:34:02 bruco corosync[2489]: [KNET ] host: host: 1 has no active links
Jun 05 17:34:02 bruco corosync[2489]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Jun 05 17:34:02 bruco corosync[2489]: [KNET ] link: host: 1 link: 0 is down
Jun 05 17:30:33 bruco corosync[2489]: [TOTEM ] Retransmit List: 49534d
Jun 05 17:30:33 bruco corosync[2489]: [TOTEM ] Retransmit List: 49534c
Jun 05 17:30:33 bruco corosync[2489]: [TOTEM ] Retransmit List: 49534b
Jun 05 17:30:33 bruco corosync[2489]: [TOTEM ] Retransmit List: 49534a
Jun 05 17:30:31 bruco corosync[2489]: [TOTEM ] Retransmit List: 495346
 
I've found some clocks not synced.

I've simplified the network removig the second link connection

but one node say
# systemctl restart pve-cluster
Job for pve-cluster.service failed because the control process exited with error code.
See "systemctl status pve-cluster.service" and "journalctl -xeu pve-cluster.service" for details.


Jun 06 00:40:53 civetta systemd[1]: Stopped pve-cluster.service - The Proxmox VE cluster filesystem.
░░ Subject: A stop job for unit pve-cluster.service has finished
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ A stop job for unit pve-cluster.service has finished.
░░
░░ The job identifier is 80206 and the job result is done.
Jun 06 00:40:53 civetta systemd[1]: pve-cluster.service: Start request repeated too quickly.
Jun 06 00:40:53 civetta systemd[1]: pve-cluster.service: Failed with result 'exit-code'.
░░ Subject: Unit failed
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ The unit pve-cluster.service has entered the 'failed' state with result 'exit-code'.
Jun 06 00:40:53 civetta systemd[1]: Failed to start pve-cluster.service - The Proxmox VE cluster filesystem.
░░ Subject: A start job for unit pve-cluster.service has failed


Now the local storage and inner ceph it seems to work

Bu i have also an external CEPH RBD for testing.... the external ceph status is HEALTH OK but i cannot see the content of the pool

1749165089371.png
 
From the logs, you have significant network issues, at least over your corosync links. This is likely causing your cluster to lose quorum.

If possible, it is recommended to have multiple corosync links, with one preferably over a dedicated switch.

I would test connectivity between all nodes on the corosync link.

If you have HA enabled, I would disable that until you can confirm corosync is stable.

Check pvecm status to see qoruem status

https://pve.proxmox.com/pve-docs/chapter-pvecm.html#_quorum
 
Thank you for your reply.

The quorum is 'quorate'.

Using tcpdump and Wireshark, I can see 30% of the traffic in the TCP analysis retransmission, both for the Proxmox cluster network and between Proxmox and the external Ceph cluster.

The network bond is mode 6 (balance-alb) with two wires connected to two different switches.

I restarted both Proxmox and Ceph, and at the moment, with only one NIC, the traffic has returned to normal.

The problem started yesterday, about 2 hours after the rebalancing started, when I added some NVMe drives to the external Ceph cluster.
 
I am under the impression that running corosync over a bond is not recommended.
Two separate links would be lower latency and easier for corosync to rebalance.


You may not see any issues until the network gets noisy.
 
  • Like
Reactions: carles89
OK, thank you for the update.
Now that all the VMs are up, I can see the 30% retransmission again (also if one of the 2 nics of the bond are switched off for undestanding the problem, but the cluster is fully functional and stable).
I need to test the best way.
Currently, I have one connection for management (admin connections for the IT test team) and a bond with two 25 Gb NICs connected to two different switches in balance ALB.
Several VLANs have been set up in this bond for corosync, Ceph public, Ceph internal network and a different VLAN for the public Ceph external cluster network. This second external cluster has its own network for OSD sync.
 
  • Like
Reactions: Weehooey-HSS
If feasible, it's ideal to maintain at least one standalone Corosync link on dedicated hardware. Your backup Corosync links can share connections, but ensure that your primary link remains on dedicated hardware. You should also always have multiple Corosync links.
 
  • Like
Reactions: UdoB