What does this mean? corosync[3587892]: [KNET ] link: host: 2 link: 0 is down

I have been having a issue with a VM intermittently losing internet connectivity, trying to check hosts for issues and I find this?

May 17 02:09:15 pve-j-dal corosync[3587892]: [KNET ] link: host: 2 link: 0 is down
May 17 02:09:15 pve-j-dal corosync[3587892]: [KNET ] host: host: 2 (passive) best link: 1 (pri: 1)
May 17 02:09:17 pve-j-dal corosync[3587892]: [KNET ] rx: host: 2 link: 0 is up
May 17 02:09:17 pve-j-dal corosync[3587892]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
May 17 02:09:31 pve-j-dal pveproxy[3963507]: Clearing outdated entries from certificate cache
 
Last edited:
this means that your network (or at least the link 0 configured for corosync usage) went down for a few seconds.
 
We have comparable issue on one auf our 3-Node-Cluster in Datacenter.
It is the "secondary" Coro-Sync-Link which does this every few minutes....

May 17 17:55:42 RZB-CPVE1 corosync[1437]: [KNET ] link: host: 2 link: 1 is down
May 17 17:55:42 RZB-CPVE1 corosync[1437]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
May 17 17:55:44 RZB-CPVE1 corosync[1437]: [KNET ] rx: host: 2 link: 1 is up
May 17 17:55:44 RZB-CPVE1 corosync[1437]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)

Every node has this:
May 17 17:28:56 RZB-CPVE2 corosync[1442]: [KNET ] link: host: 3 link: 1 is down
May 17 17:28:56 RZB-CPVE2 corosync[1442]: [KNET ] link: host: 1 link: 1 is down
May 17 17:28:56 RZB-CPVE2 corosync[1442]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
May 17 17:28:56 RZB-CPVE2 corosync[1442]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
May 17 17:28:58 RZB-CPVE2 corosync[1442]: [KNET ] rx: host: 3 link: 1 is up
May 17 17:28:58 RZB-CPVE2 corosync[1442]: [KNET ] rx: host: 1 link: 1 is up
May 17 17:28:58 RZB-CPVE2 corosync[1442]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
May 17 17:28:58 RZB-CPVE2 corosync[1442]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)


May 17 17:26:05 RZB-CPVE3 corosync[1443]: [KNET ] link: host: 2 link: 1 is down
May 17 17:26:05 RZB-CPVE3 corosync[1443]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
May 17 17:26:07 RZB-CPVE3 corosync[1443]: [KNET ] rx: host: 2 link: 1 is up
May 17 17:26:07 RZB-CPVE3 corosync[1443]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)


But all at different times.... we use BOND with TLB on this link where also the VMs are connected to. Bond has 2 NICs on different switches. Maybe there is an issue?
 

Attachments

  • 1652803418968.png
    1652803418968.png
    35.7 KB · Views: 18
@itNGO it's possible that the bond is interfering with corosync - both try to monitor the link and failover after all. the link down detection in corosync/knet is pretty simple - it sends heartbeat packets over UDP, and if no reply comes back (in time) the link gets marked as down (or if sending actual data fails ;)). a few heartbeats go through -> the link is marked as up again.
 
  • Like
Reactions: itNGO
We had a power outage yesterday. For unrelated reasons, 4/5 of the nodes did not power up successfully. As I powered them up, I would fix each node, which added maybe 5 minutes to each node's bootup. After this, I started seeing these messages constantly on all nodes but _only_ for the secondary ring.

This happens to be the same interface that I use for Ceph replication and there _was_ a lot of rebalancing activity for a few days. That has since stopped and traffic on that interface is fairly miniscule. I can reliably ping between all nodes at <1ms with no drops yet these messages continue to happen, just constantly going up and down.

What's the best way to continue to diagnose this? If it's losing a UDP packet here and there...how do I see that? None of the interfaces seem to have any significant dropped frames and none of them are bonded, just 2 VLANs on a 10GbE interface across a single switch.
 
Last edited:
if you use the link for corosync and for Ceph, than it's likely that the packets just arrived too slow/with too much delay. you can take a look at the stats cmap: corosync-cmapctl -m stats
 
  • Like
Reactions: ZipTX
Right I did think that Ceph was just saturating the link but after the rebalancing stopped and I started seeing very little traffic on the 10g link...that's when I got suspicious. I'll do some digging on how to read that stats output but if you could help me interpret it, that would be helpful:


Code:
...
stats.knet.node1.link1.connected (u8) = 1
stats.knet.node1.link1.down_count (u32) = 1
stats.knet.node1.link1.enabled (u8) = 1
stats.knet.node1.link1.latency_ave (u32) = 108
stats.knet.node1.link1.latency_max (u32) = 665876
stats.knet.node1.link1.latency_min (u32) = 108
stats.knet.node1.link1.latency_samples (u32) = 2048
stats.knet.node1.link1.mtu (u32) = 1397
stats.knet.node1.link1.rx_data_bytes (u64) = 1370
stats.knet.node1.link1.rx_data_packets (u64) = 6
stats.knet.node1.link1.rx_ping_bytes (u64) = 440388
stats.knet.node1.link1.rx_ping_packets (u64) = 16938
stats.knet.node1.link1.rx_pmtu_bytes (u64) = 1108368
stats.knet.node1.link1.rx_pmtu_packets (u64) = 1548
stats.knet.node1.link1.rx_pong_bytes (u64) = 440466
stats.knet.node1.link1.rx_pong_packets (u64) = 16941
stats.knet.node1.link1.rx_total_bytes (u64) = 1990592
stats.knet.node1.link1.rx_total_packets (u64) = 35433
stats.knet.node1.link1.rx_total_retries (u64) = 0
stats.knet.node1.link1.tx_data_bytes (u64) = 0
stats.knet.node1.link1.tx_data_errors (u32) = 0
stats.knet.node1.link1.tx_data_packets (u64) = 0
stats.knet.node1.link1.tx_data_retries (u32) = 0
stats.knet.node1.link1.tx_ping_bytes (u64) = 1355280
stats.knet.node1.link1.tx_ping_errors (u32) = 0
stats.knet.node1.link1.tx_ping_packets (u64) = 16941
stats.knet.node1.link1.tx_ping_retries (u32) = 0
stats.knet.node1.link1.tx_pmtu_bytes (u64) = 1139328
stats.knet.node1.link1.tx_pmtu_errors (u32) = 0
stats.knet.node1.link1.tx_pmtu_packets (u64) = 774
stats.knet.node1.link1.tx_pmtu_retries (u32) = 0
stats.knet.node1.link1.tx_pong_bytes (u64) = 1355040
stats.knet.node1.link1.tx_pong_errors (u32) = 0
stats.knet.node1.link1.tx_pong_packets (u64) = 16938
stats.knet.node1.link1.tx_pong_retries (u32) = 0
stats.knet.node1.link1.tx_total_bytes (u64) = 3849648
stats.knet.node1.link1.tx_total_errors (u64) = 0
stats.knet.node1.link1.tx_total_packets (u64) = 34653
stats.knet.node1.link1.up_count (u32) = 1
...
stats.knet.node2.link1.connected (u8) = 1
stats.knet.node2.link1.down_count (u32) = 1
stats.knet.node2.link1.enabled (u8) = 1
stats.knet.node2.link1.latency_ave (u32) = 125
stats.knet.node2.link1.latency_max (u32) = 1029755
stats.knet.node2.link1.latency_min (u32) = 125
stats.knet.node2.link1.latency_samples (u32) = 2048
stats.knet.node2.link1.mtu (u32) = 1397
stats.knet.node2.link1.rx_data_bytes (u64) = 0
stats.knet.node2.link1.rx_data_packets (u64) = 0
stats.knet.node2.link1.rx_ping_bytes (u64) = 440154
stats.knet.node2.link1.rx_ping_packets (u64) = 16929
stats.knet.node2.link1.rx_pmtu_bytes (u64) = 1114060
stats.knet.node2.link1.rx_pmtu_packets (u64) = 1552
stats.knet.node2.link1.rx_pong_bytes (u64) = 440466
stats.knet.node2.link1.rx_pong_packets (u64) = 16941
stats.knet.node2.link1.rx_total_bytes (u64) = 1994680
stats.knet.node2.link1.rx_total_packets (u64) = 35422
stats.knet.node2.link1.rx_total_retries (u64) = 0
stats.knet.node2.link1.tx_data_bytes (u64) = 0
stats.knet.node2.link1.tx_data_errors (u32) = 0
stats.knet.node2.link1.tx_data_packets (u64) = 0
stats.knet.node2.link1.tx_data_retries (u32) = 0
stats.knet.node2.link1.tx_ping_bytes (u64) = 1355280
stats.knet.node2.link1.tx_ping_errors (u32) = 0
stats.knet.node2.link1.tx_ping_packets (u64) = 16941
stats.knet.node2.link1.tx_ping_retries (u32) = 0
stats.knet.node2.link1.tx_pmtu_bytes (u64) = 1139328
stats.knet.node2.link1.tx_pmtu_errors (u32) = 0
stats.knet.node2.link1.tx_pmtu_packets (u64) = 774
stats.knet.node2.link1.tx_pmtu_retries (u32) = 0
stats.knet.node2.link1.tx_pong_bytes (u64) = 1354320
stats.knet.node2.link1.tx_pong_errors (u32) = 0
stats.knet.node2.link1.tx_pong_packets (u64) = 16929
stats.knet.node2.link1.tx_pong_retries (u32) = 0
stats.knet.node2.link1.tx_total_bytes (u64) = 3848928
stats.knet.node2.link1.tx_total_errors (u64) = 0
stats.knet.node2.link1.tx_total_packets (u64) = 34644
stats.knet.node2.link1.up_count (u32) = 1
...
stats.knet.node3.link1.connected (u8) = 1
stats.knet.node3.link1.down_count (u32) = 1
stats.knet.node3.link1.enabled (u8) = 1
stats.knet.node3.link1.latency_ave (u32) = 150
stats.knet.node3.link1.latency_max (u32) = 1029885
stats.knet.node3.link1.latency_min (u32) = 150
stats.knet.node3.link1.latency_samples (u32) = 2048
stats.knet.node3.link1.mtu (u32) = 1397
stats.knet.node3.link1.rx_data_bytes (u64) = 0
stats.knet.node3.link1.rx_data_packets (u64) = 0
stats.knet.node3.link1.rx_ping_bytes (u64) = 440232
stats.knet.node3.link1.rx_ping_packets (u64) = 16932
stats.knet.node3.link1.rx_pmtu_bytes (u64) = 1105522
stats.knet.node3.link1.rx_pmtu_packets (u64) = 1546
stats.knet.node3.link1.rx_pong_bytes (u64) = 440466
stats.knet.node3.link1.rx_pong_packets (u64) = 16941
stats.knet.node3.link1.rx_total_bytes (u64) = 1986220
stats.knet.node3.link1.rx_total_packets (u64) = 35419
stats.knet.node3.link1.rx_total_retries (u64) = 0
stats.knet.node3.link1.tx_data_bytes (u64) = 0
stats.knet.node3.link1.tx_data_errors (u32) = 0
stats.knet.node3.link1.tx_data_packets (u64) = 0
stats.knet.node3.link1.tx_data_retries (u32) = 0
stats.knet.node3.link1.tx_ping_bytes (u64) = 1355280
stats.knet.node3.link1.tx_ping_errors (u32) = 0
stats.knet.node3.link1.tx_ping_packets (u64) = 16941
stats.knet.node3.link1.tx_ping_retries (u32) = 0
stats.knet.node3.link1.tx_pmtu_bytes (u64) = 1139328
stats.knet.node3.link1.tx_pmtu_errors (u32) = 0
stats.knet.node3.link1.tx_pmtu_packets (u64) = 774
stats.knet.node3.link1.tx_pmtu_retries (u32) = 0
stats.knet.node3.link1.tx_pong_bytes (u64) = 1354560
stats.knet.node3.link1.tx_pong_errors (u32) = 0
stats.knet.node3.link1.tx_pong_packets (u64) = 16932
stats.knet.node3.link1.tx_pong_retries (u32) = 0
stats.knet.node3.link1.tx_total_bytes (u64) = 3849168
stats.knet.node3.link1.tx_total_errors (u64) = 0
stats.knet.node3.link1.tx_total_packets (u64) = 34647
stats.knet.node3.link1.up_count (u32) = 1
...
stats.knet.node5.link1.connected (u8) = 1
stats.knet.node5.link1.down_count (u32) = 474
stats.knet.node5.link1.enabled (u8) = 1
stats.knet.node5.link1.latency_ave (u32) = 134
stats.knet.node5.link1.latency_max (u32) = 1029854
stats.knet.node5.link1.latency_min (u32) = 134
stats.knet.node5.link1.latency_samples (u32) = 2048
stats.knet.node5.link1.mtu (u32) = 1397
stats.knet.node5.link1.rx_data_bytes (u64) = 0
stats.knet.node5.link1.rx_data_packets (u64) = 0
stats.knet.node5.link1.rx_ping_bytes (u64) = 425048
stats.knet.node5.link1.rx_ping_packets (u64) = 16348
stats.knet.node5.link1.rx_pmtu_bytes (u64) = 616198
stats.knet.node5.link1.rx_pmtu_packets (u64) = 1380
stats.knet.node5.link1.rx_pong_bytes (u64) = 425360
stats.knet.node5.link1.rx_pong_packets (u64) = 16360
stats.knet.node5.link1.rx_total_bytes (u64) = 1466606
stats.knet.node5.link1.rx_total_packets (u64) = 34088
stats.knet.node5.link1.rx_total_retries (u64) = 0
stats.knet.node5.link1.tx_data_bytes (u64) = 0
stats.knet.node5.link1.tx_data_errors (u32) = 0
stats.knet.node5.link1.tx_data_packets (u64) = 0
stats.knet.node5.link1.tx_data_retries (u32) = 0
stats.knet.node5.link1.tx_ping_bytes (u64) = 1355280
stats.knet.node5.link1.tx_ping_errors (u32) = 0
stats.knet.node5.link1.tx_ping_packets (u64) = 16941
stats.knet.node5.link1.tx_ping_retries (u32) = 0
stats.knet.node5.link1.tx_pmtu_bytes (u64) = 1442560
stats.knet.node5.link1.tx_pmtu_errors (u32) = 0
stats.knet.node5.link1.tx_pmtu_packets (u64) = 980
stats.knet.node5.link1.tx_pmtu_retries (u32) = 0
stats.knet.node5.link1.tx_pong_bytes (u64) = 1307840
stats.knet.node5.link1.tx_pong_errors (u32) = 0
stats.knet.node5.link1.tx_pong_packets (u64) = 16348
stats.knet.node5.link1.tx_pong_retries (u32) = 0
stats.knet.node5.link1.tx_total_bytes (u64) = 4105680
stats.knet.node5.link1.tx_total_errors (u64) = 0
stats.knet.node5.link1.tx_total_packets (u64) = 34269
stats.knet.node5.link1.up_count (u32) = 474
...
 
Update: So one of two things made this go away:

1. Rebooting all the nodes.
2. Migrating a VM off of one of the nodes.

When doing the live migration of the VM..it was going absurdly slowly, like 56k over a 10GbE link with no other traffic. After migrating that VM and rebooting, all is well. Maybe there's something wrong with that, I'll have to debug some more.
 
if you see the issue again, dumping the stats cmap on all nodes is probably a good idea. the latency seems to be quite over the place, but with a low average.. and in your output, it's only the link to node5 that seems to flap, so possibly the stats on that node would (have?) give(n) more insight..
 
Sorry to dig up an old thread but im having this same issue so didnt want to make anew one.

I have dedicated cronosync network with 1gb interfaces on a 8 port managed switch

One node in particular seems to have having issues with heartbeat and occasionally drops from the cluster momentarily.

If i log into one of the other nodes and look as the cluster, ALL other nodes seem to still be talking to each other and connected

Im in the process of adding a secondary ring but i need to find out why this primary node is acting up.

oot@pve-bighp:~# corosync-cmapctl -m stats stats.ipcs.global.active (u64) = 5 stats.ipcs.global.closed (u64) = 215 stats.ipcs.service0.2453.0x563c8d2f6520.dispatched (u64) = 0 stats.ipcs.service0.2453.0x563c8d2f6520.flow_control (u32) = 0 stats.ipcs.service0.2453.0x563c8d2f6520.flow_control_count (u64) = 2382 stats.ipcs.service0.2453.0x563c8d2f6520.invalid_request (u64) = 0 stats.ipcs.service0.2453.0x563c8d2f6520.overload (u64) = 0 stats.ipcs.service0.2453.0x563c8d2f6520.procname (str) = pmxcfs stats.ipcs.service0.2453.0x563c8d2f6520.queued (u32) = 0 stats.ipcs.service0.2453.0x563c8d2f6520.queueing (i32) = 0 stats.ipcs.service0.2453.0x563c8d2f6520.recv_retries (u64) = 0 stats.ipcs.service0.2453.0x563c8d2f6520.requests (u64) = 32 stats.ipcs.service0.2453.0x563c8d2f6520.responses (u64) = 32 stats.ipcs.service0.2453.0x563c8d2f6520.send_retries (u64) = 0 stats.ipcs.service0.2453.0x563c8d2f6520.sent (u32) = 0 stats.ipcs.service0.734064.0x563c8d2df040.dispatched (u64) = 0 stats.ipcs.service0.734064.0x563c8d2df040.flow_control (u32) = 0 stats.ipcs.service0.734064.0x563c8d2df040.flow_control_count (u64) = 0 stats.ipcs.service0.734064.0x563c8d2df040.invalid_request (u64) = 0 stats.ipcs.service0.734064.0x563c8d2df040.overload (u64) = 0 stats.ipcs.service0.734064.0x563c8d2df040.procname (str) = corosync-cmapct stats.ipcs.service0.734064.0x563c8d2df040.queued (u32) = 0 stats.ipcs.service0.734064.0x563c8d2df040.queueing (i32) = 0 stats.ipcs.service0.734064.0x563c8d2df040.recv_retries (u64) = 0 stats.ipcs.service0.734064.0x563c8d2df040.requests (u64) = 54 stats.ipcs.service0.734064.0x563c8d2df040.responses (u64) = 55 stats.ipcs.service0.734064.0x563c8d2df040.send_retries (u64) = 0 stats.ipcs.service0.734064.0x563c8d2df040.sent (u32) = 0 stats.ipcs.service2.2453.0x563c8d2d77e0.dispatched (u64) = 2726 stats.ipcs.service2.2453.0x563c8d2d77e0.flow_control (u32) = 0 stats.ipcs.service2.2453.0x563c8d2d77e0.flow_control_count (u64) = 44 stats.ipcs.service2.2453.0x563c8d2d77e0.invalid_request (u64) = 0 stats.ipcs.service2.2453.0x563c8d2d77e0.overload (u64) = 0 stats.ipcs.service2.2453.0x563c8d2d77e0.procname (str) = pmxcfs stats.ipcs.service2.2453.0x563c8d2d77e0.queued (u32) = 0 stats.ipcs.service2.2453.0x563c8d2d77e0.queueing (i32) = 0 stats.ipcs.service2.2453.0x563c8d2d77e0.recv_retries (u64) = 0 stats.ipcs.service2.2453.0x563c8d2d77e0.requests (u64) = 904 stats.ipcs.service2.2453.0x563c8d2d77e0.responses (u64) = 2 stats.ipcs.service2.2453.0x563c8d2d77e0.send_retries (u64) = 0 stats.ipcs.service2.2453.0x563c8d2d77e0.sent (u32) = 2726 stats.ipcs.service2.2453.0x563c8d2ee610.dispatched (u64) = 50420 stats.ipcs.service2.2453.0x563c8d2ee610.flow_control (u32) = 0 stats.ipcs.service2.2453.0x563c8d2ee610.flow_control_count (u64) = 1126 stats.ipcs.service2.2453.0x563c8d2ee610.invalid_request (u64) = 0 stats.ipcs.service2.2453.0x563c8d2ee610.overload (u64) = 0 stats.ipcs.service2.2453.0x563c8d2ee610.procname (str) = pmxcfs stats.ipcs.service2.2453.0x563c8d2ee610.queued (u32) = 0 stats.ipcs.service2.2453.0x563c8d2ee610.queueing (i32) = 0 stats.ipcs.service2.2453.0x563c8d2ee610.recv_retries (u64) = 0 stats.ipcs.service2.2453.0x563c8d2ee610.requests (u64) = 34266 stats.ipcs.service2.2453.0x563c8d2ee610.responses (u64) = 2 stats.ipcs.service2.2453.0x563c8d2ee610.send_retries (u64) = 0 stats.ipcs.service2.2453.0x563c8d2ee610.sent (u32) = 50420 stats.ipcs.service3.2453.0x563c8d2f12a0.dispatched (u64) = 1192 stats.ipcs.service3.2453.0x563c8d2f12a0.flow_control (u32) = 0 stats.ipcs.service3.2453.0x563c8d2f12a0.flow_control_count (u64) = 2382 stats.ipcs.service3.2453.0x563c8d2f12a0.invalid_request (u64) = 0 stats.ipcs.service3.2453.0x563c8d2f12a0.overload (u64) = 0 stats.ipcs.service3.2453.0x563c8d2f12a0.procname (str) = pmxcfs stats.ipcs.service3.2453.0x563c8d2f12a0.queued (u32) = 0 stats.ipcs.service3.2453.0x563c8d2f12a0.queueing (i32) = 0 stats.ipcs.service3.2453.0x563c8d2f12a0.recv_retries (u64) = 0 stats.ipcs.service3.2453.0x563c8d2f12a0.requests (u64) = 2 stats.ipcs.service3.2453.0x563c8d2f12a0.responses (u64) = 2 stats.ipcs.service3.2453.0x563c8d2f12a0.send_retries (u64) = 0 stats.ipcs.service3.2453.0x563c8d2f12a0.sent (u32) = 1192 stats.knet.handle.rx_compress_time_ave (u64) = 0 stats.knet.handle.rx_compress_time_max (u64) = 0 stats.knet.handle.rx_compress_time_min (u64) = 18446744073709551615 stats.knet.handle.rx_compressed_original_bytes (u64) = 0 stats.knet.handle.rx_compressed_packets (u64) = 0 stats.knet.handle.rx_compressed_size_bytes (u64) = 0 stats.knet.handle.rx_crypt_packets (u64) = 814334 stats.knet.handle.rx_crypt_time_ave (u64) = 12198 stats.knet.handle.rx_crypt_time_max (u64) = 400912 stats.knet.handle.rx_crypt_time_min (u64) = 8628 stats.knet.handle.tx_compress_time_ave (u64) = 0 stats.knet.handle.tx_compress_time_max (u64) = 0 stats.knet.handle.tx_compress_time_min (u64) = 18446744073709551615 stats.knet.handle.tx_compressed_original_bytes (u64) = 0 stats.knet.handle.tx_compressed_packets (u64) = 0 stats.knet.handle.tx_compressed_size_bytes (u64) = 0 stats.knet.handle.tx_crypt_byte_overhead (u64) = 43159215 stats.knet.handle.tx_crypt_packets (u64) = 923894 stats.knet.handle.tx_crypt_time_ave (u64) = 14335 stats.knet.handle.tx_crypt_time_max (u64) = 427560 stats.knet.handle.tx_crypt_time_min (u64) = 9275 stats.knet.handle.tx_uncompressed_packets (u64) = 0 stats.knet.node1.link0.connected (u8) = 1 stats.knet.node1.link0.down_count (u32) = 0 stats.knet.node1.link0.enabled (u8) = 1 stats.knet.node1.link0.latency_ave (u32) = 0 stats.knet.node1.link0.latency_max (u32) = 0 stats.knet.node1.link0.latency_min (u32) = 4294967295 stats.knet.node1.link0.latency_samples (u32) = 0 stats.knet.node1.link0.mtu (u32) = 65535 stats.knet.node1.link0.rx_data_bytes (u64) = 0 stats.knet.node1.link0.rx_data_packets (u64) = 0 stats.knet.node1.link0.rx_ping_bytes (u64) = 0 stats.knet.node1.link0.rx_ping_packets (u64) = 0 stats.knet.node1.link0.rx_pmtu_bytes (u64) = 0 stats.knet.node1.link0.rx_pmtu_packets (u64) = 0 stats.knet.node1.link0.rx_pong_bytes (u64) = 0 stats.knet.node1.link0.rx_pong_packets (u64) = 0 stats.knet.node1.link0.rx_total_bytes (u64) = 0 stats.knet.node1.link0.rx_total_packets (u64) = 0 stats.knet.node1.link0.rx_total_retries (u64) = 0 stats.knet.node1.link0.tx_data_bytes (u64) = 179427381 stats.knet.node1.link0.tx_data_errors (u32) = 0 stats.knet.node1.link0.tx_data_packets (u64) = 883640 stats.knet.node1.link0.tx_data_retries (u32) = 0 stats.knet.node1.link0.tx_ping_bytes (u64) = 0 stats.knet.node1.link0.tx_ping_errors (u32) = 0 stats.knet.node1.link0.tx_ping_packets (u64) = 0 stats.knet.node1.link0.tx_ping_retries (u32) = 0 stats.knet.node1.link0.tx_pmtu_bytes (u64) = 0 stats.knet.node1.link0.tx_pmtu_errors (u32) = 0 stats.knet.node1.link0.tx_pmtu_packets (u64) = 0 stats.knet.node1.link0.tx_pmtu_retries (u32) = 0 stats.knet.node1.link0.tx_pong_bytes (u64) = 0 stats.knet.node1.link0.tx_pong_errors (u32) = 0 stats.knet.node1.link0.tx_pong_packets (u64) = 0 stats.knet.node1.link0.tx_pong_retries (u32) = 0 stats.knet.node1.link0.tx_total_bytes (u64) = 179427381 stats.knet.node1.link0.tx_total_errors (u64) = 0 stats.knet.node1.link0.tx_total_packets (u64) = 883640 stats.knet.node1.link0.up_count (u32) = 1 stats.knet.node2.link0.connected (u8) = 1 stats.knet.node2.link0.down_count (u32) = 1405 stats.knet.node2.link0.enabled (u8) = 1 stats.knet.node2.link0.latency_ave (u32) = 1130 stats.knet.node2.link0.latency_max (u32) = 43029 stats.knet.node2.link0.latency_min (u32) = 324 stats.knet.node2.link0.latency_samples (u32) = 2048 stats.knet.node2.link0.mtu (u32) = 1397 stats.knet.node2.link0.rx_data_bytes (u64) = 77757318 stats.knet.node2.link0.rx_data_packets (u64) = 120724 stats.knet.node2.link0.rx_ping_bytes (u64) = 745888 stats.knet.node2.link0.rx_ping_packets (u64) = 28688 stats.knet.node2.link0.rx_pmtu_bytes (u64) = 1970901 stats.knet.node2.link0.rx_pmtu_packets (u64) = 2647 stats.knet.node2.link0.rx_pong_bytes (u64) = 571298 stats.knet.node2.link0.rx_pong_packets (u64) = 21973 stats.knet.node2.link0.rx_total_bytes (u64) = 81045405 stats.knet.node2.link0.rx_total_packets (u64) = 174032 stats.knet.node2.link0.rx_total_retries (u64) = 0 stats.knet.node2.link0.tx_data_bytes (u64) = 172093296 stats.knet.node2.link0.tx_data_errors (u32) = 0 stats.knet.node2.link0.tx_data_packets (u64) = 752839 stats.knet.node2.link0.tx_data_retries (u32) = 0 stats.knet.node2.link0.tx_ping_bytes (u64) = 3957280 stats.knet.node2.link0.tx_ping_errors (u32) = 0 stats.knet.node2.link0.tx_ping_packets (u64) = 49466 stats.knet.node2.link0.tx_ping_retries (u32) = 0 stats.knet.node2.link0.tx_pmtu_bytes (u64) = 2022528 stats.knet.node2.link0.tx_pmtu_errors (u32) = 0 stats.knet.node2.link0.tx_pmtu_packets (u64) = 1374 stats.knet.node2.link0.tx_pmtu_retries (u32) = 0 stats.knet.node2.link0.tx_pong_bytes (u64) = 2294960 stats.knet.node2.link0.tx_pong_errors (u32) = 0 stats.knet.node2.link0.tx_pong_packets (u64) = 28687 stats.knet.node2.link0.tx_pong_retries (u32) = 0 stats.knet.node2.link0.tx_total_bytes (u64) = 180368064 stats.knet.node2.link0.tx_total_errors (u64) = 0 stats.knet.node2.link0.tx_total_packets (u64) = 832366 stats.knet.node2.link0.up_count (u32) = 1405 stats.knet.node4.link0.connected (u8) = 1 stats.knet.node4.link0.down_count (u32) = 1406 stats.knet.node4.link0.enabled (u8) = 1 stats.knet.node4.link0.latency_ave (u32) = 1332 stats.knet.node4.link0.latency_max (u32) = 43534 stats.knet.node4.link0.latency_min (u32) = 365 stats.knet.node4.link0.latency_samples (u32) = 2048 stats.knet.node4.link0.mtu (u32) = 1397 stats.knet.node4.link0.rx_data_bytes (u64) = 111339683 stats.knet.node4.link0.rx_data_packets (u64) = 693613 stats.knet.node4.link0.rx_ping_bytes (u64) = 758940 stats.knet.node4.link0.rx_ping_packets (u64) = 29190 stats.knet.node4.link0.rx_pmtu_bytes (u64) = 1783335 stats.knet.node4.link0.rx_pmtu_packets (u64) = 2545 stats.knet.node4.link0.rx_pong_bytes (u64) = 568698 stats.knet.node4.link0.rx_pong_packets (u64) = 21873 stats.knet.node4.link0.rx_total_bytes (u64) = 114450656 stats.knet.node4.link0.rx_total_packets (u64) = 747221 stats.knet.node4.link0.rx_total_retries (u64) = 0 stats.knet.node4.link0.tx_data_bytes (u64) = 98114448 stats.knet.node4.link0.tx_data_errors (u32) = 0 stats.knet.node4.link0.tx_data_packets (u64) = 176603 stats.knet.node4.link0.tx_data_retries (u32) = 0 stats.knet.node4.link0.tx_ping_bytes (u64) = 3957280 stats.knet.node4.link0.tx_ping_errors (u32) = 0 stats.knet.node4.link0.tx_ping_packets (u64) = 49466 stats.knet.node4.link0.tx_ping_retries (u32) = 0 stats.knet.node4.link0.tx_pmtu_bytes (u64) = 2715840 stats.knet.node4.link0.tx_pmtu_errors (u32) = 0 stats.knet.node4.link0.tx_pmtu_packets (u64) = 1845 stats.knet.node4.link0.tx_pmtu_retries (u32) = 0 stats.knet.node4.link0.tx_pong_bytes (u64) = 2335200 stats.knet.node4.link0.tx_pong_errors (u32) = 0 stats.knet.node4.link0.tx_pong_packets (u64) = 29190 stats.knet.node4.link0.tx_pong_retries (u32) = 0 stats.knet.node4.link0.tx_total_bytes (u64) = 107122768 stats.knet.node4.link0.tx_total_errors (u64) = 0 stats.knet.node4.link0.tx_total_packets (u64) = 257104 stats.knet.node4.link0.up_count (u32) = 1406 stats.pg.msg_queue_avail (u32) = 0 stats.pg.msg_reserved (u32) = 2 stats.srp.avg_backlog_calc (u32) = 0 stats.srp.avg_token_workload (u32) = 0 stats.srp.commit_entered (u64) = 1364 stats.srp.commit_token_lost (u64) = 75 stats.srp.consensus_timeouts (u64) = 830 stats.srp.continuous_gather (u32) = 0 stats.srp.continuous_sendmsg_failures (u32) = 0 stats.srp.firewall_enabled_or_nic_failure (u8) = 0 stats.srp.gather_entered (u64) = 2648 stats.srp.gather_token_lost (u64) = 0 stats.srp.mcast_retx (u64) = 119 stats.srp.mcast_rx (u64) = 118496 stats.srp.mcast_tx (u64) = 78312 stats.srp.memb_commit_token_rx (u64) = 2563 stats.srp.memb_commit_token_tx (u64) = 2644 stats.srp.memb_join_rx (u64) = 118841 stats.srp.memb_join_tx (u64) = 100049 stats.srp.memb_merge_detect_rx (u64) = 65851 stats.srp.memb_merge_detect_tx (u64) = 65698 stats.srp.mtt_rx_token (u32) = 2 stats.srp.operational_entered (u64) = 1248 stats.srp.operational_token_lost (u64) = 557 stats.srp.orf_token_rx (u64) = 1190866 stats.srp.orf_token_tx (u64) = 1272 stats.srp.recovery_entered (u64) = 1280 stats.srp.recovery_token_lost (u64) = 32 stats.srp.rx_msg_dropped (u64) = 35 stats.srp.time_since_token_last_received (u64) = 288 stats.srp.token_hold_cancel_rx (u64) = 28357 stats.srp.token_hold_cancel_tx (u64) = 19077 root@pve-bighp:~#
 
can't really say, but the latency variance also seems quite high..
 
Sorry, do you mean this area in the logs:

stats.knet.node2.link0.latency_ave (u32) = 1130 stats.knet.node2.link0.latency_max (u32) = 43029 stats.knet.node2.link0.latency_min (u32) = 324

I would think any old switch would be up to the task but any thoughts on that? I have some other old stock unmanaged switches that i can try but felt this one was my "best" one lol.

Ok three different switches and its the same results, will swap out all the cables later on today and do some more testing but this is strange lol
 
Last edited:
yes, exactly. it's not a given that the switch is at fault, it could also be a scheduling issue on the node if it is overloaded, faulty cabling, .. hard to tell with more analysis. if you have monitoring in place, correlating the link down events from corosync logs with other events might be helpful.
 
Ive changed everything other than the network cards themselves and the varriance is still the same.

The system isnt even that loaded so this is just bizzar but now that you mention the load thing, i did notice the other day when rebooting the cluster and having ALL VM's on one host auto run, this same host had the same issue, but for a few minutes as all the VM's booted up.

I know the machine itself didnt go down and other servers were still running because it also hosts part of my CEPH cluster and that never reported any warnings or issues.

i should have parts of second ring soon and ill see if that helps. If the new USB ethernet based ring works better, ill swap that to the primary
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!