What does this mean? corosync[3587892]: [KNET ] link: host: 2 link: 0 is down

Nefariousparity · May 17, 2022

I have been having a issue with a VM intermittently losing internet connectivity, trying to check hosts for issues and I find this?

May 17 02:09:15 pve-j-dal corosync[3587892]: [KNET ] link: host: 2 link: 0 is down
May 17 02:09:15 pve-j-dal corosync[3587892]: [KNET ] host: host: 2 (passive) best link: 1 (pri: 1)
May 17 02:09:17 pve-j-dal corosync[3587892]: [KNET ] rx: host: 2 link: 0 is up
May 17 02:09:17 pve-j-dal corosync[3587892]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
May 17 02:09:31 pve-j-dal pveproxy[3963507]: Clearing outdated entries from certificate cache

fabian · May 17, 2022

this means that your network (or at least the link 0 configured for corosync usage) went down for a few seconds.

Nefariousparity · May 17, 2022

Ahh, Fabian. You are a saint. Always answer my questions. So now I think, I need to figure out, why this is happening. Maybe bad cable. I also updated hosts file.

Nefariousparity · May 17, 2022

Is there anyway to track this down specifically? Like if it is a bad cable?

itNGO · May 17, 2022

We have comparable issue on one auf our 3-Node-Cluster in Datacenter.
It is the "secondary" Coro-Sync-Link which does this every few minutes....

May 17 17:55:42 RZB-CPVE1 corosync[1437]: [KNET ] link: host: 2 link: 1 is down
May 17 17:55:42 RZB-CPVE1 corosync[1437]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
May 17 17:55:44 RZB-CPVE1 corosync[1437]: [KNET ] rx: host: 2 link: 1 is up
May 17 17:55:44 RZB-CPVE1 corosync[1437]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)

Every node has this:
May 17 17:28:56 RZB-CPVE2 corosync[1442]: [KNET ] link: host: 3 link: 1 is down
May 17 17:28:56 RZB-CPVE2 corosync[1442]: [KNET ] link: host: 1 link: 1 is down
May 17 17:28:56 RZB-CPVE2 corosync[1442]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
May 17 17:28:56 RZB-CPVE2 corosync[1442]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
May 17 17:28:58 RZB-CPVE2 corosync[1442]: [KNET ] rx: host: 3 link: 1 is up
May 17 17:28:58 RZB-CPVE2 corosync[1442]: [KNET ] rx: host: 1 link: 1 is up
May 17 17:28:58 RZB-CPVE2 corosync[1442]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
May 17 17:28:58 RZB-CPVE2 corosync[1442]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)

May 17 17:26:05 RZB-CPVE3 corosync[1443]: [KNET ] link: host: 2 link: 1 is down
May 17 17:26:05 RZB-CPVE3 corosync[1443]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
May 17 17:26:07 RZB-CPVE3 corosync[1443]: [KNET ] rx: host: 2 link: 1 is up
May 17 17:26:07 RZB-CPVE3 corosync[1443]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)

But all at different times.... we use BOND with TLB on this link where also the VMs are connected to. Bond has 2 NICs on different switches. Maybe there is an issue?

Nefariousparity · May 17, 2022

That was one of my thoughts, but I am not bonding any ports on this cluster.

fabian · May 18, 2022

@itNGO it's possible that the bond is interfering with corosync - both try to monitor the link and failover after all. the link down detection in corosync/knet is pretty simple - it sends heartbeat packets over UDP, and if no reply comes back (in time) the link gets marked as down (or if sending actual data fails

). a few heartbeats go through -> the link is marked as up again.

ikogan · May 30, 2022

We had a power outage yesterday. For unrelated reasons, 4/5 of the nodes did not power up successfully. As I powered them up, I would fix each node, which added maybe 5 minutes to each node's bootup. After this, I started seeing these messages constantly on all nodes but _only_ for the secondary ring.

This happens to be the same interface that I use for Ceph replication and there _was_ a lot of rebalancing activity for a few days. That has since stopped and traffic on that interface is fairly miniscule. I can reliably ping between all nodes at <1ms with no drops yet these messages continue to happen, just constantly going up and down.

What's the best way to continue to diagnose this? If it's losing a UDP packet here and there...how do I see that? None of the interfaces seem to have any significant dropped frames and none of them are bonded, just 2 VLANs on a 10GbE interface across a single switch.

fabian · May 30, 2022

if you use the link for corosync and for Ceph, than it's likely that the packets just arrived too slow/with too much delay. you can take a look at the stats cmap: corosync-cmapctl -m stats

ikogan · May 30, 2022

Right I did think that Ceph was just saturating the link but after the rebalancing stopped and I started seeing very little traffic on the 10g link...that's when I got suspicious. I'll do some digging on how to read that stats output but if you could help me interpret it, that would be helpful:

Code:

...
stats.knet.node1.link1.connected (u8) = 1
stats.knet.node1.link1.down_count (u32) = 1
stats.knet.node1.link1.enabled (u8) = 1
stats.knet.node1.link1.latency_ave (u32) = 108
stats.knet.node1.link1.latency_max (u32) = 665876
stats.knet.node1.link1.latency_min (u32) = 108
stats.knet.node1.link1.latency_samples (u32) = 2048
stats.knet.node1.link1.mtu (u32) = 1397
stats.knet.node1.link1.rx_data_bytes (u64) = 1370
stats.knet.node1.link1.rx_data_packets (u64) = 6
stats.knet.node1.link1.rx_ping_bytes (u64) = 440388
stats.knet.node1.link1.rx_ping_packets (u64) = 16938
stats.knet.node1.link1.rx_pmtu_bytes (u64) = 1108368
stats.knet.node1.link1.rx_pmtu_packets (u64) = 1548
stats.knet.node1.link1.rx_pong_bytes (u64) = 440466
stats.knet.node1.link1.rx_pong_packets (u64) = 16941
stats.knet.node1.link1.rx_total_bytes (u64) = 1990592
stats.knet.node1.link1.rx_total_packets (u64) = 35433
stats.knet.node1.link1.rx_total_retries (u64) = 0
stats.knet.node1.link1.tx_data_bytes (u64) = 0
stats.knet.node1.link1.tx_data_errors (u32) = 0
stats.knet.node1.link1.tx_data_packets (u64) = 0
stats.knet.node1.link1.tx_data_retries (u32) = 0
stats.knet.node1.link1.tx_ping_bytes (u64) = 1355280
stats.knet.node1.link1.tx_ping_errors (u32) = 0
stats.knet.node1.link1.tx_ping_packets (u64) = 16941
stats.knet.node1.link1.tx_ping_retries (u32) = 0
stats.knet.node1.link1.tx_pmtu_bytes (u64) = 1139328
stats.knet.node1.link1.tx_pmtu_errors (u32) = 0
stats.knet.node1.link1.tx_pmtu_packets (u64) = 774
stats.knet.node1.link1.tx_pmtu_retries (u32) = 0
stats.knet.node1.link1.tx_pong_bytes (u64) = 1355040
stats.knet.node1.link1.tx_pong_errors (u32) = 0
stats.knet.node1.link1.tx_pong_packets (u64) = 16938
stats.knet.node1.link1.tx_pong_retries (u32) = 0
stats.knet.node1.link1.tx_total_bytes (u64) = 3849648
stats.knet.node1.link1.tx_total_errors (u64) = 0
stats.knet.node1.link1.tx_total_packets (u64) = 34653
stats.knet.node1.link1.up_count (u32) = 1
...
stats.knet.node2.link1.connected (u8) = 1
stats.knet.node2.link1.down_count (u32) = 1
stats.knet.node2.link1.enabled (u8) = 1
stats.knet.node2.link1.latency_ave (u32) = 125
stats.knet.node2.link1.latency_max (u32) = 1029755
stats.knet.node2.link1.latency_min (u32) = 125
stats.knet.node2.link1.latency_samples (u32) = 2048
stats.knet.node2.link1.mtu (u32) = 1397
stats.knet.node2.link1.rx_data_bytes (u64) = 0
stats.knet.node2.link1.rx_data_packets (u64) = 0
stats.knet.node2.link1.rx_ping_bytes (u64) = 440154
stats.knet.node2.link1.rx_ping_packets (u64) = 16929
stats.knet.node2.link1.rx_pmtu_bytes (u64) = 1114060
stats.knet.node2.link1.rx_pmtu_packets (u64) = 1552
stats.knet.node2.link1.rx_pong_bytes (u64) = 440466
stats.knet.node2.link1.rx_pong_packets (u64) = 16941
stats.knet.node2.link1.rx_total_bytes (u64) = 1994680
stats.knet.node2.link1.rx_total_packets (u64) = 35422
stats.knet.node2.link1.rx_total_retries (u64) = 0
stats.knet.node2.link1.tx_data_bytes (u64) = 0
stats.knet.node2.link1.tx_data_errors (u32) = 0
stats.knet.node2.link1.tx_data_packets (u64) = 0
stats.knet.node2.link1.tx_data_retries (u32) = 0
stats.knet.node2.link1.tx_ping_bytes (u64) = 1355280
stats.knet.node2.link1.tx_ping_errors (u32) = 0
stats.knet.node2.link1.tx_ping_packets (u64) = 16941
stats.knet.node2.link1.tx_ping_retries (u32) = 0
stats.knet.node2.link1.tx_pmtu_bytes (u64) = 1139328
stats.knet.node2.link1.tx_pmtu_errors (u32) = 0
stats.knet.node2.link1.tx_pmtu_packets (u64) = 774
stats.knet.node2.link1.tx_pmtu_retries (u32) = 0
stats.knet.node2.link1.tx_pong_bytes (u64) = 1354320
stats.knet.node2.link1.tx_pong_errors (u32) = 0
stats.knet.node2.link1.tx_pong_packets (u64) = 16929
stats.knet.node2.link1.tx_pong_retries (u32) = 0
stats.knet.node2.link1.tx_total_bytes (u64) = 3848928
stats.knet.node2.link1.tx_total_errors (u64) = 0
stats.knet.node2.link1.tx_total_packets (u64) = 34644
stats.knet.node2.link1.up_count (u32) = 1
...
stats.knet.node3.link1.connected (u8) = 1
stats.knet.node3.link1.down_count (u32) = 1
stats.knet.node3.link1.enabled (u8) = 1
stats.knet.node3.link1.latency_ave (u32) = 150
stats.knet.node3.link1.latency_max (u32) = 1029885
stats.knet.node3.link1.latency_min (u32) = 150
stats.knet.node3.link1.latency_samples (u32) = 2048
stats.knet.node3.link1.mtu (u32) = 1397
stats.knet.node3.link1.rx_data_bytes (u64) = 0
stats.knet.node3.link1.rx_data_packets (u64) = 0
stats.knet.node3.link1.rx_ping_bytes (u64) = 440232
stats.knet.node3.link1.rx_ping_packets (u64) = 16932
stats.knet.node3.link1.rx_pmtu_bytes (u64) = 1105522
stats.knet.node3.link1.rx_pmtu_packets (u64) = 1546
stats.knet.node3.link1.rx_pong_bytes (u64) = 440466
stats.knet.node3.link1.rx_pong_packets (u64) = 16941
stats.knet.node3.link1.rx_total_bytes (u64) = 1986220
stats.knet.node3.link1.rx_total_packets (u64) = 35419
stats.knet.node3.link1.rx_total_retries (u64) = 0
stats.knet.node3.link1.tx_data_bytes (u64) = 0
stats.knet.node3.link1.tx_data_errors (u32) = 0
stats.knet.node3.link1.tx_data_packets (u64) = 0
stats.knet.node3.link1.tx_data_retries (u32) = 0
stats.knet.node3.link1.tx_ping_bytes (u64) = 1355280
stats.knet.node3.link1.tx_ping_errors (u32) = 0
stats.knet.node3.link1.tx_ping_packets (u64) = 16941
stats.knet.node3.link1.tx_ping_retries (u32) = 0
stats.knet.node3.link1.tx_pmtu_bytes (u64) = 1139328
stats.knet.node3.link1.tx_pmtu_errors (u32) = 0
stats.knet.node3.link1.tx_pmtu_packets (u64) = 774
stats.knet.node3.link1.tx_pmtu_retries (u32) = 0
stats.knet.node3.link1.tx_pong_bytes (u64) = 1354560
stats.knet.node3.link1.tx_pong_errors (u32) = 0
stats.knet.node3.link1.tx_pong_packets (u64) = 16932
stats.knet.node3.link1.tx_pong_retries (u32) = 0
stats.knet.node3.link1.tx_total_bytes (u64) = 3849168
stats.knet.node3.link1.tx_total_errors (u64) = 0
stats.knet.node3.link1.tx_total_packets (u64) = 34647
stats.knet.node3.link1.up_count (u32) = 1
...
stats.knet.node5.link1.connected (u8) = 1
stats.knet.node5.link1.down_count (u32) = 474
stats.knet.node5.link1.enabled (u8) = 1
stats.knet.node5.link1.latency_ave (u32) = 134
stats.knet.node5.link1.latency_max (u32) = 1029854
stats.knet.node5.link1.latency_min (u32) = 134
stats.knet.node5.link1.latency_samples (u32) = 2048
stats.knet.node5.link1.mtu (u32) = 1397
stats.knet.node5.link1.rx_data_bytes (u64) = 0
stats.knet.node5.link1.rx_data_packets (u64) = 0
stats.knet.node5.link1.rx_ping_bytes (u64) = 425048
stats.knet.node5.link1.rx_ping_packets (u64) = 16348
stats.knet.node5.link1.rx_pmtu_bytes (u64) = 616198
stats.knet.node5.link1.rx_pmtu_packets (u64) = 1380
stats.knet.node5.link1.rx_pong_bytes (u64) = 425360
stats.knet.node5.link1.rx_pong_packets (u64) = 16360
stats.knet.node5.link1.rx_total_bytes (u64) = 1466606
stats.knet.node5.link1.rx_total_packets (u64) = 34088
stats.knet.node5.link1.rx_total_retries (u64) = 0
stats.knet.node5.link1.tx_data_bytes (u64) = 0
stats.knet.node5.link1.tx_data_errors (u32) = 0
stats.knet.node5.link1.tx_data_packets (u64) = 0
stats.knet.node5.link1.tx_data_retries (u32) = 0
stats.knet.node5.link1.tx_ping_bytes (u64) = 1355280
stats.knet.node5.link1.tx_ping_errors (u32) = 0
stats.knet.node5.link1.tx_ping_packets (u64) = 16941
stats.knet.node5.link1.tx_ping_retries (u32) = 0
stats.knet.node5.link1.tx_pmtu_bytes (u64) = 1442560
stats.knet.node5.link1.tx_pmtu_errors (u32) = 0
stats.knet.node5.link1.tx_pmtu_packets (u64) = 980
stats.knet.node5.link1.tx_pmtu_retries (u32) = 0
stats.knet.node5.link1.tx_pong_bytes (u64) = 1307840
stats.knet.node5.link1.tx_pong_errors (u32) = 0
stats.knet.node5.link1.tx_pong_packets (u64) = 16348
stats.knet.node5.link1.tx_pong_retries (u32) = 0
stats.knet.node5.link1.tx_total_bytes (u64) = 4105680
stats.knet.node5.link1.tx_total_errors (u64) = 0
stats.knet.node5.link1.tx_total_packets (u64) = 34269
stats.knet.node5.link1.up_count (u32) = 474
...

ikogan · May 31, 2022

Update: So one of two things made this go away:

1. Rebooting all the nodes.
2. Migrating a VM off of one of the nodes.

When doing the live migration of the VM..it was going absurdly slowly, like 56k over a 10GbE link with no other traffic. After migrating that VM and rebooting, all is well. Maybe there's something wrong with that, I'll have to debug some more.

fabian · May 31, 2022

if you see the issue again, dumping the stats cmap on all nodes is probably a good idea. the latency seems to be quite over the place, but with a low average.. and in your output, it's only the link to node5 that seems to flap, so possibly the stats on that node would (have?) give(n) more insight..

bigobx · Aug 3, 2023

Sorry to dig up an old thread but im having this same issue so didnt want to make anew one.

I have dedicated cronosync network with 1gb interfaces on a 8 port managed switch

One node in particular seems to have having issues with heartbeat and occasionally drops from the cluster momentarily.

If i log into one of the other nodes and look as the cluster, ALL other nodes seem to still be talking to each other and connected

Im in the process of adding a secondary ring but i need to find out why this primary node is acting up.



oot@pve-bighp:~# corosync-cmapctl -m stats
stats.ipcs.global.active (u64) = 5
stats.ipcs.global.closed (u64) = 215
stats.ipcs.service0.2453.0x563c8d2f6520.dispatched (u64) = 0
stats.ipcs.service0.2453.0x563c8d2f6520.flow_control (u32) = 0
stats.ipcs.service0.2453.0x563c8d2f6520.flow_control_count (u64) = 2382
stats.ipcs.service0.2453.0x563c8d2f6520.invalid_request (u64) = 0
stats.ipcs.service0.2453.0x563c8d2f6520.overload (u64) = 0
stats.ipcs.service0.2453.0x563c8d2f6520.procname (str) = pmxcfs
stats.ipcs.service0.2453.0x563c8d2f6520.queued (u32) = 0
stats.ipcs.service0.2453.0x563c8d2f6520.queueing (i32) = 0
stats.ipcs.service0.2453.0x563c8d2f6520.recv_retries (u64) = 0
stats.ipcs.service0.2453.0x563c8d2f6520.requests (u64) = 32
stats.ipcs.service0.2453.0x563c8d2f6520.responses (u64) = 32
stats.ipcs.service0.2453.0x563c8d2f6520.send_retries (u64) = 0
stats.ipcs.service0.2453.0x563c8d2f6520.sent (u32) = 0
stats.ipcs.service0.734064.0x563c8d2df040.dispatched (u64) = 0
stats.ipcs.service0.734064.0x563c8d2df040.flow_control (u32) = 0
stats.ipcs.service0.734064.0x563c8d2df040.flow_control_count (u64) = 0
stats.ipcs.service0.734064.0x563c8d2df040.invalid_request (u64) = 0
stats.ipcs.service0.734064.0x563c8d2df040.overload (u64) = 0
stats.ipcs.service0.734064.0x563c8d2df040.procname (str) = corosync-cmapct
stats.ipcs.service0.734064.0x563c8d2df040.queued (u32) = 0
stats.ipcs.service0.734064.0x563c8d2df040.queueing (i32) = 0
stats.ipcs.service0.734064.0x563c8d2df040.recv_retries (u64) = 0
stats.ipcs.service0.734064.0x563c8d2df040.requests (u64) = 54
stats.ipcs.service0.734064.0x563c8d2df040.responses (u64) = 55
stats.ipcs.service0.734064.0x563c8d2df040.send_retries (u64) = 0
stats.ipcs.service0.734064.0x563c8d2df040.sent (u32) = 0
stats.ipcs.service2.2453.0x563c8d2d77e0.dispatched (u64) = 2726
stats.ipcs.service2.2453.0x563c8d2d77e0.flow_control (u32) = 0
stats.ipcs.service2.2453.0x563c8d2d77e0.flow_control_count (u64) = 44
stats.ipcs.service2.2453.0x563c8d2d77e0.invalid_request (u64) = 0
stats.ipcs.service2.2453.0x563c8d2d77e0.overload (u64) = 0
stats.ipcs.service2.2453.0x563c8d2d77e0.procname (str) = pmxcfs
stats.ipcs.service2.2453.0x563c8d2d77e0.queued (u32) = 0
stats.ipcs.service2.2453.0x563c8d2d77e0.queueing (i32) = 0
stats.ipcs.service2.2453.0x563c8d2d77e0.recv_retries (u64) = 0
stats.ipcs.service2.2453.0x563c8d2d77e0.requests (u64) = 904
stats.ipcs.service2.2453.0x563c8d2d77e0.responses (u64) = 2
stats.ipcs.service2.2453.0x563c8d2d77e0.send_retries (u64) = 0
stats.ipcs.service2.2453.0x563c8d2d77e0.sent (u32) = 2726
stats.ipcs.service2.2453.0x563c8d2ee610.dispatched (u64) = 50420
stats.ipcs.service2.2453.0x563c8d2ee610.flow_control (u32) = 0
stats.ipcs.service2.2453.0x563c8d2ee610.flow_control_count (u64) = 1126
stats.ipcs.service2.2453.0x563c8d2ee610.invalid_request (u64) = 0
stats.ipcs.service2.2453.0x563c8d2ee610.overload (u64) = 0
stats.ipcs.service2.2453.0x563c8d2ee610.procname (str) = pmxcfs
stats.ipcs.service2.2453.0x563c8d2ee610.queued (u32) = 0
stats.ipcs.service2.2453.0x563c8d2ee610.queueing (i32) = 0
stats.ipcs.service2.2453.0x563c8d2ee610.recv_retries (u64) = 0
stats.ipcs.service2.2453.0x563c8d2ee610.requests (u64) = 34266
stats.ipcs.service2.2453.0x563c8d2ee610.responses (u64) = 2
stats.ipcs.service2.2453.0x563c8d2ee610.send_retries (u64) = 0
stats.ipcs.service2.2453.0x563c8d2ee610.sent (u32) = 50420
stats.ipcs.service3.2453.0x563c8d2f12a0.dispatched (u64) = 1192
stats.ipcs.service3.2453.0x563c8d2f12a0.flow_control (u32) = 0
stats.ipcs.service3.2453.0x563c8d2f12a0.flow_control_count (u64) = 2382
stats.ipcs.service3.2453.0x563c8d2f12a0.invalid_request (u64) = 0
stats.ipcs.service3.2453.0x563c8d2f12a0.overload (u64) = 0
stats.ipcs.service3.2453.0x563c8d2f12a0.procname (str) = pmxcfs
stats.ipcs.service3.2453.0x563c8d2f12a0.queued (u32) = 0
stats.ipcs.service3.2453.0x563c8d2f12a0.queueing (i32) = 0
stats.ipcs.service3.2453.0x563c8d2f12a0.recv_retries (u64) = 0
stats.ipcs.service3.2453.0x563c8d2f12a0.requests (u64) = 2
stats.ipcs.service3.2453.0x563c8d2f12a0.responses (u64) = 2
stats.ipcs.service3.2453.0x563c8d2f12a0.send_retries (u64) = 0
stats.ipcs.service3.2453.0x563c8d2f12a0.sent (u32) = 1192
stats.knet.handle.rx_compress_time_ave (u64) = 0
stats.knet.handle.rx_compress_time_max (u64) = 0
stats.knet.handle.rx_compress_time_min (u64) = 18446744073709551615
stats.knet.handle.rx_compressed_original_bytes (u64) = 0
stats.knet.handle.rx_compressed_packets (u64) = 0
stats.knet.handle.rx_compressed_size_bytes (u64) = 0
stats.knet.handle.rx_crypt_packets (u64) = 814334
stats.knet.handle.rx_crypt_time_ave (u64) = 12198
stats.knet.handle.rx_crypt_time_max (u64) = 400912
stats.knet.handle.rx_crypt_time_min (u64) = 8628
stats.knet.handle.tx_compress_time_ave (u64) = 0
stats.knet.handle.tx_compress_time_max (u64) = 0
stats.knet.handle.tx_compress_time_min (u64) = 18446744073709551615
stats.knet.handle.tx_compressed_original_bytes (u64) = 0
stats.knet.handle.tx_compressed_packets (u64) = 0
stats.knet.handle.tx_compressed_size_bytes (u64) = 0
stats.knet.handle.tx_crypt_byte_overhead (u64) = 43159215
stats.knet.handle.tx_crypt_packets (u64) = 923894
stats.knet.handle.tx_crypt_time_ave (u64) = 14335
stats.knet.handle.tx_crypt_time_max (u64) = 427560
stats.knet.handle.tx_crypt_time_min (u64) = 9275
stats.knet.handle.tx_uncompressed_packets (u64) = 0
stats.knet.node1.link0.connected (u8) = 1
stats.knet.node1.link0.down_count (u32) = 0
stats.knet.node1.link0.enabled (u8) = 1
stats.knet.node1.link0.latency_ave (u32) = 0
stats.knet.node1.link0.latency_max (u32) = 0
stats.knet.node1.link0.latency_min (u32) = 4294967295
stats.knet.node1.link0.latency_samples (u32) = 0
stats.knet.node1.link0.mtu (u32) = 65535
stats.knet.node1.link0.rx_data_bytes (u64) = 0
stats.knet.node1.link0.rx_data_packets (u64) = 0
stats.knet.node1.link0.rx_ping_bytes (u64) = 0
stats.knet.node1.link0.rx_ping_packets (u64) = 0
stats.knet.node1.link0.rx_pmtu_bytes (u64) = 0
stats.knet.node1.link0.rx_pmtu_packets (u64) = 0
stats.knet.node1.link0.rx_pong_bytes (u64) = 0
stats.knet.node1.link0.rx_pong_packets (u64) = 0
stats.knet.node1.link0.rx_total_bytes (u64) = 0
stats.knet.node1.link0.rx_total_packets (u64) = 0
stats.knet.node1.link0.rx_total_retries (u64) = 0
stats.knet.node1.link0.tx_data_bytes (u64) = 179427381
stats.knet.node1.link0.tx_data_errors (u32) = 0
stats.knet.node1.link0.tx_data_packets (u64) = 883640
stats.knet.node1.link0.tx_data_retries (u32) = 0
stats.knet.node1.link0.tx_ping_bytes (u64) = 0
stats.knet.node1.link0.tx_ping_errors (u32) = 0
stats.knet.node1.link0.tx_ping_packets (u64) = 0
stats.knet.node1.link0.tx_ping_retries (u32) = 0
stats.knet.node1.link0.tx_pmtu_bytes (u64) = 0
stats.knet.node1.link0.tx_pmtu_errors (u32) = 0
stats.knet.node1.link0.tx_pmtu_packets (u64) = 0
stats.knet.node1.link0.tx_pmtu_retries (u32) = 0
stats.knet.node1.link0.tx_pong_bytes (u64) = 0
stats.knet.node1.link0.tx_pong_errors (u32) = 0
stats.knet.node1.link0.tx_pong_packets (u64) = 0
stats.knet.node1.link0.tx_pong_retries (u32) = 0
stats.knet.node1.link0.tx_total_bytes (u64) = 179427381
stats.knet.node1.link0.tx_total_errors (u64) = 0
stats.knet.node1.link0.tx_total_packets (u64) = 883640
stats.knet.node1.link0.up_count (u32) = 1
stats.knet.node2.link0.connected (u8) = 1
stats.knet.node2.link0.down_count (u32) = 1405
stats.knet.node2.link0.enabled (u8) = 1
stats.knet.node2.link0.latency_ave (u32) = 1130
stats.knet.node2.link0.latency_max (u32) = 43029
stats.knet.node2.link0.latency_min (u32) = 324
stats.knet.node2.link0.latency_samples (u32) = 2048
stats.knet.node2.link0.mtu (u32) = 1397
stats.knet.node2.link0.rx_data_bytes (u64) = 77757318
stats.knet.node2.link0.rx_data_packets (u64) = 120724
stats.knet.node2.link0.rx_ping_bytes (u64) = 745888
stats.knet.node2.link0.rx_ping_packets (u64) = 28688
stats.knet.node2.link0.rx_pmtu_bytes (u64) = 1970901
stats.knet.node2.link0.rx_pmtu_packets (u64) = 2647
stats.knet.node2.link0.rx_pong_bytes (u64) = 571298
stats.knet.node2.link0.rx_pong_packets (u64) = 21973
stats.knet.node2.link0.rx_total_bytes (u64) = 81045405
stats.knet.node2.link0.rx_total_packets (u64) = 174032
stats.knet.node2.link0.rx_total_retries (u64) = 0
stats.knet.node2.link0.tx_data_bytes (u64) = 172093296
stats.knet.node2.link0.tx_data_errors (u32) = 0
stats.knet.node2.link0.tx_data_packets (u64) = 752839
stats.knet.node2.link0.tx_data_retries (u32) = 0
stats.knet.node2.link0.tx_ping_bytes (u64) = 3957280
stats.knet.node2.link0.tx_ping_errors (u32) = 0
stats.knet.node2.link0.tx_ping_packets (u64) = 49466
stats.knet.node2.link0.tx_ping_retries (u32) = 0
stats.knet.node2.link0.tx_pmtu_bytes (u64) = 2022528
stats.knet.node2.link0.tx_pmtu_errors (u32) = 0
stats.knet.node2.link0.tx_pmtu_packets (u64) = 1374
stats.knet.node2.link0.tx_pmtu_retries (u32) = 0
stats.knet.node2.link0.tx_pong_bytes (u64) = 2294960
stats.knet.node2.link0.tx_pong_errors (u32) = 0
stats.knet.node2.link0.tx_pong_packets (u64) = 28687
stats.knet.node2.link0.tx_pong_retries (u32) = 0
stats.knet.node2.link0.tx_total_bytes (u64) = 180368064
stats.knet.node2.link0.tx_total_errors (u64) = 0
stats.knet.node2.link0.tx_total_packets (u64) = 832366
stats.knet.node2.link0.up_count (u32) = 1405
stats.knet.node4.link0.connected (u8) = 1
stats.knet.node4.link0.down_count (u32) = 1406
stats.knet.node4.link0.enabled (u8) = 1
stats.knet.node4.link0.latency_ave (u32) = 1332
stats.knet.node4.link0.latency_max (u32) = 43534
stats.knet.node4.link0.latency_min (u32) = 365
stats.knet.node4.link0.latency_samples (u32) = 2048
stats.knet.node4.link0.mtu (u32) = 1397
stats.knet.node4.link0.rx_data_bytes (u64) = 111339683
stats.knet.node4.link0.rx_data_packets (u64) = 693613
stats.knet.node4.link0.rx_ping_bytes (u64) = 758940
stats.knet.node4.link0.rx_ping_packets (u64) = 29190
stats.knet.node4.link0.rx_pmtu_bytes (u64) = 1783335
stats.knet.node4.link0.rx_pmtu_packets (u64) = 2545
stats.knet.node4.link0.rx_pong_bytes (u64) = 568698
stats.knet.node4.link0.rx_pong_packets (u64) = 21873
stats.knet.node4.link0.rx_total_bytes (u64) = 114450656
stats.knet.node4.link0.rx_total_packets (u64) = 747221
stats.knet.node4.link0.rx_total_retries (u64) = 0
stats.knet.node4.link0.tx_data_bytes (u64) = 98114448
stats.knet.node4.link0.tx_data_errors (u32) = 0
stats.knet.node4.link0.tx_data_packets (u64) = 176603
stats.knet.node4.link0.tx_data_retries (u32) = 0
stats.knet.node4.link0.tx_ping_bytes (u64) = 3957280
stats.knet.node4.link0.tx_ping_errors (u32) = 0
stats.knet.node4.link0.tx_ping_packets (u64) = 49466
stats.knet.node4.link0.tx_ping_retries (u32) = 0
stats.knet.node4.link0.tx_pmtu_bytes (u64) = 2715840
stats.knet.node4.link0.tx_pmtu_errors (u32) = 0
stats.knet.node4.link0.tx_pmtu_packets (u64) = 1845
stats.knet.node4.link0.tx_pmtu_retries (u32) = 0
stats.knet.node4.link0.tx_pong_bytes (u64) = 2335200
stats.knet.node4.link0.tx_pong_errors (u32) = 0
stats.knet.node4.link0.tx_pong_packets (u64) = 29190
stats.knet.node4.link0.tx_pong_retries (u32) = 0
stats.knet.node4.link0.tx_total_bytes (u64) = 107122768
stats.knet.node4.link0.tx_total_errors (u64) = 0
stats.knet.node4.link0.tx_total_packets (u64) = 257104
stats.knet.node4.link0.up_count (u32) = 1406
stats.pg.msg_queue_avail (u32) = 0
stats.pg.msg_reserved (u32) = 2
stats.srp.avg_backlog_calc (u32) = 0
stats.srp.avg_token_workload (u32) = 0
stats.srp.commit_entered (u64) = 1364
stats.srp.commit_token_lost (u64) = 75
stats.srp.consensus_timeouts (u64) = 830
stats.srp.continuous_gather (u32) = 0
stats.srp.continuous_sendmsg_failures (u32) = 0
stats.srp.firewall_enabled_or_nic_failure (u8) = 0
stats.srp.gather_entered (u64) = 2648
stats.srp.gather_token_lost (u64) = 0
stats.srp.mcast_retx (u64) = 119
stats.srp.mcast_rx (u64) = 118496
stats.srp.mcast_tx (u64) = 78312
stats.srp.memb_commit_token_rx (u64) = 2563
stats.srp.memb_commit_token_tx (u64) = 2644
stats.srp.memb_join_rx (u64) = 118841
stats.srp.memb_join_tx (u64) = 100049
stats.srp.memb_merge_detect_rx (u64) = 65851
stats.srp.memb_merge_detect_tx (u64) = 65698
stats.srp.mtt_rx_token (u32) = 2
stats.srp.operational_entered (u64) = 1248
stats.srp.operational_token_lost (u64) = 557
stats.srp.orf_token_rx (u64) = 1190866
stats.srp.orf_token_tx (u64) = 1272
stats.srp.recovery_entered (u64) = 1280
stats.srp.recovery_token_lost (u64) = 32
stats.srp.rx_msg_dropped (u64) = 35
stats.srp.time_since_token_last_received (u64) = 288
stats.srp.token_hold_cancel_rx (u64) = 28357
stats.srp.token_hold_cancel_tx (u64) = 19077
root@pve-bighp:~#

fabian · Aug 3, 2023

can't really say, but the latency variance also seems quite high..

bigobx · Aug 3, 2023

Sorry, do you mean this area in the logs:



stats.knet.node2.link0.latency_ave (u32) = 1130
stats.knet.node2.link0.latency_max (u32) = 43029
stats.knet.node2.link0.latency_min (u32) = 324

I would think any old switch would be up to the task but any thoughts on that? I have some other old stock unmanaged switches that i can try but felt this one was my "best" one lol.

Ok three different switches and its the same results, will swap out all the cables later on today and do some more testing but this is strange lol

fabian · Aug 4, 2023

yes, exactly. it's not a given that the switch is at fault, it could also be a scheduling issue on the node if it is overloaded, faulty cabling, .. hard to tell with more analysis. if you have monitoring in place, correlating the link down events from corosync logs with other events might be helpful.

bigobx · Aug 4, 2023

Ive changed everything other than the network cards themselves and the varriance is still the same.

The system isnt even that loaded so this is just bizzar but now that you mention the load thing, i did notice the other day when rebooting the cluster and having ALL VM's on one host auto run, this same host had the same issue, but for a few minutes as all the VM's booted up.

I know the machine itself didnt go down and other servers were still running because it also hosts part of my CEPH cluster and that never reported any warnings or issues.

i should have parts of second ring soon and ill see if that helps. If the new USB ethernet based ring works better, ill swap that to the primary

Search

Search

What does this mean? corosync[3587892]: [KNET ] link: host: 2 link: 0 is down

Nefariousparity

Member

fabian

Proxmox Staff Member

Nefariousparity

Member

Nefariousparity

Member

itNGO

Famous Member

Attachments

Nefariousparity

Member

fabian

Proxmox Staff Member

ikogan

Renowned Member

fabian

Proxmox Staff Member

ikogan

Renowned Member

ikogan

Renowned Member

fabian

Proxmox Staff Member

bigobx

New Member

fabian

Proxmox Staff Member

bigobx

New Member

fabian

Proxmox Staff Member

bigobx

New Member

We value your privacy