What does this mean? corosync[3587892]: [KNET ] link: host: 2 link: 0 is down

I have been having a issue with a VM intermittently losing internet connectivity, trying to check hosts for issues and I find this?

May 17 02:09:15 pve-j-dal corosync[3587892]: [KNET ] link: host: 2 link: 0 is down
May 17 02:09:15 pve-j-dal corosync[3587892]: [KNET ] host: host: 2 (passive) best link: 1 (pri: 1)
May 17 02:09:17 pve-j-dal corosync[3587892]: [KNET ] rx: host: 2 link: 0 is up
May 17 02:09:17 pve-j-dal corosync[3587892]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
May 17 02:09:31 pve-j-dal pveproxy[3963507]: Clearing outdated entries from certificate cache
 
Last edited:

fabian

Proxmox Staff Member
Staff member
Jan 7, 2016
8,372
1,662
174
this means that your network (or at least the link 0 configured for corosync usage) went down for a few seconds.
 

itNGO

Well-Known Member
Jun 12, 2020
583
128
48
44
Germany
it-ngo.com
We have comparable issue on one auf our 3-Node-Cluster in Datacenter.
It is the "secondary" Coro-Sync-Link which does this every few minutes....

May 17 17:55:42 RZB-CPVE1 corosync[1437]: [KNET ] link: host: 2 link: 1 is down
May 17 17:55:42 RZB-CPVE1 corosync[1437]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
May 17 17:55:44 RZB-CPVE1 corosync[1437]: [KNET ] rx: host: 2 link: 1 is up
May 17 17:55:44 RZB-CPVE1 corosync[1437]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)

Every node has this:
May 17 17:28:56 RZB-CPVE2 corosync[1442]: [KNET ] link: host: 3 link: 1 is down
May 17 17:28:56 RZB-CPVE2 corosync[1442]: [KNET ] link: host: 1 link: 1 is down
May 17 17:28:56 RZB-CPVE2 corosync[1442]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
May 17 17:28:56 RZB-CPVE2 corosync[1442]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
May 17 17:28:58 RZB-CPVE2 corosync[1442]: [KNET ] rx: host: 3 link: 1 is up
May 17 17:28:58 RZB-CPVE2 corosync[1442]: [KNET ] rx: host: 1 link: 1 is up
May 17 17:28:58 RZB-CPVE2 corosync[1442]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
May 17 17:28:58 RZB-CPVE2 corosync[1442]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)


May 17 17:26:05 RZB-CPVE3 corosync[1443]: [KNET ] link: host: 2 link: 1 is down
May 17 17:26:05 RZB-CPVE3 corosync[1443]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
May 17 17:26:07 RZB-CPVE3 corosync[1443]: [KNET ] rx: host: 2 link: 1 is up
May 17 17:26:07 RZB-CPVE3 corosync[1443]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)


But all at different times.... we use BOND with TLB on this link where also the VMs are connected to. Bond has 2 NICs on different switches. Maybe there is an issue?
 

Attachments

  • 1652803418968.png
    1652803418968.png
    35.7 KB · Views: 8

fabian

Proxmox Staff Member
Staff member
Jan 7, 2016
8,372
1,662
174
@itNGO it's possible that the bond is interfering with corosync - both try to monitor the link and failover after all. the link down detection in corosync/knet is pretty simple - it sends heartbeat packets over UDP, and if no reply comes back (in time) the link gets marked as down (or if sending actual data fails ;)). a few heartbeats go through -> the link is marked as up again.
 
  • Like
Reactions: itNGO

ikogan

Active Member
Apr 8, 2017
29
1
28
38
We had a power outage yesterday. For unrelated reasons, 4/5 of the nodes did not power up successfully. As I powered them up, I would fix each node, which added maybe 5 minutes to each node's bootup. After this, I started seeing these messages constantly on all nodes but _only_ for the secondary ring.

This happens to be the same interface that I use for Ceph replication and there _was_ a lot of rebalancing activity for a few days. That has since stopped and traffic on that interface is fairly miniscule. I can reliably ping between all nodes at <1ms with no drops yet these messages continue to happen, just constantly going up and down.

What's the best way to continue to diagnose this? If it's losing a UDP packet here and there...how do I see that? None of the interfaces seem to have any significant dropped frames and none of them are bonded, just 2 VLANs on a 10GbE interface across a single switch.
 
Last edited:

fabian

Proxmox Staff Member
Staff member
Jan 7, 2016
8,372
1,662
174
if you use the link for corosync and for Ceph, than it's likely that the packets just arrived too slow/with too much delay. you can take a look at the stats cmap: corosync-cmapctl -m stats
 

ikogan

Active Member
Apr 8, 2017
29
1
28
38
Right I did think that Ceph was just saturating the link but after the rebalancing stopped and I started seeing very little traffic on the 10g link...that's when I got suspicious. I'll do some digging on how to read that stats output but if you could help me interpret it, that would be helpful:


Code:
...
stats.knet.node1.link1.connected (u8) = 1
stats.knet.node1.link1.down_count (u32) = 1
stats.knet.node1.link1.enabled (u8) = 1
stats.knet.node1.link1.latency_ave (u32) = 108
stats.knet.node1.link1.latency_max (u32) = 665876
stats.knet.node1.link1.latency_min (u32) = 108
stats.knet.node1.link1.latency_samples (u32) = 2048
stats.knet.node1.link1.mtu (u32) = 1397
stats.knet.node1.link1.rx_data_bytes (u64) = 1370
stats.knet.node1.link1.rx_data_packets (u64) = 6
stats.knet.node1.link1.rx_ping_bytes (u64) = 440388
stats.knet.node1.link1.rx_ping_packets (u64) = 16938
stats.knet.node1.link1.rx_pmtu_bytes (u64) = 1108368
stats.knet.node1.link1.rx_pmtu_packets (u64) = 1548
stats.knet.node1.link1.rx_pong_bytes (u64) = 440466
stats.knet.node1.link1.rx_pong_packets (u64) = 16941
stats.knet.node1.link1.rx_total_bytes (u64) = 1990592
stats.knet.node1.link1.rx_total_packets (u64) = 35433
stats.knet.node1.link1.rx_total_retries (u64) = 0
stats.knet.node1.link1.tx_data_bytes (u64) = 0
stats.knet.node1.link1.tx_data_errors (u32) = 0
stats.knet.node1.link1.tx_data_packets (u64) = 0
stats.knet.node1.link1.tx_data_retries (u32) = 0
stats.knet.node1.link1.tx_ping_bytes (u64) = 1355280
stats.knet.node1.link1.tx_ping_errors (u32) = 0
stats.knet.node1.link1.tx_ping_packets (u64) = 16941
stats.knet.node1.link1.tx_ping_retries (u32) = 0
stats.knet.node1.link1.tx_pmtu_bytes (u64) = 1139328
stats.knet.node1.link1.tx_pmtu_errors (u32) = 0
stats.knet.node1.link1.tx_pmtu_packets (u64) = 774
stats.knet.node1.link1.tx_pmtu_retries (u32) = 0
stats.knet.node1.link1.tx_pong_bytes (u64) = 1355040
stats.knet.node1.link1.tx_pong_errors (u32) = 0
stats.knet.node1.link1.tx_pong_packets (u64) = 16938
stats.knet.node1.link1.tx_pong_retries (u32) = 0
stats.knet.node1.link1.tx_total_bytes (u64) = 3849648
stats.knet.node1.link1.tx_total_errors (u64) = 0
stats.knet.node1.link1.tx_total_packets (u64) = 34653
stats.knet.node1.link1.up_count (u32) = 1
...
stats.knet.node2.link1.connected (u8) = 1
stats.knet.node2.link1.down_count (u32) = 1
stats.knet.node2.link1.enabled (u8) = 1
stats.knet.node2.link1.latency_ave (u32) = 125
stats.knet.node2.link1.latency_max (u32) = 1029755
stats.knet.node2.link1.latency_min (u32) = 125
stats.knet.node2.link1.latency_samples (u32) = 2048
stats.knet.node2.link1.mtu (u32) = 1397
stats.knet.node2.link1.rx_data_bytes (u64) = 0
stats.knet.node2.link1.rx_data_packets (u64) = 0
stats.knet.node2.link1.rx_ping_bytes (u64) = 440154
stats.knet.node2.link1.rx_ping_packets (u64) = 16929
stats.knet.node2.link1.rx_pmtu_bytes (u64) = 1114060
stats.knet.node2.link1.rx_pmtu_packets (u64) = 1552
stats.knet.node2.link1.rx_pong_bytes (u64) = 440466
stats.knet.node2.link1.rx_pong_packets (u64) = 16941
stats.knet.node2.link1.rx_total_bytes (u64) = 1994680
stats.knet.node2.link1.rx_total_packets (u64) = 35422
stats.knet.node2.link1.rx_total_retries (u64) = 0
stats.knet.node2.link1.tx_data_bytes (u64) = 0
stats.knet.node2.link1.tx_data_errors (u32) = 0
stats.knet.node2.link1.tx_data_packets (u64) = 0
stats.knet.node2.link1.tx_data_retries (u32) = 0
stats.knet.node2.link1.tx_ping_bytes (u64) = 1355280
stats.knet.node2.link1.tx_ping_errors (u32) = 0
stats.knet.node2.link1.tx_ping_packets (u64) = 16941
stats.knet.node2.link1.tx_ping_retries (u32) = 0
stats.knet.node2.link1.tx_pmtu_bytes (u64) = 1139328
stats.knet.node2.link1.tx_pmtu_errors (u32) = 0
stats.knet.node2.link1.tx_pmtu_packets (u64) = 774
stats.knet.node2.link1.tx_pmtu_retries (u32) = 0
stats.knet.node2.link1.tx_pong_bytes (u64) = 1354320
stats.knet.node2.link1.tx_pong_errors (u32) = 0
stats.knet.node2.link1.tx_pong_packets (u64) = 16929
stats.knet.node2.link1.tx_pong_retries (u32) = 0
stats.knet.node2.link1.tx_total_bytes (u64) = 3848928
stats.knet.node2.link1.tx_total_errors (u64) = 0
stats.knet.node2.link1.tx_total_packets (u64) = 34644
stats.knet.node2.link1.up_count (u32) = 1
...
stats.knet.node3.link1.connected (u8) = 1
stats.knet.node3.link1.down_count (u32) = 1
stats.knet.node3.link1.enabled (u8) = 1
stats.knet.node3.link1.latency_ave (u32) = 150
stats.knet.node3.link1.latency_max (u32) = 1029885
stats.knet.node3.link1.latency_min (u32) = 150
stats.knet.node3.link1.latency_samples (u32) = 2048
stats.knet.node3.link1.mtu (u32) = 1397
stats.knet.node3.link1.rx_data_bytes (u64) = 0
stats.knet.node3.link1.rx_data_packets (u64) = 0
stats.knet.node3.link1.rx_ping_bytes (u64) = 440232
stats.knet.node3.link1.rx_ping_packets (u64) = 16932
stats.knet.node3.link1.rx_pmtu_bytes (u64) = 1105522
stats.knet.node3.link1.rx_pmtu_packets (u64) = 1546
stats.knet.node3.link1.rx_pong_bytes (u64) = 440466
stats.knet.node3.link1.rx_pong_packets (u64) = 16941
stats.knet.node3.link1.rx_total_bytes (u64) = 1986220
stats.knet.node3.link1.rx_total_packets (u64) = 35419
stats.knet.node3.link1.rx_total_retries (u64) = 0
stats.knet.node3.link1.tx_data_bytes (u64) = 0
stats.knet.node3.link1.tx_data_errors (u32) = 0
stats.knet.node3.link1.tx_data_packets (u64) = 0
stats.knet.node3.link1.tx_data_retries (u32) = 0
stats.knet.node3.link1.tx_ping_bytes (u64) = 1355280
stats.knet.node3.link1.tx_ping_errors (u32) = 0
stats.knet.node3.link1.tx_ping_packets (u64) = 16941
stats.knet.node3.link1.tx_ping_retries (u32) = 0
stats.knet.node3.link1.tx_pmtu_bytes (u64) = 1139328
stats.knet.node3.link1.tx_pmtu_errors (u32) = 0
stats.knet.node3.link1.tx_pmtu_packets (u64) = 774
stats.knet.node3.link1.tx_pmtu_retries (u32) = 0
stats.knet.node3.link1.tx_pong_bytes (u64) = 1354560
stats.knet.node3.link1.tx_pong_errors (u32) = 0
stats.knet.node3.link1.tx_pong_packets (u64) = 16932
stats.knet.node3.link1.tx_pong_retries (u32) = 0
stats.knet.node3.link1.tx_total_bytes (u64) = 3849168
stats.knet.node3.link1.tx_total_errors (u64) = 0
stats.knet.node3.link1.tx_total_packets (u64) = 34647
stats.knet.node3.link1.up_count (u32) = 1
...
stats.knet.node5.link1.connected (u8) = 1
stats.knet.node5.link1.down_count (u32) = 474
stats.knet.node5.link1.enabled (u8) = 1
stats.knet.node5.link1.latency_ave (u32) = 134
stats.knet.node5.link1.latency_max (u32) = 1029854
stats.knet.node5.link1.latency_min (u32) = 134
stats.knet.node5.link1.latency_samples (u32) = 2048
stats.knet.node5.link1.mtu (u32) = 1397
stats.knet.node5.link1.rx_data_bytes (u64) = 0
stats.knet.node5.link1.rx_data_packets (u64) = 0
stats.knet.node5.link1.rx_ping_bytes (u64) = 425048
stats.knet.node5.link1.rx_ping_packets (u64) = 16348
stats.knet.node5.link1.rx_pmtu_bytes (u64) = 616198
stats.knet.node5.link1.rx_pmtu_packets (u64) = 1380
stats.knet.node5.link1.rx_pong_bytes (u64) = 425360
stats.knet.node5.link1.rx_pong_packets (u64) = 16360
stats.knet.node5.link1.rx_total_bytes (u64) = 1466606
stats.knet.node5.link1.rx_total_packets (u64) = 34088
stats.knet.node5.link1.rx_total_retries (u64) = 0
stats.knet.node5.link1.tx_data_bytes (u64) = 0
stats.knet.node5.link1.tx_data_errors (u32) = 0
stats.knet.node5.link1.tx_data_packets (u64) = 0
stats.knet.node5.link1.tx_data_retries (u32) = 0
stats.knet.node5.link1.tx_ping_bytes (u64) = 1355280
stats.knet.node5.link1.tx_ping_errors (u32) = 0
stats.knet.node5.link1.tx_ping_packets (u64) = 16941
stats.knet.node5.link1.tx_ping_retries (u32) = 0
stats.knet.node5.link1.tx_pmtu_bytes (u64) = 1442560
stats.knet.node5.link1.tx_pmtu_errors (u32) = 0
stats.knet.node5.link1.tx_pmtu_packets (u64) = 980
stats.knet.node5.link1.tx_pmtu_retries (u32) = 0
stats.knet.node5.link1.tx_pong_bytes (u64) = 1307840
stats.knet.node5.link1.tx_pong_errors (u32) = 0
stats.knet.node5.link1.tx_pong_packets (u64) = 16348
stats.knet.node5.link1.tx_pong_retries (u32) = 0
stats.knet.node5.link1.tx_total_bytes (u64) = 4105680
stats.knet.node5.link1.tx_total_errors (u64) = 0
stats.knet.node5.link1.tx_total_packets (u64) = 34269
stats.knet.node5.link1.up_count (u32) = 474
...
 

ikogan

Active Member
Apr 8, 2017
29
1
28
38
Update: So one of two things made this go away:

1. Rebooting all the nodes.
2. Migrating a VM off of one of the nodes.

When doing the live migration of the VM..it was going absurdly slowly, like 56k over a 10GbE link with no other traffic. After migrating that VM and rebooting, all is well. Maybe there's something wrong with that, I'll have to debug some more.
 

fabian

Proxmox Staff Member
Staff member
Jan 7, 2016
8,372
1,662
174
if you see the issue again, dumping the stats cmap on all nodes is probably a good idea. the latency seems to be quite over the place, but with a low average.. and in your output, it's only the link to node5 that seems to flap, so possibly the stats on that node would (have?) give(n) more insight..
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!