link: host: 1 link: 0 is down

CRCinAU

Member
May 4, 2020
80
32
23
Hi all,

I'm having problems with my 2 node Proxmox cluster. One is a home built PC that runs several VMs, and the other a second hand Intel NUC to play around with Plex and hardware transcoding using Intel Quicksync.

I joined these two in a cluster - but have noticed that I get a lot of dropouts with corosync. I've been experimenting with the configuration, but still haven't managed to get a setup that works.

Error:
Code:
Jun 28 00:33:29 mel-pm2 corosync[793213]:   [KNET  ] link: host: 1 link: 0 is down
Jun 28 00:33:29 mel-pm2 corosync[793213]:   [KNET  ] link: host: 1 link: 1 is down
Jun 28 00:33:29 mel-pm2 corosync[793213]:   [KNET  ] host: host: 1 has 0 active links
Jun 28 00:33:29 mel-pm2 corosync[793213]:   [KNET  ] host: host: 1 has no active links
Jun 28 00:33:29 mel-pm2 corosync[793213]:   [KNET  ] host: host: 1 has 0 active links
Jun 28 00:33:29 mel-pm2 corosync[793213]:   [KNET  ] host: host: 1 has no active links
Jun 28 00:33:31 mel-pm2 corosync[793213]:   [TOTEM ] Token has not been received in 2250 ms
Jun 28 00:33:32 mel-pm2 corosync[793213]:   [KNET  ] rx: host: 1 link: 0 is up
Jun 28 00:33:32 mel-pm2 corosync[793213]:   [KNET  ] rx: host: 1 link: 1 is up
Jun 28 00:33:32 mel-pm2 corosync[793213]:   [KNET  ] host: host: 1 has 2 active links
Jun 28 00:33:32 mel-pm2 corosync[793213]:   [KNET  ] host: host: 1 has 2 active links
Jun 28 00:33:32 mel-pm2 corosync[793213]:   [TOTEM ] A processor failed, forming new configuration: token timed out (3000ms), waiting 3600ms for consensus.
Jun 28 00:33:34 mel-pm2 corosync[793213]:   [QUORUM] Sync members[2]: 1 2
Jun 28 00:33:34 mel-pm2 corosync[793213]:   [TOTEM ] A new membership (1.2315) was formed. Members
Jun 28 00:33:34 mel-pm2 corosync[793213]:   [QUORUM] Members[2]: 1 2
Jun 28 00:33:34 mel-pm2 corosync[793213]:   [MAIN  ] Completed service synchronization, ready to provide service.

This is my current `/etc/corosync/corosync.conf` - I attempted to add a ring1 using IPv4 - so the two nodes have paths via IPv4 and IPv6 to try:

Code:
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: mel-pm
    nodeid: 1
    quorum_votes: 1
    ring0_addr: <ipv6 prefix>:100::1
    ring1_addr: 172.31.1.1
  }
  node {
    name: mel-pm2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: <ipv6 prefix>:100::2
    ring1_addr: 172.31.1.2
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: Melbourne
  config_version: 2
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
  link_mode: active
  token: 3000
}

I tried extending the token timeout to 3000ms - but this doesn't seem to have corrected anything.

Both systems are plugged into the same switch, so there is no firewall or WAN link or similar between the nodes...

Has anyone come across this before and maybe resolved it?
 

CRCinAU

Member
May 4, 2020
80
32
23
Actually - I think this might be the source of the issue....... On the Intel NUC, I see the following in dmesg when a full hang of the cluster happens:

Code:
[293502.911056] e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:
                  TDH                  <fe>
                  TDT                  <65>
                  next_to_use          <65>
                  next_to_clean        <fd>
                buffer_info[next_to_clean]:
                  time_stamp           <1045e7288>
                  next_to_watch        <fe>
                  jiffies              <1045e7400>
                  next_to_watch.status <0>
                MAC Status             <80083>
                PHY Status             <796d>
                PHY 1000BASE-T Status  <3800>
                PHY Extended Status    <3000>
                PCI Status             <10>
[293504.899112] e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:
                  TDH                  <fe>
                  TDT                  <65>
                  next_to_use          <65>
                  next_to_clean        <fd>
                buffer_info[next_to_clean]:
                  time_stamp           <1045e7288>
                  next_to_watch        <fe>
                  jiffies              <1045e75f1>
                  next_to_watch.status <0>
                MAC Status             <80083>
                PHY Status             <796d>
                PHY 1000BASE-T Status  <3800>
                PHY Extended Status    <3000>
                PCI Status             <10>

Testing changes to `/etc/network/interfaces` as follows:

Code:
iface eno1 inet manual
        post-up /usr/bin/logger -p debug -t ifup "Disabling segmentation offload for eno1" && /sbin/ethtool -K $IFACE tso off gso off && /usr/bin/logger -p debug -t ifup "Disabled offload for eno1"
 
Last edited:

CRCinAU

Member
May 4, 2020
80
32
23
So, even after the above, I'm still seeing the corosync problems. It still manifests as:

Code:
Jun 28 12:21:15 mel-pm2 corosync[1008]:   [KNET  ] link: host: 1 link: 0 is down
Jun 28 12:21:15 mel-pm2 corosync[1008]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jun 28 12:21:15 mel-pm2 corosync[1008]:   [KNET  ] host: host: 1 has no active links
Jun 28 12:21:15 mel-pm2 corosync[1008]:   [TOTEM ] Token has not been received in 2250 ms
Jun 28 12:21:16 mel-pm2 corosync[1008]:   [TOTEM ] A processor failed, forming new configuration: token timed out (3000ms), waiting 3600ms for consensus.
Jun 28 12:21:19 mel-pm2 corosync[1008]:   [KNET  ] rx: host: 1 link: 0 is up
Jun 28 12:21:19 mel-pm2 corosync[1008]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jun 28 12:21:19 mel-pm2 corosync[1008]:   [QUORUM] Sync members[2]: 1 2
Jun 28 12:21:19 mel-pm2 corosync[1008]:   [TOTEM ] A new membership (1.2393) was formed. Members
Jun 28 12:21:19 mel-pm2 corosync[1008]:   [QUORUM] Members[2]: 1 2
Jun 28 12:21:19 mel-pm2 corosync[1008]:   [MAIN  ] Completed service synchronization, ready to provide service.

I've changed the ethernet out for a USB3 Realtek Semiconductor Corp. RTL8153 Gigabit Ethernet Adapter - as with the Intel NUCs - there is no way to change a network card. It's all onboard.

Testing continues....
 

CRCinAU

Member
May 4, 2020
80
32
23
After running for many hours using the Realtek USB3 network adapter, there has not been a single corosync / network failure.

I have made contact with Intel and supplied the referred team with a whole heap of configuration / hardware details - as the NUC is a 100% Intel provided product - and therefore cannot be a problem with parts from a different manufacturer.
 
  • Like
Reactions: Stoiko Ivanov

Stoiko Ivanov

Proxmox Staff Member
Staff member
May 2, 2018
6,962
1,079
164
I have made contact with Intel and supplied the referred team with a whole heap of configuration / hardware details - as the NUC is a 100% Intel provided product - and therefore cannot be a problem with parts from a different manufacturer.
Thanks for keeping this thread going - and also for taking up the effort to resolve the issue with some e1000e NICs - please keep us posted about any results!
 
  • Like
Reactions: CRCinAU

CRCinAU

Member
May 4, 2020
80
32
23
I've had some correspondence from Intel.

Create a blacklist file for the following modules, eg:

File: /etc/modprobe.d/intel-blacklist.conf

With the following content:

Code:
blacklist mei_me
blacklist mei_hdcp
blacklist mei

After rebooting with this blackilst in place, I no longer seem to see network dropouts with other nodes on the cluster, and the network doesn't seem to become unavailable.

Further testing is required - as I have only made these changes today - so we'll only really get the full picture after several days of operation...

For there record, this is on an Intel NUC:

System Information
Manufacturer: Intel(R) Client Systems
Product Name: NUC8v7PNH
Version: K60013-402
...
SKU Number: BKNUC8v7PNH
Family: PN
 
Last edited:
  • Like
Reactions: Stoiko Ivanov

CRCinAU

Member
May 4, 2020
80
32
23
Just to follow up on this - I haven't had a network failure at all since blacklisting those modules.

The onboard NIC is now functioning as I'd expect.

The Intel team seems to indicate a BIOS problem - but I'm not sure who to relay this to at Intel who would be in that area...
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!