link: host: 1 link: 0 is down

CRCinAU · Jun 27, 2022

Hi all,

I'm having problems with my 2 node Proxmox cluster. One is a home built PC that runs several VMs, and the other a second hand Intel NUC to play around with Plex and hardware transcoding using Intel Quicksync.

I joined these two in a cluster - but have noticed that I get a lot of dropouts with corosync. I've been experimenting with the configuration, but still haven't managed to get a setup that works.

Error:

Code:

Jun 28 00:33:29 mel-pm2 corosync[793213]:   [KNET  ] link: host: 1 link: 0 is down
Jun 28 00:33:29 mel-pm2 corosync[793213]:   [KNET  ] link: host: 1 link: 1 is down
Jun 28 00:33:29 mel-pm2 corosync[793213]:   [KNET  ] host: host: 1 has 0 active links
Jun 28 00:33:29 mel-pm2 corosync[793213]:   [KNET  ] host: host: 1 has no active links
Jun 28 00:33:29 mel-pm2 corosync[793213]:   [KNET  ] host: host: 1 has 0 active links
Jun 28 00:33:29 mel-pm2 corosync[793213]:   [KNET  ] host: host: 1 has no active links
Jun 28 00:33:31 mel-pm2 corosync[793213]:   [TOTEM ] Token has not been received in 2250 ms
Jun 28 00:33:32 mel-pm2 corosync[793213]:   [KNET  ] rx: host: 1 link: 0 is up
Jun 28 00:33:32 mel-pm2 corosync[793213]:   [KNET  ] rx: host: 1 link: 1 is up
Jun 28 00:33:32 mel-pm2 corosync[793213]:   [KNET  ] host: host: 1 has 2 active links
Jun 28 00:33:32 mel-pm2 corosync[793213]:   [KNET  ] host: host: 1 has 2 active links
Jun 28 00:33:32 mel-pm2 corosync[793213]:   [TOTEM ] A processor failed, forming new configuration: token timed out (3000ms), waiting 3600ms for consensus.
Jun 28 00:33:34 mel-pm2 corosync[793213]:   [QUORUM] Sync members[2]: 1 2
Jun 28 00:33:34 mel-pm2 corosync[793213]:   [TOTEM ] A new membership (1.2315) was formed. Members
Jun 28 00:33:34 mel-pm2 corosync[793213]:   [QUORUM] Members[2]: 1 2
Jun 28 00:33:34 mel-pm2 corosync[793213]:   [MAIN  ] Completed service synchronization, ready to provide service.

This is my current `/etc/corosync/corosync.conf` - I attempted to add a ring1 using IPv4 - so the two nodes have paths via IPv4 and IPv6 to try:

Code:

logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: mel-pm
    nodeid: 1
    quorum_votes: 1
    ring0_addr: <ipv6 prefix>:100::1
    ring1_addr: 172.31.1.1
  }
  node {
    name: mel-pm2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: <ipv6 prefix>:100::2
    ring1_addr: 172.31.1.2
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: Melbourne
  config_version: 2
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
  link_mode: active
  token: 3000
}

I tried extending the token timeout to 3000ms - but this doesn't seem to have corrected anything.

Both systems are plugged into the same switch, so there is no firewall or WAN link or similar between the nodes...

Has anyone come across this before and maybe resolved it?

CRCinAU · Jun 27, 2022

Actually - I think this might be the source of the issue....... On the Intel NUC, I see the following in dmesg when a full hang of the cluster happens:

Code:

[293502.911056] e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:
                  TDH                  <fe>
                  TDT                  <65>
                  next_to_use          <65>
                  next_to_clean        <fd>
                buffer_info[next_to_clean]:
                  time_stamp           <1045e7288>
                  next_to_watch        <fe>
                  jiffies              <1045e7400>
                  next_to_watch.status <0>
                MAC Status             <80083>
                PHY Status             <796d>
                PHY 1000BASE-T Status  <3800>
                PHY Extended Status    <3000>
                PCI Status             <10>
[293504.899112] e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:
                  TDH                  <fe>
                  TDT                  <65>
                  next_to_use          <65>
                  next_to_clean        <fd>
                buffer_info[next_to_clean]:
                  time_stamp           <1045e7288>
                  next_to_watch        <fe>
                  jiffies              <1045e75f1>
                  next_to_watch.status <0>
                MAC Status             <80083>
                PHY Status             <796d>
                PHY 1000BASE-T Status  <3800>
                PHY Extended Status    <3000>
                PCI Status             <10>

Testing changes to `/etc/network/interfaces` as follows:

Code:

iface eno1 inet manual
        post-up /usr/bin/logger -p debug -t ifup "Disabling segmentation offload for eno1" && /sbin/ethtool -K $IFACE tso off gso off && /usr/bin/logger -p debug -t ifup "Disabled offload for eno1"

CRCinAU · Jun 28, 2022

So, even after the above, I'm still seeing the corosync problems. It still manifests as:

Code:

Jun 28 12:21:15 mel-pm2 corosync[1008]:   [KNET  ] link: host: 1 link: 0 is down
Jun 28 12:21:15 mel-pm2 corosync[1008]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jun 28 12:21:15 mel-pm2 corosync[1008]:   [KNET  ] host: host: 1 has no active links
Jun 28 12:21:15 mel-pm2 corosync[1008]:   [TOTEM ] Token has not been received in 2250 ms
Jun 28 12:21:16 mel-pm2 corosync[1008]:   [TOTEM ] A processor failed, forming new configuration: token timed out (3000ms), waiting 3600ms for consensus.
Jun 28 12:21:19 mel-pm2 corosync[1008]:   [KNET  ] rx: host: 1 link: 0 is up
Jun 28 12:21:19 mel-pm2 corosync[1008]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jun 28 12:21:19 mel-pm2 corosync[1008]:   [QUORUM] Sync members[2]: 1 2
Jun 28 12:21:19 mel-pm2 corosync[1008]:   [TOTEM ] A new membership (1.2393) was formed. Members
Jun 28 12:21:19 mel-pm2 corosync[1008]:   [QUORUM] Members[2]: 1 2
Jun 28 12:21:19 mel-pm2 corosync[1008]:   [MAIN  ] Completed service synchronization, ready to provide service.

I've changed the ethernet out for a USB3 Realtek Semiconductor Corp. RTL8153 Gigabit Ethernet Adapter - as with the Intel NUCs - there is no way to change a network card. It's all onboard.

Testing continues....

CRCinAU · Jun 28, 2022

After running for many hours using the Realtek USB3 network adapter, there has not been a single corosync / network failure.

I have made contact with Intel and supplied the referred team with a whole heap of configuration / hardware details - as the NUC is a 100% Intel provided product - and therefore cannot be a problem with parts from a different manufacturer.

Stoiko Ivanov · Jun 29, 2022

CRCinAU said:
I have made contact with Intel and supplied the referred team with a whole heap of configuration / hardware details - as the NUC is a 100% Intel provided product - and therefore cannot be a problem with parts from a different manufacturer.

Thanks for keeping this thread going - and also for taking up the effort to resolve the issue with some e1000e NICs - please keep us posted about any results!

CRCinAU · Jul 5, 2022

I've had some correspondence from Intel.

Create a blacklist file for the following modules, eg:

File: /etc/modprobe.d/intel-blacklist.conf

With the following content:

Code:

blacklist mei_me
blacklist mei_hdcp
blacklist mei

After rebooting with this blackilst in place, I no longer seem to see network dropouts with other nodes on the cluster, and the network doesn't seem to become unavailable.

Further testing is required - as I have only made these changes today - so we'll only really get the full picture after several days of operation...

For there record, this is on an Intel NUC:

System Information
Manufacturer: Intel(R) Client Systems
Product Name: NUC8v7PNH
Version: K60013-402
...
SKU Number: BKNUC8v7PNH
Family: PN

CRCinAU · Jul 12, 2022

Just to follow up on this - I haven't had a network failure at all since blacklisting those modules.

The onboard NIC is now functioning as I'd expect.

The Intel team seems to indicate a BIOS problem - but I'm not sure who to relay this to at Intel who would be in that area...

Search

Search

link: host: 1 link: 0 is down

CRCinAU

Well-Known Member

CRCinAU

Well-Known Member

CRCinAU

Well-Known Member

CRCinAU

Well-Known Member

Stoiko Ivanov

Proxmox Staff Member

CRCinAU

Well-Known Member

CRCinAU

Well-Known Member