[SOLVED] 2nd Node continues to go offline

ljhardy · Apr 12, 2025

I just added a second node and migrated some VMs there. It continues to go offline, but accessible via local terminal. Is this a NIC issue?

Apr 12 13:32:09 proxmox2 corosync[951]: [KNET ] link: host: 1 link: 0 is down
Apr 12 13:32:09 proxmox2 corosync[951]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Apr 12 13:32:09 proxmox2 corosync[951]: [KNET ] host: host: 1 has no active links
Apr 12 13:32:10 proxmox2 kernel: e1000e 0000:00:19.0 enp0s25: Detected Hardware Unit Hang: TDH <c6> TDT <41> next_to_use <41> next_to_clean <c5>buffer_info[next_to_clean]: time_stamp <1010f8c1c> next_to_watch <c6> jiffies <1010f9640> next_to_watch.status <0>MAC Status <80083>PHY Status <796d>PHY 1000BASE-T Status <3800>PHY Extended Status <3000>PCI Status <10>
Apr 12 13:32:11 proxmox2 corosync[951]: [TOTEM ] Token has not been received in 2250 ms
Apr 12 13:32:11 proxmox2 corosync[951]: [TOTEM ] A processor failed, forming new configuration: token timed out (3000ms), waiting 3600ms for consensus.
Apr 12 13:32:12 proxmox2 kernel: e1000e 0000:00:19.0 enp0s25: Detected Hardware Unit Hang: TDH <c6> TDT <41> next_to_use <41> next_to_clean <c5>buffer_info[next_to_clean]: time_stamp <1010f8c1c> next_to_watch <c6> jiffies <1010f9e00> next_to_watch.status <0>MAC Status <80083>PHY Status <796d>PHY 1000BASE-T Status <3800>PHY Extended Status <3000>PCI Status <10>
Apr 12 13:32:14 proxmox2 kernel: e1000e 0000:00:19.0 enp0s25: Detected Hardware Unit Hang: TDH <c6> TDT <41> next_to_use <41> next_to_clean <c5>buffer_info[next_to_clean]: time_stamp <1010f8c1c> next_to_watch <c6> jiffies <1010fa600> next_to_watch.status <0>MAC Status <80083>PHY Status <796d>PHY 1000BASE-T Status <3800>PHY Extended Status <3000>PCI Status <10>
Apr 12 13:32:15 proxmox2 corosync[951]: [QUORUM] Sync members[1]: 2
Apr 12 13:32:15 proxmox2 corosync[951]: [QUORUM] Sync left[1]: 1
Apr 12 13:32:15 proxmox2 corosync[951]: [TOTEM ] A new membership (2.49) was formed. Members left: 1
Apr 12 13:32:15 proxmox2 corosync[951]: [TOTEM ] Failed to receive the leave message. failed: 1
Apr 12 13:32:15 proxmox2 pmxcfs[860]: [dcdb] notice: members: 2/860
Apr 12 13:32:15 proxmox2 pmxcfs[860]: [status] notice: members: 2/860
Apr 12 13:32:15 proxmox2 corosync[951]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
Apr 12 13:32:15 proxmox2 corosync[951]: [QUORUM] Members[1]: 2
Apr 12 13:32:15 proxmox2 corosync[951]: [MAIN ] Completed service synchronization, ready to provide service.
Apr 12 13:32:15 proxmox2 pmxcfs[860]: [status] notice: node lost quorum
Apr 12 13:32:15 proxmox2 pmxcfs[860]: [dcdb] crit: received write while not quorate - trigger resync
Apr 12 13:32:15 proxmox2 pmxcfs[860]: [dcdb] crit: leaving CPG group
Apr 12 13:32:15 proxmox2 pve-ha-lrm[1020]: unable to write lrm status file - unable to open file '/etc/pve/nodes/proxmox2/lrm_status.tmp.1020' - Permission denied
Apr 12 13:32:15 proxmox2 pvestatd[986]: storage 'omv' is not online
Apr 12 13:32:15 proxmox2 pvestatd[986]: status update time (5.135 seconds)
Apr 12 13:32:16 proxmox2 pmxcfs[860]: [dcdb] notice: start cluster connection
Apr 12 13:32:16 proxmox2 pmxcfs[860]: [dcdb] crit: cpg_join failed: 14
Apr 12 13:32:16 proxmox2 pmxcfs[860]: [dcdb] crit: can't initialize service
Apr 12 13:32:16 proxmox2 kernel: e1000e 0000:00:19.0 enp0s25: Detected Hardware Unit Hang: TDH <c6> TDT <41> next_to_use <41> next_to_clean <c5>buffer_info[next_to_clean]: time_stamp <1010f8c1c> next_to_watch <c6> jiffies <1010fadc0> next_to_watch.status <0>MAC Status <80083>PHY Status <796d>PHY 1000BASE-T Status <3800>PHY Extended Status <3000>PCI Status <10>
Apr 12 13:32:18 proxmox2 kernel: e1000e 0000:00:19.0 enp0s25: Detected Hardware Unit Hang: TDH <c6> TDT <41> next_to_use <41> next_to_clean <c5>buffer_info[next_to_clean]: time_stamp <1010f8c1c> next_to_watch <c6> jiffies <1010fb580> next_to_watch.status <0>MAC Status <80083>PHY Status <796d>PHY 1000BASE-T Status <3800>PHY Extended Status <3000>PCI Status <10>
Apr 12 13:32:20 proxmox2 kernel: e1000e 0000:00:19.0 enp0s25: Detected Hardware Unit Hang: TDH <c6> TDT <41> next_to_use <41> next_to_clean <c5>buffer_info[next_to_clean]: time_stamp <1010f8c1c> next_to_watch <c6> jiffies <1010fbd40> next_to_watch.status <0>MAC Status <80083>PHY Status <796d>PHY 1000BASE-T Status <3800>PHY Extended Status <3000>PCI Status <10>
Apr 12 13:32:22 proxmox2 pmxcfs[860]: [dcdb] notice: members: 2/860
Apr 12 13:32:22 proxmox2 pmxcfs[860]: [dcdb] notice: all data is up to date

jimmycav · Apr 13, 2025

The hardware hang looks similar to:

A

Thread 'e1000 driver hang'

Sep 23, 2019

In the past week we are seeing random e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang failuresacross all our nodes, even different hardware hosts. Must do a reset of the host.

There are lots of references to this issue going back 5+ years. Was there a driver change with the latest updates? We've run years with this hardware without issue. Now just this week its popping up all over.

Kernel Version Linux 5.0.21-2-pve #1 SMP PVE 5.0.21-3 (Thu, 05 Sep 2019 13:56:01 +0200)
PVE Manager Version pve-manager/6.0-7/2898402

Sep 22 20:03:08 vmhost03 kernel: [154458.471981] e1000e...

caused by this bug:

https://bugzilla.proxmox.com/show_bug.cgi?id=6273

ljhardy · Apr 13, 2025

I saw that, but it was 6 years ago. Still a bug?

ljhardy · Apr 13, 2025

This may be the solution, I've done this and now waiting to see if it stays online:

F

Thread 'Trap error on e1000 network adapter'

Feb 28, 2022

Hello everyone, I realized that I often see trap errors on Intel card:
Pc DELL T40

[394961.232725] e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:
TDH <9f>
TDT <df>
next_to_use <df>
next_to_clean <9e>
buffer_info[next_to_clean]:
time_stamp <105e1836d>
next_to_watch <9f>
jiffies <105e185e0>
next_to_watch.status <0>
MAC Status...

jimmycav · Apr 14, 2025

ljhardy said:
This may be the solution, I've done this and now waiting to see if it stays online:

F

Thread 'Trap error on e1000 network adapter'

Feb 28, 2022

Hello everyone, I realized that I often see trap errors on Intel card:
Pc DELL T40

[394961.232725] e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:
TDH <9f>
TDT <df>
next_to_use <df>
next_to_clean <9e>
buffer_info[next_to_clean]:
time_stamp <105e1836d>
next_to_watch <9f>
jiffies <105e185e0>
next_to_watch.status <0>
MAC Status...

frankz

Replies: 18

Forum: Proxmox VE: Installation and configuration

The original thread is 6 years, but it appears to have been re-introduced.

ljhardy · Apr 14, 2025

Yes, so far so good though, been up about 8 hours, was dropping every hour or so before I applied the update to the e1000 via ethtool.

ljhardy · Apr 14, 2025

Still running, I'll consider this solved!

[SOLVED] 2nd Node continues to go offline

ljhardy

New Member

jimmycav

Member

Thread 'e1000 driver hang'

ljhardy

New Member

ljhardy

New Member

Thread 'Trap error on e1000 network adapter'

jimmycav

Member

Thread 'Trap error on e1000 network adapter'

ljhardy

New Member

ljhardy

New Member

We value your privacy