[SOLVED] 2nd Node continues to go offline

ljhardy

New Member
Mar 8, 2025
11
1
3
I just added a second node and migrated some VMs there. It continues to go offline, but accessible via local terminal. Is this a NIC issue?

Apr 12 13:32:09 proxmox2 corosync[951]: [KNET ] link: host: 1 link: 0 is down
Apr 12 13:32:09 proxmox2 corosync[951]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Apr 12 13:32:09 proxmox2 corosync[951]: [KNET ] host: host: 1 has no active links
Apr 12 13:32:10 proxmox2 kernel: e1000e 0000:00:19.0 enp0s25: Detected Hardware Unit Hang: TDH <c6> TDT <41> next_to_use <41> next_to_clean <c5>buffer_info[next_to_clean]: time_stamp <1010f8c1c> next_to_watch <c6> jiffies <1010f9640> next_to_watch.status <0>MAC Status <80083>PHY Status <796d>PHY 1000BASE-T Status <3800>PHY Extended Status <3000>PCI Status <10>
Apr 12 13:32:11 proxmox2 corosync[951]: [TOTEM ] Token has not been received in 2250 ms
Apr 12 13:32:11 proxmox2 corosync[951]: [TOTEM ] A processor failed, forming new configuration: token timed out (3000ms), waiting 3600ms for consensus.
Apr 12 13:32:12 proxmox2 kernel: e1000e 0000:00:19.0 enp0s25: Detected Hardware Unit Hang: TDH <c6> TDT <41> next_to_use <41> next_to_clean <c5>buffer_info[next_to_clean]: time_stamp <1010f8c1c> next_to_watch <c6> jiffies <1010f9e00> next_to_watch.status <0>MAC Status <80083>PHY Status <796d>PHY 1000BASE-T Status <3800>PHY Extended Status <3000>PCI Status <10>
Apr 12 13:32:14 proxmox2 kernel: e1000e 0000:00:19.0 enp0s25: Detected Hardware Unit Hang: TDH <c6> TDT <41> next_to_use <41> next_to_clean <c5>buffer_info[next_to_clean]: time_stamp <1010f8c1c> next_to_watch <c6> jiffies <1010fa600> next_to_watch.status <0>MAC Status <80083>PHY Status <796d>PHY 1000BASE-T Status <3800>PHY Extended Status <3000>PCI Status <10>
Apr 12 13:32:15 proxmox2 corosync[951]: [QUORUM] Sync members[1]: 2
Apr 12 13:32:15 proxmox2 corosync[951]: [QUORUM] Sync left[1]: 1
Apr 12 13:32:15 proxmox2 corosync[951]: [TOTEM ] A new membership (2.49) was formed. Members left: 1
Apr 12 13:32:15 proxmox2 corosync[951]: [TOTEM ] Failed to receive the leave message. failed: 1
Apr 12 13:32:15 proxmox2 pmxcfs[860]: [dcdb] notice: members: 2/860
Apr 12 13:32:15 proxmox2 pmxcfs[860]: [status] notice: members: 2/860
Apr 12 13:32:15 proxmox2 corosync[951]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
Apr 12 13:32:15 proxmox2 corosync[951]: [QUORUM] Members[1]: 2
Apr 12 13:32:15 proxmox2 corosync[951]: [MAIN ] Completed service synchronization, ready to provide service.
Apr 12 13:32:15 proxmox2 pmxcfs[860]: [status] notice: node lost quorum
Apr 12 13:32:15 proxmox2 pmxcfs[860]: [dcdb] crit: received write while not quorate - trigger resync
Apr 12 13:32:15 proxmox2 pmxcfs[860]: [dcdb] crit: leaving CPG group
Apr 12 13:32:15 proxmox2 pve-ha-lrm[1020]: unable to write lrm status file - unable to open file '/etc/pve/nodes/proxmox2/lrm_status.tmp.1020' - Permission denied
Apr 12 13:32:15 proxmox2 pvestatd[986]: storage 'omv' is not online
Apr 12 13:32:15 proxmox2 pvestatd[986]: status update time (5.135 seconds)
Apr 12 13:32:16 proxmox2 pmxcfs[860]: [dcdb] notice: start cluster connection
Apr 12 13:32:16 proxmox2 pmxcfs[860]: [dcdb] crit: cpg_join failed: 14
Apr 12 13:32:16 proxmox2 pmxcfs[860]: [dcdb] crit: can't initialize service
Apr 12 13:32:16 proxmox2 kernel: e1000e 0000:00:19.0 enp0s25: Detected Hardware Unit Hang: TDH <c6> TDT <41> next_to_use <41> next_to_clean <c5>buffer_info[next_to_clean]: time_stamp <1010f8c1c> next_to_watch <c6> jiffies <1010fadc0> next_to_watch.status <0>MAC Status <80083>PHY Status <796d>PHY 1000BASE-T Status <3800>PHY Extended Status <3000>PCI Status <10>
Apr 12 13:32:18 proxmox2 kernel: e1000e 0000:00:19.0 enp0s25: Detected Hardware Unit Hang: TDH <c6> TDT <41> next_to_use <41> next_to_clean <c5>buffer_info[next_to_clean]: time_stamp <1010f8c1c> next_to_watch <c6> jiffies <1010fb580> next_to_watch.status <0>MAC Status <80083>PHY Status <796d>PHY 1000BASE-T Status <3800>PHY Extended Status <3000>PCI Status <10>
Apr 12 13:32:20 proxmox2 kernel: e1000e 0000:00:19.0 enp0s25: Detected Hardware Unit Hang: TDH <c6> TDT <41> next_to_use <41> next_to_clean <c5>buffer_info[next_to_clean]: time_stamp <1010f8c1c> next_to_watch <c6> jiffies <1010fbd40> next_to_watch.status <0>MAC Status <80083>PHY Status <796d>PHY 1000BASE-T Status <3800>PHY Extended Status <3000>PCI Status <10>
Apr 12 13:32:22 proxmox2 pmxcfs[860]: [dcdb] notice: members: 2/860
Apr 12 13:32:22 proxmox2 pmxcfs[860]: [dcdb] notice: all data is up to date
 
The hardware hang looks similar to:


caused by this bug:

https://bugzilla.proxmox.com/show_bug.cgi?id=6273
 
This may be the solution, I've done this and now waiting to see if it stays online:

 
This may be the solution, I've done this and now waiting to see if it stays online:

The original thread is 6 years, but it appears to have been re-introduced.
 
Yes, so far so good though, been up about 8 hours, was dropping every hour or so before I applied the update to the e1000 via ethtool.