Node going offline repeatedly suddenly

triumphtruth

New Member
Jan 3, 2025
13
1
3
Hello Everyone,
I am facing quite a unique issue lately. I am not sure what is causing this, earlier I thought netbird VM was causing it, but even after shutting it down I am facing the same issue. Read below for details:

I have a 2 node proxmox cluster at my home.

Node 1

Type: Primary Node

Hosts: OPNSense Firewall, Home Assistant, Traefik and a few more things.

Memory: 16GB

Storage: 128GB, much of it is free

X------------------------------------------------X-------------------------------------------X
Node 2

Type: Secondary Node

Hosts: Nextcloud, Jellyfin, Netbird and a few more things.

Memory: 64GB

Storage: 128GB SSD, 12TB x 2 HDD with RAID 1 configuration within Proxmox. Most of the nextcloud, jellyfin etc installed on HDD and all the stuff being stored in HDD pictures etc.

This started happening after I setup the netbird, on node 2. But even after shutting down the vm and not using netbird from around 2 days, the issue repeated again. And now it is repeating quickly after every few hours.

I need help in finding out the root cause of this issue, what and where I should see the logs. By the way, all the workloads are assigned static ips from OPNSense, and everytime, I am facing issue only on Node 2, Node 1 never fails or faces any problems what so ever.

Problem:

Every now and then, without any resource crunch or visible network issues the node 2 get disconnected from the cluster. There is nothing you can do with the node. None of the workload remains accessible. Once you turn it off (directly by pressing the shutdown button, and letting it gracefully shutdown) and start it up again, it starts to work like nothing happened.

I am not sure what to look for and resolve this problem. I am rocking this setup from around 6 months and only this month things started to break apart like this.

Just today I faced this issue 2 times in around 8 hour period. What could be going wrong?

Best Regards
 
Update 1

this is the exact time things failed and below are the logs I found:

Apr 22 14:06:12 pve2 corosync[1498]: [KNET ] link: host: 1 link: 0 is down
Apr 22 14:06:12 pve2 corosync[1498]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Apr 22 14:06:12 pve2 corosync[1498]: [KNET ] host: host: 1 has no active links
Apr 22 14:06:12 pve2 kernel: e1000e 0000:00:1f.6 enp0s31f6: Detected Hardware Unit Hang: TDH <68> TDT <a6> next_to_use <a6> next_to_clean <67> buffer_info[next_to_clean]: time_stamp <101174011> next_to_watch <68> jiffies <101174700> next_to_watch.status <0> MAC Status <80083> PHY Status <796d> PHY 1000BASE-T Status <3800> PHY Extended Status <3000> PCI Status <10>
Apr 22 14:06:12 pve2 corosync[1498]: [TOTEM ] Token has not been received in 2250 ms
Apr 22 14:06:13 pve2 corosync[1498]: [TOTEM ] A processor failed, forming new configuration: token timed out (3000ms), waiting 3600ms for consensus.
Apr 22 14:06:14 pve2 kernel: e1000e 0000:00:1f.6 enp0s31f6: Detected Hardware Unit Hang: TDH <68> TDT <a6> next_to_use <a6> next_to_clean <67> buffer_info[next_to_clean]: time_stamp <101174011> next_to_watch <68> jiffies <101174ec0> next_to_watch.status <0> MAC Status <80083> PHY Status <796d> PHY 1000BASE-T Status <3800> PHY Extended Status <3000> PCI Status <10>
Apr 22 14:06:16 pve2 kernel: e1000e 0000:00:1f.6 enp0s31f6: Detected Hardware Unit Hang: TDH <68> TDT <a6> next_to_use <a6> next_to_clean <67> buffer_info[next_to_clean]: time_stamp <101174011> next_to_watch <68> jiffies <101175680> next_to_watch.status <0> MAC Status <80083> PHY Status <796d> PHY 1000BASE-T Status <3800> PHY Extended Status <3000> PCI Status <10>
Apr 22 14:06:17 pve2 corosync[1498]: [QUORUM] Sync members[1]: 2
Apr 22 14:06:17 pve2 corosync[1498]: [QUORUM] Sync left[1]: 1
Apr 22 14:06:17 pve2 corosync[1498]: [TOTEM ] A new membership (2.16d) was formed. Members left: 1
Apr 22 14:06:17 pve2 corosync[1498]: [TOTEM ] Failed to receive the leave message. failed: 1
Apr 22 14:06:17 pve2 pmxcfs[1406]: [dcdb] notice: members: 2/1406
Apr 22 14:06:17 pve2 corosync[1498]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
Apr 22 14:06:17 pve2 pmxcfs[1406]: [status] notice: members: 2/1406
Apr 22 14:06:17 pve2 corosync[1498]: [QUORUM] Members[1]: 2
Apr 22 14:06:17 pve2 corosync[1498]: [MAIN ] Completed service synchronization, ready to provide service.
Apr 22 14:06:17 pve2 pmxcfs[1406]: [status] notice: node lost quorum
Apr 22 14:06:18 pve2 kernel: e1000e 0000:00:1f.6 enp0s31f6: Detected Hardware Unit Hang: TDH <68> TDT <a6> next_to_use <a6> next_to_clean <67> buffer_info[next_to_clean]: time_stamp <101174011> next_to_watch <68> jiffies <101175e80> next_to_watch.status <0> MAC Status <80083> PHY Status <796d> PHY 1000BASE-T Status <3800> PHY Extended Status <3000> PCI Status <10>

after this Detected Hardware Unit hang is repeated in the logs until I shut it down and restarted everything came up just fine.