one mode of my 3 nodes mesh cluster crash

jayjayjay

New Member
Feb 26, 2023
5
0
1
Hello,

A few months ago, Ivehave installed a 3 nodes mesh proxmox cluster with ceph included for my home lab. It works great except that one of my 3 nodes often crash in a strange manner.

My 3 nodes are exactly the same. I use the ethernet nic included in the motherboard for communication outside the cluster and two usb 2.5gbps for internal communication (proxmox cluster + ceph).

Every few days, I loose the 2 usb 2.5gbps at the same time:
Code:
Oct 28 09:52:18 proxmox02 kernel: [99883.281988] ------------[ cut here ]------------
Oct 28 09:52:18 proxmox02 kernel: [99883.282006] NETDEV WATCHDOG: enx3c4937036ac3 (r8152): transmit queue 0 timed out
Oct 28 09:52:18 proxmox02 kernel: [99883.282030] WARNING: CPU: 13 PID: 0 at net/sched/sch_generic.c:477 dev_watchdog+0x277/0x280
Oct 28 09:52:18 proxmox02 kernel: [99883.282043] Modules linked in: binfmt_misc veth rbd tcp_diag udp_diag inet_diag ceph libceph fscache netfs ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter bpfilter sctp ip6_udp_tunnel udp_tunnel nf_tables nfnetlink_cttimeout bonding tls openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 softdog nfnetlink_log nfnetlink cdc_mbim cdc_wdm cdc_ncm cdc_ether usbnet intel_rapl_msr amdgpu intel_rapl_common edac_mce_amd r8152 mii btusb snd_hda_codec_realtek iwlmvm kvm_amd btrtl snd_hda_codec_generic btbcm iommu_v2 btintel snd_acp3x_rn ledtrig_audio gpu_sched snd_acp3x_pdm_dma snd_soc_dmic snd_hda_codec_hdmi kvm snd_soc_core bluetooth mac80211 irqbypass snd_compress snd_hda_intel drm_ttm_helper ac97_bus ecdh_generic crct10dif_pclmul ttm snd_intel_dspcfg libarc4 snd_pcm_dmaengine ecc ghash_clmulni_intel snd_intel_sdw_acpi aesni_intel snd_hda_codec drm_kms_helper iwlwifi crypto_simd snd_hda_core cec snd_pci_acp6x
Oct 28 09:52:18 proxmox02 kernel: [99883.282329]  snd_hwdep cryptd wmi_bmof rc_core snd_pcm rapl i2c_algo_bit cfg80211 snd_pci_acp5x fb_sys_fops snd_timer syscopyarea sysfillrect snd snd_rn_pci_acp3x sysimgblt soundcore pcspkr k10temp snd_pci_acp3x serio_raw efi_pstore ccp cm32181 industrialio mac_hid zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) vhost_net vhost vhost_iotlb tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi drm sunrpc ip_tables x_tables autofs4 btrfs blake2b_generic xor zstd_compress raid6_pq libcrc32c simplefb xhci_pci crc32_pclmul xhci_pci_renesas nvme ahci psmouse i2c_piix4 libahci xhci_hcd nvme_core r8169 realtek wmi video i2c_hid_acpi i2c_hid hid
Oct 28 09:52:18 proxmox02 kernel: [99883.282684] CPU: 13 PID: 0 Comm: swapper/13 Tainted: P           O      5.15.116-1-pve #1
Oct 28 09:52:18 proxmox02 kernel: [99883.282691] Hardware name: AZW SER/SER, BIOS 5800H501 01/04/2023
Oct 28 09:52:18 proxmox02 kernel: [99883.282695] RIP: 0010:dev_watchdog+0x277/0x280
Oct 28 09:52:18 proxmox02 kernel: [99883.282702] Code: eb 97 48 8b 5d d0 c6 05 69 89 6c 01 01 48 89 df e8 be 4f f9 ff 44 89 e1 48 89 de 48 c7 c7 c0 1a 8b a6 48 89 c2 e8 44 99 1c 00 <0f> 0b eb 80 e9 62 f1 25 00 0f 1f 44 00 00 55 49 89 ca 48 89 e5 41
Oct 28 09:52:18 proxmox02 kernel: [99883.282707] RSP: 0018:ffffc14980498e70 EFLAGS: 00010282
Oct 28 09:52:18 proxmox02 kernel: [99883.282715] RAX: 0000000000000000 RBX: ffff9d7cd05df000 RCX: 0000000000000000
Oct 28 09:52:18 proxmox02 kernel: [99883.282720] RDX: ffff9d831116cb40 RSI: ffff9d8311160580 RDI: 0000000000000300
Oct 28 09:52:18 proxmox02 kernel: [99883.282724] RBP: ffffc14980498ea8 R08: 0000000000000003 R09: 0000000000000001
Oct 28 09:52:18 proxmox02 kernel: [99883.282728] R10: 0000000000ffff0a R11: 0000000000000001 R12: 0000000000000000
Oct 28 09:52:18 proxmox02 kernel: [99883.282731] R13: ffff9d7ccf98e080 R14: 0000000000000001 R15: ffff9d7cd05df4c0
Oct 28 09:52:18 proxmox02 kernel: [99883.282735] FS:  0000000000000000(0000) GS:ffff9d8311140000(0000) knlGS:0000000000000000
Oct 28 09:52:18 proxmox02 kernel: [99883.282741] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 28 09:52:18 proxmox02 kernel: [99883.282745] CR2: 000055e861126000 CR3: 000000018c610000 CR4: 0000000000750ee0
Oct 28 09:52:18 proxmox02 kernel: [99883.282749] PKRU: 55555554
Oct 28 09:52:18 proxmox02 kernel: [99883.282753] Call Trace:
Oct 28 09:52:18 proxmox02 kernel: [99883.282757]  <IRQ>
Oct 28 09:52:18 proxmox02 kernel: [99883.282762]  ? show_regs.cold+0x1a/0x1f
Oct 28 09:52:18 proxmox02 kernel: [99883.282772]  ? dev_watchdog+0x277/0x280
Oct 28 09:52:18 proxmox02 kernel: [99883.282777]  ? __warn+0x8c/0x100
Oct 28 09:52:18 proxmox02 kernel: [99883.282785]  ? dev_watchdog+0x277/0x280
Oct 28 09:52:18 proxmox02 kernel: [99883.282790]  ? report_bug+0xa4/0xd0
Oct 28 09:52:18 proxmox02 kernel: [99883.282799]  ? arch_irq_work_raise+0x3a/0x50
Oct 28 09:52:18 proxmox02 kernel: [99883.282809]  ? handle_bug+0x39/0x90
Oct 28 09:52:18 proxmox02 kernel: [99883.282817]  ? exc_invalid_op+0x19/0x70
Oct 28 09:52:18 proxmox02 kernel: [99883.282822]  ? asm_exc_invalid_op+0x1b/0x20
Oct 28 09:52:18 proxmox02 kernel: [99883.282831]  ? dev_watchdog+0x277/0x280
Oct 28 09:52:18 proxmox02 kernel: [99883.282835]  ? pfifo_fast_enqueue+0x160/0x160
Oct 28 09:52:18 proxmox02 kernel: [99883.282840]  call_timer_fn+0x2b/0x120
Oct 28 09:52:18 proxmox02 kernel: [99883.282847]  __run_timers.part.0+0x1e1/0x270
Oct 28 09:52:18 proxmox02 kernel: [99883.282852]  ? ktime_get+0x46/0xc0
Oct 28 09:52:18 proxmox02 kernel: [99883.282858]  ? native_x2apic_icr_read+0x20/0x20
Oct 28 09:52:18 proxmox02 kernel: [99883.282865]  ? lapic_next_event+0x21/0x30
Oct 28 09:52:18 proxmox02 kernel: [99883.282874]  ? clockevents_program_event+0xab/0x130
Oct 28 09:52:18 proxmox02 kernel: [99883.282888]  run_timer_softirq+0x2a/0x60
Oct 28 09:52:18 proxmox02 kernel: [99883.282894]  __do_softirq+0xd9/0x2ea
Oct 28 09:52:18 proxmox02 kernel: [99883.282899]  irq_exit_rcu+0x94/0xc0
Oct 28 09:52:18 proxmox02 kernel: [99883.282906]  sysvec_apic_timer_interrupt+0x80/0x90
Oct 28 09:52:18 proxmox02 kernel: [99883.282912]  </IRQ>
Oct 28 09:52:18 proxmox02 kernel: [99883.282915]  <TASK>
Oct 28 09:52:18 proxmox02 kernel: [99883.282919]  asm_sysvec_apic_timer_interrupt+0x1b/0x20
Oct 28 09:52:18 proxmox02 kernel: [99883.282926] RIP: 0010:cpuidle_enter_state+0xd9/0x620
Oct 28 09:52:18 proxmox02 kernel: [99883.282933] Code: 3d 74 26 3e 5a e8 07 e8 6c ff 49 89 c7 0f 1f 44 00 00 31 ff e8 48 f5 6c ff 80 7d d0 00 0f 85 5e 01 00 00 fb 66 0f 1f 44 00 00 <45> 85 f6 0f 88 6a 01 00 00 4d 63 ee 49 83 fd 09 0f 87 e5 03 00 00
Oct 28 09:52:18 proxmox02 kernel: [99883.282939] RSP: 0018:ffffc149801dfe38 EFLAGS: 00000246
Oct 28 09:52:18 proxmox02 kernel: [99883.282945] RAX: ffff9d83111714c0 RBX: ffff9d7ccb29b800 RCX: 0000000000000000
Oct 28 09:52:18 proxmox02 kernel: [99883.282949] RDX: 0000000000002d6e RSI: 00000000281348c4 RDI: 0000000000000000
Oct 28 09:52:18 proxmox02 kernel: [99883.282953] RBP: ffffc149801dfe88 R08: 00005ad7e38a5d7b R09: 0000000000000000
Oct 28 09:52:18 proxmox02 kernel: [99883.282957] R10: 0000000000000000 R11: ffffc149801dfd28 R12: ffffffffa72e92a0
Oct 28 09:52:18 proxmox02 kernel: [99883.282961] R13: 0000000000000003 R14: 0000000000000003 R15: 00005ad7e38a5d7b
Oct 28 09:52:18 proxmox02 kernel: [99883.282966]  ? cpuidle_enter_state+0xc8/0x620
Oct 28 09:52:18 proxmox02 kernel: [99883.282972]  cpuidle_enter+0x2e/0x50
Oct 28 09:52:18 proxmox02 kernel: [99883.282977]  do_idle+0x20d/0x2b0
Oct 28 09:52:18 proxmox02 kernel: [99883.282985]  cpu_startup_entry+0x20/0x30
Oct 28 09:52:18 proxmox02 kernel: [99883.282992]  start_secondary+0x12a/0x180
Oct 28 09:52:18 proxmox02 kernel: [99883.282998]  secondary_startup_64_no_verify+0xc2/0xcb
Oct 28 09:52:18 proxmox02 kernel: [99883.283007]  </TASK>
Oct 28 09:52:18 proxmox02 kernel: [99883.283012] ---[ end trace fabe145faad1aac2 ]---
Oct 28 09:52:18 proxmox02 kernel: [99883.283020] r8152 2-1:1.0 enx3c4937036ac3: Tx timeout
Oct 28 09:52:21 proxmox02 corosync[2172]:   [KNET  ] rx: host: 1 link: 0 is up
Oct 28 09:52:21 proxmox02 corosync[2172]:   [KNET  ] link: Resetting MTU for link 0 because host 1 joined
Oct 28 09:52:21 proxmox02 corosync[2172]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Oct 28 09:52:21 proxmox02 corosync[2172]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Oct 28 09:52:23 proxmox02 kernel: [99888.145633] r8152 2-1:1.0 enx3c4937036ac3: Tx timeout
Oct 28 09:52:23 proxmox02 kernel: [99888.401622] xhci_hcd 0000:04:00.3: xHCI host not responding to stop endpoint command.
Oct 28 09:52:23 proxmox02 kernel: [99888.401631] xhci_hcd 0000:04:00.3: USBSTS: 0x00000000
Oct 28 09:52:23 proxmox02 kernel: [99888.417453] xhci_hcd 0000:04:00.3: xHCI host controller not responding, assume dead
Oct 28 09:52:23 proxmox02 kernel: [99888.417467] xhci_hcd 0000:04:00.3: HC died; cleaning up
Oct 28 09:52:23 proxmox02 kernel: [99888.417487] r8152 2-2:1.0 enx3c4937036af0: Stop submitting intr, status -108
Oct 28 09:52:23 proxmox02 kernel: [99888.417498] r8152 2-1:1.0 enx3c4937036ac3: Tx status -108
Oct 28 09:52:23 proxmox02 kernel: [99888.417500] r8152 2-1:1.0 enx3c4937036ac3: Tx status -108
Oct 28 09:52:23 proxmox02 kernel: [99888.417501] r8152 2-1:1.0 enx3c4937036ac3: Tx status -108
Oct 28 09:52:23 proxmox02 kernel: [99888.417501] r8152 2-1:1.0 enx3c4937036ac3: Tx status -108
Oct 28 09:52:23 proxmox02 kernel: [99888.417514] usb 1-4: USB disconnect, device number 2
Oct 28 09:52:23 proxmox02 kernel: [99888.417905] r8152-cfgselector 2-1: USB disconnect, device number 2
Oct 28 09:52:23 proxmox02 kernel: [99888.419550] r8152 2-1:1.0 enx3c4937036ac3: Get ether addr fail
Oct 28 09:52:23 proxmox02 kernel: [99888.419558] r8152 2-1:1.0 enx3c4937036ac3: Promiscuous mode enabled
Oct 28 09:52:23 proxmox02 kernel: [99888.419949] device enx3c4937036ac3 left promiscuous mode
Oct 28 09:52:23 proxmox02 kernel: [99888.486043] r8152-cfgselector 2-2: USB disconnect, device number 3
Oct 28 09:52:23 proxmox02 kernel: [99888.486658] device enx3c4937036af0 left promiscuous mode
Oct 28 09:52:24 proxmox02 ntpd[1890]: Deleting interface #10 enx3c4937036ac3, fe80::3e49:37ff:fe03:6ac3%4#123, interface stats: received=0, sent=0, dropped=0, active_time=99876 secs
Oct 28 09:52:24 proxmox02 ntpd[1890]: Deleting interface #11 enx3c4937036af0, fe80::3e49:37ff:fe03:6af0%5#123, interface stats: received=0, sent=0, dropped=0, active_time=99876 secs
Oct 28 09:52:25 proxmox02 corosync[2172]:   [KNET  ] link: host: 3 link: 0 is down
Oct 28 09:52:25 proxmox02 corosync[2172]:   [KNET  ] link: host: 1 link: 0 is down
Oct 28 09:52:25 proxmox02 corosync[2172]:   [KNET  ] host: host: 3 (passive) best link: 1 (pri: 1)
Oct 28 09:52:25 proxmox02 corosync[2172]:   [KNET  ] host: host: 1 (passive) best link: 1 (pri: 1)
Oct 28 09:52:26 proxmox02 pvestatd[2372]: got timeout
Oct 28 09:52:31 proxmox02 pvestatd[2372]: got timeout
Oct 28 09:52:31 proxmox02 pvestatd[2372]: status update time (7.088 seconds)
Oct 28 09:52:32 proxmox02 kernel: [99897.459964] libceph: mon1 (1)10.10.10.202:6789 socket closed (con state OPEN)
Oct 28 09:52:32 proxmox02 kernel: [99897.460020] libceph: mon1 (1)10.10.10.202:6789 session lost, hunting for new mon
Oct 28 09:52:33 proxmox02 kernel: [99897.841214] libceph: mon2 (1)10.10.10.203:6789 socket closed (con state V1_BANNER)
Oct 28 09:52:34 proxmox02 ceph-osd: 2023-10-28T09:52:34.946+0200 7fc30017d700 -1 osd.2 7993 heartbeat_check: no reply from 10.10.10.201:6811 osd.1 since back 2023-10-28T09:52:13.111684+0200 front 2023-10-28T09:52:13.111663+0200 (oldest deadline 2023-10-28T09:52:34.791768+0200)
Oct 28 09:52:34 proxmox02 ceph-osd[2488]: 2023-10-28T09:52:34.946+0200 7fc30017d700 -1 osd.2 7993 heartbeat_check: no reply from 10.10.10.201:6811 osd.1 since back 2023-10-28T09:52:13.111684+0200 front 2023-10-28T09:52:13.111663+0200 (oldest deadline 2023-10-28T09:52:34.791768+0200)
Oct 28 09:52:34 proxmox02 ceph-osd[2488]: 2023-10-28T09:52:34.946+0200 7fc30017d700 -1 osd.2 7993 heartbeat_check: no reply from 10.10.10.201:6810 osd.4 since back 2023-10-28T09:52:13.111740+0200 front 2023-10-28T09:52:13.111749+0200 (oldest deadline 2023-10-28T09:52:34.791768+0200)

I can still access the node via ssh as the third NIC is OK but I don't see anymore my 2 usb NIC:
Code:
# lsusb
Bus 004 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 003 Device 002: ID 8087:0029 Intel Corp. AX200 Bluetooth
Bus 003 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub

What is see on a working node:
Code:
# lsusb
Bus 004 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 003 Device 002: ID 8087:0029 Intel Corp. AX200 Bluetooth
Bus 003 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 002 Device 003: ID 0bda:8156 Realtek Semiconductor Corp. USB 10/100/1G/2.5G LAN
Bus 002 Device 002: ID 0bda:8156 Realtek Semiconductor Corp. USB 10/100/1G/2.5G LAN
Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 001 Device 002: ID 1d5c:7102 Fresco Logic Generic Billboard Device
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub

I am even not able to reboot the server via command line and the only way to solve the issue is to force the shutdown of the node via the physical button and start it again. Then i is ok for a few days, until the next crash.

Any help would be much appreciated.

Thanks a lot

Jay
 
Last edited:
Hi @jayjayjay - I experience the same problem, I lose my network adapter every few days with the same message:

Code:
Feb 27 01:01:52 proxmox-2 kernel: [109559.954737] xhci_hcd 0000:3e:00.0: xHCI host controller not responding, assume dead
Feb 27 01:01:52 proxmox-2 kernel: [109559.955051] xhci_hcd 0000:3e:00.0: HC died; cleaning up
Feb 27 01:01:52 proxmox-2 kernel: [109559.955418] r8152 4-1:1.0 enx00e04c3628da: Tx status -108
Feb 27 01:01:52 proxmox-2 kernel: [109559.955616] usb 4-1: USB disconnect, device number 2
Feb 27 01:01:52 proxmox-2 kernel: [109559.959961] bond0: (slave enx00e04c3628da): Releasing backup interface
Feb 27 01:01:52 proxmox-2 kernel: [109559.959984] device enx00e04c3628da left promiscuous mode

Have you discovered anything on your side?
 
I have exactly the same problem, no fix yet. I suspect that could be in some way related to overheating, but I need to investigate more on this subject.

I have a cluster of two proxmox servers (pve1 and pve2) with the onboard 1Gb ethernet interface connected to the LAN another USB-to-2.5Gb ethernet interface that connects the two servers and is used for live migration. Suddenly every two to three days it happens that the USB ethernet interface disconnects, only on node pve2, and I have to reboot the node to temporarily fix the issue.

Below the related log from journalctl

Code:
Apr 18 02:02:05 pve2 kernel: r8152-cfgselector 2-1: USB disconnect, device number 2
Apr 18 02:02:05 pve2 kernel: xhci_hcd 0000:00:14.0: WARN Set TR Deq Ptr cmd failed due to incorrect slot or ep state.
Apr 18 02:02:05 pve2 kernel: r8152 2-1:1.0 enx00e066680053: Tx status -108
Apr 18 02:02:05 pve2 kernel: xhci_hcd 0000:00:14.0: WARN Set TR Deq Ptr cmd failed due to incorrect slot or ep state.
Apr 18 02:02:05 pve2 kernel: vmbr1: port 1(enx00e066680053) entered disabled state
Apr 18 02:02:05 pve2 kernel: r8152 2-1:1.0 enx00e066680053 (unregistering): left allmulticast mode
Apr 18 02:02:05 pve2 kernel: r8152 2-1:1.0 enx00e066680053 (unregistering): left promiscuous mode
Apr 18 02:02:05 pve2 kernel: vmbr1: port 1(enx00e066680053) entered disabled state
Apr 18 02:02:05 pve2 kernel: usb 2-1: new SuperSpeed USB device number 3 using xhci_hcd
Apr 18 02:02:05 pve2 kernel: usb 2-1: New USB device found, idVendor=0bda, idProduct=8156, bcdDevice=31.04
Apr 18 02:02:05 pve2 kernel: usb 2-1: New USB device strings: Mfr=1, Product=2, SerialNumber=6
Apr 18 02:02:05 pve2 kernel: usb 2-1: Product: USB 10/100/1G/2.5G LAN
Apr 18 02:02:05 pve2 kernel: usb 2-1: Manufacturer: Realtek
Apr 18 02:02:05 pve2 kernel: usb 2-1: SerialNumber: 4013000001
Apr 18 02:02:05 pve2 kernel: cdc_ncm 2-1:2.0: MAC-Address: 00:e0:66:68:00:53
Apr 18 02:02:05 pve2 kernel: cdc_ncm 2-1:2.0: setting rx_max = 16384
Apr 18 02:02:05 pve2 kernel: cdc_ncm 2-1:2.0: setting tx_max = 16384
Apr 18 02:02:05 pve2 kernel: cdc_ncm 2-1:2.0 eth0: register 'cdc_ncm' at usb-0000:00:14.0-1, CDC NCM (NO ZLP), 00:e0:66:68:00:53
Apr 18 02:02:05 pve2 kernel: cdc_ncm 2-1:2.0 eth0: unregister 'cdc_ncm' usb-0000:00:14.0-1, CDC NCM (NO ZLP)
Apr 18 02:02:05 pve2 kernel: r8152-cfgselector 2-1: reset SuperSpeed USB device number 3 using xhci_hcd
Apr 18 02:02:05 pve2 kernel: r8152 2-1:1.0: load rtl8156b-2 v3 10/20/23 successfully
Apr 18 02:02:06 pve2 kernel: r8152 2-1:1.0 eth0: v1.12.13
Apr 18 02:02:06 pve2 kernel: usbcore: registered new interface driver cdc_wdm
Apr 18 02:02:06 pve2 kernel: usbcore: registered new interface driver cdc_mbim
Apr 18 02:02:06 pve2 kernel: r8152 2-1:1.0 enx00e066680053: renamed from eth0

It seems that the USB get disconnected and immediately after reconnected, but the connectivity is not re-established
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!