Proxmox Cluster Failure

krs342

Active Member
Dec 22, 2017
7
0
41
46
Hello,

We have a 30 node Proxmox cluster with only clustering and no HA feature. Today we had a very strange problem. Many of our nodes has failed and goes down immediately without nothing. WE have rebooted the nodes one by one and have recovered from the problem. There is a strange error on the log file about the problem . Seems all the ethernet cards connections were closed. What could be the problem here? How can i avoid this in the future? Please check the logfile for all the errorlog output.

We have 4 switches, 2 of the swithes are used for cluster heartbeat and 2 switches are used as the internet and storage connection. :

root@kvm115:~# uname -a
Linux kvm115 5.3.13-1-pve #1 SMP PVE 5.3.13-1 (Thu, 05 Dec 2019 07:18:14 +0100) x86_64 GNU/Linux

Jan 20 13:58:54 kvm115 kernel: [2172027.473647] ------------[ cut here ]------------
Jan 20 13:58:54 kvm115 kernel: [2172027.473652] NETDEV WATCHDOG: eno1 (igb): transmit queue 1 timed out
Jan 20 13:58:54 kvm115 kernel: [2172027.473673] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:448 dev_watchdog+0x264/0x270
Jan 20 13:58:54 kvm115 kernel: [2172027.473674] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache ebtable_filter ebtables ip_set ip6table_raw i

ptable_raw ip6table_filter ip6_tables sctp iptable_filter bpfilter 8021q garp mrp bonding softdog nfnetlink_log nfnetlink tcp_bbr sch_fq intel_rapl_msr intel_rapl_common

sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel zfs(PO) aesni_intel aes_x86_64 crypto_sim

d zunicode(PO) cryptd glue_helper zlua(PO) zavl(PO) icp(PO) intel_cstate ipmi_ssif dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio mgag200 drm_vram_helper ttm drm_

kms_helper joydev drm intel_rapl_perf input_leds fb_sys_fops syscopyarea sysfillrect sysimgblt pcspkr mei_me mei ioatdma ipmi_si ipmi_devintf ipmi_msghandler acpi_pad mac

_hid zcommon(PO) znvpair(PO) spl(O) vhost_net vhost tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi sunrpc ip_tables x_tables

autofs4 hid_generic

Jan 20 13:58:54 kvm115 kernel: [2172027.473708] usbkbd usbmouse usbhid hid btrfs xor zstd_compress raid6_pq libcrc32c megaraid_sas isci ahci libsas sfc i2c_i801 libahci

lpc_ich scsi_transport_sas igb mtd i2c_algo_bit mdio dca wmi

Jan 20 13:58:54 kvm115 kernel: [2172027.473718] CPU: 0 PID: 0 Comm: swapper/0 Tainted: P O 5.3.13-1-pve #1

Jan 20 13:58:54 kvm115 kernel: [2172027.473719] Hardware name: Supermicro SYS-2027PR-HC1R/X9DRT-P, BIOS 3.3 11/28/2018

Jan 20 13:58:54 kvm115 kernel: [2172027.473721] RIP: 0010:dev_watchdog+0x264/0x270

Jan 20 13:58:54 kvm115 kernel: [2172027.473723] Code: 48 85 c0 75 e6 eb a0 4c 89 ef c6 05 f2 6f ea 00 01 e8 10 ed fa ff 89 d9 4c 89 ee 48 c7 c7 88 cc e1 8d 48 89 c2 e8 0d

3a 73 ff <0f> 0b eb 82 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41

Jan 20 13:58:54 kvm115 kernel: [2172027.473724] RSP: 0018:ffffaeb580003e58 EFLAGS: 00010282

Jan 20 13:58:54 kvm115 kernel: [2172027.473724] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000006

Jan 20 13:58:54 kvm115 kernel: [2172027.473725] RDX: 0000000000000007 RSI: 0000000000000082 RDI: ffff951aff617440

Jan 20 13:58:54 kvm115 kernel: [2172027.473726] RBP: ffffaeb580003e88 R08: 00000000000006a7 R09: 0000000000000004

Jan 20 13:58:54 kvm115 kernel: [2172027.473726] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000008

Jan 20 13:58:54 kvm115 kernel: [2172027.473727] R13: ffff951af13bc000 R14: ffff951af13bc480 R15: ffff951af1bb4940

Jan 20 13:58:54 kvm115 kernel: [2172027.473728] FS: 0000000000000000(0000) GS:ffff951aff600000(0000) knlGS:0000000000000000

Jan 20 13:58:54 kvm115 kernel: [2172027.473729] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033

Jan 20 13:58:54 kvm115 kernel: [2172027.473729] CR2: ffffffffff600400 CR3: 00000008a3a76002 CR4: 00000000001626f0

Jan 20 13:58:54 kvm115 kernel: [2172027.473730] Call Trace:

Jan 20 13:58:54 kvm115 kernel: [2172027.473732] <IRQ>

Jan 20 13:58:54 kvm115 kernel: [2172027.473736] ? pfifo_fast_enqueue+0x160/0x160

Jan 20 13:58:54 kvm115 kernel: [2172027.473740] call_timer_fn+0x32/0x130

Jan 20 13:58:54 kvm115 kernel: [2172027.473741] run_timer_softirq+0x19d/0x420

Jan 20 13:58:54 kvm115 kernel: [2172027.473743] ? enqueue_hrtimer+0x3c/0x90

Jan 20 13:58:54 kvm115 kernel: [2172027.473744] ? ktime_get+0x40/0xa0

Jan 20 13:58:54 kvm115 kernel: [2172027.473748] ? lapic_next_deadline+0x26/0x30

Jan 20 13:58:54 kvm115 kernel: [2172027.473750] ? clockevents_program_event+0x93/0xf0

Jan 20 13:58:54 kvm115 kernel: [2172027.473754] __do_softirq+0xdc/0x2d4

Jan 20 13:58:54 kvm115 kernel: [2172027.473758] irq_exit+0xa9/0xb0

Jan 20 13:58:54 kvm115 kernel: [2172027.473760] smp_apic_timer_interrupt+0x79/0x130

Jan 20 13:58:54 kvm115 kernel: [2172027.473761] apic_timer_interrupt+0xf/0x20

Jan 20 13:58:54 kvm115 kernel: [2172027.473762] </IRQ>

Jan 20 13:58:54 kvm115 kernel: [2172027.473765] RIP: 0010:cpuidle_enter_state+0xbd/0x450

Jan 20 13:58:54 kvm115 kernel: [2172027.473766] Code: ff e8 87 ea 82 ff 80 7d c7 00 74 17 9c 58 0f 1f 44 00 00 f6 c4 02 0f 85 63 03 00 00 31 ff e8 6a 51 89 ff fb 66 0f 1f

44 00 00 <45> 85 ed 0f 89 cf 01 00 00 41 c7 44 24 10 00 00 00 00 48 83 c4 18

Jan 20 13:58:54 kvm115 kernel: [2172027.473767] RSP: 0018:ffffffff8e003de8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13

Jan 20 13:58:54 kvm115 kernel: [2172027.473768] RAX: ffff951aff62a740 RBX: ffffffff8e158900 RCX: 000000000000001f

Jan 20 13:58:54 kvm115 kernel: [2172027.473769] RDX: 0007b7728e023512 RSI: 000000002db6db6d RDI: 0000000000000000

Jan 20 13:58:54 kvm115 kernel: [2172027.473769] RBP: ffffffff8e003e28 R08: 0000000000000002 R09: 0000000000029fc0

Jan 20 13:58:54 kvm115 kernel: [2172027.473770] R10: 001a7cad49111492 R11: ffff951aff6294c4 R12: ffffce957f801080

Jan 20 13:58:54 kvm115 kernel: [2172027.473770] R13: 0000000000000002 R14: ffffffff8e1589d8 R15: ffffffff8e1589c0

Jan 20 13:58:54 kvm115 kernel: [2172027.473772] ? cpuidle_enter_state+0x99/0x450

Jan 20 13:58:54 kvm115 kernel: [2172027.473773] cpuidle_enter+0x2e/0x40

Jan 20 13:58:54 kvm115 kernel: [2172027.473776] call_cpuidle+0x23/0x40

Jan 20 13:58:54 kvm115 kernel: [2172027.473778] do_idle+0x22c/0x270

Jan 20 13:58:54 kvm115 kernel: [2172027.473779] cpu_startup_entry+0x1d/0x20

Jan 20 13:58:54 kvm115 kernel: [2172027.473782] rest_init+0xae/0xb0

Jan 20 13:58:54 kvm115 kernel: [2172027.473786] arch_call_rest_init+0xe/0x1b

Jan 20 13:58:54 kvm115 kernel: [2172027.473787] start_kernel+0x56c/0x58b

Jan 20 13:58:54 kvm115 kernel: [2172027.473789] x86_64_start_reservations+0x24/0x26

Jan 20 13:58:54 kvm115 kernel: [2172027.473790] x86_64_start_kernel+0x74/0x77

Jan 20 13:58:54 kvm115 kernel: [2172027.473793] secondary_startup_64+0xa4/0xb0

Jan 20 13:58:54 kvm115 kernel: [2172027.473795] ---[ end trace 71856e23d9807cb0 ]---

Jan 20 13:58:54 kvm115 kernel: [2172027.473811] igb 0000:06:00.0 eno1: Reset adapter

Jan 20 13:58:55 kvm115 kernel: [2172028.369667] igb 0000:06:00.0 eno1: igb: eno1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX

Jan 20 14:02:28 kvm115 kernel: [2172241.389386] clocksource: timekeeping watchdog on CPU2: Marking clocksource 'tsc' as unstable because the skew is too large:
Jan 20 14:02:28 kvm115 kernel: [2172241.389389] clocksource: 'hpet' wd_now: 9ed4fdb9 wd_last: f1b8f686 mask: ffffffff
Jan 20 14:02:28 kvm115 kernel: [2172241.389389] clocksource: 'tsc' cs_now: 1a7d38be1bad26 cs_last: 1a7cb481962d4f mask: ffffffffffffffff
Jan 20 14:02:28 kvm115 kernel: [2172241.389393] tsc: Marking TSC unstable due to clocksource watchdog
Jan 20 14:02:28 kvm115 kernel: [2172241.396987] TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
Jan 20 14:02:28 kvm115 kernel: [2172241.396992] sched_clock: Marking unstable (2172225767286081, 15628372287)<-(2172241602053943, -205072864)
Jan 20 14:02:28 kvm115 kernel: [2172241.395899] clocksource: Switched to clocksource hpet
Jan 20 14:02:28 kvm115 kernel: [2172241.408643] igb 0000:06:00.1 eno2: Reset adapter
Jan 20 14:02:28 kvm115 kernel: [2172241.418126] igb 0000:06:00.1 eno2: igb: eno2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
Jan 20 14:02:28 kvm115 kernel: [2172241.452063] igb 0000:06:00.0 eno1: Reset adapter
Jan 20 14:02:28 kvm115 kernel: [2172241.470220] igb 0000:06:00.1 eno2: igb: eno2 NIC Link is Down

Jan 20 14:02:32 kvm115 kernel: [2172245.234205] igb 0000:06:00.0 eno1: igb: eno1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX

Jan 20 14:02:32 kvm115 kernel: [2172245.594092] igb 0000:06:00.1 eno2: igb: eno2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX

Jan 20 14:03:48 kvm115 kernel: [2172321.365740] igb 0000:06:00.0 eno1: Reset adapter

Jan 20 14:03:50 kvm115 kernel: [2172323.253745] igb 0000:06:00.0 eno1: igb: eno1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX

Jan 20 14:05:28 kvm115 kernel: [2172421.436414] igb 0000:06:00.1 eno2: Reset adapter

Jan 20 14:05:28 kvm115 kernel: [2172421.442088] igb 0000:06:00.1 eno2: igb: eno2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX

Jan 20 14:05:28 kvm115 kernel: [2172421.480473] igb 0000:06:00.0 eno1: Reset adapter

Jan 20 14:05:28 kvm115 kernel: [2172421.500474] igb 0000:06:00.1 eno2: igb: eno2 NIC Link is Down
 

Attachments

BTW I have strangely have the same message on the kernel log on both 10 nodes :

Jan 20 13:58:52 kvm120 kernel: [2171922.343850] NETDEV WATCHDOG: eno1 (igb): transmit queue 6 timed out
Jan 20 13:58:52 kvm120 kernel: [2171922.343866] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:448 dev_watchdog+0x264/0x270
Jan 20 13:58:52 kvm120 kernel: [2171922.343866] Modules linked in: tcp_diag inet_diag rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache ebtable_filter ebtables ip
_set ip6table_raw iptable_raw ip6table_filter ip6_tables sctp iptable_filter bpfilter 8021q garp mrp bonding softdog nfnetlink_log nfnetlink tcp_bbr sch_fq intel_rapl_msr
intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64
crypto_simd cryptd zfs(PO) glue_helper intel_cstate zunicode(PO) zlua(PO) zavl(PO) icp(PO) ipmi_ssif dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio mgag200 drm_v
ram_helper ttm drm_kms_helper intel_rapl_perf joydev input_leds drm fb_sys_fops syscopyarea sysfillrect sysimgblt pcspkr mei_me mei ioatdma ipmi_si ipmi_devintf ipmi_msgh
andler acpi_pad mac_hid zcommon(PO) znvpair(PO) spl(O) vhost_net vhost tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi sunrpc
ip_tables x_tables
Jan 20 13:58:52 kvm120 kernel: [2171922.343903] autofs4 hid_generic usbkbd usbmouse usbhid hid btrfs xor zstd_compress raid6_pq libcrc32c megaraid_sas isci ahci i2c_i801
libsas libahci sfc lpc_ich scsi_transport_sas igb mtd i2c_algo_bit mdio dca wmi
Jan 20 13:58:52 kvm120 kernel: [2171922.343913] CPU: 0 PID: 0 Comm: swapper/0 Tainted: P O 5.3.13-1-pve #1
Jan 20 13:58:52 kvm120 kernel: [2171922.343914] Hardware name: Supermicro SYS-2027PR-HC1R/X9DRT-P, BIOS 3.3 11/28/2018
Jan 20 13:58:52 kvm120 kernel: [2171922.343916] RIP: 0010:dev_watchdog+0x264/0x270
Jan 20 13:58:52 kvm120 kernel: [2171922.343918] Code: 48 85 c0 75 e6 eb a0 4c 89 ef c6 05 f2 6f ea 00 01 e8 10 ed fa ff 89 d9 4c 89 ee 48 c7 c7 88 cc 01 a5 48 89 c2 e8 0d
3a 73 ff <0f> 0b eb 82 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41
Jan 20 13:58:52 kvm120 kernel: [2171922.343918] RSP: 0018:ffffbae040003e58 EFLAGS: 00010282
Jan 20 13:58:52 kvm120 kernel: [2171922.343919] RAX: 0000000000000000 RBX: 0000000000000006 RCX: 0000000000000006
Jan 20 13:58:52 kvm120 kernel: [2171922.343920] RDX: 0000000000000007 RSI: 0000000000000082 RDI: ffffa1037f617440
Jan 20 13:58:52 kvm120 kernel: [2171922.343921] RBP: ffffbae040003e88 R08: 0000000000000707 R09: 0000000000000004
Jan 20 13:58:52 kvm120 kernel: [2171922.343921] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000008
Jan 20 13:58:52 kvm120 kernel: [2171922.343922] R13: ffffa103716a8000 R14: ffffa103716a8480 R15: ffffa10377cd2940
Jan 20 13:58:52 kvm120 kernel: [2171922.343923] FS: 0000000000000000(0000) GS:ffffa1037f600000(0000) knlGS:0000000000000000
Jan 20 13:58:52 kvm120 kernel: [2171922.343923] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 20 13:58:52 kvm120 kernel: [2171922.343924] CR2: 00002aeec40a5000 CR3: 00000010fa40a001 CR4: 00000000001626f0
Jan 20 13:58:52 kvm120 kernel: [2171922.343925] Call Trace:
Jan 20 13:58:52 kvm120 kernel: [2171922.343926] <IRQ>
Jan 20 13:58:52 kvm120 kernel: [2171922.343930] ? pfifo_fast_enqueue+0x160/0x160
Jan 20 13:58:52 kvm120 kernel: [2171922.343932] call_timer_fn+0x32/0x130
Jan 20 13:58:52 kvm120 kernel: [2171922.343934] run_timer_softirq+0x19d/0x420
Jan 20 13:58:52 kvm120 kernel: [2171922.343935] ? enqueue_hrtimer+0x3c/0x90
Jan 20 13:58:52 kvm120 kernel: [2171922.343936] ? ktime_get+0x40/0xa0
Jan 20 13:58:52 kvm120 kernel: [2171922.343939] ? lapic_next_deadline+0x26/0x30
Jan 20 13:58:52 kvm120 kernel: [2171922.343941] ? clockevents_program_event+0x93/0xf0
Jan 20 13:58:52 kvm120 kernel: [2171922.343944] __do_softirq+0xdc/0x2d4
Jan 20 13:58:52 kvm120 kernel: [2171922.343947] irq_exit+0xa9/0xb0
Jan 20 13:58:52 kvm120 kernel: [2171922.343948] smp_apic_timer_interrupt+0x79/0x130
Jan 20 13:58:52 kvm120 kernel: [2171922.343949] apic_timer_interrupt+0xf/0x20
Jan 20 13:58:52 kvm120 kernel: [2171922.343950] </IRQ>
Jan 20 13:58:52 kvm120 kernel: [2171922.343952] RIP: 0010:cpuidle_enter_state+0xbd/0x450
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!