Hi,
i have a 3 Node Cluster (new Installation with newest Kernel "Linux 6.2.16-4-pve #1 SMP PREEMPT_DYNAMIC PVE 6.2.16-4 (2023-07-07T04:22Z)".
One of the Nodes (N-2) went offline 4 Times today (Today installed Cluster and Nodes).
I see corosync log trouble, but i no have an idea, how to solve it. On Proxmox 7 this Node ran over a Year.
I use a simple 1gbit network.
All Nodes are on the same Switch. Only N-2 go offline.
Here are Logs of the moment, it go offline (Monitoring alarm on 18:55)
Log:
i have a 3 Node Cluster (new Installation with newest Kernel "Linux 6.2.16-4-pve #1 SMP PREEMPT_DYNAMIC PVE 6.2.16-4 (2023-07-07T04:22Z)".
One of the Nodes (N-2) went offline 4 Times today (Today installed Cluster and Nodes).
I see corosync log trouble, but i no have an idea, how to solve it. On Proxmox 7 this Node ran over a Year.
I use a simple 1gbit network.
All Nodes are on the same Switch. Only N-2 go offline.
Here are Logs of the moment, it go offline (Monitoring alarm on 18:55)
Log:
Code:
Jul 11 18:40:07 N-2 corosync[1528]: [TOTEM ] Token has not been received in 2250 ms
ce.
Jul 11 18:50:23 N-2 pmxcfs[33890]: [status] notice: cpg_send_message retried 13 times
Jul 11 18:50:30 N-2 corosync[1528]: [TOTEM ] Token has not been received in 2250 ms
Jul 11 18:50:30 N-2 corosync[1528]: [QUORUM] Sync members[3]: 1 2 3
Jul 11 18:50:30 N-2 corosync[1528]: [TOTEM ] A new membership (1.2d7) was formed. Members
Jul 11 18:50:30 N-2 corosync[1528]: [QUORUM] Members[3]: 1 2 3
Jul 11 18:50:30 N-2 corosync[1528]: [MAIN ] Completed service synchronization, ready to provide service.
Jul 11 18:50:36 N-2 corosync[1528]: [KNET ] link: host: 3 link: 1 is down
Jul 11 18:50:36 N-2 corosync[1528]: [KNET ] host: host: 3 (passive) best link: 1 (pri: 1)
Jul 11 18:50:36 N-2 corosync[1528]: [KNET ] host: host: 3 has no active links
Jul 11 18:50:38 N-2 corosync[1528]: [KNET ] rx: host: 3 link: 1 is up
Jul 11 18:50:38 N-2 corosync[1528]: [KNET ] link: Resetting MTU for link 1 because host 3 joined
Jul 11 18:50:38 N-2 corosync[1528]: [KNET ] host: host: 3 (passive) best link: 1 (pri: 1)
Jul 11 18:50:38 N-2 corosync[1528]: [KNET ] pmtud: Global data MTU changed to: 1397
Jul 11 18:50:55 N-2 sshd[420595]: error: kex_exchange_identification: Connection closed by remote host
Jul 11 18:50:55 N-2 sshd[420595]: Connection closed by 10.0.0.31 port 49148
Jul 11 18:51:18 N-2 corosync[1528]: [KNET ] link: host: 3 link: 1 is down
Jul 11 18:51:18 N-2 corosync[1528]: [KNET ] host: host: 3 (passive) best link: 1 (pri: 1)
Jul 11 18:51:18 N-2 corosync[1528]: [KNET ] host: host: 3 has no active links
Jul 11 18:51:20 N-2 corosync[1528]: [KNET ] rx: host: 3 link: 1 is up
Jul 11 18:51:20 N-2 corosync[1528]: [KNET ] link: Resetting MTU for link 1 because host 3 joined
Jul 11 18:51:20 N-2 corosync[1528]: [KNET ] host: host: 3 (passive) best link: 1 (pri: 1)
Jul 11 18:51:20 N-2 corosync[1528]: [KNET ] pmtud: Global data MTU changed to: 1397
Jul 11 18:51:48 N-2 corosync[1528]: [KNET ] link: host: 3 link: 1 is down
Jul 11 18:51:48 N-2 corosync[1528]: [KNET ] host: host: 3 (passive) best link: 1 (pri: 1)
Jul 11 18:51:48 N-2 corosync[1528]: [KNET ] host: host: 3 has no active links
Jul 11 18:51:50 N-2 corosync[1528]: [KNET ] rx: host: 3 link: 1 is up
Jul 11 18:51:50 N-2 corosync[1528]: [KNET ] link: Resetting MTU for link 1 because host 3 joined
Jul 11 18:51:50 N-2 corosync[1528]: [KNET ] host: host: 3 (passive) best link: 1 (pri: 1)
Jul 11 18:51:50 N-2 corosync[1528]: [KNET ] pmtud: Global data MTU changed to: 1397
Jul 11 18:51:55 N-2 sshd[421282]: error: kex_exchange_identification: Connection closed by remote host
Jul 11 18:51:55 N-2 sshd[421282]: Connection closed by 10.0.0.31 port 42238
Jul 11 18:52:55 N-2 sshd[421965]: error: kex_exchange_identification: Connection closed by remote host
Jul 11 18:52:55 N-2 sshd[421965]: Connection closed by 10.0.0.31 port 45466
Jul 11 18:53:07 N-2 corosync[1528]: [KNET ] link: host: 3 link: 1 is down
Jul 11 18:53:07 N-2 corosync[1528]: [KNET ] host: host: 3 (passive) best link: 1 (pri: 1)
Jul 11 18:53:07 N-2 corosync[1528]: [KNET ] host: host: 3 has no active links
Jul 11 18:53:09 N-2 corosync[1528]: [KNET ] rx: host: 3 link: 1 is up
Jul 11 18:53:09 N-2 corosync[1528]: [KNET ] link: Resetting MTU for link 1 because host 3 joined
Jul 11 18:53:09 N-2 corosync[1528]: [KNET ] host: host: 3 (passive) best link: 1 (pri: 1)
Jul 11 18:53:09 N-2 corosync[1528]: [KNET ] pmtud: Global data MTU changed to: 1397
Jul 11 18:53:53 N-2 corosync[1528]: [KNET ] link: host: 3 link: 1 is down
Jul 11 18:53:53 N-2 corosync[1528]: [KNET ] host: host: 3 (passive) best link: 1 (pri: 1)
Jul 11 18:53:53 N-2 corosync[1528]: [KNET ] host: host: 3 has no active links
Jul 11 18:53:54 N-2 corosync[1528]: [KNET ] rx: host: 3 link: 1 is up
Jul 11 18:53:54 N-2 corosync[1528]: [KNET ] link: Resetting MTU for link 1 because host 3 joined
Jul 11 18:53:54 N-2 corosync[1528]: [KNET ] host: host: 3 (passive) best link: 1 (pri: 1)
Jul 11 18:53:54 N-2 corosync[1528]: [KNET ] pmtud: Global data MTU changed to: 1397
Jul 11 18:53:55 N-2 sshd[422791]: error: kex_exchange_identification: Connection closed by remote host
Jul 11 18:53:55 N-2 sshd[422791]: Connection closed by 10.0.0.31 port 47236
Jul 11 18:53:58 N-2 corosync[1528]: [TOTEM ] Retransmit List: 3de
Jul 11 18:54:09 N-2 corosync[1528]: [TOTEM ] Token has not been received in 2250 ms
Jul 11 18:54:09 N-2 pmxcfs[33890]: [status] notice: received log
Jul 11 18:54:38 N-2 corosync[1528]: [KNET ] link: host: 3 link: 1 is down
Jul 11 18:54:38 N-2 corosync[1528]: [KNET ] link: host: 1 link: 1 is down
Jul 11 18:54:38 N-2 corosync[1528]: [KNET ] host: host: 3 (passive) best link: 1 (pri: 1)
Jul 11 18:54:38 N-2 corosync[1528]: [KNET ] host: host: 3 has no active links
Jul 11 18:54:38 N-2 corosync[1528]: [KNET ] host: host: 1 (passive) best link: 1 (pri: 1)
Jul 11 18:54:38 N-2 corosync[1528]: [KNET ] host: host: 1 has no active links
Jul 11 18:54:39 N-2 corosync[1528]: [TOTEM ] Token has not been received in 2250 ms
Jul 11 18:54:40 N-2 corosync[1528]: [TOTEM ] A processor failed, forming new configuration: token timed out (3000ms), waiting 3600ms for consensus.
Jul 11 18:54:43 N-2 corosync[1528]: [QUORUM] Sync members[1]: 2
Jul 11 18:54:43 N-2 corosync[1528]: [QUORUM] Sync left[2]: 1 3
Jul 11 18:54:43 N-2 corosync[1528]: [TOTEM ] A new membership (2.2db) was formed. Members left: 1 3
Jul 11 18:54:43 N-2 corosync[1528]: [TOTEM ] Failed to receive the leave message. failed: 1 3
Jul 11 18:54:43 N-2 corosync[1528]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
Jul 11 18:54:43 N-2 corosync[1528]: [QUORUM] Members[1]: 2
Jul 11 18:54:43 N-2 pmxcfs[33890]: [dcdb] notice: members: 2/33890
Jul 11 18:54:43 N-2 corosync[1528]: [MAIN ] Completed service synchronization, ready to provide service.
Jul 11 18:54:43 N-2 pmxcfs[33890]: [status] notice: node lost quorum
Jul 11 18:54:43 N-2 pmxcfs[33890]: [status] notice: members: 2/33890
Jul 11 18:54:43 N-2 pmxcfs[33890]: [dcdb] crit: received write while not quorate - trigger resync
Jul 11 18:54:43 N-2 pmxcfs[33890]: [dcdb] crit: leaving CPG group
Jul 11 18:54:44 N-2 pmxcfs[33890]: [dcdb] notice: start cluster connection
Jul 11 18:54:44 N-2 pmxcfs[33890]: [dcdb] crit: cpg_join failed: 14
Jul 11 18:54:44 N-2 pmxcfs[33890]: [dcdb] crit: can't initialize service
Jul 11 18:54:44 N-2 pve-ha-crm[46243]: status change slave => wait_for_quorum
Jul 11 18:54:44 N-2 pve-ha-lrm[45585]: lost lock 'ha_agent_N-2_lock - cfs lock update failed - Device or resource busy
Jul 11 18:54:46 N-2 pve-ha-lrm[45585]: status change active => lost_agent_lock
Jul 11 18:54:50 N-2 pmxcfs[33890]: [dcdb] notice: members: 2/33890
Jul 11 18:54:50 N-2 pmxcfs[33890]: [dcdb] notice: all data is up to date
Jul 11 18:54:52 N-2 pvestatd[1566]: storage 'Backup' is not online
Jul 11 18:54:52 N-2 pvestatd[1566]: status update time (10.297 seconds)
Jul 11 18:55:02 N-2 pvestatd[1566]: storage 'Backup' is not online
Jul 11 18:55:02 N-2 pvestatd[1566]: status update time (10.320 seconds)
Jul 11 18:55:07 N-2 kernel: ------------[ cut here ]------------
Jul 11 18:55:07 N-2 kernel: NETDEV WATCHDOG: enp1s0 (r8169): transmit queue 0 timed out
Jul 11 18:55:07 N-2 kernel: WARNING: CPU: 3 PID: 0 at net/sched/sch_generic.c:525 dev_watchdog+0x23a/0x250
Jul 11 18:55:07 N-2 kernel: Modules linked in: tcp_diag inet_diag 8021q garp mrp veth nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache netfs ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter bpfilter sctp ip6_udp_tunnel udp_tunnel nf_tables bonding tls softdog sunrpc nfnetlink_log binfmt_misc nfnetlink snd_hda_codec_hdmi snd_sof_pci_intel_apl snd_sof_intel_hda_common soundwire_intel soundwire_generic_allocation soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_sof_utils soundwire_bus snd_hda_codec_realtek snd_hda_codec_generic snd_soc_avs intel_rapl_msr snd_soc_hda_codec intel_rapl_common intel_pmc_bxt snd_soc_skl intel_telemetry_pltdrv intel_punit_ipc intel_telemetry_core x86_pkg_temp_thermal intel_powerclamp snd_soc_hdac_hda coretemp snd_hda_ext_core snd_soc_sst_ipc snd_soc_sst_dsp snd_soc_acpi_intel_match kvm_intel snd_soc_acpi mei_hdcp mei_pxp snd_soc_core snd_compress ac97_bus snd_pcm_dmaengine
Jul 11 18:55:07 N-2 kernel: snd_hda_intel i915 kvm dell_wmi irqbypass snd_intel_dspcfg ledtrig_audio crct10dif_pclmul polyval_generic ghash_clmulni_intel snd_intel_sdw_acpi sha512_ssse3 aesni_intel crypto_simd drm_buddy snd_hda_codec snd_hda_core ttm cryptd dell_smbios snd_hwdep drm_display_helper rapl cec snd_pcm rc_core dcdbas sparse_keymap intel_cstate snd_timer wmi_bmof dell_wmi_descriptor drm_kms_helper ucsi_acpi pcspkr i2c_algo_bit mei_me syscopyarea typec_ucsi sysfillrect ee1004 sysimgblt typec snd mei soundcore mac_hid zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) vhost_net vhost vhost_iotlb tap drm efi_pstore dmi_sysfs ip_tables x_tables autofs4 uas usb_storage btrfs blake2b_generic xor raid6_pq simplefb dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c xhci_pci intel_lpss_pci xhci_pci_renesas sdhci_pci intel_lpss crc32_pclmul cqhci i2c_i801 i2c_smbus r8169 realtek sdhci xhci_hcd idma64 ahci libahci video wmi pinctrl_geminilake
Jul 11 18:55:07 N-2 kernel: CPU: 3 PID: 0 Comm: swapper/3 Tainted: P O 6.2.16-3-pve #1
Jul 11 18:55:07 N-2 kernel: Hardware name: Dell Inc. Wyse 5070 Thin Client/02DXT3, BIOS 1.1.4 12/06/2018
Jul 11 18:55:07 N-2 kernel: RIP: 0010:dev_watchdog+0x23a/0x250
Jul 11 18:55:07 N-2 kernel: Code: 00 e9 2b ff ff ff 48 89 df c6 05 8a 6f 7d 01 01 e8 6b 08 f8 ff 44 89 f1 48 89 de 48 c7 c7 58 64 60 8b 48 89 c2 e8 06 ab 30 ff <0f> 0b e9 1c ff ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00
Jul 11 18:55:07 N-2 kernel: RSP: 0018:ffff9d58c0184e38 EFLAGS: 00010246
Jul 11 18:55:07 N-2 kernel: RAX: 0000000000000000 RBX: ffff8c6551794000 RCX: 0000000000000000
Jul 11 18:55:07 N-2 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
Jul 11 18:55:07 N-2 kernel: RBP: ffff9d58c0184e68 R08: 0000000000000000 R09: 0000000000000000
Jul 11 18:55:07 N-2 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff8c65517944c8
Jul 11 18:55:07 N-2 kernel: R13: ffff8c655179441c R14: 0000000000000000 R15: 0000000000000000
Jul 11 18:55:07 N-2 kernel: FS: 0000000000000000(0000) GS:ffff8c68afd80000(0000) knlGS:0000000000000000
Jul 11 18:55:07 N-2 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 11 18:55:07 N-2 kernel: CR2: 00007f81e400bef8 CR3: 00000001b3610000 CR4: 0000000000352ee0
Jul 11 18:55:07 N-2 kernel: Call Trace:
Jul 11 18:55:07 N-2 kernel: <IRQ>
Jul 11 18:55:07 N-2 kernel: ? __pfx_dev_watchdog+0x10/0x10
Jul 11 18:55:07 N-2 kernel: call_timer_fn+0x29/0x160
Jul 11 18:55:07 N-2 kernel: ? __pfx_dev_watchdog+0x10/0x10
Jul 11 18:55:07 N-2 kernel: __run_timers+0x259/0x310
Jul 11 18:55:07 N-2 kernel: run_timer_softirq+0x1d/0x40
Jul 11 18:55:07 N-2 kernel: __do_softirq+0xd6/0x346
Jul 11 18:55:07 N-2 kernel: ? hrtimer_interrupt+0x11f/0x250
Jul 11 18:55:07 N-2 kernel: __irq_exit_rcu+0xa2/0xd0
Jul 11 18:55:07 N-2 kernel: irq_exit_rcu+0xe/0x20
Jul 11 18:55:07 N-2 kernel: sysvec_apic_timer_interrupt+0x92/0xd0
Jul 11 18:55:07 N-2 kernel: </IRQ>
Jul 11 18:55:07 N-2 kernel: <TASK>
Jul 11 18:55:07 N-2 kernel: asm_sysvec_apic_timer_interrupt+0x1b/0x20
Jul 11 18:55:07 N-2 kernel: RIP: 0010:cpuidle_enter_state+0xde/0x6f0
Jul 11 18:55:07 N-2 kernel: Code: 2a 77 75 e8 54 7e 4a ff 8b 53 04 49 89 c7 0f 1f 44 00 00 31 ff e8 82 86 49 ff 80 7d d0 00 0f 85 eb 00 00 00 fb 0f 1f 44 00 00 <45> 85 f6 0f 88 12 02 00 00 4d 63 ee 49 83 fd 09 0f 87 c7 04 00 00
Jul 11 18:55:07 N-2 kernel: RSP: 0018:ffff9d58c00dfe38 EFLAGS: 00000246
Jul 11 18:55:07 N-2 kernel: RAX: 0000000000000000 RBX: ffff8c68afdbd900 RCX: 0000000000000000
Jul 11 18:55:07 N-2 kernel: RDX: 0000000000000003 RSI: 0000000000000000 RDI: 0000000000000000
Jul 11 18:55:07 N-2 kernel: RBP: ffff9d58c00dfe88 R08: 0000000000000000 R09: 0000000000000000
Jul 11 18:55:07 N-2 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff8c0c33a0
Jul 11 18:55:07 N-2 kernel: R13: 0000000000000004 R14: 0000000000000004 R15: 0000044e8f1bb65e
Jul 11 18:55:07 N-2 kernel: ? cpuidle_enter_state+0xce/0x6f0
Jul 11 18:55:07 N-2 kernel: cpuidle_enter+0x2e/0x50
Jul 11 18:55:07 N-2 kernel: do_idle+0x216/0x2a0
Jul 11 18:55:07 N-2 kernel: cpu_startup_entry+0x1d/0x20
Jul 11 18:55:07 N-2 kernel: start_secondary+0x122/0x160
Jul 11 18:55:07 N-2 kernel: secondary_startup_64_no_verify+0xe5/0xeb
Jul 11 18:55:07 N-2 kernel: </TASK>
Jul 11 18:55:07 N-2 kernel: ---[ end trace 0000000000000000 ]---
Jul 11 18:55:07 N-2 kernel: r8169 0000:01:00.0 enp1s0: rtl_chipcmd_cond == 1 (loop: 100, delay: 100).
Jul 11 18:55:07 N-2 kernel: r8169 0000:01:00.0 enp1s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10).
Jul 11 18:55:07 N-2 kernel: r8169 0000:01:00.0 enp1s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10).
Jul 11 18:55:07 N-2 kernel: r8169 0000:01:00.0 enp1s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10).
Jul 11 18:55:07 N-2 kernel: r8169 0000:01:00.0 enp1s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10).
Jul 11 18:55:07 N-2 kernel: r8169 0000:01:00.0 enp1s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10).
Jul 11 18:55:07 N-2 kernel: r8169 0000:01:00.0 enp1s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10).
Jul 11 18:55:07 N-2 kernel: r8169 0000:01:00.0 enp1s0: rtl_eriar_cond == 1 (loop: 100, delay: 100).
Jul 11 18:55:07 N-2 kernel: r8169 0000:01:00.0 enp1s0: rtl_eriar_cond == 1 (loop: 100, delay: 100).
Jul 11 18:55:07 N-2 kernel: r8169 0000:01:00.0 enp1s0: rtl_eriar_cond == 1 (loop: 100, delay: 100).
Jul 11 18:55:08 N-2 pvescheduler[423712]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum!
Jul 11 18:55:08 N-2 pvescheduler[423711]: replication: cfs-lock 'file-replication_cfg' error: no quorum!
Jul 11 18:55:13 N-2 pvestatd[1566]: storage 'Backup' is not online
Jul 11 18:55:13 N-2 pvestatd[1566]: status update time (10.272 seconds)
-- Reboot --