[SOLVED] Node goes offline

corin.corvus

Active Member
Apr 8, 2020
132
13
38
37
Hi,

i have a 3 Node Cluster (new Installation with newest Kernel "Linux 6.2.16-4-pve #1 SMP PREEMPT_DYNAMIC PVE 6.2.16-4 (2023-07-07T04:22Z)".

One of the Nodes (N-2) went offline 4 Times today (Today installed Cluster and Nodes).

I see corosync log trouble, but i no have an idea, how to solve it. On Proxmox 7 this Node ran over a Year.
I use a simple 1gbit network.
All Nodes are on the same Switch. Only N-2 go offline.
1689095473612.png




Here are Logs of the moment, it go offline (Monitoring alarm on 18:55)
Log:
Code:
Jul 11 18:40:07 N-2 corosync[1528]:   [TOTEM ] Token has not been received in 2250 ms
ce.
Jul 11 18:50:23 N-2 pmxcfs[33890]: [status] notice: cpg_send_message retried 13 times
Jul 11 18:50:30 N-2 corosync[1528]:   [TOTEM ] Token has not been received in 2250 ms
Jul 11 18:50:30 N-2 corosync[1528]:   [QUORUM] Sync members[3]: 1 2 3
Jul 11 18:50:30 N-2 corosync[1528]:   [TOTEM ] A new membership (1.2d7) was formed. Members
Jul 11 18:50:30 N-2 corosync[1528]:   [QUORUM] Members[3]: 1 2 3
Jul 11 18:50:30 N-2 corosync[1528]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jul 11 18:50:36 N-2 corosync[1528]:   [KNET  ] link: host: 3 link: 1 is down
Jul 11 18:50:36 N-2 corosync[1528]:   [KNET  ] host: host: 3 (passive) best link: 1 (pri: 1)
Jul 11 18:50:36 N-2 corosync[1528]:   [KNET  ] host: host: 3 has no active links
Jul 11 18:50:38 N-2 corosync[1528]:   [KNET  ] rx: host: 3 link: 1 is up
Jul 11 18:50:38 N-2 corosync[1528]:   [KNET  ] link: Resetting MTU for link 1 because host 3 joined
Jul 11 18:50:38 N-2 corosync[1528]:   [KNET  ] host: host: 3 (passive) best link: 1 (pri: 1)
Jul 11 18:50:38 N-2 corosync[1528]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Jul 11 18:50:55 N-2 sshd[420595]: error: kex_exchange_identification: Connection closed by remote host
Jul 11 18:50:55 N-2 sshd[420595]: Connection closed by 10.0.0.31 port 49148
Jul 11 18:51:18 N-2 corosync[1528]:   [KNET  ] link: host: 3 link: 1 is down
Jul 11 18:51:18 N-2 corosync[1528]:   [KNET  ] host: host: 3 (passive) best link: 1 (pri: 1)
Jul 11 18:51:18 N-2 corosync[1528]:   [KNET  ] host: host: 3 has no active links
Jul 11 18:51:20 N-2 corosync[1528]:   [KNET  ] rx: host: 3 link: 1 is up
Jul 11 18:51:20 N-2 corosync[1528]:   [KNET  ] link: Resetting MTU for link 1 because host 3 joined
Jul 11 18:51:20 N-2 corosync[1528]:   [KNET  ] host: host: 3 (passive) best link: 1 (pri: 1)
Jul 11 18:51:20 N-2 corosync[1528]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Jul 11 18:51:48 N-2 corosync[1528]:   [KNET  ] link: host: 3 link: 1 is down
Jul 11 18:51:48 N-2 corosync[1528]:   [KNET  ] host: host: 3 (passive) best link: 1 (pri: 1)
Jul 11 18:51:48 N-2 corosync[1528]:   [KNET  ] host: host: 3 has no active links
Jul 11 18:51:50 N-2 corosync[1528]:   [KNET  ] rx: host: 3 link: 1 is up
Jul 11 18:51:50 N-2 corosync[1528]:   [KNET  ] link: Resetting MTU for link 1 because host 3 joined
Jul 11 18:51:50 N-2 corosync[1528]:   [KNET  ] host: host: 3 (passive) best link: 1 (pri: 1)
Jul 11 18:51:50 N-2 corosync[1528]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Jul 11 18:51:55 N-2 sshd[421282]: error: kex_exchange_identification: Connection closed by remote host
Jul 11 18:51:55 N-2 sshd[421282]: Connection closed by 10.0.0.31 port 42238
Jul 11 18:52:55 N-2 sshd[421965]: error: kex_exchange_identification: Connection closed by remote host
Jul 11 18:52:55 N-2 sshd[421965]: Connection closed by 10.0.0.31 port 45466
Jul 11 18:53:07 N-2 corosync[1528]:   [KNET  ] link: host: 3 link: 1 is down
Jul 11 18:53:07 N-2 corosync[1528]:   [KNET  ] host: host: 3 (passive) best link: 1 (pri: 1)
Jul 11 18:53:07 N-2 corosync[1528]:   [KNET  ] host: host: 3 has no active links
Jul 11 18:53:09 N-2 corosync[1528]:   [KNET  ] rx: host: 3 link: 1 is up
Jul 11 18:53:09 N-2 corosync[1528]:   [KNET  ] link: Resetting MTU for link 1 because host 3 joined
Jul 11 18:53:09 N-2 corosync[1528]:   [KNET  ] host: host: 3 (passive) best link: 1 (pri: 1)
Jul 11 18:53:09 N-2 corosync[1528]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Jul 11 18:53:53 N-2 corosync[1528]:   [KNET  ] link: host: 3 link: 1 is down
Jul 11 18:53:53 N-2 corosync[1528]:   [KNET  ] host: host: 3 (passive) best link: 1 (pri: 1)
Jul 11 18:53:53 N-2 corosync[1528]:   [KNET  ] host: host: 3 has no active links
Jul 11 18:53:54 N-2 corosync[1528]:   [KNET  ] rx: host: 3 link: 1 is up
Jul 11 18:53:54 N-2 corosync[1528]:   [KNET  ] link: Resetting MTU for link 1 because host 3 joined
Jul 11 18:53:54 N-2 corosync[1528]:   [KNET  ] host: host: 3 (passive) best link: 1 (pri: 1)
Jul 11 18:53:54 N-2 corosync[1528]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Jul 11 18:53:55 N-2 sshd[422791]: error: kex_exchange_identification: Connection closed by remote host
Jul 11 18:53:55 N-2 sshd[422791]: Connection closed by 10.0.0.31 port 47236
Jul 11 18:53:58 N-2 corosync[1528]:   [TOTEM ] Retransmit List: 3de
Jul 11 18:54:09 N-2 corosync[1528]:   [TOTEM ] Token has not been received in 2250 ms
Jul 11 18:54:09 N-2 pmxcfs[33890]: [status] notice: received log
Jul 11 18:54:38 N-2 corosync[1528]:   [KNET  ] link: host: 3 link: 1 is down
Jul 11 18:54:38 N-2 corosync[1528]:   [KNET  ] link: host: 1 link: 1 is down
Jul 11 18:54:38 N-2 corosync[1528]:   [KNET  ] host: host: 3 (passive) best link: 1 (pri: 1)
Jul 11 18:54:38 N-2 corosync[1528]:   [KNET  ] host: host: 3 has no active links
Jul 11 18:54:38 N-2 corosync[1528]:   [KNET  ] host: host: 1 (passive) best link: 1 (pri: 1)
Jul 11 18:54:38 N-2 corosync[1528]:   [KNET  ] host: host: 1 has no active links
Jul 11 18:54:39 N-2 corosync[1528]:   [TOTEM ] Token has not been received in 2250 ms
Jul 11 18:54:40 N-2 corosync[1528]:   [TOTEM ] A processor failed, forming new configuration: token timed out (3000ms), waiting 3600ms for consensus.
Jul 11 18:54:43 N-2 corosync[1528]:   [QUORUM] Sync members[1]: 2
Jul 11 18:54:43 N-2 corosync[1528]:   [QUORUM] Sync left[2]: 1 3
Jul 11 18:54:43 N-2 corosync[1528]:   [TOTEM ] A new membership (2.2db) was formed. Members left: 1 3
Jul 11 18:54:43 N-2 corosync[1528]:   [TOTEM ] Failed to receive the leave message. failed: 1 3
Jul 11 18:54:43 N-2 corosync[1528]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
Jul 11 18:54:43 N-2 corosync[1528]:   [QUORUM] Members[1]: 2
Jul 11 18:54:43 N-2 pmxcfs[33890]: [dcdb] notice: members: 2/33890
Jul 11 18:54:43 N-2 corosync[1528]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jul 11 18:54:43 N-2 pmxcfs[33890]: [status] notice: node lost quorum
Jul 11 18:54:43 N-2 pmxcfs[33890]: [status] notice: members: 2/33890
Jul 11 18:54:43 N-2 pmxcfs[33890]: [dcdb] crit: received write while not quorate - trigger resync
Jul 11 18:54:43 N-2 pmxcfs[33890]: [dcdb] crit: leaving CPG group
Jul 11 18:54:44 N-2 pmxcfs[33890]: [dcdb] notice: start cluster connection
Jul 11 18:54:44 N-2 pmxcfs[33890]: [dcdb] crit: cpg_join failed: 14
Jul 11 18:54:44 N-2 pmxcfs[33890]: [dcdb] crit: can't initialize service
Jul 11 18:54:44 N-2 pve-ha-crm[46243]: status change slave => wait_for_quorum
Jul 11 18:54:44 N-2 pve-ha-lrm[45585]: lost lock 'ha_agent_N-2_lock - cfs lock update failed - Device or resource busy
Jul 11 18:54:46 N-2 pve-ha-lrm[45585]: status change active => lost_agent_lock
Jul 11 18:54:50 N-2 pmxcfs[33890]: [dcdb] notice: members: 2/33890
Jul 11 18:54:50 N-2 pmxcfs[33890]: [dcdb] notice: all data is up to date
Jul 11 18:54:52 N-2 pvestatd[1566]: storage 'Backup' is not online
Jul 11 18:54:52 N-2 pvestatd[1566]: status update time (10.297 seconds)
Jul 11 18:55:02 N-2 pvestatd[1566]: storage 'Backup' is not online
Jul 11 18:55:02 N-2 pvestatd[1566]: status update time (10.320 seconds)
Jul 11 18:55:07 N-2 kernel: ------------[ cut here ]------------
Jul 11 18:55:07 N-2 kernel: NETDEV WATCHDOG: enp1s0 (r8169): transmit queue 0 timed out
Jul 11 18:55:07 N-2 kernel: WARNING: CPU: 3 PID: 0 at net/sched/sch_generic.c:525 dev_watchdog+0x23a/0x250
Jul 11 18:55:07 N-2 kernel: Modules linked in: tcp_diag inet_diag 8021q garp mrp veth nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache netfs ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter bpfilter sctp ip6_udp_tunnel udp_tunnel nf_tables bonding tls softdog sunrpc nfnetlink_log binfmt_misc nfnetlink snd_hda_codec_hdmi snd_sof_pci_intel_apl snd_sof_intel_hda_common soundwire_intel soundwire_generic_allocation soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_sof_utils soundwire_bus snd_hda_codec_realtek snd_hda_codec_generic snd_soc_avs intel_rapl_msr snd_soc_hda_codec intel_rapl_common intel_pmc_bxt snd_soc_skl intel_telemetry_pltdrv intel_punit_ipc intel_telemetry_core x86_pkg_temp_thermal intel_powerclamp snd_soc_hdac_hda coretemp snd_hda_ext_core snd_soc_sst_ipc snd_soc_sst_dsp snd_soc_acpi_intel_match kvm_intel snd_soc_acpi mei_hdcp mei_pxp snd_soc_core snd_compress ac97_bus snd_pcm_dmaengine
Jul 11 18:55:07 N-2 kernel:  snd_hda_intel i915 kvm dell_wmi irqbypass snd_intel_dspcfg ledtrig_audio crct10dif_pclmul polyval_generic ghash_clmulni_intel snd_intel_sdw_acpi sha512_ssse3 aesni_intel crypto_simd drm_buddy snd_hda_codec snd_hda_core ttm cryptd dell_smbios snd_hwdep drm_display_helper rapl cec snd_pcm rc_core dcdbas sparse_keymap intel_cstate snd_timer wmi_bmof dell_wmi_descriptor drm_kms_helper ucsi_acpi pcspkr i2c_algo_bit mei_me syscopyarea typec_ucsi sysfillrect ee1004 sysimgblt typec snd mei soundcore mac_hid zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) vhost_net vhost vhost_iotlb tap drm efi_pstore dmi_sysfs ip_tables x_tables autofs4 uas usb_storage btrfs blake2b_generic xor raid6_pq simplefb dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c xhci_pci intel_lpss_pci xhci_pci_renesas sdhci_pci intel_lpss crc32_pclmul cqhci i2c_i801 i2c_smbus r8169 realtek sdhci xhci_hcd idma64 ahci libahci video wmi pinctrl_geminilake
Jul 11 18:55:07 N-2 kernel: CPU: 3 PID: 0 Comm: swapper/3 Tainted: P           O       6.2.16-3-pve #1
Jul 11 18:55:07 N-2 kernel: Hardware name: Dell Inc. Wyse 5070 Thin Client/02DXT3, BIOS 1.1.4 12/06/2018
Jul 11 18:55:07 N-2 kernel: RIP: 0010:dev_watchdog+0x23a/0x250
Jul 11 18:55:07 N-2 kernel: Code: 00 e9 2b ff ff ff 48 89 df c6 05 8a 6f 7d 01 01 e8 6b 08 f8 ff 44 89 f1 48 89 de 48 c7 c7 58 64 60 8b 48 89 c2 e8 06 ab 30 ff <0f> 0b e9 1c ff ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00
Jul 11 18:55:07 N-2 kernel: RSP: 0018:ffff9d58c0184e38 EFLAGS: 00010246
Jul 11 18:55:07 N-2 kernel: RAX: 0000000000000000 RBX: ffff8c6551794000 RCX: 0000000000000000
Jul 11 18:55:07 N-2 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
Jul 11 18:55:07 N-2 kernel: RBP: ffff9d58c0184e68 R08: 0000000000000000 R09: 0000000000000000
Jul 11 18:55:07 N-2 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff8c65517944c8
Jul 11 18:55:07 N-2 kernel: R13: ffff8c655179441c R14: 0000000000000000 R15: 0000000000000000
Jul 11 18:55:07 N-2 kernel: FS:  0000000000000000(0000) GS:ffff8c68afd80000(0000) knlGS:0000000000000000
Jul 11 18:55:07 N-2 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 11 18:55:07 N-2 kernel: CR2: 00007f81e400bef8 CR3: 00000001b3610000 CR4: 0000000000352ee0
Jul 11 18:55:07 N-2 kernel: Call Trace:
Jul 11 18:55:07 N-2 kernel:  <IRQ>
Jul 11 18:55:07 N-2 kernel:  ? __pfx_dev_watchdog+0x10/0x10
Jul 11 18:55:07 N-2 kernel:  call_timer_fn+0x29/0x160
Jul 11 18:55:07 N-2 kernel:  ? __pfx_dev_watchdog+0x10/0x10
Jul 11 18:55:07 N-2 kernel:  __run_timers+0x259/0x310
Jul 11 18:55:07 N-2 kernel:  run_timer_softirq+0x1d/0x40
Jul 11 18:55:07 N-2 kernel:  __do_softirq+0xd6/0x346
Jul 11 18:55:07 N-2 kernel:  ? hrtimer_interrupt+0x11f/0x250
Jul 11 18:55:07 N-2 kernel:  __irq_exit_rcu+0xa2/0xd0
Jul 11 18:55:07 N-2 kernel:  irq_exit_rcu+0xe/0x20
Jul 11 18:55:07 N-2 kernel:  sysvec_apic_timer_interrupt+0x92/0xd0
Jul 11 18:55:07 N-2 kernel:  </IRQ>
Jul 11 18:55:07 N-2 kernel:  <TASK>
Jul 11 18:55:07 N-2 kernel:  asm_sysvec_apic_timer_interrupt+0x1b/0x20
Jul 11 18:55:07 N-2 kernel: RIP: 0010:cpuidle_enter_state+0xde/0x6f0
Jul 11 18:55:07 N-2 kernel: Code: 2a 77 75 e8 54 7e 4a ff 8b 53 04 49 89 c7 0f 1f 44 00 00 31 ff e8 82 86 49 ff 80 7d d0 00 0f 85 eb 00 00 00 fb 0f 1f 44 00 00 <45> 85 f6 0f 88 12 02 00 00 4d 63 ee 49 83 fd 09 0f 87 c7 04 00 00
Jul 11 18:55:07 N-2 kernel: RSP: 0018:ffff9d58c00dfe38 EFLAGS: 00000246
Jul 11 18:55:07 N-2 kernel: RAX: 0000000000000000 RBX: ffff8c68afdbd900 RCX: 0000000000000000
Jul 11 18:55:07 N-2 kernel: RDX: 0000000000000003 RSI: 0000000000000000 RDI: 0000000000000000
Jul 11 18:55:07 N-2 kernel: RBP: ffff9d58c00dfe88 R08: 0000000000000000 R09: 0000000000000000
Jul 11 18:55:07 N-2 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff8c0c33a0
Jul 11 18:55:07 N-2 kernel: R13: 0000000000000004 R14: 0000000000000004 R15: 0000044e8f1bb65e
Jul 11 18:55:07 N-2 kernel:  ? cpuidle_enter_state+0xce/0x6f0
Jul 11 18:55:07 N-2 kernel:  cpuidle_enter+0x2e/0x50
Jul 11 18:55:07 N-2 kernel:  do_idle+0x216/0x2a0
Jul 11 18:55:07 N-2 kernel:  cpu_startup_entry+0x1d/0x20
Jul 11 18:55:07 N-2 kernel:  start_secondary+0x122/0x160
Jul 11 18:55:07 N-2 kernel:  secondary_startup_64_no_verify+0xe5/0xeb
Jul 11 18:55:07 N-2 kernel:  </TASK>
Jul 11 18:55:07 N-2 kernel: ---[ end trace 0000000000000000 ]---
Jul 11 18:55:07 N-2 kernel: r8169 0000:01:00.0 enp1s0: rtl_chipcmd_cond == 1 (loop: 100, delay: 100).
Jul 11 18:55:07 N-2 kernel: r8169 0000:01:00.0 enp1s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10).
Jul 11 18:55:07 N-2 kernel: r8169 0000:01:00.0 enp1s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10).
Jul 11 18:55:07 N-2 kernel: r8169 0000:01:00.0 enp1s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10).
Jul 11 18:55:07 N-2 kernel: r8169 0000:01:00.0 enp1s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10).
Jul 11 18:55:07 N-2 kernel: r8169 0000:01:00.0 enp1s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10).
Jul 11 18:55:07 N-2 kernel: r8169 0000:01:00.0 enp1s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10).
Jul 11 18:55:07 N-2 kernel: r8169 0000:01:00.0 enp1s0: rtl_eriar_cond == 1 (loop: 100, delay: 100).
Jul 11 18:55:07 N-2 kernel: r8169 0000:01:00.0 enp1s0: rtl_eriar_cond == 1 (loop: 100, delay: 100).
Jul 11 18:55:07 N-2 kernel: r8169 0000:01:00.0 enp1s0: rtl_eriar_cond == 1 (loop: 100, delay: 100).
Jul 11 18:55:08 N-2 pvescheduler[423712]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum!
Jul 11 18:55:08 N-2 pvescheduler[423711]: replication: cfs-lock 'file-replication_cfg' error: no quorum!
Jul 11 18:55:13 N-2 pvestatd[1566]: storage 'Backup' is not online
Jul 11 18:55:13 N-2 pvestatd[1566]: status update time (10.272 seconds)
-- Reboot --
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!