Related: (Broken dependencies with pve-ha-manager)
https://forum.proxmox.com/threads/cannot-remove-pve-ha-manager-why.141940/#post-636316
https://forum.proxmox.com/threads/cannot-remove-pve-ha-manager-why.141940/#post-636316
Circling back to this. I've got exactly the same issue occurring on my cluster. It is a cluster of 7 nodes. Four of which (slightly older) have been stable for 400+ days, there are 3 newer nodes (Xeon Platinum) that continually reboot without any warning, notice, pattern or log entry.
Aug 16 19:26:33 pve04-bne-br1 corosync[2929]: [TOTEM ] Token has not been received in 5662 ms
Aug 16 19:21:07 pve04-bne-br1 corosync[2929]: [KNET ] link: host: 10 link: 0 is down
Aug 16 19:21:07 pve04-bne-br1 corosync[2929]: [KNET ] host: host: 10 (passive) best link: 0 (pri: 1)
Aug 16 19:21:07 pve04-bne-br1 corosync[2929]: [KNET ] host: host: 10 has no active links
Aug 16 19:21:09 pve04-bne-br1 corosync[2929]: [TOTEM ] Token has not been received in 4687 ms
I stumbled across this thread and follow the information about disabling the HA services, and I thought I had fixed it, well another node just rebooted last night and the night before.
We are actually at the point where we have built up 3 new servers to match the 4 that are working, and going to replace them in the next few weeks.
I've attached as much logging and information as I can to this.
pve04 is one of the happy nodes. pve06 (in that log) is an unhappy node. The loss of quorum happens due to the node rebooting.
I've attached a full boot log of pve06 which is an offender.
journalctl -b -1
(not -k
). I took it like it restarts often enough, so the last boot (-1, or you can refer to them by ID, check --list-boots
) is one where there would be any log entries from how it went.--since=YYYY-MM-DD HH:mm
--utc
as well.Are they all [nodes] in the same location?
Yes. All nodes are in the same rack. Each have a 1x 10G to Nexus switch 1, and a 1x 10G to Nexus switch two, in a bond.
As I've done upgrades on the nodes, there was a couple of fresh boots. Here is a copy of journalctl -b -3. It shows some disk maintenance in there also so don't be too concerned about that, it was running before and after that, then just randomly crashes with no errors worth a mention. The logs just seem to stop.
Aug 06 02:45:37 pve06-bne-br1 kernel: Linux version 5.15.143-1-pve (build@proxmox) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP PVE 5.15.143-1 (2024-02-08T18:12Z) ()
Yes thats correct. It has not occurred since the update. But it is unpredictable, I once had a node go 66 days before it happened.I understand you may have updated since, but how exactly is there any fresh boot since ...?
Code:Aug 06 02:45:37 pve06-bne-br1 kernel: Linux version 5.15.143-1-pve (build@proxmox) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP PVE 5.15.143-1 (2024-02-08T18:12Z) ()
Or this is the older log before update and you have yet to encounter the crash after the updates?
Yes thats correct. It has not occurred since the update. But it is unpredictable, I once had a node go 66 days before it happened.
Aug 16 19:24:05 pve06-bne-br1 kernel: [ 0.000000] Linux version 5.15.143-1-pve (build@proxmox) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP PVE 5.15.143-1 (2024-02-08T18:12Z) ()
Yes. All nodes are in the same rack. Each have a 1x 10G to Nexus switch 1, and a 1x 10G to Nexus switch two, in a bond.
Since the upgrade I do note that I have a watchdog-mux service (not watchdog module though because I blacklisted it)
lsmod | grep -e dog -e wdt
Yes the bond is being shared with corosync and ceph, however they all operate on separate VLANs. There is a 1G copper management interface that is set as the second interface for the cluster, if the first were ever to go down.
I did notice it only contained one ring but they were definitely originally joined with 2. Anyway, another issue.
If its losing quorum, then thats an issue I need to resolve separately, but my concern is that its rebooting without warning. IPMI shows nothing of use. We even tried switching one of the processors out for a Xeon Gold thinking it might be the Platinum cores having issues with Proxmox, no change.
journalctl -b $BOOTID -n 100
for each) of such crashes?We've got 3 servers built and on standby to swap out which are identical to another 10 we have in a separate cluster which has been stable. but I really feel like this is a software issue not a hardware issue.
My concern is, this cluster is about 6 years old, and none of the original nodes exist, and its crossed many Proxmox upgrades over the years. I guess my fear is there is some old config in here somewhere that might be breaking things, but I can't be sure.
Its also worth noting that many years ago we did used to use HA, until it would reboot nodes for no apparent reason, so we turned it all off and the problem went away, until these 3 new nodes went in it seemed to resurface on its own.
It is also worth noting that when these 3 new nodes went in, the original nodes (pve01, pve02, pve03) still existed in the cluster and were disjoined after migrating everything to the new ones.
lsmod
sysctl kernel vm
journalctl | grep -e "soft lockup"
On all nodes in the cluster I disabled the two lrm and hrm services, and I added the softdog module to /etc/modprobe.d/softdog-deny.conf
I would be typically looking for any commonalities (or lack thereof) at first. You mention 3 newer servers that "continually reboot" in your original post, but you also "had a node go 66 days" (presumably one of the new ones) without such event. They also used to run Debian 10 before the recent upgrade (but I do note you might have simply installed a matching older version on them). Let's quantify it then:
- These new servers, let's call them no 5, 6 and 7 - how often did they go on random reboot in the past e.g. 90 days (per each)?
- Can you post multiple log endings (e.g.
journalctl -b $BOOTID -n 100
for each) of such crashes?- The changes you have described above (blacklisting softdog, etc.) - when did you perform them?
EDIT: Why not swapping them? One of them. One of the 3, have it go to the 10 no-issues cluster and take one from those make it member of the doomed company of 3. You will know if it's hardware or not ...
- Well, this is interesting - on what PVE / Debian version are those 10 and since when?
- Are these the said 3 servers rebooting when added to the "old cluster" of 4 to make 7?
- If so, that would point to ... configuration, network, maybe power?
- Can you afford to run the new 3 ones in a separate cluster of their own?
I've gone through /etc/pve to look for anything I consider concerning, which I can't find, and because this is synced across nodes it wouldnt be any different from one node to the next.You can compare configs with what you have on the good one to simply see any obvious differences. But you are not mentioning any and this should not be related to any VM configuration.
Added some background to this aboveI would completely disregard this theory for now, as it might have been simply the intermittent loss of quorum (which is by design causing it) and - as much as I am skeptical about quality of HA stack - from the single log so far this reboot would not be due to that now.
This adds some confusion, are you saying you had 4 [1..4], added 3 [5..7] and then removed first 3 [1..3] so you are now down to 4 again [4..7] out of which only one is the original one?
No we were very careful of that, in fact the original nodes ran for a few days once the new nodes went in so there was definitely no possibility of IP conflicts or hostname reuse.
- Have you reused names/IPs for the nodes as they came and went? If so, can you do the 3 new ones on "clean" (never user) IPs and names?
Also can you also post from one of the (suspect) running nodes...
lsmod
sysctl kernel vm
journalctl | grep -e "soft lockup"
root@pve06-bne-br1:~# lsmod
Module Size Used by
veth 40960 0
rbd 122880 2
8021q 45056 0
garp 20480 1 8021q
mrp 20480 1 8021q
ceph 593920 1
libceph 503808 2 ceph,rbd
netfs 495616 1 ceph
ebtable_filter 12288 0
ebtables 45056 1 ebtable_filter
ip_set 61440 0
ip6table_raw 12288 0
iptable_raw 12288 0
ip6table_filter 12288 0
ip6_tables 32768 2 ip6table_filter,ip6table_raw
iptable_filter 12288 0
sctp 462848 2
ip6_udp_tunnel 16384 1 sctp
udp_tunnel 28672 1 sctp
scsi_transport_iscsi 167936 1
nf_tables 327680 0
nvme_fabrics 36864 0
nvme_core 196608 1 nvme_fabrics
nvme_auth 24576 1 nvme_core
bonding 225280 0
tls 139264 1 bonding
sunrpc 770048 1
nfnetlink_log 20480 1
binfmt_misc 24576 1
nfnetlink 20480 4 nf_tables,ip_set,nfnetlink_log
intel_rapl_msr 20480 0
intel_rapl_common 36864 1 intel_rapl_msr
intel_uncore_frequency 12288 0
intel_uncore_frequency_common 16384 1 intel_uncore_frequency
ipmi_ssif 40960 0
isst_if_common 20480 0
skx_edac 24576 0
nfit 73728 1 skx_edac
x86_pkg_temp_thermal 16384 0
intel_powerclamp 16384 0
kvm_intel 389120 224
joydev 24576 0
kvm 1249280 113 kvm_intel
input_leds 12288 0
irqbypass 12288 450 kvm
crct10dif_pclmul 12288 1
polyval_clmulni 12288 0
polyval_generic 12288 1 polyval_clmulni
ghash_clmulni_intel 16384 0
sha256_ssse3 32768 0
sha1_ssse3 32768 0
aesni_intel 356352 17
crypto_simd 16384 1 aesni_intel
hid_generic 12288 0
cryptd 24576 2 crypto_simd,ghash_clmulni_intel
usbkbd 12288 0
usbmouse 12288 0
rapl 20480 0
dell_smbios 32768 0
dcdbas 20480 1 dell_smbios
intel_cstate 20480 0
usbhid 69632 0
mgag200 73728 0
cmdlinepart 12288 0
dell_wmi_descriptor 16384 1 dell_smbios
wmi_bmof 12288 0
pcspkr 12288 0
i2c_algo_bit 16384 1 mgag200
spi_nor 151552 0
hid 163840 2 usbhid,hid_generic
mei_me 53248 0
mtd 94208 3 spi_nor,cmdlinepart
mei 163840 1 mei_me
intel_pch_thermal 16384 0
acpi_power_meter 20480 0
ipmi_si 77824 1
acpi_ipmi 20480 1 acpi_power_meter
ipmi_devintf 16384 0
ipmi_msghandler 77824 4 ipmi_devintf,ipmi_si,acpi_ipmi,ipmi_ssif
mac_hid 12288 0
zfs 5988352 6
spl 143360 1 zfs
vhost_net 32768 57
vhost 61440 1 vhost_net
vhost_iotlb 16384 1 vhost
tap 28672 1 vhost_net
coretemp 16384 0
efi_pstore 12288 0
dmi_sysfs 20480 0
ip_tables 32768 2 iptable_filter,iptable_raw
x_tables 57344 7 ebtables,ip6table_filter,ip6table_raw,iptable_filter,ip6_tables,iptable_raw,ip_tables
autofs4 57344 2
btrfs 1839104 0
blake2b_generic 24576 0
xor 20480 1 btrfs
raid6_pq 118784 1 btrfs
dm_thin_pool 86016 1
dm_persistent_data 110592 1 dm_thin_pool
dm_bio_prison 24576 1 dm_thin_pool
dm_bufio 53248 1 dm_persistent_data
libcrc32c 12288 5 dm_persistent_data,btrfs,nf_tables,libceph,sctp
xhci_pci 24576 0
xhci_pci_renesas 16384 1 xhci_pci
crc32_pclmul 12288 0
i40e 540672 0
bnxt_en 368640 0
ahci 49152 5
megaraid_sas 184320 8
spi_intel_pci 12288 0
tg3 204800 0
i2c_i801 36864 0
xhci_hcd 356352 1 xhci_pci
libahci 53248 1 ahci
spi_intel 28672 1 spi_intel_pci
i2c_smbus 16384 1 i2c_i801
lpc_ich 28672 0
wmi 28672 3 wmi_bmof,dell_smbios,dell_wmi_descriptor
root@pve06-bne-br1:~# sysctl kernel vm
kernel.acct = 4 2 30
kernel.acpi_video_flags = 0
kernel.apparmor_display_secid_mode = 0
kernel.apparmor_restrict_unprivileged_io_uring = 0
kernel.apparmor_restrict_unprivileged_unconfined = 0
kernel.apparmor_restrict_unprivileged_userns = 0
kernel.apparmor_restrict_unprivileged_userns_complain = 0
kernel.apparmor_restrict_unprivileged_userns_force = 0
kernel.arch = x86_64
kernel.auto_msgmni = 0
kernel.bootloader_type = 114
kernel.bootloader_version = 2
kernel.bpf_stats_enabled = 0
kernel.cad_pid = 1
kernel.cap_last_cap = 40
kernel.core_pattern = core
kernel.core_pipe_limit = 0
kernel.core_uses_pid = 0
kernel.ctrl-alt-del = 0
kernel.dmesg_restrict = 1
kernel.domainname = (none)
kernel.firmware_config.force_sysfs_fallback = 0
kernel.firmware_config.ignore_sysfs_fallback = 0
kernel.ftrace_dump_on_oops = 0
kernel.ftrace_enabled = 1
kernel.hardlockup_all_cpu_backtrace = 0
kernel.hardlockup_panic = 0
kernel.hostname = pve06-bne-br1
kernel.hotplug =
kernel.hung_task_all_cpu_backtrace = 0
kernel.hung_task_check_count = 4194304
kernel.hung_task_check_interval_secs = 0
kernel.hung_task_panic = 0
kernel.hung_task_timeout_secs = 120
kernel.hung_task_warnings = 10
kernel.io_delay_type = 1
kernel.io_uring_disabled = 0
kernel.io_uring_group = -1
kernel.kexec_load_disabled = 0
kernel.kexec_load_limit_panic = -1
kernel.kexec_load_limit_reboot = -1
kernel.keys.gc_delay = 300
kernel.keys.maxbytes = 20000
kernel.keys.maxkeys = 2000
kernel.keys.persistent_keyring_expiry = 259200
kernel.keys.root_maxbytes = 25000000
kernel.keys.root_maxkeys = 1000000
kernel.kptr_restrict = 0
kernel.latencytop = 0
kernel.max_lock_depth = 1024
kernel.max_rcu_stall_to_panic = 0
kernel.modprobe = /sbin/modprobe
kernel.modules_disabled = 0
kernel.msg_next_id = -1
kernel.msgmax = 8192
kernel.msgmnb = 16384
kernel.msgmni = 32000
kernel.ngroups_max = 65536
kernel.nmi_watchdog = 1
kernel.ns_last_pid = 3444554
kernel.numa_balancing = 1
kernel.numa_balancing_promote_rate_limit_MBps = 65536
kernel.oops_all_cpu_backtrace = 0
kernel.oops_limit = 10000
kernel.osrelease = 6.8.12-1-pve
kernel.ostype = Linux
kernel.overflowgid = 65534
kernel.overflowuid = 65534
kernel.panic = 0
kernel.panic_on_io_nmi = 0
kernel.panic_on_oops = 0
kernel.panic_on_rcu_stall = 0
kernel.panic_on_unrecovered_nmi = 0
kernel.panic_on_warn = 0
kernel.panic_print = 0
kernel.perf_cpu_time_max_percent = 25
kernel.perf_event_max_contexts_per_stack = 8
kernel.perf_event_max_sample_rate = 32000
kernel.perf_event_max_stack = 127
kernel.perf_event_mlock_kb = 516
kernel.perf_event_paranoid = 4
kernel.pid_max = 4194304
kernel.poweroff_cmd = /sbin/poweroff
kernel.print-fatal-signals = 0
kernel.printk = 3 4 1 7
kernel.printk_delay = 0
kernel.printk_devkmsg = on
kernel.printk_ratelimit = 5
kernel.printk_ratelimit_burst = 10
kernel.pty.max = 4096
kernel.pty.nr = 1
kernel.pty.reserve = 1024
kernel.random.boot_id = 41c14646-7cfd-42ef-9394-ea6a4fd1b9b9
kernel.random.entropy_avail = 256
kernel.random.poolsize = 256
kernel.random.urandom_min_reseed_secs = 60
kernel.random.uuid = 8f2cea82-5db8-48cf-9e33-f8f3c53cb940
kernel.random.write_wakeup_threshold = 256
kernel.randomize_va_space = 2
kernel.real-root-dev = 0
kernel.sched_autogroup_enabled = 1
kernel.sched_cfs_bandwidth_slice_us = 5000
kernel.sched_deadline_period_max_us = 4194304
kernel.sched_deadline_period_min_us = 100
kernel.sched_rr_timeslice_ms = 100
kernel.sched_rt_period_us = 1000000
kernel.sched_rt_runtime_us = 950000
kernel.sched_schedstats = 0
kernel.sched_util_clamp_max = 1024
kernel.sched_util_clamp_min = 1024
kernel.sched_util_clamp_min_rt_default = 1024
kernel.seccomp.actions_avail = kill_process kill_thread trap errno user_notif trace log allow
kernel.seccomp.actions_logged = kill_process kill_thread trap errno user_notif trace log
kernel.sem = 32000 1024000000 500 32000
kernel.sem_next_id = -1
kernel.shm_next_id = -1
kernel.shm_rmid_forced = 0
kernel.shmall = 18446744073692774399
kernel.shmmax = 18446744073692774399
kernel.shmmni = 4096
kernel.soft_watchdog = 1
kernel.softlockup_all_cpu_backtrace = 0
kernel.softlockup_panic = 0
kernel.spl.gitrev = zfs-2.2.4-0-g256659204
kernel.spl.hostid = 29b7dae6
kernel.spl.kmem.slab_kvmem_alloc = 7340032
kernel.spl.kmem.slab_kvmem_max = 7340032
kernel.spl.kmem.slab_kvmem_total = 9510912
kernel.split_lock_mitigate = 1
kernel.stack_tracer_enabled = 0
kernel.sysctl_writes_strict = 1
kernel.sysrq = 438
kernel.tainted = 4097
kernel.task_delayacct = 0
kernel.threads-max = 6180383
kernel.timer_migration = 1
kernel.traceoff_on_warning = 0
kernel.tracepoint_printk = 0
kernel.unknown_nmi_panic = 0
kernel.unprivileged_bpf_disabled = 2
kernel.unprivileged_userns_apparmor_policy = 1
kernel.unprivileged_userns_clone = 1
kernel.user_events_max = 32768
kernel.usermodehelper.bset = 4294967295 511
kernel.usermodehelper.inheritable = 4294967295 511
kernel.version = #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-1 (2024-08-05T16:17Z)
kernel.warn_limit = 0
kernel.watchdog = 1
kernel.watchdog_cpumask = 0-95
kernel.watchdog_thresh = 10
kernel.yama.ptrace_scope = 1
vm.admin_reserve_kbytes = 8192
vm.compact_unevictable_allowed = 1
vm.compaction_proactiveness = 20
vm.dirty_background_bytes = 0
vm.dirty_background_ratio = 10
vm.dirty_bytes = 0
vm.dirty_expire_centisecs = 3000
vm.dirty_ratio = 20
vm.dirty_writeback_centisecs = 500
vm.dirtytime_expire_seconds = 43200
vm.extfrag_threshold = 500
vm.hugetlb_optimize_vmemmap = 0
vm.hugetlb_shm_group = 0
vm.laptop_mode = 0
vm.legacy_va_layout = 0
vm.lowmem_reserve_ratio = 256 256 32 0 0
vm.max_map_count = 262144
vm.memfd_noexec = 0
vm.memory_failure_early_kill = 0
vm.memory_failure_recovery = 1
vm.min_free_kbytes = 112505
vm.min_slab_ratio = 5
vm.min_unmapped_ratio = 1
vm.mmap_min_addr = 65536
vm.mmap_rnd_bits = 32
vm.mmap_rnd_compat_bits = 16
vm.nr_hugepages = 0
vm.nr_hugepages_mempolicy = 0
vm.nr_overcommit_hugepages = 0
vm.numa_stat = 1
vm.numa_zonelist_order = Node
vm.oom_dump_tasks = 1
vm.oom_kill_allocating_task = 0
vm.overcommit_kbytes = 0
vm.overcommit_memory = 0
vm.overcommit_ratio = 50
vm.page-cluster = 3
vm.page_lock_unfairness = 5
vm.panic_on_oom = 0
vm.percpu_pagelist_high_fraction = 0
vm.stat_interval = 1
vm.swappiness = 60
vm.unprivileged_userfaultfd = 0
vm.user_reserve_kbytes = 131072
vm.vfs_cache_pressure = 100
vm.watermark_boost_factor = 15000
vm.watermark_scale_factor = 10
vm.zone_reclaim_mode = 0
root@pve06-bne-br1:~# journalctl | grep -e "soft lockup"
On all nodes in the cluster I disabled the two lrm and hrm services, and I added the softdog module to /etc/modprobe.d/softdog-deny.conf
Let's quantify it then:
These new servers, let's call them no 5, 6 and 7- how often did they go on random reboot in the past e.g. 90 days (per each)?- Can you post multiple log endings (e.g.
journalctl -b $BOOTID -n 100
for each) of such crashes?- The changes you have described above (blacklisting softdog, etc.) - when did you perform them?
Apr 01 21:39:15 pve06-bne-br1 kernel: sd 0:2:6:0: [sdg] tag#3969 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0
Apr 01 21:39:15 pve06-bne-br1 kernel: sd 0:2:6:0: [sdg] tag#3969 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0
Apr 05 07:44:09 pve06-bne-br1 kernel: sd 0:2:3:0: [sdd] tag#5026 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0
Apr 26 21:48:29 pve06-bne-br1 kernel: sd 0:2:5:0: [sdf] tag#1224 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0
Jun 14 00:13:06 pve07-bne-br1 kernel: sd 0:2:0:0: [sda] tag#2101 BRCM Debug mfi stat 0x2d, data len requested/completed 0x2000/0x0
Jun 14 00:14:53 pve07-bne-br1 kernel: sd 0:2:0:0: [sda] tag#2401 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x400
Jun 14 00:14:53 pve07-bne-br1 kernel: sd 0:2:0:0: [sda] tag#2453 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0
all identical builds - dual Xeon Platinum 8160, Dell R640's, 768GB RAM, 8x Enterprise SAS SSD, Dell M.2 Boss card (2x M.2 for boot), dual port 10G NIC.