Getting rid of watchdog emergency node reboot

Circling back to this. I've got exactly the same issue occurring on my cluster. It is a cluster of 7 nodes. Four of which (slightly older) have been stable for 400+ days, there are 3 newer nodes (Xeon Platinum) that continually reboot without any warning, notice, pattern or log entry.

I've suspected HA for a long time, but had disabled everything HA related in the Proxmox WebUI.

I stumbled across this thread and follow the information about disabling the HA services, and I thought I had fixed it, well another node just rebooted last night and the night before.

I do note that I have nothing in `lsmod | grep softdog` and I have no watchdog-mux service.

I'm at a loss.

We are actually at the point where we have built up 3 new servers to match the 4 that are working, and going to replace them in the next few weeks. However if this is a software problem, the same issue is just going to follow us and it'll be yet another nightmare scenario.

I've attached as much logging and information as I can to this.

I'm at a point where if someone knows what the fix is or even where to begin troubleshooting it (no logs makes it hard), I'm happy to pay for a consultant.
 

Attachments

Last edited:
Circling back to this. I've got exactly the same issue occurring on my cluster. It is a cluster of 7 nodes. Four of which (slightly older) have been stable for 400+ days, there are 3 newer nodes (Xeon Platinum) that continually reboot without any warning, notice, pattern or log entry.

Just to be clear, this excerpt from your attachment is one of those unlucky 3 newer nodes?

Code:
Aug 16 19:26:33 pve04-bne-br1 corosync[2929]:   [TOTEM ] Token has not been received in 5662 ms
Aug 16 19:21:07 pve04-bne-br1 corosync[2929]:   [KNET  ] link: host: 10 link: 0 is down
Aug 16 19:21:07 pve04-bne-br1 corosync[2929]:   [KNET  ] host: host: 10 (passive) best link: 0 (pri: 1)
Aug 16 19:21:07 pve04-bne-br1 corosync[2929]:   [KNET  ] host: host: 10 has no active links
Aug 16 19:21:09 pve04-bne-br1 corosync[2929]:   [TOTEM ] Token has not been received in 4687 ms

I stumbled across this thread and follow the information about disabling the HA services, and I thought I had fixed it, well another node just rebooted last night and the night before.

How exactly did you "disable" it? There's actually several ways to do it, including some not in this thread.

One thing is for sure, you are losing quorum links. Are they all in the same location? Of course with HA disabled, there should be no fencing.

We are actually at the point where we have built up 3 new servers to match the 4 that are working, and going to replace them in the next few weeks.

Are they all same version of everything?

I've attached as much logging and information as I can to this.

Are you willing to provide now full boot log (not just corosync) of one of the nodes?
 
Last edited:
pve04 is one of the happy nodes. pve06 (in that log) is an unhappy node. The loss of quorum happens due to the node rebooting.

They are all on the same version, but on version 7. I've gone ahead and upgraded the cluster to 8 yesterday to be safe (and because 7 is out of support now).

I disabled it by disabling and stopping the two HA services - pve-ha-lrm and pve-ha-crm - can confirm this was preserved across proxmox upgrades.

I've attached a full boot log of pve06 which is an offender.
 

Attachments

pve04 is one of the happy nodes. pve06 (in that log) is an unhappy node. The loss of quorum happens due to the node rebooting.

I see.

I've attached a full boot log of pve06 which is an offender.

I think it's just kernel messages after fresh bootup, but I was specifically wondering about full boot log (i.e. boot to crash), e.g. journalctl -b -1 (not -k). I took it like it restarts often enough, so the last boot (-1, or you can refer to them by ID, check --list-boots) is one where there would be any log entries from how it went.

If it's massive, you can limit it with: --since=YYYY-MM-DD HH:mm

EDIT: Of course feel free to redact anything you do not want to share. You can use --utc as well.

And just to be sure...

Are they all [nodes] in the same location?
 
Last edited:
Yes. All nodes are in the same rack. Each have a 1x 10G to Nexus switch 1, and a 1x 10G to Nexus switch two, in a bond.

As I've done upgrades on the nodes, there was a couple of fresh boots. Here is a copy of journalctl -b -3. It shows some disk maintenance in there also so don't be too concerned about that, it was running before and after that, then just randomly crashes with no errors worth a mention. The logs just seem to stop.
 

Attachments

Yes. All nodes are in the same rack. Each have a 1x 10G to Nexus switch 1, and a 1x 10G to Nexus switch two, in a bond.

As I've done upgrades on the nodes, there was a couple of fresh boots. Here is a copy of journalctl -b -3. It shows some disk maintenance in there also so don't be too concerned about that, it was running before and after that, then just randomly crashes with no errors worth a mention. The logs just seem to stop.

I understand you may have updated since, but how exactly is there any fresh boot since ...?

Code:
Aug 06 02:45:37 pve06-bne-br1 kernel: Linux version 5.15.143-1-pve (build@proxmox) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP PVE 5.15.143-1 (2024-02-08T18:12Z) ()

Or this is the older log before update and you have yet to encounter the crash after the updates?
 
I understand you may have updated since, but how exactly is there any fresh boot since ...?

Code:
Aug 06 02:45:37 pve06-bne-br1 kernel: Linux version 5.15.143-1-pve (build@proxmox) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP PVE 5.15.143-1 (2024-02-08T18:12Z) ()

Or this is the older log before update and you have yet to encounter the crash after the updates?
Yes thats correct. It has not occurred since the update. But it is unpredictable, I once had a node go 66 days before it happened.
 
Yes thats correct. It has not occurred since the update. But it is unpredictable, I once had a node go 66 days before it happened.

Ok, but what are you booting NOW with? Because even the earlier provided kernel log from as recent as:

Aug 16 19:24:05 pve06-bne-br1 kernel: [ 0.000000] Linux version 5.15.143-1-pve (build@proxmox) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP PVE 5.15.143-1 (2024-02-08T18:12Z) ()

Is still running on v5 kernel?
 
Its currently running: Linux pve06-bne-br1 6.8.12-1-pve #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-1 (2024-08-05T16:17Z) x86_64 GNU/Linux
 
  • Like
Reactions: esi_y
Yes the bond is being shared with corosync and ceph, however they all operate on separate VLANs. There is a 1G copper management interface that is set as the second interface for the cluster, if the first were ever to go down.
 
Since the upgrade I do note that I have a watchdog-mux service (not watchdog module though because I blacklisted it)

root@pve06-bne-br1:~# systemctl status watchdog-mux
○ watchdog-mux.service - Proxmox VE watchdog multiplexer
Loaded: loaded (/lib/systemd/system/watchdog-mux.service; static)
Active: inactive (dead)
 
Since the upgrade I do note that I have a watchdog-mux service (not watchdog module though because I blacklisted it)

That would be of least concern, you can check for loaded modules with lsmod | grep -e dog -e wdt

You also have NMI watchdog enabled, but if that's what's restarting your nodes, you have a problem completely different than you suspect.

Yes the bond is being shared with corosync and ceph, however they all operate on separate VLANs. There is a 1G copper management interface that is set as the second interface for the cluster, if the first were ever to go down.

I am quite confused now, in your first post above the attached corosync config output only contained one "ring". The issue (with even MLAG / LACP) is that it can take seconds which are enough to lose quorum, you would need to configure e.g. BFD to shorten the delay. But I do not want to detract from the original point - you should not be getting any reboots even if you are losing quorum with HA off, especially with blacklisted *dog's ... but then again your servers might be freezing for some other reason. Anything in IPMI?
 
I did notice it only contained one ring but they were definitely originally joined with 2. Anyway, another issue.

If its losing quorum, then thats an issue I need to resolve separately, but my concern is that its rebooting without warning. IPMI shows nothing of use. We even tried switching one of the processors out for a Xeon Gold thinking it might be the Platinum cores having issues with Proxmox, no change.

We've got 3 servers built and on standby to swap out which are identical to another 10 we have in a separate cluster which has been stable. but I really feel like this is a software issue not a hardware issue.

My concern is, this cluster is about 6 years old, and none of the original nodes exist, and its crossed many Proxmox upgrades over the years. I guess my fear is there is some old config in here somewhere that might be breaking things, but I can't be sure.

Its also worth noting that many years ago we did used to use HA, until it would reboot nodes for no apparent reason, so we turned it all off and the problem went away, until these 3 new nodes went in it seemed to resurface on its own.

It is also worth noting that when these 3 new nodes went in, the original nodes (pve01, pve02, pve03) still existed in the cluster and were disjoined after migrating everything to the new ones.
 
Last edited:
I did notice it only contained one ring but they were definitely originally joined with 2. Anyway, another issue.

:D

If its losing quorum, then thats an issue I need to resolve separately, but my concern is that its rebooting without warning. IPMI shows nothing of use. We even tried switching one of the processors out for a Xeon Gold thinking it might be the Platinum cores having issues with Proxmox, no change.

I would be typically looking for any commonalities (or lack thereof) at first. You mention 3 newer servers that "continually reboot" in your original post, but you also "had a node go 66 days" (presumably one of the new ones) without such event. They also used to run Debian 10 before the recent upgrade (but I do note you might have simply installed a matching older version on them). Let's quantify it then:

  • These new servers, let's call them no 5, 6 and 7 - how often did they go on random reboot in the past e.g. 90 days (per each)?
  • Can you post multiple log endings (e.g. journalctl -b $BOOTID -n 100 for each) of such crashes?
  • The changes you have described above (blacklisting softdog, etc.) - when did you perform them?
We've got 3 servers built and on standby to swap out which are identical to another 10 we have in a separate cluster which has been stable. but I really feel like this is a software issue not a hardware issue.

EDIT: Why not swapping them? One of them. One of the 3, have it go to the 10 no-issues cluster and take one from those make it member of the doomed company of 3. You will know if it's hardware or not ...
  • Well, this is interesting - on what PVE / Debian version are those 10 and since when?
  • Are these the said 3 servers rebooting when added to the "old cluster" of 4 to make 7?
  • If so, that would point to ... configuration, network, maybe power?
  • Can you afford to run the new 3 ones in a separate cluster of their own?
My concern is, this cluster is about 6 years old, and none of the original nodes exist, and its crossed many Proxmox upgrades over the years. I guess my fear is there is some old config in here somewhere that might be breaking things, but I can't be sure.

You can compare configs with what you have on the good one to simply see any obvious differences. But you are not mentioning any and this should not be related to any VM configuration.

Its also worth noting that many years ago we did used to use HA, until it would reboot nodes for no apparent reason, so we turned it all off and the problem went away, until these 3 new nodes went in it seemed to resurface on its own.

I would completely disregard this theory for now, as it might have been simply the intermittent loss of quorum (which is by design causing it) and - as much as I am skeptical about quality of HA stack - from the single log so far this reboot would not be due to that now.

It is also worth noting that when these 3 new nodes went in, the original nodes (pve01, pve02, pve03) still existed in the cluster and were disjoined after migrating everything to the new ones.

This adds some confusion, are you saying you had 4 [1..4], added 3 [5..7] and then removed first 3 [1..3] so you are now down to 4 again [4..7] out of which only one is the original one?

  • Have you reused names/IPs for the nodes as they came and went? If so, can you do the 3 new ones on "clean" (never user) IPs and names?

Also can you also post from one of the (suspect) running nodes...
  • lsmod
  • sysctl kernel vm
  • journalctl | grep -e "soft lockup"
 
Last edited:
:D



I would be typically looking for any commonalities (or lack thereof) at first. You mention 3 newer servers that "continually reboot" in your original post, but you also "had a node go 66 days" (presumably one of the new ones) without such event. They also used to run Debian 10 before the recent upgrade (but I do note you might have simply installed a matching older version on them). Let's quantify it then:

  • These new servers, let's call them no 5, 6 and 7 - how often did they go on random reboot in the past e.g. 90 days (per each)?
  • Can you post multiple log endings (e.g. journalctl -b $BOOTID -n 100 for each) of such crashes?
  • The changes you have described above (blacklisting softdog, etc.) - when did you perform them?
On all nodes in the cluster I disabled the two lrm and hrm services, and I added the softdog module to /etc/modprobe.d/softdog-deny.conf
EDIT: Why not swapping them? One of them. One of the 3, have it go to the 10 no-issues cluster and take one from those make it member of the doomed company of 3. You will know if it's hardware or not ...
  • Well, this is interesting - on what PVE / Debian version are those 10 and since when?
  • Are these the said 3 servers rebooting when added to the "old cluster" of 4 to make 7?
  • If so, that would point to ... configuration, network, maybe power?
  • Can you afford to run the new 3 ones in a separate cluster of their own?

Let me draw a bigger picture of the cluster as a whole.

Originally it exists with: pve01, pve02, pve03, storage01, storage02
2 compute nodes, 2 storage nodes (hyperconverged storage still, but storage01 and storage02 didn't run any VMs)

pve04 and pve05 were added to the cluster for additional capacity. All was still good.

pve06, pve07, pve08 were recently added to the cluster (around 6 months ago), all identical builds - dual Xeon Platinum 8160, Dell R640's, 768GB RAM, 8x Enterprise SAS SSD, Dell M.2 Boss card (2x M.2 for boot), dual port 10G NIC.

pve01, pve02, pve03 were then shortly decomissioned and removed from the cluster.

The cluster now consists of: pve04, pve05, pve06, pve07, pve08, storage01, storage02

Stable nodes are: pve04, pve05, storage01, storage02.

The 10 node cluster that is running fine is on Proxmox current.

The 3 replacement servers we have built up are being supplied by the vendor and unfortunately need to be swapped, as opposed to being allowed to run in parallel. Because they were part of the same order, its an all or nothing type thing. They want all 3 back to replace with all 3 new.

You can compare configs with what you have on the good one to simply see any obvious differences. But you are not mentioning any and this should not be related to any VM configuration.
I've gone through /etc/pve to look for anything I consider concerning, which I can't find, and because this is synced across nodes it wouldnt be any different from one node to the next.

Do you have any other places you think I should be looking?

The nodes were installed from the Proxmox ISOs. Network configuration was added to them, and then straight away joined to the cluster. Nothing else has been changed or configured before that point.

I would completely disregard this theory for now, as it might have been simply the intermittent loss of quorum (which is by design causing it) and - as much as I am skeptical about quality of HA stack - from the single log so far this reboot would not be due to that now.



This adds some confusion, are you saying you had 4 [1..4], added 3 [5..7] and then removed first 3 [1..3] so you are now down to 4 again [4..7] out of which only one is the original one?
Added some background to this above
  • Have you reused names/IPs for the nodes as they came and went? If so, can you do the 3 new ones on "clean" (never user) IPs and names?
No we were very careful of that, in fact the original nodes ran for a few days once the new nodes went in so there was definitely no possibility of IP conflicts or hostname reuse.

Also can you also post from one of the (suspect) running nodes...
  • lsmod
  • sysctl kernel vm
  • journalctl | grep -e "soft lockup"

Code:
root@pve06-bne-br1:~# lsmod
Module                  Size  Used by
veth                   40960  0
rbd                   122880  2
8021q                  45056  0
garp                   20480  1 8021q
mrp                    20480  1 8021q
ceph                  593920  1
libceph               503808  2 ceph,rbd
netfs                 495616  1 ceph
ebtable_filter         12288  0
ebtables               45056  1 ebtable_filter
ip_set                 61440  0
ip6table_raw           12288  0
iptable_raw            12288  0
ip6table_filter        12288  0
ip6_tables             32768  2 ip6table_filter,ip6table_raw
iptable_filter         12288  0
sctp                  462848  2
ip6_udp_tunnel         16384  1 sctp
udp_tunnel             28672  1 sctp
scsi_transport_iscsi   167936  1
nf_tables             327680  0
nvme_fabrics           36864  0
nvme_core             196608  1 nvme_fabrics
nvme_auth              24576  1 nvme_core
bonding               225280  0
tls                   139264  1 bonding
sunrpc                770048  1
nfnetlink_log          20480  1
binfmt_misc            24576  1
nfnetlink              20480  4 nf_tables,ip_set,nfnetlink_log
intel_rapl_msr         20480  0
intel_rapl_common      36864  1 intel_rapl_msr
intel_uncore_frequency    12288  0
intel_uncore_frequency_common    16384  1 intel_uncore_frequency
ipmi_ssif              40960  0
isst_if_common         20480  0
skx_edac               24576  0
nfit                   73728  1 skx_edac
x86_pkg_temp_thermal    16384  0
intel_powerclamp       16384  0
kvm_intel             389120  224
joydev                 24576  0
kvm                  1249280  113 kvm_intel
input_leds             12288  0
irqbypass              12288  450 kvm
crct10dif_pclmul       12288  1
polyval_clmulni        12288  0
polyval_generic        12288  1 polyval_clmulni
ghash_clmulni_intel    16384  0
sha256_ssse3           32768  0
sha1_ssse3             32768  0
aesni_intel           356352  17
crypto_simd            16384  1 aesni_intel
hid_generic            12288  0
cryptd                 24576  2 crypto_simd,ghash_clmulni_intel
usbkbd                 12288  0
usbmouse               12288  0
rapl                   20480  0
dell_smbios            32768  0
dcdbas                 20480  1 dell_smbios
intel_cstate           20480  0
usbhid                 69632  0
mgag200                73728  0
cmdlinepart            12288  0
dell_wmi_descriptor    16384  1 dell_smbios
wmi_bmof               12288  0
pcspkr                 12288  0
i2c_algo_bit           16384  1 mgag200
spi_nor               151552  0
hid                   163840  2 usbhid,hid_generic
mei_me                 53248  0
mtd                    94208  3 spi_nor,cmdlinepart
mei                   163840  1 mei_me
intel_pch_thermal      16384  0
acpi_power_meter       20480  0
ipmi_si                77824  1
acpi_ipmi              20480  1 acpi_power_meter
ipmi_devintf           16384  0
ipmi_msghandler        77824  4 ipmi_devintf,ipmi_si,acpi_ipmi,ipmi_ssif
mac_hid                12288  0
zfs                  5988352  6
spl                   143360  1 zfs
vhost_net              32768  57
vhost                  61440  1 vhost_net
vhost_iotlb            16384  1 vhost
tap                    28672  1 vhost_net
coretemp               16384  0
efi_pstore             12288  0
dmi_sysfs              20480  0
ip_tables              32768  2 iptable_filter,iptable_raw
x_tables               57344  7 ebtables,ip6table_filter,ip6table_raw,iptable_filter,ip6_tables,iptable_raw,ip_tables
autofs4                57344  2
btrfs                1839104  0
blake2b_generic        24576  0
xor                    20480  1 btrfs
raid6_pq              118784  1 btrfs
dm_thin_pool           86016  1
dm_persistent_data    110592  1 dm_thin_pool
dm_bio_prison          24576  1 dm_thin_pool
dm_bufio               53248  1 dm_persistent_data
libcrc32c              12288  5 dm_persistent_data,btrfs,nf_tables,libceph,sctp
xhci_pci               24576  0
xhci_pci_renesas       16384  1 xhci_pci
crc32_pclmul           12288  0
i40e                  540672  0
bnxt_en               368640  0
ahci                   49152  5
megaraid_sas          184320  8
spi_intel_pci          12288  0
tg3                   204800  0
i2c_i801               36864  0
xhci_hcd              356352  1 xhci_pci
libahci                53248  1 ahci
spi_intel              28672  1 spi_intel_pci
i2c_smbus              16384  1 i2c_i801
lpc_ich                28672  0
wmi                    28672  3 wmi_bmof,dell_smbios,dell_wmi_descriptor
root@pve06-bne-br1:~# sysctl kernel vm
kernel.acct = 4 2       30
kernel.acpi_video_flags = 0
kernel.apparmor_display_secid_mode = 0
kernel.apparmor_restrict_unprivileged_io_uring = 0
kernel.apparmor_restrict_unprivileged_unconfined = 0
kernel.apparmor_restrict_unprivileged_userns = 0
kernel.apparmor_restrict_unprivileged_userns_complain = 0
kernel.apparmor_restrict_unprivileged_userns_force = 0
kernel.arch = x86_64
kernel.auto_msgmni = 0
kernel.bootloader_type = 114
kernel.bootloader_version = 2
kernel.bpf_stats_enabled = 0
kernel.cad_pid = 1
kernel.cap_last_cap = 40
kernel.core_pattern = core
kernel.core_pipe_limit = 0
kernel.core_uses_pid = 0
kernel.ctrl-alt-del = 0
kernel.dmesg_restrict = 1
kernel.domainname = (none)
kernel.firmware_config.force_sysfs_fallback = 0
kernel.firmware_config.ignore_sysfs_fallback = 0
kernel.ftrace_dump_on_oops = 0
kernel.ftrace_enabled = 1
kernel.hardlockup_all_cpu_backtrace = 0
kernel.hardlockup_panic = 0
kernel.hostname = pve06-bne-br1
kernel.hotplug =
kernel.hung_task_all_cpu_backtrace = 0
kernel.hung_task_check_count = 4194304
kernel.hung_task_check_interval_secs = 0
kernel.hung_task_panic = 0
kernel.hung_task_timeout_secs = 120
kernel.hung_task_warnings = 10
kernel.io_delay_type = 1
kernel.io_uring_disabled = 0
kernel.io_uring_group = -1
kernel.kexec_load_disabled = 0
kernel.kexec_load_limit_panic = -1
kernel.kexec_load_limit_reboot = -1
kernel.keys.gc_delay = 300
kernel.keys.maxbytes = 20000
kernel.keys.maxkeys = 2000
kernel.keys.persistent_keyring_expiry = 259200
kernel.keys.root_maxbytes = 25000000
kernel.keys.root_maxkeys = 1000000
kernel.kptr_restrict = 0
kernel.latencytop = 0
kernel.max_lock_depth = 1024
kernel.max_rcu_stall_to_panic = 0
kernel.modprobe = /sbin/modprobe
kernel.modules_disabled = 0
kernel.msg_next_id = -1
kernel.msgmax = 8192
kernel.msgmnb = 16384
kernel.msgmni = 32000
kernel.ngroups_max = 65536
kernel.nmi_watchdog = 1
kernel.ns_last_pid = 3444554
kernel.numa_balancing = 1
kernel.numa_balancing_promote_rate_limit_MBps = 65536
kernel.oops_all_cpu_backtrace = 0
kernel.oops_limit = 10000
kernel.osrelease = 6.8.12-1-pve
kernel.ostype = Linux
kernel.overflowgid = 65534
kernel.overflowuid = 65534
kernel.panic = 0
kernel.panic_on_io_nmi = 0
kernel.panic_on_oops = 0
kernel.panic_on_rcu_stall = 0
kernel.panic_on_unrecovered_nmi = 0
kernel.panic_on_warn = 0
kernel.panic_print = 0
kernel.perf_cpu_time_max_percent = 25
kernel.perf_event_max_contexts_per_stack = 8
kernel.perf_event_max_sample_rate = 32000
kernel.perf_event_max_stack = 127
kernel.perf_event_mlock_kb = 516
kernel.perf_event_paranoid = 4
kernel.pid_max = 4194304
kernel.poweroff_cmd = /sbin/poweroff
kernel.print-fatal-signals = 0
kernel.printk = 3       4       1       7
kernel.printk_delay = 0
kernel.printk_devkmsg = on
kernel.printk_ratelimit = 5
kernel.printk_ratelimit_burst = 10
kernel.pty.max = 4096
kernel.pty.nr = 1
kernel.pty.reserve = 1024
kernel.random.boot_id = 41c14646-7cfd-42ef-9394-ea6a4fd1b9b9
kernel.random.entropy_avail = 256
kernel.random.poolsize = 256
kernel.random.urandom_min_reseed_secs = 60
kernel.random.uuid = 8f2cea82-5db8-48cf-9e33-f8f3c53cb940
kernel.random.write_wakeup_threshold = 256
kernel.randomize_va_space = 2
kernel.real-root-dev = 0
kernel.sched_autogroup_enabled = 1
kernel.sched_cfs_bandwidth_slice_us = 5000
kernel.sched_deadline_period_max_us = 4194304
kernel.sched_deadline_period_min_us = 100
kernel.sched_rr_timeslice_ms = 100
kernel.sched_rt_period_us = 1000000
kernel.sched_rt_runtime_us = 950000
kernel.sched_schedstats = 0
kernel.sched_util_clamp_max = 1024
kernel.sched_util_clamp_min = 1024
kernel.sched_util_clamp_min_rt_default = 1024
kernel.seccomp.actions_avail = kill_process kill_thread trap errno user_notif trace log allow
kernel.seccomp.actions_logged = kill_process kill_thread trap errno user_notif trace log
kernel.sem = 32000      1024000000      500     32000
kernel.sem_next_id = -1
kernel.shm_next_id = -1
kernel.shm_rmid_forced = 0
kernel.shmall = 18446744073692774399
kernel.shmmax = 18446744073692774399
kernel.shmmni = 4096
kernel.soft_watchdog = 1
kernel.softlockup_all_cpu_backtrace = 0
kernel.softlockup_panic = 0
kernel.spl.gitrev = zfs-2.2.4-0-g256659204
kernel.spl.hostid = 29b7dae6
kernel.spl.kmem.slab_kvmem_alloc = 7340032
kernel.spl.kmem.slab_kvmem_max = 7340032
kernel.spl.kmem.slab_kvmem_total = 9510912
kernel.split_lock_mitigate = 1
kernel.stack_tracer_enabled = 0
kernel.sysctl_writes_strict = 1
kernel.sysrq = 438
kernel.tainted = 4097
kernel.task_delayacct = 0
kernel.threads-max = 6180383
kernel.timer_migration = 1
kernel.traceoff_on_warning = 0
kernel.tracepoint_printk = 0
kernel.unknown_nmi_panic = 0
kernel.unprivileged_bpf_disabled = 2
kernel.unprivileged_userns_apparmor_policy = 1
kernel.unprivileged_userns_clone = 1
kernel.user_events_max = 32768
kernel.usermodehelper.bset = 4294967295 511
kernel.usermodehelper.inheritable = 4294967295  511
kernel.version = #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-1 (2024-08-05T16:17Z)
kernel.warn_limit = 0
kernel.watchdog = 1
kernel.watchdog_cpumask = 0-95
kernel.watchdog_thresh = 10
kernel.yama.ptrace_scope = 1
vm.admin_reserve_kbytes = 8192
vm.compact_unevictable_allowed = 1
vm.compaction_proactiveness = 20
vm.dirty_background_bytes = 0
vm.dirty_background_ratio = 10
vm.dirty_bytes = 0
vm.dirty_expire_centisecs = 3000
vm.dirty_ratio = 20
vm.dirty_writeback_centisecs = 500
vm.dirtytime_expire_seconds = 43200
vm.extfrag_threshold = 500
vm.hugetlb_optimize_vmemmap = 0
vm.hugetlb_shm_group = 0
vm.laptop_mode = 0
vm.legacy_va_layout = 0
vm.lowmem_reserve_ratio = 256   256     32      0       0
vm.max_map_count = 262144
vm.memfd_noexec = 0
vm.memory_failure_early_kill = 0
vm.memory_failure_recovery = 1
vm.min_free_kbytes = 112505
vm.min_slab_ratio = 5
vm.min_unmapped_ratio = 1
vm.mmap_min_addr = 65536
vm.mmap_rnd_bits = 32
vm.mmap_rnd_compat_bits = 16
vm.nr_hugepages = 0
vm.nr_hugepages_mempolicy = 0
vm.nr_overcommit_hugepages = 0
vm.numa_stat = 1
vm.numa_zonelist_order = Node
vm.oom_dump_tasks = 1
vm.oom_kill_allocating_task = 0
vm.overcommit_kbytes = 0
vm.overcommit_memory = 0
vm.overcommit_ratio = 50
vm.page-cluster = 3
vm.page_lock_unfairness = 5
vm.panic_on_oom = 0
vm.percpu_pagelist_high_fraction = 0
vm.stat_interval = 1
vm.swappiness = 60
vm.unprivileged_userfaultfd = 0
vm.user_reserve_kbytes = 131072
vm.vfs_cache_pressure = 100
vm.watermark_boost_factor = 15000
vm.watermark_scale_factor = 10
vm.zone_reclaim_mode = 0
root@pve06-bne-br1:~# journalctl | grep -e "soft lockup"
 
On all nodes in the cluster I disabled the two lrm and hrm services, and I added the softdog module to /etc/modprobe.d/softdog-deny.conf

Noted and thanks for the rest of the explanation, how about:

Let's quantify it then:

  • These new servers, let's call them no 5, 6 and 7 - how often did they go on random reboot in the past e.g. 90 days (per each)?
  • Can you post multiple log endings (e.g. journalctl -b $BOOTID -n 100 for each) of such crashes?
  • The changes you have described above (blacklisting softdog, etc.) - when did you perform them?

I really would like to see the logs, the timestamps and know after when there were no softdog modules (I would see this typically from the log, but with trimmed last 100 lines I might not) ...
 
pve08 went 60 days at its best, but had often only gotta 5-7 days prior to that. pve06 and pve07 rebooted anywhere from 24hrs to 7 days.

Attached is some boot logs of crashes on pve06 and pve07.

I performed the softdog black listing on Saturday. The day after I disabled the hrm and lrm services (which did not work).
 

Attachments

I think it's been all red herring with the watchdog reboot suspicion in your case...

Code:
Apr 01 21:39:15 pve06-bne-br1 kernel: sd 0:2:6:0: [sdg] tag#3969 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0
Apr 01 21:39:15 pve06-bne-br1 kernel: sd 0:2:6:0: [sdg] tag#3969 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0
Apr 05 07:44:09 pve06-bne-br1 kernel: sd 0:2:3:0: [sdd] tag#5026 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0
Apr 26 21:48:29 pve06-bne-br1 kernel: sd 0:2:5:0: [sdf] tag#1224 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0
Jun 14 00:13:06 pve07-bne-br1 kernel: sd 0:2:0:0: [sda] tag#2101 BRCM Debug mfi stat 0x2d, data len requested/completed 0x2000/0x0
Jun 14 00:14:53 pve07-bne-br1 kernel: sd 0:2:0:0: [sda] tag#2401 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x400
Jun 14 00:14:53 pve07-bne-br1 kernel: sd 0:2:0:0: [sda] tag#2453 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0

all identical builds - dual Xeon Platinum 8160, Dell R640's, 768GB RAM, 8x Enterprise SAS SSD, Dell M.2 Boss card (2x M.2 for boot), dual port 10G NIC.

Can you expand more on the drive array, how exactly is this configured and whether this is the same hardware (setup) as you have with the "other 10 nodes" or anything different there?

(You can also compare if similar log entries appear on your non-issues cluster...)
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!