[SOLVED] watchdog: BUG: soft lockup - CPU#6 stuck for 23s! [khugepaged:114]

ukro · Oct 2, 2021

Greetings,
remote server
ssh working, vms/ct working. Just one CT with connected host storage not working.
I cant stop/start any CT or VM. It is a remote server, possible hardware restart in a day +-.

Any ideas what happened?

Code:

Message from syslogd@matrix at Oct  2 01:38:50 ...
 kernel:[6609484.653892] watchdog: BUG: soft lockup - CPU#6 stuck for 23s! [khugepaged:114]

Message from syslogd@matrix at Oct  2 01:39:30 ...
 kernel:[6609524.654195] watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [khugepaged:114]

Message from syslogd@matrix at Oct  2 01:39:30 ...
 kernel:[6609524.654195] watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [khugepaged:114]

Message from syslogd@matrix at Oct  2 01:39:58 ...
 kernel:[6609552.654406] watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [khugepaged:114]

Message from syslogd@matrix at Oct  2 01:39:58 ...
 kernel:[6609552.654406] watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [khugepaged:114]

It is disabled!?

Code:

root@matrix:~# grep -i HugePages_Total /proc/meminfo
HugePages_Total:       0

Code:

top - 01:42:52 up 76 days, 12:01,  2 users,  load average: 22.96, 22.42, 21.31
Tasks: 420 total,   6 running, 409 sleeping,   0 stopped,   5 zombie
%Cpu(s):  0.3 us,  6.7 sy,  0.0 ni, 93.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  32026.9 total,   1521.8 free,  28937.3 used,   1567.9 buff/cache
MiB Swap:   8192.0 total,   5892.2 free,   2299.8 used.   1970.7 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  114 root      39  19       0      0      0 R 100.0   0.0 266:08.88 khugepaged
  113 root      25   5       0      0      0 S   9.3   0.0   7531:01 ksmd
18606 root      20   0 9272480   4.0g   4844 S   4.3  12.7  75729:15 kvm
23619 root      20   0   16628   7828   6748 S   0.3   0.0   0:00.03 sshd
25904 root      20   0   11696   4028   3208 R   0.3   0.0   0:00.01 top
    1 root      20   0  172112   9492   5436 S   0.0   0.0  30:27.89 systemd
    2 root      20   0       0      0      0 S   0.0   0.0   7:43.05 kthreadd
    3 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 rcu_gp
    4 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 rcu_par_gp
    6 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 kworker/

Code:

Filesystem            1K-blocks      Used  Available Use% Mounted on
udev                   16336928         0   16336928   0% /dev
tmpfs                   3279556    357656    2921900  11% /run
/dev/mapper/pve-root   51085808  32698520   15762572  68% /
tmpfs                  16397776     49920   16347856   1% /dev/shm
tmpfs                      5120         0       5120   0% /run/lock
tmpfs                  16397776         0   16397776   0% /sys/fs/cgroup
/dev/nvme0n1p2           523248       312     522936   1% /boot/efi
/dev/fuse                 30720        20      30700   1% /etc/pve
data                 7557464448 520702336 7036762112   7% /data

Code:

root@matrix:~# pveversion -v
proxmox-ve: 6.4-1 (running kernel: 5.4.124-1-pve)
pve-manager: 6.4-13 (running version: 6.4-13/9f411e79)
pve-kernel-5.4: 6.4-4
pve-kernel-helper: 6.4-4
pve-kernel-5.4.124-1-pve: 5.4.124-1
pve-kernel-5.4.119-1-pve: 5.4.119-1
pve-kernel-5.4.114-1-pve: 5.4.114-1
pve-kernel-5.4.106-1-pve: 5.4.106-1
pve-kernel-5.4.98-1-pve: 5.4.98-1
pve-kernel-5.4.73-1-pve: 5.4.73-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.1.2-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.1.0
libproxmox-backup-qemu0: 1.1.0-1
libpve-access-control: 6.4-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-3
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-3
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.1.12-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.6-1
pve-cluster: 6.4-1
pve-container: 3.3-6
pve-docs: 6.4-2
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-4
pve-firmware: 3.2-4
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-6
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.4-pve1

Code:

root@matrix:~# grep AnonHugePages /proc/meminfo
AnonHugePages:   3915776 kB

Code:

root@matrix:~# egrep 'trans|thp' /proc/vmstat
nr_anon_transparent_hugepages 1912
thp_fault_alloc 9353
thp_fault_fallback 4011
thp_collapse_alloc 13526
thp_collapse_alloc_failed 72338
thp_file_alloc 0
thp_file_mapped 0
thp_split_page 9431
thp_split_page_failed 2369239777
thp_deferred_split_page 16531
thp_split_pmd 9431
thp_split_pud 0
thp_zero_page_alloc 0
thp_zero_page_alloc_failed 0
thp_swpout 0
thp_swpout_fallback 0

After hardware reset all is good (for now)
Nothing special in graphs:

The system stopped responding and checking CPU usage as the graphs ends. So GUI was not working and the stats data was not gathered.

ukro · Oct 2, 2021

Last data that were in rsyslog before HW restart

Code:

Oct  2 08:40:30 matrix kernel: [6634784.856196] watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [khugepaged:114]
Oct  2 08:40:30 matrix kernel: [6634784.856200] Modules linked in: dm_snapshot xt_multiport xt_tcpudp nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_conntrack xt_comment iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 tcp_diag inet_diag nfsd auth_rpcgss nfs_acl lockd grace veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter bpfilter softdog nfnetlink_log snd_hda_codec_hdmi nfnetlink ppdev snd_hda_codec_realtek snd_hda_codec_generic snd_sof_pci snd_sof_intel_hda_common snd_soc_hdac_hda snd_sof_intel_hda snd_sof_intel_byt intel_rapl_msr snd_sof_intel_ipc intel_rapl_common snd_sof x86_pkg_temp_thermal snd_sof_xtensa_dsp intel_powerclamp snd_hda_ext_core snd_soc_acpi_intel_match snd_soc_acpi kvm_intel ledtrig_audio kvm snd_soc_core snd_compress ac97_bus crct10dif_pclmul mei_hdcp snd_pcm_dmaengine crc32_pclmul ghash_clmulni_intel i915 snd_hda_intel snd_intel_dspcfg snd_hda_codec aesni_intel drm_kms_helper crypto_simd snd_hda_core cryptd glue_helper
Oct  2 08:40:30 matrix kernel: [6634784.856217]  snd_hwdep drm intel_cstate snd_pcm snd_timer i2c_algo_bit mei_me fb_sys_fops snd syscopyarea sysfillrect eeepc_wmi sysimgblt soundcore mei asus_wmi pcspkr zfs(PO) sparse_keymap wmi_bmof mxm_wmi zunicode(PO) zzstd(O) parport_pc parport zlua(O) zavl(PO) icp(PO) mac_hid acpi_pad acpi_tad zcommon(PO) znvpair(PO) spl(O) vhost_net vhost tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi wireguard ip6_udp_tunnel udp_tunnel nct6775 hwmon_vid coretemp vfio_pci vfio_virqfd irqbypass vfio_iommu_type1 vfio sunrpc ip_tables x_tables autofs4 btrfs xor zstd_compress raid6_pq dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c r8169 xhci_pci ahci realtek xhci_hcd libahci wmi video
Oct  2 08:40:30 matrix kernel: [6634784.856281] CPU: 6 PID: 114 Comm: khugepaged Tainted: P      D    O L    5.4.124-1-pve #1
Oct  2 08:40:30 matrix kernel: [6634784.856297] Hardware name: ASUS System Product Name/PRIME B460M-A, BIOS 1605 04/07/2021
Oct  2 08:40:30 matrix kernel: [6634784.856300] RIP: 0010:native_queued_spin_lock_slowpath+0x60/0x1c0
Oct  2 08:40:30 matrix kernel: [6634784.856302] Code: 6e f0 0f ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 48 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 <8b> 07 84 c0 75 f8 b8 01 00 00 00 5d 66 89 07 c3 8b 37 81 fe 00 01
Oct  2 08:40:30 matrix kernel: [6634784.856304] RSP: 0018:ffffa02e404c37b0 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff13
Oct  2 08:40:30 matrix kernel: [6634784.856305] RAX: 0000000000000141 RBX: ffff8b131c090848 RCX: 0000000000000000
Oct  2 08:40:30 matrix kernel: [6634784.856306] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8b131c090848
Oct  2 08:40:30 matrix kernel: [6634784.856307] RBP: ffffa02e404c37b0 R08: 0000000000000000 R09: ffff8b142375b738
Oct  2 08:40:30 matrix kernel: [6634784.856308] R10: ffff8b13d059ef50 R11: 0000000000000001 R12: ffff8b0d8ccef838
Oct  2 08:40:30 matrix kernel: [6634784.856309] R13: ffff8b131c090828 R14: ffff8b0d8ccef838 R15: 0000000000000000
Oct  2 08:40:30 matrix kernel: [6634784.856310] FS:  0000000000000000(0000) GS:ffff8b148f180000(0000) knlGS:0000000000000000
Oct  2 08:40:30 matrix kernel: [6634784.856312] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct  2 08:40:30 matrix kernel: [6634784.856312] CR2: 00007fb82c5b70a8 CR3: 00000003bc00a004 CR4: 00000000007626e0
Oct  2 08:40:30 matrix kernel: [6634784.856313] PKRU: 55555554
Oct  2 08:40:30 matrix kernel: [6634784.856314] Call Trace:
Oct  2 08:40:30 matrix kernel: [6634784.856318]  _raw_spin_lock+0x1f/0x30
Oct  2 08:40:30 matrix kernel: [6634784.856348]  dmu_buf_replace_user+0x60/0x90 [zfs]
Oct  2 08:40:30 matrix kernel: [6634784.856364]  dmu_buf_remove_user+0x10/0x20 [zfs]
Oct  2 08:40:30 matrix kernel: [6634784.856387]  sa_handle_destroy+0x42/0xe0 [zfs]
Oct  2 08:40:30 matrix kernel: [6634784.856412]  zfs_zinactive+0x96/0xf0 [zfs]
Oct  2 08:40:30 matrix kernel: [6634784.856436]  zfs_inactive+0x88/0x220 [zfs]
Oct  2 08:40:30 matrix kernel: [6634784.856438]  ? truncate_pagecache+0x5a/0x70
Oct  2 08:40:30 matrix kernel: [6634784.856463]  zpl_evict_inode+0x43/0x60 [zfs]
Oct  2 08:40:30 matrix kernel: [6634784.856465]  evict+0xca/0x1a0
Oct  2 08:40:30 matrix kernel: [6634784.856466]  dispose_list+0x39/0x50
Oct  2 08:40:30 matrix kernel: [6634784.856468]  prune_icache_sb+0x5a/0x80
Oct  2 08:40:30 matrix kernel: [6634784.856470]  super_cache_scan+0x12b/0x1b0
Oct  2 08:40:30 matrix kernel: [6634784.856472]  do_shrink_slab+0x130/0x2c0
Oct  2 08:40:30 matrix kernel: [6634784.856473]  shrink_slab+0xa9/0x2a0
Oct  2 08:40:30 matrix kernel: [6634784.856475]  shrink_node+0xcb/0x410
Oct  2 08:40:30 matrix kernel: [6634784.856476]  do_try_to_free_pages+0xcf/0x3a0
Oct  2 08:40:30 matrix kernel: [6634784.856477]  try_to_free_pages+0xee/0x1d0
Oct  2 08:40:30 matrix kernel: [6634784.856479]  __alloc_pages_slowpath+0x431/0xe30
Oct  2 08:40:30 matrix kernel: [6634784.856481]  ? __switch_to+0x85/0x480
Oct  2 08:40:30 matrix kernel: [6634784.856483]  ? __switch_to_asm+0x34/0x70
Oct  2 08:40:30 matrix kernel: [6634784.856484]  __alloc_pages_nodemask+0x2df/0x330
Oct  2 08:40:30 matrix kernel: [6634784.856486]  khugepaged_alloc_page+0x1b/0x50
Oct  2 08:40:30 matrix kernel: [6634784.856487]  khugepaged+0x1108/0x2280
Oct  2 08:40:30 matrix kernel: [6634784.856489]  ? syscall_return_via_sysret+0x10/0x7f
Oct  2 08:40:30 matrix kernel: [6634784.856490]  ? __switch_to+0x81/0x480
Oct  2 08:40:30 matrix kernel: [6634784.856491]  ? wait_woken+0x80/0x80
Oct  2 08:40:30 matrix kernel: [6634784.856494]  kthread+0x120/0x140
Oct  2 08:40:30 matrix kernel: [6634784.856495]  ? collapse_pte_mapped_thp+0x410/0x410
Oct  2 08:40:30 matrix kernel: [6634784.856496]  ? kthread_park+0x90/0x90
Oct  2 08:40:30 matrix kernel: [6634784.856497]  ret_from_fork+0x1f/0x40
Oct  2 08:40:47 matrix QEMU[18606]: kvm: terminating on signal 15 from pid 758 (/usr/sbin/qmeventd)
Oct  2 08:40:50 matrix systemd[1]: 102.scope: Succeeded.
Oct  2 08:40:50 matrix systemd[1]: Stopped 102.scope.
Oct  2 08:40:50 matrix systemd[1]: Stopped target Graphical Interface.
Oct  2 08:40:50 matrix systemd[1]: Stopped target Timers.
Oct  2 08:40:50 matrix systemd[1]: systemd-tmpfiles-clean.timer: Succeeded.
Oct  2 08:40:50 matrix systemd[1]: Stopped Daily Cleanup of Temporary Directories.
Oct  2 08:40:50 matrix systemd[1]: apt-daily-upgrade.timer: Succeeded.
Oct  2 08:40:50 matrix systemd[1]: Stopped Daily apt upgrade and clean activities.
Oct  2 08:40:50 matrix systemd[1]: pvesr.timer: Succeeded.
Oct  2 08:40:50 matrix systemd[1]: Stopped Proxmox VE replication runner.
Oct  2 08:40:50 matrix systemd[1]: Stopping LVM event activation on device 259:3...
Oct  2 08:40:50 matrix systemd[1]: man-db.timer: Succeeded.
Oct  2 08:40:50 matrix systemd[1]: Stopped Daily man-db regeneration.
Oct  2 08:40:50 matrix systemd[1]: Stopping Availability of block devices...
Oct  2 08:40:50 matrix systemd[1]: Stopped target RPC Port Mapper.
Oct  2 08:40:50 matrix systemd[1]: Stopped target Sound Card.
Oct  2 08:40:50 matrix systemd[1]: pve-daily-update.timer: Succeeded.
Oct  2 08:40:50 matrix systemd[1]: Stopped Daily PVE download activities.
Oct  2 08:40:50 matrix systemd[1]: lvm2-lvmpolld.socket: Succeeded.
Oct  2 08:40:50 matrix systemd[1]: Closed LVM2 poll daemon socket.
Oct  2 08:40:50 matrix systemd[1]: Stopped target Multi-User System.
Oct  2 08:40:50 matrix systemd[1]: Stopping PVE guests...
Oct  2 08:40:50 matrix systemd[1]: postfix.service: Succeeded.
Oct  2 08:40:50 matrix systemd[1]: Stopped Postfix Mail Transport Agent.
Oct  2 08:40:50 matrix systemd[1]: Stopping LXC Container Monitoring Daemon...
Oct  2 08:40:50 matrix systemd[1]: 100.scope: Succeeded.
Oct  2 08:40:50 matrix systemd[1]: Stopped 100.scope.
Oct  2 08:40:50 matrix systemd[1]: systemd-rfkill.socket: Succeeded.
Oct  2 08:40:50 matrix systemd[1]: Closed Load/Save RF Kill Switch Status /dev/rfkill Watch.
Oct  2 08:40:50 matrix systemd[1]: Stopping Kernel Samepage Merging (KSM) Tuning Daemon...
Oct  2 08:40:50 matrix systemd[1]: apt-daily.timer: Succeeded.
Oct  2 08:40:50 matrix systemd[1]: Stopped Daily apt download activities.
Oct  2 08:40:50 matrix systemd[1]: Stopping LSB: disk temperature monitoring daemon...
Oct  2 08:40:50 matrix systemd[1]: Stopping Self Monitoring and Reporting Technology (SMART) Daemon...
Oct  2 08:40:50 matrix systemd[1]: Removed slice qemu.slice.
Oct  2 08:40:50 matrix smartd[746]: smartd received signal 15: Terminated
Oct  2 08:40:50 matrix smartd[746]: Device: /dev/sda [SAT], state written to /var/lib/smartmontools/smartd.ST8000VN004_2M2101-WS4014NA.ata.state
Oct  2 08:40:50 matrix systemd[1]: lm-sensors.service: Succeeded.
Oct  2 08:40:50 matrix systemd[1]: Stopped Initialize hardware monitoring sensors.
Oct  2 08:40:50 matrix smartd[746]: Device: /dev/nvme0, state written to /var/lib/smartmontools/smartd.SAMSUNG_MZVPW256HEGL_00000-S346NY0J231222.nvme.state
Oct  2 08:40:50 matrix smartd[746]: smartd is exiting (exit status 0)
Oct  2 08:40:50 matrix systemd[1]: Stopped target Login Prompts.
Oct  2 08:40:50 matrix systemd[1]: Stopping Regular background program processing daemon...
Oct  2 08:40:50 matrix blkdeactivate[17136]: Deactivating block devices:
Oct  2 08:40:50 matrix systemd[1]: Unmounting RPC Pipe File System...
Oct  2 08:40:50 matrix systemd[1]: Stopping Postfix Mail Transport Agent (instance -)...
Oct  2 08:40:50 matrix systemd[1]: Stopping Getty on tty1...
Oct  2 08:40:50 matrix systemd[1]: Stopping Login Service...
Oct  2 08:40:50 matrix blkdeactivate[17136]:   [SKIP]: unmount of pve-swap (dm-0) mounted on [SWAP]
Oct  2 08:40:50 matrix blkdeactivate[17136]:   [SKIP]: unmount of pve-root (dm-1) mounted on /
Oct  2 08:40:50 matrix systemd[1]: smartmontools.service: Succeeded.
Oct  2 08:40:50 matrix systemd[1]: Stopped Self Monitoring and Reporting Technology (SMART) Daemon.
Oct  2 08:40:50 matrix systemd[1]: ksmtuned.service: Main process exited, code=killed, status=15/TERM
Oct  2 08:40:50 matrix systemd[1]: ksmtuned.service: Succeeded.
Oct  2 08:40:50 matrix systemd[1]: Stopped Kernel Samepage Merging (KSM) Tuning Daemon.
Oct  2 08:40:50 matrix systemd[1]: lxc-monitord.service: Succeeded.
Oct  2 08:40:50 matrix systemd[1]: Stopped LXC Container Monitoring Daemon.
Oct  2 08:40:50 matrix systemd[1]: getty@tty1.service: Main process exited, code=killed, status=15/TERM
Oct  2 08:40:50 matrix systemd[1]: getty@tty1.service: Succeeded.
Oct  2 08:40:50 matrix systemd[1]: Stopped Getty on tty1.
Oct  2 08:40:50 matrix systemd[1]: cron.service: Main process exited, code=killed, status=15/TERM
Oct  2 08:40:50 matrix systemd[1]: cron.service: Succeeded.
Oct  2 08:40:50 matrix systemd[1]: Stopped Regular background program processing daemon.
Oct  2 08:40:50 matrix systemd[1]: Removed slice system-getty.slice.
Oct  2 08:40:50 matrix systemd[1]: hddtemp.service: Succeeded.
Oct  2 08:40:50 matrix systemd[1]: Stopped LSB: disk temperature monitoring daemon.
Oct  2 08:40:50 matrix postfix/postfix-script[17211]: stopping the Postfix mail system
Oct  2 08:40:50 matrix systemd[1]: blk-availability.service: Succeeded.
Oct  2 08:40:50 matrix systemd[1]: Stopped Availability of block devices.
Oct  2 08:40:50 matrix postfix/master[1139]: terminating on signal 15
Oct  2 08:40:50 matrix systemd[1]: postfix@-.service: Succeeded.
Oct  2 08:40:50 matrix systemd[1]: Stopped Postfix Mail Transport Agent (instance -).
Oct  2 08:40:50 matrix systemd[1]: Removed slice system-postfix.slice.
Oct  2 08:40:52 matrix qmeventd[755]: cleanup failed, terminating pid '18606' with SIGKILL
Oct  2 08:40:58 matrix kernel: [6634812.856421] watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [khugepaged:114]

From top nothing special,
NVME/HDD speeds are ok.

TuxBRAiN · Dec 3, 2021

We are also experiencing this issue now from months, since we upgraded to Proxmox 7. Unfortunately, many users are reporting this exact same bug (curiously, most of them had the CPU locked exactly for 22 or 23 seconds), but for now there is no clear solution nor workaround...

tom · Dec 4, 2021

@TuxBRAiN please open a new thread, include all details, as this thread is about a 6.x installation.

- your hardware
- your storage
- pveversion -v
- qm config VMID

ukro · Jan 19, 2022

Updated to pve7, but had different error afterwards
solved with:
https://forum.proxmox.com/threads/b...-dereference-address-0000000000000000.102741/

Search

Search

[SOLVED] watchdog: BUG: soft lockup - CPU#6 stuck for 23s! [khugepaged:114]

ukro

Member

ukro

Member

TuxBRAiN

Active Member

tom

Proxmox Staff Member

ukro

Member

We value your privacy