Hung VM - eventual HA error, blocked tasks inside VM dmesg until hard reboot of node.

scyto

Active Member
Aug 8, 2023
401
73
28
Hi All, woke up to a VM in an inderterimate state causing various blackholes on my internal network.

When i figured out what VM i could ping it but i could not ssh into it, could not get the PVE console to connect to the VM output.

Tried stopping the VM and couldn't, it transitioned into an inditerminate HA state. other VMs on the same PVE node were 100% fine.

In my bleary, just woke up state i rebooted the PVE node, it would not reboot, had to do hard poweroff.

On reboot i disabled and enabled the failed VM and it started perfectly.

Looking at the historical dmesg via jouranlctl i saw the following in the VM, and wondered what on earth could have caused this.

any one have suggestions where i can look?

I saw multiple diference processes log this at 2am and then dmesg stopped logging (the tasks all had various names, it was not just this one task that failed)


Code:
Mar 10 02:05:21 Docker01 kernel: INFO: task glfs_posixhc:16203 blocked for more than 120 seconds.
Mar 10 02:05:21 Docker01 kernel:       Not tainted 5.10.0-28-amd64 #1 Debian 5.10.209-2
Mar 10 02:05:21 Docker01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 10 02:05:21 Docker01 kernel: task:glfs_posixhc    state:D stack:    0 pid:16203 ppid:     1 flags:0x00000000
Mar 10 02:05:21 Docker01 kernel: Call Trace:
Mar 10 02:05:21 Docker01 kernel:  __schedule+0x282/0x870
Mar 10 02:05:21 Docker01 kernel:  schedule+0x46/0xb0
Mar 10 02:05:21 Docker01 kernel:  percpu_rwsem_wait+0xaf/0x190
Mar 10 02:05:21 Docker01 kernel:  ? percpu_down_write+0xe0/0xe0
Mar 10 02:05:21 Docker01 kernel:  __percpu_down_read+0x4d/0x60
Mar 10 02:05:21 Docker01 kernel:  mnt_want_write+0x77/0xa0
Mar 10 02:05:21 Docker01 kernel:  path_openat+0x1098/0x10c0
Mar 10 02:05:21 Docker01 kernel:  ? finish_task_switch+0x72/0x260
Mar 10 02:05:21 Docker01 kernel:  do_filp_open+0x99/0x140
Mar 10 02:05:21 Docker01 kernel:  ? getname_flags.part.0+0x29/0x1b0
Mar 10 02:05:21 Docker01 kernel:  ? __check_object_size+0x146/0x160
Mar 10 02:05:21 Docker01 kernel:  do_sys_openat2+0x9b/0x150
Mar 10 02:05:21 Docker01 kernel:  __x64_sys_openat+0x54/0xa0
Mar 10 02:05:21 Docker01 kernel:  do_syscall_64+0x30/0x80
Mar 10 02:05:21 Docker01 kernel:  entry_SYSCALL_64_after_hwframe+0x62/0xc7
Mar 10 02:05:21 Docker01 kernel: RIP: 0033:0x7f46fdc617e4
Mar 10 02:05:21 Docker01 kernel: RSP: 002b:00007f46eaffad70 EFLAGS: 00000297 ORIG_RAX: 0000000000000101
Mar 10 02:05:21 Docker01 kernel: RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f46fdc617e4
Mar 10 02:05:21 Docker01 kernel: RDX: 0000000000000241 RSI: 00007f46eaffb130 RDI: 00000000ffffff9c
Mar 10 02:05:21 Docker01 kernel: RBP: 00007f46eaffb130 R08: 0000000000000001 R09: 00007f46f80274c0
Mar 10 02:05:21 Docker01 kernel: R10: 00000000000001a4 R11: 0000000000000297 R12: 0000000000000241
Mar 10 02:05:21 Docker01 kernel: R13: 0000000000000134 R14: 0000000000000019 R15: 00007f46eaffaf30
 
On the host this is the only thing i see in dmesg


Code:
Mar 10 02:04:54 pve1 systemd-journald[478]: Data hash table of /var/log/journal/40ae389db54d4cb191b4475cc4b7277a/system.journal has a fill level at 75.0 (174763 of 233016 items, 92274688 file size, 527 bytes per hash ta>

this seems to be a repeating item that happens at this time every day.
 
Last edited:
Nothing in the hosts log, other than backup process saw the node hang...... i wonder if backup at 2:05 am today somehow made the node fail... grasping at straws here....


Code:
Mar 10 00:02:42 pve1 pvescheduler[438441]: INFO: Starting Backup of VM 111 (qemu)
Mar 10 00:03:13 pve1 pvescheduler[438441]: INFO: Finished Backup of VM 111 (00:00:31)
Mar 10 02:02:09 pve1 pvescheduler[489435]: INFO: Starting Backup of VM 111 (qemu)
Mar 10 02:05:15 pve1 pvescheduler[489435]: VM 111 qmp command failed - VM 111 qmp command 'guest-fsfreeze-thaw' failed - got timeout
Mar 10 02:05:16 pve1 pvescheduler[489435]: INFO: Finished Backup of VM 111 (00:03:07)
Mar 10 04:01:37 pve1 pvescheduler[541610]: INFO: Starting Backup of VM 111 (qemu)
Mar 10 04:01:40 pve1 pvescheduler[541610]: VM 111 qmp command failed - VM 111 qmp command 'guest-ping' failed - got timeout
Mar 10 04:01:54 pve1 pvescheduler[541610]: INFO: Finished Backup of VM 111 (00:00:17)
Mar 10 06:01:40 pve1 pvescheduler[594047]: INFO: Starting Backup of VM 111 (qemu)
Mar 10 06:01:44 pve1 pvescheduler[594047]: VM 111 qmp command failed - VM 111 qmp command 'guest-ping' failed - got timeout
 
Last edited: