I have dell r710 which I am running proxmox 6.2 on. The host has a mix of some local storage, and other storage presented via iSCSI/NFS. The local storage consists of 2x500G drives in raid 1 for the proxmox OS, and 2x500G SSDs in raid 1.
Most of the VMs run on the iSCSI share, while I have a few running on the local SSD storage.
I have been running this in my homelab for a while now, but all of a sudden I have started to have a issue I am not sure how to track down. It seems like the host can be online for about a week, then all of a sudden several of my VMs go offline.
When this happens I cannot even access the console in proxmox. They usually time out, and I believe it says "error waiting on systemd". What is in common is that they are all on the SSD storage. But not all VMs are affected on this storage. For example I also have pfsense running on these SSDs,
and it continues to work and also console access works.
It is also interesting that when this happens the I/O delay shoots up to about 8% and stays there.
I find this error in the syslog file, but I'm not sure what it means and what next steps I should take with it is.
Jun 18 03:10:26 compute1 kernel: [623467.883697] INFO: task kvm:2361 blocked for more than 241 seconds.
Jun 18 03:10:26 compute1 kernel: [623467.883732] Tainted: P IOE 5.4.34-1-pve #1
Jun 18 03:10:26 compute1 kernel: [623467.883752] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 18 03:10:26 compute1 kernel: [623467.883777] kvm D 0 2361 1 0x00000000
Jun 18 03:10:26 compute1 kernel: [623467.883779] Call Trace:
Jun 18 03:10:26 compute1 kernel: [623467.883787] __schedule+0x2e6/0x700
Jun 18 03:10:26 compute1 kernel: [623467.883789] schedule+0x33/0xa0
Jun 18 03:10:26 compute1 kernel: [623467.883790] schedule_preempt_disabled+0xe/0x10
Jun 18 03:10:26 compute1 kernel: [623467.883792] __mutex_lock.isra.10+0x2c9/0x4c0
Jun 18 03:10:26 compute1 kernel: [623467.883823] ? kvm_arch_vcpu_put+0xe2/0x170 [kvm]
Jun 18 03:10:26 compute1 kernel: [623467.883825] __mutex_lock_slowpath+0x13/0x20
Jun 18 03:10:26 compute1 kernel: [623467.883826] mutex_lock+0x2c/0x30
Jun 18 03:10:26 compute1 kernel: [623467.883828] sr_block_ioctl+0x43/0xd0
Jun 18 03:10:26 compute1 kernel: [623467.883832] blkdev_ioctl+0x4c1/0x9e0
Jun 18 03:10:26 compute1 kernel: [623467.883835] block_ioctl+0x3d/0x50
Jun 18 03:10:26 compute1 kernel: [623467.883837] do_vfs_ioctl+0xa9/0x640
Jun 18 03:10:26 compute1 kernel: [623467.883838] ksys_ioctl+0x67/0x90
Jun 18 03:10:26 compute1 kernel: [623467.883840] __x64_sys_ioctl+0x1a/0x20
Jun 18 03:10:26 compute1 kernel: [623467.883843] do_syscall_64+0x57/0x190
Jun 18 03:10:26 compute1 kernel: [623467.883846] entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jun 18 03:10:26 compute1 kernel: [623467.883848] RIP: 0033:0x7f2e40f97427
Jun 18 03:10:26 compute1 kernel: [623467.883852] Code: Bad RIP value.
Jun 18 03:10:26 compute1 kernel: [623467.883853] RSP: 002b:00007f2d75ffa098 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Jun 18 03:10:26 compute1 kernel: [623467.883855] RAX: ffffffffffffffda RBX: 00007f2e33af4850 RCX: 00007f2e40f97427
Jun 18 03:10:26 compute1 kernel: [623467.883856] RDX: 000000007fffffff RSI: 0000000000005326 RDI: 0000000000000012
Jun 18 03:10:26 compute1 kernel: [623467.883856] RBP: 0000000000000001 R08: 0000559be29be890 R09: 0000000000000000
Jun 18 03:10:26 compute1 kernel: [623467.883857] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f2d74a42268
Jun 18 03:10:26 compute1 kernel: [623467.883858] R13: 0000000000000000 R14: 0000559be2ef0d20 R15: 0000559be27fc740
Steps I have done so far:
I ran a memtest for one entire pass, and no issues were found
I ran smart tests on all the local drives, no issues were found
I removed the SSDs from the host, checked their status in windows, ran additional tests (none found) and applied available firmware.
Any help of a next step or details in what the possible message means would be appreciated!
Edit: If it helps, pveversion output is below:
Most of the VMs run on the iSCSI share, while I have a few running on the local SSD storage.
I have been running this in my homelab for a while now, but all of a sudden I have started to have a issue I am not sure how to track down. It seems like the host can be online for about a week, then all of a sudden several of my VMs go offline.
When this happens I cannot even access the console in proxmox. They usually time out, and I believe it says "error waiting on systemd". What is in common is that they are all on the SSD storage. But not all VMs are affected on this storage. For example I also have pfsense running on these SSDs,
and it continues to work and also console access works.
It is also interesting that when this happens the I/O delay shoots up to about 8% and stays there.
I find this error in the syslog file, but I'm not sure what it means and what next steps I should take with it is.
Jun 18 03:10:26 compute1 kernel: [623467.883697] INFO: task kvm:2361 blocked for more than 241 seconds.
Jun 18 03:10:26 compute1 kernel: [623467.883732] Tainted: P IOE 5.4.34-1-pve #1
Jun 18 03:10:26 compute1 kernel: [623467.883752] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 18 03:10:26 compute1 kernel: [623467.883777] kvm D 0 2361 1 0x00000000
Jun 18 03:10:26 compute1 kernel: [623467.883779] Call Trace:
Jun 18 03:10:26 compute1 kernel: [623467.883787] __schedule+0x2e6/0x700
Jun 18 03:10:26 compute1 kernel: [623467.883789] schedule+0x33/0xa0
Jun 18 03:10:26 compute1 kernel: [623467.883790] schedule_preempt_disabled+0xe/0x10
Jun 18 03:10:26 compute1 kernel: [623467.883792] __mutex_lock.isra.10+0x2c9/0x4c0
Jun 18 03:10:26 compute1 kernel: [623467.883823] ? kvm_arch_vcpu_put+0xe2/0x170 [kvm]
Jun 18 03:10:26 compute1 kernel: [623467.883825] __mutex_lock_slowpath+0x13/0x20
Jun 18 03:10:26 compute1 kernel: [623467.883826] mutex_lock+0x2c/0x30
Jun 18 03:10:26 compute1 kernel: [623467.883828] sr_block_ioctl+0x43/0xd0
Jun 18 03:10:26 compute1 kernel: [623467.883832] blkdev_ioctl+0x4c1/0x9e0
Jun 18 03:10:26 compute1 kernel: [623467.883835] block_ioctl+0x3d/0x50
Jun 18 03:10:26 compute1 kernel: [623467.883837] do_vfs_ioctl+0xa9/0x640
Jun 18 03:10:26 compute1 kernel: [623467.883838] ksys_ioctl+0x67/0x90
Jun 18 03:10:26 compute1 kernel: [623467.883840] __x64_sys_ioctl+0x1a/0x20
Jun 18 03:10:26 compute1 kernel: [623467.883843] do_syscall_64+0x57/0x190
Jun 18 03:10:26 compute1 kernel: [623467.883846] entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jun 18 03:10:26 compute1 kernel: [623467.883848] RIP: 0033:0x7f2e40f97427
Jun 18 03:10:26 compute1 kernel: [623467.883852] Code: Bad RIP value.
Jun 18 03:10:26 compute1 kernel: [623467.883853] RSP: 002b:00007f2d75ffa098 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Jun 18 03:10:26 compute1 kernel: [623467.883855] RAX: ffffffffffffffda RBX: 00007f2e33af4850 RCX: 00007f2e40f97427
Jun 18 03:10:26 compute1 kernel: [623467.883856] RDX: 000000007fffffff RSI: 0000000000005326 RDI: 0000000000000012
Jun 18 03:10:26 compute1 kernel: [623467.883856] RBP: 0000000000000001 R08: 0000559be29be890 R09: 0000000000000000
Jun 18 03:10:26 compute1 kernel: [623467.883857] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f2d74a42268
Jun 18 03:10:26 compute1 kernel: [623467.883858] R13: 0000000000000000 R14: 0000559be2ef0d20 R15: 0000559be27fc740
Steps I have done so far:
I ran a memtest for one entire pass, and no issues were found
I ran smart tests on all the local drives, no issues were found
I removed the SSDs from the host, checked their status in windows, ran additional tests (none found) and applied available firmware.
Any help of a next step or details in what the possible message means would be appreciated!
Edit: If it helps, pveversion output is below:
Code:
proxmox-ve: 6.2-1 (running kernel: 5.4.34-1-pve)
pve-manager: 6.2-4 (running version: 6.2-4/9824574a)
pve-kernel-5.4: 6.2-1
pve-kernel-helper: 6.2-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libproxmox-acme-perl: 1.0.3
libpve-access-control: 6.1-1
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.1-2
libpve-guest-common-perl: 3.0-10
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-7
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.2-1
lxcfs: 4.0.3-pve2
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-1
pve-cluster: 6.1-8
pve-container: 3.1-5
pve-docs: 6.2-4
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.1-2
pve-firmware: 3.1-1
pve-ha-manager: 3.0-9
pve-i18n: 2.1-2
pve-qemu-kvm: 5.0.0-2
pve-xtermjs: 4.3.0-1
qemu-server: 6.2-2
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.3-pve1
Last edited: