Hi all.
I have a 3-node Proxmox 5.2 cluster with Ceph Luminous Backend. I use KRBD device as storage.
In some circumstances (after some high load on the server), some KVM machine freezes, independent of the guest OS.
After stopping the machine from interface, RBD device remains mapped. Even force unmap does not work. Only reboot solves this problem.
Does anyone have ideas why this problem occurs? I've found some rare threads in the CEPH mailing list that under some high VM pressure such situation can occur, but they gave no solution there.
P. S. will upgrade it to the latest kernel now, but I do not think it will solve the problem.
I have a 3-node Proxmox 5.2 cluster with Ceph Luminous Backend. I use KRBD device as storage.
In some circumstances (after some high load on the server), some KVM machine freezes, independent of the guest OS.
After stopping the machine from interface, RBD device remains mapped. Even force unmap does not work. Only reboot solves this problem.
Code:
root@pve3:/etc/ceph# rbd showmapped | grep 103
10 rbd vm-103-disk-1 - /dev/rbd10
root@pve3:/etc/ceph# rbd unmap /dev/rbd10
rbd: sysfs write failed
rbd: unmap failed: (16) Device or resource busy
root@pve3:/etc/ceph# rbd unmap -o force /dev/rbd10
.... stucks and only reboot helps.
root@pve3:~# cat /proc/1846502/stack
[<ffffffff9c0388c7>] blk_mq_freeze_queue_wait+0x57/0xb0
[<ffffffff9c039d9a>] blk_mq_freeze_queue+0x1a/0x20
[<ffffffffc0ce3492>] do_rbd_remove.isra.25+0x202/0x260 [rbd]
[<ffffffffc0ce3504>] rbd_remove_single_major+0x14/0x20 [rbd]
[<ffffffff9c1ebb07>] bus_attr_store+0x27/0x30
[<ffffffff9bee303c>] sysfs_kf_write+0x3c/0x50
[<ffffffff9bee28e3>] kernfs_fop_write+0x123/0x1b0
[<ffffffff9be556ab>] __vfs_write+0x1b/0x40
[<ffffffff9be56425>] vfs_write+0xb5/0x1a0
[<ffffffff9be57ae5>] SyS_write+0x55/0xc0
[<ffffffff9c6001a1>] entry_SYSCALL_64_fastpath+0x24/0xab
[<ffffffffffffffff>] 0xffffffffffffffff
root@pve3:~# uname -a
Linux pve3 4.13.13-6-pve #1 SMP PVE 4.13.13-41 (Wed, 21 Feb 2018 10:07:54 +0100) x86_64 GNU/Linux
Does anyone have ideas why this problem occurs? I've found some rare threads in the CEPH mailing list that under some high VM pressure such situation can occur, but they gave no solution there.
P. S. will upgrade it to the latest kernel now, but I do not think it will solve the problem.