CEPH + KRBD for VM image strange hungs

Timofey

New Member
Aug 16, 2018
5
0
1
48
Code:
proxmox-ve: 5.1-32 (running kernel: 4.13.13-2-pve)
pve-manager: 5.1-41 (running version: 5.1-41/0b958203)
pve-kernel-4.13.13-2-pve: 4.13.13-32
libpve-http-server-perl: 2.0-8
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-19
qemu-server: 5.0-18
pve-firmware: 2.0-3
libpve-common-perl: 5.0-25
libpve-guest-common-perl: 2.0-14
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-17
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-3
pve-docs: 5.1-12
pve-qemu-kvm: 2.9.1-5
pve-container: 2.0-18
pve-firewall: 3.0-5
pve-ha-manager: 2.0-4
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.1-2
lxcfs: 2.0.8-1
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.3-pve1~bpo9
ceph: 12.2.7-pve1

We have 12 OSD on 6 hosts, all hosts used for VM + CEPH (33 vm running). VMs has poor perfomance so we decide to do some tests with KRBD


So...

- I mark storage1_ct(krbd) for use with VM images (just for test, we have no containers)
- Move one image from storage1_vm to storage1_ct
VM config:
Code:
root@vhost4:~# cat /etc/pve/qemu-server/165.conf
#start%3A %D0%A1%D1%82%D0%B0%D1%80%D1%82%D0%BE%D0%B2%D0%B0%D1%82%D1%8C %D0%B2%D1%80%D1%83%D1%87%D0%BD%D1%83%D1%8E
#project%3A %D0%98%D0%A1%D0%9E%D0%93%D0%94 %D0%94%D0%B0%D0%B3%D0%B5%D1%81%D1%82%D0%B0%D0%BD
#role%3A %D1%81%D0%B5%D1%80%D0%B2%D0%B5%D1%80 %D0%91%D0%94
#owner%3A %D0%A1%D0%BF%D0%B8%D1%80%D0%B8%D0%BD
#db%3A Oracle Database 12c Standard Edition Release 12.2.0.1.0 - 64bit
agent: 1
balloon: 5120
boot: cdn
bootdisk: scsi0
cores: 2
cpu: host
ide2: none,media=cdrom
memory: 7168
name: vm-227-oradb-ogd-da
net0: virtio=62:39:39:19:96:01,bridge=vmbr0
numa: 0
ostype: win10
scsi0: storage1_vm:vm-165-disk-2,cache=writeback,discard=on,size=60G
scsi2: storage1_ct:vm-165-disk-3,discard=on,size=151G
scsihw: virtio-scsi-pci
smbios1: uuid=4b66ff2a-cf95-480c-9d70-43319400a968
sockets: 1

- Do tests (from x2 to x7 io improvement in database tasks) - wow! We need this now!
- Just fo sure let's test some other things
-- Make snapshot -- passed
-- Remove snapshot -- passed
-- Make backup on running VM -- hung!
- Waiting for 30 minutes on 32% without ceph io, cancel backup
Code:
2019-07-06 21:54:19 INFO: Starting Backup of VM 165 (qemu)
2019-07-06 21:54:19 INFO: status = running
2019-07-06 21:54:19 INFO: update VM 165: -lock backup
2019-07-06 21:54:19 INFO: VM Name: vm-227-oradb-ogd-da
2019-07-06 21:54:19 INFO: include disk 'scsi0' 'storage1_vm:vm-165-disk-2' 60G
2019-07-06 21:54:19 INFO: include disk 'scsi2' 'storage1_ct:vm-165-disk-3' 150G
2019-07-06 21:54:20 INFO: backup mode: snapshot
2019-07-06 21:54:20 INFO: ionice priority: 7
2019-07-06 21:54:20 INFO: creating archive '/mnt/pve/nas_backup/dump/vzdump-qemu-165-2019_07_06-21_54_19.vma.gz'
2019-07-06 21:54:26 INFO: started backup task '6a041bc0-cce9-44a3-ade5-0835b0b749e8'
2019-07-06 21:54:29 INFO: status: 0% (180355072/225485783040), sparse 0% (59043840), duration 3, read/write 60/40 MB/s
2019-07-06 21:55:44 INFO: status: 1% (2273312768/225485783040), sparse 0% (379441152), duration 78, read/write 27/23 MB/s
2019-07-06 21:57:19 INFO: status: 2% (4513071104/225485783040), sparse 0% (688300032), duration 173, read/write 23/20 MB/s
2019-07-06 21:58:58 INFO: status: 3% (6790578176/225485783040), sparse 0% (1219534848), duration 272, read/write 23/17 MB/s
2019-07-06 22:00:38 INFO: status: 4% (9030336512/225485783040), sparse 0% (1818075136), duration 372, read/write 22/16 MB/s
2019-07-06 22:02:10 INFO: status: 5% (11282677760/225485783040), sparse 1% (2338820096), duration 464, read/write 24/18 MB/s
2019-07-06 22:03:51 INFO: status: 6% (13530824704/225485783040), sparse 1% (2787958784), duration 565, read/write 22/17 MB/s
2019-07-06 22:05:01 INFO: status: 7% (15791554560/225485783040), sparse 1% (3808411648), duration 635, read/write 32/17 MB/s
2019-07-06 22:06:07 INFO: status: 8% (18102616064/225485783040), sparse 2% (5092950016), duration 701, read/write 35/15 MB/s
2019-07-06 22:07:20 INFO: status: 9% (20296237056/225485783040), sparse 2% (5978890240), duration 774, read/write 30/17 MB/s
2019-07-06 22:08:30 INFO: status: 10% (22570860544/225485783040), sparse 3% (7112863744), duration 844, read/write 32/16 MB/s
2019-07-06 22:09:31 INFO: status: 11% (24855445504/225485783040), sparse 3% (8441536512), duration 905, read/write 37/15 MB/s
2019-07-06 22:10:36 INFO: status: 12% (27262976000/225485783040), sparse 4% (9513435136), duration 970, read/write 37/20 MB/s
2019-07-06 22:11:27 INFO: status: 13% (29322379264/225485783040), sparse 4% (10651230208), duration 1021, read/write 40/18 MB/s
2019-07-06 22:12:37 INFO: status: 14% (31872516096/225485783040), sparse 5% (12083429376), duration 1091, read/write 36/15 MB/s
2019-07-06 22:12:55 INFO: status: 15% (33822867456/225485783040), sparse 6% (13661736960), duration 1109, read/write 108/20 MB/s
2019-07-06 22:13:02 INFO: status: 16% (36138123264/225485783040), sparse 7% (15833014272), duration 1116, read/write 330/20 MB/s
2019-07-06 22:13:05 INFO: status: 18% (40978350080/225485783040), sparse 9% (20673241088), duration 1119, read/write 1613/0 MB/s
2019-07-06 22:13:08 INFO: status: 20% (45772439552/225485783040), sparse 11% (25467330560), duration 1122, read/write 1598/0 MB/s
2019-07-06 22:13:11 INFO: status: 22% (50524585984/225485783040), sparse 13% (30219476992), duration 1125, read/write 1584/0 MB/s
2019-07-06 22:13:14 INFO: status: 24% (55314481152/225485783040), sparse 15% (35009372160), duration 1128, read/write 1596/0 MB/s
2019-07-06 22:13:17 INFO: status: 26% (60033073152/225485783040), sparse 17% (39727964160), duration 1131, read/write 1572/0 MB/s
2019-07-06 22:13:20 INFO: status: 28% (64428638208/225485783040), sparse 19% (44120440832), duration 1134, read/write 1465/1 MB/s
2019-07-06 22:13:56 INFO: status: 29% (65407942656/225485783040), sparse 19% (44120788992), duration 1170, read/write 27/27 MB/s
2019-07-06 22:15:28 INFO: status: 30% (67665068032/225485783040), sparse 19% (44170231808), duration 1262, read/write 24/23 MB/s
2019-07-06 22:16:55 INFO: status: 31% (69915181056/225485783040), sparse 19% (44170235904), duration 1349, read/write 25/25 MB/s
2019-07-06 22:19:03 INFO: status: 32% (72186462208/225485783040), sparse 19% (44170244096), duration 1477, read/write 17/17 MB/s
2019-07-06 22:49:59 ERROR: interrupted by signal
2019-07-06 22:49:59 INFO: aborting backup job
2019-07-06 22:50:00 ERROR: Backup of VM 165 failed - interrupted by signal

- syslog unusual entry. Time is similar to backup hangs. Saw it first time, never before for almost year ceph using
Code:
2019-07-06 22:19:57.015180 osd.7 osd.7 10.10.10.104:6800/1946843 2722 : cluster [ERR] 1.75 missing primary copy of 1:ae4bb006:::rbd_data.81fdd46b8b4567.0000000000000832:head, will try copies on 2,5

OSD log:
Code:
2019-07-06 22:19:57.015138 7f7e4a4c3700 -1 bluestore(/var/lib/ceph/osd/ceph-7) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x25000, got 0x6706be76, expected 0xedf78c30, device location [0x235bdbc5000~1000], logical extent 0x3a5000~1000, object #1:ae4bb006:::rbd_data.81fdd46b8b4567.0000000000000832:head#
2019-07-06 22:19:57.015178 7f7e4a4c3700 -1 log_channel(cluster) log [ERR] : 1.75 missing primary copy of 1:ae4bb006:::rbd_data.81fdd46b8b4567.0000000000000832:head, will try copies on 2,5

ceph status HEALTH_OK


- Backup aborted but web interface show task running with rolling icon
- Try to run another backup
Code:
INFO: starting new backup job: vzdump 165 --compress gzip --remove 0 --node vhost4 --storage nas_backup --mode snapshot
INFO: Starting Backup of VM 165 (qemu)
INFO: status = running
INFO: update VM 165: -lock backup
INFO: VM Name: vm-227-oradb-ogd-da
INFO: include disk 'scsi0' 'storage1_vm:vm-165-disk-2' 60G
INFO: include disk 'scsi2' 'storage1_ct:vm-165-disk-3' 150G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating archive '/mnt/pve/nas_backup/dump/vzdump-qemu-165-2019_07_06-22_50_49.vma.gz'
ERROR: Node 'drive-scsi2' is busy: block device is in use by block job: backup
INFO: aborting backup job
ERROR: Backup of VM 165 failed - Node 'drive-scsi2' is busy: block device is in use by block job: backup
INFO: Backup job finished with errors
TASK ERROR: job errors

- Shutdown VM
- Try to start VM
Code:
rbd: sysfs write failed
can't unmap rbd volume vm-165-disk-3: rbd: sysfs write failed
TASK ERROR: start failed: org.freedesktop.systemd1.UnitExists: Unit 165.scope already exists.

- Try to move image to another storage (just to clean things, locks, whatever...) - hungs
Code:
create full clone of drive scsi2 (storage1_ct:vm-165-disk-3)
transferred: 0 bytes remaining: 162135015424 bytes total: 162135015424 bytes progression: 0.00 %

- Cancel it - marked as stopped but:
Code:
root@vhost4:~# ps -ef | grep 165
root     1947665       1  0 Jul06 ?        00:00:00 /usr/bin/qemu-img convert -p -n -f raw -O raw /dev/rbd/storage1/vm-165-disk-3 zeroinit:rbd:storage1/vm-165-disk-1:conf=/etc/pve/ceph.conf:id=admin:keyring=/etc/pve/priv/ceph/storage1_vm.keyring

- Try to understand/do smthng!!111
Code:
root@vhost4:~# cat /proc/1947665/syscall
2 0x7f77c44f30e0 0x80800 0x0 0x7f77c48001c8 0x23901f00 0xaaaaaaab 0x7ffc3c531fc8 0x7f77d3769820

Code:
root@vhost4:~# cat /proc/1947665/stack
[<ffffffffa169276d>] __blkdev_get+0x6d/0x430
[<ffffffffa1693715>] blkdev_get+0x125/0x310
[<ffffffffa1693e71>] blkdev_open+0x81/0xf0
[<ffffffffa164e64c>] do_dentry_open+0x1fc/0x300
[<ffffffffa164fa7e>] vfs_open+0x4e/0x80
[<ffffffffa1661eb4>] path_openat+0x614/0x1490
[<ffffffffa16640b9>] do_filp_open+0x99/0x110
[<ffffffffa164fea5>] do_sys_open+0x135/0x280
[<ffffffffa165000e>] SyS_open+0x1e/0x20
[<ffffffffa1d347fb>] entry_SYSCALL_64_fastpath+0x1e/0xa9
[<ffffffffffffffff>] 0xffffffffffffffff

Code:
root@vhost4:~# rbd unmap -o force storage1/vm-165-disk-3
... hung

- For now: VM restored from backup to another VM No, so no data loss, no downtime


Questions....
0. What the cause of this situation? KRBD? crc fail? backup? combination?
1. What I've done wrong in this situation?
2. What to do in next such situation?
3. Can we use KRBD for VM storage?
4. How can librbd be so bad? We archive x2-x7 io improvement with krbd, it means that librbd adds overhead more than 1Gbit(!) network + sata disks do. Is librbd broken?

5. How to clean locked VM 165 image without host reboot?

Thank you
 
Please upgrade your system to the latest packages, as there have been numerous updates.

0. What the cause of this situation? KRBD? crc fail? backup? combination?
Unclear, but as you noted, the cluster hast performance issues, I would suspect that some parts of the OSDs are swapped out and couldn't be read in time from the disks. Please upgrade.
https://tracker.ceph.com/issues/22464

3. Can we use KRBD for VM storage?
Yes.

4. How can librbd be so bad? We archive x2-x7 io improvement with krbd, it means that librbd adds overhead more than 1Gbit(!) network + sata disks do. Is librbd broken?
1 GbE is a bottleneck for a hyper-converged Proxmox VE + Ceph cluster, as Ceph client and its MON, OSD traffic are on the same network.

librbd and krbd work a little bit different from each other.
  • librbd has its own cache, while krbd is using the page cache. The default cache size is 25 MB for librbd.
    These can be set in the ceph.conf. http://docs.ceph.com/docs/luminous/rbd/rbd-config-ref/
  • Qemu needs a mapped device with krbd, while it can directly communicate with the cluster through librbd.
  • New features land in librbd faster as in krbd.
5. How to clean locked VM 165 image without host reboot?
You can remove the lock with 'qm unlock <vmid>', but this doesn't unamp the rbd device from an earlier interrupted run.
 
Please upgrade your system to the latest packages, as there have been numerous updates.
Upgrade is planned, will try to speed it up...

Unclear, but as you noted, the cluster hast performance issues, I would suspect that some parts of the OSDs are swapped out and couldn't be read in time from the disks. Please upgrade.
https://tracker.ceph.com/issues/22464
We catch this bug sometimes (22464) but usually all osds is in place and green, with adequate latency

1 GbE is a bottleneck for a hyper-converged Proxmox VE + Ceph cluster, as Ceph client and its MON, OSD traffic are on the same network.
I know, I know... :( But we have what we have and _very_ limited budget.

librbd and krbd work a little bit different from each other.
  • librbd has its own cache, while krbd is using the page cache. The default cache size is 25 MB for librbd.
    These can be set in the ceph.conf. http://docs.ceph.com/docs/luminous/rbd/rbd-config-ref/
  • Qemu needs a mapped device with krbd, while it can directly communicate with the cluster through librbd.
  • New features land in librbd faster as in krbd.
Well, small size of librbd cache can explain performance boost with krbd. Need to rerun tests with kernel page cache drop. Maybe after upgrade.

But it seems to be clear that kernel cache (say 10G on host) will be much more efficient in compare to librbd small cache. We can't set librbd cache comparable large because it binds to each image ((

Thank you
 
We catch this bug sometimes (22464) but usually all osds is in place and green, with adequate latency
The read runs into a timeout and is not retried. A subsequent read has been introduced in Ceph 12.2.10, that works around this issue.

Well, small size of librbd cache can explain performance boost with krbd. Need to rerun tests with kernel page cache drop. Maybe after upgrade.
The page cache will be used by any process and a drop will allow the fastest process to fill it up and then performance will drop again. To have comparable test, the memory should be already filled.

But it seems to be clear that kernel cache (say 10G on host) will be much more efficient in compare to librbd small cache. We can't set librbd cache comparable large because it binds to each image ((
As said above, the page cache can be taken by any process. The librbd cache is more controllable then page cache, otherwise it can happen that one VM hogs all memory for its IO and starves the other VMs.
 
The librbd cache is more controllable then page cache, otherwise it can happen that one VM hogs all memory for its IO and starves the other VMs.

Our usual case is 1-2 vms doing io bound tasks while others are waiting. So page cache seems ok for our needs - let it give more memory for more loaded vm.

Shared cache and cache flood can produce problems and hard to test. Test with 3 vms tells little about 15 vms. Maybe divide images between krbd and librbd pools for high load and low load. We will think... But upgrade first :)

Thank you
 
Last edited by a moderator:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!