VM lockups with Ceph

brad_mssw

Well-Known Member
Jun 13, 2014
133
9
58
I can consistently reproduce a VM lockup in my production environment.

I install a CentOS 7 guest on a ceph rbd storage pool, and within that VM, I run:

Code:
fio --rw=randrw --refill_buffers --norandommap --randrepeat=0 --ioengine=libaio --bs=4k --rwmixread=100 --iodepth=16 --numjobs=16 --runtime=60 --group_reporting --name=4ktest --size=128m

It will not complete, sometimes the entire VM locks up, other times just the disk (meaning you really can't do anything after that anyhow).

When running the same test with a VM on 'local' storage, it works as expected.

This is a new production setup that we're trying to QA. Our test lab does NOT appear to exhibit this behavior but the machines are much slower and only use 1Gb networking, whereas this new production equipment is 10Gb.

Has anyone else seen anything like this? I'm not sure where to look. I'm running the latest pve-no-subscription and tried both ceph firefly and giant. I've also attempted to back off to older versions of qemu from the repo, fiddled with cache settings, set aio=threads, nothing I do seems to resolve the issue.
 
How is your cpu load on osd nodes when doing the benchmark ?

Ceph can be cpu hungry on write. (the more iops the more cpu), so with small 4k block and random(so writeback don't help), It can be really huge.

What is your ceph storage config ? (pool replication level ? cpu, disk journal ? hdd/ssd ? )
 
How is your cpu load on osd nodes when doing the benchmark ?
Ceph can be cpu hungry on write. (the more iops the more cpu), so with small 4k block and random(so writeback don't help), It can be really huge.

I see the CPU go up briefly before it deadlocks, then it sits at 0% CPU and dumping in-flight ops from ceph itself shows 0 during the deadlock on all OSDs.


What is your ceph storage config ? (pool replication level ? cpu, disk journal ? hdd/ssd ? )

I've got 3 nodes, each with 1 OSD in the pool that is dual purpose (ceph + proxmox).

Ceph is using size=2 min_size=1 standard replication pool.

There are 3 total OSDs in the pool, all are 4-drive RAID-5 SSD (Intel DC S3700) arrays that show about 2.5GB/s read/write. Disk journal is on the OSD not separate. Rados bench shows about 800MB/s write, 2500MB/s read.

Hardware:

  • 3 identical nodes
    • Supermicro 2U Chassis with SAS3 expander 216BE2C-R920LPB: http://www.supermicro.com/products/chassis/2U/216/SC216BE2C-R920LP.cfm
    • Supermicro X10 Xeon E5-2600 v3 Motherboard X10DRI-O: http://www.supermicro.com/products/motherboard/Xeon/C600/X10DRi.cfm
    • 2x Xeon E5-2630v3 8-core, 2.4GHz CPUs
    • 8x 16GB DDR4-2133 ECC Registered RAM (128GB total)
    • Intel Dual SFP+ 10Gbps NIC X520-DA2
    • 2x Intel DC S3500 80GB SSD (boot drives raid 1)
    • 4x Intel DC S3700 400GB SSD (fast VM storage raid 5)
    • 4x WD 2.5" Velociraptor 10kRPM 1TB HD (bulk VM storage raid 5)
    • LSI MegaRAID SAS 9361-8i Raid controller
    • LSI Supercap LSICVM02 BBU for raid controller
    • LSI BBU-BRACKET-05 pci-e slot mounting bracket for BBU
    • 2x CPU heatsink SNK-P0048PS
    • 2x 0.6m miniSAS HD cables
  • 2x Juniper EX4300 switches with a 4-port 10GbE SFP+ module in each, stacked (chassis cluster) using 40GbE stacking cables.
    • Each VM node is connected to both switches via 10GbE with SFP+ Direct Attach Cables using a cross-switch LACP(802.3ad) bond for redundancy (we can survive a switch failure) and performance (we get 20Gbps).
 
Also, if you use pvetest repository, I have add a new option to vm config:

iothread: 1


Which allow to use a dedicated thread inside qemu for virtio disk, and it's helping a little bit.

I can definitely give that a try. I noticed someone commented on the ceph ticket as well and suggested trying krbd instead of librbd, is that also in pvetest?
 
I guess pve-no-subscription is basically the same as pvetest these days as pvetest didn't say there were any updates available, and passing the iothread: 1 flag in the vm configuration did show modifications to the kvm command line, so I'll give that a shot.

Regarding the pve-storage git, what is recommended to test that? Just check it out, apply the patch, then how do you build the .deb, just 'make deb' or something? Then I can just install that deb and try it, right?

Thanks!
 
Also, if you use pvetest repository, I have add a new option to vm config:

iothread: 1


Which allow to use a dedicated thread inside qemu for virtio disk, and it's helping a little bit.
Hi Spirit,
that's sound intresting!
Is it also possible to use more threads? Because it's looks for me, that ceph got only good transfer rates with many threads...

Udo
 
You would simply copy the *.pm files to your node(s) replacing any existing files. Then
service pve{daemon,proxy,statd} restart.
 
The ceph guys are asking if we have debug symbols available for the qemu binary. Obviously I'm using pve-qemu-kvm, but I don't see a -dbg package for that that I can install. Is there one available somewhere?
 
Hi Spirit,
that's sound intresting!
Is it also possible to use more threads? Because it's looks for me, that ceph got only good transfer rates with many threads...

Udo

Yes, I'll send a patch soon to manage 1 iothread<-> 1disk. (currently in proxmox, it's 1iothread->n disk)


But one limitation of qemu currently, is 1 iothread<--> 1 disk.
In the future( qemu 2.3, 2.4...), it'll be possible to use multiple iothread for 1 disk. (n iothread<--> 1disk)



one thing possible currently, assign same disk multiple time, and do some multipathing inside guest.

like this I have been able to reach 90000iops with 1 disk. (3x virtio disk 30000iops each + iothreads + krbd).
 
Yes, I'll send a patch soon to manage 1 iothread<-> 1disk. (currently in proxmox, it's 1iothread->n disk)


But one limitation of qemu currently, is 1 iothread<--> 1 disk.
In the future( qemu 2.3, 2.4...), it'll be possible to use multiple iothread for 1 disk. (n iothread<--> 1disk)



one thing possible currently, assign same disk multiple time, and do some multipathing inside guest.

like this I have been able to reach 90000iops with 1 disk. (3x virtio disk 30000iops each + iothreads + krbd).

Using krbd appears to work without locking up. So that's the good news. However, this is much slower for me than librbd. Any idea what may be doing on since you seem to be reporting the opposite.

I'd expect to be getting at least 10k iops from this, but I'm getting 1/10th of that:

Code:
[root@centostest ~]# fio --rw=randrw --refill_buffers --norandommap --randrepeat=0 --ioengine=libaio --bs=4k --rwmixread=100 --iodepth=16 --numjobs=16 --runtime=60 --group_reporting --name=4ktest --size=128m
4ktest: (g=0): rw=randrw, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=16
...
fio-2.1.11
Starting 16 processes
Jobs: 16 (f=16): [r(16)] [100.0% done] [6624KB/0KB/0KB /s] [1656/0/0 iops] [eta 00m:00s]
4ktest: (groupid=0, jobs=16): err= 0: pid=2183: Mon Nov 17 15:44:34 2014
  read : io=328372KB, bw=5469.8KB/s, iops=1367, runt= 60034msec
    slat (usec): min=1, max=41728, avg=11689.23, stdev=15607.16
    clat (usec): min=2, max=364410, avg=175287.85, stdev=44945.91
     lat (msec): min=5, max=368, avg=186.98, stdev=46.47
    clat percentiles (msec):
     |  1.00th=[   84],  5.00th=[   93], 10.00th=[  125], 20.00th=[  129],
     | 30.00th=[  163], 40.00th=[  165], 50.00th=[  169], 60.00th=[  188],
     | 70.00th=[  204], 80.00th=[  208], 90.00th=[  241], 95.00th=[  249],
     | 99.00th=[  285], 99.50th=[  289], 99.90th=[  314], 99.95th=[  326],
     | 99.99th=[  330]
    bw (KB  /s): min=  193, max=  572, per=6.24%, avg=341.13, stdev=53.62
    lat (usec) : 4=0.01%, 10=0.01%
    lat (msec) : 10=0.02%, 20=0.01%, 50=0.45%, 100=5.07%, 250=90.86%
    lat (msec) : 500=3.58%
  cpu          : usr=0.03%, sys=0.14%, ctx=75948, majf=0, minf=389
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.2%, 16=99.7%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=82093/w=0/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
   READ: io=328372KB, aggrb=5469KB/s, minb=5469KB/s, maxb=5469KB/s, mint=60034msec, maxt=60034msec

Disk stats (read/write):
  vda: ios=75808/4, merge=0/0, ticks=956035/76, in_queue=956239, util=99.82%
 
Hmm, when I add --direct=1 the numbers change completely. Maybe it is invalid without --direct=1.
Code:
 fio --direct=1 --rw=randrw --refill_buffers --norandommap --randrepeat=0 --ioengine=libaio --bs=4k --rwmixread=100 --iodepth=16 --numjobs=16 --runtime=60 --group_reporting --name=4ktest --size=128m
4ktest: (g=0): rw=randrw, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=16
...
fio-2.1.11
Starting 16 processes
4ktest: Laying out IO file(s) (1 file(s) / 128MB)
4ktest: Laying out IO file(s) (1 file(s) / 128MB)
4ktest: Laying out IO file(s) (1 file(s) / 128MB)
4ktest: Laying out IO file(s) (1 file(s) / 128MB)
4ktest: Laying out IO file(s) (1 file(s) / 128MB)
4ktest: Laying out IO file(s) (1 file(s) / 128MB)
4ktest: Laying out IO file(s) (1 file(s) / 128MB)
4ktest: Laying out IO file(s) (1 file(s) / 128MB)
4ktest: Laying out IO file(s) (1 file(s) / 128MB)
4ktest: Laying out IO file(s) (1 file(s) / 128MB)
4ktest: Laying out IO file(s) (1 file(s) / 128MB)
4ktest: Laying out IO file(s) (1 file(s) / 128MB)
4ktest: Laying out IO file(s) (1 file(s) / 128MB)
4ktest: Laying out IO file(s) (1 file(s) / 128MB)
4ktest: Laying out IO file(s) (1 file(s) / 128MB)
4ktest: Laying out IO file(s) (1 file(s) / 128MB)
Jobs: 1 (f=1): [r(1),_(15)] [100.0% done] [64368KB/0KB/0KB /s] [16.1K/0/0 iops] [eta 00m:00s]    s]
4ktest: (groupid=0, jobs=16): err= 0: pid=2127: Mon Nov 17 18:40:06 2014
  read : io=2048.0MB, bw=106611KB/s, iops=26652, runt= 19671msec
    slat (usec): min=0, max=57939, avg=333.34, stdev=2449.63
    clat (usec): min=200, max=97262, avg=8613.45, stdev=9925.26
     lat (usec): min=209, max=101891, avg=8947.16, stdev=10161.61
    clat percentiles (usec):
     |  1.00th=[  828],  5.00th=[ 1544], 10.00th=[ 2224], 20.00th=[ 3248],
     | 30.00th=[ 4048], 40.00th=[ 4832], 50.00th=[ 5664], 60.00th=[ 6688],
     | 70.00th=[ 8096], 80.00th=[10304], 90.00th=[14656], 95.00th=[40192],
     | 99.00th=[48896], 99.50th=[51968], 99.90th=[57600], 99.95th=[59136],
     | 99.99th=[64768]
    bw (KB  /s): min= 3530, max=13343, per=6.68%, avg=7118.17, stdev=1135.35
    lat (usec) : 250=0.01%, 500=0.14%, 750=0.55%, 1000=1.13%
    lat (msec) : 2=6.36%, 4=21.09%, 10=49.70%, 20=14.28%, 50=6.02%
    lat (msec) : 100=0.73%
  cpu          : usr=0.32%, sys=1.18%, ctx=153530, majf=0, minf=371
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=524288/w=0/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
   READ: io=2048.0MB, aggrb=106611KB/s, minb=106611KB/s, maxb=106611KB/s, mint=19671msec, maxt=19671msec

Disk stats (read/write):
  sda: ios=524135/30, merge=117/10, ticks=2877109/237, in_queue=2878631, util=99.56%
 
direct=1 simply means skip VM cache and write directly to storage. As your ceph storage likely has much better cache than the vm this is what to expect.

BTW. a more appropriate setting would be: --rwmixread=80
 
Last edited:
Using krbd appears to work without locking up. So that's the good news. However, this is much slower for me than librbd. Any idea what may be doing on since you seem to be reporting the opposite.[/QUOTE]


check in you ceph.conf:

[CLIENT]
rbd_cache=true

(it's enabled by default in giant now, but for firefly you need it). And configure writeback in your virtio disk.

This help for write doing bigger aggrate ios (mostly sequential, or random is blocks are near each other and they can be aggrated)


Note that even with that, with directio and without rbd_cache, I have better performance with krbd.
(It's on a full ssd setup)
 
It looks like the 3.10.0-5-pve kernel is the culprit of my lockups. The ceph guys had me back off to the 2.6.32-34-pve kernel and I can't get it to lock up. Granted, I'm now taking about a 30% performance hit, but it's stable.

I just need to decide if I should go 2.6.32, or 3.10 with krbd. I did notice with krbd that live snapshots don't work with your patch, Spirit.

Really though, I doubt the kernel is at fault, but maybe I'm wrong, there could be some weird interaction with the glibc pthreads version and the kernel version I suppose.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!