VM lockups with Ceph

brad_mssw · Nov 16, 2014

I can consistently reproduce a VM lockup in my production environment.

I install a CentOS 7 guest on a ceph rbd storage pool, and within that VM, I run:

Code:

fio --rw=randrw --refill_buffers --norandommap --randrepeat=0 --ioengine=libaio --bs=4k --rwmixread=100 --iodepth=16 --numjobs=16 --runtime=60 --group_reporting --name=4ktest --size=128m

It will not complete, sometimes the entire VM locks up, other times just the disk (meaning you really can't do anything after that anyhow).

When running the same test with a VM on 'local' storage, it works as expected.

This is a new production setup that we're trying to QA. Our test lab does NOT appear to exhibit this behavior but the machines are much slower and only use 1Gb networking, whereas this new production equipment is 10Gb.

Has anyone else seen anything like this? I'm not sure where to look. I'm running the latest pve-no-subscription and tried both ceph firefly and giant. I've also attempted to back off to older versions of qemu from the repo, fiddled with cache settings, set aio=threads, nothing I do seems to resolve the issue.

brad_mssw · Nov 17, 2014

Pretty sure this is a ceph issue, I've filed a bug with Ceph: http://tracker.ceph.com/issues/10116

spirit · Nov 17, 2014

How is your cpu load on osd nodes when doing the benchmark ?

Ceph can be cpu hungry on write. (the more iops the more cpu), so with small 4k block and random(so writeback don't help), It can be really huge.

What is your ceph storage config ? (pool replication level ? cpu, disk journal ? hdd/ssd ? )

spirit · Nov 17, 2014

Also, if you use pvetest repository, I have add a new option to vm config:

iothread: 1

Which allow to use a dedicated thread inside qemu for virtio disk, and it's helping a little bit.

brad_mssw · Nov 17, 2014

spirit said:
How is your cpu load on osd nodes when doing the benchmark ?
Ceph can be cpu hungry on write. (the more iops the more cpu), so with small 4k block and random(so writeback don't help), It can be really huge.

I see the CPU go up briefly before it deadlocks, then it sits at 0% CPU and dumping in-flight ops from ceph itself shows 0 during the deadlock on all OSDs.

spirit said:
What is your ceph storage config ? (pool replication level ? cpu, disk journal ? hdd/ssd ? )

I've got 3 nodes, each with 1 OSD in the pool that is dual purpose (ceph + proxmox).

Ceph is using size=2 min_size=1 standard replication pool.

There are 3 total OSDs in the pool, all are 4-drive RAID-5 SSD (Intel DC S3700) arrays that show about 2.5GB/s read/write. Disk journal is on the OSD not separate. Rados bench shows about 800MB/s write, 2500MB/s read.

Hardware:

3 identical nodes
- Supermicro 2U Chassis with SAS3 expander 216BE2C-R920LPB: http://www.supermicro.com/products/chassis/2U/216/SC216BE2C-R920LP.cfm
- Supermicro X10 Xeon E5-2600 v3 Motherboard X10DRI-O: http://www.supermicro.com/products/motherboard/Xeon/C600/X10DRi.cfm
- 2x Xeon E5-2630v3 8-core, 2.4GHz CPUs
- 8x 16GB DDR4-2133 ECC Registered RAM (128GB total)
- Intel Dual SFP+ 10Gbps NIC X520-DA2
- 2x Intel DC S3500 80GB SSD (boot drives raid 1)
- 4x Intel DC S3700 400GB SSD (fast VM storage raid 5)
- 4x WD 2.5" Velociraptor 10kRPM 1TB HD (bulk VM storage raid 5)
- LSI MegaRAID SAS 9361-8i Raid controller
- LSI Supercap LSICVM02 BBU for raid controller
- LSI BBU-BRACKET-05 pci-e slot mounting bracket for BBU
- 2x CPU heatsink SNK-P0048PS
- 2x 0.6m miniSAS HD cables
2x Juniper EX4300 switches with a 4-port 10GbE SFP+ module in each, stacked (chassis cluster) using 40GbE stacking cables.
- Each VM node is connected to both switches via 10GbE with SFP+ Direct Attach Cables using a cross-switch LACP(802.3ad) bond for redundancy (we can survive a switch failure) and performance (we get 20Gbps).

brad_mssw · Nov 17, 2014

spirit said:
Also, if you use pvetest repository, I have add a new option to vm config:

iothread: 1

Which allow to use a dedicated thread inside qemu for virtio disk, and it's helping a little bit.

I can definitely give that a try. I noticed someone commented on the ceph ticket as well and suggested trying krbd instead of librbd, is that also in pvetest?

spirit · Nov 17, 2014

brad_mssw said:
I noticed someone commented on the ceph ticket

yes, it's me

as well and suggested trying krbd instead of librbd, is that also in pvetest?

no yet, you need to apply patch on top of pve-storage git
https://git.proxmox.com/?p=pve-storage.git;a=summary

I can reach around 20000iops with librbd + iothread , and 30000iops with krbd + iothreads (with a lot less cpu)

brad_mssw · Nov 17, 2014

I guess pve-no-subscription is basically the same as pvetest these days as pvetest didn't say there were any updates available, and passing the iothread: 1 flag in the vm configuration did show modifications to the kvm command line, so I'll give that a shot.

Regarding the pve-storage git, what is recommended to test that? Just check it out, apply the patch, then how do you build the .deb, just 'make deb' or something? Then I can just install that deb and try it, right?

Thanks!

udo · Nov 17, 2014

spirit said:
Also, if you use pvetest repository, I have add a new option to vm config:

iothread: 1

Which allow to use a dedicated thread inside qemu for virtio disk, and it's helping a little bit.

Hi Spirit,
that's sound intresting!
Is it also possible to use more threads? Because it's looks for me, that ceph got only good transfer rates with many threads...

Udo

mir · Nov 17, 2014

You would simply copy the *.pm files to your node(s) replacing any existing files. Then
service pve{daemon,proxy,statd} restart.

mir · Nov 17, 2014

not possible since kvm only allows one thread per storage (disk). It is on its way though: http://www.linux-kvm.org/wiki/images/a/a7/02x04-MultithreadedDevices.pdf

brad_mssw · Nov 17, 2014

The ceph guys are asking if we have debug symbols available for the qemu binary. Obviously I'm using pve-qemu-kvm, but I don't see a -dbg package for that that I can install. Is there one available somewhere?

udo · Nov 17, 2014

mir said:
not possible since kvm only allows one thread per storage (disk). It is on its way though: http://www.linux-kvm.org/wiki/images/a/a7/02x04-MultithreadedDevices.pdf

Hi mir,
thanks for the info.

Udo

spirit · Nov 17, 2014

udo said:
Hi Spirit,
that's sound intresting!
Is it also possible to use more threads? Because it's looks for me, that ceph got only good transfer rates with many threads...

Udo

Yes, I'll send a patch soon to manage 1 iothread<-> 1disk. (currently in proxmox, it's 1iothread->n disk)

But one limitation of qemu currently, is 1 iothread<--> 1 disk.
In the future( qemu 2.3, 2.4...), it'll be possible to use multiple iothread for 1 disk. (n iothread<--> 1disk)

one thing possible currently, assign same disk multiple time, and do some multipathing inside guest.

like this I have been able to reach 90000iops with 1 disk. (3x virtio disk 30000iops each + iothreads + krbd).

udo · Nov 17, 2014

Hi Spirit,
this let me hope...

thanks

udo

brad_mssw · Nov 17, 2014

spirit said:
Yes, I'll send a patch soon to manage 1 iothread<-> 1disk. (currently in proxmox, it's 1iothread->n disk)

But one limitation of qemu currently, is 1 iothread<--> 1 disk.
In the future( qemu 2.3, 2.4...), it'll be possible to use multiple iothread for 1 disk. (n iothread<--> 1disk)

one thing possible currently, assign same disk multiple time, and do some multipathing inside guest.

like this I have been able to reach 90000iops with 1 disk. (3x virtio disk 30000iops each + iothreads + krbd).

Using krbd appears to work without locking up. So that's the good news. However, this is much slower for me than librbd. Any idea what may be doing on since you seem to be reporting the opposite.

I'd expect to be getting at least 10k iops from this, but I'm getting 1/10th of that:

Code:

[root@centostest ~]# fio --rw=randrw --refill_buffers --norandommap --randrepeat=0 --ioengine=libaio --bs=4k --rwmixread=100 --iodepth=16 --numjobs=16 --runtime=60 --group_reporting --name=4ktest --size=128m
4ktest: (g=0): rw=randrw, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=16
...
fio-2.1.11
Starting 16 processes
Jobs: 16 (f=16): [r(16)] [100.0% done] [6624KB/0KB/0KB /s] [1656/0/0 iops] [eta 00m:00s]
4ktest: (groupid=0, jobs=16): err= 0: pid=2183: Mon Nov 17 15:44:34 2014
  read : io=328372KB, bw=5469.8KB/s, iops=1367, runt= 60034msec
    slat (usec): min=1, max=41728, avg=11689.23, stdev=15607.16
    clat (usec): min=2, max=364410, avg=175287.85, stdev=44945.91
     lat (msec): min=5, max=368, avg=186.98, stdev=46.47
    clat percentiles (msec):
     |  1.00th=[   84],  5.00th=[   93], 10.00th=[  125], 20.00th=[  129],
     | 30.00th=[  163], 40.00th=[  165], 50.00th=[  169], 60.00th=[  188],
     | 70.00th=[  204], 80.00th=[  208], 90.00th=[  241], 95.00th=[  249],
     | 99.00th=[  285], 99.50th=[  289], 99.90th=[  314], 99.95th=[  326],
     | 99.99th=[  330]
    bw (KB  /s): min=  193, max=  572, per=6.24%, avg=341.13, stdev=53.62
    lat (usec) : 4=0.01%, 10=0.01%
    lat (msec) : 10=0.02%, 20=0.01%, 50=0.45%, 100=5.07%, 250=90.86%
    lat (msec) : 500=3.58%
  cpu          : usr=0.03%, sys=0.14%, ctx=75948, majf=0, minf=389
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.2%, 16=99.7%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=82093/w=0/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
   READ: io=328372KB, aggrb=5469KB/s, minb=5469KB/s, maxb=5469KB/s, mint=60034msec, maxt=60034msec

Disk stats (read/write):
  vda: ios=75808/4, merge=0/0, ticks=956035/76, in_queue=956239, util=99.82%

brad_mssw · Nov 18, 2014

Hmm, when I add --direct=1 the numbers change completely. Maybe it is invalid without --direct=1.

Code:

 fio --direct=1 --rw=randrw --refill_buffers --norandommap --randrepeat=0 --ioengine=libaio --bs=4k --rwmixread=100 --iodepth=16 --numjobs=16 --runtime=60 --group_reporting --name=4ktest --size=128m
4ktest: (g=0): rw=randrw, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=16
...
fio-2.1.11
Starting 16 processes
4ktest: Laying out IO file(s) (1 file(s) / 128MB)
4ktest: Laying out IO file(s) (1 file(s) / 128MB)
4ktest: Laying out IO file(s) (1 file(s) / 128MB)
4ktest: Laying out IO file(s) (1 file(s) / 128MB)
4ktest: Laying out IO file(s) (1 file(s) / 128MB)
4ktest: Laying out IO file(s) (1 file(s) / 128MB)
4ktest: Laying out IO file(s) (1 file(s) / 128MB)
4ktest: Laying out IO file(s) (1 file(s) / 128MB)
4ktest: Laying out IO file(s) (1 file(s) / 128MB)
4ktest: Laying out IO file(s) (1 file(s) / 128MB)
4ktest: Laying out IO file(s) (1 file(s) / 128MB)
4ktest: Laying out IO file(s) (1 file(s) / 128MB)
4ktest: Laying out IO file(s) (1 file(s) / 128MB)
4ktest: Laying out IO file(s) (1 file(s) / 128MB)
4ktest: Laying out IO file(s) (1 file(s) / 128MB)
4ktest: Laying out IO file(s) (1 file(s) / 128MB)
Jobs: 1 (f=1): [r(1),_(15)] [100.0% done] [64368KB/0KB/0KB /s] [16.1K/0/0 iops] [eta 00m:00s]    s]
4ktest: (groupid=0, jobs=16): err= 0: pid=2127: Mon Nov 17 18:40:06 2014
  read : io=2048.0MB, bw=106611KB/s, iops=26652, runt= 19671msec
    slat (usec): min=0, max=57939, avg=333.34, stdev=2449.63
    clat (usec): min=200, max=97262, avg=8613.45, stdev=9925.26
     lat (usec): min=209, max=101891, avg=8947.16, stdev=10161.61
    clat percentiles (usec):
     |  1.00th=[  828],  5.00th=[ 1544], 10.00th=[ 2224], 20.00th=[ 3248],
     | 30.00th=[ 4048], 40.00th=[ 4832], 50.00th=[ 5664], 60.00th=[ 6688],
     | 70.00th=[ 8096], 80.00th=[10304], 90.00th=[14656], 95.00th=[40192],
     | 99.00th=[48896], 99.50th=[51968], 99.90th=[57600], 99.95th=[59136],
     | 99.99th=[64768]
    bw (KB  /s): min= 3530, max=13343, per=6.68%, avg=7118.17, stdev=1135.35
    lat (usec) : 250=0.01%, 500=0.14%, 750=0.55%, 1000=1.13%
    lat (msec) : 2=6.36%, 4=21.09%, 10=49.70%, 20=14.28%, 50=6.02%
    lat (msec) : 100=0.73%
  cpu          : usr=0.32%, sys=1.18%, ctx=153530, majf=0, minf=371
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=524288/w=0/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
   READ: io=2048.0MB, aggrb=106611KB/s, minb=106611KB/s, maxb=106611KB/s, mint=19671msec, maxt=19671msec

Disk stats (read/write):
  sda: ios=524135/30, merge=117/10, ticks=2877109/237, in_queue=2878631, util=99.56%

mir · Nov 18, 2014

direct=1 simply means skip VM cache and write directly to storage. As your ceph storage likely has much better cache than the vm this is what to expect.

BTW. a more appropriate setting would be: --rwmixread=80

spirit · Nov 18, 2014

mir said:
Using krbd appears to work without locking up. So that's the good news. However, this is much slower for me than librbd. Any idea what may be doing on since you seem to be reporting the opposite.[/QUOTE]

check in you ceph.conf:

[CLIENT]
rbd_cache=true

(it's enabled by default in giant now, but for firefly you need it). And configure writeback in your virtio disk.

This help for write doing bigger aggrate ios (mostly sequential, or random is blocks are near each other and they can be aggrated)

Note that even with that, with directio and without rbd_cache, I have better performance with krbd.
(It's on a full ssd setup)

brad_mssw · Nov 18, 2014

It looks like the 3.10.0-5-pve kernel is the culprit of my lockups. The ceph guys had me back off to the 2.6.32-34-pve kernel and I can't get it to lock up. Granted, I'm now taking about a 30% performance hit, but it's stable.

I just need to decide if I should go 2.6.32, or 3.10 with krbd. I did notice with krbd that live snapshots don't work with your patch, Spirit.

Really though, I doubt the kernel is at fault, but maybe I'm wrong, there could be some weird interaction with the glibc pthreads version and the kernel version I suppose.

VM lockups with Ceph

Well-Known Member

Well-Known Member

Distinguished Member

Distinguished Member

Well-Known Member

Well-Known Member

Distinguished Member

Well-Known Member

Distinguished Member

Famous Member

Famous Member

Well-Known Member

Distinguished Member

Distinguished Member

Distinguished Member

Well-Known Member

Well-Known Member

Famous Member

Distinguished Member

Well-Known Member