I can consistently reproduce a VM lockup in my production environment.
I install a CentOS 7 guest on a ceph rbd storage pool, and within that VM, I run:
It will not complete, sometimes the entire VM locks up, other times just the disk (meaning you really can't do anything after that anyhow).
When running the same test with a VM on 'local' storage, it works as expected.
This is a new production setup that we're trying to QA. Our test lab does NOT appear to exhibit this behavior but the machines are much slower and only use 1Gb networking, whereas this new production equipment is 10Gb.
Has anyone else seen anything like this? I'm not sure where to look. I'm running the latest pve-no-subscription and tried both ceph firefly and giant. I've also attempted to back off to older versions of qemu from the repo, fiddled with cache settings, set aio=threads, nothing I do seems to resolve the issue.
I install a CentOS 7 guest on a ceph rbd storage pool, and within that VM, I run:
Code:
fio --rw=randrw --refill_buffers --norandommap --randrepeat=0 --ioengine=libaio --bs=4k --rwmixread=100 --iodepth=16 --numjobs=16 --runtime=60 --group_reporting --name=4ktest --size=128m
It will not complete, sometimes the entire VM locks up, other times just the disk (meaning you really can't do anything after that anyhow).
When running the same test with a VM on 'local' storage, it works as expected.
This is a new production setup that we're trying to QA. Our test lab does NOT appear to exhibit this behavior but the machines are much slower and only use 1Gb networking, whereas this new production equipment is 10Gb.
Has anyone else seen anything like this? I'm not sure where to look. I'm running the latest pve-no-subscription and tried both ceph firefly and giant. I've also attempted to back off to older versions of qemu from the repo, fiddled with cache settings, set aio=threads, nothing I do seems to resolve the issue.