VM IO freeze for 15 seconds when Ceph node reboot gracefully

Kelvin Kam · Jul 26, 2020

Hi, I have built a Hyper-coveraged 3 nodes cluster with Proxmox and using ceph for shared storage.
However I observed VM IO will hang for about 15 seconds when one Ceph node performing graceful reboot. I had already tried the reboot procedure from redhat, configured noout and norebalance, but no luck.

Could anyone tell me if the freeze period can be decreased? Thank you very much.

ales · Jul 26, 2020

How many monitors, managers and osd do you have?

Kelvin Kam · Jul 26, 2020

ales said:
How many monitors, managers and osd do you have?

3 monitors and 3 managers
Total 7 OSD, but sedated as two pool with crush rules, size has configured to 2/1
Node 1: 2 OSD (1 in affected pool)
Node 2: 3 OSD (2 in affected pool)
Node 3: 3 OSD (2 in affected pool)
*affected pool means the pool that stored testing VM

Alwin · Jul 27, 2020

Kelvin Kam said:
Total 7 OSD, but sedated as two pool with crush rules, size has configured to 2/1

The min_size 1 is dangerous. In-flight data might not be written out to a PG, leaving no copy available.

Kelvin Kam said:
However I observed VM IO will hang for about 15 seconds when one Ceph node performing graceful reboot.

There will always be some time till a new primary OSD is selected. If you don't have already, try to set the cache on the disk to writeback. This might help in minimizing the effect.

Kelvin Kam · Jul 28, 2020

Alwin said:
The min_size 1 is dangerous. In-flight data might not be written out to a PG, leaving no copy available.

Thank you for your reply, so 3/2 is recommended even for small setup? Would 2/2 provides better performance?

Alwin said:
There will always be some time till a new primary OSD is selected. If you don't have already, try to set the cache on the disk to writeback. This might help in minimizing the effect.

I will try this later. But it means the freeze time should be lesser if the rebooting node is not assigned as "Primary OSD"?

Alwin · Jul 28, 2020

Kelvin Kam said:
Thank you for your reply, so 3/2 is recommended even for small setup? Would 2/2 provides better performance?

Yes, but on node/OSD failure it will go into read-only mode till all replicas have been recovered.

Kelvin Kam said:
I will try this later. But it means the freeze time should be lesser if the rebooting node is not assigned as "Primary OSD"?

Should be, but since Ceph doesn't know locality they are distributed.

Kelvin Kam · Jul 29, 2020

Alwin said:
There will always be some time till a new primary OSD is selected. If you don't have already, try to set the cache on the disk to writeback. This might help in minimizing the effect.

Hi Alwin, I think it may reduce the freeze of write for 2-3 seconds only, however the read is still freeze when the node initialize reboot...

Alwin · Jul 29, 2020

I don't think that you can get rid of it completely, especially in a small cluster. Can you post a config of your VM, qm config <id>?

Kelvin Kam · Jul 31, 2020

Alwin said:
I don't think that you can get rid of it completely, especially in a small cluster. Can you post a config of your VM, qm config <id>?

Please find the VM config on below, currently is 3 nodes cluster and we may expand our cluster to 5 nodes in further. just afraid will face the same after expanded the cluster...

ceph-dcssd is system OS disk, it is Ceph (3/2) with using Samsung SM863a 240G SSD

root@PVE01:~# qm config 100
bootdisk: virtio0
cores: 4
ide2: none,media=cdrom
memory: 4096
name: WS2016
net0: virtio=4E:7E:30:05:60:CD,bridge=vmbr0,firewall=1
numa: 0
ostype: win10
scsihw: virtio-scsi-pci
smbios1: uuid=bfb0b6b1-5128-4241-9e18-2aa1030db4e0
sockets: 1
virtio0: ceph-dcssd:vm-100-disk-2,cache=writeback,size=60G
virtio1: ceph-crssd:vm-100-disk-0,cache=writeback,size=16G
virtio2: local-lvm:vm-100-disk-0,size=32G
virtio3: ceph-crssd:vm-100-disk-1,size=32G
virtio4: ceph-dcssd:vm-100-disk-1,size=32G
vmgenid: 936817f8-e536-4609-b164-ba8495fcb85c

Alwin · Jul 31, 2020

Try with scsi disks instead of virtio. And add discard, so a trim inside the VM will be passed to Ceph. Less data to read for Ceph.

Another option is iothread, it allows qemu to open a thread per disk.

Kelvin Kam · Aug 9, 2020

Alwin said:
Try with scsi disks instead of virtio. And add discard, so a trim inside the VM will be passed to Ceph. Less data to read for Ceph.

Another option is iothread, it allows qemu to open a thread per disk.

I had gave a try for SCSI and iothread, SCSI did provide better performance than virtio. But both did not resolve the 15 seconds freeze issue. I am wondering if there is any settings in ceph for timeout settings which may help on it.

ermanishchawla · Aug 9, 2020

I had the same issue but with changing sysctl values in vm and settings recommend by Alwin, I am not facing issue now

Rainerle · Jul 9, 2021

ermanishchawla said:
I had the same issue but with changing sysctl values in vm and settings recommend by Alwin, I am not facing issue now

So what magical and secret settings did you set then?

Search

Search

VM IO freeze for 15 seconds when Ceph node reboot gracefully

Kelvin Kam

Active Member

ales

Member

Kelvin Kam

Active Member

Alwin

Proxmox Retired Staff

Kelvin Kam

Active Member

Alwin

Proxmox Retired Staff

Kelvin Kam

Active Member

Alwin

Proxmox Retired Staff

Kelvin Kam

Active Member

Alwin

Proxmox Retired Staff

Kelvin Kam

Active Member

ermanishchawla

Well-Known Member

Rainerle

Renowned Member

We value your privacy