MySQL performance issue on Proxmox with Ceph

hepo · Feb 4, 2022

Hi community, needs some help with MySQL performance improvement.

We have recently migrated MySQL DB (about 20GB) from bare metal to VM on Proxmox.
Shortly after the DBA team started complaining about performance issues (slow queries and overall slowness).
They have also ran variety of tests using sysbench, comparing the bare metal performance against test VM.
The sysbench results on the VM are extremely bad (150K QPS vs 1500QPS on the VM).

We had issues with Ceph before so we were naturally drawn into avoiding it.
The test VM was moved to local-zfs volume (pair of 2 SSDs in mirror used to boot PVE from). Side note - moving VM disk from ceph to local-zfs caused random reboots.
However, the sysbech tests were better but far off compared to the physical box (150K vs 4.5K QPS).
We tested with all possible VM options (baloon=0, NUMA=1, cache=writeback, iothread=1 with VirtIO SCSI single, SSD=1, discard=on) - no noticeable difference.

Next, we went into testing Ceph IO performance using fio and landed on some interesting findings.
The fio command we used is the following (typical RDBMS workload according to our research).

Code:

command: fio -ioengine=libaio -direct=1 -name=test -bs=4k -iodepth=1 -rw=randwrite -runtime=60 -filename=/dev/sdc -sync=1

Ubuntu20.04        Jobs: 1 (f=1): [w(1)][100.0%][w=956KiB/s][w=239 IOPS][eta 00m:00s]0m:09s]
CentOS7            Jobs: 1 (f=1): [w(1)][100.0%][w=32.7MiB/s][w=8363 IOPS][eta 00m:00s]

To our surprise, the Ubuntu server is not doing well at all, Debian returns identical results.
Our DBA team tested the DB on the CentOS VM but reported no change at all - neither running queries nor sysbench results.

Does anyone have experience with this?
What else we can do to improve the performance?
Any other testing that may be relevant?

Many thanks!

itNGO · Feb 4, 2022

Hello again,
VirtIO? VioSCSI?

Please share your VM config...

hepo · Feb 4, 2022

Hi there, hope all is in order

Here's the VM config

Ubuntu VM

Code:

agent: 1,fstrim_cloned_disks=1
balloon: 0
boot: order=scsi0;net0
cores: 32
memory: 73728
name: dev-rms113
net0: virtio=02:AB:5C:C3:42:7E,bridge=vmbr1,firewall=1
numa: 1
ostype: l26
scsi0: ceph:vm-313-disk-0,cache=writeback,discard=on,iothread=1,size=100G,ssd=1
scsi1: ceph:vm-313-disk-1,cache=writeback,discard=on,iothread=1,size=100G,ssd=1
scsi2: ceph:vm-313-disk-2,cache=writeback,discard=on,iothread=1,size=100G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=1bb1dd54-c3f8-48bc-b24f-f23eabca900b
sockets: 1
vmgenid: 56186688-5477-47ce-942d-9f5cf69d456f

CentOS VM

Code:

agent: 1,fstrim_cloned_disks=1
balloon: 0
boot: order=scsi0
cores: 32
memory: 65536
meta: creation-qemu=6.1.0,ctime=1643203692
name: dev-rms114
net0: virtio=2A:95:0C:92:5D:15,bridge=vmbr1,firewall=1
numa: 1
ostype: l26
parent: base_config
scsi0: ceph:vm-314-disk-0,cache=writeback,discard=on,iothread=1,size=60G,ssd=1
scsi1: ceph:vm-314-disk-1,cache=writeback,discard=on,iothread=1,size=50G,ssd=1
scsi2: ceph:vm-314-disk-2,cache=writeback,discard=on,iothread=1,size=50G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=b2864fa7-4964-4efd-a463-f2ef9bda9a8c
sockets: 1
vmgenid: f78de09f-7b84-4e9e-aada-95f89480a4c9

itNGO · Feb 4, 2022

Please try to disable cache for the scsi0,1 and 2 Disk... set to "no cache", cause sometimes the VM Guests get weird on CEPH.
By design the emulated Controller on Proxmox when using CEPH does not really allow write back. Ceph will always force a commit on all Disks.
For example, 3 Copies means to wait until all of them are confirmed by Ceph to be "written".

Test again then.
And Post your CEPH.CONF pls.

hepo · Feb 5, 2022

We have being default when we started, then started enabling various options and measure the difference.
Just re-ran the tests with cache=none, CentOS is down to the Ubuntu numbers - conclusion CentOS leverages cache=writeback, Ubuntu does not.

Ceph.conf is default, we like it that way unless absolutely needed

Code:

[global]
     auth_client_required = cephx
     auth_cluster_required = cephx
     auth_service_required = cephx
     cluster_network = 10.10.12.11/24
     fsid = 3a228350-a4e8-4cb9-820e-468230905269
     mon_allow_pool_delete = true
     mon_host = 10.10.12.11 10.10.12.12 10.10.12.21 10.10.12.22 10.10.12.10
     ms_bind_ipv4 = true
     osd_pool_default_min_size = 2
     osd_pool_default_size = 4
     public_network = 10.10.12.11/24

[client]
     keyring = /etc/pve/priv/$cluster.$name.keyring

[mon.pve11]
     public_addr = 10.10.12.11

[mon.pve12]
     public_addr = 10.10.12.12

[mon.pve21]
     public_addr = 10.10.12.21

[mon.pve22]
     public_addr = 10.10.12.22

[mon.pve31]
     public_addr = 10.10.12.10

Out Ceph setup is stretched between two datacenters, automatically this means higher IO latency compared to everyone else.
However, I do not want to deep dive into Ceph too much since the DBAs are also having trouble with running MySQL on the local-zfs volume.

I was more hoping to see if anyone else have struggled with MySQL and managed to tune it accordingly.

Thanks

itNGO · Feb 5, 2022

Well, if you just want to "tune" mySQL, there are plenty of parameters to extend Cache and Buffers.
But this has to be done in steps and with much comparing before/after on pre-defined queries where you now the exact times.

If I get this right you assigned 32 Cores to the VMs? What HOST CPU are you using?

itNGO · Feb 5, 2022

hepo said:
We have being default when we started, then started enabling various options and measure the difference.
Just re-ran the tests with cache=none, CentOS is down to the Ubuntu numbers - conclusion CentOS leverages cache=writeback, Ubuntu does not.

Ceph.conf is default, we like it that way unless absolutely needed

Code:

[global] auth_client_required = cephx auth_cluster_required = cephx auth_service_required = cephx cluster_network = 10.10.12.11/24 fsid = 3a228350-a4e8-4cb9-820e-468230905269 mon_allow_pool_delete = true mon_host = 10.10.12.11 10.10.12.12 10.10.12.21 10.10.12.22 10.10.12.10 ms_bind_ipv4 = true osd_pool_default_min_size = 2 osd_pool_default_size = 4 public_network = 10.10.12.11/24 [client] keyring = /etc/pve/priv/$cluster.$name.keyring [mon.pve11] public_addr = 10.10.12.11 [mon.pve12] public_addr = 10.10.12.12 [mon.pve21] public_addr = 10.10.12.21 [mon.pve22] public_addr = 10.10.12.22 [mon.pve31] public_addr = 10.10.12.10

Out Ceph setup is stretched between two datacenters, automatically this means higher IO latency compared to everyone else.
However, I do not want to deep dive into Ceph too much since the DBAs are also having trouble with running MySQL on the local-zfs volume.

I was more hoping to see if anyone else have struggled with MySQL and managed to tune it accordingly.

Thanks

If done right you can gain up to 170% of IOPS in Ceph with setting the right parameters for the usage-case.

However, at the end.. latency in CEPH is the key. If you say you have 2 DCs and latency I high, then you can nearly do whatever you want it will never get fast in IOPS....

itNGO · Feb 5, 2022

hepo said:
However, I do not want to deep dive into Ceph too much since the DBAs are also having trouble with running MySQL on the local-zfs volume.

And this is not unusual for ZFS, it is faster when using 32KB+ IOPS, but that is not what MySQL is using for most queries even if it working internal with 32KB IO.

https://dev.mysql.com/doc/refman/5.7/en/optimizing-innodb-diskio.html

And finaly, if your ZFS is on Consumer-Grade-SSD.... you are lost again....

chrispage1 · Feb 5, 2022

@hepo what hard drives is your Ceph cluster running?

hepo · Feb 5, 2022

Host CPU is (dual socket) Intel Xeon Platinum 8160 - arc

What tuning you would suggest for Ceph? if there is room for improvement then I am totally interested

Latency between the DC is about 1.5ms which is higher but not extremely high.

I am still not convinced we are having IO problem, the IO on the prod server is minimal (the following is from innotop command)

hepo · Feb 5, 2022

chrispage1 said:
@hepo what hard drives is your Ceph cluster running?

Ceph is on Micron 7300 Pro NVMe
Local ZFS is on Micron 5300 Pro SSD's

This is what we have on each server.

itNGO · Feb 5, 2022

So you have a not very fast in Single-Thread-CPU and assigned 32 Cores in 2 Numas of 16 to the guest I guess?
https://www.cpubenchmark.net/cpu.php?cpu=Intel+Xeon+Platinum+8160+@+2.10GHz&id=3085

And this is Hyper-Converged with CEPH? Ceph also needs Single-Thread-Performance after Latency more than anything else to get high IO.

I am not sure if this can solved by Forum, looks more like you might need a consultant.

You can also have a look here https://media-www.micron.com/-/medi...df?la=en&rev=5d9ee3b76bff4b728e018895c384071f where Micron showed what can be done with the NVME you are using on an Epyc-Rome. There are also some parameters shown, but I am not sure if they can adapted to your Hardware...

chrispage1 · Feb 5, 2022

hepo said:
Ceph is on Micron 7300 Pro NVMe
Local ZFS is on Micron 5300 Pro SSD's

This is what we have on each server.
View attachment 33937

I would have expected them to run like lightning, all enterprise with PLP.

Have you tried without spreading across the DC's? I believe the writes are synchronous and won't be acknowledged until writes to the replica OSDs have been acknowledged too, even if it's across to the secondary DC.

What's the purpose for spreading across DC's, just redundancy or more compute over there too?

chrispage1 · Feb 5, 2022

Just a secondary thought, maybe try setting up a crush map that only includes the nodes in the one DC, create a pool with that map and see how it performs...

hepo · Feb 5, 2022

Yes, ceph uses synchronous writes and will only ack when all writes are done on all OSDs.
We have 3 identical nodes in both DC's - so both redundancy and compute.
Re-running the test in single DC now and will share the results shortly (crushmap already accommodates this, just need additional pool with 2 replicas).

hepo · Feb 5, 2022

3 different fio commands - 4M and 4K block sizes and different iodepth (threads)

Code:

fio -ioengine=libaio -direct=1 -name=test -bs=4M -iodepth=16 -rw=randwrite -runtime=60 -filename=/dev/sdc
fio -ioengine=libaio -direct=1 -name=test -bs=4k -iodepth=1 -rw=randwrite -runtime=60 -filename=/dev/sdc -sync=1
fio -ioengine=libaio -direct=1 -name=test -bs=4k -iodepth=128 -rw=randwrite -runtime=60 -filename=/dev/sdc

4 replicas - 2xDC

Code:

Jobs: 1 (f=1): [w(1)][100.0%][w=1469MiB/s][w=367 IOPS][eta 00m:00s]
Jobs: 1 (f=1): [w(1)][100.0%][w=960KiB/s][w=240 IOPS][eta 00m:00s]
Jobs: 1 (f=1): [w(1)][100.0%][w=108MiB/s][w=27.7k IOPS][eta 00m:00s]

2 replicas - 1xDC

Code:

Jobs: 1 (f=1): [w(1)][100.0%][w=2134MiB/s][w=533 IOPS][eta 00m:00s]
Jobs: 1 (f=1): [w(1)][100.0%][w=1552KiB/s][w=388 IOPS][eta 00m:00s]
Jobs: 1 (f=1): [w(1)][100.0%][w=208MiB/s][w=53.2k IOPS][eta 00m:00s]

My (lame maybe) conclusion:
1. The IO and bandwidth are there, they are better in single DC setup (expected)
2. The IO and bandwidth are seen in multi-threaded scenario only
3. The CentOS7 VM from above (cache=writeback) is still outperforming 1xDC setup

Code:

Ubuntu20.04(1xDC)     Jobs: 1 (f=1): [w(1)][100.0%][w=1552KiB/s][w=388 IOPS][eta 00m:00s]
CentOS7(writeback)    Jobs: 1 (f=1): [w(1)][100.0%][w=32.7MiB/s][w=8363 IOPS][eta 00m:00s]

It almost looks like 1 VM is limited to 1 thread towards ceph, not sure if this can be manipulated.

Thanks folks much appreciated.

itNGO · Feb 5, 2022

You are loosing 40 to 50% performance due to 2xDC and 4 Replicas.... I wouldn't expect that....
Again sounds like latency problems....

When I compare this to our 2xDC Setup we lose around 5% in 4 Replica across DCs.
You really should get rid of latency.

What the Ubuntu Guest shows is independent from that really weird....

hepo · Feb 5, 2022

What latency do you have between the DC's? Are you using jumbo frames?

Here's mine:

Code:

40 packets transmitted, 40 received, 0% packet loss, time 39038ms
rtt min/avg/max/mdev = 1.041/1.101/1.152/0.029 ms

itNGO · Feb 5, 2022

Code:

40 packets transmitted, 40 received, 0% packet loss, time 39935ms
rtt min/avg/max/mdev = 0.011/0.014/0.019/0.002 ms

Yes Jumbo 9000....

hepo · Feb 5, 2022

I will give the jumo's a try... they are supported in our infra but so far I was hesitating.
I know "ceph loves jumbo frames" and have seen the PVE team using them while benchmarking.

I cannot get rid of the latency, this means remove one of the datacenters - no go.

From all that was shared, I guess I need to rename the post to "what is wrong with Ubuntu on PVE and why it does not react to writeback"

MySQL performance issue on Proxmox with Ceph

Active Member

Famous Member

Active Member

Famous Member

Active Member

Famous Member

Famous Member

Famous Member

Member

Active Member

Active Member

Famous Member

Member

Member

Active Member

Active Member

Famous Member

Active Member

Famous Member

Active Member

We value your privacy