MySQL performance issue on Proxmox with Ceph

Mar 27, 2021
102
15
23
44
Hi community, needs some help with MySQL performance improvement.

We have recently migrated MySQL DB (about 20GB) from bare metal to VM on Proxmox.
Shortly after the DBA team started complaining about performance issues (slow queries and overall slowness).
They have also ran variety of tests using sysbench, comparing the bare metal performance against test VM.
The sysbench results on the VM are extremely bad (150K QPS vs 1500QPS on the VM).

We had issues with Ceph before so we were naturally drawn into avoiding it.
The test VM was moved to local-zfs volume (pair of 2 SSDs in mirror used to boot PVE from). Side note - moving VM disk from ceph to local-zfs caused random reboots.
However, the sysbech tests were better but far off compared to the physical box (150K vs 4.5K QPS).
We tested with all possible VM options (baloon=0, NUMA=1, cache=writeback, iothread=1 with VirtIO SCSI single, SSD=1, discard=on) - no noticeable difference.

Next, we went into testing Ceph IO performance using fio and landed on some interesting findings.
The fio command we used is the following (typical RDBMS workload according to our research).
Code:
command: fio -ioengine=libaio -direct=1 -name=test -bs=4k -iodepth=1 -rw=randwrite -runtime=60 -filename=/dev/sdc -sync=1

Ubuntu20.04        Jobs: 1 (f=1): [w(1)][100.0%][w=956KiB/s][w=239 IOPS][eta 00m:00s]0m:09s]
CentOS7            Jobs: 1 (f=1): [w(1)][100.0%][w=32.7MiB/s][w=8363 IOPS][eta 00m:00s]

To our surprise, the Ubuntu server is not doing well at all, Debian returns identical results.
Our DBA team tested the DB on the CentOS VM but reported no change at all - neither running queries nor sysbench results.

Does anyone have experience with this?
What else we can do to improve the performance?
Any other testing that may be relevant?

Many thanks!
 
Hello again,
VirtIO? VioSCSI?

Please share your VM config...
 
Hi there, hope all is in order ;)

Here's the VM config

Ubuntu VM
Code:
agent: 1,fstrim_cloned_disks=1
balloon: 0
boot: order=scsi0;net0
cores: 32
memory: 73728
name: dev-rms113
net0: virtio=02:AB:5C:C3:42:7E,bridge=vmbr1,firewall=1
numa: 1
ostype: l26
scsi0: ceph:vm-313-disk-0,cache=writeback,discard=on,iothread=1,size=100G,ssd=1
scsi1: ceph:vm-313-disk-1,cache=writeback,discard=on,iothread=1,size=100G,ssd=1
scsi2: ceph:vm-313-disk-2,cache=writeback,discard=on,iothread=1,size=100G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=1bb1dd54-c3f8-48bc-b24f-f23eabca900b
sockets: 1
vmgenid: 56186688-5477-47ce-942d-9f5cf69d456f

CentOS VM
Code:
agent: 1,fstrim_cloned_disks=1
balloon: 0
boot: order=scsi0
cores: 32
memory: 65536
meta: creation-qemu=6.1.0,ctime=1643203692
name: dev-rms114
net0: virtio=2A:95:0C:92:5D:15,bridge=vmbr1,firewall=1
numa: 1
ostype: l26
parent: base_config
scsi0: ceph:vm-314-disk-0,cache=writeback,discard=on,iothread=1,size=60G,ssd=1
scsi1: ceph:vm-314-disk-1,cache=writeback,discard=on,iothread=1,size=50G,ssd=1
scsi2: ceph:vm-314-disk-2,cache=writeback,discard=on,iothread=1,size=50G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=b2864fa7-4964-4efd-a463-f2ef9bda9a8c
sockets: 1
vmgenid: f78de09f-7b84-4e9e-aada-95f89480a4c9
 
Last edited:
Please try to disable cache for the scsi0,1 and 2 Disk... set to "no cache", cause sometimes the VM Guests get weird on CEPH.
By design the emulated Controller on Proxmox when using CEPH does not really allow write back. Ceph will always force a commit on all Disks.
For example, 3 Copies means to wait until all of them are confirmed by Ceph to be "written".

Test again then.
And Post your CEPH.CONF pls.
 
We have being default when we started, then started enabling various options and measure the difference.
Just re-ran the tests with cache=none, CentOS is down to the Ubuntu numbers - conclusion CentOS leverages cache=writeback, Ubuntu does not.

Ceph.conf is default, we like it that way unless absolutely needed

Code:
[global]
     auth_client_required = cephx
     auth_cluster_required = cephx
     auth_service_required = cephx
     cluster_network = 10.10.12.11/24
     fsid = 3a228350-a4e8-4cb9-820e-468230905269
     mon_allow_pool_delete = true
     mon_host = 10.10.12.11 10.10.12.12 10.10.12.21 10.10.12.22 10.10.12.10
     ms_bind_ipv4 = true
     osd_pool_default_min_size = 2
     osd_pool_default_size = 4
     public_network = 10.10.12.11/24

[client]
     keyring = /etc/pve/priv/$cluster.$name.keyring

[mon.pve11]
     public_addr = 10.10.12.11

[mon.pve12]
     public_addr = 10.10.12.12

[mon.pve21]
     public_addr = 10.10.12.21

[mon.pve22]
     public_addr = 10.10.12.22

[mon.pve31]
     public_addr = 10.10.12.10

Out Ceph setup is stretched between two datacenters, automatically this means higher IO latency compared to everyone else.
However, I do not want to deep dive into Ceph too much since the DBAs are also having trouble with running MySQL on the local-zfs volume.

I was more hoping to see if anyone else have struggled with MySQL and managed to tune it accordingly.

Thanks
 
Well, if you just want to "tune" mySQL, there are plenty of parameters to extend Cache and Buffers.
But this has to be done in steps and with much comparing before/after on pre-defined queries where you now the exact times.

If I get this right you assigned 32 Cores to the VMs? What HOST CPU are you using?
 
We have being default when we started, then started enabling various options and measure the difference.
Just re-ran the tests with cache=none, CentOS is down to the Ubuntu numbers - conclusion CentOS leverages cache=writeback, Ubuntu does not.

Ceph.conf is default, we like it that way unless absolutely needed

Code:
[global]
     auth_client_required = cephx
     auth_cluster_required = cephx
     auth_service_required = cephx
     cluster_network = 10.10.12.11/24
     fsid = 3a228350-a4e8-4cb9-820e-468230905269
     mon_allow_pool_delete = true
     mon_host = 10.10.12.11 10.10.12.12 10.10.12.21 10.10.12.22 10.10.12.10
     ms_bind_ipv4 = true
     osd_pool_default_min_size = 2
     osd_pool_default_size = 4
     public_network = 10.10.12.11/24

[client]
     keyring = /etc/pve/priv/$cluster.$name.keyring

[mon.pve11]
     public_addr = 10.10.12.11

[mon.pve12]
     public_addr = 10.10.12.12

[mon.pve21]
     public_addr = 10.10.12.21

[mon.pve22]
     public_addr = 10.10.12.22

[mon.pve31]
     public_addr = 10.10.12.10

Out Ceph setup is stretched between two datacenters, automatically this means higher IO latency compared to everyone else.
However, I do not want to deep dive into Ceph too much since the DBAs are also having trouble with running MySQL on the local-zfs volume.

I was more hoping to see if anyone else have struggled with MySQL and managed to tune it accordingly.

Thanks
If done right you can gain up to 170% of IOPS in Ceph with setting the right parameters for the usage-case.

However, at the end.. latency in CEPH is the key. If you say you have 2 DCs and latency I high, then you can nearly do whatever you want it will never get fast in IOPS....
 
Last edited:
Host CPU is (dual socket) Intel Xeon Platinum 8160 - arc

What tuning you would suggest for Ceph? if there is room for improvement then I am totally interested ;)
Latency between the DC is about 1.5ms which is higher but not extremely high.

I am still not convinced we are having IO problem, the IO on the prod server is minimal (the following is from innotop command)

1644051152112.png
 
So you have a not very fast in Single-Thread-CPU and assigned 32 Cores in 2 Numas of 16 to the guest I guess?
https://www.cpubenchmark.net/cpu.php?cpu=Intel+Xeon+Platinum+8160+@+2.10GHz&id=3085

And this is Hyper-Converged with CEPH? Ceph also needs Single-Thread-Performance after Latency more than anything else to get high IO.

I am not sure if this can solved by Forum, looks more like you might need a consultant. :confused:

You can also have a look here https://media-www.micron.com/-/medi...df?la=en&rev=5d9ee3b76bff4b728e018895c384071f where Micron showed what can be done with the NVME you are using on an Epyc-Rome. There are also some parameters shown, but I am not sure if they can adapted to your Hardware...
 
Ceph is on Micron 7300 Pro NVMe
Local ZFS is on Micron 5300 Pro SSD's

This is what we have on each server.
View attachment 33937

I would have expected them to run like lightning, all enterprise with PLP.

Have you tried without spreading across the DC's? I believe the writes are synchronous and won't be acknowledged until writes to the replica OSDs have been acknowledged too, even if it's across to the secondary DC.

What's the purpose for spreading across DC's, just redundancy or more compute over there too?
 
Just a secondary thought, maybe try setting up a crush map that only includes the nodes in the one DC, create a pool with that map and see how it performs...
 
Yes, ceph uses synchronous writes and will only ack when all writes are done on all OSDs.
We have 3 identical nodes in both DC's - so both redundancy and compute.
Re-running the test in single DC now and will share the results shortly (crushmap already accommodates this, just need additional pool with 2 replicas).
 
Last edited:
3 different fio commands - 4M and 4K block sizes and different iodepth (threads)
Code:
fio -ioengine=libaio -direct=1 -name=test -bs=4M -iodepth=16 -rw=randwrite -runtime=60 -filename=/dev/sdc
fio -ioengine=libaio -direct=1 -name=test -bs=4k -iodepth=1 -rw=randwrite -runtime=60 -filename=/dev/sdc -sync=1
fio -ioengine=libaio -direct=1 -name=test -bs=4k -iodepth=128 -rw=randwrite -runtime=60 -filename=/dev/sdc

4 replicas - 2xDC
Code:
Jobs: 1 (f=1): [w(1)][100.0%][w=1469MiB/s][w=367 IOPS][eta 00m:00s]
Jobs: 1 (f=1): [w(1)][100.0%][w=960KiB/s][w=240 IOPS][eta 00m:00s]
Jobs: 1 (f=1): [w(1)][100.0%][w=108MiB/s][w=27.7k IOPS][eta 00m:00s]

2 replicas - 1xDC
Code:
Jobs: 1 (f=1): [w(1)][100.0%][w=2134MiB/s][w=533 IOPS][eta 00m:00s]
Jobs: 1 (f=1): [w(1)][100.0%][w=1552KiB/s][w=388 IOPS][eta 00m:00s]
Jobs: 1 (f=1): [w(1)][100.0%][w=208MiB/s][w=53.2k IOPS][eta 00m:00s]

My (lame maybe) conclusion:
1. The IO and bandwidth are there, they are better in single DC setup (expected)
2. The IO and bandwidth are seen in multi-threaded scenario only
3. The CentOS7 VM from above (cache=writeback) is still outperforming 1xDC setup

Code:
Ubuntu20.04(1xDC)     Jobs: 1 (f=1): [w(1)][100.0%][w=1552KiB/s][w=388 IOPS][eta 00m:00s]
CentOS7(writeback)    Jobs: 1 (f=1): [w(1)][100.0%][w=32.7MiB/s][w=8363 IOPS][eta 00m:00s]

It almost looks like 1 VM is limited to 1 thread towards ceph, not sure if this can be manipulated.

Thanks folks much appreciated.
 
Last edited:
You are loosing 40 to 50% performance due to 2xDC and 4 Replicas.... I wouldn't expect that....
Again sounds like latency problems....

When I compare this to our 2xDC Setup we lose around 5% in 4 Replica across DCs.
You really should get rid of latency.

What the Ubuntu Guest shows is independent from that really weird....
 
What latency do you have between the DC's? Are you using jumbo frames?

Here's mine:
Code:
40 packets transmitted, 40 received, 0% packet loss, time 39038ms
rtt min/avg/max/mdev = 1.041/1.101/1.152/0.029 ms
 
Code:
40 packets transmitted, 40 received, 0% packet loss, time 39935ms
rtt min/avg/max/mdev = 0.011/0.014/0.019/0.002 ms

Yes Jumbo 9000....
 
  • Like
Reactions: hepo
I will give the jumo's a try... they are supported in our infra but so far I was hesitating.
I know "ceph loves jumbo frames" and have seen the PVE team using them while benchmarking.

I cannot get rid of the latency, this means remove one of the datacenters - no go.

From all that was shared, I guess I need to rename the post to "what is wrong with Ubuntu on PVE and why it does not react to writeback" o_O