MySQL performance issue on Proxmox with Ceph

And this is not unusual for ZFS, it is faster when using 32KB+ IOPS, but that is not what MySQL is using for most queries even if it working internal with 32KB IO.
Looks like I've missed this... Agree, we have looked at block size miss-alignment, this issues would be even worst with the 4M block size that Ceph uses!
 
  • Like
Reactions: itNGO
I did more testing today with separate/standalone server we have (PVE installed but not configured/used).
The server specs are 2x Xeon E5-2698 v4, 512GB RAM and 3x Samsung PM9A3 3.84TB NVMe.

The test were done with the same fio command from above
fio -ioengine=libaio -direct=1 -name=test -bs=4k -iodepth=1 -rw=randwrite -runtime=60 -filename=/dev/nvme2n1 -sync=1

Running the command from the PVE host returned (as expected)
Code:
Jobs: 1 (f=1): [w(1)][100.0%][w=237MiB/s][w=60.5k IOPS][eta 00m:00s]

Next, from each drive we created - LVM, LVM-thin and ZFS volumes (ZFS arc limited to 32GB)
Created new Ubuntu VM with hard drive from each volume, same command returned the following results:

Code:
VM LVM         Jobs: 1 (f=1): [w(1)][100.0%][w=40.9MiB/s][w=10.5k IOPS][eta 00m:00s]
VM LVM-thin    Jobs: 1 (f=1): [w(1)][100.0%][w=5929KiB/s][w=1482 IOPS][eta 00m:00s]
VM ZFS         Jobs: 1 (f=1): [w(1)][100.0%][w=7003KiB/s][w=1750 IOPS][eta 00m:00s]

Conclusions:
- The LVM volume performed best, however this is 5 times less than what the physical drive can do.
- The LVM test is very much comparable to the CentOS7/Ceph results from above.
 
Hi,
I'm around 1800 iops

fio -ioengine=libaio -direct=1 -name=test -bs=4k -iodepth=1 -rw=randwrite -runtime=60 -filename=/dev/sdc -sync=1

Code:
qemuguestvm# fio -ioengine=libaio -direct=1 -name=test -bs=4k -iodepth=1 -rw=randwrite -runtime=60 -filename=/dev/sdc -sync=1
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.12
Starting 1 process
^Cbs: 1 (f=1): [w(1)][80.0%][w=7219KiB/s][w=1804 IOPS][eta 00m:12s]
fio: terminating on signal 2

test: (groupid=0, jobs=1): err= 0: pid=15516: Sun Feb  6 18:38:53 2022
  write: IOPS=2029, BW=8117KiB/s (8312kB/s)(380MiB/47946msec); 0 zone resets
    slat (usec): min=6, max=1435, avg=10.41, stdev= 6.47
    clat (nsec): min=1451, max=14291k, avg=480037.12, stdev=239194.88
     lat (usec): min=122, max=14317, avg=490.85, stdev=239.56
    clat percentiles (usec):
     |  1.00th=[  131],  5.00th=[  139], 10.00th=[  147], 20.00th=[  165],
     | 30.00th=[  545], 40.00th=[  562], 50.00th=[  570], 60.00th=[  578],
     | 70.00th=[  586], 80.00th=[  603], 90.00th=[  619], 95.00th=[  644],
     | 99.00th=[  857], 99.50th=[ 1090], 99.90th=[ 2212], 99.95th=[ 2900],
     | 99.99th=[ 6390]
   bw (  KiB/s): min= 6608, max=25248, per=100.00%, avg=8124.68, stdev=4163.71, samples=95
   iops        : min= 1652, max= 6312, avg=2031.16, stdev=1040.93, samples=95
  lat (usec)   : 2=0.01%, 100=0.01%, 250=26.54%, 500=0.17%, 750=71.38%
  lat (usec)   : 1000=1.28%
  lat (msec)   : 2=0.50%, 4=0.11%, 10=0.02%, 20=0.01%
  cpu          : usr=0.90%, sys=1.43%, ctx=194637, majf=0, minf=286
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,97292,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=8117KiB/s (8312kB/s), 8117KiB/s-8117KiB/s (8312kB/s-8312kB/s), io=380MiB (399MB), run=47946-47946msec

Disk stats (read/write):
  sdc: ios=48/97188, merge=0/0, ticks=3/46998, in_queue=0, util=0.00%

This is a 18 osd ssd cluster (3 nodes) , intel s4610

with replicat x3 on 1 DC , octopus version

ping latency between nodes:
Code:
26493 packets transmitted, 26493 received, 0% packet loss, time 972ms
rtt min/avg/max/mdev = 0.009/0.010/0.076/0.003 ms, ipg/ewma 0.036/0.010 ms

This is with 3ghz cpu cores both on client && ceph storage. (this is really important to have fast freq).




(with 1ms across DC, you can't reach more than theorically 1000iops with queue_depth=1 , and lower in reallity as you have also the cpu processing).
 
@hepo have you got any further with this?

Have you tried running a rados benchmark against a newly created pool?

We're also finding writes are heavily restricted when running on a VM itself.
 
@spirit those are pretty good stats.

We've just replaced our cluster (3 nodes, 1 DC) with 3 x 3.84TB Samsung P893's (with another 9 on order, so 4 per node).

We have Intel Xeon E5-2697 configured in performance mode so 2.3GHz / 3.6GHz turbo per core.

Running the same FIO benchmark within a VM we're getting ~290 IOPS. Surely the fact we only have three disks at the moment can't be the problem? Not a clue what is going on here. Have you done any kind of Ceph optimisation?

UPDATE

When skipping the synchronous writes flag I get ~5000 IOPS. Still, pretty sure I should be getting more than this?
 
Last edited:
Yes, more testing was done... we were mostly focusing on enabling jumbo frames (MTU=9000) and can definitely see more stable and faster speeds.
This is migrating VM disk from one Ceph pool to another which would load the reads and write of the Ceph cluster simultaneously:
1644521380268.png
That 1.1GiB/s is the max throughput of the link between the datacenters.
Without jumbo frames this graph is much more sketchy.
Same goes for rebalance/recover speeds.
Jumbo frames has no improvement in the fio tests inside VM.

Rados bench tests were done with the following commands:
- rados bench -p cephbench 100 write -b 4M -t 16 --no-cleanup
- rados bench -p cephbench 100 seq -t 16
- rados -p cephbench cleanup

WRITES
2xDC / 4 replica
Code:
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
   80      16     35651     35635   1781.53      1996   0.0262882   0.0359118
   81      16     36090     36074   1781.21      1756   0.0370569     0.03592
   82      16     36578     36562   1783.29      1952   0.0188674   0.0358769
   83      16     37047     37031   1784.41      1876   0.0338395   0.0358552
   84      16     37511     37495   1785.26      1856   0.0291933   0.0358387
   85      16     37981     37965   1786.37      1880   0.0355501   0.0358147

1xDC / 2replica
Code:
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
   80      16     55023     55007   2750.03      2776   0.0182776   0.0232641
   81      16     55710     55694   2750.01      2748   0.0293555   0.0232659
   82      16     56354     56338   2747.88      2576   0.0853413   0.0232795
   83      16     57033     57017   2747.49      2716   0.0224716   0.0232872
   84      16     57727     57711   2747.83      2776   0.0248156   0.0232839
   85      16     58411     58395   2747.69      2736   0.0239619   0.0232852

READS
2xDC / 4 replica
Code:
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
   80      16     31822     31806   1590.04      1624   0.0191366     0.03978
   81      16     32226     32210   1590.35      1616   0.0342244   0.0397798
   82      16     32634     32618   1590.86      1632   0.0481717    0.039771
   83      16     32995     32979   1589.09      1444   0.0668168   0.0397993
   84      15     33372     33357   1588.17      1512   0.0591213   0.0398374
   85      16     33765     33749   1587.93      1568   0.0626353   0.0398346

1xDC / 2replica
Code:
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
   80      15     32191     32176   1608.51      1600   0.0155326   0.0392819
   81      16     32587     32571   1608.15      1580   0.0111232   0.0392905
   82      16     32987     32971   1608.05      1600   0.0270539   0.0392936
   83      16     33410     33394   1609.06      1692  0.00840267   0.0392713
   84      16     33828     33812   1609.81      1672   0.0357919   0.0392608
   85      16     34232     34216   1609.87      1616  0.00693911   0.0392537

Obviously the single DC results are much better, two datacenters is core requirement for our environment and to be honest the normal Ceph cluster load is peanuts compared to what the benchmarks show.

With regards to the IO tests inside VM, we have not found any way to improve and the results are somehow terrifying (even in single DC).
The biggest mystery is how different OS-es react to cache=writeback i.e. Ubuntu shows no noticeable difference while CentOS7 is much better.

The following results are from VM disk in single DC Ceph pool with 2 replicas (cephbench).
The command once again: fio -ioengine=libaio -direct=1 -name=test -bs=4k -iodepth=1 -rw=randwrite -runtime=60 -filename=/dev/sdb -sync=1
The VM config:
1644524810642.png

With cache=writeback
Code:
Ubuntu20.04   Jobs: 1 (f=1): [w(1)][100.0%][w=1404KiB/s][w=351 IOPS][eta 00m:00s]
CentOS7       Jobs: 1 (f=1): [w(1)][100.0%][w=46.0MiB/s][w=11.8k IOPS][eta 00m:00s]

With cache=none (default)
Code:
Ubuntu20.04   Jobs: 1 (f=1): [w(1)][100.0%][w=1641KiB/s][w=410 IOPS][eta 00m:00s]
CentOS7       Jobs: 1 (f=1): [w(1)][100.0%][w=1800KiB/s][w=450 IOPS][eta 00m:00s]

I don't want to promote CenOS7 in our environment due to it's EOL in 2024 but the difference is so much bigger.

I will be very interested in discussing other performance tuning that can be done on Ceph.
So far I've seen bits and pieces (e.g. disable debugging) but have not seen well described article that talks about the "how's and why's".


@chrispage1, "When skipping the synchronous writes flag" - what is this and how to set it up?
@spirit, what VM specs and setup (writeback?) you used for your test and what OS?
@itNGO, what Ceph tuning you have done in your cluster?

PS - we have some results on the DB performance issue, it turned out that recent change in the DB structure had a side effect on a number of queries (ran much slower than before). Also we will add more disks to the VM and place the data and binlogs on separate drives, this is expected to provide more IO based on the testing above.

Cheers all and tanks for your time once again!
 
Last edited:
@itNGO, what Ceph tuning you have done in your cluster?
Well I can give you a small overview that we changed several values in sysctl.conf for CEPH and Networking-Latency. Use and installed TUNED to change PowerProfile for our servers and have Kernel-Parameters set to disable C-States for the CPUs.

But this has to be tested VALUE by VALUE if it makes things better or worse and even in combination... we spend about 6 Weeks and 10 Hours a day for "EVERY" new CEPH-Cluster in our Datacenter, which often have different Hardware to get the right values for the Overall planed Workload.

This is just for Information, it is not usable as simply copy/paste. So don't do that. And never in production... some changes will kill nodes if not all nodes get a fresh reboot at once....


INI:
sysctl.conf
#Controls the default maxmimum size of a presage queue
kernel.msgmax = 65536

#ZFS Anpassungen
vfs.zfs.write_limit_override=1073741824

# Controls the maximum shared segment size, in bytes
kernel.shmmax = 68719476736

# Controls the maximum number of shared memory segments, in pages
kernel.shmall = 4294967296

# Ceph Networking Optimizations
net.core.rmem_default = 56623104
net.core.rmem_max = 56623104
net.core.wmem_default = 56623104
net.core.wmem_max = 56623104

net.core.somaxconn = 40000
net.core.netdev_max_backlog = 300000

net.ipv4.tcp_max_tw_buckets = 10000
#net.ipv4.tcp_tw_recyc1e = 1
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_syncookies = 0
net.ipv4.tcp_slow_start_after_idle=0
net.ipv4.tcp_mtu_probing=1
net.ipv4.tcp_rmem = 4096 87380 56623104
net.ipv4.tcp_wmem = 4096 65536 56623104
net.core.somaxconn = 5000
net.ipv4.tcp_timestamps = 1
net.ipv4.tcp_sack = 1
net.core.netdev_max_backlog = 5000
net.ipv4.tcp_max_tw_buckets = 262144
net.core.optmem_max = 4194304
net.ipv4.tcp_low_latency = 1
net.ipv4.tcp_adv_win_scale = 1
net.ipv4.tcp_slow_start_after_idle = 0
net.ipv4.tcp_no_metrics_save = 1
net.ipv4.tcp_syncookies = 0
net.ipv4.tcp_ecn = 0
net.ipv4.icmp_echo_ignore_broadcasts = 1
net.ipv4.tcp_fin_timeout = 10
net.core.netdev_budget = 600
net.ipv4.tcp_fastopen = 3

kernel.pid_max = 4194303
vm.zone_reclaim_mode = 0
vm.swappiness = 99
vm.min_free_kbytes = 513690 # assuming 64GB of RAM total
vm.dirty_ratio = 40
vm.vfs_cache_pressure = 1000
vm.dirty_ratio = 20
net.netfilter.nf_conntrack_max = 10000000
net.nf_conntrack_max = 10000000
vm.dirty_background_ratio = 3
fs.file-max = 524288

INI:
ceph.conf
[global]
         auth_client_required = cephx
         auth_cluster_required = cephx
         auth_service_required = cephx
         cephx_sign_messages = false
         cephx_require_signatures = false
         cephx_cluster_require_signatures = false
         cluster_network = 10.255.179.12/24
         fsid = 378dde03-3f1b-42e5-962d-76b9ddb0f990
         mon_allow_pool_delete = true
         mon_host = 10.255.179.12 10.255.179.10 10.255.179.11
         ms_bind_ipv4 = true
         ms_bind_ipv6 = false
         osd_pool_default_min_size = 2
         osd_pool_default_size = 3
         osd_memory_target = 6442450944
         bluefs_buffered_io=true       
         public_network = 10.255.179.12/24
         mon_cluster_log_file_level = info
         debug asok = 0/0
         debug auth = 0/0
         debug bdev = 0/0
         debug bluefs = 0/0
         debug bluestore = 0/0
         debug buffer = 0/0
         debug civetweb = 0/0
         debug client = 0/0
         debug compressor = 0/0
         debug context = 0/0
         debug crush = 0/0
         debug crypto = 0/0
         debug dpdk = 0/0
         debug eventtrace = 0/0
         debug filer = 0/0
         debug filestore = 0/0
         debug finisher = 0/0
         debug fuse = 0/0
         debug heartbeatmap = 0/0
         debug javaclient = 0/0
         debug journal = 0/0
         debug journaler = 0/0
         debug kinetic = 0/0
         debug kstore = 0/0
         debug leveldb = 0/0
         debug lockdep = 0/0
         debug mds = 0/0
         debug mds balancer = 0/0
         debug mds locker = 0/0
         debug mds log = 0/0
         debug mds log expire = 0/0
         debug mds migrator = 0/0
         debug memdb = 0/0
         debug mgr = 0/0
         debug mgrc = 0/0
         debug mon = 0/0
         debug monc = 0/00
         debug ms = 0/0
         debug none = 0/0
         debug objclass = 0/0
         debug objectcacher = 0/0
         debug objecter = 0/0
         debug optracker = 0/0
         debug osd = 0/0
         debug paxos = 0/0
         debug perfcounter = 0/0
         debug rados = 0/0
         debug rbd = 0/0
         debug rbd mirror = 0/0
         debug rbd replay = 0/0
         debug refs = 0/0
         debug reserver = 0/0
         debug rgw = 0/0
         debug rocksdb = 0/0
         debug striper = 0/0
         debug throttle = 0/0
         debug timer = 0/0
         debug tp = 0/0
         debug xio = 0/0
         bluestore_rocksdb_options = compression=kNoCompression,max_write_buffer_number=64,min_write_buffer_number_to_merge=32,recycle_log_file_num=64,compaction_style=kCompactionStyleLevel,write_buffer_size=4MB,target>
         ms_type = async
         ms_crc_data = false

[client]
         keyring = /etc/pve/priv/$cluster.$name.keyring

[mon.RZB-BPVE1]
         public_addr = 10.255.179.10

[mon.RZB-BPVE2]
         public_addr = 10.255.179.11

[mon.RZB-MPVE2]
         public_addr = 10.255.179.12

[client]
         keyring = /etc/pve/priv/$cluster.$name.keyring
         rbd cache = True
         rbd cache size = 335544320
         rbd cache max dirty = 134217728
         rbd cache max dirty age = 30
         rbd cache writethrough until flush = False
         rbd cache max dirty object = 2
         rbd cache target dirty = 235544320

[mon.RZB-APVE1]
         public_addr = 10.255.182.11

[mon.RZB-APVE2]
         public_addr = 10.255.182.12

[mon.RZB-MPVE1]
         public_addr = 10.255.181.10

[OSD]
         bdev_aio_max_queue_depth = 1024
         osd journal size = 20000
         osd max write size = 512
         osd client message size cap = 2147483648
         osd deep scrub stride = 131072
         osd op threads = 16
         osd disk threads = 4
         osd map cache size = 1024
         osd map cache bl size = 128
         osd recovery op priority = 2
         osd recovery max active = 10
         osd max backfills = 4
         osd min pg log entries = 30000
         osd max pg log entries = 100000
         osd mon heartbeat interval = 40
         ms dispatch throttle bytes = 1048576000
         objecter inflight ops = 819200
         osd op log threshold = 50
         osd crush chooseleaf type = 0
         journal max write bytes = 1073714824
         journal max write entries = 10000
         journal queue max ops = 50000
         journal queue max bytes = 10485760000

[osd.0]
         public_addr = 10.255.182.11
         cluster_addr = 10.255.182.11

Also some values in CEPH.CONF are changed....
 
@spirit, what VM specs and setup (writeback?) you used for your test and what OS?
yes,I'm using writeback (as since octopus, it's really improve small write without any impact on read. previous version have a global lock performance penality).

vms use debian11 (kernel 5.10).

Also, for better latency, I'm forcing cpu at 100% frequency with grub config:

GRUB_CMDLINE_LINUX="idle=poll intel_idle.max_cstate=0 intel_pstate=disable"

my ceph.conf

Code:
[global]
 debug asok = 0/0
 debug auth = 0/0
 debug buffer = 0/0
 debug client = 0/0
 debug context = 0/0
 debug crush = 0/0
 debug filer = 0/0
 debug filestore = 0/0
 debug finisher = 0/0
 debug heartbeatmap = 0/0
 debug journal = 0/0
 debug journaler = 0/0
 debug lockdep = 0/0
 debug mds = 0/0
 debug mds balancer = 0/0
 debug mds locker = 0/0
 debug mds log = 0/0
 debug mds log expire = 0/0
 debug mds migrator = 0/0
 debug mon = 0/0
 debug monc = 0/0
 debug ms = 0/0
 debug objclass = 0/0
 debug objectcacher = 0/0
 debug objecter = 0/0
 debug optracker = 0/0
 debug osd = 0/0
 debug paxos = 0/0
 debug perfcounter = 0/0
 debug rados = 0/0
 debug rbd = 0/0
 debug rgw = 0/0
 debug throttle = 0/0
 debug timer = 0/0
 debug tp = 0/0
 perf = true
 mutex_perf_counter = false
 throttler_perf_counter = false
 
  • Like
Reactions: itNGO
Running the same FIO benchmark within a VM we're getting ~290 IOPS. Surely the fact we only have three disks at the moment can't be the problem?
mmm, yes maybe. pg number could have an impact too with small number of osd.

Maybe could you try to create multiple osd by disk to compare ?

Code:
ceph-volume lvm batch --osds-per-device X /dev/sdX
 
Looks like I've missed this... Agree, we have looked at block size miss-alignment, this issues would be even worst with the 4M block size that Ceph uses!
....

Hi,

IMO you will go in the wrong direction. Your goal is to have the better performance as you can. CEPH(I do not use it) or zfs(I am use this) do not matter. I see also some fio test .... in this thred that do not have nothing to do with mysql.

I can not speak how CEPH can be optimize for better mysql results, but I can for zfs. And I think that the basics can be translated to CEPH(only a suposition).

So let go to basics of how mysql works:

- mysql use 16 k blocks for any innodb database(sync), so any larger blocks will be a huge way to a bad performance(4 M ???? = harakiry)

... see more on this: https://www.percona.com/blog/2017/12/07/hands-look-zfs-with-mysql/

I can tell more specifics of mysql/zfs, but without CEPH basics I can not say more.

Good luck / Bafta!
 
also, about mysql, you should use something like
innodb_flush_log_at_trx_commit = 2 (to flush transactions each second)

if you want the benefit of writeback on ceph.

innodb_flush_log_at_trx_commit = 1 flush each transaction, so it can be slow with ceph.
 
I really like the collaboration ;)

Some reading and testing is required here... on first look I am comfortable with disabling debug messages and will look into the rbd cache for the clients.

@itNGO disclaimer acknowledged! We have a test cluster where we can test config, not performance unfortunately.
Forgot to answer one of your questions - yes we run Ceph on top PVE.

@spirit disabling the c states should be done by default in PVE, however I cannot see such settings in the grub config and need to research this, thanks.
Also, innodb_flush_log_at_trx_commit = 2 is something we are aware off bud consider risky. Indeed this makes a huge difference on sysbench tests.
Going down that route... barrier=0 on ext4 boosts sysbench and fio results very well.

@guletz we are big fans of Percona and actively using PMM.
I recall reading Percona article for Ceph but need to refresh myself.
Indeed 4M block size is not ideal for MySQL workload. For me, the resilience of Ceph prevails.

Thanks all, I will post updates as I have them.
 
@spirit disabling the c states should be done by default in PVE, however I cannot see such settings in the grub config and need to research this, thanks.
proxmox force governor to performance, but no c-state. (This can use a lot of cpu)
Also, innodb_flush_log_at_trx_commit = 2 is something we are aware off bud consider risky. Indeed this makes a huge difference on sysbench tests.
yes, it can be risky, depend of your application.
Going down that route... barrier=0 on ext4 boosts sysbench and fio results very well.
disabling barrier is really risky, as you can have fs corruption.
 
  • Like
Reactions: hepo
Indeed 4M block size is not ideal for MySQL workload. For me, the resilience of Ceph prevails
Hi,

Is not about IDEAL, 4 M is far far away from any decent mysql/percona.

Resilience for CePH ? How about data safety with Ceph? No checksum on runtime, so how ceph will save you?

You will have resilience but no data safety, so can be sure that your read data is the same of what was write in the past. At most ceph will tell you that after your check integrity task on CEPH that not all of your data are OK.

Maybe I am wrong because I only read from time to time how CEPH are able yo do new things.


Good Luck / Bafta!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!