ceph storage all pgs snaptrim every night slowing down vms

abzsol · Jun 18, 2020

Hi, we take snaphost of vms every night at midnight and remove the 16th snapshot. After removing snapshot all pgs go in snaptrim status and this goes for 9/10 hours and the vms are unusable until it finish.

root@pve01:~# ceph -s
cluster:
id: 7ced7402-a929-461a-bd40-53f863fa46ab
health: HEALTH_OK

services:
mon: 3 daemons, quorum pve02,pve03,pve01 (age 5d)
mgr: pve01(active, since 5d), standbys: pve03, pve02
osd: 12 osds: 12 up (since 5d), 12 in (since 5d)

data:
pools: 1 pools, 512 pgs
objects: 4.58M objects, 8.2 TiB
usage: 27 TiB used, 15 TiB / 42 TiB avail
pgs: 486 active+clean+snaptrim_wait
24 active+clean+snaptrim
2 active+clean+scrubbing+deep+snaptrim_wait

io:
client: 3.1 MiB/s rd, 3.6 MiB/s wr, 2.10k op/s rd, 166 op/s wr

another things: we added 3 disks, one for each node last week and the new disks are not gpt and have a highest used space on the osds.
the cluster was created on version 5.2 and the latest disk added with 6.1 or 6.2

could be this the problem?

abzsol · Jun 18, 2020

This is the situation after quite 10 hours.

Can anyone help us please?
Thanks

Dominic · Jun 18, 2020

Hi!

we take snaphost of vms every night at midnight and remove the 16th snapshot

That means every night one snapshot creation and one deletion per VM? How many VMs are we talking about?

How fast is your network? Did you maybe do a benchmark of your Ceph storage at some time?

we added 3 disks, one for each node last week

Have the problems appeared immediately after that?

This is the situation after quite 10 hours.

=> It takes the whole day and most pgs are still in snaptrim state when the new snapshots happen?

abzsol · Jun 18, 2020

Dominic said:
That means every night one snapshot creation and one deletion per VM? How many VMs are we talking about?

yes, one for each vms. 14vms total.

Dominic said:
How fast is your network? Did you maybe do a benchmark of your Ceph storage at some time?

network is 10Gbe bonded in active/backup to 2 different switch. i did benchmark on first install before going production with crystal disk mark within a windows vms and all was good.

Dominic said:
Have the problems appeared immediately after that?

yes and no, after adding the disk it start backfilling and all was working fine for 2 days

Dominic said:
It takes the whole day and most pgs are still in snaptrim state when the new snapshots happen?

Did this yesterday night and finished yesterday morning, tonight right after snapshot all pgs go yellow and in snaptrim but fortunatly is not impacting too much the vms today. I did this command that maybe helped:

Code:

ceph tell 'osd.*' injectargs '--osd-snap-trim-sleep 3'

ceph.conf (i think that the line [osd] should be deleted when i upgraded to nautilus. can i remove it?)

Code:

[global]
         auth_client_required = cephx
         auth_cluster_required = cephx
         auth_service_required = cephx
         cluster_network = 192.168.5.0/24
         fsid = 7ced7402-a929-461a-bd40-53f863fa46ab
         mon_allow_pool_delete = true
mon_host = 192.168.5.12 192.168.5.13 192.168.5.11
         osd_journal_size = 5120
         osd_pool_default_min_size = 2
         osd_pool_default_size = 3
         public_network = 192.168.5.0/24

[client]
         keyring = /etc/pve/priv/$cluster.$name.keyring

[osd]
         keyring = /var/lib/ceph/osd/ceph-$id/keyring

abzsol · Jun 18, 2020

Could be all caused by the fact that the new disk has a different format? Not GPT and LVM.

in this thread https://forum.proxmox.com/threads/ceph-luminous-to-nautilus-upgrade-issues.56268/ seems it has to destroy and recreate all "old" osd originally created in luminous.

abzsol · Jun 18, 2020

EDIT: Adding benchmark

root@pve01:~# rados bench -p scbench 10 write --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_pve01_1646222
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
0 0 0 0 0 0 - 0
1 16 197 181 723.904 724 0.0406462 0.0835088
2 16 389 373 745.91 768 0.051436 0.0846006
3 16 565 549 731.925 704 0.0530815 0.0829333
4 16 737 721 720.926 688 0.0313293 0.0861381
5 16 925 909 727.122 752 0.0315973 0.087306
6 16 1117 1101 733.92 768 0.0755781 0.08633
7 15 1305 1290 737.062 756 0.0441932 0.0864622
8 16 1497 1481 740.419 764 0.0601034 0.0857883
9 16 1681 1665 739.92 736 0.0555269 0.0859927
10 16 1865 1849 739.521 736 0.0968004 0.0859251
Total time run: 10.0556
Total writes made: 1866
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 742.276
Stddev Bandwidth: 27.5487
Max bandwidth (MB/sec): 768
Min bandwidth (MB/sec): 688
Average IOPS: 185
Stddev IOPS: 6.88719
Max IOPS: 192
Min IOPS: 172
Average Latency(s): 0.0861962
Stddev Latency(s): 0.0651585
Max latency(s): 1.29398
Min latency(s): 0.0256866

hints = 1
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
0 0 0 0 0 0 - 0
1 16 288 272 1087.83 1088 0.0244231 0.0412122
2 16 587 571 1141.85 1196 0.233822 0.0541319
3 16 884 868 1157.19 1188 0.32743 0.0529549
4 16 1188 1172 1171.86 1216 0.074967 0.053297
5 16 1493 1477 1181.46 1220 0.0618044 0.0505108
6 16 1785 1769 1179.18 1168 0.020421 0.0530125
Total time run: 6.30899
Total reads made: 1866
Read size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 1183.07
Average IOPS: 295
Stddev IOPS: 12.1559
Max IOPS: 305
Min IOPS: 272
Average Latency(s): 0.0532738
Max latency(s): 1.26634
Min latency(s): 0.0149414

hints = 1
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
0 0 0 0 0 0 - 0
1 16 344 328 1311.78 1312 0.0535415 0.0462263
2 16 624 608 1215.8 1120 0.0200653 0.0463566
3 15 940 925 1233.15 1268 0.0270318 0.049781
4 15 1256 1241 1240.83 1264 0.158066 0.0502829
5 16 1538 1522 1217.43 1124 0.0146232 0.0509896
6 16 1873 1857 1237.83 1340 0.0317591 0.0498054
7 15 2183 2168 1238.7 1244 0.0552959 0.0506502
8 15 2505 2490 1244.84 1288 0.00386829 0.0504596
9 16 2808 2792 1240.73 1208 0.00581728 0.0495815
10 16 3108 3092 1236.64 1200 0.0121912 0.0507447
Total time run: 10.1045
Total reads made: 3109
Read size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 1230.74
Average IOPS: 307
Stddev IOPS: 18.492
Max IOPS: 335
Min IOPS: 280
Average Latency(s): 0.0512282
Max latency(s): 0.845722
Min latency(s): 0.00347601

rbd: bench-write is deprecated, use rbd bench --io-type write ...
bench type write io_size 4096 io_threads 16 bytes 1073741824 pattern sequential
SEC OPS OPS/SEC BYTES/SEC
1 64688 64963.07 266088746.99
2 125152 62708.66 256854685.59
3 186384 62215.54 254834847.72
4 246000 61564.82 252169519.23
elapsed: 4 ops: 262144 ops/sec: 61076.62 bytes/sec: 250169825.54

rbd_iodepth32: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=32
fio-3.12
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=62.2MiB/s][w=15.9k IOPS][eta 00m:00s]
rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=1650209: Thu Jun 18 11:47:44 2020
write: IOPS=7485, BW=29.2MiB/s (30.7MB/s)(1024MiB/35020msec); 0 zone resets
slat (nsec): min=1318, max=931629, avg=5428.69, stdev=4531.63
clat (usec): min=786, max=1592.0k, avg=4268.40, stdev=38473.29
lat (usec): min=793, max=1592.0k, avg=4273.83, stdev=38473.32
clat percentiles (usec):
| 1.00th=[ 1188], 5.00th=[ 1401], 10.00th=[ 1516],
| 20.00th=[ 1663], 30.00th=[ 1778], 40.00th=[ 1876],
| 50.00th=[ 1975], 60.00th=[ 2089], 70.00th=[ 2245],
| 80.00th=[ 2474], 90.00th=[ 3228], 95.00th=[ 4883],
| 99.00th=[ 16909], 99.50th=[ 95945], 99.90th=[ 231736],
| 99.95th=[1317012], 99.99th=[1551893]
bw ( KiB/s): min= 48, max=65136, per=100.00%, avg=33258.57, stdev=19665.86, samples=63
iops : min= 12, max=16284, avg=8314.63, stdev=4916.46, samples=63
lat (usec) : 1000=0.08%
lat (msec) : 2=52.27%, 4=40.71%, 10=5.17%, 20=0.86%, 50=0.28%
lat (msec) : 100=0.13%, 250=0.40%, 500=0.02%, 750=0.01%, 1000=0.01%
cpu : usr=5.97%, sys=3.28%, ctx=162147, majf=0, minf=9756
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
issued rwts: total=0,262144,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
WRITE: bw=29.2MiB/s (30.7MB/s), 29.2MiB/s-29.2MiB/s (30.7MB/s-30.7MB/s), io=1024MiB (1074MB), run=35020-35020msec

Disk stats (read/write):
dm-1: ios=1/1282, merge=0/0, ticks=0/20, in_queue=20, util=0.63%, aggrios=25/757, aggrmerge=0/619, aggrticks=1/110, aggrin_queue=0, aggrutil=0.72%
sde: ios=25/757, merge=0/619, ticks=1/110, in_queue=0, util=0.72%

abzsol · Dec 1, 2020

Hi, updated to 6.3 and ceph 15.2.6. Same problem. when a snapshot get deleted all vms get unresponsive until all pgs reach "active+clean".

Tried

Code:

ceph tell 'osd.*' injectargs '--osd-snap-trim-sleep 3'

but still the same. We have disabled the snapshot for now.

Dominic · Dec 1, 2020

1 second maximum latency could well be the the source of your problem. So it might help to search the Ceph log files for latency problems and upgrade your hardware to avoid them.

abzsol · Dec 1, 2020

Dominic said:
1 second maximum latency could well be the the source of your problem. So it might help to search the Ceph log files for latency problems and upgrade your hardware to avoid them.

Hi @Dominic and thanks for your reply,
Hardware is very powerful and when it's running normally, it has no performance problem. Only when a snapshot get deleted, it freezes.

3 nodes:

2x Xeon(R) Silver 4114
128 GB RAM
2x300GB SAS for OS
4x 3.8 TB SSD (Intel SSD D3-S4510)
1x 1gbit ring0 to switch 1
1x 1gbit ring1 to switch 2
2x 10GB SFP+ Bonded Active/Passive with openvswitch to switch1 and switch2, on these bond we have:
- 1 ovs bridge for ceph
- 1 ovs bridge for vms
- 1 ovs bridge for accessing the proxmox GUI on the 10GBIT BOND

abzsol · Dec 7, 2020

hi @Dominic, could my problem be related to this? https://forum.proxmox.com/threads/c...le-snaptrim-regression-on-ceph-14-2-10.74103/

Code:

ceph tell osd.0 config show | grep _buffered
"bluefs_buffered_io": "false",

proxmox-ve: 6.3-1 (running kernel: 5.4.73-1-pve)
pve-manager: 6.3-2 (running version: 6.3-2/22f57405)
pve-kernel-5.4: 6.3-1
pve-kernel-helper: 6.3-1
pve-kernel-5.3: 6.1-6
pve-kernel-5.0: 6.0-11
pve-kernel-5.4.73-1-pve: 5.4.73-1
pve-kernel-5.4.44-2-pve: 5.4.44-2
pve-kernel-5.4.41-1-pve: 5.4.41-1
pve-kernel-4.15: 5.4-9
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.18-2-pve: 5.3.18-2
pve-kernel-5.3.13-1-pve: 5.3.13-1
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-4.15.18-21-pve: 4.15.18-48
pve-kernel-4.15.18-9-pve: 4.15.18-30
ceph: 15.2.6-pve1
ceph-fuse: 15.2.6-pve1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.5
libproxmox-backup-qemu0: 1.0.2-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.2-6
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.3-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
openvswitch-switch: 2.12.0-1
proxmox-backup-client: 1.0.5-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-3
pve-cluster: 6.2-1
pve-container: 3.3-1
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.1.0-7
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-1
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.5-pve1

UPDATE: after changing the bluefs_buffered_io to true with the command: ceph tell 'osd.*' injectargs '--bluefs_buffered_io true' now the cluster not freeze the vms during snaptrim, tried removing 2 snapshot 4 months old. Vms slow down but not freeze anymore.

Is this a bug? If i understand correctly this problem introduced with ceph 14.2.10 and then resolved with 14.2.11 restoring the value to true. Does this still apply to octopus?

Thanks

Dominic · Dec 7, 2020

You can find in the changelog of Ceph 14.2.10 that
bluefs_buffered_io has been set to false by default.

common/options: Disable bluefs_buffered_io by default again. said:
When this option is enabled, bluefs will in some cases perform buffered reads. This allows the kernel page cache to act as a secondary cache for things like RocksDB compaction. For example, if the rocksdb block cache isn't large enough to hold blocks from the compressed SST files itself, they can be read from page cache instead of from the disk. This option previously was enabled by default, however in some test cases it appears to cause excessive swap utilization by the linux kernel and a large negative performance impact after several hours of run time. Please exercise caution when enabling

And it seems that it hasn't been touched since. There is no word about activating it again in the Ceph Nautilus changelog, and in the Ceph Octopus source code it is still set to false.

You can also use

Code:

ceph daemon osd.0 config help bluefs_buffered_io

to check the default.

abzsol · Dec 7, 2020

Dominic said:
You can find in the changelog of Ceph 14.2.10 that
bluefs_buffered_io has been set to false by default.

And it seems that it hasn't been touched since. There is no word about activating it again in the Ceph Nautilus changelog, and in the Ceph Octopus source code it is still set to false.

You can also use

Code:

ceph daemon osd.0 config help bluefs_buffered_io

to check the default.

Thanks for your reply, the default is false. We'll separate cluster and public network and reset to false and see if problem is solved

David Herselman · Dec 11, 2020

It appears there is motivation to submit a pull request to once again change the default back to 'true' again.
Whilst there are some workloads that benefit from this having been disabled, the majority of people may ultimately prefer to enable this again, especially smaller clusters comprising of spinners.

Perhaps the best way to handle this is to raise awareness and let those managing their clusters compare performance with this either on or off...

abzsol · Dec 23, 2020

abzsol said:
Thanks for your reply, the default is false. We'll separate cluster and public network and reset to false and see if problem is solved

We have now separated VMS, ceph public and ceph private traffic:

2x LACP 1gbit for vms
2x 10 gbit active/backup for ceph private
2x 10 gbit active/backup for ceph public

but the problem persist. if we remove a snapshot all vms became unavailable until the snaptrim finish.

iostat -xd

abzsol · Dec 26, 2020

this seem to be solved changing power profile from HPE ILO from "dynamic power savings mode" to "os control mode"

https://forum.proxmox.com/threads/s...interruptions-solved-but-want-to-share.81357/

Search

Search

ceph storage all pgs snaptrim every night slowing down vms

abzsol

Active Member

abzsol

Active Member

Dominic

Proxmox Retired Staff

abzsol

Active Member

abzsol

Active Member

abzsol

Active Member

abzsol

Active Member

Dominic

Proxmox Retired Staff

abzsol

Active Member

abzsol

Active Member

Dominic

Proxmox Retired Staff

abzsol

Active Member

David Herselman

Renowned Member

abzsol

Active Member

abzsol

Active Member