Ceph monitos latancy skyrocket after upgrade

Mikepop

Well-Known Member
Feb 6, 2018
63
5
48
50
Hi, I've noticed monitor latency skyrocket after update in all my clusters.
TwClcL2.png

cFZqDwI.png

Any way to check why or how to improove/fix it?

Config is the same in the three nodes. I've noticed this latency increase in three different clusters:

# pveversion --verbose
proxmox-ve: 6.0-2 (running kernel: 5.0.15-1-pve)
pve-manager: 6.0-4 (running version: 6.0-4/2a719255)
pve-kernel-5.0: 6.0-5
pve-kernel-helper: 6.0-5
pve-kernel-4.15: 5.4-6
pve-kernel-5.0.15-1-pve: 5.0.15-1
pve-kernel-4.15.18-18-pve: 4.15.18-44
pve-kernel-4.15.18-17-pve: 4.15.18-43
pve-kernel-4.15.18-12-pve: 4.15.18-36
ceph: 14.2.1-pve2
ceph-fuse: 14.2.1-pve2
corosync: 3.0.2-pve2
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.10-pve1
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-2
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-2
libpve-storage-perl: 6.0-5
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-61
lxcfs: 3.0.3-pve60
novnc-pve: 1.0.0-60
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-5
pve-cluster: 6.0-4
pve-container: 3.0-4
pve-docs: 6.0-4
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-5
pve-firmware: 3.0-2
pve-ha-manager: 3.0-2
pve-i18n: 2.0-2
pve-qemu-kvm: 4.0.0-3
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-5
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.1-pve1

root@ac101:~# ceph -s
cluster:
id: 6230ef03-9b53-4f90-8c8a-2273a0d0c7ae
health: HEALTH_OK
services:
mon: 3 daemons, quorum ac101,ac102,ac103 (age 2d)
mgr: ac101(active, since 2d), standbys: ac103, ac102
osd: 24 osds: 24 up, 24 in
data:
pools: 2 pools, 576 pgs
objects: 823.25k objects, 3.1 TiB
usage: 9.3 TiB used, 33 TiB / 42 TiB avail
pgs: 576 active+clean
io:
client: 33 KiB/s rd, 4.1 MiB/s wr, 3 op/s rd, 250 op/s wr

Regards
 
Did the latency between the OSDs change too? Has network latency in general changed?
 
Did you restart the MONs one-by-one already? What does the following command show (maybe the keys changed)?
Code:
ceph daemon mon.<ID> perf dump
 
Is there anything in the ceph logs (eg. ceph-mon)? Maybe there is some background task running. Or another thought, is the MON DB on an SSD and maybe it needs an fstrim?
 
Hi Alwin, how can I check that? I do not see any error on mon logs, just rocksdb entries and latency histograms. All disks are enterprise ssds. Can't tell if fstrim is needed, but latency increase was right after the update in all three different clusters.
root@int101:/var/log/ceph# fstrim --verbose --all
/var/lib/ceph/osd/ceph-10: 96.3 MiB (100982784 bytes) trimmed on /dev/sdd1
/var/lib/ceph/osd/ceph-1: 96.3 MiB (100982784 bytes) trimmed on /dev/sdc1
/var/lib/ceph/osd/ceph-11: 96.3 MiB (100982784 bytes) trimmed on /dev/sdf1
/var/lib/ceph/osd/ceph-23: 96.3 MiB (100982784 bytes) trimmed on /dev/sdg1
/var/lib/ceph/osd/ceph-0: 96.3 MiB (100982784 bytes) trimmed on /dev/sdb1
/var/lib/ceph/osd/ceph-13: 96.3 MiB (100982784 bytes) trimmed on /dev/sdj1
/var/lib/ceph/osd/ceph-22: 96.3 MiB (100982784 bytes) trimmed on /dev/sde1
/var/lib/ceph/osd/ceph-12: 96.3 MiB (100982784 bytes) trimmed on /dev/sdh1
/: 13 GiB (13996597248 bytes) trimmed on /dev/mapper/pve-root
root@int101:/var/log/ceph# journalctl -u fstrim
-- Logs begin at Tue 2019-07-23 12:16:16 CEST, end at Wed 2019-07-31 21:08:00 CEST. --
-- No entries --
root@int101:/var/log/ceph#
root@int101:/var/log/ceph#
root@int101:/var/log/ceph# systemctl list-timers
NEXT LEFT LAST PASSED UNIT ACTIVATES
Wed 2019-07-31 21:09:00 CEST 21s left Wed 2019-07-31 21:08:00 CEST 38s ago pvesr.timer pvesr.service
Thu 2019-08-01 00:00:00 CEST 2h 51min left Wed 2019-07-31 00:00:00 CEST 21h ago logrotate.timer logrotate.ser
Thu 2019-08-01 00:00:00 CEST 2h 51min left Wed 2019-07-31 00:00:00 CEST 21h ago man-db.timer man-db.servic
Thu 2019-08-01 02:48:03 CEST 5h 39min left Wed 2019-07-31 13:06:55 CEST 8h ago apt-daily.timer apt-daily.ser
Thu 2019-08-01 03:11:33 CEST 6h left Wed 2019-07-31 03:32:55 CEST 17h ago pve-daily-update.timer pve-daily-upd
Thu 2019-08-01 06:38:20 CEST 9h left Wed 2019-07-31 06:25:46 CEST 14h ago apt-daily-upgrade.timer apt-daily-upg
Thu 2019-08-01 10:02:43 CEST 12h left Wed 2019-07-31 20:13:46 CEST 54min ago certbot.timer certbot.servi
Thu 2019-08-01 12:32:00 CEST 15h left Wed 2019-07-31 12:32:00 CEST 8h ago systemd-tmpfiles-clean.timer systemd-tmpfi

8 timers listed.


Regards
 
/: 13 GiB (13996597248 bytes) trimmed on /dev/mapper/pve-root
This is a good portion but I suspect not the whole SSD. ;) Did you do this for all nodes with MONs? And if, could you restart all MONs (one at a time)?

After that, do the latency settings change? If not we need to dig deeper and maybe add some debug flags.
 
I've restarted all nodes since the upgrade to 6 versions at leat two times, in the other cluster too, no changes in monitor's latency i'm afraid.

Regards
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!