Ceph monitos latancy skyrocket after upgrade

Mikepop · Jul 23, 2019

Hi, I've noticed monitor latency skyrocket after update in all my clusters.

Any way to check why or how to improove/fix it?

Config is the same in the three nodes. I've noticed this latency increase in three different clusters:

# pveversion --verbose
proxmox-ve: 6.0-2 (running kernel: 5.0.15-1-pve)
pve-manager: 6.0-4 (running version: 6.0-4/2a719255)
pve-kernel-5.0: 6.0-5
pve-kernel-helper: 6.0-5
pve-kernel-4.15: 5.4-6
pve-kernel-5.0.15-1-pve: 5.0.15-1
pve-kernel-4.15.18-18-pve: 4.15.18-44
pve-kernel-4.15.18-17-pve: 4.15.18-43
pve-kernel-4.15.18-12-pve: 4.15.18-36
ceph: 14.2.1-pve2
ceph-fuse: 14.2.1-pve2
corosync: 3.0.2-pve2
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.10-pve1
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-2
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-2
libpve-storage-perl: 6.0-5
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-61
lxcfs: 3.0.3-pve60
novnc-pve: 1.0.0-60
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-5
pve-cluster: 6.0-4
pve-container: 3.0-4
pve-docs: 6.0-4
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-5
pve-firmware: 3.0-2
pve-ha-manager: 3.0-2
pve-i18n: 2.0-2
pve-qemu-kvm: 4.0.0-3
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-5
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.1-pve1

root@ac101:~# ceph -s
cluster:
id: 6230ef03-9b53-4f90-8c8a-2273a0d0c7ae
health: HEALTH_OK
services:
mon: 3 daemons, quorum ac101,ac102,ac103 (age 2d)
mgr: ac101(active, since 2d), standbys: ac103, ac102
osd: 24 osds: 24 up, 24 in
data:
pools: 2 pools, 576 pgs
objects: 823.25k objects, 3.1 TiB
usage: 9.3 TiB used, 33 TiB / 42 TiB avail
pgs: 576 active+clean
io:
client: 33 KiB/s rd, 4.1 MiB/s wr, 3 op/s rd, 250 op/s wr

Regards

Alwin · Jul 23, 2019

Did the latency between the OSDs change too? Has network latency in general changed?

Mikepop · Jul 23, 2019

Alwin said:
Did the latency between the OSDs change too? Has network latency in general changed?

Hello Alwin, nope, just monitors.

Alwin · Jul 24, 2019

Did you restart the MONs one-by-one already? What does the following command show (maybe the keys changed)?

Code:

ceph daemon mon.<ID> perf dump

Mikepop · Jul 25, 2019

Hello Alwin. I've restarted monitors with no changes.

https://pastebin.com/raw/cDGkvJAD

Alwin · Jul 25, 2019

Is there anything in the ceph logs (eg. ceph-mon)? Maybe there is some background task running. Or another thought, is the MON DB on an SSD and maybe it needs an fstrim?

Mikepop · Jul 31, 2019

Hi Alwin, how can I check that? I do not see any error on mon logs, just rocksdb entries and latency histograms. All disks are enterprise ssds. Can't tell if fstrim is needed, but latency increase was right after the update in all three different clusters.
root@int101:/var/log/ceph# fstrim --verbose --all
/var/lib/ceph/osd/ceph-10: 96.3 MiB (100982784 bytes) trimmed on /dev/sdd1
/var/lib/ceph/osd/ceph-1: 96.3 MiB (100982784 bytes) trimmed on /dev/sdc1
/var/lib/ceph/osd/ceph-11: 96.3 MiB (100982784 bytes) trimmed on /dev/sdf1
/var/lib/ceph/osd/ceph-23: 96.3 MiB (100982784 bytes) trimmed on /dev/sdg1
/var/lib/ceph/osd/ceph-0: 96.3 MiB (100982784 bytes) trimmed on /dev/sdb1
/var/lib/ceph/osd/ceph-13: 96.3 MiB (100982784 bytes) trimmed on /dev/sdj1
/var/lib/ceph/osd/ceph-22: 96.3 MiB (100982784 bytes) trimmed on /dev/sde1
/var/lib/ceph/osd/ceph-12: 96.3 MiB (100982784 bytes) trimmed on /dev/sdh1
/: 13 GiB (13996597248 bytes) trimmed on /dev/mapper/pve-root
root@int101:/var/log/ceph# journalctl -u fstrim
-- Logs begin at Tue 2019-07-23 12:16:16 CEST, end at Wed 2019-07-31 21:08:00 CEST. --
-- No entries --
root@int101:/var/log/ceph#
root@int101:/var/log/ceph#
root@int101:/var/log/ceph# systemctl list-timers
NEXT LEFT LAST PASSED UNIT ACTIVATES
Wed 2019-07-31 21:09:00 CEST 21s left Wed 2019-07-31 21:08:00 CEST 38s ago pvesr.timer pvesr.service
Thu 2019-08-01 00:00:00 CEST 2h 51min left Wed 2019-07-31 00:00:00 CEST 21h ago logrotate.timer logrotate.ser
Thu 2019-08-01 00:00:00 CEST 2h 51min left Wed 2019-07-31 00:00:00 CEST 21h ago man-db.timer man-db.servic
Thu 2019-08-01 02:48:03 CEST 5h 39min left Wed 2019-07-31 13:06:55 CEST 8h ago apt-daily.timer apt-daily.ser
Thu 2019-08-01 03:11:33 CEST 6h left Wed 2019-07-31 03:32:55 CEST 17h ago pve-daily-update.timer pve-daily-upd
Thu 2019-08-01 06:38:20 CEST 9h left Wed 2019-07-31 06:25:46 CEST 14h ago apt-daily-upgrade.timer apt-daily-upg
Thu 2019-08-01 10:02:43 CEST 12h left Wed 2019-07-31 20:13:46 CEST 54min ago certbot.timer certbot.servi
Thu 2019-08-01 12:32:00 CEST 15h left Wed 2019-07-31 12:32:00 CEST 8h ago systemd-tmpfiles-clean.timer systemd-tmpfi

8 timers listed.

Regards

Alwin · Aug 1, 2019

Mikepop said:
/: 13 GiB (13996597248 bytes) trimmed on /dev/mapper/pve-root

This is a good portion but I suspect not the whole SSD.

Did you do this for all nodes with MONs? And if, could you restart all MONs (one at a time)?

After that, do the latency settings change? If not we need to dig deeper and maybe add some debug flags.

Mikepop · Aug 15, 2019

I've restarted all nodes since the upgrade to 6 versions at leat two times, in the other cluster too, no changes in monitor's latency i'm afraid.

Regards

Alwin · Aug 21, 2019

Then we need to dig deeper and get some debug logging going. Please see the link on how to increase logging for the MONs.
https://docs.ceph.com/docs/nautilus/rados/troubleshooting/log-and-debug/

Search

Search

Ceph monitos latancy skyrocket after upgrade

Mikepop

Well-Known Member

Alwin

Proxmox Retired Staff

Mikepop

Well-Known Member

Alwin

Proxmox Retired Staff

Mikepop

Well-Known Member

Alwin

Proxmox Retired Staff

Mikepop

Well-Known Member

Alwin

Proxmox Retired Staff

Mikepop

Well-Known Member

Alwin

Proxmox Retired Staff