3-Node Cluster with CEPH RDP backend. Note, this should not be due to Proxmox or CEPH updates, since it never changed while this issue started to occur.
I was doing scheduled monthly updates to all our devices and software and the read/write times increased considerably (over 20ms) on all of the VMs, whose cause I cannot pinpoint.
The only updates I was able to complete was some packages for our pfSense routers (which shouldn't have anything to do with this issue), our UniFi switches and some of the VMs, most of which I have shut down to try and find the culprit.
Since most of the services are write heavy, I will focus on those.
What I cannot understand is why the w_await (The average time (in milliseconds) for write requests issued to the device to be served. This includes the time spent by the requests in queue and the time spent servicing them.) differ completely on every VM from what CEPH shows as latency for the cluster.
Example average w_await from selected VMs (using iostat -x 5):
However, when I check the latency on the ceph cluster, here is what the OSDs are showing (using ceph osd perf and also zabbix ceph plugin):
0 - 2 ms.
How can I go about diagnosing high latency on the VMs, when the backend itself apparently has no latency issues at all?
Code:
Header
Proxmox
Virtual Environment 8.0.3
Search
Node 'VMHost2'
Day (maximum)
CPU usage 2.20% of 24 CPU(s)
IO delay 0.13%
Load average 0.33,0.52,0.53
RAM usage 20.21% (25.44 GiB of 125.87 GiB)
KSM sharing 0 B
/ HD space 25.89% (24.32 GiB of 93.93 GiB)
SWAP usage 0.00% (0 B of 8.00 GiB)
CPU(s) 24 x Intel(R) Xeon(R) CPU X5679 @ 3.20GHz (2 Sockets)
Kernel Version Linux 6.2.16-3-pve #1 SMP PREEMPT_DYNAMIC PVE 6.2.16-3 (2023-06-17T05:58Z)
PVE Manager Version pve-manager/8.0.3/bbf3993334bfa916
Repository Status Proxmox VE updates Non production-ready repository enabled!
Server View
Logs
()
proxmox-ve: 8.0.1 (running kernel: 6.2.16-3-pve)
pve-manager: 8.0.3 (running version: 8.0.3/bbf3993334bfa916)
pve-kernel-6.2: 8.0.2
pve-kernel-5.15: 7.4-4
pve-kernel-6.2.16-3-pve: 6.2.16-3
pve-kernel-5.15.108-1-pve: 5.15.108-1
pve-kernel-5.13.19-2-pve: 5.13.19-4
ceph: 17.2.6-pve1+3
ceph-fuse: 17.2.6-pve1+3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown: not correctly installed
ifupdown2: 3.2.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-3
libknet1: 1.25-pve1
libproxmox-acme-perl: 1.4.6
libproxmox-backup-qemu0: 1.4.0
libproxmox-rs-perl: 0.3.0
libpve-access-control: 8.0.3
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.0.5
libpve-guest-common-perl: 5.0.3
libpve-http-server-perl: 5.0.3
libpve-rs-perl: 0.8.3
libpve-storage-perl: 8.0.2
libqb0: 1.0.5-1
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve3
novnc-pve: 1.4.0-2
proxmox-backup-client: 3.0.1-1
proxmox-backup-file-restore: 3.0.1-1
proxmox-kernel-helper: 8.0.2
proxmox-mail-forward: 0.2.0
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.1
proxmox-widget-toolkit: 4.0.5
pve-cluster: 8.0.1
pve-container: 5.0.4
pve-docs: 8.0.4
pve-edk2-firmware: 3.20230228-4
pve-firewall: 5.0.2
pve-firmware: 3.7-1
pve-ha-manager: 4.0.2
pve-i18n: 3.0.4
pve-qemu-kvm: 8.0.2-3
pve-xtermjs: 4.16.0-3
qemu-server: 8.0.6
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.1.12-pve1
I was doing scheduled monthly updates to all our devices and software and the read/write times increased considerably (over 20ms) on all of the VMs, whose cause I cannot pinpoint.
The only updates I was able to complete was some packages for our pfSense routers (which shouldn't have anything to do with this issue), our UniFi switches and some of the VMs, most of which I have shut down to try and find the culprit.
Since most of the services are write heavy, I will focus on those.
What I cannot understand is why the w_await (The average time (in milliseconds) for write requests issued to the device to be served. This includes the time spent by the requests in queue and the time spent servicing them.) differ completely on every VM from what CEPH shows as latency for the cluster.
Example average w_await from selected VMs (using iostat -x 5):
- 25 - 100 ms
- 15 - 40 ms
- 10 - 350 ms
However, when I check the latency on the ceph cluster, here is what the OSDs are showing (using ceph osd perf and also zabbix ceph plugin):
0 - 2 ms.
How can I go about diagnosing high latency on the VMs, when the backend itself apparently has no latency issues at all?