LAMP VMs - PVE7 - Kernel 5.11

TwiX

Renowned Member
Feb 3, 2015
311
23
83
Hi,

I recently upgraded a 6 nodes PVE cluster to PVE7

proxmox-ve: 7.0-2 (running kernel: 5.11.22-3-pve)
pve-manager: 7.0-11 (running version: 7.0-11/63d82f4e)
pve-kernel-5.11: 7.0-6
pve-kernel-helper: 7.0-6
pve-kernel-5.4: 6.4-5
pve-kernel-5.0: 6.0-11
pve-kernel-5.11.22-3-pve: 5.11.22-6
pve-kernel-5.4.128-1-pve: 5.4.128-1
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph: 16.2.5-pve1
ceph-fuse: 16.2.5-pve1
corosync: 3.1.2-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: 0.8.36
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.21-pve1
libproxmox-acme-perl: 1.2.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.0-4
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-5
libpve-guest-common-perl: 4.0-2
libpve-http-server-perl: 4.0-2
libpve-storage-perl: 7.0-10
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.9-4
lxcfs: 4.0.8-pve2
novnc-pve: 1.2.0-3
proxmox-backup-client: 2.0.8-1
proxmox-backup-file-restore: 2.0.8-1
proxmox-mini-journalreader: 1.2-1
proxmox-widget-toolkit: 3.3-6
pve-cluster: 7.0-3
pve-container: 4.0-9
pve-docs: 7.0-5
pve-edk2-firmware: 3.20200531-1
pve-firewall: 4.2-2
pve-firmware: 3.2-4
pve-ha-manager: 3.3-1
pve-i18n: 2.4-1
pve-qemu-kvm: 6.0.0-3
pve-xtermjs: 4.12.0-1
qemu-server: 7.0-13
smartmontools: 7.2-pve2
spiceterm: 3.2-2
vncterm: 1.7-1
zfsutils-linux: 2.0.5-pve1

Code:
cat /sys/devices/system/cpu/cpu2/cpufreq/scaling_governor
performance

Users report me some issues (lag, freezed browsers for 1 min) for several LAMP VMs...
Everything seems to be ok inside the VMs including pve nodes ressources...

I'm wondering if the pb could be the kernel, 5.11 is the only pve7 kernel available
What may happen if I try to boot on the last 5.4.128 (pve6) kernel ?
 
Last edited:
I can confirm issue with all LAMP server (debian based) since PVE7 upgrade

Dont know if its related to kernel or qemu
I will move VM to pve 6.4 in order to confirm
 
Hi,

Can you post the VM configuration of such a VM, to see what specific options are set. qm config VMID

Also, what is your exact test to determine (which?) regression?
 
Hi,

Of course :

Code:
qm config 20089

agent: 1
bootdisk: scsi0
cores: 4
cpu: host
description: Dolibarr
ide2: none,media=cdrom
memory: 8192
name: dc-doli-merc02
net0: virtio=36:7C:34:B8:B1:9E,bridge=vmbr0,tag=20
numa: 1
onboot: 1
ostype: l26
scsi0: kvm_pool:vm-20089-disk-0,cache=writeback,discard=on,size=75G
scsihw: virtio-scsi-pci
smbios1: uuid=9f3a5887-716b-460e-8050-f2dd348b2bda
sockets: 2
tablet: 0

What I want to try is to move this VM to another cluster (still in v6.4 + ceph octopus), in order to see if the issue is related to pve7 (kernel/qemu/ceph)
 
We also noticed relatively more cpu usage & more iowaits for pve 7.0/ ceph pacific

1629814807662.png
 
I see now aio=io_uring

I guess previously it was threads.
Where can I change the aio value ?
 
Hi,

I settled all VM disks with aio=threads.

I'll keep you in touch.

I also saw this :(
1629878414834.png
 
I also saw this
Note that since the release of the kernel package pve-kernel-5.11.22-3-pve in version 5.11.22-6 the io_uring related issue, which was a kernel bug, was fixed, so io_uring should not cause any crash anymore.

Note, also that switching from 5.4 to 5.11 kernel, and 5.2 to 6.0 QEMU with io_uring can also have some effects in regards to how the load is measured or where it happens (user-space vs. kernel space).

You did not yet told us how you actually measure the performance regression, or how/where the freezes actually show up.
In a LAMP stack it would be the clients browser of the LAMP applications? A frozen Browser for over 1 Minute seems rather like a client issue than a (LAMP) server issue?
Is there higher latency connecting to the HTTP servers or slower bandwidth?
 
Hi,

Thanks for your answer.

First, aio=threads seems to fix the issue. No complains this morning !

I was pretty confident on the fact that issue should be related to PVE7. The complains involved all relatively high loaded LAMP servers on the updated cluster.
Mysql queries were incredibly slow and some transactions triggered dead locks.
Mysql zabbix supervision showed us some kind of unavailability of mysql

On the same hypervisor

Before PVE 7 upgrade :
1629882318645.png

After PVE 7 upgrade :
1629882207934.png


So yesterday I took the decision to move 2 VMs to another cluster with same hardware which was still on pve6.4
Everything seems to work as expected this morning.

And I saw that pve7 now uses the new io_uring engine as default, so I also set aio=threads for all remaining PVE7 VMs
Only 2 issues this morning (may be not related). Yesterday, it happens every 5 min for almost every LAMP VMs on that cluster.

I still see some more higher iowaits but as you said maybe related to how the load is measured with new kernel/qemu
 

Attachments

  • 1629883538705.png
    1629883538705.png
    216.6 KB · Views: 0
Do you think it is possible to boot on the old 5.4.128 kernel for one node ?
I just want to check if iowaits are better
 
Do you think it is possible to boot on the old 5.4.128 kernel for one node ?
Do you use ZFS and upgraded any rpool already? As then it may fail to import the rpool.
but no harm done, one can just boot the 5.11 kernel again.

I'd say that it's pretty safe to try booting into the older kernel, but would keep the async handler set to aio instead of io_uring, as the 5.4 Kernels io_uring support is quite a bit more lacking compared to the 5.11 one.