Diverse Kernel Probleme mit > 6.1.2.1

quanto11

Member
Dec 11, 2021
35
2
13
34
Hallo Zusammen,

seit der Kernel Version 6.1.X haben wir diverse Stabilitätsprobleme in unserem Proxmox/Ceph Cluster erfahren. Wir nutzen Ceph seit Version 17.2.4 und dem Kernel 5.15, dann mit 5.19 und jetzt 6.1.10-1. Bis Kernel Version 6.1.2.1 lief alles Problemlos, mit 6.1.6-1 hatten wir 5-10 Minuten nach Start des Hosts einen kompletten Crash mit Kernel Error, sodass wir zurück auf 6.1.2.1 sind. Mit Version 6.1.10-1 haben wir nach etwa 2-3 Tagen das Problem, das sich virtuelle Maschinen unabhängig aufhängen und nicht migriert werden können. In den Logs sind diverse „bad crc in data“ zu finden. Wir verwenden KRBD, da der Modus deutlich performanter läuft. Das Cluster läuft komplett auf der gleichen Proxmox Software Version, sowie identischer und aktueller Firmware/Bios. Ceph läuft auf einem 3 Host System mit 3/2 Replica. Es wird nur RBD genutzt. Die Auslastung der jeweiligen Hosts ist angepasst, sodass alle auf gleicher Last laufen. Durchschnittliche CPU Last: 3-7%, RAM 55-65%.

Anbei einige Infos zu unserem System (pro Host):

2x AMD EPYC 7413
H12DSi-N6
256 GB RAM
4x Kioxia CM6-V (eingeständiger NVME POOL)
8x Toshiba MG09 (eingeständiger HDD POOL)
1x 100Gbit Mellanox ( ConnectX-6 Dx EN, FW 22.35.1012, Ceph Traffic, MESH Routed Simple)
2x 10Gbit Mellanox ( ConnectX4LX, FW: 14.25.0017, 1x VM Traffic, 1x HA Cluster Traffic )

proxmox-ve: 7.3-1 (running kernel: 6.1.10-1-pve)
pve-manager: 7.3-6 (running version: 7.3-6/723bb6ec)
pve-kernel-6.1: 7.3-4
pve-kernel-helper: 7.3-4
pve-kernel-5.15: 7.3-2
pve-kernel-5.19: 7.2-15
pve-kernel-6.1.10-1-pve: 6.1.10-1
pve-kernel-6.1.6-1-pve: 6.1.6-1
pve-kernel-6.1.2-1-pve: 6.1.2-1
pve-kernel-6.1.0-1-pve: 6.1.0-1
pve-kernel-5.19.17-2-pve: 5.19.17-2
pve-kernel-5.19.17-1-pve: 5.19.17-1
pve-kernel-5.19.7-2-pve: 5.19.7-2
pve-kernel-5.15.85-1-pve: 5.15.85-1
pve-kernel-5.15.83-1-pve: 5.15.83-1
pve-kernel-5.15.74-1-pve: 5.15.74-1
pve-kernel-5.15.30-2-pve: 5.15.30-3
ceph: 17.2.5-pve1
ceph-fuse: 17.2.5-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.3
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.3-1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.3-2
libpve-guest-common-perl: 4.2-3
libpve-http-server-perl: 4.1-5
libpve-storage-perl: 7.3-2
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-1
lxcfs: 5.0.3-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.3.3-1
proxmox-backup-file-restore: 2.3.3-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-offline-mirror-helper: 0.5.1-1
proxmox-widget-toolkit: 3.5.5
pve-cluster: 7.3-2
pve-container: 4.4-2
pve-docs: 7.3-1
pve-edk2-firmware: 3.20220526-1
pve-firewall: 4.2-7
pve-firmware: 3.6-3
pve-ha-manager: 3.5.1
pve-i18n: 2.8-2
pve-qemu-kvm: 7.1.0-4
pve-xtermjs: 4.16.0-1
qemu-server: 7.3-3
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+2
vncterm: 1.7-1
zfsutils-linux: 2.1.9-pve1

Bitte das Health Warning ignorieren (3 pgs not deep-scrubbed in time, sonst immer auf OK)
cluster:
id: 8f515fd6-628a-4a4b-bca7-ad03c981189d
health: HEALTH_WARN
3 pgs not deep-scrubbed in time

services:
mon: 3 daemons, quorum pve1,pve2,pve3 (age 2d)
mgr: pve3(active, since 2d), standbys: pve2, pve1
osd: 36 osds: 36 up (since 2d), 36 in (since 10w)

data:
pools: 3 pools, 641 pgs
objects: 20.59M objects, 78 TiB
usage: 234 TiB used, 193 TiB / 428 TiB avail
pgs: 641 active+clean

io:
client: 309 KiB/s rd, 465 MiB/s wr, 19 op/s rd, 513 op/s wr

[global]
auth_client_required = cephx
auth_cluster_required = cephx
auth_service_required = cephx
cluster_network = 10.26.15.50/24
fsid = 8f515fd6-628a-4a4b-bca7-ad03c981189d
mon_allow_pool_delete = true
mon_host = 10.26.15.50 10.26.15.51 10.26.15.52
ms_bind_ipv4 = true
ms_bind_ipv6 = false
osd_pool_default_min_size = 2
osd_pool_default_size = 3
public_network = 10.26.15.50/24

[client]
keyring = /etc/pve/priv/$cluster.$name.keyring

[mds]
keyring = /var/lib/ceph/mds/ceph-$id/keyring

[mon.pve1]
public_addr = 10.26.15.50

[mon.pve2]
public_addr = 10.26.15.51

[mon.pve3]
public_addr = 10.26.15.52

Sollten weitere Logs benötigt werden, werden diese nachgereicht. Danke.
 

Attachments

  • CRC State.png
    CRC State.png
    645.6 KB · Views: 12
  • syslog Kernel 6.1.10.1.zip
    901.5 KB · Views: 4
  • Syslog 6.1.6.1 Kernel Crash.txt
    292.1 KB · Views: 6
Last edited:
Ein weiteres Problem was seit Kernel 6.1.10.1 beobachtet wird:

Es lassen sich bestimmte Maschinen nicht mehr sichern, die vorher keine Probleme hatten. Zusätzlich lassen sich die Maschinen nicht mehr ansprechen und reagieren nicht, nachdem das Backup mit Fehler abgebrochen wurde. Diese müssen über die Shell gestoppt werden. Ein weiterer Punkt ist, das dieses Problem nur über den Backup Job auftritt, wenn man die VM manuell, über das Kontextmenü der VM, sichert, läuft das Backup problemlos durch. Anbei ein die Meldung:

107VMNAMEFAILED00:15:54VM 107 qmp command 'cont' failed - unable to connect to VM 107 qmp socket - timeout after 450 retries

INFO: Starting Backup of VM 107 (qemu)
INFO: Backup started at 2023-02-15 02:27:03
INFO: status = running
INFO: VM Name: VMNAME
INFO: include disk 'virtio0' 'VMs:vm-107-disk-1' 40G
INFO: include disk 'efidisk0' 'VMs:vm-107-disk-0' 528K
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/107/2023-02-15T01:27:03Z'
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
ERROR: VM 107 qmp command 'guest-fsfreeze-thaw' failed - got timeout
ERROR: VM 107 qmp command 'backup' failed - got timeout
INFO: aborting backup job
ERROR: VM 107 qmp command 'backup-cancel' failed - unable to connect to VM 107 qmp socket - timeout after 5983 retries
INFO: resuming VM again
ERROR: Backup of VM 107 failed - VM 107 qmp command 'cont' failed - unable to connect to VM 107 qmp socket - timeout after 450 retries
INFO: Failed at 2023-02-15 02:42:57

agent: 1,fstrim_cloned_disks=1
balloon: 0
bios: ovmf
boot: order=virtio0;ide2;net0
cores: 4
efidisk0: VMs:vm-107-disk-0,efitype=4m,pre-enrolled-keys=1,size=528K
ide2: none,media=cdrom
machine: pc-q35-5.1
memory: 4096
meta: creation-qemu=7.0.0,ctime=1661765507
name: ca
net0: virtio=06:44:B9:C3:3D:89,bridge=vmbr1,tag=20
numa: 0
onboot: 1
ostype: win11
scsihw: virtio-scsi-pci
smbios1: uuid=f81f4155-c2c5-42b9-a9b3-b82216c22bda
sockets: 1
virtio0: VMs:vm-107-disk-1,discard=on,size=40G
vmgenid: f1276277-770a-4d76-b3c9-2fbc4c92232b

Feb 12 03:26:25 pve1 pvestatd[6364]: VM 107 qmp command failed - VM 107 qmp command 'query-proxmox-support' failed - unable to connect to VM 107 qmp socket - timeout after 51 retries
Feb 12 03:26:25 pve1 pvestatd[6364]: status update time (8.225 seconds)
Feb 12 03:26:27 pve1 systemd[1]: Stopping User Manager for UID 0...
Feb 12 03:26:27 pve1 systemd[826052]: Stopped target Main User Target.
Feb 12 03:26:27 pve1 systemd[826052]: Stopped target Basic System.
Feb 12 03:26:27 pve1 systemd[826052]: Stopped target Paths.
Feb 12 03:26:27 pve1 systemd[826052]: Stopped target Sockets.
Feb 12 03:26:27 pve1 systemd[826052]: Stopped target Timers.
Feb 12 03:26:27 pve1 systemd[826052]: dirmngr.socket: Succeeded.
Feb 12 03:26:27 pve1 systemd[826052]: Closed GnuPG network certificate management daemon.
Feb 12 03:26:27 pve1 systemd[826052]: gpg-agent-browser.socket: Succeeded.
Feb 12 03:26:27 pve1 systemd[826052]: Closed GnuPG cryptographic agent and passphrase cache (access for web browsers).
Feb 12 03:26:27 pve1 systemd[826052]: gpg-agent-extra.socket: Succeeded.
Feb 12 03:26:27 pve1 systemd[826052]: Closed GnuPG cryptographic agent and passphrase cache (restricted).
Feb 12 03:26:27 pve1 systemd[826052]: gpg-agent-ssh.socket: Succeeded.
Feb 12 03:26:27 pve1 systemd[826052]: Closed GnuPG cryptographic agent (ssh-agent emulation).
Feb 12 03:26:27 pve1 systemd[826052]: gpg-agent.socket: Succeeded.
Feb 12 03:26:27 pve1 systemd[826052]: Closed GnuPG cryptographic agent and passphrase cache.
Feb 12 03:26:27 pve1 systemd[826052]: Removed slice User Application Slice.
Feb 12 03:26:27 pve1 systemd[826052]: Reached target Shutdown.
Feb 12 03:26:27 pve1 systemd[826052]: systemd-exit.service: Succeeded.
Feb 12 03:26:27 pve1 systemd[826052]: Finished Exit the Session.
Feb 12 03:26:27 pve1 systemd[826052]: Reached target Exit the Session.
Feb 12 03:26:27 pve1 systemd[1]: user@0.service: Succeeded.
Feb 12 03:26:27 pve1 systemd[1]: Stopped User Manager for UID 0.
Feb 12 03:26:27 pve1 systemd[1]: Stopping User Runtime Directory /run/user/0...
Feb 12 03:26:27 pve1 systemd[1]: run-user-0.mount: Succeeded.
Feb 12 03:26:27 pve1 systemd[1]: user-runtime-dir@0.service: Succeeded.
Feb 12 03:26:27 pve1 systemd[1]: Stopped User Runtime Directory /run/user/0.
Feb 12 03:26:27 pve1 systemd[1]: Removed slice User Slice of UID 0.
Feb 12 03:26:34 pve1 pvestatd[6364]: VM 107 qmp command failed - VM 107 qmp command 'query-proxmox-support' failed - unable to connect to VM 107 qmp socket - timeout after 51 retries
Feb 12 03:26:34 pve1 pvestatd[6364]: status update time (8.239 seconds)
Feb 12 03:26:36 pve1 pve-ha-lrm[826524]: VM 107 qmp command failed - VM 107 qmp command 'query-status' failed - unable to connect to VM 107 qmp socket - timeout after 51 retries
Feb 12 03:26:36 pve1 pve-ha-lrm[826524]: VM 107 qmp command 'query-status' failed - unable to connect to VM 107 qmp socket - timeout after 51 retries#012
Feb 12 03:26:44 pve1 pvestatd[6364]: VM 107 qmp command failed - VM 107 qmp command 'query-proxmox-support' failed - unable to connect to VM 107 qmp socket - timeout after 51 retries
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!