QEMU/KVM process crashed with a segmentation fault on CPU 7 | Backup Failure.

SamerXtn

New Member
Nov 29, 2024
16
0
1
Italy
Hello

i have a backup of 2disk sum 2.3 TB everynight start 00:00 the backup is taking around 5 hours to be done, it was going normally. last night i got backup Failure i went to check i found the VM down, checking the log i found
Code:
root@proxmox:~# journalctl --since "2025-02-04 03:00:00" --until "2025-02-04 06:00:00" | grep -i "error\|warning\|fail\|oom\|critical"
Feb 04 03:45:24 proxmox pvestatd[2491]: VM 111 qmp command failed - VM 111 qmp command 'query-proxmox-support' failed - unable to connect to VM 111 qmp socket - timeout after 51 retries
Feb 04 05:03:22 proxmox kernel: kvm[4507]: segfault at 4 ip 000065551de7e597 sp 000077d45d0d0f08 error 6 in qemu-system-x86_64[65551dd82000+6a4000] likely on CPU 7 (core 3, socket 0)
Feb 04 05:03:23 proxmox pvescheduler[1487087]: VM 111 qmp command failed - VM 111 not running
Feb 04 05:03:23 proxmox pvescheduler[1487087]: VM 111 qmp command failed - VM 111 not running
Feb 04 05:03:23 proxmox pvescheduler[1487087]: VM 111 qmp command failed - VM 111 not running
Feb 04 05:03:29 proxmox pvescheduler[1487087]: ERROR: Backup of VM 111 failed - VM 111 not running
Feb 04 05:03:29 proxmox pvescheduler[1487087]: INFO: Backup job finished with errors
Feb 04 05:03:29 proxmox pvescheduler[1487087]: job errors
root@proxmox:~#

root@proxmox:~# dmesg -T | grep "2025-02-04"
root@proxmox:~# ls -la /var/crash/
ls: cannot access '/var/crash/': No such file or directory
root@proxmox:~# journalctl -k --since "2025-02-04 03:00:00" --until "2025-02-04 06:00:00"
Feb 04 05:02:32 proxmox kernel: usb 2-3.1: reset SuperSpeed USB device number 3 using xhci_hcd
Feb 04 05:03:22 proxmox kernel: kvm[4507]: segfault at 4 ip 000065551de7e597 sp 000077d45d0d0f08 error 6 in qemu-system-x86_64[65551dd82000+6a4000] likely o>
Feb 04 05:03:22 proxmox kernel: Code: 8d 3d ad ee 60 00 e8 f8 40 f0 ff 0f 1f 84 00 00 00 00 00 c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 8b 87 b8 0>
Feb 04 05:03:22 proxmox kernel: fwbr111i0: port 2(tap111i0) entered disabled state
Feb 04 05:03:22 proxmox kernel: tap111i0 (unregistering): left allmulticast mode
Feb 04 05:03:22 proxmox kernel: fwbr111i0: port 2(tap111i0) entered disabled state
Feb 04 05:03:23 proxmox kernel: fwbr111i0: port 1(fwln111i0) entered disabled state
Feb 04 05:03:23 proxmox kernel: vmbr0: port 3(fwpr111p0) entered disabled state
Feb 04 05:03:23 proxmox kernel: fwln111i0 (unregistering): left allmulticast mode
Feb 04 05:03:23 proxmox kernel: fwln111i0 (unregistering): left promiscuous mode
Feb 04 05:03:23 proxmox kernel: fwbr111i0: port 1(fwln111i0) entered disabled state
Feb 04 05:03:23 proxmox kernel: fwpr111p0 (unregistering): left allmulticast mode
Feb 04 05:03:23 proxmox kernel: fwpr111p0 (unregistering): left promiscuous mode
Feb 04 05:03:23 proxmox kernel: vmbr0: port 3(fwpr111p0) entered disabled state
lines 1-14/14 (END)

as you can see Critical Event :
Code:
Feb 04 05:03:22 proxmox kernel: kvm[4507]: segfault at 4 ip 000065551de7e597 sp 000077d45d0d0f08 error 6 in qemu-system-x86_64[65551dd82000+6a4000] likely on CPU 7 (core 3, socket 0)

This caused the VM's network interfaces to shut down:
  • tap111i0 disabled
  • fwbr111i0 ports disabled
  • vmbr0 port 3 disabled
and The backup job then failed because it couldn't find the running VM

Vm Config:

Code:
/etc/pve/qemu-server/111.conf
agent: enabled=1
bios: ovmf
boot: order=sata0;ide0
cores: 6
cpu: x86-64-v2-AES
efidisk0: local-lvm:vm-111-disk-2,efitype=4m,pre-enrolled-keys=1,size=4M
ide0: local:iso/Virtio-Win-0.1.262_Drivers_08-2024.iso,media=cdrom,size=708140K
ide2: none,media=cdrom
memory: 32768
meta: creation-qemu=9.0.2,ctime=1734169526
name: WindowsServer
net0: virtio=BC:24:11:04:7D:F3,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: l26
sata0: local-lvm:vm-111-disk-0,size=487808M
sata1: local-lvm:vm-111-disk-1,size=1907200M
scsihw: virtio-scsi-single
smbios1: uuid=adea1593-fafd-4281-ac54-322f7eed920d
sockets: 1
usb0: host=2-3.1
vmgenid: c4cd6be6-36bd-4315-bb97-610eb0e6706d
 
Hello,
there is no common reason for qemu-system-x86_64 to segfault. Probably some of your hardware is getting faulty and should be replaced. Start with checking the memory with programs like memtest86.
 
  • Like
Reactions: SamerXtn
this can either be a hardware issue (CPU/ram/mainboard/disk ;)) or a bug in qemu.. could you also post "pveversion -v"? how long was the VM running before it crashed?
 
  • Like
Reactions: fiona
Hi,
to further debug the issue please run apt install pve-qemu-kvm-dbgsym gdb systemd-coredump libproxmox-backup-qemu0-dbgsym. The next time a crash happens afterwards, you can run coredumpctl -1 gdb and then in the GDB prompt thread apply all backtrace. This will obtain a backtrace of the crash.
 
  • Like
Reactions: fba and SamerXtn
this can either be a hardware issue (CPU/ram/mainboard/disk ;)) or a bug in qemu.. could you also post "pveversion -v"? how long was the VM running before it crashed?


Code:
proxmox-ve: 8.3.0 (running kernel: 6.8.12-4-pve)
pve-manager: 8.3.0 (running version: 8.3.0/c1689ccb1065a83b)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.8: 6.8.12-4
proxmox-kernel-6.8.12-4-pve-signed: 6.8.12-4
proxmox-kernel-6.8.4-2-pve-signed: 6.8.4-2
ceph-fuse: 17.2.7-pve3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx11
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.4
libpve-access-control: 8.2.0
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.10
libpve-cluster-perl: 8.0.10
libpve-common-perl: 8.2.9
libpve-guest-common-perl: 5.1.6
libpve-http-server-perl: 5.1.2
libpve-network-perl: 0.10.0
libpve-rs-perl: 0.9.0
libpve-storage-perl: 8.2.9
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.5.0-1
proxmox-backup-client: 3.2.9-1
proxmox-backup-file-restore: 3.2.9-1
proxmox-firewall: 0.6.0
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.3.1
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.7
proxmox-widget-toolkit: 4.3.1
pve-cluster: 8.0.10
pve-container: 5.2.2
pve-docs: 8.3.1
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.2
pve-firewall: 5.1.0
pve-firmware: 3.14-1
pve-ha-manager: 4.0.6
pve-i18n: 3.3.1
pve-qemu-kvm: 9.0.2-4
pve-xtermjs: 5.3.0-3
qemu-server: 8.3.0
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.6-pve1
 
I noticed that theres a usb reset before the crash error

thats the USB disk where is stored the backup


Code:
Feb 04 05:02:32 proxmox kernel: usb 2-3.1: reset SuperSpeed USB device number 3 using xhci_hcd
Feb 04 05:03:22 proxmox kernel: kvm[4507]: segfault at 4 ip 000065551de7e597 sp 000077d45d0d0f08 error 6 in qemu-system-x86_64[65551dd82000+6a4000] likely o>

could be this reason of the crash ?
 
that does make it slightly more likely that it is actually a bug in qemu. if you run into it again, the backtrace as indicated by fiona hopefully tells us!
 
  • Like
Reactions: SamerXtn
Stumbled upon similar issue today in my homelab, maybe the information can help somehow.

From PBS backup job, when the VM stopped:
INFO: Starting Backup of VM 112 (qemu)
INFO: Backup started at 2025-02-22 00:00:16
INFO: status = running

INFO: VM Name: prometheus-03
INFO: include disk 'virtio0' 'zfs-02:vm-112-disk-1' 32G
INFO: include disk 'virtio1' 'zfs-02:vm-112-disk-2' 250G
INFO: include disk 'efidisk0' 'zfs-02:vm-112-disk-0' 128K
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/112/2025-02-21T23:00:16Z'
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
ERROR: VM 112 not running

ERROR: VM 112 qmp command 'backup' failed - client closed connection
INFO: aborting backup job
ERROR: VM 112 not running
INFO: resuming VM again
ERROR: Backup of VM 112 failed - VM 112 not running
INFO: Failed at 2025-02-22 00:00:17
INFO: Backup job finished with errors
TASK ERROR: job errors

So it seems like the VM stopped around fs-freeze/fs-thaw(?).

Also from PBS log for VM 112:
2025-02-22T00:00:17+01:00: download 'index.json.blob' from previous backup.
2025-02-22T00:00:17+01:00: register chunks in 'drive-efidisk0.img.fidx' from previous backup.
2025-02-22T00:00:17+01:00: download 'drive-efidisk0.img.fidx' from previous backup.
2025-02-22T00:00:17+01:00: created new fixed index 1 ("vm/112/2025-02-21T23:00:16Z/drive-efidisk0.img.fidx")
2025-02-22T00:00:17+01:00: register chunks in 'drive-virtio0.img.fidx' from previous backup.
2025-02-22T00:00:17+01:00: download 'drive-virtio0.img.fidx' from previous backup.
2025-02-22T00:00:17+01:00: created new fixed index 2 ("vm/112/2025-02-21T23:00:16Z/drive-virtio0.img.fidx")
2025-02-22T00:00:17+01:00: register chunks in 'drive-virtio1.img.fidx' from previous backup.
2025-02-22T00:00:17+01:00: download 'drive-virtio1.img.fidx' from previous backup.
2025-02-22T00:00:17+01:00: created new fixed index 3 ("vm/112/2025-02-21T23:00:16Z/drive-virtio1.img.fidx")
2025-02-22T00:00:17+01:00: add blob "/mnt/datastore/zbackup-01/vm/112/2025-02-21T23:00:16Z/qemu-server.conf.blob" (669 bytes, comp: 669)
2025-02-22T00:00:17+01:00: backup ended and finish failed: backup ended but finished flag is not set.
2025-02-22T00:00:17+01:00: removing unfinished backup
2025-02-22T00:00:17+01:00: removing backup snapshot "/mnt/datastore/zbackup-01/vm/112/2025-02-21T23:00:16Z"
2025-02-22T00:00:17+01:00: TASK ERROR: backup ended but finished flag is not set.

From journalctl on the PVE node:
2025-02-22T00:00:16+0100 pve-53 pvescheduler[2512943]: INFO: Starting Backup of VM 112 (qemu)
2025-02-22T00:00:17+0100 pve-53 promtail[1389]: ts=2025-02-21T23:00:17.20644884Z caller=log.go:168 level=info msg="Re-opening truncated file /var/log/vzdump/qemu-112.log ..."
2025-02-22T00:00:17+0100 pve-53 promtail[1389]: ts=2025-02-21T23:00:17.206481691Z caller=log.go:168 level=info msg="Successfully reopened truncated /var/log/vzdump/qemu-112.log"
2025-02-22T00:00:17+0100 pve-53 kernel: kvm[707201]: segfault at 78 ip 00005a4858930914 sp 00007fff3db62450 error 6 in qemu-system-x86_64[5a485844a000+6a4000] likely on CPU 6 (core 12, socket 0)
2025-02-22T00:00:17+0100 pve-53 kernel: Code: 00 00 00 00 66 90 83 47 78 01 31 ff c3 66 0f 1f 84 00 00 00 00 00 41 54 55 53 48 89 fb e8 64 9e d7 ff 84 c0 0f 84 eb 00 00 00 <83> 6b 78 01 0f 85 d2 00 00 0>

2025-02-22T00:00:17+0100 pve-53 pvescheduler[2512943]: VM 112 qmp command failed - VM 112 qmp command 'backup' failed - client closed connection
2025-02-22T00:00:17+0100 pve-53 pvescheduler[2512943]: VM 112 qmp command failed - VM 112 not running
2025-02-22T00:00:17+0100 pve-53 pvescheduler[2512943]: VM 112 qmp command failed - VM 112 not running
2025-02-22T00:00:17+0100 pve-53 pvescheduler[2512943]: VM 112 qmp command failed - VM 112 not running
2025-02-22T00:00:17+0100 pve-53 pvescheduler[2512943]: ERROR: Backup of VM 112 failed - VM 112 not running
2025-02-22T00:00:17+0100 pve-53 pvescheduler[2512943]: INFO: Backup job finished with errors
2025-02-22T00:00:17+0100 pve-53 kernel: zd224: p1 p14 p15 p16
2025-02-22T00:00:17+0100 pve-53 perl[2512943]: skipping disabled matcher 'default-matcher'
2025-02-22T00:00:17+0100 pve-53 qmeventd[1404]: read: Connection reset by peer
2025-02-22T00:00:17+0100 pve-53 kernel: vmbr0: port 4(tap112i0) entered disabled state
2025-02-22T00:00:17+0100 pve-53 kernel: tap112i0 (unregistering): left allmulticast mode
2025-02-22T00:00:17+0100 pve-53 kernel: vmbr0: port 4(tap112i0) entered disabled state
2025-02-22T00:00:17+0100 pve-53 kernel: zd240: p1 p9
2025-02-22T00:00:17+0100 pve-53 systemd[1]: 112.scope: Deactivated successfully.
2025-02-22T00:00:17+0100 pve-53 systemd[1]: 112.scope: Consumed 14h 50min 21.198s CPU time.
2025-02-22T00:00:17+0100 pve-53 perl[2512943]: notified via target `esod-discord-backup`
2025-02-22T00:00:17+0100 pve-53 pvescheduler[2512943]: job errors
2025-02-22T00:00:17+0100 pve-53 qmeventd[2513508]: Starting cleanup for 112
2025-02-22T00:00:17+0100 pve-53 qmeventd[2513508]: Finished cleanup for 112

root@pve-53:~# pveversion -v
proxmox-ve: 8.3.0 (running kernel: 6.8.12-5-pve)
pve-manager: 8.3.1 (running version: 8.3.1/fb48e850ef9dde27)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.8: 6.8.12-5
proxmox-kernel-6.8.12-5-pve-signed: 6.8.12-5
proxmox-kernel-6.8.12-4-pve-signed: 6.8.12-4
proxmox-kernel-6.8.12-2-pve-signed: 6.8.12-2
proxmox-kernel-6.8.4-3-pve-signed: 6.8.4-3
proxmox-kernel-6.5.13-6-pve-signed: 6.5.13-6
proxmox-kernel-6.5: 6.5.13-6
proxmox-kernel-6.5.11-8-pve-signed: 6.5.11-8
ceph-fuse: 18.2.4-pve3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx11
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.4
libpve-access-control: 8.2.0
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.10
libpve-cluster-perl: 8.0.10
libpve-common-perl: 8.2.9
libpve-guest-common-perl: 5.1.6
libpve-http-server-perl: 5.1.2
libpve-network-perl: 0.10.0
libpve-rs-perl: 0.9.1
libpve-storage-perl: 8.3.1
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.5.0-1
proxmox-backup-client: 3.3.2-1
proxmox-backup-file-restore: 3.3.2-2
proxmox-firewall: 0.6.0
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.3.1
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.7
proxmox-widget-toolkit: 4.3.3
pve-cluster: 8.0.10
pve-container: 5.2.2
pve-docs: 8.3.1
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.2
pve-firewall: 5.1.0
pve-firmware: 3.14-2
pve-ha-manager: 4.0.6
pve-i18n: 3.3.2
pve-qemu-kvm: 9.0.2-4
pve-xtermjs: 5.3.0-3
qemu-server: 8.3.2
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.6-pve1

VM conf:
agent: 1
bios: ovmf
boot: order=virtio0
cicustom: vendor=local:snippets/vendor.yaml
cipassword: REDACTED
ciuser: esod
cores: 2
cpu: host
efidisk0: zfs-02:vm-112-disk-0,pre-enrolled-keys=0,size=128K
ipconfig0: ip=dhcp
machine: q35
memory: 10240
meta: creation-qemu=9.0.2,ctime=1734377987
name: prometheus-03
net0: virtio=BC:24:11:A1:4C:A3,bridge=vmbr0,tag=11
numa: 0
ostype: l26
scsi1: zfs-02:vm-112-cloudinit,media=cdrom,size=4M
scsihw: virtio-scsi-pci
serial0: socket
smbios1: uuid=453a9c67-cffe-4959-81c1-0c99698b41cb
sockets: 2
sshkeys: REDACTED
vga: serial0
virtio0: zfs-02:vm-112-disk-1,discard=on,size=32G
virtio1: zfs-02:vm-112-disk-2,discard=on,iothread=1,size=250G
vmgenid: dab51f8c-705b-4f7d-b589-31526656d6e6

VM uptime ~3,9 weeks based on lastest report from node_exporter.

Guess main issue is from the segfault (kernel/kvm/qemu/hw).

PS: I recall playing around with some zfs unmap/trim/discard yesterday (succesfully), e.g. doing zpool trim within the guest OS (guest is also using zfs), and also enabling "thin provisioning" on zfs-02 (the PVE datastore), including removing previous refreservation on the VMs to decrease used space. I think I also cancelled a storage migration job between two local PVE datastores (zfs) yesterday (cancelled the job ~50% migration). VM still ran fine until backup job started a bit later, and have been offline since (not critical to me, as this is "test environment" from my other production environment).

Only have two other VMs on the same PVE node, but they seem to be running just fine, and also PBS backup for the VMs (same host) ran successfully. Only this one VM crashed.
 
Hi,
Stumbled upon similar issue today in my homelab, maybe the information can help somehow.

From PBS backup job, when the VM stopped:
INFO: Starting Backup of VM 112 (qemu)
INFO: Backup started at 2025-02-22 00:00:16
INFO: status = running

INFO: VM Name: prometheus-03
INFO: include disk 'virtio0' 'zfs-02:vm-112-disk-1' 32G
INFO: include disk 'virtio1' 'zfs-02:vm-112-disk-2' 250G
INFO: include disk 'efidisk0' 'zfs-02:vm-112-disk-0' 128K
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/112/2025-02-21T23:00:16Z'
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
ERROR: VM 112 not running

ERROR: VM 112 qmp command 'backup' failed - client closed connection
INFO: aborting backup job
ERROR: VM 112 not running
INFO: resuming VM again
ERROR: Backup of VM 112 failed - VM 112 not running
INFO: Failed at 2025-02-22 00:00:17
INFO: Backup job finished with errors
TASK ERROR: job errors

So it seems like the VM stopped around fs-freeze/fs-thaw(?).

Also from PBS log for VM 112:
2025-02-22T00:00:17+01:00: download 'index.json.blob' from previous backup.
2025-02-22T00:00:17+01:00: register chunks in 'drive-efidisk0.img.fidx' from previous backup.
2025-02-22T00:00:17+01:00: download 'drive-efidisk0.img.fidx' from previous backup.
2025-02-22T00:00:17+01:00: created new fixed index 1 ("vm/112/2025-02-21T23:00:16Z/drive-efidisk0.img.fidx")
2025-02-22T00:00:17+01:00: register chunks in 'drive-virtio0.img.fidx' from previous backup.
2025-02-22T00:00:17+01:00: download 'drive-virtio0.img.fidx' from previous backup.
2025-02-22T00:00:17+01:00: created new fixed index 2 ("vm/112/2025-02-21T23:00:16Z/drive-virtio0.img.fidx")
2025-02-22T00:00:17+01:00: register chunks in 'drive-virtio1.img.fidx' from previous backup.
2025-02-22T00:00:17+01:00: download 'drive-virtio1.img.fidx' from previous backup.
2025-02-22T00:00:17+01:00: created new fixed index 3 ("vm/112/2025-02-21T23:00:16Z/drive-virtio1.img.fidx")
2025-02-22T00:00:17+01:00: add blob "/mnt/datastore/zbackup-01/vm/112/2025-02-21T23:00:16Z/qemu-server.conf.blob" (669 bytes, comp: 669)
2025-02-22T00:00:17+01:00: backup ended and finish failed: backup ended but finished flag is not set.
2025-02-22T00:00:17+01:00: removing unfinished backup
2025-02-22T00:00:17+01:00: removing backup snapshot "/mnt/datastore/zbackup-01/vm/112/2025-02-21T23:00:16Z"
2025-02-22T00:00:17+01:00: TASK ERROR: backup ended but finished flag is not set.

From journalctl on the PVE node:
2025-02-22T00:00:16+0100 pve-53 pvescheduler[2512943]: INFO: Starting Backup of VM 112 (qemu)
2025-02-22T00:00:17+0100 pve-53 promtail[1389]: ts=2025-02-21T23:00:17.20644884Z caller=log.go:168 level=info msg="Re-opening truncated file /var/log/vzdump/qemu-112.log ..."
2025-02-22T00:00:17+0100 pve-53 promtail[1389]: ts=2025-02-21T23:00:17.206481691Z caller=log.go:168 level=info msg="Successfully reopened truncated /var/log/vzdump/qemu-112.log"
2025-02-22T00:00:17+0100 pve-53 kernel: kvm[707201]: segfault at 78 ip 00005a4858930914 sp 00007fff3db62450 error 6 in qemu-system-x86_64[5a485844a000+6a4000] likely on CPU 6 (core 12, socket 0)
2025-02-22T00:00:17+0100 pve-53 kernel: Code: 00 00 00 00 66 90 83 47 78 01 31 ff c3 66 0f 1f 84 00 00 00 00 00 41 54 55 53 48 89 fb e8 64 9e d7 ff 84 c0 0f 84 eb 00 00 00 <83> 6b 78 01 0f 85 d2 00 00 0>

2025-02-22T00:00:17+0100 pve-53 pvescheduler[2512943]: VM 112 qmp command failed - VM 112 qmp command 'backup' failed - client closed connection
2025-02-22T00:00:17+0100 pve-53 pvescheduler[2512943]: VM 112 qmp command failed - VM 112 not running
2025-02-22T00:00:17+0100 pve-53 pvescheduler[2512943]: VM 112 qmp command failed - VM 112 not running
2025-02-22T00:00:17+0100 pve-53 pvescheduler[2512943]: VM 112 qmp command failed - VM 112 not running
2025-02-22T00:00:17+0100 pve-53 pvescheduler[2512943]: ERROR: Backup of VM 112 failed - VM 112 not running
2025-02-22T00:00:17+0100 pve-53 pvescheduler[2512943]: INFO: Backup job finished with errors
2025-02-22T00:00:17+0100 pve-53 kernel: zd224: p1 p14 p15 p16
2025-02-22T00:00:17+0100 pve-53 perl[2512943]: skipping disabled matcher 'default-matcher'
2025-02-22T00:00:17+0100 pve-53 qmeventd[1404]: read: Connection reset by peer
2025-02-22T00:00:17+0100 pve-53 kernel: vmbr0: port 4(tap112i0) entered disabled state
2025-02-22T00:00:17+0100 pve-53 kernel: tap112i0 (unregistering): left allmulticast mode
2025-02-22T00:00:17+0100 pve-53 kernel: vmbr0: port 4(tap112i0) entered disabled state
2025-02-22T00:00:17+0100 pve-53 kernel: zd240: p1 p9
2025-02-22T00:00:17+0100 pve-53 systemd[1]: 112.scope: Deactivated successfully.
2025-02-22T00:00:17+0100 pve-53 systemd[1]: 112.scope: Consumed 14h 50min 21.198s CPU time.
2025-02-22T00:00:17+0100 pve-53 perl[2512943]: notified via target `esod-discord-backup`
2025-02-22T00:00:17+0100 pve-53 pvescheduler[2512943]: job errors
2025-02-22T00:00:17+0100 pve-53 qmeventd[2513508]: Starting cleanup for 112
2025-02-22T00:00:17+0100 pve-53 qmeventd[2513508]: Finished cleanup for 112

root@pve-53:~# pveversion -v
proxmox-ve: 8.3.0 (running kernel: 6.8.12-5-pve)
pve-manager: 8.3.1 (running version: 8.3.1/fb48e850ef9dde27)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.8: 6.8.12-5
proxmox-kernel-6.8.12-5-pve-signed: 6.8.12-5
proxmox-kernel-6.8.12-4-pve-signed: 6.8.12-4
proxmox-kernel-6.8.12-2-pve-signed: 6.8.12-2
proxmox-kernel-6.8.4-3-pve-signed: 6.8.4-3
proxmox-kernel-6.5.13-6-pve-signed: 6.5.13-6
proxmox-kernel-6.5: 6.5.13-6
proxmox-kernel-6.5.11-8-pve-signed: 6.5.11-8
ceph-fuse: 18.2.4-pve3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx11
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.4
libpve-access-control: 8.2.0
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.10
libpve-cluster-perl: 8.0.10
libpve-common-perl: 8.2.9
libpve-guest-common-perl: 5.1.6
libpve-http-server-perl: 5.1.2
libpve-network-perl: 0.10.0
libpve-rs-perl: 0.9.1
libpve-storage-perl: 8.3.1
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.5.0-1
proxmox-backup-client: 3.3.2-1
proxmox-backup-file-restore: 3.3.2-2
proxmox-firewall: 0.6.0
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.3.1
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.7
proxmox-widget-toolkit: 4.3.3
pve-cluster: 8.0.10
pve-container: 5.2.2
pve-docs: 8.3.1
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.2
pve-firewall: 5.1.0
pve-firmware: 3.14-2
pve-ha-manager: 4.0.6
pve-i18n: 3.3.2
pve-qemu-kvm: 9.0.2-4
pve-xtermjs: 5.3.0-3
qemu-server: 8.3.2
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.6-pve1

VM conf:
agent: 1
bios: ovmf
boot: order=virtio0
cicustom: vendor=local:snippets/vendor.yaml
cipassword: REDACTED
ciuser: esod
cores: 2
cpu: host
efidisk0: zfs-02:vm-112-disk-0,pre-enrolled-keys=0,size=128K
ipconfig0: ip=dhcp
machine: q35
memory: 10240
meta: creation-qemu=9.0.2,ctime=1734377987
name: prometheus-03
net0: virtio=BC:24:11:A1:4C:A3,bridge=vmbr0,tag=11
numa: 0
ostype: l26
scsi1: zfs-02:vm-112-cloudinit,media=cdrom,size=4M
scsihw: virtio-scsi-pci
serial0: socket
smbios1: uuid=453a9c67-cffe-4959-81c1-0c99698b41cb
sockets: 2
sshkeys: REDACTED
vga: serial0
virtio0: zfs-02:vm-112-disk-1,discard=on,size=32G
virtio1: zfs-02:vm-112-disk-2,discard=on,iothread=1,size=250G
vmgenid: dab51f8c-705b-4f7d-b589-31526656d6e6

VM uptime ~3,9 weeks based on lastest report from node_exporter.

Guess main issue is from the segfault (kernel/kvm/qemu/hw).

PS: I recall playing around with some zfs unmap/trim/discard yesterday (succesfully), e.g. doing zpool trim within the guest OS (guest is also using zfs), and also enabling "thin provisioning" on zfs-02 (the PVE datastore), including removing previous refreservation on the VMs to decrease used space. I think I also cancelled a storage migration job between two local PVE datastores (zfs) yesterday (cancelled the job ~50% migration). VM still ran fine until backup job started a bit later, and have been offline since (not critical to me, as this is "test environment" from my other production environment).

Only have two other VMs on the same PVE node, but they seem to be running just fine, and also PBS backup for the VMs (same host) ran successfully. Only this one VM crashed.
please try to obtain a backtrace:
Hi,
to further debug the issue please run apt install pve-qemu-kvm-dbgsym gdb systemd-coredump libproxmox-backup-qemu0-dbgsym. The next time a crash happens afterwards, you can run coredumpctl -1 gdb and then in the GDB prompt thread apply all backtrace. This will obtain a backtrace of the crash.
 
  • Like
Reactions: SamerXtn
I have got a similar log:

Code:
Jun 23 13:24:08 pve pvestatd[1599]: VM 109 qmp command failed - VM 109 qmp command 'query-proxmox-support' failed - client closed connection
Jun 23 13:24:08 pve kernel: kvm[718259]: segfault at b3ad0 ip 00000000000b3ad0 sp 00007ffdc5cac108 error 14 in qemu-system-x86_64[56cb0c556000+335000] likely on CPU 12 (core 25, socket 0)
Jun 23 13:24:08 pve kernel: Code: Unable to access opcode bytes at 0xb3aa6.
Jun 23 13:24:08 pve systemd[1]: 109.scope: Deactivated successfully.

All the VMs on this node crashed with basically the same log. One after another during a 2 second window. This node is backing up to PBS and I am running some restore test from PBS on the PVE node creating a different VM. I will investigate further as time allows.
 
Hi,
I have got a similar log:

Code:
Jun 23 13:24:08 pve pvestatd[1599]: VM 109 qmp command failed - VM 109 qmp command 'query-proxmox-support' failed - client closed connection
Jun 23 13:24:08 pve kernel: kvm[718259]: segfault at b3ad0 ip 00000000000b3ad0 sp 00007ffdc5cac108 error 14 in qemu-system-x86_64[56cb0c556000+335000] likely on CPU 12 (core 25, socket 0)
Jun 23 13:24:08 pve kernel: Code: Unable to access opcode bytes at 0xb3aa6.
Jun 23 13:24:08 pve systemd[1]: 109.scope: Deactivated successfully.

All the VMs on this node crashed with basically the same log. One after another during a 2 second window. This node is backing up to PBS and I am running some restore test from PBS on the PVE node creating a different VM. I will investigate further as time allows.
how many VMs are we talking about? If all VMs crash at the same time, it's much more likely a kernel or hardware issue. Make sure you have the latest BIOS updates/CPU microcode installed: https://pve.proxmox.com/pve-docs/chapter-sysadmin.html#sysadmin_firmware_cpu

Could you share the full system logs surrounding the time of the issue?

What kernel version are you using? Note that there also is a 6.14 opt-in kernel: https://forum.proxmox.com/threads/o...e-8-available-on-test-no-subscription.164497/
 
Hi,

how many VMs are we talking about? If all VMs crash at the same time, it's much more likely a kernel or hardware issue. Make sure you have the latest BIOS updates/CPU microcode installed: https://pve.proxmox.com/pve-docs/chapter-sysadmin.html#sysadmin_firmware_cpu

Could you share the full system logs surrounding the time of the issue?

What kernel version are you using? Note that there also is a 6.14 opt-in kernel: https://forum.proxmox.com/threads/o...e-8-available-on-test-no-subscription.164497/
We are talking about 10 VMs, fortunately in a test environment (for my PBS restore speedup tests).
Code:
Jun 23 13:01:19 pve sshd[2237098]: pam_env(sshd:session): deprecated reading of user environment enabled
Jun 23 13:03:20 pve qmrestore[2237894]: <root@pam> starting task UPID:pve:002225C7:02236063:68593478:qmrestore:111:root@pam:
Jun 23 13:05:01 pve CRON[2238731]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jun 23 13:05:01 pve CRON[2238732]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Jun 23 13:05:01 pve CRON[2238731]: pam_unix(cron:session): session closed for user root
Jun 23 13:07:36 pve qmrestore[2237894]: <root@pam> end task UPID:cmis-pve:002225C7:02236063:68593478:qmrestore:111:root@pam: OK
Jun 23 13:08:13 pve qmrestore[2240486]: <root@pam> starting task UPID:cmis-pve:00222FE7:0223D2BC:6859359D:qmrestore:111:root@pam:
Jun 23 13:08:27 pve pvestatd[1599]: auth key pair too old, rotating..
Jun 23 13:12:45 pve qmrestore[2240486]: <root@pam> end task UPID:cmis-pve:00222FE7:0223D2BC:6859359D:qmrestore:111:root@pam: OK
Jun 23 13:14:28 pve pvedaemon[802195]: <root@pam> successful auth for user 'root@pam'
Jun 23 13:15:01 pve CRON[2244749]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jun 23 13:15:01 pve CRON[2244750]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Jun 23 13:15:01 pve CRON[2244749]: pam_unix(cron:session): session closed for user root
Jun 23 13:17:01 pve CRON[2245512]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jun 23 13:17:01 pve CRON[2245513]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Jun 23 13:17:01 pve CRON[2245512]: pam_unix(cron:session): session closed for user root
Jun 23 13:17:50 pve qmrestore[2245821]: <root@pam> starting task UPID:cmis-pve:002244CF:0224B3FC:685937DE:qmrestore:111:root@pam:
Jun 23 13:21:30 pve sshd[2247433]: Accepted publickey for root from 10.212.134.200 port 56184 ssh2: <REDACTED>
Jun 23 13:21:30 pve sshd[2247433]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Jun 23 13:21:30 pve systemd-logind[1402]: New session 729 of user root.
░░ Subject: A new session 729 has been created for user root
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░ Documentation: sd-login(3)
░░
░░ A new session with the ID 729 has been created for the user root.
░░
░░ The leading process of the session is 2247433.
Jun 23 13:21:30 pve systemd[1]: Started session-729.scope - Session 729 of User root.
░░ Subject: A start job for unit session-729.scope has finished successfully
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ A start job for unit session-729.scope has finished successfully.
░░
░░ The job identifier is 7327.
Jun 23 13:21:30 pve sshd[2247433]: pam_env(sshd:session): deprecated reading of user environment enabled
Jun 23 13:21:31 pve sshd[2247433]: Received disconnect from 10.212.134.200 port 56184:11: disconnected by user
Jun 23 13:21:31 pve sshd[2247433]: Disconnected from user root 10.212.134.200 port 56184
Jun 23 13:21:31 pve sshd[2247433]: pam_unix(sshd:session): session closed for user root
Jun 23 13:21:31 pve systemd-logind[1402]: Session 729 logged out. Waiting for processes to exit.
Jun 23 13:21:31 pve systemd[1]: session-729.scope: Deactivated successfully.
░░ Subject: Unit succeeded
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ The unit session-729.scope has successfully entered the 'dead' state.
Jun 23 13:21:31 pve systemd-logind[1402]: Removed session 729.
░░ Subject: Session 729 has been terminated
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░ Documentation: sd-login(3)
░░
░░ A session with the ID 729 has been terminated.
Jun 23 13:21:57 pve qmrestore[2245821]: <root@pam> end task UPID:cmis-pve:002244CF:0224B3FC:685937DE:qmrestore:111:root@pam: OK
Jun 23 13:24:03 pve qmrestore[2248333]: <root@pam> starting task UPID:cmis-pve:00224E8E:022545DE:68593953:qmrestore:111:root@pam:
Jun 23 13:24:07 pve kernel: kvm[718191]: segfault at b3ad0 ip 00000000000b3ad0 sp 00007fffaff921c8 error 14 in qemu-system-x86_64[6440937cc000+335000] likely on CPU 26 (core 11, socket 0)
Jun 23 13:24:07 pve kernel: Code: Unable to access opcode bytes at 0xb3aa6.
Jun 23 13:24:07 pve pvestatd[1599]: VM 103 qmp command failed - VM 103 qmp command 'query-proxmox-support' failed - client closed connection
Jun 23 13:24:07 pve kernel: kvm[717516]: segfault at b3ad0 ip 00000000000b3ad0 sp 00007ffc2983af58 error 14 in qemu-system-x86_64[56a2a83bd000+335000] likely on CPU 4 (core 2, socket 0)
Jun 23 13:24:07 pve kernel: Code: Unable to access opcode bytes at 0xb3aa6.
Jun 23 13:24:07 pve systemd[1]: 103.scope: Deactivated successfully.
░░ Subject: Unit succeeded
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ The unit 103.scope has successfully entered the 'dead' state.
Jun 23 13:24:07 pve systemd[1]: 103.scope: Consumed 8h 53min 34.624s CPU time.
░░ Subject: Resources consumed by unit runtime
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ The unit 103.scope completed and consumed the indicated resources.
Jun 23 13:24:07 pve pvestatd[1599]: VM 102 qmp command failed - VM 102 qmp command 'query-proxmox-support' failed - client closed connection
Jun 23 13:24:07 pve kernel: kvm[718534]: segfault at b3ad0 ip 00000000000b3ad0 sp 00007ffcec66fe18 error 14 in qemu-system-x86_64[62ca38ba5000+335000] likely on CPU 8 (core 9, socket 0)
Jun 23 13:24:07 pve kernel: Code: Unable to access opcode bytes at 0xb3aa6.
Jun 23 13:24:07 pve systemd[1]: 102.scope: Deactivated successfully.
░░ Subject: Unit succeeded
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ The unit 102.scope has successfully entered the 'dead' state.
Jun 23 13:24:07 pve systemd[1]: 102.scope: Consumed 8h 56min 8.654s CPU time.
░░ Subject: Resources consumed by unit runtime
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ The unit 102.scope completed and consumed the indicated resources.
Jun 23 13:24:07 pve pvestatd[1599]: VM 108 qmp command failed - VM 108 qmp command 'query-proxmox-support' failed - client closed connection
Jun 23 13:24:07 pve kernel: kvm[718459]: segfault at b3ad0 ip 00000000000b3ad0 sp 00007ffe6557f028 error 14 in qemu-system-x86_64[630a75fdb000+335000] likely on CPU 8 (core 9, socket 0)
Jun 23 13:24:07 pve kernel: Code: Unable to access opcode bytes at 0xb3aa6.
Jun 23 13:24:07 pve systemd[1]: 108.scope: Deactivated successfully.
░░ Subject: Unit succeeded
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ The unit 108.scope has successfully entered the 'dead' state.
Jun 23 13:24:07 pve systemd[1]: 108.scope: Consumed 8h 47min 28.120s CPU time.
░░ Subject: Resources consumed by unit runtime
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ The unit 108.scope completed and consumed the indicated resources.
Jun 23 13:24:07 pve pvestatd[1599]: VM 107 qmp command failed - VM 107 qmp command 'query-proxmox-support' failed - client closed connection
Jun 23 13:24:07 pve kernel: kvm[703322]: segfault at b3ad0 ip 00000000000b3ad0 sp 00007ffdcd59d288 error 14 in qemu-system-x86_64[57634bc75000+335000] likely on CPU 24 (core 9, socket 0)
Jun 23 13:24:07 pve kernel: Code: Unable to access opcode bytes at 0xb3aa6.
Jun 23 13:24:07 pve systemd[1]: 107.scope: Deactivated successfully.
...

I am using Linux cmis-pve 6.8.12-11-pve and intel-microcode 3.20250512.1~deb12u1 should be in use.
 
please verify that it doesn't happen without your patches!