segfault in backup process and VM shutting down at backup

Jul 6, 2024
15
0
1
Hello,
I am having a strange issue with my backup tasks. Each time they run (scheduled tasks) there is a chance everything goes haywire.
On this server, on-premise-2
The task completed successfully however afterwards all statused are questionmarks. dmesg has:
Code:
[854411.622392] apps.plugin[2061063]: segfault at 1 ip 00005cc2d94b6ba0 sp 00007fff3d0cb948 error 6 in apps.plugin[5cc2d94b6000+8e000] likely on CPU 17 (core 1, socket 0)
[854411.622427] Code: Unable to access opcode bytes at 0x5cc2d94b6b76.
[856133.168067] pvestatd[1389]: segfault at c ip 0000770e542d1000 sp 00007ffe5f8fd7e8 error 4 in libcrypto.so.3[770e542c5000+27c000] likely on CPU 17 (core 1, socket 0)
[856133.168085] Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 <8b> 04 25 0c 00 00 00 0f 0b 8b 04 25 0c 00 00 00 0f 0b 31 c0 31 d2
[856532.497543] pve-firewall[1386]: segfault at 0 ip 00005f8470aa1320 sp 00007ffd5fa80bc8 error 6 in perl[5f84709f4000+195000] likely on CPU 3 (core 3, socket 0)
[856532.497565] Code: Unable to access opcode bytes at 0x5f8470aa12f6.
strange.png
systemctl restart pvestatd would fix that.

On another server the backup task got terminated in the dashboard due to VM not running. However I didn't stop it - running the backup task stops the VM. I tried to disable the qemu guest agent but the issue remains.

Here is the full log
Code:
INFO: starting new backup job: vzdump 101 --fleecing '1,storage=local' --quiet 1 --mode snapshot --notes-template '{{guestname}}' --storage backup-ssd --prune-backups 'keep-last=7'
INFO: Starting Backup of VM 101 (qemu)
INFO: Backup started at 2025-05-11 02:56:03
INFO: status = running
INFO: VM Name: legacy-data
INFO: include disk 'scsi0' 'local:101/vm-101-disk-0.qcow2' 3547088M
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: special config section found (not included into backup)
INFO: creating Proxmox Backup Server archive 'vm/101/2025-05-11T00:56:03Z'
INFO: removing (old) fleecing image 'local:101/vm-101-fleece-0.qcow2'
Formatting '/var/lib/vz/images/101/vm-101-fleece-0.qcow2', fmt=qcow2 cluster_size=65536 extended_l2=off preallocation=metadata compression_type=zlib size=3719391346688 lazy_refcounts=off refcount_bits=16
INFO: drive-scsi0: attaching fleecing image local:101/vm-101-fleece-0.qcow2 to QEMU
INFO: skipping guest-agent 'fs-freeze', agent configured but not running?
INFO: started backup task '0716fd38-2ca0-437c-bbb9-be3b731c59a7'
INFO: resuming VM again
INFO: scsi0: dirty-bitmap status: created new
INFO:   0% (1012.0 MiB of 3.4 TiB) in 3s, read: 337.3 MiB/s, write: 300.0 MiB/s
INFO:   1% (34.8 GiB of 3.4 TiB) in 3m 4s, read: 191.4 MiB/s, write: 189.5 MiB/s
INFO:   2% (69.4 GiB of 3.4 TiB) in 4m 50s, read: 333.8 MiB/s, write: 155.7 MiB/s
INFO:   3% (105.1 GiB of 3.4 TiB) in 6m 57s, read: 288.2 MiB/s, write: 163.0 MiB/s
INFO:   4% (138.7 GiB of 3.4 TiB) in 8m 18s, read: 424.4 MiB/s, write: 141.8 MiB/s
INFO:   5% (173.6 GiB of 3.4 TiB) in 10m 32s, read: 266.5 MiB/s, write: 142.9 MiB/s
INFO:   6% (207.9 GiB of 3.4 TiB) in 11m 51s, read: 445.8 MiB/s, write: 189.4 MiB/s
INFO:   7% (242.5 GiB of 3.4 TiB) in 12m 57s, read: 536.1 MiB/s, write: 180.5 MiB/s
INFO:   8% (277.5 GiB of 3.4 TiB) in 14m 32s, read: 376.9 MiB/s, write: 147.0 MiB/s
INFO:   9% (311.8 GiB of 3.4 TiB) in 18m 48s, read: 137.2 MiB/s, write: 125.2 MiB/s
INFO:  10% (346.4 GiB of 3.4 TiB) in 23m 10s, read: 135.5 MiB/s, write: 135.0 MiB/s
INFO:  11% (381.1 GiB of 3.4 TiB) in 26m 52s, read: 159.7 MiB/s, write: 156.5 MiB/s
INFO:  12% (415.7 GiB of 3.4 TiB) in 30m 48s, read: 150.4 MiB/s, write: 150.0 MiB/s
INFO:  13% (450.5 GiB of 3.4 TiB) in 34m 46s, read: 149.5 MiB/s, write: 148.8 MiB/s
INFO:  14% (485.0 GiB of 3.4 TiB) in 38m 28s, read: 159.3 MiB/s, write: 158.7 MiB/s
INFO:  15% (519.9 GiB of 3.4 TiB) in 42m 30s, read: 147.7 MiB/s, write: 140.8 MiB/s
INFO:  16% (554.6 GiB of 3.4 TiB) in 43m 45s, read: 473.6 MiB/s, write: 174.0 MiB/s
INFO:  17% (589.0 GiB of 3.4 TiB) in 45m 15s, read: 391.4 MiB/s, write: 177.6 MiB/s
INFO:  18% (623.5 GiB of 3.4 TiB) in 49m 7s, read: 152.5 MiB/s, write: 147.8 MiB/s
INFO:  19% (659.1 GiB of 3.4 TiB) in 50m 15s, read: 535.2 MiB/s, write: 198.1 MiB/s
INFO:  20% (692.8 GiB of 3.4 TiB) in 52m 17s, read: 283.2 MiB/s, write: 167.9 MiB/s
INFO:  21% (727.5 GiB of 3.4 TiB) in 55m 40s, read: 174.9 MiB/s, write: 143.1 MiB/s
INFO:  22% (763.0 GiB of 3.4 TiB) in 57m 41s, read: 300.8 MiB/s, write: 188.1 MiB/s
INFO:  23% (796.8 GiB of 3.4 TiB) in 59m 41s, read: 287.9 MiB/s, write: 161.0 MiB/s
INFO:  24% (831.6 GiB of 3.4 TiB) in 1h 3m 47s, read: 144.7 MiB/s, write: 138.7 MiB/s
INFO:  25% (866.1 GiB of 3.4 TiB) in 1h 8m 17s, read: 131.0 MiB/s, write: 130.5 MiB/s
INFO:  26% (900.7 GiB of 3.4 TiB) in 1h 12m 4s, read: 156.0 MiB/s, write: 155.5 MiB/s
INFO:  27% (935.3 GiB of 3.4 TiB) in 1h 16m 18s, read: 139.6 MiB/s, write: 139.0 MiB/s
INFO:  28% (970.0 GiB of 3.4 TiB) in 1h 20m 23s, read: 145.0 MiB/s, write: 144.8 MiB/s
INFO:  29% (1004.6 GiB of 3.4 TiB) in 1h 24m 20s, read: 149.8 MiB/s, write: 149.2 MiB/s
INFO:  30% (1.0 TiB of 3.4 TiB) in 1h 29m, read: 126.6 MiB/s, write: 126.5 MiB/s
INFO:  31% (1.0 TiB of 3.4 TiB) in 1h 33m 45s, read: 124.4 MiB/s, write: 124.2 MiB/s
INFO:  32% (1.1 TiB of 3.4 TiB) in 1h 37m 32s, read: 156.2 MiB/s, write: 155.9 MiB/s
INFO:  33% (1.1 TiB of 3.4 TiB) in 1h 41m 33s, read: 147.2 MiB/s, write: 147.2 MiB/s
INFO:  34% (1.2 TiB of 3.4 TiB) in 1h 45m 30s, read: 150.2 MiB/s, write: 149.9 MiB/s
INFO:  35% (1.2 TiB of 3.4 TiB) in 1h 49m 18s, read: 155.5 MiB/s, write: 155.2 MiB/s
INFO:  36% (1.2 TiB of 3.4 TiB) in 1h 54m 8s, read: 121.8 MiB/s, write: 121.5 MiB/s
INFO:  37% (1.3 TiB of 3.4 TiB) in 1h 58m 4s, read: 150.6 MiB/s, write: 139.2 MiB/s
INFO:  38% (1.3 TiB of 3.4 TiB) in 2h 2m 4s, read: 147.4 MiB/s, write: 132.5 MiB/s
INFO:  39% (1.3 TiB of 3.4 TiB) in 2h 5m 16s, read: 185.0 MiB/s, write: 138.9 MiB/s
INFO:  40% (1.4 TiB of 3.4 TiB) in 2h 8m 31s, read: 182.0 MiB/s, write: 137.5 MiB/s
INFO:  41% (1.4 TiB of 3.4 TiB) in 2h 11m 17s, read: 218.8 MiB/s, write: 140.9 MiB/s
INFO:  42% (1.4 TiB of 3.4 TiB) in 2h 12m 4s, read: 748.2 MiB/s, write: 310.0 MiB/s
INFO:  43% (1.5 TiB of 3.4 TiB) in 2h 13m 10s, read: 539.6 MiB/s, write: 176.4 MiB/s
INFO:  44% (1.5 TiB of 3.4 TiB) in 2h 15m 6s, read: 300.0 MiB/s, write: 180.8 MiB/s
INFO:  45% (1.5 TiB of 3.4 TiB) in 2h 17m 20s, read: 264.1 MiB/s, write: 146.6 MiB/s
INFO:  46% (1.6 TiB of 3.4 TiB) in 2h 21m 42s, read: 135.4 MiB/s, write: 135.3 MiB/s
INFO:  47% (1.6 TiB of 3.4 TiB) in 2h 26m 9s, read: 133.0 MiB/s, write: 132.9 MiB/s
INFO:  48% (1.6 TiB of 3.4 TiB) in 2h 28m 37s, read: 239.5 MiB/s, write: 135.8 MiB/s
INFO:  49% (1.7 TiB of 3.4 TiB) in 2h 32m 39s, read: 146.4 MiB/s, write: 146.3 MiB/s
INFO:  50% (1.7 TiB of 3.4 TiB) in 2h 37m 30s, read: 122.0 MiB/s, write: 121.9 MiB/s
INFO:  51% (1.7 TiB of 3.4 TiB) in 2h 42m, read: 131.7 MiB/s, write: 131.6 MiB/s
INFO:  52% (1.8 TiB of 3.4 TiB) in 2h 46m 26s, read: 133.0 MiB/s, write: 132.7 MiB/s
INFO:  53% (1.8 TiB of 3.4 TiB) in 2h 48m 9s, read: 345.0 MiB/s, write: 150.6 MiB/s
INFO:  54% (1.8 TiB of 3.4 TiB) in 2h 52m 2s, read: 152.2 MiB/s, write: 141.0 MiB/s
INFO:  55% (1.9 TiB of 3.4 TiB) in 2h 56m 17s, read: 139.0 MiB/s, write: 131.2 MiB/s
INFO:  56% (1.9 TiB of 3.4 TiB) in 2h 58m 11s, read: 319.9 MiB/s, write: 160.2 MiB/s
INFO:  57% (1.9 TiB of 3.4 TiB) in 2h 59m 51s, read: 358.9 MiB/s, write: 159.9 MiB/s
INFO:  58% (2.0 TiB of 3.4 TiB) in 3h 49s, read: 587.9 MiB/s, write: 170.2 MiB/s
INFO:  59% (2.0 TiB of 3.4 TiB) in 3h 2m 50s, read: 292.5 MiB/s, write: 162.2 MiB/s
INFO:  60% (2.0 TiB of 3.4 TiB) in 3h 4m 22s, read: 388.6 MiB/s, write: 163.4 MiB/s
INFO:  61% (2.1 TiB of 3.4 TiB) in 3h 5m 32s, read: 508.2 MiB/s, write: 171.7 MiB/s
INFO:  62% (2.1 TiB of 3.4 TiB) in 3h 7m 4s, read: 389.7 MiB/s, write: 149.3 MiB/s
INFO:  63% (2.1 TiB of 3.4 TiB) in 3h 10m 22s, read: 175.9 MiB/s, write: 138.5 MiB/s
INFO:  64% (2.2 TiB of 3.4 TiB) in 3h 13m 20s, read: 208.7 MiB/s, write: 139.1 MiB/s
INFO:  65% (2.2 TiB of 3.4 TiB) in 3h 14m 12s, read: 650.2 MiB/s, write: 168.7 MiB/s
INFO:  66% (2.2 TiB of 3.4 TiB) in 3h 15m 49s, read: 383.2 MiB/s, write: 159.9 MiB/s
INFO:  67% (2.3 TiB of 3.4 TiB) in 3h 16m 44s, read: 622.4 MiB/s, write: 198.6 MiB/s
INFO:  68% (2.3 TiB of 3.4 TiB) in 3h 18m 8s, read: 416.8 MiB/s, write: 167.8 MiB/s
INFO:  69% (2.3 TiB of 3.4 TiB) in 3h 19m 7s, read: 599.4 MiB/s, write: 168.9 MiB/s
INFO:  70% (2.4 TiB of 3.4 TiB) in 3h 20m 28s, read: 438.8 MiB/s, write: 162.8 MiB/s
INFO:  71% (2.4 TiB of 3.4 TiB) in 3h 22m 58s, read: 236.4 MiB/s, write: 151.5 MiB/s
INFO:  72% (2.4 TiB of 3.4 TiB) in 3h 27m 10s, read: 145.6 MiB/s, write: 138.1 MiB/s
INFO:  73% (2.5 TiB of 3.4 TiB) in 3h 28m 37s, read: 404.8 MiB/s, write: 177.1 MiB/s
INFO:  74% (2.5 TiB of 3.4 TiB) in 3h 30m 34s, read: 296.0 MiB/s, write: 164.7 MiB/s
INFO:  75% (2.5 TiB of 3.4 TiB) in 3h 34m 2s, read: 169.6 MiB/s, write: 133.2 MiB/s
INFO:  76% (2.6 TiB of 3.4 TiB) in 3h 36m 57s, read: 202.6 MiB/s, write: 160.5 MiB/s
INFO:  77% (2.6 TiB of 3.4 TiB) in 3h 40m 6s, read: 187.7 MiB/s, write: 141.8 MiB/s
INFO:  78% (2.6 TiB of 3.4 TiB) in 3h 42m 48s, read: 219.9 MiB/s, write: 162.3 MiB/s
INFO:  79% (2.7 TiB of 3.4 TiB) in 3h 45m 11s, read: 251.1 MiB/s, write: 146.9 MiB/s
INFO:  80% (2.7 TiB of 3.4 TiB) in 3h 46m 26s, read: 481.3 MiB/s, write: 222.4 MiB/s
INFO:  81% (2.7 TiB of 3.4 TiB) in 3h 48m 23s, read: 292.5 MiB/s, write: 152.1 MiB/s
INFO:  82% (2.8 TiB of 3.4 TiB) in 3h 50m 43s, read: 253.8 MiB/s, write: 157.8 MiB/s
ERROR: VM 101 not running
INFO: aborting backup job
ERROR: VM 101 not running
INFO: resuming VM again
ERROR: Backup of VM 101 failed - VM 101 not running
INFO: Failed at 2025-05-11 06:48:22
INFO: Backup job finished with errors
ERROR: could not notify via target `mail-to-root`: could not notify via endpoint(s): mail-to-root: no recipients provided for the mail, cannot send it.
TASK ERROR: job errors


In dmesg on the server the following is shown. Also a segfault
Code:
[Sun May 11 06:48:36 2025] proxmox-backup-[3880855]: segfault at 0 ip 0000771982f493a0 sp 000076fa5c3e5ff8 error 6 in libzstd.so.1.5.4[771982f49000+a1000] likely on CPU 24 (core 8, socket 0)
[Sun May 11 06:48:36 2025] Code: Unable to access opcode bytes at 0x771982f49376.

We did try to repeat the backup earlier because of a similar fail (this time only a few minutes after it started. Same segfault error
Code:
[Sun May 11 02:42:42 2025] apps.plugin[3245195]: segfault at 0 ip 000064ca3be77240 sp 00007ffd9ee64d68 error 6 in apps.plugin[64ca3be69000+8e000] likely on CPU 27 (core 11, socket 0)
[Sun May 11 02:42:42 2025] Code: Unable to access opcode bytes at 0x64ca3be77216.
[Sun May 11 02:43:12 2025] proxmox-backup-[3869209]: segfault at 18 ip 000077031c1b94e0 sp 000076e3c58d0e88 error 6 in libzstd.so.1.5.4[77031c1b9000+a1000] likely on CPU 3 (core 3, socket 0)
[Sun May 11 02:43:12 2025] Code: Unable to access opcode bytes at 0x77031c1b94b6.


Here is my version
Code:
pveversion -v
proxmox-ve: 8.4.0 (running kernel: 6.8.12-10-pve)
pve-manager: 8.4.1 (running version: 8.4.1/2a5fa54a8503f96d)
proxmox-kernel-helper: 8.1.1
proxmox-kernel-6.8.12-10-pve-signed: 6.8.12-10
proxmox-kernel-6.8: 6.8.12-10
amd64-microcode: 3.20240820.1~deb12u1
ceph-fuse: 16.2.15+ds-0+deb12u1
corosync: 3.1.9-pve1
criu: 3.17.1-2+deb12u1
glusterfs-client: 10.3-5
ifupdown: residual config
ifupdown2: 3.2.0-1+pmx11
intel-microcode: 3.20250211.1~deb12u1
libjs-extjs: 7.0.0-5
libknet1: 1.30-pve2
libproxmox-acme-perl: 1.6.0
libproxmox-backup-qemu0: 1.5.1
libproxmox-rs-perl: 0.3.5
libpve-access-control: 8.2.2
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.1.0
libpve-cluster-perl: 8.1.0
libpve-common-perl: 8.3.1
libpve-guest-common-perl: 5.2.2
libpve-http-server-perl: 5.2.2
libpve-network-perl: 0.11.2
libpve-rs-perl: 0.9.4
libpve-storage-perl: 8.3.6
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.6.0-2
proxmox-backup-client: 3.4.1-1
proxmox-backup-file-restore: 3.4.1-1
proxmox-kernel-helper: 8.1.1
proxmox-mail-forward: 0.3.2
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.3.10
pve-cluster: 8.1.0
pve-container: 5.2.6
pve-docs: 8.4.0
pve-edk2-firmware: not correctly installed
pve-firewall: 5.1.1
pve-firmware: 3.15-3
pve-ha-manager: 4.0.7
pve-i18n: 3.4.2
pve-qemu-kvm: 9.2.0-5
pve-xtermjs: 5.5.0-2
qemu-server: 8.3.12
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0

Journalctl
Code:
May 11 06:48:22 on-premise-1 kernel: proxmox-backup-[3880855]: segfault at 0 ip 0000771982f493a0 sp 000076fa5c3e5ff8 error 6 in libzstd.so.1.5.4[771982f49000+a>
May 11 06:48:22 on-premise-1 kernel: Code: Unable to access opcode bytes at 0x771982f49376.
May 11 06:48:22 on-premise-1 pvescheduler[3880741]: VM 101 qmp command failed - VM 101 not running
May 11 06:48:22 on-premise-1 pvescheduler[3880741]: VM 101 qmp command failed - VM 101 not running
May 11 06:48:22 on-premise-1 pvescheduler[3880741]: VM 101 qmp command failed - VM 101 not running
May 11 06:48:22 on-premise-1 kernel: vmbr0: port 1(tap101i0) entered disabled state
May 11 06:48:22 on-premise-1 kernel: tap101i0 (unregistering): left allmulticast mode
May 11 06:48:22 on-premise-1 kernel: vmbr0: port 1(tap101i0) entered disabled state
May 11 06:48:22 on-premise-1 pvescheduler[3880741]: ERROR: Backup of VM 101 failed - VM 101 not running
May 11 06:48:22 on-premise-1 pvescheduler[3880741]: INFO: Backup job finished with errors
May 11 06:48:22 on-premise-1 perl[3880741]: could not notify via target `mail-to-root`: could not notify via endpoint(s): mail-to-root: no recipients provided >
May 11 06:48:22 on-premise-1 pvescheduler[3880741]: job errors
May 11 06:48:23 on-premise-1 qmeventd[4021145]: Starting cleanup for 101
May 11 06:48:23 on-premise-1 qmeventd[4021145]: Finished cleanup for 101
May 11 06:48:29 on-premise-1 systemd[1]: 101.scope: Deactivated successfully.
May 11 06:48:29 on-premise-1 systemd[1]: 101.scope: Consumed 8h 6min 29.387s CPU time.

I briefly considered the hardware because pvestatd, pve-firewall also had segfaults when backup job failed. It doesn't make sense since it's separate software, but all servers have ECC memory and I don't see anything wrong there. It has been running stable and issue only occurs as backup job runs.

Another reason why I think it's not hardware related is that it happened on 2 out of 3 server.

For it to be related to hardware and happen on 2 out of 3 servers is very unlikely.
I'm considering it's a kernel bug or something similar. Maybe it doesn't like the fleecing option, perhaps unstable under IO load? I'm happy to enable debugging or pinning to lower version kernel. Let me know!

If I take backup from inside the VM, all works fine. I've been doing that using alternative software (full backups for past months) and there I blast the 1G port for around 3-4 hours per day - no issues at all. Idk if it matters that inside the vm, I do have different kernel version (RHEL 9),
 
Last edited:
Hello,
The 7950x3d is used

The latest microcode is should be used
Code:
sudo dmesg | grep 'microcode:'
[    1.777015] microcode: Current revision: 0x0a601209

# apt install amd64-microcode
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
amd64-microcode is already the newest version (3.20240820.1~deb12u1).
The following package was automatically installed and is no longer required:
  linux-image-6.1.0-28-amd64
Use 'apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.

The latest bios should also be used
Code:
dmidecode -t bios -q
BIOS Information
    Vendor: American Megatrends International, LLC.
    Version: 20.11
    Release Date: 10/24/2024

I have the new hetzner AX102 server (not the one with mobo replacement issue)
I should add that when I got them I did memtester for 2 pass (around 30 hours)

Do you recommend me to switch to latest 6.14 kernel or earlier kernel?

I find it really strange that these issues only happen when I take backup from the proxmox side like this, while taking backup inside the vm - there's no issues in dmesg and I've been doing that for 1 month. While here. I can replicate it almost every backup taken don't you agree? Inside the VM i use kernel 5.14.0 and I do rsync for backup.
 
Last edited:
Do you recommend me to switch to latest 6.14 kernel or earlier kernel?
If the issue started happening after a recent kernel update, then it would be good to boot the previous kernel. Otherwise, I'd recommend trying the 6.14 kernel.
I find it really strange that these issues only happen when I take backup from the proxmox side like this, while taking backup inside the vm - there's no issues in dmesg and I've been doing that for 1 month. While here. I can replicate it almost every backup taken don't you agree? Inside the VM i use kernel 5.14.0 and I do rsync for backup.
Well, the backup inside the VM is a virtualized workload and thus is a quite different kind of workload than a backup running on the host itself.

You could also try setting a bandwidth limit or reduce the amount of workers (both can be set in the advanced options for the backup job) to see if it depends on the load on the system.