PVE crash while running a backup job

jeanmars · Sep 9, 2024

Hi,
I have a n intermittent issue where PVE "crashes" (i.e. does ping but impossible to open a SSH/web console session on PVE or any VM) while executing a backup job. There are not many traces:
Sep 09 01:15:17 ganymede systemd[1]: Starting fstrim.service - Discard unused blocks on filesystems from /etc/fstab...
Sep 09 01:15:47 ganymede fstrim[1643546]: /boot/efi: 1021.6 MiB (1071276032 bytes) trimmed on /dev/nvme0n1p2
Sep 09 01:15:47 ganymede fstrim[1643546]: /: 85.2 GiB (91440758784 bytes) trimmed on /dev/pve/root
Sep 09 01:15:47 ganymede systemd[1]: fstrim.service: Deactivated successfully.
Sep 09 01:15:47 ganymede systemd[1]: Finished fstrim.service - Discard unused blocks on filesystems from /etc/fstab.
Sep 09 01:17:01 ganymede CRON[1643895]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Sep 09 01:17:01 ganymede CRON[1643896]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Sep 09 01:17:01 ganymede CRON[1643895]: pam_unix(cron:session): session closed for user root
Sep 09 02:06:17 ganymede pvestatd[1038]: status update time (8.147 seconds)
Sep 09 02:06:45 ganymede pvestatd[1038]: status update time (5.690 seconds)
Sep 09 02:07:06 ganymede pve-firewall[1031]: firewall update time (7.428 seconds)
Sep 09 02:07:06 ganymede pvestatd[1038]: status update time (6.411 seconds)
Sep 09 02:08:44 ganymede pvestatd[1038]: status update time (5.133 seconds)
Sep 09 02:09:57 ganymede pvestatd[1038]: status update time (8.292 seconds)
Sep 09 02:10:05 ganymede pvescheduler[1654054]: <root@pam> starting task UPID:ganymede:00193D29:03240ABE:66DE3CDD:vzdump:100:root@pam:
Sep 09 02:10:05 ganymede pvescheduler[1654057]: INFO: starting new backup job: vzdump 100 --storage FX6712X-SUN_PVE --notes-template '{{guestname}}' --mailto xxxx@gmail.com --compress zstd --mode snapshot --mailnotification always --quiet 1 --prune-backups 'keep-last=4'
Sep 09 02:10:05 ganymede pvescheduler[1654057]: INFO: Starting Backup of VM 100 (qemu)
Sep 09 02:14:34 ganymede pvestatd[1038]: status update time (5.309 seconds)
Sep 09 02:15:45 ganymede pve-firewall[1031]: firewall update time (6.519 seconds)
Sep 09 02:15:48 ganymede pvestatd[1038]: status update time (18.744 seconds)
-- Reboot --
Sep 09 09:12:28 ganymede kernel: Linux version 6.8.12-1-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-1 (2024-08-05T16:17Z) ()
Sep 09 09:12:28 ganymede kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-6.8.12-1-pve root=/dev/mapper/pve-root ro quiet
Sep 09 09:12:28 ganymede kernel: KERNEL supported cpus:
Sep 09 09:12:28 ganymede kernel: Intel GenuineIntel

And from tasks, I can see just an un-expected error on backup job (attachment).

This is not happening every time but it did happen 5 weeks ago and just happened yesterday. Only way is to hard reboot pve.
Any idea what' going on or how I can troubleshoot this?
Thanks,
Jean

gfngfn256 · Sep 9, 2024

Double-click that line of the VM/CT 100 - Backup to see a more detailed report of the Backup-job. Maybe you'll see at what stage the error occurred.

jeanmars · Sep 9, 2024

Hi,
well, not that instructive, backup stopped at 20%:

Code:

INFO: starting new backup job: vzdump 100 --storage FX6712X-SUN_PVE --notes-template '{{guestname}}' --mailto xxxx@gmail.com --compress zstd --mode snapshot --mailnotification always --quiet 1 --prune-backups 'keep-last=4'
INFO: Starting Backup of VM 100 (qemu)
INFO: Backup started at 2024-09-09 02:10:05
INFO: status = running
INFO: VM Name: deb11-docker
INFO: include disk 'virtio0' 'xdata:100/vm-100-disk-0.qcow2' 200G
INFO: include disk 'virtio1' 'xdata:100/vm-100-disk-1.qcow2' 512G
INFO: exclude disk 'virtio2' 'xdata:100/vm-100-disk-2.qcow2' (backup=no)
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating vzdump archive '/mnt/pve/FX6712X-SUN_PVE/dump/vzdump-qemu-100-2024_09_09-02_10_05.vma.zst'
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
INFO: started backup task '5d1a2602-88d0-420e-8bc0-db3973e0f56d'
INFO: resuming VM again
INFO:   0% (399.8 MiB of 712.0 GiB) in 3s, read: 133.2 MiB/s, write: 59.9 MiB/s
INFO:   1% (7.8 GiB of 712.0 GiB) in 20s, read: 447.3 MiB/s, write: 20.6 MiB/s
INFO:   2% (14.5 GiB of 712.0 GiB) in 34s, read: 489.4 MiB/s, write: 5.3 MiB/s
INFO:   3% (21.7 GiB of 712.0 GiB) in 51s, read: 434.1 MiB/s, write: 12.0 MiB/s
INFO:   4% (29.6 GiB of 712.0 GiB) in 55s, read: 2.0 GiB/s, write: 2.6 MiB/s
INFO:   5% (36.8 GiB of 712.0 GiB) in 59s, read: 1.8 GiB/s, write: 826.0 KiB/s
INFO:   6% (43.7 GiB of 712.0 GiB) in 1m 4s, read: 1.4 GiB/s, write: 93.6 KiB/s
INFO:   7% (50.6 GiB of 712.0 GiB) in 1m 8s, read: 1.7 GiB/s, write: 500.0 KiB/s
INFO:   8% (57.0 GiB of 712.0 GiB) in 1m 12s, read: 1.6 GiB/s, write: 83.0 KiB/s
INFO:   9% (64.1 GiB of 712.0 GiB) in 1m 16s, read: 1.8 GiB/s, write: 104.0 KiB/s
INFO:  10% (72.3 GiB of 712.0 GiB) in 1m 21s, read: 1.6 GiB/s, write: 496.8 KiB/s
INFO:  11% (80.1 GiB of 712.0 GiB) in 1m 25s, read: 1.9 GiB/s, write: 205.0 KiB/s
INFO:  12% (85.5 GiB of 712.0 GiB) in 1m 38s, read: 421.7 MiB/s, write: 26.7 MiB/s
INFO:  13% (92.6 GiB of 712.0 GiB) in 3m 35s, read: 62.3 MiB/s, write: 28.3 MiB/s
INFO:  14% (99.7 GiB of 712.0 GiB) in 4m 44s, read: 106.1 MiB/s, write: 31.3 MiB/s
INFO:  15% (107.0 GiB of 712.0 GiB) in 4m 48s, read: 1.8 GiB/s, write: 3.6 MiB/s
INFO:  16% (114.9 GiB of 712.0 GiB) in 4m 52s, read: 2.0 GiB/s, write: 1.2 MiB/s
INFO:  17% (123.3 GiB of 712.0 GiB) in 4m 56s, read: 2.1 GiB/s, write: 3.7 MiB/s
INFO:  18% (130.5 GiB of 712.0 GiB) in 4m 59s, read: 2.4 GiB/s, write: 1.6 MiB/s
INFO:  19% (136.2 GiB of 712.0 GiB) in 5m 2s, read: 1.9 GiB/s, write: 1.2 MiB/s
INFO:  20% (142.6 GiB of 712.0 GiB) in 5m 6s, read: 1.6 GiB/s, write: 3.3 MiB/s

On device where file is sent (FX6712X-SUN_PVE), this one is responding OK.

jzxkkk · Sep 9, 2024

I have the same problem

I have been using apt update to keep the system up-to-date for the past month, but this issue has arisen. Then I completely deleted the system and reinstalled it using ISO, but the problem still occurred last night. In the log display, only the backup situation was shown, and there have been no logs since then
My version is PVE-Manager/8.2.2/9355359cd7afbae4 (running kernel: 6.8.4-2-PVE)

gfngfn256 · Sep 9, 2024

jeanmars said:
well, not that instructive, backup stopped at 20%:

Since your server appears to really just hard reboot without warning. You can probably assume its HW related. I'd start with thermals. Then check storage, RAM & PSU.
On both occasions was it this specific backup job (VM 100) that caused the reboot? Is this job one of the more work intensive for this node?
If all the above does not yield any results - you could then try an older kernel, although I doubt this is your problem.

jeanmars · Sep 9, 2024

Hi,

Since your server appears to really just hard reboot without warning. You can probably assume its HW related. I'd start with thermals. Then check storage, RAM & PSU.
On both occasions was it this specific backup job (VM 100) that caused the reboot? Is this job one of the more work intensive for this node?
If all the above does not yield any results - you could then try an older kernel, although I doubt this is your problem.

Sorry I mis-explained: PVE does NOT reboot on its own, I have to manually reboot it using power switch as there is no way to connect to it via SSH or web console. According to the power sensor, there is a power consumption peak for 15min after backup starts but the it comes back to regular consumption which looks OK for an AMD Ryzen 5 5625U with Radeon graphics (so I don't thiknk there could be a hardware issue):

Right now, I'm only running this VM100 in my PVE so hard to tell if it is the one causing this issue. Anyway, whatever the VM is running, I believe it should not behave like that

.

gfngfn256 · Sep 9, 2024

I don't see your proof that it is not a HW issue. During the period that the node is inaccessible, do you have any proof of ANY activity on the node? Does the VM 100 itself contain any logs/activity for that period?
From your graph, it appears to use minimal power during that period, so I'm guessing zero (or next-to-zero) activity. That would indicate nothing is working.

jeanmars said:
there is no way to connect to it via SSH or web console

Maybe connect a monitor to it, so that you can see what happens when it goes down.

jeanmars said:
AMD Ryzen 5 5625U

I assume that is a mini pc. Go through my checklist above.

jeanmars · Sep 9, 2024

Hi,
You're right; I can't find any log activity after 2:15 when PVE becomes inaccessible.
Connecting a monitor would be difficult but somehow possible. But the most difficult thing is to repro this.
In the past, I had issue with SSD and I was able to see logs from PVE, here nothing, so it might be RAM. Anything possible to check RAM directly from PVE?

jzxkkk · Sep 10, 2024

Have you used anything related to Windows and Q35; I saw another post that seems to describe the issue of Proxmox VE crashing when using multiple Windows virtual machines such as Q35 or i440

jeanmars · Sep 10, 2024

Hi,

no the only VM running in this PVE is a Debian. On my side it's either like a hardware issue or a proxmox bug.

jeanmars · Oct 16, 2024

Hi all,

Against all odds, blame was to put on the SDD. It became completely un-responsive a week ago. Saying "Against all odds" because every SMART test reported no issues at all. I was suspecting the RAM but booting using memtestx86+ (nice idea to include it in Proxmox boot BTW) but RAM is OK.
Now I know the root cause, I should have pay more attention to the fact that mysql DB responded quite slowly, but I was not sure it was new.
Anyway, problem solved with a new SSD. No problem on Proxmox side.
Cheers,
Jean

Search

Search

PVE crash while running a backup job

jeanmars

New Member

gfngfn256

Distinguished Member

jeanmars

New Member

jzxkkk

New Member

gfngfn256

Distinguished Member

jeanmars

New Member

gfngfn256

Distinguished Member

jeanmars

New Member

jzxkkk

New Member

jeanmars

New Member

jeanmars

New Member

We value your privacy