Proxmox randomly reboots while doing backup job

simplix

New Member
May 2, 2023
5
0
1
Hi all,

I have the problem that my Proxmox server is rebooting at night - but I don't find a concrete reason in the log files.
It does not happen every night, but when it happens, there is a schedules backup running at this moment.

Here the Syslog. Any tips how to further isolate and fix the problem?

I'm running:
Linux 6.5.11-4-pve (2023-11-20T10:19Z)
pve-manager/8.1.3

Code:
Jan 17 07:30:02 serenity pvescheduler[2689639]: INFO: Starting Backup of VM 210 (qemu)
Jan 17 07:30:04 serenity qm[2777770]: <root@pam> starting task UPID:serenity:002A62F4:0100B7AC:65A773EC:qmpause:210:root@pam:
Jan 17 07:30:04 serenity qm[2777844]: suspend VM 210: UPID:serenity:002A62F4:0100B7AC:65A773EC:qmpause:210:root@pam:
Jan 17 07:30:04 serenity qm[2777770]: <root@pam> end task UPID:serenity:002A62F4:0100B7AC:65A773EC:qmpause:210:root@pam: OK
Jan 17 07:30:07 serenity pvescheduler[2689639]: VM 210 qmp command failed - VM 210 qmp command 'guest-ping' failed - got timeout
Jan 17 07:31:00 serenity kernel: overlayfs: fs on '/var/lib/docker/overlay2/l/6EZ6I2VGELMEDN2JVIJ46BA475' does not support file handles, falling back to xino=off.
Jan 17 07:31:00 serenity kernel: br-1b38c62a2e77: port 10(veth5758b10) entered blocking state
Jan 17 07:31:00 serenity kernel: br-1b38c62a2e77: port 10(veth5758b10) entered disabled state
Jan 17 07:31:00 serenity kernel: veth5758b10: entered allmulticast mode
Jan 17 07:31:00 serenity kernel: veth5758b10: entered promiscuous mode
Jan 17 07:31:00 serenity kernel: br-bf0106f40fa3: port 4(vethbdf4bf7) entered blocking state
Jan 17 07:31:00 serenity kernel: br-bf0106f40fa3: port 4(vethbdf4bf7) entered disabled state
Jan 17 07:31:00 serenity kernel: vethbdf4bf7: entered allmulticast mode
Jan 17 07:31:00 serenity kernel: vethbdf4bf7: entered promiscuous mode
Jan 17 07:31:00 serenity kernel: br-bf0106f40fa3: port 4(vethbdf4bf7) entered blocking state
Jan 17 07:31:00 serenity kernel: br-bf0106f40fa3: port 4(vethbdf4bf7) entered forwarding state
Jan 17 07:31:00 serenity kernel: eth0: renamed from veth20f7f55
Jan 17 07:31:00 serenity kernel: eth1: renamed from veth7f0e88f
Jan 17 07:31:00 serenity kernel: br-1b38c62a2e77: port 10(veth5758b10) entered blocking state
Jan 17 07:31:00 serenity kernel: br-1b38c62a2e77: port 10(veth5758b10) entered forwarding state
Jan 17 07:31:05 serenity kernel: br-bf0106f40fa3: port 4(vethbdf4bf7) entered disabled state
Jan 17 07:31:05 serenity kernel: veth20f7f55: renamed from eth0
Jan 17 07:31:06 serenity kernel: br-bf0106f40fa3: port 4(vethbdf4bf7) entered disabled state
Jan 17 07:31:06 serenity kernel: vethbdf4bf7 (unregistering): left allmulticast mode
Jan 17 07:31:06 serenity kernel: vethbdf4bf7 (unregistering): left promiscuous mode
Jan 17 07:31:06 serenity kernel: br-bf0106f40fa3: port 4(vethbdf4bf7) entered disabled state
Jan 17 07:31:06 serenity kernel: br-1b38c62a2e77: port 10(veth5758b10) entered disabled state
Jan 17 07:31:06 serenity kernel: veth7f0e88f: renamed from eth1
Jan 17 07:31:06 serenity kernel: br-1b38c62a2e77: port 10(veth5758b10) entered disabled state
Jan 17 07:31:06 serenity kernel: veth5758b10 (unregistering): left allmulticast mode
Jan 17 07:31:06 serenity kernel: veth5758b10 (unregistering): left promiscuous mode
Jan 17 07:31:06 serenity kernel: br-1b38c62a2e77: port 10(veth5758b10) entered disabled state
-- reboot --
 
Hello, is this is a cluster? Do you have any guest using HA? Are you using Docker on the server? Note that Docker is incompatible with Proxmox VE, the usual recommendation is to run it inside a VM.
 
No Cluster, no HA.
No Docker on the Proxmox Host (however in LXC with keyctl=1 and nesting=1).

I just did a backup again, and proxmox startet not responding while backing up VM210.
So it seems to be specific to this VM (with the VM running without problems besides backup)

Here the log of the backup until the server stops responding:
Code:
INFO: starting new backup job: vzdump 210 --notification-mode auto --notes-template '{{guestname}}' --node serenity --mode snapshot --storage elements --remove 0 --compress zstd
INFO: Starting Backup of VM 210 (qemu)
INFO: Backup started at 2024-01-17 13:37:23
INFO: status = running
INFO: VM Name: Cloud
INFO: include disk 'scsi0' 'local:210/vm-210-disk-0.qcow2' 100G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: skip unused drive 'SSD:210/vm-210-disk-0.qcow2' (not included into backup)
INFO: creating vzdump archive '/mnt/elements/dump/vzdump-qemu-210-2024_01_17-13_37_23.vma.zst'
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
INFO: started backup task '16b7cdd8-179a-46fd-8cb6-c50583b1ddfb'
INFO: resuming VM again
INFO:   0% (663.4 MiB of 100.0 GiB) in 3s, read: 221.1 MiB/s, write: 190.0 MiB/s
INFO:   1% (1.2 GiB of 100.0 GiB) in 6s, read: 201.8 MiB/s, write: 162.4 MiB/s
 
OK, this time the host got partly back some moments later, so that I could read the backup log.
It's running in to input/output errors while backup of the VM.

Code:
INFO: starting new backup job: vzdump 210 --notification-mode auto --notes-template '{{guestname}}' --node serenity --mode snapshot --storage elements --remove 0 --compress zstd
INFO: Starting Backup of VM 210 (qemu)
INFO: Backup started at 2024-01-17 13:37:23
INFO: status = running
INFO: VM Name: Cloud
INFO: include disk 'scsi0' 'local:210/vm-210-disk-0.qcow2' 100G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: skip unused drive 'SSD:210/vm-210-disk-0.qcow2' (not included into backup)
INFO: creating vzdump archive '/mnt/elements/dump/vzdump-qemu-210-2024_01_17-13_37_23.vma.zst'
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
INFO: started backup task '16b7cdd8-179a-46fd-8cb6-c50583b1ddfb'
INFO: resuming VM again
INFO:   0% (663.4 MiB of 100.0 GiB) in 3s, read: 221.1 MiB/s, write: 190.0 MiB/s
INFO:   1% (1.2 GiB of 100.0 GiB) in 6s, read: 201.8 MiB/s, write: 162.4 MiB/s
INFO:   1% (1.9 GiB of 100.0 GiB) in 6m 19s, read: 1.9 MiB/s, write: 1.4 MiB/s
ERROR: job failed with err -5 - Input/output error
INFO: aborting backup job
INFO: resuming VM again
unable to open file '/etc/pve/nodes/serenity/qemu-server/210.conf.tmp.52859' - Input/output error
ERROR: Backup of VM 210 failed - job failed with err -5 - Input/output error
INFO: Failed at 2024-01-17 13:43:43
Can't exec "cp": Eingabe-/Ausgabefehler at /usr/share/perl5/PVE/VZDump.pm line 1298.
INFO: Backup job finished with errors
Can't exec "hostname": Eingabe-/Ausgabefehler at /usr/share/perl5/PVE/VZDump.pm line 429.
 
Maybe my NVME is dying:

SMART passed but with "Media and Data Integrity Errors: 413".

And the following in the log:
Code:
Jan 17 14:18:35 serenity kernel: nvme0n1: I/O Cmd(0x2) @ LBA 371154584, 136 blocks, I/O Error (sct 0x2 / sc 0x81) MORE
Jan 17 14:18:35 serenity kernel: critical medium error, dev nvme0n1, sector 371154584 op 0x0:(READ) flags 0x80700 phys_seg 5 prio class 2[
/CODE]
 
Hello, yes it's possible. But in principle that wouldn't hang the system, just block applications that require IO. Please also do a memcheck to discard memory issues.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!