We have a problem in our environment that has been going on a while and is intermittent and quite random. Sometimes VM's in our backup schedules that are in 'STOP' mode fail to stop correctly, they are not backed up and look like they are hanging (but 'green') in the Proxmox GUI. They need to be manually stopped and started to get them up and running again.
The Task log for the backup always looks like this
The journalctl logs for the linux VM's (the majority of our VM's, but this has happened to Win10 as well) always look good, i.e they print the shutdown hook and all services are shut down in an orderly fashion down to journalctl itself, as you can see this happens in a few seconds at most
starting with:
Additional info:
- The VM's are Alma 9.5, Ubuntu 22 or 24 as well as a few Win10
- latest version of qemu-guest-agent (running)
- the same VM might work fine for several cycles (the one in the logs has been backed up twice per week for 6 weeks without issue)
- usually when it happens there are more than one VM, around 3-6 out of ~20 in the schedule
- PBS version: 3.3.2 PVE version: 8.3.3
I am at the end of my rope and need ways to get a higher understanding of the process or ways to get to additional logs that might help
The Task log for the backup always looks like this
112: 2025-04-15 23:20:55 INFO: Starting Backup of VM 112 (qemu)
112: 2025-04-15 23:20:55 INFO: status = running
112: 2025-04-15 23:20:55 INFO: backup mode: stop
112: 2025-04-15 23:20:55 INFO: ionice priority: 7
112: 2025-04-15 23:20:55 INFO: VM Name: ****
112: 2025-04-15 23:20:55 INFO: include disk 'scsi0' 'Ceph-VMs:vm-112-disk-0' 200G
112: 2025-04-15 23:20:55 INFO: stopping virtual guest
112: 2025-04-15 23:30:56 INFO: VM quit/powerdown failed
112: 2025-04-15 23:30:56 ERROR: Backup of VM 112 failed - command 'qm shutdown 112 --skiplock --keepActive --timeout 600' failed: exit code 255
The journalctl logs for the linux VM's (the majority of our VM's, but this has happened to Win10 as well) always look good, i.e they print the shutdown hook and all services are shut down in an orderly fashion down to journalctl itself, as you can see this happens in a few seconds at most
starting with:
Apr 15 23:20:56 **** qemu-ga[524]: info: guest-ping called
Apr 15 23:20:56 **** qemu-ga[524]: info: guest-shutdown called, mode: (null)
Apr 15 23:20:56 **** systemd-logind[608]: Creating /run/nologin, blocking further logins...
Apr 15 23:20:56 **** systemd-logind[608]: System is powering down (hypervisor initiated shutdown).
..
..
Apr 15 23:20:56 **** systemd-shutdown[1]: Sending SIGTERM to remaining processes...
Apr 15 23:20:56 **** systemd-journald[291]: Received SIGTERM from PID 1 (systemd-shutdow).
Apr 15 23:20:56 **** systemd-journald[291]: Journal stopped
-- Boot 11029d25257044bda3206956f8325cb5 --
[startup logs when technician reboots in the morning]
Additional info:
- The VM's are Alma 9.5, Ubuntu 22 or 24 as well as a few Win10
- latest version of qemu-guest-agent (running)
- the same VM might work fine for several cycles (the one in the logs has been backed up twice per week for 6 weeks without issue)
- usually when it happens there are more than one VM, around 3-6 out of ~20 in the schedule
- PBS version: 3.3.2 PVE version: 8.3.3
I am at the end of my rope and need ways to get a higher understanding of the process or ways to get to additional logs that might help
Last edited: