PBS fails to shut down VM's correctly when running backup on schedule in 'STOP' mode

horned1

New Member
Oct 23, 2023
2
0
1
We have a problem in our environment that has been going on a while and is intermittent and quite random. Sometimes VM's in our backup schedules that are in 'STOP' mode fail to stop correctly, they are not backed up and look like they are hanging (but 'green') in the Proxmox GUI. They need to be manually stopped and started to get them up and running again.

The Task log for the backup always looks like this
112: 2025-04-15 23:20:55 INFO: Starting Backup of VM 112 (qemu)
112: 2025-04-15 23:20:55 INFO: status = running
112: 2025-04-15 23:20:55 INFO: backup mode: stop
112: 2025-04-15 23:20:55 INFO: ionice priority: 7
112: 2025-04-15 23:20:55 INFO: VM Name: ****
112: 2025-04-15 23:20:55 INFO: include disk 'scsi0' 'Ceph-VMs:vm-112-disk-0' 200G
112: 2025-04-15 23:20:55 INFO: stopping virtual guest
112: 2025-04-15 23:30:56 INFO: VM quit/powerdown failed
112: 2025-04-15 23:30:56 ERROR: Backup of VM 112 failed - command 'qm shutdown 112 --skiplock --keepActive --timeout 600' failed: exit code 255

The journalctl logs for the linux VM's (the majority of our VM's, but this has happened to Win10 as well) always look good, i.e they print the shutdown hook and all services are shut down in an orderly fashion down to journalctl itself, as you can see this happens in a few seconds at most


starting with:

Apr 15 23:20:56 **** qemu-ga[524]: info: guest-ping called
Apr 15 23:20:56 **** qemu-ga[524]: info: guest-shutdown called, mode: (null)
Apr 15 23:20:56 **** systemd-logind[608]: Creating /run/nologin, blocking further logins...
Apr 15 23:20:56 **** systemd-logind[608]: System is powering down (hypervisor initiated shutdown).
..
..
Apr 15 23:20:56 **** systemd-shutdown[1]: Sending SIGTERM to remaining processes...
Apr 15 23:20:56 **** systemd-journald[291]: Received SIGTERM from PID 1 (systemd-shutdow).
Apr 15 23:20:56 **** systemd-journald[291]: Journal stopped
-- Boot 11029d25257044bda3206956f8325cb5 --
[startup logs when technician reboots in the morning]

Additional info:
- The VM's are Alma 9.5, Ubuntu 22 or 24 as well as a few Win10
- latest version of qemu-guest-agent (running)
- the same VM might work fine for several cycles (the one in the logs has been backed up twice per week for 6 weeks without issue)
- usually when it happens there are more than one VM, around 3-6 out of ~20 in the schedule
- PBS version: 3.3.2 PVE version: 8.3.3

I am at the end of my rope and need ways to get a higher understanding of the process or ways to get to additional logs that might help
 
Last edited:
Hi,
you could try and attach a serial console to one of the Linux VMs and inspect if there is further output not written to the systemd journal since that process exited already. That might tell more about what's not working. Also, if you perform a regular shutdown, how long does that take? Do you maybe have services (e.g. network attached storages) which might take a long time during shutdown?
 
Hi,
Thanks for the console idea, it's possible we can see something there that we don't see in the logs. I'll make sure this is checked next time it happens.

This happens to alot of VM's some more complex than others, but the one in the example is a very simple VM that only runs a very simple service (offline repo) and shuts down in seconds. There are no errors from the server after the shutdown-hook , no external filesystems or logged in users that could pose an issue