QEMU Guest Agent Does not Restart After Apt Update

aav_waob

New Member
Jul 21, 2023
11
3
3
Hello!

We have a couple Proxmox VM servers that we are managing at our organization. Everything has been working fine except that in the last couple of months we've had an issue with the QEMU Guest Agent service shutting down after receiving upgrades via apt. Rebooting the VMs restarts the qemu-ga service, but with a modest array of VMs, some of which require extra steps to safely reboot, this is very tedious. We've looked through our logs and configurations and I'm more than happy to post some of those here if they would be helpful.

Would anyone have any pointers or suggestions?

Thanks!
 
Since you didn't reveal what the VM OSes are in question and what repositories (if *IX based) they are using - we are left to presume that this is not related to any PVE repositories or PVE hypervisor functions. Have you inquired with the OS vendor/community? If you looked through the logs - were there any indication why the process did not start? In modern Linux distro "systemctl" and "journalctl" can provide a wealth of information.

Besides being more specific in your problem description and examining the logs, I am not certain anyone can provide other advice.

Additionally, how often are you seeing upgrades of Qemu Agent? Or is this once per VM occurance?


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
My apologies for the sparse info. We are running mostly Ubuntu VMs that are mostly on LTS 22.04. We have a few Windows VMs as well, but these have not had similar issues. We are running PVE 7.

We get updates for QEMU Guest agent maybe once a month or so.

I'll have a look at journalctl, but I do know that in syslog I see a message stating that the QEMU guest agent has been successfully shut down, which seems to suggest that apt is shutting it down prior to updating but the service is just never started after it's been updated.

I'll post the relevant logs, but it might take me a while to get them together.
 
Okay, I grabbed an old VM backup with an old version of QEMU guest agent on it and went through apt update again. Here's the output of systemctl status qemu-guest-agent.service before the update:

Rich (BB code):
● qemu-guest-agent.service - QEMU Guest Agent
     Loaded: loaded (/lib/systemd/system/qemu-guest-agent.service; static)
     Active: active (running) since Fri 2023-07-21 13:27:14 EDT; 2min 32s ago
   Main PID: 782 (qemu-ga)
      Tasks: 2 (limit: 4556)
     Memory: 1.2M
        CPU: 40ms
     CGroup: /system.slice/qemu-guest-agent.service
             └─782 /usr/sbin/qemu-ga

Jul 21 13:27:14 QEMU-GA-TESTING systemd[1]: Started QEMU Guest Agent.
Jul 21 13:28:37 QEMU-GA-TESTING qemu-ga[782]: info: guest-ping called
Jul 21 13:28:47 QEMU-GA-TESTING qemu-ga[782]: info: guest-ping called
Jul 21 13:28:58 QEMU-GA-TESTING qemu-ga[782]: info: guest-ping called
Jul 21 13:29:08 QEMU-GA-TESTING qemu-ga[782]: info: guest-ping called
Jul 21 13:29:19 QEMU-GA-TESTING qemu-ga[782]: info: guest-ping called
Jul 21 13:29:29 QEMU-GA-TESTING qemu-ga[782]: info: guest-ping called
Jul 21 13:29:40 QEMU-GA-TESTING qemu-ga[782]: info: guest-ping called

And after running apt update and upgrade:

Rich (BB code):
○ qemu-guest-agent.service - QEMU Guest Agent
     Loaded: loaded (/lib/systemd/system/qemu-guest-agent.service; static)
     Active: inactive (dead) since Fri 2023-07-21 13:32:10 EDT; 1min 44s ago
   Main PID: 782 (code=exited, status=0/SUCCESS)
        CPU: 77ms

Jul 21 13:31:05 QEMU-GA-TESTING qemu-ga[782]: info: guest-ping called
Jul 21 13:31:15 QEMU-GA-TESTING qemu-ga[782]: info: guest-ping called
Jul 21 13:31:26 QEMU-GA-TESTING qemu-ga[782]: info: guest-ping called
Jul 21 13:31:37 QEMU-GA-TESTING qemu-ga[782]: info: guest-ping called
Jul 21 13:31:47 QEMU-GA-TESTING qemu-ga[782]: info: guest-ping called
Jul 21 13:31:58 QEMU-GA-TESTING qemu-ga[782]: info: guest-ping called
Jul 21 13:32:08 QEMU-GA-TESTING qemu-ga[782]: info: guest-ping called
Jul 21 13:32:10 QEMU-GA-TESTING systemd[1]: Stopping QEMU Guest Agent...
Jul 21 13:32:10 QEMU-GA-TESTING systemd[1]: qemu-guest-agent.service: Deactivated successfully.
Jul 21 13:32:10 QEMU-GA-TESTING systemd[1]: Stopped QEMU Guest Agent.
 
Restarting qemu guest agent with systemctl works, but we'd prefer not to have to set up a task to do this for all our VMs if there's another way to fix the issue (also in case there are other, less visible side effects to the apt upgrade).

The old, pre-QEMU upgrade VM which we cloned from a backup is running with 1:6.2+dfsg-2ubuntu6.6 and our (more) up-to-date VMs are running with 1:6.2+dfsg-2ubuntu6.11

Journalctl just gives the same output as systemctl status before and after the upgrade and then a deluge of guest-ping logs after the service is restarted (via systemctl or a reboot). I can paste that here too if it would be helpful.

I will get an apt upgrade output tomorrow.
 
Last edited:
Okay, here is the output of an apt update (I ran
Bash:
sudo apt upgrade | tee -a apt-update
and then removed all the formatting characters)
 

Attachments

  • apt-update.log
    45.3 KB · Views: 4
You need to go back a step. During an install you should have seen :
root@Copy-of-VM-vm3000:/home/ubuntu# systemctl enable qemu-guest-agent
Synchronizing state of qemu-guest-agent.service with SysV service script with /lib/systemd/systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install enable qemu-guest-agent
The unit files have no installation config (WantedBy=, RequiredBy=, Also=,
Alias= settings in the [Install] section, and DefaultInstance= for template
units). This means they are not meant to be enabled using systemctl.

Possible reasons for having this kind of units are:
• A unit may be statically enabled by being symlinked from another unit's
.wants/ or .requires/ directory.
• A unit's purpose may be to act as a helper for some other unit which has
a requirement dependency on it.
• A unit may be started when needed via activation (socket, path, timer,
D-Bus, udev, scripted systemctl call, ...).
• In case of template units, the unit is meant to be enabled with some
instance name specified.

the key phrase: This means they are not meant to be enabled using systemctl.

basic google search leads us to: https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1883009
The gist of which is that the qemu-agent is not started by systemd but by udev event.
Unfortunately for you that Udev event only happens once under normal circumstances - on boot.
So when you upgrade the agent and the upgrade process tries to restart it, it stays down.
You can retrigger it via:
udevadm control --reload-rules && udevadm trigger
root@Copy-of-VM-vm3000:/home/ubuntu# systemctl status qemu-guest-agent
○ qemu-guest-agent.service - QEMU Guest Agent
Loaded: loaded (/lib/systemd/system/qemu-guest-agent.service; static)
Active: inactive (dead)

Jul 25 16:06:50 Copy-of-VM-vm3000 systemd[1]: Started QEMU Guest Agent.
Jul 25 16:07:06 Copy-of-VM-vm3000 systemd[1]: Stopping QEMU Guest Agent...
Jul 25 16:07:06 Copy-of-VM-vm3000 systemd[1]: qemu-guest-agent.service: Deactivated successfully.
Jul 25 16:07:06 Copy-of-VM-vm3000 systemd[1]: Stopped QEMU Guest Agent.
root@Copy-of-VM-vm3000:/home/ubuntu# udevadm control --reload-rules && udevadm trigger
root@Copy-of-VM-vm3000:/home/ubuntu# systemctl status qemu-guest-agent
● qemu-guest-agent.service - QEMU Guest Agent
Loaded: loaded (/lib/systemd/system/qemu-guest-agent.service; static)
Active: active (running) since Tue 2023-07-25 16:15:25 UTC; 2s ago
Main PID: 974 (qemu-ga)
Tasks: 2 (limit: 573)
Memory: 380.0K
CPU: 3ms
CGroup: /system.slice/qemu-guest-agent.service
└─974 /usr/sbin/qemu-ga

Jul 25 16:15:25 Copy-of-VM-vm3000 systemd[1]: Started QEMU Guest Agent.

If you feel strongly about this - reopen the bug and report your experience to maintainers.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
So qemu-guest-agent staying shut down after upgrading is standard behavior?
seems like a packaging/upgrade problem to me, but I never came across this until now. Nor does it affect our daily routine. It is good to know and understand.

Make a good details report with useful output and the powers that be will consider it.

IMHO The agent interaction is only available to Cloud administrators in enterprise, not VPS consumers. People running cloud infrastructures dont run "apt upgrade" at will, they employ ansible playbooks that might take such qemu-guest-agent quirks into consideration.
They also test the upgrades prior to deploying into production, to prevent issues like these.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
The agent interaction is only available to Cloud administrators in enterprise, not VPS consumers.
In our case we are the administrators and do care about agent interaction (rebooting, shutting on/off, etc. without having to manually or programmatically log in). When the guest agent goes down we lose this ability.

People running cloud infrastructures dont run "apt upgrade" at will
Neither do we -- we use unattended-upgrades (which of course does an apt upgrade under the hood) for non-mission-critical VMs and barring this issue with qemu-guest-agent it hasn't given us any trouble. We could set up an automated task or run Ansible playbooks against all our VMs to restart qemu-guest-agent when it goes down, but that seems like a duct-tape solution to a problem that may be deeper than it appears, hence my previous question.

They also test the upgrades prior to deploying into production, to prevent issues like these.
So do we -- as I mentioned, rebooting did restart the QEMU guest agent service, so the issue is not earth-shattering so we were able to go ahead with upgrades fairly easily on more important VMs after upgrading less mission-critical ones, but we're curious if there may be more going on that meets the eye.

Make a good details report with useful output and the powers that be will consider it.
Where would I do this? What output specifically should I include?
 
The cloud images for Debian 12 Bookworm also do this mess out of the box, I've spent the last week trying to figure out why the qemu-guest-agent kept getting turned off.
 
Out of curiosity, would anyone know why Proxmox does not fallback to ACPI signals when the qemu guest agent is down?
 
So, it seems like this was a bug in qemu guest agent. The good news is it should be getting patched in an upcoming release.
Fresh of the presses - discussion is literally happening yesterday :)

Technically its a breaking change in "debhelper 12", not in qemu-agent.

As I said earlier, while you are waiting for a fix you can trigger qga restart without reboot via: udevadm control --reload-rules && udevadm trigger


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
  • Like
Reactions: janssensm

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!