BUG? PROXMOX stopped all guest vms on clicking shutdown at node

SWEETGOOD · Sep 19, 2022

Hello everyone,

today I pressed the "Shutdown" button in the upper right corner after selecting my single node on the left hand side. No HA, no Ceph, just a single PROXMOX 7.2-7 machine.

I expected the system starts shutting down all guests and shuts down itself afterwards like with the "Reboot" button in the past.

Instead it stopped all vms (like pulling their power source) and shut down itself afterwards properly...

I never used this button ever before, because I always used the "Reboot" button in the past. And this button normally shuts down the guests properly. I can't test any of those things now because its my production machine and I don't want to harm my VMs by this behaviour.

All guests have set a custom boot order and boot timeout, but no shutdown timeout. The fields are all holding the greyed-out value "default" (which should be 180s ?!)

Did I miss something in the docs, is it a bug or can you explain me why such a critical behaviour happens without any warning?

Thanks,
Christian

fabian · Sep 20, 2022

please post the full log starting at the time you pressed the shutdown buttong - that the qemu systemd scopes are stopped near the end of the shutdown sequence doesn't mean the guest itself was hard-stopped, it is always the case..

SWEETGOOD · Sep 20, 2022

Hi Fabian,

please find the full log attached. Maybe you find something in it but there was no relevant content before the section I posted. Thats why I posted this issue here. I pressed the "Shutdown" button and the host machine was immediately shutting down. No shutdown of guest was ever tried. I also checked all guests and all had reported an improper shutdown within their boot log.

I also found this in the full log below:

Code:

Sep 19 15:48:29 sg-pve-01-REDACTED systemd[1]: Stopping PVE guests...
Sep 19 15:48:30 sg-pve-01-REDACTED pve-guests[3669022]: <root@pam> starting task UPID:sg-pve-01-REDACTED:0037FC97:010CD925:6328732E:stopall::root@pam:
Sep 19 15:48:30 sg-pve-01-REDACTED pve-guests[3669143]: all VMs and CTs stopped
Sep 19 15:48:30 sg-pve-01-REDACTED pve-guests[3669022]: <root@pam> end task UPID:sg-pve-01-REDACTED:0037FC97:010CD925:6328732E:stopall::root@pam: OK
Sep 19 15:48:30 sg-pve-01-REDACTED systemd[1]: pve-guests.service: Succeeded.
Sep 19 15:48:30 sg-pve-01-REDACTED systemd[1]: Stopped PVE guests.
Sep 19 15:48:30 sg-pve-01-REDACTED systemd[1]: pve-guests.service: Consumed 1.158s CPU time.

fabian · Sep 20, 2022

can't confirm this behaviour here - when I press the shutdown button, the guests are shutdown by pve-guests.service being stopped. in your log, no guests were running anymore at that point. the shutdown button just triggers a shutdown (like the reboot button triggers a reboot - both are the same API endpoint).

is there anything else about your system that is special? any customization done to units/services/.. ?

SWEETGOOD · Sep 20, 2022

Hi Fabian,

did you set a custom boot-delay and boot-order for all your VMs like I mentioned? I can only believe it has to do with that because all VMs were up and running when I hit the button.

Here a screenshot of the configuration I mean. It's set for all but two VMs which were not running at that time:

And I didn't do anything else than clicking the nodes "Shutdown" button. No bulk stop before, no bulk shutdown before, nothing. Thats why I'm so concerned about this.

Now I changed the dialog like this to maybe prevent this in the future – but I can't test it again as the machine is now part of a cluster and as I already said it's a production host:

This is the task history for the node:

The marked entry is the one which was logged after clicking the button.

While getting through those logs I also noticed that I can't display any details on this very special event. I hit this error message:

Maybe this has to do with upgrading the single host to a cluster now? But it's somehow very strange because the status of all other tasks in the history can be displayed fine.

is there anything else about your system that is special? any customization done to units/services/.. ?

No. Its a completely fresh new installation of PROXMOX 7.4 without custom changes.

fabian · Sep 20, 2022

the bootorder/bootdelay is only relevant inside pve-guests (at which point no guests were running, hence also no delay to be applied)

No. Its a completely fresh new installation of PROXMOX 7.4 without custom changes.

I assume you mean 7.2 here

could you do the following:

Code:

systemctl status qemu.slice
systemctl list-dependencies --all qemu.slice
systemctl list-dependencies --all --reverse qemu.slice

SWEETGOOD · Sep 20, 2022

I assume you mean 7.2 here

Sure, I mixed it up somehow. It's 7.2-7

at which point no guests were running, hence also no delay to be applied

Where do you take this assumption from? All guests except from two were running at the time I clicked the button. Or are you talking about what the log says?

systemctl status qemu.slice

● qemu.slice
Loaded: loaded
Active: active since Mon 2022-09-19 15:59:01 CEST; 19h ago
Tasks: 68
Memory: 43.0G
CPU: 8h 38min 45.138s
CGroup: /qemu.slice
├─100.scope
│ └─4188 /usr/bin/kvm -id 100 -name sg-REDACTED,debug-threads=on -no-shutdown -chardev socket,id=qmp,path=/var>
├─101.scope
│ └─93657 /usr/bin/kvm -id 101 -name sg-REDACTED,debug-threads=on -no-shutdown -chardev socket,id=qmp,path=/va>
├─102.scope
│ └─47149 /usr/bin/kvm -id 102 -name sg-REDACTED,debug-threads=on -no-shutdown -chardev socket,id=qmp,path=/v>
├─104.scope
│ └─17449 /usr/bin/kvm -id 104 -name sg-REDACTED,debug-threads=on -no-shutdown -chardev socket,id=qmp,path=/var/r>
├─105.scope
│ └─111376 /usr/bin/kvm -id 105 -name sg-REDACTED,debug-threads=on -no-shutdown -chardev socket,id=qmp,path=/var/>
├─106.scope
│ └─137636 /usr/bin/kvm -id 106 -name sg-REDACTED,debug-threads=on -no-shutdown -chardev socket,id=qmp,path=/>
├─107.scope
│ └─4019 /usr/bin/kvm -id 107 -name sweet-REDACTED,debug-threads=on -no-shutdown -chardev socket,id=qmp,path=/var/run/qemu-server/1>
└─200.scope
└─71486 /usr/bin/kvm -id 200 -name REDACTED,debug-threads=on -no-shutdown -chardev socket,id=qmp,path=/va>

Sep 19 15:59:01 sg-pve-01-REDACTED systemd[1]: Created slice qemu.slice.

systemctl list-dependencies --all qemu.slice

qemu.slice
● └─-.slice

systemctl list-dependencies --all --reverse qemu.slice

qemu.slice
● ├─100.scope
● ├─101.scope
● ├─102.scope
● ├─104.scope
● ├─105.scope
● ├─106.scope
● ├─107.scope
● └─200.scope

fabian · Sep 20, 2022

SWEETGOOD said:
Where do you take this assumption from? All guests except from two were running at the time I clicked the button. Or are you talking about what the log says?

yeah, at the point where the guest shutdown is supposed to happen (pve-guests.service being stopped) all guests were already stopped. I didn't mean to say that that was the case when you pressed the button. stopping of guests is also not done differently for node shutdown vs reboot (except for the HA case) - in both cases pve-guests.service is stopped, which will (in turn) execute vzdump --stop (to cancel any running backup tasks) and then pvesh --nooutput create /nodes/localhost/stopall to shutdown, then stop all running non-HA guests.

edit: also, normally qmeventd should log about cleaning up after VMs exit.. I suspect somehow that the VM processes just got killed as part of the shutdown - the question is *why*

SWEETGOOD · Sep 20, 2022

If I can help you with testing this any further please let me know. I have another host in place which is not used in production, but it's not the same hardware / configuration like the one where this happened. But maybe this might help?

So is it true that the "Restart" and the "Shutdown" buttons are handled equal from a "vm shutdown" perspective? Because I don't remember that I hit that button ever before because I never had to shutdown the node completely. It was always a reboot I performed.

fabian · Sep 20, 2022

yes, they should be handled the same (except for HA resources, where the result depends on the combination of policy and whether it's a shutdown or reboot).

SWEETGOOD · Sep 20, 2022

Are there any other logs which could be helpful for getting more information in that case?
Because currently the investigation seems to be stuck here if you can't reproduce it and I can't test it again. :/

fabian · Sep 20, 2022

no, unless you find a way to reproduce it in a test environment where shutdown tests are possible..

harijsk · Jan 2, 2023

Just had the same issue on 'Virtual Environment 6.2-4'
CEPH cluster of 3 nodes, moved all my VMS so I can shut down 3rd node for maintenance, pressed the 'shutdown' button and as it went down a task 'stopall' was issued.
When powered 3rd node back on, 'startall' was issued by the same 3rd node.
Same happens when rebooting 3rd via console.

Previously did the same with 1st and 2nd node, no such issue.

Search

Search

BUG? PROXMOX stopped all guest vms on clicking shutdown at node

SWEETGOOD

Member

fabian

Proxmox Staff Member

SWEETGOOD

Member

Attachments

fabian

Proxmox Staff Member

SWEETGOOD

Member

fabian

Proxmox Staff Member

SWEETGOOD

Member

fabian

Proxmox Staff Member

SWEETGOOD

Member

fabian

Proxmox Staff Member

SWEETGOOD

Member

fabian

Proxmox Staff Member

harijsk

Member