Proxmox VE shutdown does not poweroff machine

Hi,
I'm running the latest proxmox version and I've noticed that if I launch a "systemctl poweroff" command the PVE box almost immediately clears the screen and I see only the cursor on the top left corner of the screen. It does not shutdown completely the machine and the running VMs can be shutdown in less than 20 seconds if I shutdown them "by hand". What is the correct way of shutting down the system completely without having to force the shutdown after some time (hoping everything has been correctly stopped) by pushing the physical button? Is it correct that it immediately show a black screen with the cursor blinking instead of showing the classic linux shutdown process where it stops every service and then shutsdown the machine?
 

l.wagner

Proxmox Staff Member
Staff member
Oct 3, 2022
39
16
8
Hi,

by default, Proxmox tries to shutdown guests by sending an ACPI shutdown (which is basically the same thing as pressing the physical power button on a machine), or by sending a shutdown command to qemu-guest-agent running inside the virtual machine, depending on the setting of "QEMU guest agent" in the VM configuration.

From my experience, the phenomenon that you've described can happen if
  • the guest ignores the ACPI signal
  • "QEMU guest agent" is enabled for the VM, but the guest agent is not actually running in the VM
If "QEMU guest agent" is enable in the configuration, make sure that installed and active in the guest: apt install qemu-guest-agent and systemctl enable --now qemu-guest-agent in the case of Debian-based systems.

Hope this helps.
 
  • Like
Reactions: MightySlaytanic
Hi,

by default, Proxmox tries to shutdown guests by sending an ACPI shutdown (which is basically the same thing as pressing the physical power button on a machine), or by sending a shutdown command to qemu-guest-agent running inside the virtual machine, depending on the setting of "QEMU guest agent" in the VM configuration.

From my experience, the phenomenon that you've described can happen if
  • the guest ignores the ACPI signal
  • "QEMU guest agent" is enabled for the VM, but the guest agent is not actually running in the VM
If "QEMU guest agent" is enable in the configuration, make sure that installed and active in the guest: apt install qemu-guest-agent and systemctl enable --now qemu-guest-agent in the case of Debian-based systems.

Hope this helps.
I've tried to shutdown "by hand" the 2 VMs and 2CTs running on my PVE and it takes less than 40s:
Code:
 time pve-bulk shutdown --vm-list=103,104 --ct-list=101,105                                                                                                                                      
ct101 : shutdown has succeeded                                                                                                                                                                              
ct105 : shutdown has succeeded                                                                                                                                                                              
vm103 : shutdown has succeeded                                                                                                                                                                              
vm104 : shutdown has succeeded                                                                                                                                                                              
                                                                                                                                                                                                            
real    0m39.423s                                                                                                                                                                                           
user    0m1.684s                                                                                                                                                                                            
sys     0m0.204s
(pve-bulk calls qm/pct shutdown)

This are the VMs/CTs that where running when I did the shutdown, is there any log to check what happens?
 

l.wagner

Proxmox Staff Member
Staff member
Oct 3, 2022
39
16
8
Hey,

so, yesterday I tried to shutdown a PVE host with a running VM using systemctl poweroff - I did not observer any issues there.

This are the VMs/CTs that where running when I did the shutdown, is there any log to check what happens?

You can look at the task log in PVE UI, there should be a task "Stop all VMs and Containers" - maybe there might be some info on what is going on. Furthermore, you could check the output of journalctl --boot=-1 -e the get the system log messages from the last boot.

Hope this helps.
 
  • Like
Reactions: MightySlaytanic
Hey,

so, yesterday I tried to shutdown a PVE host with a running VM using systemctl poweroff - I did not observer any issues there.



You can look at the task log in PVE UI, there should be a task "Stop all VMs and Containers" - maybe there might be some info on what is going on. Furthermore, you could check the output of journalctl --boot=-1 -e the get the system log messages from the last boot.

Hope this helps.
Hi wagner, does the shutdown procedure execute a stop or a shutdown on the CTs/VMs? Because I've noticed that the shutdown, which should take longer, finishes in 40 seconds, while if I execute a stop of the running CTs and VMs it gets stuck on stopping CT 101 for minutes.
I thought that the PVE shutdown procedure used shutdown command to cleanly shutdown running CTs and VMs and not the stop command which may damage VMs filesystem with an immediate stop. If it uses the stop command we shoult always do a clean shutdown of everything running on the host before launching the shutdown on PVE itself, right?
BTW, if PVE shutdown runs a stop on everything instead of a shutdown my problem could be due to the fact that CT 101 does not immediately shutdown (SIX minutes to stop it, 2 seconds to shutdown it):

Code:
root@pve:~# time pct shutdown 101                                                                                                                                                                          
                                                                                                                                                                                                           
real    0m2.862s                                                                                                                                                                                           
user    0m0.461s                                                                                                                                                                                           
sys     0m0.034s                                                                                                                                                                                           
root@pve:~# pct start 101                                                                                                                                                                                  
root@pve:~# pct status 101                                                                                                                                                                                 
status: running                                                                                                                                                                                            
root@pve:~# time pct stop 101                                                                                                                                                                              
                                                                                                                                                                                                           
real    6m6.311s                                                                                                                                                                                           
user    0m0.500s                                                                                                                                                                                           
sys     0m0.075s

Follows the config of the CT101
Code:
root@pve:~# cat /etc/pve/lxc/101.conf                                                                                                                                                                      
#**Proxmox Backup Server**                                                                                                                                                                                 
arch: amd64                                                                                                                                                                                                
cores: 6                                                                                                                                                                                                   
features: nesting=1                                                                                                                                                                                        
hostname: ProxmoxBackupServer                                                                                                                                                                              
memory: 4096                                                                                                                                                                                               
mp1: /mnt/pve/sata_disk/pbs_backups,mp=/pbs_backups                                                                                                                                                        
net0: name=eth0,bridge=vmbr0,firewall=1,gw=10.0.0.254,hwaddr=CA:CE:16:FC:86:01,ip=10.0.0.5/24,type=veth                                                                                                    
onboot: 1                                                                                                                                                                                                  
ostype: debian                                                                                                                                                                                             
rootfs: local-lvm:vm-101-disk-0,size=32G                                                                                                                                                                   
searchdomain: mynetwork                                                                                                                                                                                 
startup: order=1                                                                                                                                                                                           
swap: 512
 
Last edited:

l.wagner

Proxmox Staff Member
Staff member
Oct 3, 2022
39
16
8
does the shutdown procedure execute a stop or a shutdown on the CTs/VMs?
It does perform a shutdown, followed by a stop after a certain timeout (I believe it was two or three minutes), if the guest did not shutdown properly.

What really strikes me odd is the time it takes to stop your PBS container.
Can you reliably reproduce this behaviour? Does this happen for other containers as well? Are there any tasks running in the GUI before you attempt stopping the container? Because afaik stopping the container will first try to wait for the running tasks to finish before the stop command is executed.
 
It does perform a shutdown, followed by a stop after a certain timeout (I believe it was two or three minutes), if the guest did not shutdown properly.

What really strikes me odd is the time it takes to stop your PBS container.
Can you reliably reproduce this behaviour? Does this happen for other containers as well? Are there any tasks running in the GUI before you attempt stopping the container? Because afaik stopping the container will first try to wait for the running tasks to finish before the stop command is executed.
Hi Wagner,
I've tried to clone the container in another bridge and mounting a different folder (to avoid messing up with the backups) and it stopped regularly. This evening I'll try to restore the same 101 container and I'll see if it stops in few seconds. No tasks are running when I've tried the stops and it happens deterministically. Shutdown instead is fast.

If can be of any help, I've tried to run a strace lxc-stop -n 101 --kill and it gets stuck here:

Code:
connect(3, {sa_family=AF_UNIX, sun_path=@"/var/lib/lxc/101/command"}, 27) = 0
getpid()                                = 2624506
getuid()                                = 0
getgid()                                = 0
sendmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", iov_len=16}], msg_iovlen=1, msg_control=[{cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS, cmsg_data={pid=2624506, uid=0, gid=0}}], msg_controllen=32, msg_flags=0}, MSG_NOSIGNAL) = 16
recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\0\0\0\0\0\0\0\0\2\0\0\0\0\0\0\0", iov_len=16}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 16
close(3)                                = 0
socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0) = 3
connect(3, {sa_family=AF_UNIX, sun_path=@"/var/lib/lxc/101/command"}, 27) = 0
getpid()                                = 2624506
getuid()                                = 0
getgid()                                = 0
sendmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", iov_len=16}], msg_iovlen=1, msg_control=[{cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS, cmsg_data={pid=2624506, uid=0, gid=0}}], msg_controllen=32, msg_flags=0}, MSG_NOSIGNAL) = 16
recvmsg(3,

// WAIT SOME MINUTES, the recvmsg is completed //

recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="", iov_len=16}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 0
close(3)                                = 0
exit_group(0)                           = ?
+++ exited with 0 +++

I'll attach the whole output of strace.
 

Attachments

  • lxc-stop-log.txt
    14.2 KB · Views: 1
It does perform a shutdown, followed by a stop after a certain timeout (I believe it was two or three minutes), if the guest did not shutdown properly.

What really strikes me odd is the time it takes to stop your PBS container.
Can you reliably reproduce this behaviour? Does this happen for other containers as well? Are there any tasks running in the GUI before you attempt stopping the container? Because afaik stopping the container will first try to wait for the running tasks to finish before the stop command is executed.
Small update.. I did the following tests with CT101, without restoring from backups:
- I've tried to remove the NFS datastore pointed by PBS container => nothing changed
- I've moved the net0 interface from vmbr0 to vmbr1 where it is not reached by anyone (PVE can not poll it) => it stops in few seconds

So, it seems that the very slow stop happens when it is in vmbr0 and maybe it is stuck on some network connections (from PVE?)? As soon as I launch the stop command I can not reach it via network, so it seems strange that it is stuck on something related to networking.

BTW, even if stopping CT101 takes so longer, it should not impact the shutdown procedure since a normal shutdown of CT101 takes few seconds (this is really something I don't understand, shutdown should be slower than a stop). If there is something else I can try, just tell me and I'll test it ;)
 
Last edited:

l.wagner

Proxmox Staff Member
Staff member
Oct 3, 2022
39
16
8
So, it seems that the very slow stop happens when it is in vmbr0 and maybe it is stuck on some network connections (from PVE?)? As soon as I launch the stop command I can not reach it via network, so it seems strange that it is stuck on something related to networking.
Could you provide me with an overview of your setup? What kind of networks do you have, what is their purpose?

BTW, even if stopping CT101 takes so longer, it should not impact the shutdown procedure since a normal shutdown of CT101 takes few seconds (this is really something I don't understand, shutdown should be slower than a stop). If there is something else I can try, just tell me and I'll test it ;)
I did some research regarding the exact differences between shutdown and stop. For LXC containers, shutdown sends a SIGPWR signal to the init process of the container. So in the case of PBS, systemd will gracefully stop all services and then terminate. stop on the other hand simply kills all processes running inside the container.
I suspect that there might be a process in the "D" state (uninterruptible sleep [1]), which prohibits it from being killed. You could check the output of ps aux to see if there are any processes in this state.

[1] https://unix.stackexchange.com/questions/16738/when-a-process-will-go-to-d-state
 
Could you provide me with an overview of your setup? What kind of networks do you have, what is their purpose?
Hi,
I'm using vmbr0 as main bridge to connect VMs to my home network. It is vlan-aware since one of the VMs tags its traffic. Then, there are two other bridges using for internal communication between some VMs/CTs. The fact that moving the network of PBS CT101 in another vmbr makes it stop almost immediately made me think that it is kept active by a connection from outside, for example the connection from PVE. So, I've tried to leave PBS in vmbr0 but block every communication from the outside with the firewall of CT101 and nothing changed. Then, I've set firewall output policy to REJECT and rebooted CT101: stopping CT101 with no allowed outgoing communications is almost immediate, so there's some network communication that blocks the stop process. As soon as I launch pct stop 101 the lxc-start process goes in the D state (while within the PBS CT before launching the stop there are no D-state processes):

Code:
root@pve:~# ps aux | grep "lxc.*101" | grep -v grep
root     3192525  0.0  0.0   3952  3208 ?        Ds   11:47   0:00 /usr/bin/lxc-start -F -n 101
root     3193084  0.0  0.0  91256  7988 ?        S    11:47   0:00 /usr/bin/termproxy 5900 --path /vms/101 --perm VM.Console -- /usr/bin/dtach -A /var/run/dtach/vzctlconsole101 -r winch -z lxc-console -n 101 -e -1
root     3193088  0.0  0.0   2280   588 pts/1    Ss+  11:47   0:00 /usr/bin/dtach -A /var/run/dtach/vzctlconsole101 -r winch -z lxc-console -n 101 -e -1
root     3193089  0.0  0.0   2412    84 ?        Ss   11:47   0:00 /usr/bin/dtach -A /var/run/dtach/vzctlconsole101 -r winch -z lxc-console -n 101 -e -1
root     3193090  0.0  0.0   3884  2640 pts/3    Ss+  11:47   0:00 lxc-console -n 101 -e -1
root     3193102  0.0  0.0   3884  2804 pts/0    S+   11:47   0:00 lxc-stop -n 101 --kill

This is my /etc/network/interfaces file

Code:
auto lo
iface lo inet loopback

iface eno1 inet manual
    ethernet-wol g
    post-up /usr/bin/logger -p debug -t ifup "Disabling segmentation offload for eno1" && /sbin/ethtool -K $IFACE tso off gso off && /usr/bin/logger -p debug -t ifup "Disabled offload for eno1"

auto vmbr0
iface vmbr0 inet static
    address 10.0.0.4/24
    gateway 10.0.0.254
    bridge-ports eno1 eno2
    bridge-stp off
    bridge-fd 0
    bridge-vlan-aware yes
    bridge-vids 2-4094

iface wlp0s20f3 inet manual

iface eno2 inet manual

auto vmbr1
iface vmbr1 inet manual
    bridge-ports none
    bridge-stp off
    bridge-fd 0

auto vmbr999
iface vmbr999 inet manual
    bridge-ports none
    bridge-stp off
    bridge-fd 0


I did some research regarding the exact differences between shutdown and stop. For LXC containers, shutdown sends a SIGPWR signal to the init process of the container. So in the case of PBS, systemd will gracefully stop all services and then terminate. stop on the other hand simply kills all processes running inside the container.
I suspect that there might be a process in the "D" state (uninterruptible sleep [1]), which prohibits it from being killed. You could check the output of ps aux to see if there are any processes in this state.

[1] https://unix.stackexchange.com/questions/16738/when-a-process-will-go-to-d-state

No processes are in D state within PBS:

Code:
root@ProxmoxBackupServer:~# ps aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.2 165652  9832 ?        Ss   Nov28   0:00 /sbin/init
root          53  0.0  0.2  32004 10936 ?        Ss   Nov28   0:00 /lib/systemd/systemd-journald
systemd+      64  0.0  0.1  16052  6896 ?        Ss   Nov28   0:00 /lib/systemd/systemd-networkd
systemd+      79  0.0  0.2  24112 11752 ?        Ss   Nov28   0:00 /lib/systemd/systemd-resolved
message+      82  0.0  0.0   8092  4176 ?        Ss   Nov28   0:00 /usr/bin/dbus-daemon --system --address=systemd: --nofor
root          83  0.0  0.0 220800  3492 ?        Ssl  Nov28   0:00 /usr/sbin/rsyslogd -n -iNONE
root          84  0.0  0.1  14008  6916 ?        Ss   Nov28   0:00 /lib/systemd/systemd-logind
root         126  0.0  0.4 512040 20320 ?        Ssl  Nov28   1:41 /usr/lib/x86_64-linux-gnu/proxmox-backup/proxmox-backup-
backup       154  0.0  0.8 1325144 34460 ?       Ssl  Nov28   1:52 /usr/lib/x86_64-linux-gnu/proxmox-backup/proxmox-backup-
root         238  0.0  0.0   3748  2376 ?        Ss   Nov28   0:00 /usr/sbin/cron -f
root         241  0.0  0.0   2508  1712 pts/0    Ss+  Nov28   0:00 /sbin/agetty -o -p -- \u --noclear --keep-baud console 1
root         242  0.0  0.0   2508  1652 pts/1    Ss+  Nov28   0:00 /sbin/agetty -o -p -- \u --noclear --keep-baud tty1 1152
root         243  0.0  0.0   2508  1676 pts/2    Ss+  Nov28   0:00 /sbin/agetty -o -p -- \u --noclear --keep-baud tty2 1152
root         311  0.0  0.1  40052  4808 ?        Ss   Nov28   0:00 /usr/lib/postfix/sbin/master -w
postfix      313  0.0  0.1  40356  6788 ?        S    Nov28   0:00 qmgr -l -t unix -u
postfix      989  0.0  0.1  40308  6720 ?        S    09:09   0:00 pickup -l -t unix -u -c
root        1029  0.0  0.2  14520  8824 ?        Ss   10:21   0:00 sshd: root@pts/3
root        1032  0.0  0.1  15068  7664 ?        Ss   10:21   0:00 /lib/systemd/systemd --user
root        1033  0.0  0.0 168608  2656 ?        S    10:21   0:00 (sd-pam)
root        1042  0.0  0.0   4824  4056 pts/3    Ss   10:21   0:00 -bash
root        1049  0.0  0.0   6760  2944 pts/3    R+   10:22   0:00 ps aux
 
Hi,
I'm using vmbr0 as main bridge to connect VMs to my home network. It is vlan-aware since one of the VMs tags its traffic. Then, there are two other bridges using for internal communication between some VMs/CTs. The fact that moving the network of PBS CT101 in another vmbr makes it stop almost immediately made me think that it is kept active by a connection from outside, for example the connection from PVE. So, I've tried to leave PBS in vmbr0 but block every communication from the outside with the firewall of CT101 and nothing changed. Then, I've set firewall output policy to REJECT and rebooted CT101: stopping CT101 with no allowed outgoing communications is almost immediate, so there's some network communication that blocks the stop process. As soon as I launch pct stop 101 the lxc-start process goes in the D state (while within the PBS CT before launching the stop there are no D-state processes):

Code:
root@pve:~# ps aux | grep "lxc.*101" | grep -v grep
root     3192525  0.0  0.0   3952  3208 ?        Ds   11:47   0:00 /usr/bin/lxc-start -F -n 101
root     3193084  0.0  0.0  91256  7988 ?        S    11:47   0:00 /usr/bin/termproxy 5900 --path /vms/101 --perm VM.Console -- /usr/bin/dtach -A /var/run/dtach/vzctlconsole101 -r winch -z lxc-console -n 101 -e -1
root     3193088  0.0  0.0   2280   588 pts/1    Ss+  11:47   0:00 /usr/bin/dtach -A /var/run/dtach/vzctlconsole101 -r winch -z lxc-console -n 101 -e -1
root     3193089  0.0  0.0   2412    84 ?        Ss   11:47   0:00 /usr/bin/dtach -A /var/run/dtach/vzctlconsole101 -r winch -z lxc-console -n 101 -e -1
root     3193090  0.0  0.0   3884  2640 pts/3    Ss+  11:47   0:00 lxc-console -n 101 -e -1
root     3193102  0.0  0.0   3884  2804 pts/0    S+   11:47   0:00 lxc-stop -n 101 --kill
@l.wagner I've made a test by launching the container directly with "strace lxc-start" and then I've launched the pct stop 101 command.. what makes the stop wait for minutes is the lxc-start process in D state that is stuck on closing a file descriptor.. in my test was FD 5:

Code:
strace /usr/bin/lxc-start -F -n 101 2>&1 | grep "open\|close" | tee strace.log
[...]
openat(AT_FDCWD, "/proc/3197648/ns/net", O_RDONLY|O_CLOEXEC) = 5
close(5)                                = 0
openat(AT_FDCWD, "/run/lxc//var/lib/lxc/monitor-fifo", O_WRONLY|O_NONBLOCK) = 5
close(5)                                = 0
close(5)                                = 0
close(5)                                = 0
close(5)                                = 0
openat(AT_FDCWD, "/run/lxc//var/lib/lxc/monitor-fifo", O_WRONLY|O_NONBLOCK) = 5
close(5)                                = 0
close(5)                                = 0
close(5)

I can't really understand if the problem is closing /run/lxc//var/lib/lxc/monitor-fifo since I can see several close(5) calls that finishes after the open that returns 5... but I'm definitely not expert in using strace. BTW, as soon as FD 5 closes, stop ends.
 

l.wagner

Proxmox Staff Member
Staff member
Oct 3, 2022
39
16
8
I did some experiments, but I could not really reproduce the issue on my end. Maybe some other users have an idea what could be going on here.
 
  • Like
Reactions: MightySlaytanic

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!