Fast CT shutdown, very slow (minutes) CT stop

Hi,
In another thread I’ve asked for some help about why Proxmox does not fully shutdown and test stuck for minutes with a black screen and blinking cursor. I’ve noticed this strange behavior:

Code:
root@pve:~# time pct shutdown 101                                                                                                                                                                         
                                                                                                                                                                                                          
real    0m2.862s                                                                                                                                                                                          
user    0m0.461s                                                                                                                                                                                          
sys     0m0.034s                                                                                                                                                                                          
root@pve:~# pct start 101                                                                                                                                                                                 
root@pve:~# pct status 101                                                                                                                                                                                
status: running                                                                                                                                                                                           
root@pve:~# time pct stop 101                                                                                                                                                                             
                                                                                                                                                                                                          
real    6m6.311s                                                                                                                                                                                          
user    0m0.500s                                                                                                                                                                                          
sys     0m0.075s

I expected Proxmox VE shutdown procedure to execute shutdown on every VM and CT but maybe instead it executes a less clean stop command and this could be the reason why it takes ages.. stopping CT101 is very slow. Is there a way to understand where it gets stuck? In which script can I look at what action PVE executes on shutdown?

Follows the config of the CT101
Code:
root@pve:~# cat /etc/pve/lxc/101.conf                                                                                                                                                                     
#**Proxmox Backup Server**                                                                                                                                                                                
arch: amd64                                                                                                                                                                                               
cores: 6                                                                                                                                                                                                  
features: nesting=1                                                                                                                                                                                       
hostname: ProxmoxBackupServer                                                                                                                                                                             
memory: 4096                                                                                                                                                                                              
mp1: /mnt/pve/sata_disk/pbs_backups,mp=/pbs_backups                                                                                                                                                       
net0: name=eth0,bridge=vmbr0,firewall=1,gw=10.0.0.254,hwaddr=CA:CE:16:FC:86:01,ip=10.0.0.5/24,type=veth                                                                                                   
onboot: 1                                                                                                                                                                                                 
ostype: debian                                                                                                                                                                                            
rootfs: local-lvm:vm-101-disk-0,size=32G                                                                                                                                                                  
searchdomain: mynetwork                                                                                                                                                                                
startup: order=1                                                                                                                                                                                          
swap: 512
 
Hi,
In another thread I’ve asked for some help about why Proxmox does not fully shutdown and test stuck for minutes with a black screen and blinking cursor. I’ve noticed this strange behavior:

Code:
root@pve:~# time pct shutdown 101                                                                                                                                                                        
                                                                                                                                                                                                         
real    0m2.862s                                                                                                                                                                                         
user    0m0.461s                                                                                                                                                                                         
sys     0m0.034s                                                                                                                                                                                         
root@pve:~# pct start 101                                                                                                                                                                                
root@pve:~# pct status 101                                                                                                                                                                               
status: running                                                                                                                                                                                          
root@pve:~# time pct stop 101                                                                                                                                                                            
                                                                                                                                                                                                         
real    6m6.311s                                                                                                                                                                                         
user    0m0.500s                                                                                                                                                                                         
sys     0m0.075s

I expected Proxmox VE shutdown procedure to execute shutdown on every VM and CT but maybe instead it executes a less clean stop command and this could be the reason why it takes ages.. stopping CT101 is very slow. Is there a way to understand where it gets stuck? In which script can I look at what action PVE executes on shutdown?

Follows the config of the CT101
Code:
root@pve:~# cat /etc/pve/lxc/101.conf                                                                                                                                                                    
#**Proxmox Backup Server**                                                                                                                                                                               
arch: amd64                                                                                                                                                                                              
cores: 6                                                                                                                                                                                                 
features: nesting=1                                                                                                                                                                                      
hostname: ProxmoxBackupServer                                                                                                                                                                            
memory: 4096                                                                                                                                                                                             
mp1: /mnt/pve/sata_disk/pbs_backups,mp=/pbs_backups                                                                                                                                                      
net0: name=eth0,bridge=vmbr0,firewall=1,gw=10.0.0.254,hwaddr=CA:CE:16:FC:86:01,ip=10.0.0.5/24,type=veth                                                                                                  
onboot: 1                                                                                                                                                                                                
ostype: debian                                                                                                                                                                                           
rootfs: local-lvm:vm-101-disk-0,size=32G                                                                                                                                                                 
searchdomain: mynetwork                                                                                                                                                                               
startup: order=1                                                                                                                                                                                         
swap: 512
Regardless you stop the container via `pct shutdown`, shutdown in WEB-GUI or shutdown caused by host shutdown the same process for/in the container takes place. /var/log/syslog from container shows what happens, typical output:


Code:
Dec  1 09:25:13 debct systemd[1]: Received SIGRTMIN+3.
Dec  1 09:25:13 debct systemd[1]: Removed slice system-modprobe.slice.
Dec  1 09:25:13 debct systemd[1]: Stopped target Graphical Interface.
Dec  1 09:25:13 debct systemd[1]: Stopped target Multi-User System.
Dec  1 09:25:13 debct systemd[1]: Stopped target Login Prompts.
Dec  1 09:25:13 debct systemd[1]: Stopped target Remote Encrypted Volumes.
Dec  1 09:25:13 debct systemd[1]: Stopped target Timers.
Dec  1 09:25:13 debct systemd[1]: apt-daily-upgrade.timer: Succeeded.
Dec  1 09:25:13 debct systemd[1]: Stopped Daily apt upgrade and clean activities.
Dec  1 09:25:13 debct systemd[1]: apt-daily.timer: Succeeded.
Dec  1 09:25:13 debct systemd[1]: Stopped Daily apt download activities.
Dec  1 09:25:13 debct systemd[1]: e2scrub_all.timer: Succeeded.
Dec  1 09:25:13 debct systemd[1]: Stopped Periodic ext4 Online Metadata Check for All Filesystems.
Dec  1 09:25:13 debct systemd[1]: logrotate.timer: Succeeded.
Dec  1 09:25:13 debct systemd[1]: Stopped Daily rotation of log files.
Dec  1 09:25:13 debct systemd[1]: man-db.timer: Succeeded.
Dec  1 09:25:13 debct systemd[1]: Stopped Daily man-db regeneration.
Dec  1 09:25:13 debct systemd[1]: systemd-tmpfiles-clean.timer: Succeeded.
Dec  1 09:25:13 debct systemd[1]: Stopped Daily Cleanup of Temporary Directories.
Dec  1 09:25:13 debct systemd[1]: Stopped target System Time Synchronized.
Dec  1 09:25:13 debct systemd[1]: Stopped target System Time Set.
Dec  1 09:25:13 debct systemd[1]: Reached target Unmount All Filesystems.
Dec  1 09:25:13 debct systemd[1]: Stopping Console Getty...
Dec  1 09:25:13 debct systemd[1]: Stopping Container Getty on /dev/tty1...
Dec  1 09:25:13 debct systemd[1]: Stopping Container Getty on /dev/tty2...
Dec  1 09:25:13 debct systemd[1]: Stopping Regular background program processing daemon...
Dec  1 09:25:13 debct systemd[1]: Stopping D-Bus System Message Bus...
Dec  1 09:25:13 debct systemd[1]: postfix.service: Succeeded.
Dec  1 09:25:13 debct systemd[1]: Stopped Postfix Mail Transport Agent.
Dec  1 09:25:13 debct systemd[1]: Stopping Postfix Mail Transport Agent (instance -)...

As it can be seen just OS performs shutdown within 1 second. If it lasts longer there is something special configured in the container.
 
  • Like
Reactions: MightySlaytanic
Regardless you stop the container via `pct shutdown`, shutdown in WEB-GUI or shutdown caused by host shutdown the same process for/in the container takes place. /var/log/syslog from container shows what happens, typical output:


Code:
Dec  1 09:25:13 debct systemd[1]: Received SIGRTMIN+3.
Dec  1 09:25:13 debct systemd[1]: Removed slice system-modprobe.slice.
Dec  1 09:25:13 debct systemd[1]: Stopped target Graphical Interface.
Dec  1 09:25:13 debct systemd[1]: Stopped target Multi-User System.
Dec  1 09:25:13 debct systemd[1]: Stopped target Login Prompts.
Dec  1 09:25:13 debct systemd[1]: Stopped target Remote Encrypted Volumes.
Dec  1 09:25:13 debct systemd[1]: Stopped target Timers.
Dec  1 09:25:13 debct systemd[1]: apt-daily-upgrade.timer: Succeeded.
Dec  1 09:25:13 debct systemd[1]: Stopped Daily apt upgrade and clean activities.
Dec  1 09:25:13 debct systemd[1]: apt-daily.timer: Succeeded.
Dec  1 09:25:13 debct systemd[1]: Stopped Daily apt download activities.
Dec  1 09:25:13 debct systemd[1]: e2scrub_all.timer: Succeeded.
Dec  1 09:25:13 debct systemd[1]: Stopped Periodic ext4 Online Metadata Check for All Filesystems.
Dec  1 09:25:13 debct systemd[1]: logrotate.timer: Succeeded.
Dec  1 09:25:13 debct systemd[1]: Stopped Daily rotation of log files.
Dec  1 09:25:13 debct systemd[1]: man-db.timer: Succeeded.
Dec  1 09:25:13 debct systemd[1]: Stopped Daily man-db regeneration.
Dec  1 09:25:13 debct systemd[1]: systemd-tmpfiles-clean.timer: Succeeded.
Dec  1 09:25:13 debct systemd[1]: Stopped Daily Cleanup of Temporary Directories.
Dec  1 09:25:13 debct systemd[1]: Stopped target System Time Synchronized.
Dec  1 09:25:13 debct systemd[1]: Stopped target System Time Set.
Dec  1 09:25:13 debct systemd[1]: Reached target Unmount All Filesystems.
Dec  1 09:25:13 debct systemd[1]: Stopping Console Getty...
Dec  1 09:25:13 debct systemd[1]: Stopping Container Getty on /dev/tty1...
Dec  1 09:25:13 debct systemd[1]: Stopping Container Getty on /dev/tty2...
Dec  1 09:25:13 debct systemd[1]: Stopping Regular background program processing daemon...
Dec  1 09:25:13 debct systemd[1]: Stopping D-Bus System Message Bus...
Dec  1 09:25:13 debct systemd[1]: postfix.service: Succeeded.
Dec  1 09:25:13 debct systemd[1]: Stopped Postfix Mail Transport Agent.
Dec  1 09:25:13 debct systemd[1]: Stopping Postfix Mail Transport Agent (instance -)...

As it can be seen just OS performs shutdown within 1 second. If it lasts longer there is something special configured in the container.
Hi, the shutdown process is really fast, while if I launch a stop of the CT 101 it gets stuck for 3-6 minutes. I've noticed that the lxc-start process stays in D state after launching the stop and it's waiting for a file descriptor to be closed:

Code:
root@pve:~# ps aux | grep "lxc.*101" | grep -v grep
root     3192525  0.0  0.0   3952  3208 ?        Ds   11:47   0:00 /usr/bin/lxc-start -F -n 101
root     3193084  0.0  0.0  91256  7988 ?        S    11:47   0:00 /usr/bin/termproxy 5900 --path /vms/101 --perm VM.Console -- /usr/bin/dtach -A /var/run/dtach/vzctlconsole101 -r winch -z lxc-console -n 101 -e -1
root     3193088  0.0  0.0   2280   588 pts/1    Ss+  11:47   0:00 /usr/bin/dtach -A /var/run/dtach/vzctlconsole101 -r winch -z lxc-console -n 101 -e -1
root     3193089  0.0  0.0   2412    84 ?        Ss   11:47   0:00 /usr/bin/dtach -A /var/run/dtach/vzctlconsole101 -r winch -z lxc-console -n 101 -e -1
root     3193090  0.0  0.0   3884  2640 pts/3    Ss+  11:47   0:00 lxc-console -n 101 -e -1
root     3193102  0.0  0.0   3884  2804 pts/0    S+   11:47   0:00 lxc-stop -n 101 --kill

I tried to launch by hand the lxc-start command with strace and this is where it gets stuck as soon as I stop CT 101:

Code:
strace /usr/bin/lxc-start -F -n 101 2>&1 | grep "open\|close" | tee strace.log
[...]
openat(AT_FDCWD, "/proc/3197648/ns/net", O_RDONLY|O_CLOEXEC) = 5
close(5)                                = 0
openat(AT_FDCWD, "/run/lxc//var/lib/lxc/monitor-fifo", O_WRONLY|O_NONBLOCK) = 5
close(5)                                = 0
close(5)                                = 0
close(5)                                = 0
close(5)                                = 0
openat(AT_FDCWD, "/run/lxc//var/lib/lxc/monitor-fifo", O_WRONLY|O_NONBLOCK) = 5
close(5)                                = 0
close(5)                                = 0
close(5)

If I block network traffic from CT 101 (with an outgoing ACL) it stops in few seconds.

So, to summarize:
1) CT101 shutdown is very fast
2) CT101 stop takes minutes with lxc-start process stuck on closing a file descriptor
3) CT101 stop is very fast if I launch the container moving its interface in an isolated bridge or blocking outgoing traffic (it is a PBS container, so there could be some communication with PVE that blocks the stop process)

To do a bit fo history, I've started looking at how CTs and VMs where doing the shutdown/stop procedure since I've noticed that when I shutdown my PVE it almost immediately shows a blank screen with the cursor on top of it but then it doesn't poweroff the machine (see this thread). BTW, since the shutdown procedure of PVE should execute a shutdown on VMs and CTs and not a stop, the slow stop of CT101 shouldn't be the cause of the problem.