qmeventd send SIGKILL before vm shutdown there nics

auranext

Well-Known Member
Jun 5, 2018
54
2
48
123
Hi,
we have been using proxmox for many years.
We have servers running different versions of PVE5.3, PVE6.2 and PVE7.0.
We virtualize routers. This means that the VMs can have 10 to 20 nics.
In this context we have noticed a problem
after running the qm shutdown command the QEMU process receives a SIGKILL signal from qmeventd while it has not finished its nics shutdown sequence
This behavior exists in PVE7.0 but not in PVE6.2 and PVE5.3
Reading the source code we can see that a timeout of 5 seconds has been introduced in a hardcoded way in qmeventd somewhere between the version PVE6.2 and PVE7.0.
line 481 -> alarm(5);
https://github.com/proxmox/qemu-server/blob/master/qmeventd/qmeventd.c

the 5sec timeout is not described in the changelog
https://github.com/proxmox/qemu-server/blob/master/debian/changelog

In our case the 5 sec timeout is really too short, is it not a problem for specific cases, especially in situations where some network stuff need to be cleaned like SDN
In addition the fact that it is hardcoded doesn't allow any way to adjust it according to the number of nic that a VM has.

help is appreciate ;)


regards,

maxime

thx
 
Last edited:
i am not saying there couldn't be a bug, but i think you misinterpret the 'alarm(5)' line
when the vm is stopped, we send a 'SIGTERM' to the qemu process (but only if it already told us that it shutdown)
and after 5 seconds a 'SIGKILL' (which happens with the alarm(5) you mentioned)

so at this point, the vm should be shutdown already, but maybe qemu already sends the shutdown signal before executing the nic scripts .. i'll have to check
 
Hi,

Thank you for this explaination.
On my side I ll will investigate a little more to clarify my observations.
In addition I m waiting for the check you will do
 
I reproduce the following behavior on a node in PVE version 7.2-11
Invoking a shutdown (from PVE WebUI) on a VM with 13 interfaces and qemu-guest-agent running and fully operationnal results in qmeventd fire SIGKILL. so it seems that qmeventd doesn't give the VM a chance to stop its network devices properly

Sep 19 17:41:20 pve7-0 QEMU[67633]: kvm: terminating on signal 15 from pid 36174 (/usr/sbin/qmeventd)
Sep 19 17:41:25 pve7-0 qmeventd[36174]: cleanup failed, terminating pid '67633' with SIGKILL
Sep 19 17:41:26 pve7-0 qmeventd[68565]: Starting cleanup for 100
Sep 19 17:41:26 pve7-0 qmeventd[68565]: Finished cleanup for 100

the same VM (a cloned) with only 3 nics has enough time to properly stop her nic, so qmeventd does not fire SIGKILL

Sep 19 17:54:00 pve7-0 QEMU[70219]: kvm: terminating on signal 15 from pid 36174 (/usr/sbin/qmeventd)
Sep 19 17:54:01 pve7-0 qmeventd[36174]: read: Connection reset by peer
Sep 19 17:54:02 pve7-0 qmeventd[70751]: Starting cleanup for 101
Sep 19 17:54:02 pve7-0 qmeventd[70751]: Finished cleanup for 101
 
Last edited:
mhmm sorry i cannot really reproduce it

i started a vm with >20 nics and could start & shutdown properly without any issue (qmevent started the cleanup only after all nics were removed)

also i tried with a custom vm with a script on a netdev that simply sleeps for 30 seconds, but that also triggers qmevent only after the script has finished

can you post the versions (pveversion -v) the vm config (qm config ID) and the complete syslog from the time you shutdown until the 'finished cleanup' line?
 
I have just upoload the debug data you requested
Did you try with a sleep 1 in bridge-down script ?

It is not really related but we found a little bug that cause the cleanup procedure does not process net device correctly
/usr/share/perl5/PVE/CLI/qm.pm +812

#next if $opt !~ m/^net(\d)+$/;
next if $opt !~ m/^net(\d+)$/;
 

Attachments

  • cleanup-failed.tar
    20 KB · Views: 2
/usr/share/perl5/PVE/CLI/qm.pm +812

#next if $opt !~ m/^net(\d)+$/;
next if $opt !~ m/^net(\d+)$/;
huh? did you manually edit that?

what bug exactly?

I have just upoload the debug data you requested
Did you try with a sleep 1 in bridge-down script ?
i have setup a vm with a custom '-netdev tap,....,script=foo.sh,downscript=foodown.sh' which only have a 'sleep 30' in them and it worked without problems
 
# the original line
next if $opt !~ m/^net(\d)+$/;

# we have corrected the line for testing
next if $opt !~ m/^net(\d+)$/;

The bug is the "+" character must be captured in parenthesis but it is placed after
So the net label like net12 is captured as net1 (second digit is not taken)
 
i have setup a vm with a custom '-netdev tap,....,script=foo.sh,downscript=foodown.sh' which only have a 'sleep 30' in them and it worked without problems
with a sleep30 you should notice after 5sec that qmeventd sends a SIGKILL to the qemu process and then invokes qm cleanup but this one doesn't call tap_unplug. (to display SIGKILL in logs you must set /etc/pve/.debug to 1)
after a qm shutdown if qmeventd fire SIGKILL we expected that cleanup procedure call tap_unplug function
The question is why it does not
It does not because clean arg is set to 1 so it considers that the clean has already been done
The question is why clean arg is set to 1
 
Last edited:
ok after looking at it again i can reproduce (somewhat); yesterday i used 'qm stop' instead of shutdown which has (unexpectedly for me at least) different behaviour

i can see that the timeout of 5s is probably too short for some setups and also there is quite some improvement potential for that code in general.. so we're looking at it and see how we can improve that. in the meantime, could you open a bug report for it, so that we can track it better? https://bugzilla.proxmox.com
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!