KVM VMs are restarted erratically

madz

New Member
Apr 29, 2013
3
0
1
Hi all !
We are running a 2 Node HA cluster with quorum disk, and around 30 lvm-based VMs on a node. Two or three times a day, some of the VMs
are shut down and restarted by proxmox. Which VMs are affected seems to be random.

/var/log/cluster/rgmanager.log shows
Code:
Apr 29 12:07:32 rgmanager [pvevm] got empty cluster VM list
at the beginning of the sequence.

To track this down, I extracted the code responsible for the message to a little perl script which fetches the list every second, but
the script always gets it correctly.

Any hints what to do or try ?

Thanks a lot for proxmox !
 
Any hints in syslog?

Nothing unusual in syslog and the other system logs.
A statistic of the restarts shows that the "first" 15 machines (by VM id) are almost never affected,
whereas the rest is, in no specific order, though. All VMs themselves do not differ significantly.

I tried to investigate what the pvevm script does with Devel::Trace , and where it gets its vmlist from,
but lost the line somewhere in Cluster.pm. vmlist seems to be populated by some ipcc call, and I can't
find the "other end" of this call. I suspect there might be some kind of timeout when populating vmlist,
which depends on the number of running VMs ...

Next I'll try to put half of the VMs on the other node to see if this happens when the number of VMs is lower...

Thanks for your reply !
 
Re: KVM VMs are restarted erratically [SOLVED]

Came back to the roots after my deep dive into perl and C ...

On the rgmanager man-page, it says:
status_child_max - Maximum number of status check threads (default =
5). It is not recommended that this ever be changed. This simply con‐
trols how many instances of clustat queries may be outstanding on a
single node at any given time.
It seems that the default value is appropriate for up to approx. 15 VMs.
I doubled the value and now proxmox works like a charm with 30 VMs
per node.
Thanks for proxmox !
Martin
 
Re: KVM VMs are restarted erratically [SOLVED]

Hey madz

I have the same issue for the first time today.

i have 3 nodes with 110 kvms. One of them has 60 inside, This node today got me a "rgmanager [pvevm] got empty cluster VM list" log an i think its the same problem.
I want to try a value of 15
This looks correct, yes?

Code:
  <rm status_child_max="15" >
    <pvevm autostart="1" vmid="120"/>
...
    <pvevm autostart="1" vmid="129"/>
  </rm>]