Problems with a VM that is in HA

cesarpk · May 26, 2014

Hi to all

Please, if somebody can help me, i will be very grateful

I have a serious problem with a VM that is in HA:

1- The VM crashes more or less once a week or is super slow after of more or less a week:
- I can't connect by ssh to the VM (when the VM is super slow)
- When i connect by ssh to PVE Host proxmox7 (when the VM is super slow), i see that the VM is using all cores of my HOST to 162%

Then first i run by CLI "qm stop 112" (because the option "shutdown" give me a timeout), and after that this VM isn't running, I run on PVE Host proxmox7 htop, and see that some other proccess is very high:

2- PVE Host show me this message in the tag "syslog" of PVE GUI

Code:

May 26 10:45:01 kvm7 /USR/SBIN/CRON[15429]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
May 26 10:45:07 proxmox7 pmxcfs[2903]: [status] notice: received log
May 26 10:45:33 proxmox7 rgmanager[15461]: [pvevm] VM 112 is running
May 26 10:45:43 proxmox7 rgmanager[15489]: [pvevm] VM 112 is running
May 26 10:46:23 proxmox7 rgmanager[15545]: [pvevm] VM 112 is running
May 26 10:46:43 proxmox7 rgmanager[15580]: [pvevm] VM 112 is running
May 26 10:47:03 proxmox7 rgmanager[15615]: [pvevm] VM 112 is running
May 26 10:47:43 proxmox7 rgmanager[15670]: [pvevm] VM 112 is running
May 26 10:47:53 proxmox7 rgmanager[15698]: [pvevm] VM 112 is running
May 26 10:48:23 proxmox7 rgmanager[15746]: [pvevm] VM 112 is running
May 26 10:49:03 proxmox7 rgmanager[15795]: [pvevm] VM 112 is running
May 26 10:49:13 proxmox7 rgmanager[15823]: [pvevm] VM 112 is running
May 26 10:49:53 proxmox7 rgmanager[15878]: [pvevm] VM 112 is running
May 26 10:50:13 proxmox7 rgmanager[15913]: [pvevm] VM 112 is running
May 26 10:50:33 proxmox7 rgmanager[15954]: [pvevm] VM 112 is running
May 26 10:51:14 proxmox7 rgmanager[16003]: [pvevm] VM 112 is running
May 26 10:51:23 proxmox7 rgmanager[16037]: [pvevm] VM 112 is running
May 26 10:51:53 proxmox7 rgmanager[16079]: [pvevm] VM 112 is running
May 26 10:52:23 proxmox7 rgmanager[16127]: [pvevm] VM 112 is running
May 26 10:52:33 proxmox7 rgmanager[16155]: [pvevm] VM 112 is running
May 26 10:53:03 proxmox7 rgmanager[16197]: [pvevm] VM 112 is running
May 26 10:53:23 proxmox7 rgmanager[16238]: [pvevm] VM 112 is running
May 26 10:53:43 proxmox7 rgmanager[16273]: [pvevm] VM 112 is running
May 26 10:54:13 proxmox7 rgmanager[16315]: [pvevm] VM 112 is running
May 26 10:54:23 proxmox7 rgmanager[16349]: [pvevm] VM 112 is running
May 26 10:54:44 proxmox7 rgmanager[16384]: [pvevm] VM 112 is running
...etc..etc...etc

3- When i had configured this VM in HA, this VM did not boot automatically (with the "service rgmanager" and "join_fence" started), so i had that do click on "start" of PVE GUI for that the VM starts

- This is a part of my configuration of cluster.conf with the problem:

Code:

<rm>
    <pvevm autostart="1" vmid="112" domain="VM-Mail"/>
    <pvevm autostart="1" vmid="113" domain="VM-Order"/>

        <failoverdomains>
            <failoverdomain name="VM-Mail" restricted="1" ordered="1" nofailback="1">
                <failoverdomainnode name="proxmox7" priority="1"/>
                <failoverdomainnode name="proxmox8" priority="10"/>

            </failoverdomain>
            <failoverdomain name="VM-Order" restricted="1" ordered="1" nofailback="1">
                <failoverdomainnode name="proxmox8" priority="1"/>
                <failoverdomainnode name="proxmox7" priority="10"/>
            </failoverdomain>
        </failoverdomains>
</rm>

4- This is the configuration of my PVE Nodes:
proxmox-ve-2.6.32: 3.2-126 (running kernel: 2.6.32-29-pve)
pve-manager: 3.2-4 (running version: 3.2-4/e24a91c1)
pve-kernel-2.6.32-27-pve: 2.6.32-121
pve-kernel-2.6.32-28-pve: 2.6.32-124
pve-kernel-2.6.32-29-pve: 2.6.32-126
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.5-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.5-1
pve-cluster: 3.0-12
qemu-server: 3.1-16
pve-firmware: 1.1-3
libpve-common-perl: 3.0-18
libpve-access-control: 3.0-11
libpve-storage-perl: 3.0-19
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-6
vzctl: 4.0-1pve5
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 1.7-8
ksm-control-daemon: 1.1-1
glusterfs-client: 3.4.2-1

5- This is the configuration of my VM:
boot: c
bootdisk: virtio0
cores: 4
cpu: host
ide2: none,media=cdrom
memory: 8192
name: Centos63-x64-Mail
net0: virtio=86

6:B3:2F:0A:40,bridge=vmbr1
net1: virtio=D2

F:CF:14

B:6B,bridge=vmbr0
ostype: l26
sockets: 1
virtio0: Storage_HA-01-proxmox7-proxmox8:vm-112-disk-1,cache=diectsync,size=571756M

General notes:
-The Servers are DELL
- My RAID controller is by Hardware (MegaRAID SAS 2008), and don't have cache memory, but i think that is isn't a problem for that the VM hangs
- Fence for me isn't the problem (at least for now).
- The VM with problems is a CentOS 6.3 (that have his original kernel, but when this VM was running on PVE 2.3, never had problems, inclusive when this VM was in HA)
- My PVE Hosts and the VM uses I/O deadline scheduler after i did change of PVE to 3.2 version (before, PVE Host and VM had cfq configured)
- The LVM Virtual Group PVE don't have space free in my PVE Host, but TOM (a staff member of PVE) in the past said me that have free space on this VG is only necessary for the live backups of CTs that are running in this VG.

My Questions:
1- Why my VM hangs or is super slow after of more or less a week?, and how can i fix it?
2- Why rgmanager don't start this VM when i apply "reboot" to my PVE Node with problems, or when I did the settings in "HA" (please see my configuration of the file cluster.conf above and thinks that the fence is only in manual mode, ie with human interaction, then, never the VM will have that run in the other Node while i don't apply the manual fence)? ... may be that I should not have two "pvevm" with different "domain" directives in my cluster.conf file? or have PVE a bug?
3- Is correct that rgmanager shows the message that the VM 112 is running so very repetitively?, and if it is bad, how can i fix it?
4- What is better for the hardware bios configuration, power saving controlled by the hardware or by the kernel of OS?

Best regards
Cesar

cesarpk · May 27, 2014

Nobody can help me with the problems of my previous post? ...

acidrop · May 27, 2014

HelloAlthough I am not an expert I would try these:1. Change cpu type on vm from "host" to "kvm64" or "qemu64".2. When VM is slow, ssh or get vnc console on VM and try to investigate there which process occupies the cpu resources.3. Most probably the auto start VM problems should be from the reason that you are using manual fencing??

e100 · May 27, 2014

HA services can be set to disabled, this will prevent them from starting automatically.
This state can happen automatically if certain errors occur.

What is the output of clustat?

Also, when you add a VM to HA you must ensure the VM is stopped at the time you add it.

Did you increase the config_version when manually changing the cluster.conf?

I agree with acidrop, human fencing might be part of your issue too.

The problem with your VM being slow is likely something in the VM itself.
What sort of IO (disk and network) is reported in the Proxmox GUI when this issue happens?

cesarpk · May 28, 2014

e100 said:
HA services can be set to disabled, this will prevent them from starting automatically.
This state can happen automatically if certain errors occur.

Many thanks for your help e100 ...

(you are a great partner and a master!!!)

What is the output of clustat?

This:

Code:

Cluster Status for apollo @ Wed May 28 02:32:49 2014
Member Status: Quorate

 Member Name                                                     ID   Status
 ------ ----                                                     ---- ------
 proxmox4                                                        1 Online
 proxmox3                                                        2 Online
 proxmox7                                                        3 Online, Local, rgmanager
 proxmox2                                                        5 Online
 proxmox1                                                        6 Online
 proxmox8                                                        7 Online, rgmanager

 Service Name                                                   Owner (Last)                                                    State
 ------- ----                                                     ----- ------                                                     -----
 pvevm:112                                                        proxmox7                                                        started
 pvevm:113                                                        proxmox8                                                        started

Also, when you add a VM to HA you must ensure the VM is stopped at the time you add it.

yes, of course

Did you increase the config_version when manually changing the cluster.conf?

yes, of course

I agree with acidrop, human fencing might be part of your issue too.

Before, in this same PVE Cluster, when i had only 1 VM on HA and with human fence configured with PVE 2.3 Nodes, I did several tests of HA, fencing did his job very well, i am sure that fencing isn't the problem. Also in my mini test lab, human fence worked perfectly all times with only two PVE nodes.

I think that my problem is in the configuration of "failover domain", it is the first time that i try with two VMs configured in "HA" and with "failoverdomains"

Also in my opinion, i think that don't should mix the things, because the target of do "fencing" is very different to "rgmanager" that should start the VMs that are in HA if the autostart directive is enabled when "HA" is configured in the first time without regardless if "fencing" is well or bad configured (for example fencing with a PDU where the user/password is bad configured)

The problem with your VM being slow is likely something in the VM itself.

The strange is that the same VM before was working in PVE 2.3 and never had this problem, and after i migrate this VM to PVE 3.2 (with the command dd and a extra USB hard disk for do this task) is that the problems began (after of more or less a week)

But immediately after of this migration, also these changes was applied while the VM was off:
1- In the cache of Virtual disk: of "none" (in the old PVE 2.3 node) to "directsync" (to the new PVE 3.2 node) ... may be that "directsync" has a bug on PVE 3.2?
2- In the PVE 3.2 host, i did a lvresize for grow the size of volume that use the virtual disk of the VM
3- booting with a live CD into of this VM: resize the partition and file system of the VM to the max size possible

And finally, after of start the VM, I changed the config the I/O scheduler from cfq to deadline

What sort of IO (disk and network) is reported in the Proxmox GUI when this issue happens?

I was not present at time of the problem, i was speaking by telephone with my partner about of problem (that is his work schedule), and giving him instructions (that finally ended with apply a "reset" on the physical server), but for the next time, and if you want, i can get the images of PVE GUI for you.

Awaiting your help, i again say many thanks and see you soon

Best regards
Cesar

Re-edited: just as a test, i changed the configuration of the virtual disk of "directsync" to "Writethrough"
Question: why when i run by CLI "qm shutdown <ID of the VM>", the VM is turned off, and after, my PVE host continue with a very high consume of processor?

cesarpk · May 28, 2014

acidrop said:
HelloAlthough I am not an expert I would try these:1. Change cpu type on vm from "host" to "kvm64" or "qemu64".2. When VM is slow, ssh or get vnc console on VM and try to investigate there which process occupies the cpu resources.3. Most probably the auto start VM problems should be from the reason that you are using manual fencing??

Thanks acidrop for your suggestions, but I can not believe that the problem is CPU, my processor is a Intel Xeon E5-2407 @ 2.20GHz (quad core), the server is DELL, and if the problem is in the VM, why when i run by CLI "qm shutdown <ID of the VM>", the VM is turned off, and after, my PVE host continue with a very high consume of processor?

And about of fencing, please see my post in reply to e100 in this thread

Any idea or suggestions will be welcome

Best regards
Cesar

Re-edited: But may be that i am wrong, and the change of type of CPU will be better... I think that should exhaust the possibilities for achieving success, and may be that downgrade the kernel will be better

e100 · May 28, 2014

The CFQ -> deadline and cache=none->directsync could possibly be related to your performance issues.

On fast battery backed RAID arrays directsync vs none is nearly the same.
Never benchmarked the two on slower storage where I would expect to see a difference.

Nothing in your setup jumps out as being wrong, so really all I can suggest is to try and narrow down the possibilities.
Try cache=none and change back to CFQ, if that helps change back to deadline see if that made it worse, etc etc.

cesarpk · Jun 9, 2014

Finally i found the solution to my problems, and as i see strange things, i want share my experiences waiting that it can help to somebody more

Any comment will be welcome

1- About of the high consumption of processor: I did a change of directsync to writethrough, and since 2 weeks ago that this problem don't show any more
My Conclusion: A bug in qemu if the storage backend isn't fast and directsync was enabled

2- About of my configuration on the cluster.conf file: Today, after of change the amount of RAM of the VM reducing it a bit, i turned off this VM, and after, the VM was started automatically as i expected it (without do some other change of configurations)
My Conclusion: A bug in pve-manager

Best regards
Cesar

Problems with a VM that is in HA

cesarpk

Renowned Member

Attachments

cesarpk

Renowned Member

acidrop

Renowned Member

e100

Famous Member

cesarpk

Renowned Member

cesarpk

Renowned Member

e100

Famous Member

cesarpk

Renowned Member

We value your privacy