Live migration - Resume OPs just makes VMs burn CPU in wasteland

stefws

Renowned Member
Jan 29, 2015
302
4
83
Denmark
siimnet.dk
Hints appreciated as live migration resume operations just makes VMs burn one vCPU 100% thus rendering VM useless/not-responding :confused:

root@node1:~# pveversion
pve-manager/3.3-15/0317e201 (running kernel: 2.6.32-37-pve)
root@node1:~# uname -a
Linux node1 2.6.32-37-pve #1 SMP Fri Jan 30 06:16:52 CET 2015 x86_64 GNU/Linux
root@node1:~# dpkg -l | egrep qemu\|pve\|ceph
ii ceph-common 0.87-1~bpo70+1 amd64 common utilities to mount and interact with a ceph storage cluster
ii ceph-fuse 0.87-1~bpo70+1 amd64 FUSE-based client for the Ceph distributed file system
ii clvm 2.02.98-pve4 amd64 Cluster LVM Daemon for lvm2
ii corosync-pve 1.4.7-1 amd64 Standards-based cluster framework (daemon and modules)
ii dmsetup 2:1.02.77-pve4 amd64 Linux Kernel Device Mapper userspace library
ii fence-agents-pve 4.0.10-2 amd64 fence agents for redhat cluster suite
ii libcephfs1 0.87-1~bpo70+1 amd64 Ceph distributed file system client library
ii libcorosync4-pve 1.4.7-1 amd64 Standards-based cluster framework (libraries)
ii libcurl3-gnutls:amd64 7.29.0-1~bpo70+1.ceph amd64 easy-to-use client-side URL transfer library (GnuTLS flavour)
ii libdevmapper-event1.02.1:amd64 2:1.02.77-pve4 amd64 Linux Kernel Device Mapper event support library
ii libdevmapper1.02.1:amd64 2:1.02.77-pve4 amd64 Linux Kernel Device Mapper userspace library
ii liblvm2app2.2:amd64 2.02.98-pve4 amd64 LVM2 application library
ii libopenais3-pve 1.1.4-3 amd64 Standards-based cluster framework (libraries)
ii libpve-access-control 3.0-16 amd64 Proxmox VE access control library
ii libpve-common-perl 3.0-22 all Proxmox VE base library
ii libpve-storage-perl 3.0-28 all Proxmox VE storage management library
ii lvm2 2.02.98-pve4 amd64 Linux Logical Volume Manager
ii novnc-pve 0.4-7 amd64 HTML5 VNC client
ii openais-pve 1.1.4-3 amd64 Standards-based cluster framework (daemon and modules)
ii pve-cluster 3.0-15 amd64 Cluster Infrastructure for Proxmox Virtual Environment
ii pve-firewall 1.0-17 amd64 Proxmox VE Firewall
ii pve-firmware 1.1-3 all Binary firmware code for the pve-kernel
ii pve-kernel-2.6.32-32-pve 2.6.32-136 amd64 The Proxmox PVE Kernel Image
ii pve-kernel-2.6.32-37-pve 2.6.32-146 amd64 The Proxmox PVE Kernel Image
ii pve-libspice-server1 0.12.4-3 amd64 SPICE remote display system server library
ii pve-manager 3.3-15 amd64 The Proxmox Virtual Environment
ii pve-qemu-kvm 2.1-12 amd64 Full virtualization on x86 hardware
ii python-ceph 0.87-1~bpo70+1 amd64 Python libraries for the Ceph distributed filesystem
ri qemu-server 3.3-14 amd64 Qemu Server Tools
ii redhat-cluster-pve 3.2.0-2 amd64 Red Hat cluster suite
ii resource-agents-pve 3.9.2-4 amd64 resource agents for redhat cluster suite
ii tar 1.27.1+pve.1 amd64 GNU version of the tar archiving utility
ii vzctl 4.0-1pve6 amd64 OpenVZ - server virtualization solution - control tools
 
also post your VM config.

> qm config VMID

and include all information about the used virtual disk storage.
 
F.ex:

root@node5:~# qm config 109
args: -chardev socket,id=serial0,path=/var/run/qemu-server/109.console,server,nowait -serial chardev:serial0
balloon: 1024
bootdisk: virtio0
cores: 1
ide2: none,media=cdrom
kvm: 0
memory: 2048
name: afn
net0: e1000=4A:F4:F3:8E:9F:29,bridge=vmbr0
net1: e1000=FE:54:FF:33:44:D6,bridge=vmbr2
ostype: l26
shares: 2000
smbios1: uuid=b1ae2986-b659-4d8d-bd01-7861552b85b1
sockets: 1
tablet: 0
virtio0: vm_images:vm-109-disk-1,cache=writethrough,size=30G

vm_images is RBD pool on a 4x Ceph Giant Cluster

I'll try tomorrow to down grade VM kernel & it's virtio guest drivers to see if this has an effect... suspecting it might currently
 
does it work without the custom "args" (-serial)?
 
No still the same issue without the custom serial, also both qmp & vnc sockets are created fine and the serial works fine once migrated successfully.
A few times I've seen kennel dumping on this locally serial console.
Minicom output from local serial console after a high cpu usage following a live migration resume OP:
Press CTRL-A Z for help on special keys

Copying data : [100.0 %] /
Saving core complete
Restarting system.
Press any key to continue.
Press any key to continue.
Press any key to continue.

Guest was running fully patched CentOS 6.6 on kernel 2.6.32-504-3.3.el6 when this happened after several successfully live migration, like maybe 14 successfully out of 15 migrations.
Other similar CentOS 6.6 VMs never migrates they crash and burn CPU every time :(
Check this task log: https://dl.dropboxusercontent.com/u/13225502/PVE_VM_Kdumping_on_livemig.tiff
It's showing multiple successful migrations of VM 110 only the last one resumed with the kernel dump shown above burning cpu for minutes :(
Kernel dumping might that not indicate a guest side issue, ie. problem with virtue drivers maybe?
 
patched to latest on pvetest on friday the 13th. ;) and got these:

Code:
root@node7:~# dpkg -l | egrep pve\|qemu
ii  clvm                             2.02.98-pve4                  amd64        Cluster LVM Daemon for lvm2
ii  corosync-pve                     1.4.7-1                       amd64        Standards-based cluster framework (daemon and modules)
ii  dmsetup                          2:1.02.77-pve4                amd64        Linux Kernel Device Mapper userspace library
ii  fence-agents-pve                 4.0.10-2                      amd64        fence agents for redhat cluster suite
ii  libcorosync4-pve                 1.4.7-1                       amd64        Standards-based cluster framework (libraries)
ii  libdevmapper-event1.02.1:amd64   2:1.02.77-pve4                amd64        Linux Kernel Device Mapper event support library
ii  libdevmapper1.02.1:amd64         2:1.02.77-pve4                amd64        Linux Kernel Device Mapper userspace library
ii  liblvm2app2.2:amd64              2.02.98-pve4                  amd64        LVM2 application library
ii  libopenais3-pve                  1.1.4-3                       amd64        Standards-based cluster framework (libraries)
ii  libpve-access-control            3.0-16                        amd64        Proxmox VE access control library
ii  libpve-common-perl               3.0-24                        all          Proxmox VE base library
ii  libpve-storage-perl              3.0-30                        all          Proxmox VE storage management library
ii  lvm2                             2.02.98-pve4                  amd64        Linux Logical Volume Manager
ii  novnc-pve                        0.4-7                         amd64        HTML5 VNC client
ii  openais-pve                      1.1.4-3                       amd64        Standards-based cluster framework (daemon and modules)
ii  pve-cluster                      3.0-16                        amd64        Cluster Infrastructure for Proxmox Virtual Environment
ii  pve-firewall                     1.0-18                        amd64        Proxmox VE Firewall
ii  pve-firmware                     1.1-3                         all          Binary firmware code for the pve-kernel
ii  pve-kernel-2.6.32-32-pve         2.6.32-136                    amd64        The Proxmox PVE Kernel Image
ii  pve-kernel-2.6.32-37-pve         2.6.32-147                    amd64        The Proxmox PVE Kernel Image
ii  pve-libspice-server1             0.12.4-3                      amd64        SPICE remote display system server library
ii  pve-manager                      3.3-19                        amd64        The Proxmox Virtual Environment
ii  pve-qemu-kvm                     2.1-12                        amd64        Full virtualization on x86 hardware
ii  qemu-server                      3.3-17                        amd64        Qemu Server Tools
ii  redhat-cluster-pve               3.2.0-2                       amd64        Red Hat cluster suite
ii  resource-agents-pve              3.9.2-4                       amd64        resource agents for redhat cluster suite
ii  tar                              1.27.1+pve.1                  amd64        GNU version of the tar archiving utility
ii  vzctl                            4.0-1pve6                     amd64        OpenVZ - server virtualization solution - control tools

which seems much better... so far live migration has been working flawless again...
 
No still happens, this time I managed to see a kernel panic on the console and VM then dumped a vmcore and rebooted, so maybe it's related to VM virt driver/module issue vs the VM kernel... Only hoot nail ti further?

Does this call trace tell someone further?

# tail -33 /var/crash/127.0.01-2015-02-19-10:22:00:

<4>general protection fault: fff2 [#1] SMP
<4>last sysfs file: /sys/devices/system/cpu/online
<4>CPU 0
<4>Modules linked in: ipv6 virtio_balloon i2c_piix4 i2c_core e1000 sg ext4 jbd2 mbcache virtio_blk sr_mod cdrom virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
<4>
<4>Pid: 0, comm: swapper Not tainted 2.6.32-504.8.1.el6.x86_64 #1 QEMU Standard PC (i440FX + PIIX, 1996)
<4>RIP: 0010:[<ffffffff81040f8b>] [<ffffffff81040f8b>] native_safe_halt+0xb/0x10
<4>RSP: 0018:ffffffff81a01ea8 EFLAGS: 00000246
<4>RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
<4>RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffff81dec228
<4>RBP: ffffffff81a01ea8 R08: 0000000000000000 R09: 0000000000000000
<4>R10: 000000486d338d6c R11: 0000000000000000 R12: ffffffff81c09ec0
<4>R13: 0000000000000000 R14: ffffffffffffffff R15: ffffffff81de8000
<4>FS: 0000000000000000(0000) GS:ffff880028200000(0000) knlGS:0000000000000000
<4>CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
<4>CR2: 0000003606285e20 CR3: 000000013d3ea000 CR4: 00000000000006f0
<4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>DR3: 0000000000000000 DR6: 0000000000000000 DR7: 0000000000000000
<4>Process swapper (pid: 0, threadinfo ffffffff81a00000, task ffffffff81a8d020)
<4>Stack:
<4> ffffffff81a01ec8 ffffffff810167ad ffffffff81a01fd8 ffffffff81c09ec0
<4><d> ffffffff81a01ef8 ffffffff81009fc6 6db6db6db6db6db7 2c2481d11dc0fcac
<4><d> 0000000000000000 6db6db6db6db6db7 ffffffff81a01f08 ffffffff8151061a
<4>Call Trace:
<4> [<ffffffff810167ad>] default_idle+0x4d/0xb0
<4> [<ffffffff81009fc6>] cpu_idle+0xb6/0x110
<4> [<ffffffff8151061a>] rest_init+0x7a/0x80
<4> [<ffffffff81c29f8f>] start_kernel+0x424/0x430
<4> [<ffffffff81c2933a>] x86_64_start_reservations+0x125/0x129
<4> [<ffffffff81c29453>] x86_64_start_kernel+0x115/0x124
<4>Code: 55 48 89 e5 0f 1f 44 00 00 fa c9 c3 0f 1f 40 00 55 48 89 e5 0f 1f 44 00 00 fb c9 c3 0f 1f 40 00 55 48 89 e5 0f 1f 44 00 00 fb f4 <c9> c3 0f 1f 00 55 48 89 e5 0f 1f 44 00 00 f4 c9 c3 0f 1f 40 00
<1>RIP [<ffffffff81040f8b>] native_safe_halt+0xb/0x10
<4> RSP <ffffffff81a01ea8>
 
Managed to capture a console screen dump after failed resume OP following a live mig, to me this looks like a potential VM kernel issue (kernel bug at mm/slab.c:3069):

PVEliveMigBug.png

Seems it never recovers after such, but just burn one CPU 100%, not even dump vmcore image and rebooting like some other times :/