proxmox 4 beta refusing to start VM after VM crash

vkhera

Member
Feb 24, 2015
192
14
18
Maryland, USA
Running 4-beta 4.0-26/5d4a615b.

I set up the first VM on this machine, which has 64GB of RAM and ZFS backed storage. The VM was given 16GB of RAM to run FreeBSD 10.2. I'm using ZFS send/receive to copy data into the VM from another machine. After it loaded about 14Gb of the data, the machine just stopped. Nothing in the VM's logs nor anything obvious in the Proxmox host logs other than "process exit".

I restarted the VM after giving it 32GB of RAM (dynamically allocated), and this time it was able to copy via send/receive about 25GB of data before just stopping. Again, with nothing notable in any logs. At this point, I cannot get the VM to restart at all.

Here is everything I see in the logs from the last time it ran:

Code:
Aug 23 15:30:02 pve2.int.kcilink.com pvedaemon[4955]: start VM 103: UPID:pve2:0000135B:010ED212:55DA1F3A:qmstart:103:root@pam:
Aug 23 15:30:02 pve2.int.kcilink.com pvedaemon[32110]: <root@pam> starting task UPID:pve2:0000135B:010ED212:55DA1F3A:qmstart:103:root@pam:
Aug 23 15:30:03 pve2.int.kcilink.com kernel: device tap103i0 entered promiscuous mode
Aug 23 15:30:03 pve2.int.kcilink.com kernel: vmbr0: port 2(tap103i0) entered forwarding state
Aug 23 15:30:03 pve2.int.kcilink.com kernel: vmbr0: port 2(tap103i0) entered forwarding state
Aug 23 15:30:04 pve2.int.kcilink.com pvedaemon[5007]: starting vnc proxy UPID:pve2:0000138F:010ED2A0:55DA1F3C:vncproxy:103:root@pam:
Aug 23 15:30:04 pve2.int.kcilink.com pvedaemon[32535]: <root@pam> starting task UPID:pve2:0000138F:010ED2A0:55DA1F3C:vncproxy:103:root@pam:
Aug 23 15:30:05 pve2.int.kcilink.com kernel: kvm: zapping shadow pages for mmio generation wraparound
Aug 23 15:33:05 pve2.int.kcilink.com pvedaemon[32535]: <root@pam> successful auth for user 'root@pam'
Aug 23 15:38:33 pve2.int.kcilink.com pmxcfs[1708]: [dcdb] notice: data verification successful
Aug 23 15:42:00 pve2.int.kcilink.com systemd-timesyncd[1512]: interval/delta/delay/jitter/drift 2048s/-0.001s/0.043s/0.007s/-3ppm
Aug 23 15:43:05 pve2.int.kcilink.com pveproxy[2118]: worker exit
Aug 23 15:43:05 pve2.int.kcilink.com pveproxy[24548]: worker 2118 finished
Aug 23 15:43:05 pve2.int.kcilink.com pveproxy[24548]: starting 1 worker(s)
Aug 23 15:43:05 pve2.int.kcilink.com pveproxy[24548]: worker 28006 started
Aug 23 15:46:53 pve2.int.kcilink.com pvedaemon[32535]: worker exit
Aug 23 15:46:53 pve2.int.kcilink.com pvedaemon[2059]: worker 32535 finished
Aug 23 15:46:53 pve2.int.kcilink.com pvedaemon[2059]: starting 1 worker(s)
Aug 23 15:46:53 pve2.int.kcilink.com pvedaemon[2059]: worker 15248 started
Aug 23 15:47:52 pve2.int.kcilink.com kernel: vmbr0: port 2(tap103i0) entered disabled state
Aug 23 15:47:52 pve2.int.kcilink.com kernel:  zd16: p1 p2 p3
Aug 23 15:48:05 pve2.int.kcilink.com pvedaemon[32110]: <root@pam> successful auth for user 'root@pam'
Aug 23 15:49:37 pve2.int.kcilink.com pvedaemon[5007]: command '/bin/nc6 -l -p 5901 -w 10 -e '/usr/sbin/qm vncproxy 103 2>/dev/null'' failed: exit code 1

Now every time I try to start it this is all it does:

Code:
Aug 23 16:17:38 pve2.int.kcilink.com pvedaemon[23958]: <root@pam> starting task UPID:pve2:000063AF:01132DB3:55DA2A62:qmstart:103:root@pam:
Aug 23 16:17:38 pve2.int.kcilink.com pvedaemon[25519]: start VM 103: UPID:pve2:000063AF:01132DB3:55DA2A62:qmstart:103:root@pam:
Aug 23 16:17:38 pve2.int.kcilink.com pvedaemon[25519]: start failed: command '/usr/bin/systemd-run --scope --slice qemu --unit 103 -p 'CPUShares=1000' /usr/bin/kvm -id 103 -chardev 'socket,id=qmp,path=/var/run/qemu-server/103.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -vnc unix:/var/run/qemu-server/103.vnc,x509,password -pidfile /var/run/qemu-server/103.pid -daemonize -smbios 'type=1,uuid=676eee0b-3b28-42be-9890-8cafeb5fe1cc' -name staging -smp '4,sockets=2,cores=2,maxcpus=4' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000' -vga cirrus -cpu kvm64,+lahf_lm,+x2apic,+sep,+kvm_pv_unhalt,+kvm_pv_eoi,enforce -m 16384 -object 'memory-backend-ram,size=8192M,id=ram-node0' -numa 'node,nodeid=0,cpus=0-1,memdev=ram-node0' -object 'memory-backend-ram,size=8192M,id=ram-node1' -numa 'node,nodeid=1,cpus=2-3,memdev=ram-node1' -k en-us -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' -device 'usb-tablet,id=tablet,bus=uhci.0,port=1' -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:fde03f29af2f' -drive 'file=/dev/zvol/tank/vm-103-disk-2,if=none,id=drive-virtio1,discard=on,format=raw,cache=none,aio=native,detect-zeroes=unmap' -device 'virtio-blk-pci,drive=drive-virtio1,id=virtio1,bus=pci.0,addr=0xb' -drive 'file=/mnt/pve/filer/template/iso/FreeBSD-10.2-RELEASE-amd64-bootonly.iso,if=none,id=drive-ide2,media=cdrom,aio=threads' -device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=200' -drive 'file=/dev/zvol/tank/vm-103-disk-1,if=none,id=drive-virtio0,discard=on,format=raw,cache=none,aio=native,detect-zeroes=unmap' -device 'virtio-blk-pci,drive=drive-virtio0,id=virtio0,bus=pci.0,addr=0xa,bootindex=101' -netdev 'type=tap,id=net0,ifname=tap103i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=6E:7D:F9:87:BE:A6,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300'' failed: exit code 1

I restarted the pvedaemon and pveproxy services on the host, but that's kind of the end of my ideas short of reboot.

It is my guess that the VM died because the ZFS ARC ran out of memory -- it was showing continued growth as the copy was proceeding, and the VM memory use chart on proxmox was approaching the max allowed amount. Either that, or something is broken with the KVM on 4.x when it reaches the max memory. I want to try running more tests to limit the arc on the freebsd vm, but at this point nothing will restart... I will try rebooting the proxmox server then see what happens.
 
I was able to set the ARC max on the VM to 8GB and re-run my copy. When the ARC approaches the 8GB limit, the VM just stops. There is no kernel panic or anything on the console or logs for the VM. This is completely repeatable. I will test on a physical machine running 10.2 to determine if it is a freebsd problem or proxmox problem.
 
I ran the same test on a physical server with 16GB of RAM and FreeBSD 10.2 and the ZFS ARC set to 8GB. There was no kernel panic. I am going to try some alternate settings in Proxmox to see what I can find. I'm leaning towards this being a bug with something on proxmox KVM.
 
So one of two things worked around the failure: I made a new ZFS snapshot on the source server, and instead of doing "ssh old zfs send | zfs recv" I did "zfs send | ssh new zfs rcv"

The whole 54GB data set copied fine this time. I'll do more tests tomorrow when I'm more awake.
 
I'm observing this KVM "exit" on another physical host and VM now.

The VM is running FreeBSD 10.2 with ZFS. Upon attempting to zfs send 1.5TB volume to a file server on the same LAN, after about 10GB of transfer, the VM just exits. There is nothing on the VM console about a panic. The proxmox syslog shows this:

Code:
Aug 28 16:06:20 pve1 kernel: [1310788.419742] vmbr0: port 4(tap104i0) entered disabled state
Aug 28 16:06:21 pve1 kernel: [1310789.388799]  zd48: p1 p2 p3
Aug 28 16:06:24 pve1 pvedaemon[15732]: command '/bin/nc6 -l -p 5900 -w 10 -e '/usr/sbin/qm vncproxy 104 2>/dev/null'' failed: exit code 1
Aug 28 16:06:24 pve1 pvedaemon[19782]: starting vnc proxy UPID:pve1:00004D46:07D212B0:55E0BF40:vncproxy:104:root@pam:
Aug 28 16:06:24 pve1 pvedaemon[11098]: <root@pam> starting task UPID:pve1:00004D46:07D212B0:55E0BF40:vncproxy:104:root@pam:
Aug 28 16:06:24 pve1 pveproxy[6156]: worker exit
Aug 28 16:06:24 pve1 pveproxy[8871]: worker 6156 finished
Aug 28 16:06:24 pve1 pveproxy[8871]: starting 1 worker(s)
Aug 28 16:06:24 pve1 pveproxy[8871]: worker 19786 started
Aug 28 16:06:24 pve1 qm[19785]: VM 104 qmp command failed - VM 104 not running

I can repeat this pretty much at will. I'm sending the file system with basically this pipeline: "zfs send -R zfspool@m4 | ssh -c arcfour filer zfs recv -v tank/backup/zfspool" running from the VM itself.

The disks that make up the VM's ZFS pool are provided by the proxmox ZFS storage local to the machine. Before when I observed this a few days ago I was using the virtio block disk driver; now I'm using the virtio-scsi driver. I was able to easily receive this same ZFS volume to this VM, and the write performance was pretty solid at an average of 70MB/s without any crashing.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!