We're having problems with all newer pve-qemu-kvm versions, every version newer than 2.5.19 is causing a unpredictable crash of our VM's. I've setup some tests and reproduce details, can you please check your configuration, with this test you can crash your VM in a few minutes.
My test VM details:
So, NUMA enabled and memory hotplug selected.
I tested with Debian Jessie (8.8) and Stretch (9.1), both are having the same problem, a few months ago I also tested with CentOS with the same results, earlier I didn't had time to test this thoroughly. All our Proxmox nodes are running with pve-qemu-kvm_2.5-19_amd64.deb and the problem doesn't occur then.
Enable hotplug support in your VM according https://pve.proxmox.com/wiki/Hotplug_(qemu_disk,nic,cpu,memory) To make this easy you can copy this rule (for Jessie):
If you installed Debian Stretch add the following in /etc/default/grub instead of the above udev rule:
Save the file and update Grub:
Please test with newer pve-qemu-kvm version and setup a default Debian Jessie or Stretch VM with comparable details as listed above. Make sure your test VM can send email to your own email address, on Jessie this can be done with: dpkg-reconfigure exim4-config (choose internet site and all other options default). In /etc/aliases specify your email address after root: to setup a forwarder for the root account. Now test by: echo test | mail -s test root
If you don't receive the email please fix this, the command to test will send you an email after a crash and a reboot.
Now set following cronjob with: crontab -e and reboot
Reboot your VM.
This will create a test file of 2GB after every reboot, first wait 10 seconds to give the VM time to boot. The test file is written and will be read, caches are dropped before and after the dd tests. This dd tests repeat 5 times and then reboot. Before the test starts a file /home/test_crashed_your_server is created and after the 5 successful tests it is deleted. If your VM crashes during the dd tests the file isn't removed and after the reboot the cron sends you a warning. If you didn't enable HA on the VM the VM will be stopped when crashed, please start it and you will get the email, tests aren't repeated until you clean the files in /home (test_count, tempfile, test_crashed_your_server). If email can't be sent but you'll see the file test_crashed_your_server directly after a reboot, your VM crashed.
Notes:
- VM's also crash on unpredictable moments, this test only triggers it, maybe someone can think of a better test, but this worked for me.
- It doesn't happen without memory hotplug. You could enable memory hotplug on the VM but don't enable it in the VM with the udev rule or Grub parameter in Stretch and the VM will not crash.
- It happens on all storages, tested on SATA RAID10 NFS over 1Gbit/s and Ceph full SSD cluster over redundant 10Gbit/s.
Thanks for testing!
My test VM details:
Code:
boot: dc
bootdisk: scsi0
cores: 12
hotplug: disk,network,usb,memory,cpu
ide2: none,media=cdrom
memory: 2048
name: testserver.mydomain.com
net0: virtio=3A:A9:6A:0C:3E:2D,bridge=vmbr123
numa: 1
onboot: 1
ostype: l26
protection: 1
scsi0: sata-datastore:995/vm-995-disk-1.qcow2,size=20G
scsihw: virtio-scsi-single
smbios1: uuid=18ee0633-f00c-4d40-b037-349da8e44ea4
sockets: 1
vcpus: 2
I tested with Debian Jessie (8.8) and Stretch (9.1), both are having the same problem, a few months ago I also tested with CentOS with the same results, earlier I didn't had time to test this thoroughly. All our Proxmox nodes are running with pve-qemu-kvm_2.5-19_amd64.deb and the problem doesn't occur then.
Enable hotplug support in your VM according https://pve.proxmox.com/wiki/Hotplug_(qemu_disk,nic,cpu,memory) To make this easy you can copy this rule (for Jessie):
Code:
echo 'SUBSYSTEM=="memory", ACTION=="add", TEST=="state", ATTR{state}=="offline", ATTR{state}="online"' > /lib/udev/rules.d/80-hotplug-cpu-mem.rules
Code:
GRUB_CMDLINE_LINUX="memhp_default_state=online"
Code:
update-grub
Please test with newer pve-qemu-kvm version and setup a default Debian Jessie or Stretch VM with comparable details as listed above. Make sure your test VM can send email to your own email address, on Jessie this can be done with: dpkg-reconfigure exim4-config (choose internet site and all other options default). In /etc/aliases specify your email address after root: to setup a forwarder for the root account. Now test by: echo test | mail -s test root
If you don't receive the email please fix this, the command to test will send you an email after a crash and a reboot.
Now set following cronjob with: crontab -e and reboot
Code:
@reboot touch /home/test_count && if [ -e /home/test_crashed_your_server ]; then echo "`/bin/hostname` crashed after `wc -c < /home/test_count` tries" | mail -s "`/bin/hostname` crashed" root; elif [ `wc -c < /home/test_count` -ge 50 ]; then exit; else sleep 10; touch /home/test_crashed_your_server; for i in `seq 1 5`; do SIZE=2048; echo 3 > /proc/sys/vm/drop_caches; dd if=/dev/zero of=/home/tempfile bs=1M count=$SIZE conv=fdatasync,notrunc > /dev/null 2>&1; echo 3 > /proc/sys/vm/drop_caches; dd if=/home/tempfile of=/dev/null bs=1M count=$SIZE > /dev/null 2>&1; rm -f /home/tempfile; echo -n . >> /home/test_count; done; rm -f /home/test_crashed_your_server; /sbin/reboot; fi
This will create a test file of 2GB after every reboot, first wait 10 seconds to give the VM time to boot. The test file is written and will be read, caches are dropped before and after the dd tests. This dd tests repeat 5 times and then reboot. Before the test starts a file /home/test_crashed_your_server is created and after the 5 successful tests it is deleted. If your VM crashes during the dd tests the file isn't removed and after the reboot the cron sends you a warning. If you didn't enable HA on the VM the VM will be stopped when crashed, please start it and you will get the email, tests aren't repeated until you clean the files in /home (test_count, tempfile, test_crashed_your_server). If email can't be sent but you'll see the file test_crashed_your_server directly after a reboot, your VM crashed.
Notes:
- VM's also crash on unpredictable moments, this test only triggers it, maybe someone can think of a better test, but this worked for me.
- It doesn't happen without memory hotplug. You could enable memory hotplug on the VM but don't enable it in the VM with the udev rule or Grub parameter in Stretch and the VM will not crash.
- It happens on all storages, tested on SATA RAID10 NFS over 1Gbit/s and Ceph full SSD cluster over redundant 10Gbit/s.
Thanks for testing!
Last edited: