VM crash with memory hotplug

hansm

Well-Known Member
Feb 27, 2015
62
3
48
We're having problems with all newer pve-qemu-kvm versions, every version newer than 2.5.19 is causing a unpredictable crash of our VM's. I've setup some tests and reproduce details, can you please check your configuration, with this test you can crash your VM in a few minutes.
My test VM details:
Code:
boot: dc
bootdisk: scsi0
cores: 12
hotplug: disk,network,usb,memory,cpu
ide2: none,media=cdrom
memory: 2048
name: testserver.mydomain.com
net0: virtio=3A:A9:6A:0C:3E:2D,bridge=vmbr123
numa: 1
onboot: 1
ostype: l26
protection: 1
scsi0: sata-datastore:995/vm-995-disk-1.qcow2,size=20G
scsihw: virtio-scsi-single
smbios1: uuid=18ee0633-f00c-4d40-b037-349da8e44ea4
sockets: 1
vcpus: 2
So, NUMA enabled and memory hotplug selected.

I tested with Debian Jessie (8.8) and Stretch (9.1), both are having the same problem, a few months ago I also tested with CentOS with the same results, earlier I didn't had time to test this thoroughly. All our Proxmox nodes are running with pve-qemu-kvm_2.5-19_amd64.deb and the problem doesn't occur then.

Enable hotplug support in your VM according https://pve.proxmox.com/wiki/Hotplug_(qemu_disk,nic,cpu,memory) To make this easy you can copy this rule (for Jessie):
Code:
echo 'SUBSYSTEM=="memory", ACTION=="add", TEST=="state", ATTR{state}=="offline", ATTR{state}="online"' > /lib/udev/rules.d/80-hotplug-cpu-mem.rules
If you installed Debian Stretch add the following in /etc/default/grub instead of the above udev rule:
Code:
GRUB_CMDLINE_LINUX="memhp_default_state=online"
Save the file and update Grub:
Code:
update-grub

Please test with newer pve-qemu-kvm version and setup a default Debian Jessie or Stretch VM with comparable details as listed above. Make sure your test VM can send email to your own email address, on Jessie this can be done with: dpkg-reconfigure exim4-config (choose internet site and all other options default). In /etc/aliases specify your email address after root: to setup a forwarder for the root account. Now test by: echo test | mail -s test root
If you don't receive the email please fix this, the command to test will send you an email after a crash and a reboot.

Now set following cronjob with: crontab -e and reboot
Code:
@reboot    touch /home/test_count && if [ -e /home/test_crashed_your_server ]; then echo "`/bin/hostname` crashed after `wc -c < /home/test_count` tries" | mail -s "`/bin/hostname` crashed" root; elif [ `wc -c < /home/test_count` -ge 50 ]; then exit; else sleep 10; touch /home/test_crashed_your_server; for i in `seq 1 5`; do SIZE=2048; echo 3 > /proc/sys/vm/drop_caches; dd if=/dev/zero of=/home/tempfile bs=1M count=$SIZE conv=fdatasync,notrunc > /dev/null 2>&1; echo 3 > /proc/sys/vm/drop_caches; dd if=/home/tempfile of=/dev/null bs=1M count=$SIZE > /dev/null 2>&1; rm -f /home/tempfile; echo -n . >> /home/test_count; done; rm -f /home/test_crashed_your_server; /sbin/reboot; fi
Reboot your VM.
This will create a test file of 2GB after every reboot, first wait 10 seconds to give the VM time to boot. The test file is written and will be read, caches are dropped before and after the dd tests. This dd tests repeat 5 times and then reboot. Before the test starts a file /home/test_crashed_your_server is created and after the 5 successful tests it is deleted. If your VM crashes during the dd tests the file isn't removed and after the reboot the cron sends you a warning. If you didn't enable HA on the VM the VM will be stopped when crashed, please start it and you will get the email, tests aren't repeated until you clean the files in /home (test_count, tempfile, test_crashed_your_server). If email can't be sent but you'll see the file test_crashed_your_server directly after a reboot, your VM crashed.

Notes:
- VM's also crash on unpredictable moments, this test only triggers it, maybe someone can think of a better test, but this worked for me.
- It doesn't happen without memory hotplug. You could enable memory hotplug on the VM but don't enable it in the VM with the udev rule or Grub parameter in Stretch and the VM will not crash.
- It happens on all storages, tested on SATA RAID10 NFS over 1Gbit/s and Ceph full SSD cluster over redundant 10Gbit/s.

Thanks for testing!
 
Last edited:
Bump...
I really think you need to take a few minutes to test this. It's kind of crucial to have VM's that do not crash and with recent pve-qemu-kvm versions VM's will crash with memory hotplug support enabled in Proxmox and Linux guest OS.
 
I tested it with PVE 5.0 on a test server in our office. Standalone, only 1 SATA disk for OS and test VM on local-lvm.
With above procedure (memory hotplug and test case) a clean Debian 9 install crashes directly after the first test run. Reinstalled the VM with CentOS 7, setup cronjob and reboot to start testing. After 20 tests the VM crashes.

As I told before, this doesnt happen with pve-qemu-kvm 2.5.19 and earlier.

I gathered strace output of the kvm process on the host when running the test case. These are the last lines strace outputs, it keeps quiet after the last line. The VM isn't usable anymore, it just hangs, no real crash or kernel panic, it hangs and you can't login anymore, you can't even attach the console to it, it's completely dead.

For the Debian 9 VM:
Code:
ppoll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=10, events=POLLIN}, {fd=12, events=POLLIN}, {fd=23, events=POLLIN}, {fd=26, events=POLLIN}, {fd=31, events=POLLIN}], 8, {tv_sec=0, tv_nsec=933041}, NULL, 8) = 0 (Timeout)
ppoll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=10, events=POLLIN}, {fd=12, events=POLLIN}, {fd=23, events=POLLIN}, {fd=26, events=POLLIN}, {fd=31, events=POLLIN}], 8, {tv_sec=0, tv_nsec=0}, NULL, 8) = 0 (Timeout)
ppoll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=10, events=POLLIN}, {fd=12, events=POLLIN}, {fd=23, events=POLLIN}, {fd=26, events=POLLIN}, {fd=31, events=POLLIN}], 8, {tv_sec=0, tv_nsec=934244}, NULL, 8) = 0 (Timeout)
ppoll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=10, events=POLLIN}, {fd=12, events=POLLIN}, {fd=23, events=POLLIN}, {fd=26, events=POLLIN}, {fd=31, events=POLLIN}], 8, {tv_sec=0, tv_nsec=0}, NULL, 8) = 0 (Timeout)
ppoll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=10, events=POLLIN}, {fd=12, events=POLLIN}, {fd=23, events=POLLIN}, {fd=26, events=POLLIN}, {fd=31, events=POLLIN}], 8, {tv_sec=0, tv_nsec=933368}, NULL, 8) = 0 (Timeout)
ppoll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=10, events=POLLIN}, {fd=12, events=POLLIN}, {fd=23, events=POLLIN}, {fd=26, events=POLLIN}, {fd=31, events=POLLIN}], 8, {tv_sec=0, tv_nsec=0}, NULL, 8) = 0 (Timeout)
ppoll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=10, events=POLLIN}, {fd=12, events=POLLIN}, {fd=23, events=POLLIN}, {fd=26, events=POLLIN}, {fd=31, events=POLLIN}], 8, {tv_sec=0, tv_nsec=935821}, NULL, 8) = 0 (Timeout)
ppoll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=10, events=POLLIN}, {fd=12, events=POLLIN}, {fd=23, events=POLLIN}, {fd=26, events=POLLIN}, {fd=31, events=POLLIN}], 8, {tv_sec=0, tv_nsec=0}, NULL, 8) = 0 (Timeout)
ppoll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=10, events=POLLIN}, {fd=12, events=POLLIN}, {fd=23, events=POLLIN}, {fd=26, events=POLLIN}, {fd=31, events=POLLIN}], 8, {tv_sec=0, tv_nsec=933838}, NULL, 8) = 0 (Timeout)
ppoll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=10, events=POLLIN}, {fd=12, events=POLLIN}, {fd=23, events=POLLIN}, {fd=26, events=POLLIN}, {fd=31, events=POLLIN}], 8, {tv_sec=0, tv_nsec=0}, NULL, 8) = 0 (Timeout)
ppoll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=10, events=POLLIN}, {fd=12, events=POLLIN}, {fd=23, events=POLLIN}, {fd=26, events=POLLIN}, {fd=31, events=POLLIN}], 8, {tv_sec=0, tv_nsec=934214}, NULL, 8) = 0 (Timeout)
ppoll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=10, events=POLLIN}, {fd=12, events=POLLIN}, {fd=23, events=POLLIN}, {fd=26, events=POLLIN}, {fd=31, events=POLLIN}], 8, {tv_sec=0, tv_nsec=0}, NULL, 8) = 0 (Timeout)
ppoll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=10, events=POLLIN}, {fd=12, events=POLLIN}, {fd=23, events=POLLIN}, {fd=26, events=POLLIN}, {fd=31, events=POLLIN}], 8, {tv_sec=0, tv_nsec=931164}, NULL, 8) = 0 (Timeout)
ppoll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=10, events=POLLIN}, {fd=12, events=POLLIN}, {fd=23, events=POLLIN}, {fd=26, events=POLLIN}, {fd=31, events=POLLIN}], 8, {tv_sec=0, tv_nsec=0}, NULL, 8) = 0 (Timeout)
ppoll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=10, events=POLLIN}, {fd=12, events=POLLIN}, {fd=23, events=POLLIN}, {fd=26, events=POLLIN}, {fd=31, events=POLLIN}], 8, {tv_sec=0, tv_nsec=935833}, NULL, 8) = 0 (Timeout)
ppoll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=10, events=POLLIN}, {fd=12, events=POLLIN}, {fd=23, events=POLLIN}, {fd=26, events=POLLIN}, {fd=31, events=POLLIN}], 8, {tv_sec=0, tv_nsec=0}, NULL, 8) = 0 (Timeout)
ppoll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=10, events=POLLIN}, {fd=12, events=POLLIN}, {fd=23, events=POLLIN}, {fd=26, events=POLLIN}, {fd=31, events=POLLIN}], 8, {tv_sec=0, tv_nsec=934361}, NULL, 8) = 0 (Timeout)
ppoll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=10, events=POLLIN}, {fd=12, events=POLLIN}, {fd=23, events=POLLIN}, {fd=26, events=POLLIN}, {fd=31, events=POLLIN}], 8, {tv_sec=0, tv_nsec=0}, NULL, 8) = 0 (Timeout)
ppoll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=10, events=POLLIN}, {fd=12, events=POLLIN}, {fd=23, events=POLLIN}, {fd=26, events=POLLIN}, {fd=31, events=POLLIN}], 8, {tv_sec=0, tv_nsec=932171}, NULL, 8) = 0 (Timeout)
ppoll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=10, events=POLLIN}, {fd=12, events=POLLIN}, {fd=23, events=POLLIN}, {fd=26, events=POLLIN}, {fd=31, events=POLLIN}], 8, {tv_sec=0, tv_nsec=0}, NULL, 8) = 0 (Timeout)
ppoll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=10, events=POLLIN}, {fd=12, events=POLLIN}, {fd=23, events=POLLIN}, {fd=26, events=POLLIN}, {fd=31, events=POLLIN}], 8, {tv_sec=0, tv_nsec=934129}, NULL, 8) = 0 (Timeout)
ppoll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=10, events=POLLIN}, {fd=12, events=POLLIN}, {fd=23, events=POLLIN}, {fd=26, events=POLLIN}, {fd=31, events=POLLIN}], 8, {tv_sec=0, tv_nsec=0}, NULL, 8) = 0 (Timeout)
ppoll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=10, events=POLLIN}, {fd=12, events=POLLIN}, {fd=23, events=POLLIN}, {fd=26, events=POLLIN}, {fd=31, events=POLLIN}], 8, {tv_sec=0, tv_nsec=935054}, NULL, 8) = 0 (Timeout)
ppoll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=10, events=POLLIN}, {fd=12, events=POLLIN}, {fd=23, events=POLLIN}, {fd=26, events=POLLIN}, {fd=31, events=POLLIN}], 8, {tv_sec=0, tv_nsec=0}, NULL, 8) = 0 (Timeout)
ppoll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=10, events=POLLIN}, {fd=12, events=POLLIN}, {fd=23, events=POLLIN}, {fd=26, events=POLLIN}, {fd=31, events=POLLIN}], 8, {tv_sec=0, tv_nsec=937179}, NULL, 8) = 1 ([{fd=31, revents=POLLIN}], left {tv_sec=0, tv_nsec=216610})
read(31, "\1\0\0\0\0\0\0\0", 512)       = 8

CentOS 7 VM:
Code:
io_submit(0x7feccd9cd000, 1, [{preadv, fildes=20, iovec=[{iov_base=0x7febc82f8000, iov_len=122880}], offset=19363287040, resfd=26}]) = 1
write(9, "\1\0\0\0\0\0\0\0", 8)         = 8
io_submit(0x7feccd9cd000, 1, [{preadv, fildes=20, iovec=[{iov_base=0x7febc825a000, iov_len=131072}], offset=19362639872, resfd=26}]) = 1
write(9, "\1\0\0\0\0\0\0\0", 8)         = 8
io_submit(0x7feccd9cd000, 1, [{preadv, fildes=20, iovec=[{iov_base=0x7febc83db000, iov_len=131072}], offset=19364704256, resfd=26}]) = 1
write(9, "\1\0\0\0\0\0\0\0", 8)         = 8
io_submit(0x7feccd9cd000, 1, [{preadv, fildes=20, iovec=[{iov_base=0x7febc827a000, iov_len=122880}], offset=19362770944, resfd=26}]) = 1
write(9, "\1\0\0\0\0\0\0\0", 8)         = 8
io_submit(0x7feccd9cd000, 1, [{preadv, fildes=20, iovec=[{iov_base=0x7febc83fb000, iov_len=122880}], offset=19364835328, resfd=26}]) = 1
write(9, "\1\0\0\0\0\0\0\0", 8)         = 8
io_submit(0x7feccd9cd000, 1, [{preadv, fildes=20, iovec=[{iov_base=0x7febc79fc000, iov_len=16384}, {iov_base=0x7febc8200000, iov_len=106496}], offset=19362254848, resfd=26}]) = 1
write(9, "\1\0\0\0\0\0\0\0", 8)         = 8
io_submit(0x7feccd9cd000, 1, [{preadv, fildes=20, iovec=[{iov_base=0x7febc8356000, iov_len=131072}], offset=19363672064, resfd=26}]) = 1
write(9, "\1\0\0\0\0\0\0\0", 8)         = 8
read(26, "@\0\0\0\0\0\0\0", 512)        = 8
write(9, "\1\0\0\0\0\0\0\0", 8)         = 8
io_submit(0x7feccd9cd000, 1, [{preadv, fildes=20, iovec=[{iov_base=0x7febc8376000, iov_len=32768}, {iov_base=0x7fec977fc000, iov_len=90112}], offset=19363803136, resfd=26}]) = 1
ppoll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=10, events=POLLIN}, {fd=12, events=POLLIN}, {fd=23, events=POLLIN}, {fd=25, events=POLLIN}, {fd=26, events=POLLIN}, {fd=27, events=POLLIN}, {fd=32, events=POLLIN}], 10, {tv_sec=0, tv_nsec=0}, NULL, 8) = 2 ([{fd=9, revents=POLLIN}, {fd=26, revents=POLLIN}], left {tv_sec=0, tv_nsec=0})
write(24, "\1\0\0\0\0\0\0\0", 8)        = 8
read(26, "\2\0\0\0\0\0\0\0", 512)       = 8
write(9, "\1\0\0\0\0\0\0\0", 8)         = 8
write(9, "\1\0\0\0\0\0\0\0", 8)         = 8
ppoll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=10, events=POLLIN}, {fd=12, events=POLLIN}, {fd=23, events=POLLIN}, {fd=25, events=POLLIN}, {fd=26, events=POLLIN}, {fd=27, events=POLLIN}, {fd=32, events=POLLIN}], 10, {tv_sec=0, tv_nsec=0}, NULL, 8) = 1 ([{fd=9, revents=POLLIN}], left {tv_sec=0, tv_nsec=0})
ppoll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=10, events=POLLIN}, {fd=12, events=POLLIN}, {fd=23, events=POLLIN}, {fd=25, events=POLLIN}, {fd=26, events=POLLIN}, {fd=27, events=POLLIN}, {fd=32, events=POLLIN}], 10, {tv_sec=0, tv_nsec=11543183}, NULL, 8) = 1 ([{fd=9, revents=POLLIN}], left {tv_sec=0, tv_nsec=11540890})
read(9, "\33\0\0\0\0\0\0\0", 512)       = 8
ppoll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=10, events=POLLIN}, {fd=12, events=POLLIN}, {fd=23, events=POLLIN}, {fd=25, events=POLLIN}, {fd=26, events=POLLIN}, {fd=27, events=POLLIN}, {fd=32, events=POLLIN}], 10, {tv_sec=0, tv_nsec=11454465}, NULL, 8) = 1 ([{fd=25, revents=POLLIN}], left {tv_sec=0, tv_nsec=10338237})
read(25, "\1\0\0\0\0\0\0\0", 512)       = 8
io_submit(0x7feccd9cd000, 9, [{preadv, fildes=20, iovec=[{iov_base=0x7febc8429000, iov_len=131072}], offset=19365023744, resfd=26}, {preadv, fildes=20, iovec=[{iov_base=0x7febc84a7000, iov_len=131072}], offset=19365539840, resfd=26}, {preadv, fildes=20, iovec=[{iov_base=0x7febc8525000, iov_len=131072}], offset=19366055936, resfd=26}, {preadv, fildes=20, iovec=[{iov_base=0x7fec978cb000, iov_len=126976}, {iov_base=0x7febc854b000, iov_len=4096}], offset=19366572032, resfd=26}, {preadv, fildes=20, iovec=[{iov_base=0x7febc85aa000, iov_len=131072}], offset=19367088128, resfd=26}, {preadv, fildes=20, iovec=[{iov_base=0x7febc8628000, iov_len=131072}], offset=19367604224, resfd=26}, {preadv, fildes=20, iovec=[{iov_base=0x7febc86a6000, iov_len=131072}], offset=19368120320, resfd=26}, {preadv, fildes=20, iovec=[{iov_base=0x7fec978f3000, iov_len=131072}], offset=19368636416, resfd=26}, {preadv, fildes=20, iovec=[{iov_base=0x7febc872b000, iov_len=65536}], offset=19369152512, resfd=26}]) = 9
io_submit(0x7feccd9cd000, 1, [{preadv, fildes=20, iovec=[{iov_base=0x7febc8449000, iov_len=131072}], offset=19365154816, resfd=26}]) = 1
io_submit(0x7feccd9cd000, 1, [{preadv, fildes=20, iovec=[{iov_base=0x7febc84c7000, iov_len=131072}], offset=19365670912, resfd=26}]) = 1
io_submit(0x7feccd9cd000, 1, [{preadv, fildes=20, iovec=[{iov_base=0x7febc854c000, iov_len=131072}], offset=19366703104, resfd=26}]) = 1
io_submit(0x7feccd9cd000, 1, [{preadv, fildes=20, iovec=[{iov_base=0x7febc85ca000, iov_len=131072}], offset=19367219200, resfd=26}]) = 1
io_submit(0x7feccd9cd000, 1, [{preadv, fildes=20, iovec=[{iov_base=0x7febc8648000, iov_len=131072}], offset=19367735296, resfd=26}]) = 1
io_submit(0x7feccd9cd000, 1, [{preadv, fildes=20, iovec=[{iov_base=0x7febc86c6000, iov_len=131072}], offset=19368251392, resfd=26}]) = 1
io_submit(0x7feccd9cd000, 1, [{preadv, fildes=20, iovec=[{iov_base=0x7fec97913000, iov_len=131072}], offset=19368767488, resfd=26}]) = 1
io_submit(0x7feccd9cd000, 1, [{preadv, fildes=20, iovec=[{iov_base=0x7febc8469000, iov_len=131072}], offset=19365285888, resfd=26}]) = 1
io_submit(0x7feccd9cd000, 1, [{preadv, fildes=20, iovec=[{iov_base=0x7febc8545000, iov_len=24576}, {iov_base=0x7fec97873000, iov_len=106496}], offset=19366187008, resfd=26}]) = 1
io_submit(0x7feccd9cd000, 1, [{preadv, fildes=20, iovec=[{iov_base=0x7febc85ea000, iov_len=131072}], offset=19367350272, resfd=26}]) = 1
io_submit(0x7feccd9cd000, 1, [{preadv, fildes=20, iovec=[{iov_base=0x7febc86e6000, iov_len=131072}], offset=19368382464, resfd=26}]) = 1
io_submit(0x7feccd9cd000, 1, [{preadv, fildes=20, iovec=[{iov_base=0x7fec97933000, iov_len=131072}], offset=19368898560, resfd=26}]) = 1
io_submit(0x7feccd9cd000, 1, [{preadv, fildes=20, iovec=[{iov_base=0x7febc84e7000, iov_len=131072}], offset=19365801984, resfd=26}]) = 1
io_submit(0x7feccd9cd000, 1, [{preadv, fildes=20, iovec=[{iov_base=0x7fec9788d000, iov_len=131072}], offset=19366318080, resfd=26}]) = 1
io_submit(0x7feccd9cd000, 1, [{preadv, fildes=20, iovec=[{iov_base=0x7febc8668000, iov_len=131072}], offset=19367866368, resfd=26}]) = 1
io_submit(0x7feccd9cd000, 1, [{preadv, fildes=20, iovec=[{iov_base=0x7fec97953000, iov_len=57344}, {iov_base=0x7febc871b000, iov_len=65536}], offset=19369029632, resfd=26}]) = 1
io_submit(0x7feccd9cd000, 1, [{preadv, fildes=20, iovec=[{iov_base=0x7febc8489000, iov_len=122880}], offset=19365416960, resfd=26}]) = 1
io_submit(0x7feccd9cd000, 1, [{preadv, fildes=20, iovec=[{iov_base=0x7febc856c000, iov_len=131072}], offset=19366834176, resfd=26}]) = 1
io_submit(0x7feccd9cd000, 1, [{preadv, fildes=20, iovec=[{iov_base=0x7febc8688000, iov_len=122880}], offset=19367997440, resfd=26}]) = 1
io_submit(0x7feccd9cd000, 1, [{preadv, fildes=20, iovec=[{iov_base=0x7febc8507000, iov_len=122880}], offset=19365933056, resfd=26}]) = 1
io_submit(0x7feccd9cd000, 1, [{preadv, fildes=20, iovec=[{iov_base=0x7febc860a000, iov_len=122880}], offset=19367481344, resfd=26}]) = 1
io_submit(0x7feccd9cd000, 1, [{preadv, fildes=20, iovec=[{iov_base=0x7febc858c000, iov_len=122880}], offset=19366965248, resfd=26}]) = 1
io_submit(0x7feccd9cd000, 1, [{preadv, fildes=20, iovec=[{iov_base=0x7fec978ad000, iov_len=122880}], offset=19366449152, resfd=26}]) = 1
io_submit(0x7feccd9cd000, 1, [{preadv, fildes=20, iovec=[{iov_base=0x7febc8706000, iov_len=86016}, {iov_base=0x7fec978ea000, iov_len=36864}], offset=19368513536, resfd=26}]) = 1
ppoll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=10, events=POLLIN}, {fd=12, events=POLLIN}, {fd=23, events=POLLIN}, {fd=25, events=POLLIN}, {fd=26, events=POLLIN}, {fd=27, events=POLLIN}, {fd=32, events=POLLIN}], 10, {tv_sec=0, tv_nsec=0}, NULL, 8) = 2 ([{fd=25, revents=POLLIN}, {fd=26, revents=POLLIN}], left {tv_sec=0, tv_nsec=0})
write(24, "\1\0\0\0\0\0\0\0", 8)        = 8
read(25, "\1\0\0\0\0\0\0\0", 512)       = 8
write(2, "kvm:", 4)                     = 4
write(2, " ", 1)                        = 1
write(2, "Looped descriptor", 17)       = 17
write(2, "\n", 1)                       = 1

The kvm process keeps running on the host with 100% cpu usage. The VM itself runs with 50% cpu usage (2 core VM). But stays unusable.

I'm sure everyone has this problem, please test it and report your results. I hope that the Proxmox team will test it soon.
 
Because no one responds and Proxmox team also doesn't respond to our bug report at https://bugzilla.proxmox.com/show_bug.cgi?id=1107#c16 we doubt if we can keep using PVE in the future.

I'm still trying to solve this myself but I really appreciate some help with it.

I've made some progress. Starting a VM with only 1GB of memory doesn't crash, this is the command that PVE runs:
Code:
/usr/bin/kvm -id 100 -chardev 'socket,id=qmp,path=/var/run/qemu-server/100.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -pidfile /var/run/qemu-server/100.pid -daemonize -smbios 'type=1,uuid=bd7f9680-73f0-428e-a421-2e3b6ac733d8' -name test.localdomain -smp '1,sockets=1,cores=8,maxcpus=8' -device 'kvm64-x86_64-cpu,id=cpu2,socket-id=0,core-id=1,thread-id=0' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vga std -vnc unix:/var/run/qemu-server/100.vnc,x509,password -cpu kvm64,+lahf_lm,+sep,+kvm_pv_unhalt,+kvm_pv_eoi,enforce -m 'size=1024,slots=255,maxmem=4194304M' -object 'memory-backend-ram,id=ram-node0,size=1024M' -numa 'node,nodeid=0,cpus=0-7,memdev=ram-node0' -k en-us -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' -device 'usb-tablet,id=tablet,bus=uhci.0,port=1' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:8bc71019ca99' -drive 'if=none,id=drive-ide2,media=cdrom,aio=threads' -device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=200' -device 'virtio-scsi-pci,id=scsihw0,bus=pci.0,addr=0x5' -drive 'file=/dev/pve/vm-100-disk-1,if=none,id=drive-scsi0,format=raw,cache=none,aio=native,detect-zeroes=on' -device 'scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100' -netdev 'type=tap,id=net0,ifname=tap100i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=02:74:F2:CE:A1:2D,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300'

Starting with more than 1GB of memory makes the VM crash with the method describe in my first post. Command for the same VM with 2GB mem:
Code:
/usr/bin/kvm -id 100 -chardev 'socket,id=qmp,path=/var/run/qemu-server/100.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -pidfile /var/run/qemu-server/100.pid -daemonize -smbios 'type=1,uuid=bd7f9680-73f0-428e-a421-2e3b6ac733d8' -name test.localdomain -smp '1,sockets=1,cores=8,maxcpus=8' -device 'kvm64-x86_64-cpu,id=cpu2,socket-id=0,core-id=1,thread-id=0' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vga std -vnc unix:/var/run/qemu-server/100.vnc,x509,password -cpu kvm64,+lahf_lm,+sep,+kvm_pv_unhalt,+kvm_pv_eoi,enforce -m 'size=1024,slots=255,maxmem=4194304M' -object 'memory-backend-ram,id=ram-node0,size=1024M' -numa 'node,nodeid=0,cpus=0-7,memdev=ram-node0' -object 'memory-backend-ram,id=mem-dimm0,size=512M' -device 'pc-dimm,id=dimm0,memdev=mem-dimm0,node=0' -object 'memory-backend-ram,id=mem-dimm1,size=512M' -device 'pc-dimm,id=dimm1,memdev=mem-dimm1,node=0' -k en-us -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' -device 'usb-tablet,id=tablet,bus=uhci.0,port=1' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:8bc71019ca99' -drive 'if=none,id=drive-ide2,media=cdrom,aio=threads' -device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=200' -device 'virtio-scsi-pci,id=scsihw0,bus=pci.0,addr=0x5' -drive 'file=/dev/pve/vm-100-disk-1,if=none,id=drive-scsi0,format=raw,cache=none,aio=native,detect-zeroes=on' -device 'scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100' -netdev 'type=tap,id=net0,ifname=tap100i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=02:74:F2:CE:A1:2D,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300'

The only difference seems to be the 2 memory-backend-ram objects and pc-dimm devices. I tried to change te command slightly, the following also works:
Code:
/usr/bin/kvm -id 100 -chardev 'socket,id=qmp,path=/var/run/qemu-server/100.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -pidfile /var/run/qemu-server/100.pid -daemonize -smbios 'type=1,uuid=bd7f9680-73f0-428e-a421-2e3b6ac733d8' -name test.localdomain -smp '1,sockets=1,cores=8,maxcpus=8' -device 'kvm64-x86_64-cpu,id=cpu2,socket-id=0,core-id=1,thread-id=0' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vga std -vnc unix:/var/run/qemu-server/100.vnc,x509,password -cpu kvm64,+lahf_lm,+sep,+kvm_pv_unhalt,+kvm_pv_eoi,enforce -m 'size=1024,slots=255,maxmem=4194304M' -object 'memory-backend-ram,id=ram-node0,size=1024M' -numa 'node,nodeid=0,cpus=0-7,memdev=ram-node0' -object 'memory-backend-ram,id=mem-dimm0,size=1024M' -device 'pc-dimm,id=dimm0,memdev=mem-dimm0,node=0' -k en-us -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' -device 'usb-tablet,id=tablet,bus=uhci.0,port=1' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:8bc71019ca99' -drive 'if=none,id=drive-ide2,media=cdrom,aio=threads' -device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=200' -device 'virtio-scsi-pci,id=scsihw0,bus=pci.0,addr=0x5' -drive 'file=/dev/pve/vm-100-disk-1,if=none,id=drive-scsi0,format=raw,cache=none,aio=native,detect-zeroes=on' -device 'scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100' -netdev 'type=tap,id=net0,ifname=tap100i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=02:74:F2:CE:A1:2D,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300'

There's only one additional memory-backend-ram object and pc-dimm device.

It also works when memdev isn't used on starting, but just start a VM with eg. mem=2G, like this:
Code:
/usr/bin/kvm -id 100 -chardev 'socket,id=qmp,path=/var/run/qemu-server/100.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -pidfile /var/run/qemu-server/100.pid -daemonize -smbios 'type=1,uuid=bd7f9680-73f0-428e-a421-2e3b6ac733d8' -name test.localdomain -smp '1,sockets=1,cores=8,maxcpus=8' -device 'kvm64-x86_64-cpu,id=cpu2,socket-id=0,core-id=1,thread-id=0' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vga std -vnc unix:/var/run/qemu-server/100.vnc,x509,password -cpu kvm64,+lahf_lm,+sep,+kvm_pv_unhalt,+kvm_pv_eoi,enforce -m 'size=2G,slots=255,maxmem=4194304M' -numa 'node,nodeid=0,cpus=0-7,mem=2G' -k en-us -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' -device 'usb-tablet,id=tablet,bus=uhci.0,port=1' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:8bc71019ca99' -drive 'if=none,id=drive-ide2,media=cdrom,aio=threads' -device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=200' -device 'virtio-scsi-pci,id=scsihw0,bus=pci.0,addr=0x5' -drive 'file=/dev/pve/vm-100-disk-1,if=none,id=drive-scsi0,format=raw,cache=none,aio=native,detect-zeroes=on' -device 'scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100' -netdev 'type=tap,id=net0,ifname=tap100i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=02:74:F2:CE:A1:2D,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300'
Hotplugging memory in the QEMU monitor works with:
Code:
object_add memory-backend-ram,id=mem1,size=1G
device_add pc-dimm,id=dimm1,memdev=mem1
The memory is added to the VM and the VM will not crash.

I also compiled QEMU 2.9 from source and started that binary instead of /usr/bin/kvm. Same behaviour, so it's not a bug in pve-qemu-kvm. Maybe it's just the way PVE uses to add memory.

I hope the Proxmox team kicks in and test this and fixes it :)
You can ask me for help with testing or whatever. I'm working for days/weeks on this problem now and want it solved.

Thank you!
 
Couldn't reproduce it. Ran through up to test_count 50 several times now, tried with pve4 with qemu 2.7.1 and pve5 with qemu 2.9.1.
At this point my best suggestion is - since you mentioned you compiled 2.9 from source - try a full git-bisect, which is rather tedious, but for now I can't reproduce it, so I can't do that :-/.
In the mean time: I've been playing with these options and could trigger a bit of weirdness with numa + memory hotplug + virtio-net + ovmf/uefi and am wondering if you could try replacing virtio-net with e1000, or adding ',disable-modern=true' to the 'virtio-net-pci' part of the kvm command, see if that makes a difference.
Also, since you tested with the non-pve qemu source, you could also open a bug report with qemu directly (if you haven't done so already).
 
Thank you for your reply and testing. I'm very surprised you couldn't reproduce it, strange. Only thing I can think of is that we're using Dell hardware only, I tested it on Dell PowerEdge R310, R320, R420 and R610. In cluster setups with PVE 4.4 and standalone at our office on PVE 5. Possibly it's some incompatibility with Dell hardware/BIOS or something...

I tried with e1000, same behaviour, also tried -device 'virtio-net-pci,mac=02:74:F2:CE:A1:2D,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300,disable-modern=true', same behaviour. Thanks for thinking about the problem and giving me some options to try. It's really appreciated.

I tried with QEMU from source but I'm not sure if it won't use any libraries or something from the default install. I downloaded the source in /usr/src and compiled according the documentation (I needed to install many additional Debian packages to make everything work). Than I could run:
Code:
/usr/src/qemu-2.9.0/x86_64-softmmu/qemu-system-x86_64 -chardev 'socket,id=qmp,path=/var/run/qemu-server/100.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -pidfile /var/run/qemu-server/100.pid -daemonize -smbios 'type=1,uuid=bd7f9680-73f0-428e-a421-2e3b6ac733d8' -name test.localdomain -smp '1,sockets=1,cores=8,maxcpus=8' -device 'kvm64-x86_64-cpu,id=cpu2,socket-id=0,core-id=1,thread-id=0' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vga std -vnc unix:/var/run/qemu-server/100.vnc,x509,password -cpu kvm64,+lahf_lm,+sep,+kvm_pv_unhalt,+kvm_pv_eoi,enforce -m 'size=1024,slots=255,maxmem=4194304M' -object 'memory-backend-ram,id=ram-node0,size=1024M' -numa 'node,nodeid=0,cpus=0-7,memdev=ram-node0' -object 'memory-backend-ram,id=mem-dimm0,size=512M' -device 'pc-dimm,id=dimm0,memdev=mem-dimm0,node=0' -object 'memory-backend-ram,id=mem-dimm1,size=512M' -device 'pc-dimm,id=dimm1,memdev=mem-dimm1,node=0' -k en-us -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' -device 'usb-tablet,id=tablet,bus=uhci.0,port=1' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:8bc71019ca99' -drive 'if=none,id=drive-ide2,media=cdrom,aio=threads' -device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=200' -device 'virtio-scsi-pci,id=scsihw0,bus=pci.0,addr=0x5' -drive 'file=/dev/pve/vm-100-disk-1,if=none,id=drive-scsi0,format=raw,cache=none,aio=native,detect-zeroes=on' -device 'scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100' -netdev 'type=tap,id=net0,ifname=tap100i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=02:74:F2:CE:A1:2D,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300'
This gave the same problem. And the following worked perfectly:
Code:
qemu-system-x86_64 -accel kvm -chardev 'socket,id=qmp,path=/var/run/qemu-server/100.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -pidfile /var/run/qemu-server/100.pid -daemonize -smbios 'type=1,uuid=bd7f9680-73f0-428e-a421-2e3b6ac733d8' -name test.localdomain -smp '1,sockets=1,cores=8,maxcpus=8' -device 'kvm64-x86_64-cpu,id=cpu2,socket-id=0,core-id=1,thread-id=0' -nodefaults -vga std -vnc unix:/var/run/qemu-server/100.vnc,x509,password -cpu kvm64,+lahf_lm,+sep,+kvm_pv_unhalt,+kvm_pv_eoi,enforce -m 'size=4G,slots=8,maxmem=10240M' -numa 'node,nodeid=0,cpus=0-7,mem=4G' -k en-us -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' -device 'usb-tablet,id=tablet,bus=uhci.0,port=1' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:8bc71019ca99' -device 'virtio-scsi-pci,id=scsihw0,bus=pci.0,addr=0x5' -drive 'file=/dev/pve/vm-100-disk-1,if=none,id=drive-scsi0,format=raw,cache=none,aio=threads,detect-zeroes=on' -device 'scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100' -netdev 'type=tap,id=net0,ifname=tap100i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=02:74:F2:CE:A1:2D,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300'
You're right, I can open a bug report with qemu directly but I'm not sure, and I couldn't find out eiter, if Proxmox adds memory to qemu on the recommended way. My previous post describes how I can run qemu correctly with memory hotplug enabled, so it can work with qemu, but not with the options that Proxmox specified. These 2 reasons (shared libs + qemu options) prevented me from opening a bug report with qemu directly.

You wrote:
I've been playing with these options and could trigger a bit of weirdness with numa + memory hotplug + virtio-net + ovmf/uefi
How could you trigger that and what did you see? Can I try the same?
try a full git-bisect, which is rather tedious
I have no problem with tedious but I'm unfamiliar with these kind of tasks, it looks very developer like :) I understand this can be used to find the commit that caused a regression, but I really have no idea where to start, if you can point me in the right direction I will do this. I want this problem solved because it's very serious for us and I'm shocked that you can't reproduce, I can do it every time, with Debian guest at the first run all the time. I've spent so much time on this problem already, so a few hours or days more isn't a problem ;-)
 
Possibly it's some incompatibility with Dell hardware/BIOS or something...

Since it seems to take a few tries to trigger it, rather than a having a short simple trigger-command, it's very possible that the hardware in use at least strongly influences the likelihood of triggering the bug.

How could you trigger that and what did you see? Can I try the same?

OVMF looped endlessly right when booting the VM (during the splashscreen) as it seems to be incompatible with the 'modern' mode in virtio-pci (which was changed to default to on between 2.6 and 2.7, so with 2.7 and up 'disable-modern=true' is needed to fix this).
I'll have to investigate this issue further and possibly report it upstream.

I have no problem with tedious but I'm unfamiliar with these kind of tasks, it looks very developer like :) I understand this can be used to find the commit that caused a regression, but I really have no idea where to start, if you can point me in the right direction I will do this.
First you need to clone qemu from git, then you can start a `git bisect` session which takes a good and a bad revision and then basically does a binary search for the commit introducing the issue. For that it'll check out a commit half way between the two current good & bad versions, you then compile and test it, then say `git bisect good` or `git bisect bad` depending on whether the bug triggered or not. It'll then use this commit as the new good or bad starting point and go half way between the new commits. Given the amount of revisions between the versions you It'll take about 6-ish revisions (assuming v2.5.1 works and v2.6.0 fails).
Here's an outline of the required commands:

As root install the required dev packages (I'm using a ./configure line below equivalent to what we use to build the pve-qemu-kvm package where some of the library dependencies are explicitly enabled, alternatively you can disable the ones you know you don't need).
Code:
# apt install autotools-dev libpci-dev quilt texinfo texi2html libgnutls28-dev libsdl1.2-dev check libaio-dev uuid-dev librbd-dev libiscsi-dev libspice-protocol-dev pve-libspice-server-dev libusbredirparser-dev glusterfs-common libusb-1.0-0-dev xfslibs-dev libnuma-dev libjemalloc-dev libjpeg-dev libacl1-dev libcap-dev

Setup the git clone and prepare for building (assuming v2.6.0 already doesn't work):
Code:
$ git clone git://git.qemu.org/qemu.git
$ cd qemu
$ git bisect start v2.5.1 v2.6.0
$ ./configure --with-confsuffix=/kvm --target-list=x86_64-softmmu --prefix=/usr --datadir=/usr/share --docdir=/usr/share/doc/pve-qemu-kvm --sysconfdir=/etc --localstatedir=/var --disable-xen --enable-gnutls --enable-sdl --enable-linux-aio --enable-rbd --enable-libiscsi --disable-smartcard --audio-drv-list=alsa --enable-spice --enable-usb-redir --enable-glusterfs --enable-libusb --disable-gtk --enable-xfsctl --enable-numa --disable-strip --enable-jemalloc --disable-libnfs --disable-fdt --enable-debug-info --enable-debug --disable-werror

It should be enough to run configure once there (it'll rerun itself between revisions where it needs to).

Iteration:
1) Run `make` as user
2) Run the qemu-system-x86_64 ... command which you use to trigger the issue (you should include `-accel kvm` - the 'kvm' binary from the pve-qemu-kvm package changes this to be on by default.
3) If the bug was triggered:
a) Run `git bisect bad`
If it worked fine:
b) Run `git bisect good`​
4) If the above command tells you the commit responsible you're done, otherwise repeat from step 1.
 
Thanks for the explanation. I've done the bisect but I don't think it gave usable information (so far).
First I tried v2.5.1 v2.6.0 in bisect like you wrote. Every make inbetween worked so I approved with 'git bisect good', at the end it resulted in:
Code:
root@test:/usr/src/qemu# git bisect good
a58047f7fbb055677e45c9a7d65ba40fbfad4b92 is the first bad commit
commit a58047f7fbb055677e45c9a7d65ba40fbfad4b92
Author: Michael Roth <mdroth@linux.vnet.ibm.com>
Date:   Tue Mar 29 15:47:56 2016 -0500

    Update version for 2.5.1 release

    Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>

:100644 100644 437459cd94c9fa59d82c61c0bc8aa36e293b735e 73462a5a13445f66009e00988279d30e55aa8363 M      VERSION
This commit only set another QEMU version.

Useless so far. I started over an did a 'git checkout tags/2.5.1.1', configure command and make. I started my VM with the binary I just built and all works fine. Than I cleaned it again and started over again for tags/v2.6.0. Now my VM crashes, started over again for tags/v2.6.0-rc0 and this also crashes my VM.

With this information I started a bisect again wit 'git bisect start v2.5.1.1 v2.6.0-rc0', all revisions I built were good (no crashes) and at the end:
Code:
root@test:/usr/src/qemu# git bisect good
db51dfc1fcaf0027a5f266b7def4317605848c6a is the first bad commit
commit db51dfc1fcaf0027a5f266b7def4317605848c6a
Author: Michael Roth <mdroth@linux.vnet.ibm.com>
Date:   Mon May 9 11:10:47 2016 -0500

    Update version for 2.5.1.1 release

    Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com

:100644 100644 73462a5a13445f66009e00988279d30e55aa8363 3a6d2147d6d583da05abf686c317817658ae6fbd M      VERSION
Probably I'm doing something wrong, can you help me again?
 
I must apologize. This is one of those times where I wish git revision IDs were comparable to see which direction one is going... I got the order of the 'start' command wrong. It's `git bisect start <bad> <good>`, so the two version parameters need to be swapped. (Doesn't help that the bisect terminology can also be changed in a checked-out repository to add to the confusion.)
 
Hmm... that explains a lot ;-) I repeated the steps with this new knowledge and now we have the commit which causes it, I think.
Code:
root@test:/usr/src/qemu# git bisect good
3b3b0628217e2726069990ff9942a5d6d9816bd7 is the first bad commit
commit 3b3b0628217e2726069990ff9942a5d6d9816bd7
Author: Paolo Bonzini <pbonzini@redhat.com>
Date:   Sun Jan 31 11:29:01 2016 +0100

    virtio: slim down allocation of VirtQueueElements

    Build the addresses and s/g lists on the stack, and then copy them
    to a VirtQueueElement that is just as big as required to contain this
    particular s/g list.  The cost of the copy is minimal compared to that
    of a large malloc.

    When virtqueue_map is used on the destination side of migration or on
    loadvm, the iovecs have already been split at memory region boundary,
    so we can just reuse the out_num/in_num we find in the file.

    Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
    Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
    Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

:040000 040000 42931b27fc2917c6031a5c487cbc2fe33490c9a0 198a86de8b06888629ec5e0f6b90b24f5ee506cf M      hw
I think it make sense somehow, I'm not a developer but malloc is known :) Memory allocation, my test always fails after writing data and flushing memory caches, maybe the next step at reading the data again. I may be caused by emptying the memory or filling it when reading the data.

I also tested the IDE bus instead of SCSI or VIRTIO, as mentioned in https://forum.proxmox.com/threads/3000-msec-ping-and-packet-drops-with-virtio-under-load.36687/ And IDE works for me too! I really think both problems are related somehow, the results are not much alike but when I read that topic this weekend I needed to try IDE, I saw some familiarities and that works.
 
do you have same performance problem, without "--enable-jemalloc" ?

we enable it mainly for ceph/librbd performance in qemu 2.4, just wonder if this new commit could change behaviour.

this bugzilla
https://bugzilla.redhat.com/show_bug.cgi?id=1251353

talk about jemalloc, tcmalloc before this commit, which seem to fix performance with tcmalloc. Don't known behaviour with jemalloc
 
I did some additional tests with a default PVE 5.0 install.
VM config has NUMA enabled and I use vCPU (1 socket, 8 cores, 2 vCPU's). At Options we add Memory and CPU to Hotplug.
I tested this VM config with a default Debian Stretch install (memory hotplug in guest enabled in /etc/default/grub, see earlier post) and tried all SCSI Controller Types and hard disk bus types. See attached PDF for the results. The last column in the table represents test results for the earlier failed tests only, for these tests I disabled NUMA, set 1 socket, 2 cores and disabled memory and CPU hotplug.

Problem really seems related to virtio but it also has something to do with NUMA and/or hotplug. I think it's NUMA because of memory allocation relation but NUMA is needed for memory hotplug.
 

Attachments

  • Proxmox test controller and hard disk.pdf
    351.7 KB · Views: 7
do you have same performance problem, without "--enable-jemalloc" ?

we enable it mainly for ceph/librbd performance in qemu 2.4, just wonder if this new commit could change behaviour.

this bugzilla
https://bugzilla.redhat.com/show_bug.cgi?id=1251353

talk about jemalloc, tcmalloc before this commit, which seem to fix performance with tcmalloc. Don't known behaviour with jemalloc
Thank you for the suggestion. I tested it with a new git clone and removed --enable-jemalloc from the configure command. My test crashes my VM at the first run. I'm not having performance issues BTW, with the test in my first post my VM crashes, in Debian it's after the first run in most cases, sometimes the second run, so it doesn't take very long to know if it works or not ;-) It takes maximum 2 minutes to complete 3 runs of my test, that's enough to verify if there's a problem.

More suggestions are more than welcome :)
 
If you remove the -daemonize flag from the qemu command line (in case you haven't already) and do the test directly at the bad commit 3b3b0628217, does it show any (error) output? Interestingly it seems to introduce a new error case (and at this revision simply reports it and exits - later commits change this to return instead of exiting directly, so I wonder if there's a difference in how the bug manifests there as well...).
 
Last edited:
If you remove the -daemonize flag from the qemu command line (in case you haven't already) and do the test directly at the bad commit 3b3b0628217, does it show any (error) output? Interestingly it seems to introduce a new error case (and at this revision simply reports it and exits - later commits change this to return instead of exiting directly, so I wonder if there's a difference in how the bug manifests there as well...).

The bisect was finished and the only missing commit was the one which introduces the error. I'm not familiar with git, I did this to apply the commit:
Code:
root@test:/usr/src/qemu# git cherry-pick 3b3b0628217
[detached HEAD beb0fb61a2] virtio: slim down allocation of VirtQueueElements
 Author: Paolo Bonzini <pbonzini@redhat.com>
 Date: Sun Jan 31 11:29:01 2016 +0100
 Committer: root <root@test.localserver>
Your name and email address were configured automatically based
on your username and hostname. Please check that they are accurate.
You can suppress this message by setting them explicitly. Run the
following command and follow the instructions in your editor to edit
your configuration file:

    git config --global --edit

After doing this, you may fix the identity used for this commit with:

    git commit --amend --reset-author

 1 file changed, 51 insertions(+), 31 deletions(-)
root@test:/usr/src/qemu# git status
HEAD detached from 3724650db0
You are currently bisecting, started from branch 'master'.
  (use "git bisect reset" to get back to the original branch)

nothing to commit, working tree clean
root@test:/usr/src/qemu# make
  CC    x86_64-softmmu/hw/virtio/virtio.o
  LINK  x86_64-softmmu/qemu-system-x86_64
root@test:/usr/src/qemu#
I think I did it correct so I started qemu without -daemonize and you're right, it give some output at the crash but if it helps...
Code:
root@test:/usr/src/qemu# /usr/src/qemu/x86_64-softmmu/qemu-system-x86_64 -enable-kvm -chardev 'socket,id=qmp,path=/var/run/qemu-server/100.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -pidfile /var/run/qemu-server/100.pid -smbios 'type=1,uuid=bd7f9680-73f0-428e-a421-2e3b6ac733d8' -name test.localdomain -smp '2,sockets=1,cores=8,maxcpus=8' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vga cirrus -vnc unix:/var/run/qemu-server/100.vnc,x509,password -cpu kvm64,+lahf_lm,+sep,+kvm_pv_unhalt,+kvm_pv_eoi,enforce -m 'size=1024,slots=255,maxmem=4194304M' -object 'memory-backend-ram,id=ram-node0,size=1024M' -numa 'node,nodeid=0,cpus=0-7,memdev=ram-node0' -object 'memory-backend-ram,id=mem-dimm0,size=512M' -device 'pc-dimm,id=dimm0,memdev=mem-dimm0,node=0' -object 'memory-backend-ram,id=mem-dimm1,size=512M' -device 'pc-dimm,id=dimm1,memdev=mem-dimm1,node=0' -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' -device 'usb-tablet,id=tablet,bus=uhci.0,port=1' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:8bc71019ca99' -drive 'if=none,id=drive-ide2,media=cdrom,aio=threads' -device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=200' -device 'virtio-scsi-pci,id=scsihw0,bus=pci.0,addr=0x5' -drive 'file=/dev/pve/vm-100-disk-1,if=none,id=drive-scsi0,format=raw,cache=none,aio=native,detect-zeroes=on' -device 'scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100' -netdev 'type=tap,id=net0,ifname=tap100i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=02:74:F2:CE:A1:2D,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300'

qemu-system-x86_64: Looped descriptor
root@test:/usr/src/qemu#
That's all output.

Is it possible somehow to build qemu 2.6 or newer without this commit? As I understand you can revert it and a new commit will be created which will do the opposite of this commit to revert it but I couldn't get it to work. Probably because of too many changes in virtio.c after this commit.
 
Building newer qemus with the above commit reverted will be difficult because there have been a couple more changes in there which would conflict.

However, the output you posted is useful and helped me spot a not so obvious change that happened with the above commit which seems accidental.
I have a patch I'd like you to test and pushed it to a branch on github.

https://github.com/Blub/qemu/commit/7bc9ce912373b571686db231dd97e08564303fa2

You can checkout the branch this way:
Reset the bisect state first:
Code:
$ git bisect reset
Add the repository and fetch its branches:
Code:
$ git remote add wbumiller https://github.com/Blub/qemu
$ git fetch wbumiller
Checkout the branch:
Code:
$ git checkout wbumiller/virtqueue-count-fix
Then build & test.
If this fixes the issue for you I'd forwad the patch to the qemu developer list for them to review and apply. (Also let me know if I should include a `Reported-by` tag with your name in the message, see the various entries in `git log` for what that would look like (I'd need a name & email address)).

Since this is based on our current 2.9.1 branch it would also be useful to verify that it fails without the patch:
Code:
$ git checkout wbumiller/extra
This one should fail.
 
YES, it works!!! Thank you! I repeated the qemu build twice and repeated my tests to be sure. I am very grateful, thanks for your help and fix!

I'm curious, now you know the problem, the cause and the solution, can you think of a way to trigger the problem on your hardware? It still seems I'm the only one having this problem but that bothers me because I can reproduce it all the time on different hardware (all Dell) with pretty default settings.

I also tested wbumiller/extra and that fails indeed.

Please add Reported-by tag:
Code:
Reported-by: Hans Middelhoek <h.middelhoek@ospito.nl>

When will Proxmox apply the patch in pve-qemu-kvm packages? Directly at the next build, or only when qemu approves it and releases a version where the patch is applied?

I also replied to thread https://forum.proxmox.com/threads/3...ket-drops-with-virtio-under-load.36687/page-4 It doesn't seem very related but their problems are also solved when they move away from virtio to ide. I think it's interesting to build a test package which can be installed with dpkg -i, so they can easily test if your patch also solves their problem.
 
I'm curious, now you know the problem, the cause and the solution, can you think of a way to trigger the problem on your hardware? It still seems I'm the only one having this problem but that bothers me because I can reproduce it all the time on different hardware (all Dell) with pretty default settings.

Qemu seems to be counting the buffers in the virtio device's queue wrong in a way which somewhat depends on your hardware and how the guest buffers requests, which in turn can depend on various components. It's probably possible to craft a failing request when directly manipulating the virtio-block or scsi driver (or writing a separate independent virtio test-driver), but the patch seems to make sense to me, works for you, so my preferred next step is to send it upstream to the people who wrote the code and should be much faster at analyzing the situation ;-)

When will Proxmox apply the patch in pve-qemu-kvm packages? Directly at the next build, or only when qemu approves it and releases a version where the patch is applied?
We'll send it upstream and begin testing with a patched package internally simultaneously, so that if the patch is accepted upstream a package will already be on its way through the internal and afterwards external testing repositories.

I also replied to thread https://forum.proxmox.com/threads/3...ket-drops-with-virtio-under-load.36687/page-4 It doesn't seem very related but their problems are also solved when they move away from virtio to ide. I think it's interesting to build a test package which can be installed with dpkg -i, so they can easily test if your patch also solves their problem.
It's unlikely to be related, will first wait for feedback from upstream, then it shouldn't take long for a package to be available in the pvetest repositories.
 
  • Like
Reactions: hansm and aderumier

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!