Hi!
I have a PBS system (latest v2.1-1 with kernel 5.13.19-4-pve) running with ZFS raidz2 on a ProLiant DL380 Gen9 server (Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz, 1 socket with 8 cores + 2 threads per core AKA 16 CPUs).
The system also has pve-qemu-kvm v6.1.1-1 installed (from http://download.proxmox.com/debian/pve), to execute
This basically works fine, but for a few VMs and under certain(?) situations I get immediate segfaults when running
I have
https://paste.grml.org/hidden/5fdfb37c/
What's interesting is, that this
By identifying this, I tried binding the
Now this looks like a bug related to threading?
Any ideas what's going wrong here?
I seem to have a reproducible cmdline available, and can also share such a coredump file in private if that would help.
More than happy to provide any further information.
I have a PBS system (latest v2.1-1 with kernel 5.13.19-4-pve) running with ZFS raidz2 on a ProLiant DL380 Gen9 server (Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz, 1 socket with 8 cores + 2 threads per core AKA 16 CPUs).
The system also has pve-qemu-kvm v6.1.1-1 installed (from http://download.proxmox.com/debian/pve), to execute
vma create
for restoring VMs for direct (import) usage in PVE environments.This basically works fine, but for a few VMs and under certain(?) situations I get immediate segfaults when running
vma create
:
Code:
root@pbs01:~# vma create /mnt/offsite-backup/vzdump-qemu-103-2022_01_27-00_04_36.vma -v -c /srv/restore/103/fw.conf -c /srv/restore/103/qemu-server.conf drive-virtio0=/srv/restore/103/drive-virtio0.img drive-virtio1=/srv/restore/103/dri
ve-virtio1.img drive-virtio2=/srv/restore/103/drive-virtio2.img
vma: vma_writer_register_stream 'drive-virtio2' failed
Trace/breakpoint trap (core dumped)
root@pbs01:~# dmesg -T | tail -1
[Fri Feb 11 14:54:05 2022] traps: vma[3258736] trap int3 ip:7f9e73366332 sp:7ffd45559170 error:0 in libglib-2.0.so.0.6600.8[7f9e73329000+88000]
root@pbs01:~# dpkg -l libglib2.0-0\* | grep '^ii'
ii libglib2.0-0:amd64 2.66.8-1 amd64 GLib library of C routines
ii libglib2.0-0-dbgsym:amd64 2.66.8-1 amd64 debug symbols for libglib2.0-0
I have
bt full
from such a coredump available with dbg packages being present (briefly stripped down in the paste):https://paste.grml.org/hidden/5fdfb37c/
What's interesting is, that this
vma create ...
segfaults, but when executing under gdb
, it's working perfectly fine:
Code:
root@pbs01:~# gdb --args vma create /mnt/offsite-backup/vzdump-qemu-103-2022_01_27-00_04_36.vma -v -c /srv/restore/103/fw.conf -c /srv/restore/103/qemu-server.conf drive-virtio0=/srv/restore/103/drive-virtio0.img drive-virtio1=/srv/rest
ore/103/drive-virtio1.img drive-virtio2=/srv/restore/103/drive-virtio2.img
GNU gdb (Debian 10.1-1.7) 10.1.90.20210103-git
Copyright (C) 2021 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from vma...
Reading symbols from /usr/lib/debug/.build-id/3a/977ef179bb3be80ff7c2afff3ef350aaad5e9f.debug...
(gdb) run
Starting program: /usr/bin/vma create /mnt/offsite-backup/vzdump-qemu-103-2022_01_27-00_04_36.vma -v -c /srv/restore/103/fw.conf -c /srv/restore/103/qemu-server.conf drive-virtio0=/srv/restore/103/drive-virtio0.img drive-virtio1=/srv/restore/103/drive-virtio1.img drive-virtio2=/srv/restore/103/drive-virtio2.img
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7fffeca05700 (LWP 3259172)]
[New Thread 0x7fffe7c4e700 (LWP 3259173)]
[New Thread 0x7fffe6f48700 (LWP 3259174)]
progress 0% 393216/416611827712 258048
[New Thread 0x7fffe6747700 (LWP 3259175)]
[New Thread 0x7fffe5f46700 (LWP 3261574)]
[New Thread 0x7fffe5745700 (LWP 3261575)]
progress 1% 4166254592/416611827712 2710716416
progress 2% 8332247040/416611827712 4807213056
progress 3% 12498370560/416611827712 6867615744
progress 4% 16664494080/416611827712 10424430592
progress 5% 20830617600/416611827712 12500537344
progress 6% 24996806656/416611827712 14348218368
progress 7% 29162930176/416611827712 17227669504
progress 8% 33329053696/416611827712 18972573696
[...]
progress 98% 408279646208/416611827712 109400383488
progress 99% 412445769728/416611827712 109400383488
progress 100% 416611827712/416611827712 109400383488
image drive-virtio0: size=34359738368 zeros=608661504 saved=33751076864
image drive-virtio1: size=274877906944 zeros=1423118336 saved=273454788608
image drive-virtio2: size=107374182400 zeros=107368603648 saved=5578752
[Thread 0x7fffe5f46700 (LWP 3261574) exited]
[Thread 0x7fffe6747700 (LWP 3259175) exited]
[Thread 0x7fffe6f48700 (LWP 3259174) exited]
[Thread 0x7fffe7c4e700 (LWP 3259173) exited]
[Thread 0x7fffeca05700 (LWP 3259172) exited]
[Thread 0x7fffecb66cc0 (LWP 3259168) exited]
[Inferior 1 (process 3259168) exited normally]
By identifying this, I tried binding the
vma create
process to a single CPU (via taskset 1 vma create ...
), and this also seems to work reliable.Now this looks like a bug related to threading?
Any ideas what's going wrong here?
I seem to have a reproducible cmdline available, and can also share such a coredump file in private if that would help.
More than happy to provide any further information.