[SOLVED] "vma create" segfaulting in libglib-2.0.so.0.6600.8 - multi-threading bug?

mika · Feb 11, 2022

Hi!

I have a PBS system (latest v2.1-1 with kernel 5.13.19-4-pve) running with ZFS raidz2 on a ProLiant DL380 Gen9 server (Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz, 1 socket with 8 cores + 2 threads per core AKA 16 CPUs).
The system also has pve-qemu-kvm v6.1.1-1 installed (from http://download.proxmox.com/debian/pve), to execute vma create for restoring VMs for direct (import) usage in PVE environments.
This basically works fine, but for a few VMs and under certain(?) situations I get immediate segfaults when running vma create:

Code:

root@pbs01:~# vma create /mnt/offsite-backup/vzdump-qemu-103-2022_01_27-00_04_36.vma -v -c /srv/restore/103/fw.conf -c /srv/restore/103/qemu-server.conf drive-virtio0=/srv/restore/103/drive-virtio0.img drive-virtio1=/srv/restore/103/dri
ve-virtio1.img drive-virtio2=/srv/restore/103/drive-virtio2.img
vma: vma_writer_register_stream 'drive-virtio2' failed
Trace/breakpoint trap (core dumped)
root@pbs01:~# dmesg -T | tail -1
[Fri Feb 11 14:54:05 2022] traps: vma[3258736] trap int3 ip:7f9e73366332 sp:7ffd45559170 error:0 in libglib-2.0.so.0.6600.8[7f9e73329000+88000]
root@pbs01:~# dpkg -l libglib2.0-0\* | grep '^ii'
ii  libglib2.0-0:amd64        2.66.8-1     amd64        GLib library of C routines
ii  libglib2.0-0-dbgsym:amd64 2.66.8-1     amd64        debug symbols for libglib2.0-0

I have bt full from such a coredump available with dbg packages being present (briefly stripped down in the paste):

https://paste.grml.org/hidden/5fdfb37c/

What's interesting is, that this vma create ... segfaults, but when executing under gdb, it's working perfectly fine:

Code:

root@pbs01:~# gdb --args vma create /mnt/offsite-backup/vzdump-qemu-103-2022_01_27-00_04_36.vma -v -c /srv/restore/103/fw.conf -c /srv/restore/103/qemu-server.conf drive-virtio0=/srv/restore/103/drive-virtio0.img drive-virtio1=/srv/rest
ore/103/drive-virtio1.img drive-virtio2=/srv/restore/103/drive-virtio2.img
GNU gdb (Debian 10.1-1.7) 10.1.90.20210103-git
Copyright (C) 2021 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from vma...
Reading symbols from /usr/lib/debug/.build-id/3a/977ef179bb3be80ff7c2afff3ef350aaad5e9f.debug...
(gdb) run
Starting program: /usr/bin/vma create /mnt/offsite-backup/vzdump-qemu-103-2022_01_27-00_04_36.vma -v -c /srv/restore/103/fw.conf -c /srv/restore/103/qemu-server.conf drive-virtio0=/srv/restore/103/drive-virtio0.img drive-virtio1=/srv/restore/103/drive-virtio1.img drive-virtio2=/srv/restore/103/drive-virtio2.img
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7fffeca05700 (LWP 3259172)]
[New Thread 0x7fffe7c4e700 (LWP 3259173)]
[New Thread 0x7fffe6f48700 (LWP 3259174)]
progress 0% 393216/416611827712 258048
[New Thread 0x7fffe6747700 (LWP 3259175)]
[New Thread 0x7fffe5f46700 (LWP 3261574)]
[New Thread 0x7fffe5745700 (LWP 3261575)]
progress 1% 4166254592/416611827712 2710716416
progress 2% 8332247040/416611827712 4807213056
progress 3% 12498370560/416611827712 6867615744
progress 4% 16664494080/416611827712 10424430592
progress 5% 20830617600/416611827712 12500537344
progress 6% 24996806656/416611827712 14348218368
progress 7% 29162930176/416611827712 17227669504
progress 8% 33329053696/416611827712 18972573696
[...]
progress 98% 408279646208/416611827712 109400383488
progress 99% 412445769728/416611827712 109400383488
progress 100% 416611827712/416611827712 109400383488
image drive-virtio0: size=34359738368 zeros=608661504 saved=33751076864
image drive-virtio1: size=274877906944 zeros=1423118336 saved=273454788608
image drive-virtio2: size=107374182400 zeros=107368603648 saved=5578752
[Thread 0x7fffe5f46700 (LWP 3261574) exited]
[Thread 0x7fffe6747700 (LWP 3259175) exited]
[Thread 0x7fffe6f48700 (LWP 3259174) exited]
[Thread 0x7fffe7c4e700 (LWP 3259173) exited]
[Thread 0x7fffeca05700 (LWP 3259172) exited]
[Thread 0x7fffecb66cc0 (LWP 3259168) exited]
[Inferior 1 (process 3259168) exited normally]

By identifying this, I tried binding the vma create process to a single CPU (via taskset 1 vma create ...), and this also seems to work reliable.

Now this looks like a bug related to threading?
Any ideas what's going wrong here?
I seem to have a reproducible cmdline available, and can also share such a coredump file in private if that would help.
More than happy to provide any further information.

fiona · Feb 14, 2022

Hi,
thank you for the detailed report! I think I was able to reproduce the issue and sent a patch that should fix it.

mika · Feb 14, 2022

That's great news, thanks for the fast fix!

Looking forward to giving this a try!

t.lamprecht · Feb 14, 2022

FYI: A package with the fix from Fabian is now available on the pvetest repo as pve-qemu-kvm version 6.1.1-2

https://pve.proxmox.com/wiki/Package_Repositories#sysadmin_test_repo

mika · Feb 14, 2022

I can confirm that pve-qemu-kvm v6.1.1-2 seems to be working stable, I couldn't reproduce my issue any longer.
Thanks!

thex · May 3, 2023

This should be fixed right?
Had a spontaneous reboots recently and found this in the logs when investigating.

Code:

pr 20 02:53:41 proxmox kernel: [20761.448291] show_signal_msg: 2 callbacks suppressed
Apr 20 02:53:41 proxmox kernel: [20761.448294] kvm[15747]: segfault at 51 ip 00007f4f21328f63 sp 00007fffd99d9020 error 4 in libglib-2.0.so.0.6600.8[7f4f212f5000+88000] likely on CPU 4 (core 4, socket 0)
Apr 20 02:53:41 proxmox kernel: [20761.448312] Code: 8b 7b 18 48 85 ff 74 ae 8b 43 08 85 c0 74 a7 48 8b 33 ba 01 00 00 00 e8 cb e2 ff ff eb 98 66 0f 1f 84 00 00 00 00 00 48 8b 03 <48> 8b 68 50 eb ac 0f 1f 80 00 00 00 00 41 55 41 54 55 53 48 83 ec
Apr 20 02:53:41 proxmox kernel: [20761.467286]  zd48: p1 p2 p3 p4 p5 p6 p7 p8
Apr 20 02:53:41 proxmox kernel: [20761.502282] vmbr0: port 4(tap102i0) entered disabled state
Apr 20 02:53:41 proxmox kernel: [20761.502407] vmbr0: port 4(tap102i0) entered disabled state
Apr 20 02:53:41 proxmox systemd[1]: 102.scope: Succeeded.
Apr 20 02:53:41 proxmox systemd[1]: 102.scope: Consumed 54min 23.314s CPU time.
Apr 20 02:53:42 proxmox qmeventd[805800]: Starting cleanup for 102
Apr 20 02:53:42 proxmox qmeventd[805800]: Finished cleanup for 102

# pveversion --verbose
proxmox-ve: 7.4-1 (running kernel: 6.2.6-1-pve)
pve-manager: 7.4-3 (running version: 7.4-3/9002ab8a)
pve-kernel-6.2: 7.3-8
pve-kernel-5.15: 7.3-3
pve-kernel-5.19: 7.2-15
pve-kernel-5.4: 6.4-20
pve-kernel-6.2.6-1-pve: 6.2.6-1
pve-kernel-5.19.17-2-pve: 5.19.17-2
pve-kernel-5.15.102-1-pve: 5.15.102-1
pve-kernel-5.4.203-1-pve: 5.4.203-1
pve-kernel-5.4.106-1-pve: 5.4.106-1
pve-kernel-5.4.73-1-pve: 5.4.73-1
ceph-fuse: 14.2.21-1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: 0.8.36+pve2
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4-2
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.3-4
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-3
libpve-rs-perl: 0.7.5
libpve-storage-perl: 7.4-2
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.4.1-1
proxmox-backup-file-restore: 2.4.1-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-offline-mirror-helper: 0.5.1-1
proxmox-widget-toolkit: 3.6.5
pve-cluster: 7.3-3
pve-container: 4.4-3
pve-docs: 7.4-2
pve-edk2-firmware: 3.20230228-2
pve-firewall: 4.3-1
pve-firmware: 3.6-5
pve-ha-manager: 3.6.1
pve-i18n: 2.12-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-1
qemu-server: 7.4-3
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.11-pve1

fiona · May 4, 2023

Hi,

thex said:
This should be fixed right?

yes, but your error is different.

thex said:

Had a spontaneous reboots recently and found this in the logs when investigating.

Code:

pr 20 02:53:41 proxmox kernel: [20761.448291] show_signal_msg: 2 callbacks suppressed
Apr 20 02:53:41 proxmox kernel: [20761.448294] kvm[15747]: segfault at 51 ip 00007f4f21328f63 sp 00007fffd99d9020 error 4 in libglib-2.0.so.0.6600.8[7f4f212f5000+88000] likely on CPU 4 (core 4, socket 0)

It's for the kvm binary, not the vma binary. And the segfault error code is 4, not 0.

But this alone should not lead to a spontaneous reboot, and it apparently didn't, because the log goes on. When did the reboot happen? Is there anything else in the logs? I'd also suggest you run a memtest.

thex · May 13, 2023

Thanks, you are right I should have looked a bit closer.
Will do a memtest soon but the crashes coincide exactly with an upgrade from 6.4 to 7.4

sillyquota · Jun 25, 2024

In the past I ran a personal cloud-type situation on a Proxmox node with 1 or 2 VM's. I had started out with Proxmox 6 and manually upgraded it to 7. This was running on slow, outdated hardware vulnerable to ALL those heartbleed/spectre/whatever-else vuln's. I had no issues what-so-ever; everything worked great. Long story short, I ran into some issues and had to shut the whole thing down for a little while.

Now I'm in a position where I can get my personal cloud experiment running again, and this time on better hardware! After a fresh install of Proxmox 8.2.4 on a Ryzen 7 I was disheartened to see that every day (for the first 2 days), around the early-morning hours, my one-and-only primary VM was crashing with the following being found in the host-node dmesg:

Code:

[203283.914684] show_signal_msg: 17 callbacks suppressed
[203283.914688] kvm[481925]: segfault at 4 ip 00007069dfb3c121 sp 00007ffe3528bb60 error 4 in libglib-2.0.so.0.7400.6[7069dfb05000+8d000] likely on CPU 3 (core 3, socket 0)
[203283.914699] Code: f3 8b 09 49 8d 34 f6 66 89 56 04 31 d2 89 0e 66 89 56 06 83 c3 01 48 89 c6 48 8b 40 10 48 85 c0 74 3f 39 68 18 7f f2 48 8b 08 <0f> b7 51 04 83 e2 c7 48 85 f6 74 c3 48 8b 36 8b 3e 39 39 75 ba 41
[203283.924175] fwbr100i0: port 2(tap100i0) entered disabled state
[203283.924355] tap100i0 (unregistering): left allmulticast mode
[203283.924360] fwbr100i0: port 2(tap100i0) entered disabled state
[203284.401555] fwbr100i0: port 1(fwln100i0) entered disabled state
[203284.401596] vmbr0: port 2(fwpr100p0) entered disabled state

I only know enough about Proxmox to get myself in trouble... I can get it very basically setup and get some VMs running. I'm not at all proficient with Proxmox so this was really sad for me because when it comes to segfaults and things, I'm completely screwed. I have no idea where to go, what to do. In these situations, all I can do is wait for new code to be released.

When I set up this VM I told it to use 16 vCPUs at only 1433% (cpulimit 14.33). In the past I never used this feature. I figured maybe its got something to do with this feature that I've never used before? .... so I turned it off (set cpulimit back to undefined/unlimited).

Today is day 3 and the VM did NOT crash this morning

I know this is a little early to be claiming victory but the first 2 days it crashed at the same time (shortly after when the node automatically updates its package database) and today it did not crash. Hopefully that was the cause of my crashes and hopefully this helps you or someone else.

TL;DR: VM with 16vCPUs crashes every morning with the above dmesg error. Re-configured VM's cpulimit from 14.33 to <undefined> (aka unlimited). VM did not crash this morning. Hopefully the cause was cpulimit and hopefully its fixed.

fiona · Jun 25, 2024

Hi,

sillyquota said:
In the past I ran a personal cloud-type situation on a Proxmox node with 1 or 2 VM's. I had started out with Proxmox 6 and manually upgraded it to 7. This was running on slow, outdated hardware vulnerable to ALL those heartbleed/spectre/whatever-else vuln's. I had no issues what-so-ever; everything worked great. Long story short, I ran into some issues and had to shut the whole thing down for a little while.

Now I'm in a position where I can get my personal cloud experiment running again, and this time on better hardware! After a fresh install of Proxmox 8.2.4 on a Ryzen 7 I was disheartened to see that every day (for the first 2 days), around the early-morning hours, my one-and-only primary VM was crashing with the following being found in the host-node dmesg:

Code:

[203283.914684] show_signal_msg: 17 callbacks suppressed [203283.914688] kvm[481925]: segfault at 4 ip 00007069dfb3c121 sp 00007ffe3528bb60 error 4 in libglib-2.0.so.0.7400.6[7069dfb05000+8d000] likely on CPU 3 (core 3, socket 0) [203283.914699] Code: f3 8b 09 49 8d 34 f6 66 89 56 04 31 d2 89 0e 66 89 56 06 83 c3 01 48 89 c6 48 8b 40 10 48 85 c0 74 3f 39 68 18 7f f2 48 8b 08 <0f> b7 51 04 83 e2 c7 48 85 f6 74 c3 48 8b 36 8b 3e 39 39 75 ba 41 [203283.924175] fwbr100i0: port 2(tap100i0) entered disabled state [203283.924355] tap100i0 (unregistering): left allmulticast mode [203283.924360] fwbr100i0: port 2(tap100i0) entered disabled state [203284.401555] fwbr100i0: port 1(fwln100i0) entered disabled state [203284.401596] vmbr0: port 2(fwpr100p0) entered disabled state

I only know enough about Proxmox to get myself in trouble... I can get it very basically setup and get some VMs running. I'm not at all proficient with Proxmox so this was really sad for me because when it comes to segfaults and things, I'm completely screwed. I have no idea where to go, what to do. In these situations, all I can do is wait for new code to be released.

When I set up this VM I told it to use 16 vCPUs at only 1433% (cpulimit 14.33). In the past I never used this feature. I figured maybe its got something to do with this feature that I've never used before? .... so I turned it off (set cpulimit back to undefined/unlimited).

Today is day 3 and the VM did NOT crash this morning I know this is a little early to be claiming victory but the first 2 days it crashed at the same time (shortly after when the node automatically updates its package database) and today it did not crash. Hopefully that was the cause of my crashes and hopefully this helps you or someone else.

TL;DR: VM with 16vCPUs crashes every morning with the above dmesg error. Re-configured VM's cpulimit from 14.33 to <undefined> (aka unlimited). VM did not crash this morning. Hopefully the cause was cpulimit and hopefully its fixed.

can you please share the VM configuration qm config <ID> and output of pveversion -v? If you want to test a bit more, what if you use 14 instead of 14.33. Just a wild guess, but maybe it has to do with the value being fractional.

sillyquota · Jun 25, 2024

fiona said:
Hi,

can you please share the VM configuration qm config <ID> and output of pveversion -v? If you want to test a bit more, what if you use 14 instead of 14.33. Just a wild guess, but maybe it has to do with the value being fractional.

root@hostnode:/var/log# qm config 100

Code:

agent: 1,fstrim_cloned_disks=1
boot: order=scsi0;ide2;net0
cores: 16
cpu: host,flags=+ibpb;+virt-ssbd;+amd-ssbd
ide2: none,media=cdrom
memory: 57344
meta: creation-qemu=8.1.5,ctime=1719028175
name: virtyboi
net0: virtio=BC:24:11:2A:38:0D,bridge=vmbr0,firewall=1
numa: 0
ostype: l26
scsi0: data480:100/vm-100-disk-0.qcow2,discard=on,iothread=1,size=447G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=143e3945-8aa4-4a94-a61f-c55d1225eb29
sockets: 1
vmgenid: e4adf61a-b9ee-4eb0-a0fd-9547ebc0e0c2

root@hostnode:/var/log# pveversion -v

Code:

proxmox-ve: 8.2.0 (running kernel: 6.8.8-1-pve)
pve-manager: 8.2.4 (running version: 8.2.4/faa83925c9641325)
proxmox-kernel-helper: 8.1.0
pve-kernel-6.2: 8.0.5
proxmox-kernel-6.8: 6.8.8-1
proxmox-kernel-6.8.8-1-pve-signed: 6.8.8-1
proxmox-kernel-6.2.16-20-pve: 6.2.16-20
proxmox-kernel-6.2: 6.2.16-20
amd64-microcode: 3.20230808.1.1~deb12u1
ceph-fuse: 16.2.11+ds-2
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown: residual config
ifupdown2: 3.2.0-1+pmx8
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.1.4
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.7
libpve-cluster-perl: 8.0.7
libpve-common-perl: 8.2.1
libpve-guest-common-perl: 5.1.3
libpve-http-server-perl: 5.1.0
libpve-network-perl: 0.9.8
libpve-rs-perl: 0.8.9
libpve-storage-perl: 8.2.2
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.2.4-1
proxmox-backup-file-restore: 3.2.4-1
proxmox-firewall: 0.4.2
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.6
proxmox-widget-toolkit: 4.2.3
pve-cluster: 8.0.7
pve-container: 5.1.12
pve-docs: 8.2.2
pve-edk2-firmware: not correctly installed
pve-esxi-import-tools: 0.7.1
pve-firewall: 5.0.7
pve-firmware: 3.12-1
pve-ha-manager: 4.0.5
pve-i18n: 3.2.2
pve-qemu-kvm: 8.1.5-6
pve-xtermjs: 5.3.0-3
qemu-server: 8.2.1
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.4-pve1

root@hostnode:/var/log# cat /proc/cpuinfo

Code:

processor    : 0
vendor_id    : AuthenticAMD
cpu family    : 23
model        : 113
model name    : AMD Ryzen 7 3800X 8-Core Processor
stepping    : 0
microcode    : 0x8701030
cpu MHz        : 3900.000
cache size    : 512 KB
physical id    : 0
siblings    : 16
core id        : 0
cpu cores    : 8
apicid        : 0
initial apicid    : 0
fpu        : yes
fpu_exception    : yes
cpuid level    : 16
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sev sev_es
bugs        : sysret_ss_attrs spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso
bogomips    : 7784.67
TLB size    : 3072 4K pages
clflush size    : 64
cache_alignment    : 64
address sizes    : 43 bits physical, 48 bits virtual
power management: ts ttp tm hwpstate cpb eff_freq_ro [13] [14]
<TRUNCATED>

I'm happy to say that my VM has made it through another day without the early-morning segfault! I have set cpulimit to 14 and I will report back in the coming days with the results. Let me know if I can provide any other logs/outputs for you!

On a side-note however, I am now getting some kind of kernel panic/crash with regard to pve-ha-lrm. My original issue appears to be resolved but I figure I'd putting this here since it's a new error that popped up only after resetting my cpulimit back to unlimited; and it might be relevant to someone else's setup if they use HA. I dont need or use high-availability stuff so it's not particularly high on list of things to resolve.

Code:

[318419.836133] BUG: kernel NULL pointer dereference, address: 0000000000000000
[318419.836151] #PF: supervisor instruction fetch in kernel mode
[318419.836164] #PF: error_code(0x0010) - not-present page
[318419.836176] PGD 0 P4D 0
[318419.836185] Oops: 0010 [#1] PREEMPT SMP NOPTI
[318419.836197] CPU: 11 PID: 1138 Comm: pve-ha-lrm Tainted: P           O       6.8.8-1-pve #1
[318419.836215] Hardware name: Gigabyte Technology Co., Ltd. B450M DS3H/B450M DS3H-CF, BIOS F65b 09/20/2023
[318419.836234] RIP: 0010:0x0
[318419.836256] Code: Unable to access opcode bytes at 0xffffffffffffffd6.
[318419.836269] RSP: 0018:ffffb756c5157da8 EFLAGS: 00010246
[318419.836282] RAX: 0000000000000000 RBX: 000000012a05f200 RCX: 0000000000000000
[318419.836298] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[318419.836312] RBP: ffffb756c5157d98 R08: 0000000000000000 R09: 0000000000000000
[318419.836327] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
[318419.836342] R13: 000000000000c350 R14: 000000000000c350 R15: ffff9d1115e95180
[318419.836357] FS:  00007df0a3b56740(0000) GS:ffff9d1ffe780000(0000) knlGS:0000000000000000
[318419.836373] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[318419.836386] CR2: ffffffffffffffd6 CR3: 00000001174b6000 CR4: 0000000000350ef0
[318419.836401] Call Trace:
[318419.836409]  <TASK>
[318419.836416]  ? show_regs+0x6d/0x80
[318419.836429]  ? __die+0x24/0x80
[318419.836440]  ? page_fault_oops+0x176/0x500
[318419.836455]  ? do_user_addr_fault+0x2f9/0x6b0
[318419.836467]  ? srso_return_thunk+0x5/0x5f
[318419.836480]  ? exc_page_fault+0x83/0x1b0
[318419.836493]  ? asm_exc_page_fault+0x27/0x30
[318419.836510]  ? __pfx_hrtimer_wakeup+0x10/0x10
[318419.836524]  ? common_nsleep+0x43/0x60
[318419.836536]  ? __x64_sys_clock_nanosleep+0xe5/0x160
[318419.836549]  ? x64_sys_call+0x10c1/0x24b0
[318419.836561]  ? do_syscall_64+0x81/0x170
[318419.836572]  ? srso_return_thunk+0x5/0x5f
[318419.836583]  ? do_syscall_64+0x8d/0x170
[318419.836595]  ? srso_return_thunk+0x5/0x5f
[318419.836606]  ? syscall_exit_to_user_mode+0x89/0x260
[318419.836619]  ? srso_return_thunk+0x5/0x5f
[318419.836630]  ? do_syscall_64+0x8d/0x170
[318419.836640]  ? irqentry_exit+0x43/0x50
[318419.836650]  ? srso_return_thunk+0x5/0x5f
[318419.836662]  ? entry_SYSCALL_64_after_hwframe+0x78/0x80
[318419.836679]  </TASK>
[318419.836686] Modules linked in: tcp_diag inet_diag veth dm_crypt twofish_generic twofish_avx_x86_64 twofish_x86_64_3way twofish_x86_64 twofish_common serpent_avx2 serpent_avx_x86_64 serpent_sse2_x86_64 serpent_generic algif_skcipher af_alg ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter nf_tables libcrc32c bonding tls softdog sunrpc nfnetlink_log nfnetlink binfmt_misc amdgpu amdxcp drm_exec gpu_sched drm_buddy intel_rapl_msr intel_rapl_common edac_mce_amd input_leds crct10dif_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel sha256_ssse3 sha1_ssse3 radeon aesni_intel crypto_simd drm_suballoc_helper cryptd hid_generic drm_ttm_helper usbkbd ttm drm_display_helper cec rc_core usbhid i2c_algo_bit hid video gigabyte_wmi wmi_bmof rapl pcspkr k10temp mac_hid zfs(PO) spl(O) vhost_net vhost vhost_iotlb tap kvm_amd ccp kvm irqbypass efi_pstore dmi_sysfs ip_tables x_tables autofs4 xhci_pci xhci_pci_renesas crc32_pclmul r8169 e1000e i2c_piix4 realtek xhci_hcd ahci libahci
[318419.836789]  wmi gpio_amdpt [last unloaded: msr]
[318419.836947] CR2: 0000000000000000
[318419.836957] ---[ end trace 0000000000000000 ]---
[318419.836968] RIP: 0010:0x0
[318419.836977] Code: Unable to access opcode bytes at 0xffffffffffffffd6.
[318419.836991] RSP: 0018:ffffb756c5157da8 EFLAGS: 00010246
[318419.837004] RAX: 0000000000000000 RBX: 000000012a05f200 RCX: 0000000000000000
[318419.837018] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[318419.837033] RBP: ffffb756c5157d98 R08: 0000000000000000 R09: 0000000000000000
[318419.837507] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
[318419.837978] R13: 000000000000c350 R14: 000000000000c350 R15: ffff9d1115e95180
[318419.838445] FS:  00007df0a3b56740(0000) GS:ffff9d1ffe780000(0000) knlGS:0000000000000000
[318419.838912] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[318419.839378] CR2: ffffffffffffffd6 CR3: 00000001174b6000 CR4: 0000000000350ef0
[318419.839846] note: pve-ha-lrm[1138] exited with irqs disabled

fiona · Jun 26, 2024

sillyquota said:
root@hostnode:/var/log# qm config 100

Code:

cpu: host,flags=+ibpb;+virt-ssbd;+amd-ssbd

My first guess is that the issue could also be related to the additional flags.

Do you have the latest BIOS updates/CPU microcode installed?

sillyquota said:

On a side-note however, I am now getting some kind of kernel panic/crash with regard to pve-ha-lrm. My original issue appears to be resolved but I figure I'd putting this here since it's a new error that popped up only after resetting my cpulimit back to unlimited; and it might be relevant to someone else's setup if they use HA. I dont need or use high-availability stuff so it's not particularly high on list of things to resolve.

Code:

[318419.836133] BUG: kernel NULL pointer dereference, address: 0000000000000000
[318419.836151] #PF: supervisor instruction fetch in kernel mode
[318419.836164] #PF: error_code(0x0010) - not-present page
[318419.836176] PGD 0 P4D 0
[318419.836185] Oops: 0010 [#1] PREEMPT SMP NOPTI
[318419.836197] CPU: 11 PID: 1138 Comm: pve-ha-lrm Tainted: P           O       6.8.8-1-pve #1
[318419.836215] Hardware name: Gigabyte Technology Co., Ltd. B450M DS3H/B450M DS3H-CF, BIOS F65b 09/20/2023
[318419.836234] RIP: 0010:0x0
[318419.836256] Code: Unable to access opcode bytes at 0xffffffffffffffd6.
[318419.836269] RSP: 0018:ffffb756c5157da8 EFLAGS: 00010246
[318419.836282] RAX: 0000000000000000 RBX: 000000012a05f200 RCX: 0000000000000000
[318419.836298] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[318419.836312] RBP: ffffb756c5157d98 R08: 0000000000000000 R09: 0000000000000000
[318419.836327] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
[318419.836342] R13: 000000000000c350 R14: 000000000000c350 R15: ffff9d1115e95180
[318419.836357] FS:  00007df0a3b56740(0000) GS:ffff9d1ffe780000(0000) knlGS:0000000000000000
[318419.836373] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[318419.836386] CR2: ffffffffffffffd6 CR3: 00000001174b6000 CR4: 0000000000350ef0
[318419.836401] Call Trace:
[318419.836409]  <TASK>
[318419.836416]  ? show_regs+0x6d/0x80
[318419.836429]  ? __die+0x24/0x80
[318419.836440]  ? page_fault_oops+0x176/0x500
[318419.836455]  ? do_user_addr_fault+0x2f9/0x6b0
[318419.836467]  ? srso_return_thunk+0x5/0x5f
[318419.836480]  ? exc_page_fault+0x83/0x1b0
[318419.836493]  ? asm_exc_page_fault+0x27/0x30
[318419.836510]  ? __pfx_hrtimer_wakeup+0x10/0x10
[318419.836524]  ? common_nsleep+0x43/0x60
[318419.836536]  ? __x64_sys_clock_nanosleep+0xe5/0x160
[318419.836549]  ? x64_sys_call+0x10c1/0x24b0
[318419.836561]  ? do_syscall_64+0x81/0x170
[318419.836572]  ? srso_return_thunk+0x5/0x5f
[318419.836583]  ? do_syscall_64+0x8d/0x170
[318419.836595]  ? srso_return_thunk+0x5/0x5f
[318419.836606]  ? syscall_exit_to_user_mode+0x89/0x260
[318419.836619]  ? srso_return_thunk+0x5/0x5f
[318419.836630]  ? do_syscall_64+0x8d/0x170
[318419.836640]  ? irqentry_exit+0x43/0x50
[318419.836650]  ? srso_return_thunk+0x5/0x5f
[318419.836662]  ? entry_SYSCALL_64_after_hwframe+0x78/0x80
[318419.836679]  </TASK>
[318419.836686] Modules linked in: tcp_diag inet_diag veth dm_crypt twofish_generic twofish_avx_x86_64 twofish_x86_64_3way twofish_x86_64 twofish_common serpent_avx2 serpent_avx_x86_64 serpent_sse2_x86_64 serpent_generic algif_skcipher af_alg ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter nf_tables libcrc32c bonding tls softdog sunrpc nfnetlink_log nfnetlink binfmt_misc amdgpu amdxcp drm_exec gpu_sched drm_buddy intel_rapl_msr intel_rapl_common edac_mce_amd input_leds crct10dif_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel sha256_ssse3 sha1_ssse3 radeon aesni_intel crypto_simd drm_suballoc_helper cryptd hid_generic drm_ttm_helper usbkbd ttm drm_display_helper cec rc_core usbhid i2c_algo_bit hid video gigabyte_wmi wmi_bmof rapl pcspkr k10temp mac_hid zfs(PO) spl(O) vhost_net vhost vhost_iotlb tap kvm_amd ccp kvm irqbypass efi_pstore dmi_sysfs ip_tables x_tables autofs4 xhci_pci xhci_pci_renesas crc32_pclmul r8169 e1000e i2c_piix4 realtek xhci_hcd ahci libahci
[318419.836789]  wmi gpio_amdpt [last unloaded: msr]
[318419.836947] CR2: 0000000000000000
[318419.836957] ---[ end trace 0000000000000000 ]---
[318419.836968] RIP: 0010:0x0
[318419.836977] Code: Unable to access opcode bytes at 0xffffffffffffffd6.
[318419.836991] RSP: 0018:ffffb756c5157da8 EFLAGS: 00010246
[318419.837004] RAX: 0000000000000000 RBX: 000000012a05f200 RCX: 0000000000000000
[318419.837018] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[318419.837033] RBP: ffffb756c5157d98 R08: 0000000000000000 R09: 0000000000000000
[318419.837507] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
[318419.837978] R13: 000000000000c350 R14: 000000000000c350 R15: ffff9d1115e95180
[318419.838445] FS:  00007df0a3b56740(0000) GS:ffff9d1ffe780000(0000) knlGS:0000000000000000
[318419.838912] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[318419.839378] CR2: ffffffffffffffd6 CR3: 00000001174b6000 CR4: 0000000000350ef0
[318419.839846] note: pve-ha-lrm[1138] exited with irqs disabled

Did you update the kernel recently? I'd also do a memory test when you have some downtime, just to be sure.

sillyquota · Jun 27, 2024

fiona said:
My first guess is that the issue could also be related to the additional flags.

Do you have the latest BIOS updates/CPU microcode installed?

Did you update the kernel recently? I'd also do a memory test when you have some downtime, just to be sure.

After setting cpulimit to 14, the VM has still NOT crashed as before. Perhaps the issue was the fractional cpulimit (or fractional cpulimit along with the extra flags)

My host CPU does have ibpb and ssbd flags so I wasn't sure if I really needed to add those extra flags. The Proxmox admin guide seemed to urge them so I added them anyways. +ibpb is certainly redundant and the other 2 probably are as well. Is it safe to remove +virt-ssbd;+amd-ssbd since my CPU has ssbd?

The BIOS seems to be 1 version behind what's available on the manufacturer website: "Gigabyte B450M DS3H/B450M DS3H-CF, BIOS F65b" (version F66 is available). As the hardware isn't mine, nor in my control, I've submitted a ticket to see if the NOC will flash it.
The hostnode has the following microcode installed: "amd64-microcode is already the newest version (3.20230808.1.1~deb12u1)"

I'm not sure about the kernel. I didn't do any kernel updates but its possible there was an automated upgrade when the server was provisioned. The first thing I did when I got access to the server was check for updates, but everything was already fully updated (which I thought was a little weird).

EDIT: BIOS has been updated to the latest available version, F66. Which added/fixed:

Update AMD AGESA V2 1.2.0.B
Fix AMD processor vulnerabilities security
Addresses potential UEFI vulnerabilities. (LogoFAIL)

Search

Search

[SOLVED] "vma create" segfaulting in libglib-2.0.so.0.6600.8 - multi-threading bug?

mika

Member

fiona

Proxmox Staff Member

mika

Member

t.lamprecht

Proxmox Staff Member

mika

Member

thex

Member

fiona

Proxmox Staff Member

thex

Member

sillyquota

New Member

fiona

Proxmox Staff Member

sillyquota

New Member

fiona

Proxmox Staff Member

sillyquota

New Member