Ubuntu 2204 VMs show "BUG: Bad page state in process *** pfn:***" and crash after some time

athurdent

Active Member
Jan 25, 2017
24
6
43
System is a M11SDV-8C-LN4F with an X710-DA2 (SR-IOV used). System has been working fine for over a year now.
Also working fine is the base OS (Debian Bullseye), no kernel errors logged and no problems with heavy harddisk or network I/O, e.g. when backing up VMs to a NAS.
An OPNsense VM is also not affected, can push through gigabytes of traffic and write I/O with Zenarmor's database is also not causing troubles.

Going back through kern.log of the Ubuntu machines, it seems to have started after I updated to this:
Code:
Start-Date: 2022-10-16  07:55:41
Commandline: apt-get dist-upgrade
Install: pve-kernel-5.15.60-2-pve:amd64 (5.15.60-2, automatic)
Upgrade: dbus-user-session:amd64 (1.12.20-2, 1.12.24-0+deb11u1), pve-firmware:amd64 (3.5-3, 3.5-4), tzdata:amd64 (2021a-1+deb11u5, 2021a-1+deb11u7), zfs-zed:amd64 (2.1.5-pve1, 2.1.6-pve1), libnvpair3linux:amd64 (2.1.5-pve1, 2.1.6-pve1), libuutil3linux:amd64 (2.1.5-pve1, 2.1.6-pve1), libpve-storage-perl:amd64 (7.2-9, 7.2-10), libzpool5linux:amd64 (2.1.5-pve1, 2.1.6-pve1), libpve-guest-common-perl:amd64 (4.1-2, 4.1-3), libdbus-1-3:amd64 (1.12.20-2, 1.12.24-0+deb11u1), isc-dhcp-common:amd64 (4.4.1-2.3, 4.4.1-2.3+deb11u1), proxmox-backup-file-restore:amd64 (2.2.6-1, 2.2.7-1), isc-dhcp-client:amd64 (4.4.1-2.3, 4.4.1-2.3+deb11u1), proxmox-backup-client:amd64 (2.2.6-1, 2.2.7-1), libpve-http-server-perl:amd64 (4.1-3, 4.1-4), libpve-common-perl:amd64 (7.2-2, 7.2-3), pve-kernel-5.15:amd64 (7.2-11, 7.2-12), libzfs4linux:amd64 (2.1.5-pve1, 2.1.6-pve1), dbus:amd64 (1.12.20-2, 1.12.24-0+deb11u1), pve-kernel-helper:amd64 (7.2-12, 7.2-13), zfsutils-linux:amd64 (2.1.5-pve1, 2.1.6-pve1)
End-Date: 2022-10-16  07:56:36

Starting with the day after the update, e.g. speedest started causing the abovementioned problems. First log entry is Oct 17th.
Code:
Oct 17 09:00:10 monitor kernel: [89066.126736] BUG: Bad page state in process speedtest  pfn:5148f
Oct 18 12:00:12 monitor kernel: [186269.143191] BUG: Bad page state in process speedtest  pfn:2650c
Oct 19 12:00:13 monitor kernel: [272671.359707] BUG: Bad page state in process speedtest  pfn:0a3e0
Oct 22 09:00:12 monitor kernel: [ 8768.674228] BUG: Bad page state in process speedtest  pfn:10a86a
Oct 22 12:00:07 monitor kernel: [19563.611729] BUG: Bad page state in process speedtest  pfn:24d08

Example crash, but various other processes (e.g. kswapd,kworker,swapper) are also showing similar behaviour:
Code:
[Thu Oct 27 05:32:06 2022] BUG: Bad page state in process speedtest  pfn:1091e
[Thu Oct 27 05:32:06 2022] page:00000000d7d00019 refcount:-1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1091e
[Thu Oct 27 05:32:06 2022] flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
[Thu Oct 27 05:32:06 2022] raw: 000fffffc0000000 dead000000000100 dead000000000122 0000000000000000
[Thu Oct 27 05:32:06 2022] raw: 0000000000000000 0000000000000000 ffffffffffffffff 0000000000000000
[Thu Oct 27 05:32:06 2022] page dumped because: nonzero _refcount
[Thu Oct 27 05:32:06 2022] Modules linked in: tls ipmi_devintf ipmi_msghandler intel_rapl_msr intel_rapl_common kvm_amd ccp kvm crct10dif_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd bochs drm_vram_helper drm_ttm_helper joydev input_leds ttm serio_raw drm_kms_helper cec rc_core fb_sys_fops syscopyarea sysfillrect sysimgblt qemu_fw_cfg mac_hid sch_fq_codel lp parport nfsd ramoops pstore_blk reed_solomon pstore_zone auth_rpcgss nfs_acl efi_pstore lockd drm grace sunrpc ip_tables x_tables autofs4 crc32_pclmul psmouse iavf virtio_scsi i2c_piix4 pata_acpi floppy
[Thu Oct 27 05:32:06 2022] CPU: 3 PID: 7434 Comm: speedtest Not tainted 5.15.0-52-generic #58-Ubuntu
[Thu Oct 27 05:32:06 2022] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
[Thu Oct 27 05:32:06 2022] Call Trace:
[Thu Oct 27 05:32:06 2022]  <TASK>
[Thu Oct 27 05:32:06 2022]  show_stack+0x52/0x5c
[Thu Oct 27 05:32:06 2022]  dump_stack_lvl+0x4a/0x63
[Thu Oct 27 05:32:06 2022]  dump_stack+0x10/0x16
[Thu Oct 27 05:32:06 2022]  bad_page.cold+0x63/0x94
[Thu Oct 27 05:32:06 2022]  check_free_page_bad+0x66/0x70
[Thu Oct 27 05:32:06 2022]  free_pcppages_bulk+0x1bf/0x390
[Thu Oct 27 05:32:06 2022]  free_unref_page_commit.constprop.0+0x122/0x160
[Thu Oct 27 05:32:06 2022]  free_unref_page+0xe3/0x190
[Thu Oct 27 05:32:06 2022]  __put_page+0x77/0xe0
[Thu Oct 27 05:32:06 2022]  skb_release_data+0x10d/0x180
[Thu Oct 27 05:32:06 2022]  __kfree_skb+0x26/0x40
[Thu Oct 27 05:32:06 2022]  tcp_recvmsg_locked+0x763/0x9e0
[Thu Oct 27 05:32:06 2022]  tcp_recvmsg+0x79/0x1c0
[Thu Oct 27 05:32:06 2022]  inet_recvmsg+0x5c/0x120
[Thu Oct 27 05:32:06 2022]  ? security_socket_recvmsg+0x3d/0x60
[Thu Oct 27 05:32:06 2022]  sock_recvmsg+0x71/0x80
[Thu Oct 27 05:32:06 2022]  __sys_recvfrom+0x1a2/0x1d0
[Thu Oct 27 05:32:06 2022]  __x64_sys_recvfrom+0x24/0x30
[Thu Oct 27 05:32:06 2022]  do_syscall_64+0x5c/0xc0
[Thu Oct 27 05:32:06 2022]  ? exit_to_user_mode_prepare+0x37/0xb0
[Thu Oct 27 05:32:06 2022]  ? syscall_exit_to_user_mode+0x27/0x50
[Thu Oct 27 05:32:06 2022]  ? __x64_sys_recvfrom+0x24/0x30
[Thu Oct 27 05:32:06 2022]  ? do_syscall_64+0x69/0xc0
[Thu Oct 27 05:32:06 2022]  ? exit_to_user_mode_prepare+0x37/0xb0
[Thu Oct 27 05:32:06 2022]  ? syscall_exit_to_user_mode+0x27/0x50
[Thu Oct 27 05:32:06 2022]  ? __x64_sys_recvfrom+0x24/0x30
[Thu Oct 27 05:32:06 2022]  ? do_syscall_64+0x69/0xc0
[Thu Oct 27 05:32:06 2022]  ? do_syscall_64+0x69/0xc0
[Thu Oct 27 05:32:06 2022]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
[Thu Oct 27 05:32:06 2022] RIP: 0033:0x5c1d56
[Thu Oct 27 05:32:06 2022] Code: 44 24 08 75 c4 48 89 e8 49 03 47 08 e9 a8 fe ff ff 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <c3> 83 ff 0e 75 10 48 83 3e 00 ba 90 6d 5f 00 b8 fe 6e 5f 00 eb 28
[Thu Oct 27 05:32:06 2022] RSP: 002b:00007f9ff0151538 EFLAGS: 00000246 ORIG_RAX: 000000000000002d
[Thu Oct 27 05:32:06 2022] RAX: ffffffffffffffda RBX: 00007f9ff0151850 RCX: 00000000005c1d56
[Thu Oct 27 05:32:06 2022] RDX: 0000000000008000 RSI: 000000000184bc80 RDI: 0000000000000008
[Thu Oct 27 05:32:06 2022] RBP: 00007f9ff0151660 R08: 0000000000000000 R09: 0000000000000000
[Thu Oct 27 05:32:06 2022] R10: 0000000000000020 R11: 0000000000000246 R12: 0000000000000000
[Thu Oct 27 05:32:06 2022] R13: 00000000018479c0 R14: 0000000000008000 R15: 00007f9ff0151701
[Thu Oct 27 05:32:06 2022]  </TASK>
[Thu Oct 27 05:32:06 2022] Disabling lock debugging due to kernel taint

Code:
root@epyc:~# pveversion -v
proxmox-ve: 7.2-1 (running kernel: 5.15.64-1-pve)
pve-manager: 7.2-11 (running version: 7.2-11/b76d3178)
pve-kernel-5.15: 7.2-13
pve-kernel-helper: 7.2-13
pve-kernel-5.15.64-1-pve: 5.15.64-1
pve-kernel-5.15.60-2-pve: 5.15.60-2
ceph-fuse: 14.2.21-1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx3
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve1
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-4
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-3
libpve-guest-common-perl: 4.1-4
libpve-http-server-perl: 4.1-4
libpve-storage-perl: 7.2-10
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.0-3
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.2.7-1
proxmox-backup-file-restore: 2.2.7-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.1
pve-cluster: 7.2-2
pve-container: 4.2-3
pve-docs: 7.2-2
pve-edk2-firmware: 3.20220526-1
pve-firewall: 4.2-6
pve-firmware: 3.5-6
pve-ha-manager: 3.4.0
pve-i18n: 2.7-2
pve-qemu-kvm: 7.0.0-4
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.6-pve1
 
Last edited:
the logs are from the VM.. did you try to correlate with either of these two possible sources:
- kernel upgrades in the VM
- qemu upgrades on the host

is it always logging a refcount of -1 and a network related stacktrace? could you maybe post all of the errors & traces?
 
the logs are from the VM.. did you try to correlate with either of these two possible sources:
- kernel upgrades in the VM
- qemu upgrades on the host

is it always logging a refcount of -1 and a network related stacktrace? could you maybe post all of the errors & traces?
Thank you very much for looking into this! Attached a few more crashes. Always refcount -1 it seems.

I have now also upgraded the Intel card's firmware, which did not help.

qemu upgrades would have been from Proxmox, so it seems there were none.

Ubuntu upgrades correlate:
Code:
Start-Date: 2022-10-16  08:10:29
Commandline: apt-get dist-upgrade
Install: linux-headers-5.15.0-50-generic:amd64 (5.15.0-50.56, automatic), linux-modules-5.15.0-50-generic:amd64 (5.15.0-50.56, automatic), linux-modules-extra-5.15.0-50-generic:amd64 (5.15.0-50.56, automatic), linux-headers-5.15.0-50:amd64 (5.15.0-50.56, automatic), linux-image-5.15.0-50-generic:amd64 (5.15.0-50.56, automatic)
Upgrade: linux-headers-generic:amd64 (5.15.0.48.48, 5.15.0.50.50), linux-generic:amd64 (5.15.0.48.48, 5.15.0.50.50), linux-image-generic:amd64 (5.15.0.48.48, 5.15.0.50.50), unzip:amd64 (6.0-26ubuntu3, 6.0-26ubuntu3.1)
End-Date: 2022-10-16  08:10:49

What seems to have helped just now is installing the original Intel iavf-4.5.3 driver on the VMs. I have tortured one VM with high network load afterwards and the issue would normally have surfaced during this.

So, maybe Ubuntu changed something in the network stack / drivers causing this?
 

Attachments

  • traces.txt
    48 KB · Views: 0
I still have one unpatched test VM I can reproduce the error with by simply running OOKLA‘s CLI speedtest.
Tried with the 22.04 HWE kernel, no luck there.
If you still want to look into what seems to be an Ubuntu problem, I can keep that VM without the Intel driver and run tests. Just let me know.
Thank you very much for the help though!
 
it's probably best to file that on the Ubuntu side (either in launchpad, or if you have some sort of support contract, there ;))
 
  • Like
Reactions: athurdent
Thank you very much for looking into this! Attached a few more crashes. Always refcount -1 it seems.

I have now also upgraded the Intel card's firmware, which did not help.

qemu upgrades would have been from Proxmox, so it seems there were none.

Ubuntu upgrades correlate:
Code:
Start-Date: 2022-10-16  08:10:29
Commandline: apt-get dist-upgrade
Install: linux-headers-5.15.0-50-generic:amd64 (5.15.0-50.56, automatic), linux-modules-5.15.0-50-generic:amd64 (5.15.0-50.56, automatic), linux-modules-extra-5.15.0-50-generic:amd64 (5.15.0-50.56, automatic), linux-headers-5.15.0-50:amd64 (5.15.0-50.56, automatic), linux-image-5.15.0-50-generic:amd64 (5.15.0-50.56, automatic)
Upgrade: linux-headers-generic:amd64 (5.15.0.48.48, 5.15.0.50.50), linux-generic:amd64 (5.15.0.48.48, 5.15.0.50.50), linux-image-generic:amd64 (5.15.0.48.48, 5.15.0.50.50), unzip:amd64 (6.0-26ubuntu3, 6.0-26ubuntu3.1)
End-Date: 2022-10-16  08:10:49

What seems to have helped just now is installing the original Intel iavf-4.5.3 driver on the VMs. I have tortured one VM with high network load afterwards and the issue would normally have surfaced during this.

So, maybe Ubuntu changed something in the network stack / drivers causing this?
Hi, Athurdent,

I had the same problem and had no clue.

May I ask that did you update the iavf driver to fix it completely, or is the problem still there? I am gonna give it a try.

Thanks.
 
  • Like
Reactions: athurdent
  • Like
Reactions: oNeToWn

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!