PVE 8.0.3 frequent page faults

n1els_ph · Oct 6, 2023

I have been setting up a new home server with a brand new proxmox installation. I have been getting about 5 lockups today during which the whole system hangs and produces page faults.

I don't know how to copy the text of the page fault, but I have attached it as a picture. If I can get that copied as text somehow please let me know.

PVE is running the latest version:

Code:

root@pve:~# pveversion
pve-manager/8.0.3/bbf3993334bfa916 (running kernel: 6.2.16-3-pve)

I have completed a full memtest86+ test with zero errors.

Full dmesg output is included as well, please let me know what else to add.

System is running on an Intel J6413 CPU

n1els_ph · Oct 7, 2023

Code:

Oct 07 17:50:12 pve kernel: BUG: unable to handle page fault for address: ffffffffffffff72
Oct 07 17:50:12 pve kernel: #PF: supervisor read access in kernel mode
Oct 07 17:50:12 pve kernel: #PF: error_code(0x0000) - not-present page
Oct 07 17:50:12 pve kernel: PGD 777215067 P4D 777215067 PUD 777217067 PMD 0
Oct 07 17:50:12 pve kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI
Oct 07 17:50:12 pve kernel: CPU: 1 PID: 1670 Comm: vhost-1626 Tainted: P           O       6.2.16-3-pve #1
Oct 07 17:50:12 pve kernel: Hardware name: Default string Default string/Default string, BIOS 5.19 05/15/2023
Oct 07 17:50:12 pve kernel: RIP: 0010:vhost_get_vq_desc+0x43c/0xaf0 [vhost]
Oct 07 17:50:12 pve kernel: Code: 0f 87 27 06 00 00 48 8b 45 80 41 8b 0e 03 08 85 ff 0f 84 5e 06 00 00 44 89 bd 48 ff ff ff 41 bc 01 00 00 00 41 89 cf 41 89 fd <48> 8b 97 70 ff ff ff be 10 00 00 00 48 8d 7d 98 e8 5f 93 d1 f3 48
Oct 07 17:50:12 pve kernel: RSP: 0018:ffffa956c3ebbc58 EFLAGS: 00010202
Oct 07 17:50:12 pve kernel: RAX: ffffa956c3ebbde4 RBX: ffff88a24e5c4ab0 RCX: 0000000000000000
Oct 07 17:50:12 pve kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000002
Oct 07 17:50:12 pve kernel: RBP: ffffa956c3ebbd18 R08: 0000000000000000 R09: 0000000000000000
Oct 07 17:50:12 pve kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
Oct 07 17:50:12 pve kernel: R13: 0000000000000002 R14: ffffa956c3ebbde0 R15: 0000000000000000
Oct 07 17:50:12 pve kernel: FS:  0000000000000000(0000) GS:ffff88a99fe80000(0000) knlGS:0000000000000000
Oct 07 17:50:12 pve kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 07 17:50:12 pve kernel: CR2: ffffffffffffff72 CR3: 000000010e822000 CR4: 0000000000352ee0
Oct 07 17:50:12 pve kernel: Call Trace:
Oct 07 17:50:12 pve kernel:  <TASK>
Oct 07 17:50:12 pve kernel:  get_tx_bufs.constprop.0+0x46/0x1e0 [vhost_net]
Oct 07 17:50:12 pve kernel:  ? newidle_balance+0x325/0x4b0
Oct 07 17:50:12 pve kernel:  handle_tx_copy+0xe0/0x700 [vhost_net]
Oct 07 17:50:12 pve kernel:  ? raw_spin_rq_unlock+0x10/0x40
Oct 07 17:50:12 pve kernel:  handle_tx+0xc0/0xd0 [vhost_net]
Oct 07 17:50:12 pve kernel:  handle_tx_kick+0x15/0x20 [vhost_net]
Oct 07 17:50:12 pve kernel:  vhost_worker+0x7b/0xd0 [vhost]
Oct 07 17:50:12 pve kernel:  ? __pfx_vhost_worker+0x10/0x10 [vhost]
Oct 07 17:50:12 pve kernel:  kthread+0xe6/0x110
Oct 07 17:50:12 pve kernel:  ? __pfx_kthread+0x10/0x10
Oct 07 17:50:12 pve kernel:  ret_from_fork+0x29/0x50
Oct 07 17:50:12 pve kernel:  </TASK>
Oct 07 17:50:12 pve kernel: Modules linked in: tcp_diag inet_diag veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter bpfilter nf_tables bonding tls softdog sunrpc nfnetlink_log binfmt_misc nfnetlink snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_sof_pci_intel_tgl snd_sof_intel_hda_common soundwire_intel soundwire_generic_allocation intel_rapl_msr soundwire_cadence intel_rapl_common snd_sof_intel_hda snd_sof_pci x86_pkg_temp_thermal intel_powerclamp snd_sof_xtensa_dsp snd_sof snd_sof_utils snd_soc_hdac_hda snd_hda_ext_core snd_soc_acpi_intel_match snd_soc_acpi soundwire_bus coretemp i915 snd_soc_core snd_compress ac97_bus snd_pcm_dmaengine drm_buddy kvm_intel snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi ttm drm_display_helper kvm irqbypass crct10dif_pclmul cec polyval_generic snd_hda_codec snd_hda_core snd_hwdep cmdlinepart rc_core snd_pcm drm_kms_helper ghash_clmulni_intel sha512_ssse3 aesni_intel i2c_algo_bit
Oct 07 17:50:12 pve kernel:  crypto_simd syscopyarea spi_nor cryptd snd_timer sysfillrect wmi_bmof snd intel_cstate intel_wmi_thunderbolt mtd pcspkr soundcore sysimgblt mei_me mei acpi_pad acpi_tad joydev input_leds mac_hid zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) vhost_net vhost vhost_iotlb tap drm efi_pstore dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic xor raid6_pq simplefb dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c hid_generic usbkbd usbhid hid nvme sdhci_pci nvme_core crc32_pclmul video r8169 realtek i2c_i801 xhci_pci ahci i2c_smbus nvme_common xhci_pci_renesas igc spi_intel_pci spi_intel cqhci xhci_hcd sdhci libahci wmi
Oct 07 17:50:12 pve kernel: CR2: ffffffffffffff72
Oct 07 17:50:12 pve kernel: ---[ end trace 0000000000000000 ]---
Oct 07 17:50:12 pve kernel: RIP: 0010:vhost_get_vq_desc+0x43c/0xaf0 [vhost]
Oct 07 17:50:12 pve kernel: Code: 0f 87 27 06 00 00 48 8b 45 80 41 8b 0e 03 08 85 ff 0f 84 5e 06 00 00 44 89 bd 48 ff ff ff 41 bc 01 00 00 00 41 89 cf 41 89 fd <48> 8b 95 70 ff ff ff be 10 00 00 00 48 8d 7d 98 e8 5f 93 d1 f3 48
Oct 07 17:50:12 pve kernel: RSP: 0018:ffffa956c3ebbc58 EFLAGS: 00010202
Oct 07 17:50:12 pve kernel: RAX: ffffa956c3ebbde4 RBX: ffff88a24e5c4ab0 RCX: 0000000000000000
Oct 07 17:50:12 pve kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000002
Oct 07 17:50:12 pve kernel: RBP: ffffa956c3ebbd18 R08: 0000000000000000 R09: 0000000000000000
Oct 07 17:50:12 pve kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
Oct 07 17:50:12 pve kernel: R13: 0000000000000002 R14: ffffa956c3ebbde0 R15: 0000000000000000
Oct 07 17:50:12 pve kernel: FS:  0000000000000000(0000) GS:ffff88a99fe80000(0000) knlGS:0000000000000000
Oct 07 17:50:12 pve kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 07 17:50:12 pve kernel: CR2: ffffffffffffff72 CR3: 000000010e822000 CR4: 0000000000352ee0
Oct 07 17:50:12 pve kernel: note: vhost-1626[1670] exited with irqs disabled
Oct 07 17:53:28 pve QEMU[1626]: kvm: ../accel/kvm/kvm-all.c:920: kvm_log_clear_one_slot: Assertion `mem->dirty_bmap' failed.
Oct 07 17:53:28 pve pvedaemon[970]: <root@pam> end task UPID:pve:000006C6:00004A38:65212994:vncproxy:101:root@pam: OK
Oct 07 17:53:37 pve pvedaemon[2441]: start VM 101: UPID:pve:00000989:0000B38F:65212AA1:qmstart:101:root@pam:
Oct 07 17:53:37 pve pvedaemon[970]: <root@pam> starting task UPID:pve:00000989:0000B38F:65212AA1:qmstart:101:root@pam:
Oct 07 17:53:38 pve lvm[2451]: /dev/dm-4 excluded: device is an LV.
Oct 07 17:53:38 pve systemd[1]: 101.scope: Deactivated successfully.
Oct 07 17:53:38 pve systemd[1]: Stopped 101.scope.
Oct 07 17:53:38 pve systemd[1]: 101.scope: Consumed 6min 43.463s CPU time.
Oct 07 17:53:58 pve pvedaemon[2441]: timeout waiting on systemd
Oct 07 17:53:58 pve pvedaemon[970]: <root@pam> end task UPID:pve:00000989:0000B38F:65212AA1:qmstart:101:root@pam: timeout waiting on systemd

the1corrupted · Oct 8, 2023

Run Memtest on the system. It is available at the Proxmox bootup screen.

This is Linux reporting memory addressing errors.

n1els_ph · Oct 8, 2023

the1corrupted said:
Run Memtest on the system. It is available at the Proxmox bootup screen.

This is Linux reporting memory addressing errors.

Thanks for the reply, I did one full pass with memtest work zero errors. Since a full pass on 32GB takes a while, I left it at one pass but I can leave it overnight to see if anything comes up at later passes.

The only weird thing about it is that in the description my single 32GB stick shows up as 64GB. The detected capacity is correct though, so not sure if this might be causing any issues, what do you think?

Edit: just restarted memtest, at about 1 pass per hour I should have some more data in a couple of hours. Also reseated the memory to slot 1 instead of slot 2 where it was initially.

Edit2: first pass completed again without any errors. I'll leave it running for now to get more passes.

n1els_ph · Oct 10, 2023

After 46 hours, memtest has completed 21 passes and still at 0 errors.
Any suggestions what else to try are greatly appreciated.

fiona · Oct 10, 2023

Hi,
please provide the output of pveversion -v and the configuration of the VM that failed after the page fault for the invalid address, i.e. qm config 101. How many VMs do you have?

n1els_ph · Oct 10, 2023

Hi Fiona, thanks for your reply.

I have two vm's running. vmid 100 is running Zentyal 7 community, and vmid 101 is running ubuntu 22.04 with some docker containers such as sonarr, radarr etc. I am pretty sure the problems start in vmid 101, but I'll include both.

I read somewhere that changing the CPU type may have some effect in certain cases, so I have been trying the generic options on those but I can't notice any real difference.

PVE Version:

Code:

root@pve:~# pveversion -v
proxmox-ve: 8.0.1 (running kernel: 6.2.16-3-pve)
pve-manager: 8.0.3 (running version: 8.0.3/bbf3993334bfa916)
pve-kernel-6.2: 8.0.2
pve-kernel-6.2.16-3-pve: 6.2.16-3
ceph-fuse: 17.2.6-pve1+3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx2
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-3
libknet1: 1.25-pve1
libproxmox-acme-perl: 1.4.6
libproxmox-backup-qemu0: 1.4.0
libproxmox-rs-perl: 0.3.0
libpve-access-control: 8.0.3
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.0.5
libpve-guest-common-perl: 5.0.3
libpve-http-server-perl: 5.0.3
libpve-rs-perl: 0.8.3
libpve-storage-perl: 8.0.1
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve3
novnc-pve: 1.4.0-2
proxmox-backup-client: 2.99.0-1
proxmox-backup-file-restore: 2.99.0-1
proxmox-kernel-helper: 8.0.2
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.0.5
pve-cluster: 8.0.1
pve-container: 5.0.3
pve-docs: 8.0.3
pve-edk2-firmware: 3.20230228-4
pve-firewall: 5.0.2
pve-firmware: 3.7-1
pve-ha-manager: 4.0.2
pve-i18n: 3.0.4
pve-qemu-kvm: 8.0.2-3
pve-xtermjs: 4.16.0-3
qemu-server: 8.0.6
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.1.12-pve1
root@pve:~#

QM 100:

Code:

root@pve:~# qm config 100
agent: 1
boot: order=scsi0;ide2;net0
cores: 1
cpu: kvm64
ide2: none,media=cdrom
memory: 8192
meta: creation-qemu=8.0.2,ctime=1695869301
name: Zentyal
net0: virtio=96:DD:4F:CD:1E:23,bridge=vmbr0
numa: 0
onboot: 1
ostype: l26
scsi0: local-lvm:vm-100-disk-0,iothread=1,size=15G
scsihw: virtio-scsi-single
smbios1: uuid=143286a7-70e6-4e83-a2cb-755929411534
sockets: 1
startup: order=1,up=30
vmgenid: dc59805f-3264-4c7d-898a-04ed201a0823

QM 101:

Code:

root@pve:~# qm config 101
agent: 1
boot: order=scsi0;ide2;net0
cores: 2
cpu: x86-64-v2-AES
ide2: none,media=cdrom
memory: 24576
meta: creation-qemu=8.0.2,ctime=1696129775
name: Library
net0: virtio=06:DB:D4:D4:60:F8,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: l26
scsi0: local-lvm:vm-101-disk-0,iothread=1,size=32G
scsi1: Storage:vm-101-disk-0,iothread=1,size=13000G
scsihw: virtio-scsi-single
smbios1: uuid=7b28518c-43f9-4c9a-b48d-f168a0f855d8
sockets: 1
vmgenid: 41ddb35c-1b13-4405-99e9-f04560a10f55

I noticed some SMART values on my storage hard disk that I thought looked a bit suspicious (see attachments) so I have started a long test using smartctl which will take about 20 hours since it's a 14TB drive. This drive is used as storage, Proxmox itself and the boot and OS partitions of the Virtual Machines are one the 500GB Nvme. Both drives pass, but testing the HDD just in case.

fiona · Oct 10, 2023

n1els_ph said:
Hi Fiona, thanks for your reply.

I have two vm's running. vmid 100 is running Zentyal 7 community, and vmid 101 is running ubuntu 22.04 with some docker containers such as sonarr, radarr etc. I am pretty sure the problems start in vmid 101, but I'll include both.

I read somewhere that changing the CPU type may have some effect in certain cases, so I have been trying the generic options on those but I can't notice any real difference.

PVE Version:

Code:

root@pve:~# pveversion -v proxmox-ve: 8.0.1 (running kernel: 6.2.16-3-pve)

What you can also try is upgrading to a more recent kernel, the latest (on the no-subscription repository) is 6.2.16-15-pve.

n1els_ph said:

QM 100:

Code:

root@pve:~# qm config 100
agent: 1
boot: order=scsi0;ide2;net0
cores: 1
cpu: kvm64
ide2: none,media=cdrom
memory: 8192
meta: creation-qemu=8.0.2,ctime=1695869301
name: Zentyal
net0: virtio=96:DD:4F:CD:1E:23,bridge=vmbr0
numa: 0
onboot: 1
ostype: l26
scsi0: local-lvm:vm-100-disk-0,iothread=1,size=15G
scsihw: virtio-scsi-single
smbios1: uuid=143286a7-70e6-4e83-a2cb-755929411534
sockets: 1
startup: order=1,up=30
vmgenid: dc59805f-3264-4c7d-898a-04ed201a0823

QM 101:

Code:

root@pve:~# qm config 101
agent: 1
boot: order=scsi0;ide2;net0
cores: 2
cpu: x86-64-v2-AES
ide2: none,media=cdrom
memory: 24576
meta: creation-qemu=8.0.2,ctime=1696129775
name: Library
net0: virtio=06:DB:D4:D4:60:F8,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: l26
scsi0: local-lvm:vm-101-disk-0,iothread=1,size=32G
scsi1: Storage:vm-101-disk-0,iothread=1,size=13000G
scsihw: virtio-scsi-single
smbios1: uuid=7b28518c-43f9-4c9a-b48d-f168a0f855d8
sockets: 1
vmgenid: 41ddb35c-1b13-4405-99e9-f04560a10f55

I don't see anything suspicious here. In particular, I would've "hoped" for some nonstandard network configuration to narrow it down, because the call trace indicates the error occurring in vhost_net.

n1els_ph · Oct 10, 2023

fiona said:
What you can also try is upgrading to a more recent kernel, the latest (on the no-subscription repository) is 6.2.16-15-pve.

I don't see anything suspicious here. In particular, I would've "hoped" for some nonstandard network configuration to narrow it down, because the call trace indicates the error occurring in vhost_net.

I'll upgrade the kernel to check.

Yeah, on my proxmox cluster at work we do a lot more fancy network stuff (which thankfully all work exactly as intended), so encountering these persistent crashes at home when I just have a ubiquiti router and two vm's in 192.168.1. 0/24 was a bit surprising

I am not 100% sure if all call traces came from the networking side. The hardware is all new, so it's possible something isn't right with it, but I'm struggling finding a suspect.

The motherboard is a CW-J6-NAS, with a J6413 cpu included on the board. This would be the first component that would be suspicious but I can't find any issues with it (yet)

fiona · Oct 10, 2023

Right, the error in the initial screenshot is different, so it might be related to hardware, but can't be conclusive of course. Do you have latest BIOS and microcode updates installed?

n1els_ph · Oct 10, 2023

Kernel is updated:

Code:

root@pve:~# pveversion -v
proxmox-ve: 8.0.2 (running kernel: 6.2.16-15-pve)
pve-manager: 8.0.4 (running version: 8.0.4/d258a813cfa6b390)
pve-kernel-6.2: 8.0.5
proxmox-kernel-helper: 8.0.3
proxmox-kernel-6.2.16-15-pve: 6.2.16-15
proxmox-kernel-6.2: 6.2.16-15
pve-kernel-6.2.16-3-pve: 6.2.16-3
ceph-fuse: 17.2.6-pve1+3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx5
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-4
libknet1: 1.26-pve1
libproxmox-acme-perl: 1.4.6
libproxmox-backup-qemu0: 1.4.0
libproxmox-rs-perl: 0.3.1
libpve-access-control: 8.0.5
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.0.9
libpve-guest-common-perl: 5.0.5
libpve-http-server-perl: 5.0.4
libpve-rs-perl: 0.8.5
libpve-storage-perl: 8.0.2
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve3
novnc-pve: 1.4.0-2
proxmox-backup-client: 3.0.3-1
proxmox-backup-file-restore: 3.0.3-1
proxmox-kernel-helper: 8.0.3
proxmox-mail-forward: 0.2.0
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.0.9
pve-cluster: 8.0.4
pve-container: 5.0.4
pve-docs: 8.0.5
pve-edk2-firmware: 3.20230228-4
pve-firewall: 5.0.3
pve-firmware: 3.8-2
pve-ha-manager: 4.0.2
pve-i18n: 3.0.7
pve-qemu-kvm: 8.0.2-6
pve-xtermjs: 4.16.0-3
qemu-server: 8.0.7
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.1.13-pve1
root@pve:~#

Yes, as far as I know I am running the latest BIOS version for this motherboard (although support seems kinda sketchy). Also everything from the OS installation up to setting up the VM's went fine. If the errors come back after updating the kernel I might set up ubuntu or something on it so I can better test the mainboard.

n1els_ph · Oct 13, 2023

Hi Fiona,

Since the kernel update the freezing of the vm still happens, but it no longer drags the whole PVE environment with it, so that's progress. I'll keep troubleshooting the vm when I have some time off from work, the vm issues do not have to be a proxmox problem of course. If I find anything useful I'll post it.

fiona · Oct 13, 2023

n1els_ph said:
Hi Fiona,

Since the kernel update the freezing of the vm still happens, but it no longer drags the whole PVE environment with it, so that's progress. I'll keep troubleshooting the vm when I have some time off from work, the vm issues do not have to be a proxmox problem of course. If I find anything useful I'll post it.

When the VM freezes, does issuing qm status <ID> --verbose still work without erorrs? Anything in the VM's internal logs or in the host's logs?

PVE 8.0.3 frequent page faults

n1els_ph

Active Member

Attachments

n1els_ph

Active Member

the1corrupted

New Member

n1els_ph

Active Member

Attachments

n1els_ph

Active Member

Attachments

fiona

Proxmox Staff Member

n1els_ph

Active Member

Attachments

fiona

Proxmox Staff Member

n1els_ph

Active Member

fiona

Proxmox Staff Member

n1els_ph

Active Member

n1els_ph

Active Member

fiona

Proxmox Staff Member

We value your privacy