[SOLVED] VM crashes Node

ecotechie

Member
Nov 11, 2023
44
2
13
Hi, I am having this issue where I am rebuilding my NixOS VM to use Ollama and the VM hangs. A couple of seconds later the node it's in reboots. I am using PCI passthrough for the Nvidia GPU, so the VM should have full access to it, including video output (which works well). Not sure what to do here, since everything seems to work well otherwise.

Node info:
1767996001247.png

VM info:
1767996060059.png

journalctl -f output right before the crash:
Code:
Jan 09 13:44:28 pod kernel: BUG: unable to handle page fault for address: 00001953fb980808
Jan 09 13:44:28 pod kernel: #PF: supervisor read access in kernel mode
Jan 09 13:44:28 pod kernel: #PF: error_code(0x0000) - not-present page
Jan 09 13:44:28 pod kernel: PGD 0 P4D 0
Jan 09 13:44:28 pod kernel: Oops: Oops: 0000 [#2] PREEMPT SMP NOPTI
Jan 09 13:44:28 pod kernel: CPU: 9 UID: 0 PID: 220 Comm: ksmd Tainted: P      D    O       6.14.11-5-pve #2
Jan 09 13:44:28 pod kernel: Tainted: [P]=PROPRIETARY_MODULE, [D]=DIE, [O]=OOT_MODULE
Jan 09 13:44:28 pod kernel: Hardware name: System76 Thelio Mira/Thelio Mira, BIOS FG Z5 10/19/2023
Jan 09 13:44:28 pod kernel: RIP: 0010:ksm_get_folio+0x44/0x1d0
Jan 09 13:44:28 pod kernel: Code: cc 03 48 83 ec 08 89 75 d4 eb 0d 49 8b 47 30 49 39 c6 0f 84 29 01 00 00 4d 8b 77 30 4c 89 f0 48 c1 e0 06 48 03 05 b4 01 59 01 <48> 8b 50 08 48 89 c3 f6 c2 01 0f 85 fe 00 00 00 66 90 48 8b 43 18
Jan 09 13:44:28 pod kernel: RSP: 0018:ffffcf0280967d58 EFLAGS: 00010207
Jan 09 13:44:28 pod kernel: RAX: 00001953fb980800 RBX: ffff8eeb9089f9c0 RCX: 0000000000000005
Jan 09 13:44:28 pod kernel: RDX: 0000000000000004 RSI: 0000000000000001 RDI: ffff8eeb8568d4e9
Jan 09 13:44:28 pod kernel: RBP: ffffcf0280967d88 R08: 0000000000000000 R09: 0000000000000000
Jan 09 13:44:28 pod kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff8eeb8568d4eb
Jan 09 13:44:28 pod kernel: R13: ffff8eeb8568d4e9 R14: e8000071bbee6020 R15: ffff8eeb8568d4e9
Jan 09 13:44:28 pod kernel: FS:  0000000000000000(0000) GS:ffff8eed3f080000(0000) knlGS:0000000000000000
Jan 09 13:44:28 pod kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 09 13:44:28 pod kernel: CR2: 00001953fb980808 CR3: 0000001029438004 CR4: 0000000000f72ef0
Jan 09 13:44:28 pod kernel: PKRU: 55555554
Jan 09 13:44:28 pod kernel: Call Trace:
Jan 09 13:44:28 pod kernel:  <TASK>
Jan 09 13:44:28 pod kernel:  remove_rmap_item_from_tree+0x74/0x150
Jan 09 13:44:28 pod kernel:  ksm_scan_thread+0x5f8/0x26a0
Jan 09 13:44:28 pod kernel:  ? __pfx_ksm_scan_thread+0x10/0x10
Jan 09 13:44:28 pod kernel:  kthread+0xf9/0x230
Jan 09 13:44:28 pod kernel:  ? __pfx_kthread+0x10/0x10
Jan 09 13:44:28 pod kernel:  ret_from_fork+0x44/0x70
Jan 09 13:44:28 pod kernel:  ? __pfx_kthread+0x10/0x10
Jan 09 13:44:28 pod kernel:  ret_from_fork_asm+0x1a/0x30
Jan 09 13:44:28 pod kernel:  </TASK>
Jan 09 13:44:28 pod kernel: Modules linked in: tcp_diag inet_diag rpcsec_gss_krb5 nfsv4 nfs netfs veth ccm ebtable_filter ebtables ip_set ip6table_raw ip6table_filter ip6_tables iptable_filter sctp ip6_udp_tunnel udp_tunnel nf_tables iptable_raw xt_CT iptable_nat xt_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 8021q garp mrp hid_maltron bonding tls rfcomm cmac algif_hash algif_skcipher af_alg bnep nfnetlink_log binfmt_misc uvcvideo uvc pwc videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_common videodev snd_usb_audio snd_usbmidi_lib snd_ump snd_rawmidi snd_seq_device mc joydev input_leds uas usb_storage hid_generic usbmouse usbkbd btusb btrtl btintel btbcm btmtk usbhid bluetooth hid nfsd auth_rpcgss nfs_acl lockd grace sunrpc xe drm_gpuvm gpu_sched drm_ttm_helper drm_exec drm_suballoc_helper snd_hda_codec_realtek snd_hda_codec_generic snd_hda_scodec_component intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common snd_sof_pci_intel_tgl snd_sof_pci_intel_cnl snd_sof_intel_hda_generic
Jan 09 13:44:28 pod kernel:  soundwire_intel soundwire_cadence snd_sof_intel_hda_common snd_soc_hdac_hda snd_sof_intel_hda_mlink snd_sof_intel_hda snd_hda_codec_hdmi snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_sof_utils snd_soc_acpi_intel_match snd_soc_acpi_intel_sdca_quirks soundwire_generic_allocation snd_soc_acpi soundwire_bus snd_soc_sdca snd_soc_avs sch_fq_codel snd_soc_hda_codec snd_hda_ext_core snd_soc_core x86_pkg_temp_thermal intel_powerclamp snd_compress coretemp ac97_bus snd_pcm_dmaengine snd_hda_intel kvm_intel snd_intel_dspcfg iwlmvm i915 kvm snd_intel_sdw_acpi snd_hda_codec drm_buddy polyval_clmulni mac80211 polyval_generic ghash_clmulni_intel ttm snd_hda_core sha256_ssse3 sha1_ssse3 snd_hwdep drm_display_helper libarc4 aesni_intel snd_pcm mei_hdcp mei_pxp crypto_simd intel_pmc_core snd_timer iwlwifi snd cec cryptd pmt_telemetry cmdlinepart spi_nor intel_hid rapl rc_core mei_me pmt_class cfg80211 intel_cstate mtd pcspkr gigabyte_wmi mei wmi_bmof i2c_algo_bit soundcore acpi_tad intel_vsec acpi_pad sparse_keymap mac_hid
Jan 09 13:44:28 pod kernel:  zfs(PO) spl(O) vhost_net vhost vhost_iotlb tap vfio_pci vfio_pci_core irqbypass vfio_iommu_type1 vfio iommufd efi_pstore nfnetlink dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic xor raid6_pq nvme nvme_core ahci nvme_auth libahci r8169 realtek xhci_pci i2c_i801 spi_intel_pci i2c_mux spi_intel intel_lpss_pci i2c_smbus xhci_hcd intel_lpss idma64 vmd video wmi pinctrl_alderlake
Jan 09 13:44:28 pod kernel: CR2: 00001953fb980808
Jan 09 13:44:28 pod kernel: ---[ end trace 0000000000000000 ]---
Jan 09 13:44:28 pod kernel: RIP: 0010:0xffff8ede2ae1ddc0
Jan 09 13:44:28 pod kernel: Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 <00> 00 00 00 01 03 00 00 00 00 00 00 00 00 00 00 00 90 2f 40 63 70
Jan 09 13:44:28 pod kernel: RSP: 0000:ffffcf029f097e38 EFLAGS: 00010246
Jan 09 13:44:28 pod kernel: RAX: 0000000000000000 RBX: 00006120bc1f2610 RCX: 0000000000000000
Jan 09 13:44:28 pod kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
Jan 09 13:44:28 pod kernel: RBP: ffffcf029f097ea0 R08: 0000000000000000 R09: 0000000000000000
Jan 09 13:44:28 pod kernel: R10: 0000000000000000 R11: 0000000000000215 R12: ffff8ede2ae1dd80
Jan 09 13:44:28 pod kernel: R13: ffffcf029f097e30 R14: 00006120bc1f2610 R15: 0000000000000007
Jan 09 13:44:28 pod kernel: FS:  0000000000000000(0000) GS:ffff8eed3f080000(0000) knlGS:0000000000000000
Jan 09 13:44:28 pod kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 09 13:44:28 pod kernel: CR2: 00001953fb980808 CR3: 000000015a2e7006 CR4: 0000000000f72ef0
Jan 09 13:44:28 pod kernel: PKRU: 55555554
Jan 09 13:44:28 pod kernel: note: ksmd[220] exited with irqs disabled
Jan 09 13:44:28 pod kernel: CPU14 BANK0 CMCI storm detected

Thanks for any input you may have on this.
 
To give a single VM 83 % of the system RAM may be problematic, sometimes. I would start with 30 GiB and then 40 GiB - and I wouldn't be surprised if that works well.

Sidenote, probably not relevant for that "Oops": that CPU has eight "Performance Cores". Start with these, not with 26.
 
If it occurs frequently, why not disable it and test whether it reproduces?

If that function (ksmd service's ksm_get_folio) isn't called, I think the problem won't occur or will transform into a different issue.

*You can confirm from the screen that KSM is not enabled on your computer.
However, since KSMD is running even though it's not in use, I simply speculated that it might have an impact. I don't know if it will improve the situation.

https://pve.proxmox.com/wiki/Kernel_Samepage_Merging_(KSM)
 
Last edited:
  • Like
Reactions: ecotechie
I lowered the CPU count and that seems to have done it! Also deactivated KSM, since it's just one VM and many LXCs. Now to see why Ollama won't compile, but at least the node isn't crashing anymore...

Thanks!
 
  • Like
Reactions: UdoB