Kernel 6.8.12-12-pve Update borken Ollama on AMDGPU issues

nyx1197

New Member
Jul 22, 2025
3
0
1
Machine Model: Inspur 5212H5
CPU: Intel Xeon Gold 6138
GPU: 2x AMD Radeon Instinct MI50 32GB
GPU Driver: https://repo.radeon.com/amdgpu/6.3.2/ubuntu

I deployed Ollama in an LXC container on PVE using Docker Compose, and it works well. After updating to the latest version of the kernel, 6.8.12-12-pve, PVE receives an NMI interrupt error during the startup phase, and Ollama encounters an error when attempting to load the model.
When i use older kernel, like 6.8.12-11.pve, problem is gone.

Here is kernel log.

Code:
kernel: [  167.335499] BUG: Bad page state in process ollama  pfn:5190b3
kernel: [  167.335526] page:000000007f2dd029 refcount:-1 mapcount:0 mapping:0000000000000000 index:0x1 pfn:0x5190b3
kernel: [  167.335530] flags: 0x17ffffd0000020(lru|node=0|zone=2|lastcpupid=0x1fffff)
kernel: [  167.335534] page_type: 0xffffffff()
kernel: [  167.335537] raw: 0017ffffd0000020 dead000000000100 dead000000000122 0000000000000000
kernel: [  167.335540] raw: 0000000000000001 0000000000000000 ffffffffffffffff 0000000000000000
kernel: [  167.335541] page dumped because: PAGE_FLAGS_CHECK_AT_PREP flag(s) set
kernel: [  167.335542] Modules linked in: nft_chain_nat nft_compat cfg80211 ebtable_filter ebtables ip6table_raw nf_conntrack_netlink xt_nat xt_tcpudp iptable_raw veth xt_conntrack xt_MASQUERADE ip6table_nat ip6table_filter ip6_tables xt_set ip_set iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_addrtype iptable_filter xfrm_user xfrm_algo scsi_transport_iscsi nf_tables nvme_fabrics nvme_keyring overlay qrtr softdog sunrpc binfmt_misc bonding tls nfnetlink_log nfnetlink nvidia_uvm(POE) zram vhost_net vhost vhost_iotlb tap nvidia_drm(POE) intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common nvidia_modeset(POE) isst_if_common skx_edac skx_edac_common nfit ipmi_ssif x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm snd_hda_codec_hdmi crct10dif_pclmul irdma snd_hda_intel polyval_clmulni snd_intel_dspcfg polyval_generic ghash_clmulni_intel snd_intel_sdw_acpi sha256_ssse3 snd_hda_codec sha1_ssse3 ice aesni_intel snd_hda_core crypto_simd cryptd snd_hwdep gnss snd_pcm ib_uverbs
kernel: [  167.335616]  cmdlinepart snd_timer ucsi_ccg spi_nor rapl snd typec_ucsi acpi_ipmi intel_cstate pcspkr typec soundcore ib_core ast mei_me mtd ipmi_si intel_pch_thermal mei ipmi_devintf ioatdma zfs(PO) dca ipmi_msghandler joydev input_leds mac_hid spl(O) nvidia(POE) coretemp vfio_pci vfio_pci_core irqbypass vfio_iommu_type1 vfio iommufd efi_pstore dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic xor raid6_pq amdgpu(OE) amddrm_ttm_helper(OE) amdttm(OE) hid_generic amddrm_buddy(OE) dm_thin_pool amdxcp(OE) drm_exec drm_suballoc_helper usbkbd usbmouse dm_persistent_data amd_sched(OE) dm_bio_prison amdkcl(OE) drm_display_helper usbhid dm_bufio hid libcrc32c cec rc_core nvme i2c_algo_bit i2c_nvidia_gpu xhci_pci crc32_pclmul i2c_ccgx_ucsi xhci_pci_renesas video nvme_core i40e spi_intel_pci ahci xhci_hcd i2c_i801 nvme_auth spi_intel lpc_ich i2c_smbus libahci wmi
kernel: [  167.335693] CPU: 13 PID: 10296 Comm: ollama Tainted: P           OE      6.8.12-12-pve #1
kernel: [  167.335697] Hardware name: Inspur AliServer Thor02-2U/YZMB-00824-101, BIOS 3.0.1 05/21/2017
kernel: [  167.335698] Call Trace:
kernel: [  167.335701]  <TASK>
kernel: [  167.335704]  dump_stack_lvl+0x76/0xa0
kernel: [  167.335713]  dump_stack+0x10/0x20
kernel: [  167.335717]  bad_page+0x76/0x120
kernel: [  167.335721]  ? _copy_to_user+0x25/0x50
kernel: [  167.335725]  __rmqueue_pcplist+0x218/0x8c0
kernel: [  167.335732]  ? __pfx_kfd_ioctl_map_memory_to_gpu+0x10/0x10 [amdgpu]
kernel: [  167.336362]  ? mas_wr_store_entry.isra.0+0x337/0x3e0
kernel: [  167.336368]  get_page_from_freelist+0x64e/0x11c0
kernel: [  167.336376]  ? change_protection+0x1301/0x1460
kernel: [  167.336383]  __alloc_pages+0x251/0x1320
kernel: [  167.336388]  ? vma_modify+0x4c/0x110
kernel: [  167.336391]  ? policy_nodemask+0xe1/0x150
kernel: [  167.336397]  alloc_pages_mpol+0x91/0x1f0
kernel: [  167.336401]  vma_alloc_folio+0x64/0xd0
kernel: [  167.336405]  do_anonymous_page+0x21e/0x740
kernel: [  167.336409]  ? __pte_offset_map+0x1c/0x1b0
kernel: [  167.336412]  __handle_mm_fault+0xbca/0xf70
kernel: [  167.336417]  handle_mm_fault+0x18d/0x380
kernel: [  167.336420]  do_user_addr_fault+0x169/0x660
kernel: [  167.336425]  exc_page_fault+0x83/0x1b0
kernel: [  167.336429]  asm_exc_page_fault+0x27/0x30
kernel: [  167.336434] RIP: 0033:0x7086229e337a
kernel: [  167.336460] Code: 2c 58 15 00 49 8d 0c 28 48 29 e8 48 83 ce 04 48 39 d3 48 89 4b 60 48 0f 45 ee 48 83 c8 01 49 83 c0 10 48 83 cd 01 49 89 68 f8 <48> 89 41 08 48 83 c4 48 4c 89 c0 5b 5d 41 5c 41 5d 41 5e 41 5f c3
kernel: [  167.336463] RSP: 002b:00007085d9538310 EFLAGS: 00010202
kernel: [  167.336466] RAX: 0000000000000c71 RBX: 00007085b4000020 RCX: 00007085b7172390
kernel: [  167.336468] RDX: 0000708622b38b80 RSI: 0000000000008044 RDI: 00007085b716b000
kernel: [  167.336470] RBP: 0000000000008045 R08: 00007085b716a360 R09: 000000000316b000
kernel: [  167.336472] R10: 00007085b716b000 R11: 0000000000000206 R12: 0000000000000cb0
kernel: [  167.336474] R13: 0000000000001000 R14: 00007085b716a350 R15: 0000000000008060
kernel: [  167.336477]  </TASK>
kernel: [  167.336500] general protection fault, probably for non-canonical address 0xdead000000000108: 0000 [#1] PREEMPT SMP PTI
kernel: [  167.336525] CPU: 13 PID: 10296 Comm: ollama Tainted: P    B      OE      6.8.12-12-pve #1
kernel: [  167.336543] Hardware name: Inspur AliServer Thor02-2U/YZMB-00824-101, BIOS 3.0.1 05/21/2017
kernel: [  167.336560] RIP: 0010:__rmqueue_pcplist+0xbd/0x8c0
kernel: [  167.336574] Code: 01 f8 48 89 45 a0 49 8b 07 49 39 c7 0f 84 7f 01 00 00 48 bf 22 01 00 00 00 00 ad de 49 8b 07 48 8b 08 48 8b 50 08 4c 8d 40 f8 <48> 89 51 08 48 89 0a 48 b9 00 01 00 00 00 00 ad de 48 89 08 48 89
kernel: [  167.336607] RSP: 0000:ffffb9ec7e3fba20 EFLAGS: 00010293
kernel: [  167.336619] RAX: ffffdfe7d4642cc8 RBX: 0000000000000001 RCX: dead000000000100
kernel: [  167.336633] RDX: dead000000000122 RSI: 0000000000000000 RDI: dead000000000122
kernel: [  167.336648] RBP: ffffb9ec7e3fbad0 R08: ffffdfe7d4642cc0 R09: 0000000000000000
kernel: [  167.336662] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
kernel: [  167.336676] R13: 0000000000000010 R14: ffff944caffd5c00 R15: ffff944af02bcd70
kernel: [  167.336690] FS:  00007085d953a700(0000) GS:ffff944af0280000(0000) knlGS:0000000000000000
kernel: [  167.336707] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: [  167.336719] CR2: 00007085b7172398 CR3: 000000038486e005 CR4: 00000000007706f0
kernel: [  167.336733] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
kernel: [  167.336747] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
kernel: [  167.336762] PKRU: 55555554
kernel: [  167.336769] Call Trace:
kernel: [  167.336776]  <TASK>
kernel: [  167.336782]  ? show_regs+0x6d/0x80
kernel: [  167.336794]  ? die_addr+0x37/0xa0
kernel: [  167.336803]  ? exc_general_protection+0x1dc/0x480
kernel: [  167.336818]  ? asm_exc_general_protection+0x27/0x30
kernel: [  167.336832]  ? __rmqueue_pcplist+0xbd/0x8c0
kernel: [  167.336845]  ? __pfx_kfd_ioctl_map_memory_to_gpu+0x10/0x10 [amdgpu]
kernel: [  167.337402]  ? mas_wr_store_entry.isra.0+0x337/0x3e0
kernel: [  167.337416]  get_page_from_freelist+0x64e/0x11c0
kernel: [  167.337432]  ? change_protection+0x1301/0x1460
kernel: [  167.337445]  __alloc_pages+0x251/0x1320
kernel: [  167.337458]  ? vma_modify+0x4c/0x110
kernel: [  167.337469]  ? policy_nodemask+0xe1/0x150
kernel: [  167.337481]  alloc_pages_mpol+0x91/0x1f0
kernel: [  167.337493]  vma_alloc_folio+0x64/0xd0
kernel: [  167.337505]  do_anonymous_page+0x21e/0x740
kernel: [  167.337516]  ? __pte_offset_map+0x1c/0x1b0
kernel: [  167.337527]  __handle_mm_fault+0xbca/0xf70
kernel: [  167.337540]  handle_mm_fault+0x18d/0x380
kernel: [  167.337551]  do_user_addr_fault+0x169/0x660
kernel: [  167.337563]  exc_page_fault+0x83/0x1b0
kernel: [  167.337573]  asm_exc_page_fault+0x27/0x30
kernel: [  167.337584] RIP: 0033:0x7086229e337a
kernel: [  167.337609] Code: 2c 58 15 00 49 8d 0c 28 48 29 e8 48 83 ce 04 48 39 d3 48 89 4b 60 48 0f 45 ee 48 83 c8 01 49 83 c0 10 48 83 cd 01 49 89 68 f8 <48> 89 41 08 48 83 c4 48 4c 89 c0 5b 5d 41 5c 41 5d 41 5e 41 5f c3
kernel: [  167.337641] RSP: 002b:00007085d9538310 EFLAGS: 00010202
kernel: [  167.337653] RAX: 0000000000000c71 RBX: 00007085b4000020 RCX: 00007085b7172390
kernel: [  167.337668] RDX: 0000708622b38b80 RSI: 0000000000008044 RDI: 00007085b716b000
kernel: [  167.337682] RBP: 0000000000008045 R08: 00007085b716a360 R09: 000000000316b000
kernel: [  167.337696] R10: 00007085b716b000 R11: 0000000000000206 R12: 0000000000000cb0
kernel: [  167.337710] R13: 0000000000001000 R14: 00007085b716a350 R15: 0000000000008060
kernel: [  167.337726]  </TASK>
kernel: [  167.337732] Modules linked in: nft_chain_nat nft_compat cfg80211 ebtable_filter ebtables ip6table_raw nf_conntrack_netlink xt_nat xt_tcpudp iptable_raw veth xt_conntrack xt_MASQUERADE ip6table_nat ip6table_filter ip6_tables xt_set ip_set iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_addrtype iptable_filter xfrm_user xfrm_algo scsi_transport_iscsi nf_tables nvme_fabrics nvme_keyring overlay qrtr softdog sunrpc binfmt_misc bonding tls nfnetlink_log nfnetlink nvidia_uvm(POE) zram vhost_net vhost vhost_iotlb tap nvidia_drm(POE) intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common nvidia_modeset(POE) isst_if_common skx_edac skx_edac_common nfit ipmi_ssif x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm snd_hda_codec_hdmi crct10dif_pclmul irdma snd_hda_intel polyval_clmulni snd_intel_dspcfg polyval_generic ghash_clmulni_intel snd_intel_sdw_acpi sha256_ssse3 snd_hda_codec sha1_ssse3 ice aesni_intel snd_hda_core crypto_simd cryptd snd_hwdep gnss snd_pcm ib_uverbs
kernel: [  167.337802]  cmdlinepart snd_timer ucsi_ccg spi_nor rapl snd typec_ucsi acpi_ipmi intel_cstate pcspkr typec soundcore ib_core ast mei_me mtd ipmi_si intel_pch_thermal mei ipmi_devintf ioatdma zfs(PO) dca ipmi_msghandler joydev input_leds mac_hid spl(O) nvidia(POE) coretemp vfio_pci vfio_pci_core irqbypass vfio_iommu_type1 vfio iommufd efi_pstore dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic xor raid6_pq amdgpu(OE) amddrm_ttm_helper(OE) amdttm(OE) hid_generic amddrm_buddy(OE) dm_thin_pool amdxcp(OE) drm_exec drm_suballoc_helper usbkbd usbmouse dm_persistent_data amd_sched(OE) dm_bio_prison amdkcl(OE) drm_display_helper usbhid dm_bufio hid libcrc32c cec rc_core nvme i2c_algo_bit i2c_nvidia_gpu xhci_pci crc32_pclmul i2c_ccgx_ucsi xhci_pci_renesas video nvme_core i40e spi_intel_pci ahci xhci_hcd i2c_i801 nvme_auth spi_intel lpc_ich i2c_smbus libahci wmi
kernel: [  167.341158] ---[ end trace 0000000000000000 ]---
kernel: [  167.403299] RIP: 0010:__rmqueue_pcplist+0xbd/0x8c0
kernel: [  167.404140] Code: 01 f8 48 89 45 a0 49 8b 07 49 39 c7 0f 84 7f 01 00 00 48 bf 22 01 00 00 00 00 ad de 49 8b 07 48 8b 08 48 8b 50 08 4c 8d 40 f8 <48> 89 51 08 48 89 0a 48 b9 00 01 00 00 00 00 ad de 48 89 08 48 89
kernel: [  167.405023] RSP: 0000:ffffb9ec7e3fba20 EFLAGS: 00010293
kernel: [  167.405917] RAX: ffffdfe7d4642cc8 RBX: 0000000000000001 RCX: dead000000000100
kernel: [  167.406815] RDX: dead000000000122 RSI: 0000000000000000 RDI: dead000000000122
kernel: [  167.407709] RBP: ffffb9ec7e3fbad0 R08: ffffdfe7d4642cc0 R09: 0000000000000000
kernel: [  167.408602] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
kernel: [  167.409488] R13: 0000000000000010 R14: ffff944caffd5c00 R15: ffff944af02bcd70
kernel: [  167.410366] FS:  00007085d953a700(0000) GS:ffff944af0280000(0000) knlGS:0000000000000000
kernel: [  167.411235] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: [  167.412106] CR2: 00007085b7172398 CR3: 000000038486e005 CR4: 00000000007706f0
kernel: [  167.412979] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
kernel: [  167.413828] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
kernel: [  167.414650] PKRU: 55555554
kernel: [  167.415464] note: ollama[10296] exited with preempt_count 2
kernel: [  264.527827] amdgpu 0000:69:00.0: amdgpu: qcm fence wait loop timeout expired
kernel: [  264.528676] amdgpu 0000:69:00.0: amdgpu: The cp might be in an unrecoverable state due to an unsuccessful queues preemption
kernel: [  264.529527] amdgpu 0000:69:00.0: amdgpu: Failed to evict process queues
kernel: [  264.532922] amdgpu: Failed to quiesce KFD
kernel: [  264.558837] amdgpu 0000:69:00.0: amdgpu: GPU reset begin!
kernel: [  264.561356] amdgpu 0000:69:00.0: amdgpu: Dumping IP State
kernel: [  264.565570] amdgpu 0000:69:00.0: amdgpu: Dumping IP State Completed
kernel: [  264.642300] amdgpu 0000:69:00.0: amdgpu: BACO reset
kernel: [  266.503274] amdgpu 0000:69:00.0: amdgpu: GPU reset succeeded, trying to resume
kernel: [  266.504258] [drm] PCIE GART of 512M enabled.
kernel: [  266.505090] [drm] PTB located at 0x0000008000000000
kernel: [  266.506052] [drm] VRAM is lost due to GPU reset!
kernel: [  266.507748] amdgpu 0000:69:00.0: amdgpu: PSP is resuming...
kernel: [  266.659119] amdgpu 0000:69:00.0: amdgpu: reserve 0x400000 from 0x87fec00000 for PSP TMR
kernel: [  266.743513] amdgpu 0000:69:00.0: amdgpu: RAP: optional rap ta ucode is not available
kernel: [  266.751504] [drm] kiq ring mec 2 pipe 1 q 0
kernel: [  266.797720] [drm] UVD and UVD ENC initialized successfully.
kernel: [  266.999972] [drm] VCE initialized successfully.
kernel: [  267.000970] amdgpu 0000:69:00.0: amdgpu: ring gfx uses VM inv eng 0 on hub 0
kernel: [  267.001910] amdgpu 0000:69:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
kernel: [  267.002767] amdgpu 0000:69:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
kernel: [  267.003607] amdgpu 0000:69:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
kernel: [  267.004452] amdgpu 0000:69:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
kernel: [  267.005294] amdgpu 0000:69:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
kernel: [  267.006134] amdgpu 0000:69:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
kernel: [  267.006970] amdgpu 0000:69:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
kernel: [  267.007803] amdgpu 0000:69:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
kernel: [  267.008629] amdgpu 0000:69:00.0: amdgpu: ring kiq_0.2.1.0 uses VM inv eng 11 on hub 0
kernel: [  267.009465] amdgpu 0000:69:00.0: amdgpu: ring sdma0 uses VM inv eng 0 on hub 8
kernel: [  267.010301] amdgpu 0000:69:00.0: amdgpu: ring page0 uses VM inv eng 1 on hub 8
kernel: [  267.011135] amdgpu 0000:69:00.0: amdgpu: ring sdma1 uses VM inv eng 4 on hub 8
kernel: [  267.011966] amdgpu 0000:69:00.0: amdgpu: ring page1 uses VM inv eng 5 on hub 8
kernel: [  267.012795] amdgpu 0000:69:00.0: amdgpu: ring uvd_0 uses VM inv eng 6 on hub 8
kernel: [  267.013616] amdgpu 0000:69:00.0: amdgpu: ring uvd_enc_0.0 uses VM inv eng 7 on hub 8
kernel: [  267.014419] amdgpu 0000:69:00.0: amdgpu: ring uvd_enc_0.1 uses VM inv eng 8 on hub 8
kernel: [  267.015195] amdgpu 0000:69:00.0: amdgpu: ring uvd_1 uses VM inv eng 9 on hub 8
kernel: [  267.015968] amdgpu 0000:69:00.0: amdgpu: ring uvd_enc_1.0 uses VM inv eng 10 on hub 8
kernel: [  267.016732] amdgpu 0000:69:00.0: amdgpu: ring uvd_enc_1.1 uses VM inv eng 11 on hub 8
kernel: [  267.017503] amdgpu 0000:69:00.0: amdgpu: ring vce0 uses VM inv eng 12 on hub 8
kernel: [  267.018267] amdgpu 0000:69:00.0: amdgpu: ring vce1 uses VM inv eng 13 on hub 8
kernel: [  267.019028] amdgpu 0000:69:00.0: amdgpu: ring vce2 uses VM inv eng 14 on hub 8
kernel: [  267.523747] [drm] Fence fallback timer expired on ring comp_1.0.0
kernel: [  267.533054] amdgpu 0000:69:00.0: amdgpu: GPU reset(1) succeeded!
 
Machine Model: Inspur 5212H5
CPU: Intel Xeon Gold 6138
GPU: 2x AMD Radeon Instinct MI50 32GB
GPU Driver: https://repo.radeon.com/amdgpu/6.3.2/ubuntu

I deployed Ollama in an LXC container on PVE using Docker Compose, and it works well. After updating to the latest version of the kernel, 6.8.12-12-pve, PVE receives an NMI interrupt error during the startup phase, and Ollama encounters an error when attempting to load the model.
When i use older kernel, like 6.8.12-11.pve, problem is gone.

Here is kernel log.

Code:
kernel: [  167.335499] BUG: Bad page state in process ollama  pfn:5190b3
kernel: [  167.335526] page:000000007f2dd029 refcount:-1 mapcount:0 mapping:0000000000000000 index:0x1 pfn:0x5190b3
kernel: [  167.335530] flags: 0x17ffffd0000020(lru|node=0|zone=2|lastcpupid=0x1fffff)
kernel: [  167.335534] page_type: 0xffffffff()
kernel: [  167.335537] raw: 0017ffffd0000020 dead000000000100 dead000000000122 0000000000000000
kernel: [  167.335540] raw: 0000000000000001 0000000000000000 ffffffffffffffff 0000000000000000
kernel: [  167.335541] page dumped because: PAGE_FLAGS_CHECK_AT_PREP flag(s) set
kernel: [  167.335542] Modules linked in: nft_chain_nat nft_compat cfg80211 ebtable_filter ebtables ip6table_raw nf_conntrack_netlink xt_nat xt_tcpudp iptable_raw veth xt_conntrack xt_MASQUERADE ip6table_nat ip6table_filter ip6_tables xt_set ip_set iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_addrtype iptable_filter xfrm_user xfrm_algo scsi_transport_iscsi nf_tables nvme_fabrics nvme_keyring overlay qrtr softdog sunrpc binfmt_misc bonding tls nfnetlink_log nfnetlink nvidia_uvm(POE) zram vhost_net vhost vhost_iotlb tap nvidia_drm(POE) intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common nvidia_modeset(POE) isst_if_common skx_edac skx_edac_common nfit ipmi_ssif x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm snd_hda_codec_hdmi crct10dif_pclmul irdma snd_hda_intel polyval_clmulni snd_intel_dspcfg polyval_generic ghash_clmulni_intel snd_intel_sdw_acpi sha256_ssse3 snd_hda_codec sha1_ssse3 ice aesni_intel snd_hda_core crypto_simd cryptd snd_hwdep gnss snd_pcm ib_uverbs
kernel: [  167.335616]  cmdlinepart snd_timer ucsi_ccg spi_nor rapl snd typec_ucsi acpi_ipmi intel_cstate pcspkr typec soundcore ib_core ast mei_me mtd ipmi_si intel_pch_thermal mei ipmi_devintf ioatdma zfs(PO) dca ipmi_msghandler joydev input_leds mac_hid spl(O) nvidia(POE) coretemp vfio_pci vfio_pci_core irqbypass vfio_iommu_type1 vfio iommufd efi_pstore dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic xor raid6_pq amdgpu(OE) amddrm_ttm_helper(OE) amdttm(OE) hid_generic amddrm_buddy(OE) dm_thin_pool amdxcp(OE) drm_exec drm_suballoc_helper usbkbd usbmouse dm_persistent_data amd_sched(OE) dm_bio_prison amdkcl(OE) drm_display_helper usbhid dm_bufio hid libcrc32c cec rc_core nvme i2c_algo_bit i2c_nvidia_gpu xhci_pci crc32_pclmul i2c_ccgx_ucsi xhci_pci_renesas video nvme_core i40e spi_intel_pci ahci xhci_hcd i2c_i801 nvme_auth spi_intel lpc_ich i2c_smbus libahci wmi
kernel: [  167.335693] CPU: 13 PID: 10296 Comm: ollama Tainted: P           OE      6.8.12-12-pve #1
kernel: [  167.335697] Hardware name: Inspur AliServer Thor02-2U/YZMB-00824-101, BIOS 3.0.1 05/21/2017
kernel: [  167.335698] Call Trace:
kernel: [  167.335701]  <TASK>
kernel: [  167.335704]  dump_stack_lvl+0x76/0xa0
kernel: [  167.335713]  dump_stack+0x10/0x20
kernel: [  167.335717]  bad_page+0x76/0x120
kernel: [  167.335721]  ? _copy_to_user+0x25/0x50
kernel: [  167.335725]  __rmqueue_pcplist+0x218/0x8c0
kernel: [  167.335732]  ? __pfx_kfd_ioctl_map_memory_to_gpu+0x10/0x10 [amdgpu]
kernel: [  167.336362]  ? mas_wr_store_entry.isra.0+0x337/0x3e0
kernel: [  167.336368]  get_page_from_freelist+0x64e/0x11c0
kernel: [  167.336376]  ? change_protection+0x1301/0x1460
kernel: [  167.336383]  __alloc_pages+0x251/0x1320
kernel: [  167.336388]  ? vma_modify+0x4c/0x110
kernel: [  167.336391]  ? policy_nodemask+0xe1/0x150
kernel: [  167.336397]  alloc_pages_mpol+0x91/0x1f0
kernel: [  167.336401]  vma_alloc_folio+0x64/0xd0
kernel: [  167.336405]  do_anonymous_page+0x21e/0x740
kernel: [  167.336409]  ? __pte_offset_map+0x1c/0x1b0
kernel: [  167.336412]  __handle_mm_fault+0xbca/0xf70
kernel: [  167.336417]  handle_mm_fault+0x18d/0x380
kernel: [  167.336420]  do_user_addr_fault+0x169/0x660
kernel: [  167.336425]  exc_page_fault+0x83/0x1b0
kernel: [  167.336429]  asm_exc_page_fault+0x27/0x30
kernel: [  167.336434] RIP: 0033:0x7086229e337a
kernel: [  167.336460] Code: 2c 58 15 00 49 8d 0c 28 48 29 e8 48 83 ce 04 48 39 d3 48 89 4b 60 48 0f 45 ee 48 83 c8 01 49 83 c0 10 48 83 cd 01 49 89 68 f8 <48> 89 41 08 48 83 c4 48 4c 89 c0 5b 5d 41 5c 41 5d 41 5e 41 5f c3
kernel: [  167.336463] RSP: 002b:00007085d9538310 EFLAGS: 00010202
kernel: [  167.336466] RAX: 0000000000000c71 RBX: 00007085b4000020 RCX: 00007085b7172390
kernel: [  167.336468] RDX: 0000708622b38b80 RSI: 0000000000008044 RDI: 00007085b716b000
kernel: [  167.336470] RBP: 0000000000008045 R08: 00007085b716a360 R09: 000000000316b000
kernel: [  167.336472] R10: 00007085b716b000 R11: 0000000000000206 R12: 0000000000000cb0
kernel: [  167.336474] R13: 0000000000001000 R14: 00007085b716a350 R15: 0000000000008060
kernel: [  167.336477]  </TASK>
kernel: [  167.336500] general protection fault, probably for non-canonical address 0xdead000000000108: 0000 [#1] PREEMPT SMP PTI
kernel: [  167.336525] CPU: 13 PID: 10296 Comm: ollama Tainted: P    B      OE      6.8.12-12-pve #1
kernel: [  167.336543] Hardware name: Inspur AliServer Thor02-2U/YZMB-00824-101, BIOS 3.0.1 05/21/2017
kernel: [  167.336560] RIP: 0010:__rmqueue_pcplist+0xbd/0x8c0
kernel: [  167.336574] Code: 01 f8 48 89 45 a0 49 8b 07 49 39 c7 0f 84 7f 01 00 00 48 bf 22 01 00 00 00 00 ad de 49 8b 07 48 8b 08 48 8b 50 08 4c 8d 40 f8 <48> 89 51 08 48 89 0a 48 b9 00 01 00 00 00 00 ad de 48 89 08 48 89
kernel: [  167.336607] RSP: 0000:ffffb9ec7e3fba20 EFLAGS: 00010293
kernel: [  167.336619] RAX: ffffdfe7d4642cc8 RBX: 0000000000000001 RCX: dead000000000100
kernel: [  167.336633] RDX: dead000000000122 RSI: 0000000000000000 RDI: dead000000000122
kernel: [  167.336648] RBP: ffffb9ec7e3fbad0 R08: ffffdfe7d4642cc0 R09: 0000000000000000
kernel: [  167.336662] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
kernel: [  167.336676] R13: 0000000000000010 R14: ffff944caffd5c00 R15: ffff944af02bcd70
kernel: [  167.336690] FS:  00007085d953a700(0000) GS:ffff944af0280000(0000) knlGS:0000000000000000
kernel: [  167.336707] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: [  167.336719] CR2: 00007085b7172398 CR3: 000000038486e005 CR4: 00000000007706f0
kernel: [  167.336733] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
kernel: [  167.336747] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
kernel: [  167.336762] PKRU: 55555554
kernel: [  167.336769] Call Trace:
kernel: [  167.336776]  <TASK>
kernel: [  167.336782]  ? show_regs+0x6d/0x80
kernel: [  167.336794]  ? die_addr+0x37/0xa0
kernel: [  167.336803]  ? exc_general_protection+0x1dc/0x480
kernel: [  167.336818]  ? asm_exc_general_protection+0x27/0x30
kernel: [  167.336832]  ? __rmqueue_pcplist+0xbd/0x8c0
kernel: [  167.336845]  ? __pfx_kfd_ioctl_map_memory_to_gpu+0x10/0x10 [amdgpu]
kernel: [  167.337402]  ? mas_wr_store_entry.isra.0+0x337/0x3e0
kernel: [  167.337416]  get_page_from_freelist+0x64e/0x11c0
kernel: [  167.337432]  ? change_protection+0x1301/0x1460
kernel: [  167.337445]  __alloc_pages+0x251/0x1320
kernel: [  167.337458]  ? vma_modify+0x4c/0x110
kernel: [  167.337469]  ? policy_nodemask+0xe1/0x150
kernel: [  167.337481]  alloc_pages_mpol+0x91/0x1f0
kernel: [  167.337493]  vma_alloc_folio+0x64/0xd0
kernel: [  167.337505]  do_anonymous_page+0x21e/0x740
kernel: [  167.337516]  ? __pte_offset_map+0x1c/0x1b0
kernel: [  167.337527]  __handle_mm_fault+0xbca/0xf70
kernel: [  167.337540]  handle_mm_fault+0x18d/0x380
kernel: [  167.337551]  do_user_addr_fault+0x169/0x660
kernel: [  167.337563]  exc_page_fault+0x83/0x1b0
kernel: [  167.337573]  asm_exc_page_fault+0x27/0x30
kernel: [  167.337584] RIP: 0033:0x7086229e337a
kernel: [  167.337609] Code: 2c 58 15 00 49 8d 0c 28 48 29 e8 48 83 ce 04 48 39 d3 48 89 4b 60 48 0f 45 ee 48 83 c8 01 49 83 c0 10 48 83 cd 01 49 89 68 f8 <48> 89 41 08 48 83 c4 48 4c 89 c0 5b 5d 41 5c 41 5d 41 5e 41 5f c3
kernel: [  167.337641] RSP: 002b:00007085d9538310 EFLAGS: 00010202
kernel: [  167.337653] RAX: 0000000000000c71 RBX: 00007085b4000020 RCX: 00007085b7172390
kernel: [  167.337668] RDX: 0000708622b38b80 RSI: 0000000000008044 RDI: 00007085b716b000
kernel: [  167.337682] RBP: 0000000000008045 R08: 00007085b716a360 R09: 000000000316b000
kernel: [  167.337696] R10: 00007085b716b000 R11: 0000000000000206 R12: 0000000000000cb0
kernel: [  167.337710] R13: 0000000000001000 R14: 00007085b716a350 R15: 0000000000008060
kernel: [  167.337726]  </TASK>
kernel: [  167.337732] Modules linked in: nft_chain_nat nft_compat cfg80211 ebtable_filter ebtables ip6table_raw nf_conntrack_netlink xt_nat xt_tcpudp iptable_raw veth xt_conntrack xt_MASQUERADE ip6table_nat ip6table_filter ip6_tables xt_set ip_set iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_addrtype iptable_filter xfrm_user xfrm_algo scsi_transport_iscsi nf_tables nvme_fabrics nvme_keyring overlay qrtr softdog sunrpc binfmt_misc bonding tls nfnetlink_log nfnetlink nvidia_uvm(POE) zram vhost_net vhost vhost_iotlb tap nvidia_drm(POE) intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common nvidia_modeset(POE) isst_if_common skx_edac skx_edac_common nfit ipmi_ssif x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm snd_hda_codec_hdmi crct10dif_pclmul irdma snd_hda_intel polyval_clmulni snd_intel_dspcfg polyval_generic ghash_clmulni_intel snd_intel_sdw_acpi sha256_ssse3 snd_hda_codec sha1_ssse3 ice aesni_intel snd_hda_core crypto_simd cryptd snd_hwdep gnss snd_pcm ib_uverbs
kernel: [  167.337802]  cmdlinepart snd_timer ucsi_ccg spi_nor rapl snd typec_ucsi acpi_ipmi intel_cstate pcspkr typec soundcore ib_core ast mei_me mtd ipmi_si intel_pch_thermal mei ipmi_devintf ioatdma zfs(PO) dca ipmi_msghandler joydev input_leds mac_hid spl(O) nvidia(POE) coretemp vfio_pci vfio_pci_core irqbypass vfio_iommu_type1 vfio iommufd efi_pstore dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic xor raid6_pq amdgpu(OE) amddrm_ttm_helper(OE) amdttm(OE) hid_generic amddrm_buddy(OE) dm_thin_pool amdxcp(OE) drm_exec drm_suballoc_helper usbkbd usbmouse dm_persistent_data amd_sched(OE) dm_bio_prison amdkcl(OE) drm_display_helper usbhid dm_bufio hid libcrc32c cec rc_core nvme i2c_algo_bit i2c_nvidia_gpu xhci_pci crc32_pclmul i2c_ccgx_ucsi xhci_pci_renesas video nvme_core i40e spi_intel_pci ahci xhci_hcd i2c_i801 nvme_auth spi_intel lpc_ich i2c_smbus libahci wmi
kernel: [  167.341158] ---[ end trace 0000000000000000 ]---
kernel: [  167.403299] RIP: 0010:__rmqueue_pcplist+0xbd/0x8c0
kernel: [  167.404140] Code: 01 f8 48 89 45 a0 49 8b 07 49 39 c7 0f 84 7f 01 00 00 48 bf 22 01 00 00 00 00 ad de 49 8b 07 48 8b 08 48 8b 50 08 4c 8d 40 f8 <48> 89 51 08 48 89 0a 48 b9 00 01 00 00 00 00 ad de 48 89 08 48 89
kernel: [  167.405023] RSP: 0000:ffffb9ec7e3fba20 EFLAGS: 00010293
kernel: [  167.405917] RAX: ffffdfe7d4642cc8 RBX: 0000000000000001 RCX: dead000000000100
kernel: [  167.406815] RDX: dead000000000122 RSI: 0000000000000000 RDI: dead000000000122
kernel: [  167.407709] RBP: ffffb9ec7e3fbad0 R08: ffffdfe7d4642cc0 R09: 0000000000000000
kernel: [  167.408602] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
kernel: [  167.409488] R13: 0000000000000010 R14: ffff944caffd5c00 R15: ffff944af02bcd70
kernel: [  167.410366] FS:  00007085d953a700(0000) GS:ffff944af0280000(0000) knlGS:0000000000000000
kernel: [  167.411235] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: [  167.412106] CR2: 00007085b7172398 CR3: 000000038486e005 CR4: 00000000007706f0
kernel: [  167.412979] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
kernel: [  167.413828] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
kernel: [  167.414650] PKRU: 55555554
kernel: [  167.415464] note: ollama[10296] exited with preempt_count 2
kernel: [  264.527827] amdgpu 0000:69:00.0: amdgpu: qcm fence wait loop timeout expired
kernel: [  264.528676] amdgpu 0000:69:00.0: amdgpu: The cp might be in an unrecoverable state due to an unsuccessful queues preemption
kernel: [  264.529527] amdgpu 0000:69:00.0: amdgpu: Failed to evict process queues
kernel: [  264.532922] amdgpu: Failed to quiesce KFD
kernel: [  264.558837] amdgpu 0000:69:00.0: amdgpu: GPU reset begin!
kernel: [  264.561356] amdgpu 0000:69:00.0: amdgpu: Dumping IP State
kernel: [  264.565570] amdgpu 0000:69:00.0: amdgpu: Dumping IP State Completed
kernel: [  264.642300] amdgpu 0000:69:00.0: amdgpu: BACO reset
kernel: [  266.503274] amdgpu 0000:69:00.0: amdgpu: GPU reset succeeded, trying to resume
kernel: [  266.504258] [drm] PCIE GART of 512M enabled.
kernel: [  266.505090] [drm] PTB located at 0x0000008000000000
kernel: [  266.506052] [drm] VRAM is lost due to GPU reset!
kernel: [  266.507748] amdgpu 0000:69:00.0: amdgpu: PSP is resuming...
kernel: [  266.659119] amdgpu 0000:69:00.0: amdgpu: reserve 0x400000 from 0x87fec00000 for PSP TMR
kernel: [  266.743513] amdgpu 0000:69:00.0: amdgpu: RAP: optional rap ta ucode is not available
kernel: [  266.751504] [drm] kiq ring mec 2 pipe 1 q 0
kernel: [  266.797720] [drm] UVD and UVD ENC initialized successfully.
kernel: [  266.999972] [drm] VCE initialized successfully.
kernel: [  267.000970] amdgpu 0000:69:00.0: amdgpu: ring gfx uses VM inv eng 0 on hub 0
kernel: [  267.001910] amdgpu 0000:69:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
kernel: [  267.002767] amdgpu 0000:69:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
kernel: [  267.003607] amdgpu 0000:69:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
kernel: [  267.004452] amdgpu 0000:69:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
kernel: [  267.005294] amdgpu 0000:69:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
kernel: [  267.006134] amdgpu 0000:69:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
kernel: [  267.006970] amdgpu 0000:69:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
kernel: [  267.007803] amdgpu 0000:69:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
kernel: [  267.008629] amdgpu 0000:69:00.0: amdgpu: ring kiq_0.2.1.0 uses VM inv eng 11 on hub 0
kernel: [  267.009465] amdgpu 0000:69:00.0: amdgpu: ring sdma0 uses VM inv eng 0 on hub 8
kernel: [  267.010301] amdgpu 0000:69:00.0: amdgpu: ring page0 uses VM inv eng 1 on hub 8
kernel: [  267.011135] amdgpu 0000:69:00.0: amdgpu: ring sdma1 uses VM inv eng 4 on hub 8
kernel: [  267.011966] amdgpu 0000:69:00.0: amdgpu: ring page1 uses VM inv eng 5 on hub 8
kernel: [  267.012795] amdgpu 0000:69:00.0: amdgpu: ring uvd_0 uses VM inv eng 6 on hub 8
kernel: [  267.013616] amdgpu 0000:69:00.0: amdgpu: ring uvd_enc_0.0 uses VM inv eng 7 on hub 8
kernel: [  267.014419] amdgpu 0000:69:00.0: amdgpu: ring uvd_enc_0.1 uses VM inv eng 8 on hub 8
kernel: [  267.015195] amdgpu 0000:69:00.0: amdgpu: ring uvd_1 uses VM inv eng 9 on hub 8
kernel: [  267.015968] amdgpu 0000:69:00.0: amdgpu: ring uvd_enc_1.0 uses VM inv eng 10 on hub 8
kernel: [  267.016732] amdgpu 0000:69:00.0: amdgpu: ring uvd_enc_1.1 uses VM inv eng 11 on hub 8
kernel: [  267.017503] amdgpu 0000:69:00.0: amdgpu: ring vce0 uses VM inv eng 12 on hub 8
kernel: [  267.018267] amdgpu 0000:69:00.0: amdgpu: ring vce1 uses VM inv eng 13 on hub 8
kernel: [  267.019028] amdgpu 0000:69:00.0: amdgpu: ring vce2 uses VM inv eng 14 on hub 8
kernel: [  267.523747] [drm] Fence fallback timer expired on ring comp_1.0.0
kernel: [  267.533054] amdgpu 0000:69:00.0: amdgpu: GPU reset(1) succeeded!
level devil



The most direct and correct solution is to update your AMDGPU driver to a version compatible with Kernel 6.8.12-12-pve or newer