Machine Model: Inspur 5212H5
CPU: Intel Xeon Gold 6138
GPU: 2x AMD Radeon Instinct MI50 32GB
GPU Driver: https://repo.radeon.com/amdgpu/6.3.2/ubuntu
I deployed Ollama in an LXC container on PVE using Docker Compose, and it works well. After updating to the latest version of the kernel, 6.8.12-12-pve, PVE receives an NMI interrupt error during the startup phase, and Ollama encounters an error when attempting to load the model.
When i use older kernel, like 6.8.12-11.pve, problem is gone.
Here is kernel log.
CPU: Intel Xeon Gold 6138
GPU: 2x AMD Radeon Instinct MI50 32GB
GPU Driver: https://repo.radeon.com/amdgpu/6.3.2/ubuntu
I deployed Ollama in an LXC container on PVE using Docker Compose, and it works well. After updating to the latest version of the kernel, 6.8.12-12-pve, PVE receives an NMI interrupt error during the startup phase, and Ollama encounters an error when attempting to load the model.
When i use older kernel, like 6.8.12-11.pve, problem is gone.
Here is kernel log.
Code:
kernel: [ 167.335499] BUG: Bad page state in process ollama pfn:5190b3
kernel: [ 167.335526] page:000000007f2dd029 refcount:-1 mapcount:0 mapping:0000000000000000 index:0x1 pfn:0x5190b3
kernel: [ 167.335530] flags: 0x17ffffd0000020(lru|node=0|zone=2|lastcpupid=0x1fffff)
kernel: [ 167.335534] page_type: 0xffffffff()
kernel: [ 167.335537] raw: 0017ffffd0000020 dead000000000100 dead000000000122 0000000000000000
kernel: [ 167.335540] raw: 0000000000000001 0000000000000000 ffffffffffffffff 0000000000000000
kernel: [ 167.335541] page dumped because: PAGE_FLAGS_CHECK_AT_PREP flag(s) set
kernel: [ 167.335542] Modules linked in: nft_chain_nat nft_compat cfg80211 ebtable_filter ebtables ip6table_raw nf_conntrack_netlink xt_nat xt_tcpudp iptable_raw veth xt_conntrack xt_MASQUERADE ip6table_nat ip6table_filter ip6_tables xt_set ip_set iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_addrtype iptable_filter xfrm_user xfrm_algo scsi_transport_iscsi nf_tables nvme_fabrics nvme_keyring overlay qrtr softdog sunrpc binfmt_misc bonding tls nfnetlink_log nfnetlink nvidia_uvm(POE) zram vhost_net vhost vhost_iotlb tap nvidia_drm(POE) intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common nvidia_modeset(POE) isst_if_common skx_edac skx_edac_common nfit ipmi_ssif x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm snd_hda_codec_hdmi crct10dif_pclmul irdma snd_hda_intel polyval_clmulni snd_intel_dspcfg polyval_generic ghash_clmulni_intel snd_intel_sdw_acpi sha256_ssse3 snd_hda_codec sha1_ssse3 ice aesni_intel snd_hda_core crypto_simd cryptd snd_hwdep gnss snd_pcm ib_uverbs
kernel: [ 167.335616] cmdlinepart snd_timer ucsi_ccg spi_nor rapl snd typec_ucsi acpi_ipmi intel_cstate pcspkr typec soundcore ib_core ast mei_me mtd ipmi_si intel_pch_thermal mei ipmi_devintf ioatdma zfs(PO) dca ipmi_msghandler joydev input_leds mac_hid spl(O) nvidia(POE) coretemp vfio_pci vfio_pci_core irqbypass vfio_iommu_type1 vfio iommufd efi_pstore dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic xor raid6_pq amdgpu(OE) amddrm_ttm_helper(OE) amdttm(OE) hid_generic amddrm_buddy(OE) dm_thin_pool amdxcp(OE) drm_exec drm_suballoc_helper usbkbd usbmouse dm_persistent_data amd_sched(OE) dm_bio_prison amdkcl(OE) drm_display_helper usbhid dm_bufio hid libcrc32c cec rc_core nvme i2c_algo_bit i2c_nvidia_gpu xhci_pci crc32_pclmul i2c_ccgx_ucsi xhci_pci_renesas video nvme_core i40e spi_intel_pci ahci xhci_hcd i2c_i801 nvme_auth spi_intel lpc_ich i2c_smbus libahci wmi
kernel: [ 167.335693] CPU: 13 PID: 10296 Comm: ollama Tainted: P OE 6.8.12-12-pve #1
kernel: [ 167.335697] Hardware name: Inspur AliServer Thor02-2U/YZMB-00824-101, BIOS 3.0.1 05/21/2017
kernel: [ 167.335698] Call Trace:
kernel: [ 167.335701] <TASK>
kernel: [ 167.335704] dump_stack_lvl+0x76/0xa0
kernel: [ 167.335713] dump_stack+0x10/0x20
kernel: [ 167.335717] bad_page+0x76/0x120
kernel: [ 167.335721] ? _copy_to_user+0x25/0x50
kernel: [ 167.335725] __rmqueue_pcplist+0x218/0x8c0
kernel: [ 167.335732] ? __pfx_kfd_ioctl_map_memory_to_gpu+0x10/0x10 [amdgpu]
kernel: [ 167.336362] ? mas_wr_store_entry.isra.0+0x337/0x3e0
kernel: [ 167.336368] get_page_from_freelist+0x64e/0x11c0
kernel: [ 167.336376] ? change_protection+0x1301/0x1460
kernel: [ 167.336383] __alloc_pages+0x251/0x1320
kernel: [ 167.336388] ? vma_modify+0x4c/0x110
kernel: [ 167.336391] ? policy_nodemask+0xe1/0x150
kernel: [ 167.336397] alloc_pages_mpol+0x91/0x1f0
kernel: [ 167.336401] vma_alloc_folio+0x64/0xd0
kernel: [ 167.336405] do_anonymous_page+0x21e/0x740
kernel: [ 167.336409] ? __pte_offset_map+0x1c/0x1b0
kernel: [ 167.336412] __handle_mm_fault+0xbca/0xf70
kernel: [ 167.336417] handle_mm_fault+0x18d/0x380
kernel: [ 167.336420] do_user_addr_fault+0x169/0x660
kernel: [ 167.336425] exc_page_fault+0x83/0x1b0
kernel: [ 167.336429] asm_exc_page_fault+0x27/0x30
kernel: [ 167.336434] RIP: 0033:0x7086229e337a
kernel: [ 167.336460] Code: 2c 58 15 00 49 8d 0c 28 48 29 e8 48 83 ce 04 48 39 d3 48 89 4b 60 48 0f 45 ee 48 83 c8 01 49 83 c0 10 48 83 cd 01 49 89 68 f8 <48> 89 41 08 48 83 c4 48 4c 89 c0 5b 5d 41 5c 41 5d 41 5e 41 5f c3
kernel: [ 167.336463] RSP: 002b:00007085d9538310 EFLAGS: 00010202
kernel: [ 167.336466] RAX: 0000000000000c71 RBX: 00007085b4000020 RCX: 00007085b7172390
kernel: [ 167.336468] RDX: 0000708622b38b80 RSI: 0000000000008044 RDI: 00007085b716b000
kernel: [ 167.336470] RBP: 0000000000008045 R08: 00007085b716a360 R09: 000000000316b000
kernel: [ 167.336472] R10: 00007085b716b000 R11: 0000000000000206 R12: 0000000000000cb0
kernel: [ 167.336474] R13: 0000000000001000 R14: 00007085b716a350 R15: 0000000000008060
kernel: [ 167.336477] </TASK>
kernel: [ 167.336500] general protection fault, probably for non-canonical address 0xdead000000000108: 0000 [#1] PREEMPT SMP PTI
kernel: [ 167.336525] CPU: 13 PID: 10296 Comm: ollama Tainted: P B OE 6.8.12-12-pve #1
kernel: [ 167.336543] Hardware name: Inspur AliServer Thor02-2U/YZMB-00824-101, BIOS 3.0.1 05/21/2017
kernel: [ 167.336560] RIP: 0010:__rmqueue_pcplist+0xbd/0x8c0
kernel: [ 167.336574] Code: 01 f8 48 89 45 a0 49 8b 07 49 39 c7 0f 84 7f 01 00 00 48 bf 22 01 00 00 00 00 ad de 49 8b 07 48 8b 08 48 8b 50 08 4c 8d 40 f8 <48> 89 51 08 48 89 0a 48 b9 00 01 00 00 00 00 ad de 48 89 08 48 89
kernel: [ 167.336607] RSP: 0000:ffffb9ec7e3fba20 EFLAGS: 00010293
kernel: [ 167.336619] RAX: ffffdfe7d4642cc8 RBX: 0000000000000001 RCX: dead000000000100
kernel: [ 167.336633] RDX: dead000000000122 RSI: 0000000000000000 RDI: dead000000000122
kernel: [ 167.336648] RBP: ffffb9ec7e3fbad0 R08: ffffdfe7d4642cc0 R09: 0000000000000000
kernel: [ 167.336662] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
kernel: [ 167.336676] R13: 0000000000000010 R14: ffff944caffd5c00 R15: ffff944af02bcd70
kernel: [ 167.336690] FS: 00007085d953a700(0000) GS:ffff944af0280000(0000) knlGS:0000000000000000
kernel: [ 167.336707] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: [ 167.336719] CR2: 00007085b7172398 CR3: 000000038486e005 CR4: 00000000007706f0
kernel: [ 167.336733] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
kernel: [ 167.336747] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
kernel: [ 167.336762] PKRU: 55555554
kernel: [ 167.336769] Call Trace:
kernel: [ 167.336776] <TASK>
kernel: [ 167.336782] ? show_regs+0x6d/0x80
kernel: [ 167.336794] ? die_addr+0x37/0xa0
kernel: [ 167.336803] ? exc_general_protection+0x1dc/0x480
kernel: [ 167.336818] ? asm_exc_general_protection+0x27/0x30
kernel: [ 167.336832] ? __rmqueue_pcplist+0xbd/0x8c0
kernel: [ 167.336845] ? __pfx_kfd_ioctl_map_memory_to_gpu+0x10/0x10 [amdgpu]
kernel: [ 167.337402] ? mas_wr_store_entry.isra.0+0x337/0x3e0
kernel: [ 167.337416] get_page_from_freelist+0x64e/0x11c0
kernel: [ 167.337432] ? change_protection+0x1301/0x1460
kernel: [ 167.337445] __alloc_pages+0x251/0x1320
kernel: [ 167.337458] ? vma_modify+0x4c/0x110
kernel: [ 167.337469] ? policy_nodemask+0xe1/0x150
kernel: [ 167.337481] alloc_pages_mpol+0x91/0x1f0
kernel: [ 167.337493] vma_alloc_folio+0x64/0xd0
kernel: [ 167.337505] do_anonymous_page+0x21e/0x740
kernel: [ 167.337516] ? __pte_offset_map+0x1c/0x1b0
kernel: [ 167.337527] __handle_mm_fault+0xbca/0xf70
kernel: [ 167.337540] handle_mm_fault+0x18d/0x380
kernel: [ 167.337551] do_user_addr_fault+0x169/0x660
kernel: [ 167.337563] exc_page_fault+0x83/0x1b0
kernel: [ 167.337573] asm_exc_page_fault+0x27/0x30
kernel: [ 167.337584] RIP: 0033:0x7086229e337a
kernel: [ 167.337609] Code: 2c 58 15 00 49 8d 0c 28 48 29 e8 48 83 ce 04 48 39 d3 48 89 4b 60 48 0f 45 ee 48 83 c8 01 49 83 c0 10 48 83 cd 01 49 89 68 f8 <48> 89 41 08 48 83 c4 48 4c 89 c0 5b 5d 41 5c 41 5d 41 5e 41 5f c3
kernel: [ 167.337641] RSP: 002b:00007085d9538310 EFLAGS: 00010202
kernel: [ 167.337653] RAX: 0000000000000c71 RBX: 00007085b4000020 RCX: 00007085b7172390
kernel: [ 167.337668] RDX: 0000708622b38b80 RSI: 0000000000008044 RDI: 00007085b716b000
kernel: [ 167.337682] RBP: 0000000000008045 R08: 00007085b716a360 R09: 000000000316b000
kernel: [ 167.337696] R10: 00007085b716b000 R11: 0000000000000206 R12: 0000000000000cb0
kernel: [ 167.337710] R13: 0000000000001000 R14: 00007085b716a350 R15: 0000000000008060
kernel: [ 167.337726] </TASK>
kernel: [ 167.337732] Modules linked in: nft_chain_nat nft_compat cfg80211 ebtable_filter ebtables ip6table_raw nf_conntrack_netlink xt_nat xt_tcpudp iptable_raw veth xt_conntrack xt_MASQUERADE ip6table_nat ip6table_filter ip6_tables xt_set ip_set iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_addrtype iptable_filter xfrm_user xfrm_algo scsi_transport_iscsi nf_tables nvme_fabrics nvme_keyring overlay qrtr softdog sunrpc binfmt_misc bonding tls nfnetlink_log nfnetlink nvidia_uvm(POE) zram vhost_net vhost vhost_iotlb tap nvidia_drm(POE) intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common nvidia_modeset(POE) isst_if_common skx_edac skx_edac_common nfit ipmi_ssif x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm snd_hda_codec_hdmi crct10dif_pclmul irdma snd_hda_intel polyval_clmulni snd_intel_dspcfg polyval_generic ghash_clmulni_intel snd_intel_sdw_acpi sha256_ssse3 snd_hda_codec sha1_ssse3 ice aesni_intel snd_hda_core crypto_simd cryptd snd_hwdep gnss snd_pcm ib_uverbs
kernel: [ 167.337802] cmdlinepart snd_timer ucsi_ccg spi_nor rapl snd typec_ucsi acpi_ipmi intel_cstate pcspkr typec soundcore ib_core ast mei_me mtd ipmi_si intel_pch_thermal mei ipmi_devintf ioatdma zfs(PO) dca ipmi_msghandler joydev input_leds mac_hid spl(O) nvidia(POE) coretemp vfio_pci vfio_pci_core irqbypass vfio_iommu_type1 vfio iommufd efi_pstore dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic xor raid6_pq amdgpu(OE) amddrm_ttm_helper(OE) amdttm(OE) hid_generic amddrm_buddy(OE) dm_thin_pool amdxcp(OE) drm_exec drm_suballoc_helper usbkbd usbmouse dm_persistent_data amd_sched(OE) dm_bio_prison amdkcl(OE) drm_display_helper usbhid dm_bufio hid libcrc32c cec rc_core nvme i2c_algo_bit i2c_nvidia_gpu xhci_pci crc32_pclmul i2c_ccgx_ucsi xhci_pci_renesas video nvme_core i40e spi_intel_pci ahci xhci_hcd i2c_i801 nvme_auth spi_intel lpc_ich i2c_smbus libahci wmi
kernel: [ 167.341158] ---[ end trace 0000000000000000 ]---
kernel: [ 167.403299] RIP: 0010:__rmqueue_pcplist+0xbd/0x8c0
kernel: [ 167.404140] Code: 01 f8 48 89 45 a0 49 8b 07 49 39 c7 0f 84 7f 01 00 00 48 bf 22 01 00 00 00 00 ad de 49 8b 07 48 8b 08 48 8b 50 08 4c 8d 40 f8 <48> 89 51 08 48 89 0a 48 b9 00 01 00 00 00 00 ad de 48 89 08 48 89
kernel: [ 167.405023] RSP: 0000:ffffb9ec7e3fba20 EFLAGS: 00010293
kernel: [ 167.405917] RAX: ffffdfe7d4642cc8 RBX: 0000000000000001 RCX: dead000000000100
kernel: [ 167.406815] RDX: dead000000000122 RSI: 0000000000000000 RDI: dead000000000122
kernel: [ 167.407709] RBP: ffffb9ec7e3fbad0 R08: ffffdfe7d4642cc0 R09: 0000000000000000
kernel: [ 167.408602] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
kernel: [ 167.409488] R13: 0000000000000010 R14: ffff944caffd5c00 R15: ffff944af02bcd70
kernel: [ 167.410366] FS: 00007085d953a700(0000) GS:ffff944af0280000(0000) knlGS:0000000000000000
kernel: [ 167.411235] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: [ 167.412106] CR2: 00007085b7172398 CR3: 000000038486e005 CR4: 00000000007706f0
kernel: [ 167.412979] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
kernel: [ 167.413828] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
kernel: [ 167.414650] PKRU: 55555554
kernel: [ 167.415464] note: ollama[10296] exited with preempt_count 2
kernel: [ 264.527827] amdgpu 0000:69:00.0: amdgpu: qcm fence wait loop timeout expired
kernel: [ 264.528676] amdgpu 0000:69:00.0: amdgpu: The cp might be in an unrecoverable state due to an unsuccessful queues preemption
kernel: [ 264.529527] amdgpu 0000:69:00.0: amdgpu: Failed to evict process queues
kernel: [ 264.532922] amdgpu: Failed to quiesce KFD
kernel: [ 264.558837] amdgpu 0000:69:00.0: amdgpu: GPU reset begin!
kernel: [ 264.561356] amdgpu 0000:69:00.0: amdgpu: Dumping IP State
kernel: [ 264.565570] amdgpu 0000:69:00.0: amdgpu: Dumping IP State Completed
kernel: [ 264.642300] amdgpu 0000:69:00.0: amdgpu: BACO reset
kernel: [ 266.503274] amdgpu 0000:69:00.0: amdgpu: GPU reset succeeded, trying to resume
kernel: [ 266.504258] [drm] PCIE GART of 512M enabled.
kernel: [ 266.505090] [drm] PTB located at 0x0000008000000000
kernel: [ 266.506052] [drm] VRAM is lost due to GPU reset!
kernel: [ 266.507748] amdgpu 0000:69:00.0: amdgpu: PSP is resuming...
kernel: [ 266.659119] amdgpu 0000:69:00.0: amdgpu: reserve 0x400000 from 0x87fec00000 for PSP TMR
kernel: [ 266.743513] amdgpu 0000:69:00.0: amdgpu: RAP: optional rap ta ucode is not available
kernel: [ 266.751504] [drm] kiq ring mec 2 pipe 1 q 0
kernel: [ 266.797720] [drm] UVD and UVD ENC initialized successfully.
kernel: [ 266.999972] [drm] VCE initialized successfully.
kernel: [ 267.000970] amdgpu 0000:69:00.0: amdgpu: ring gfx uses VM inv eng 0 on hub 0
kernel: [ 267.001910] amdgpu 0000:69:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
kernel: [ 267.002767] amdgpu 0000:69:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
kernel: [ 267.003607] amdgpu 0000:69:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
kernel: [ 267.004452] amdgpu 0000:69:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
kernel: [ 267.005294] amdgpu 0000:69:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
kernel: [ 267.006134] amdgpu 0000:69:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
kernel: [ 267.006970] amdgpu 0000:69:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
kernel: [ 267.007803] amdgpu 0000:69:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
kernel: [ 267.008629] amdgpu 0000:69:00.0: amdgpu: ring kiq_0.2.1.0 uses VM inv eng 11 on hub 0
kernel: [ 267.009465] amdgpu 0000:69:00.0: amdgpu: ring sdma0 uses VM inv eng 0 on hub 8
kernel: [ 267.010301] amdgpu 0000:69:00.0: amdgpu: ring page0 uses VM inv eng 1 on hub 8
kernel: [ 267.011135] amdgpu 0000:69:00.0: amdgpu: ring sdma1 uses VM inv eng 4 on hub 8
kernel: [ 267.011966] amdgpu 0000:69:00.0: amdgpu: ring page1 uses VM inv eng 5 on hub 8
kernel: [ 267.012795] amdgpu 0000:69:00.0: amdgpu: ring uvd_0 uses VM inv eng 6 on hub 8
kernel: [ 267.013616] amdgpu 0000:69:00.0: amdgpu: ring uvd_enc_0.0 uses VM inv eng 7 on hub 8
kernel: [ 267.014419] amdgpu 0000:69:00.0: amdgpu: ring uvd_enc_0.1 uses VM inv eng 8 on hub 8
kernel: [ 267.015195] amdgpu 0000:69:00.0: amdgpu: ring uvd_1 uses VM inv eng 9 on hub 8
kernel: [ 267.015968] amdgpu 0000:69:00.0: amdgpu: ring uvd_enc_1.0 uses VM inv eng 10 on hub 8
kernel: [ 267.016732] amdgpu 0000:69:00.0: amdgpu: ring uvd_enc_1.1 uses VM inv eng 11 on hub 8
kernel: [ 267.017503] amdgpu 0000:69:00.0: amdgpu: ring vce0 uses VM inv eng 12 on hub 8
kernel: [ 267.018267] amdgpu 0000:69:00.0: amdgpu: ring vce1 uses VM inv eng 13 on hub 8
kernel: [ 267.019028] amdgpu 0000:69:00.0: amdgpu: ring vce2 uses VM inv eng 14 on hub 8
kernel: [ 267.523747] [drm] Fence fallback timer expired on ring comp_1.0.0
kernel: [ 267.533054] amdgpu 0000:69:00.0: amdgpu: GPU reset(1) succeeded!