Opt in kernel panics

donhwyo

Member
Jan 14, 2023
103
14
18
I was asked to start a new thread. https://forum.proxmox.com/threads/o...r-proxmox-ve-7-x-available.119483/post-532379
Hope this is the correct place.

I followed these instructions to get the crash files. https://forum.proxmox.com/threads/random-proxmox-server-hang-no-vms-no-web-gui.58823/post-271632

Not sure size of files I can attach. Here is what I get.
Bash:
root@pve:~# ls -la /var/crash/202302101626
total 6825736
drwxr-xr-x 2 root root       4096 Feb 10 16:36 .
drwxr-xr-x 4 root root       4096 Feb 11 10:30 ..
-rw------- 1 root root     141638 Feb 10 16:36 dmesg.202302101626
-rw-r--r-- 1 root root 6989396787 Feb 10 16:36 dump.202302101626

The dump.202302101626 file would take forever to upload probably even ziped.

There is more info in the above thread.

Thanks
 

Attachments

Adding this
Code:
GRUB_CMDLINE_LINUX="iommu=pt tsc=unstable"
to /etc/default/grub may have fixed it. Been up for over 2.5 days.

Thanks
 
Spoke to soon it crashed early on third day. The crash logs were filling up / so I moved /var/crash to a bigger drive. Is it normal for /var/crash to take more than 25Gb?

Thanks
 
Is it normal for /var/crash to take more than 25Gb?
Up to the amount of memory. You can configure what it does, please see the manpage of makedumpfile in order to get the smallest dump level. Often you don't need the full dump, especially if you're not going to attach a debugger and get down the kernel debug rabbit hole.
 
Up to the amount of memory. You can configure what it does, please see the manpage of makedumpfile in order to get the smallest dump level. Often you don't need the full dump, especially if you're not going to attach a debugger and get down the kernel debug rabbit hole.
Is there an official Proxmox way to do this? I have tried a few generic howtos but am not getting dump files.

Thanks
 
I fixed a few minor errors I saw in dmesg and since that it crashes after a few days instead of a few hours. But it is no longer saving dumps. I changed the location of the dumps to a bigger disk. Will see if that helps.

Thanks
 
Good news Linux 6.1.15-1-pve #1 SMP PREEMPT_DYNAMIC PVE 6.1.15-1 (2023-03-08T08:53Z has been stable with no panics

Here we go again. Bad news both new 6.2 kernels froze with no saved info.
 
I tried the latest pve-kernel-6.2.9-1-pve and still getting panics. These are the end of the dmesg saved in the crash folder.

Code:
[ 9437.695884] fwbr800i0: port 2(fwln800o0) entered forwarding state
[35911.105662] unchecked MSR access error: WRMSR to 0x48 (tried to write 0x0000000000000000) at rIP: 0xffffffff884a94d4 (native_write_msr+0x4/0x30)
[35911.109392] Call Trace:
[35911.111147]  <TASK>
[35911.112544]  ? intel_idle_ibrs+0x2f/0xd0
[35911.114033]  ? ioapic_service+0x13b/0x180 [kvm]
[35911.115842]  ? ioapic_set_irq+0xd4/0x300 [kvm]
[35911.117661]  ? __pfx___pollwait+0x10/0x10
[35911.119101]  ? __pfx_pollwake+0x10/0x10
[35911.120430]  ? kvm_set_ioapic_irq+0x1f/0x30 [kvm]
[35911.122161]  ? kvm_set_irq+0xed/0x200 [kvm]
[35911.123651]  ? __pfx_kvm_set_ioapic_irq+0x10/0x10 [kvm]
[35911.125278]  ? __pfx_kvm_set_pic_irq+0x10/0x10 [kvm]
[35911.126877]  ? __pfx_kvm_set_pic_irq+0x10/0x10 [kvm]
[35911.128395]  ? kvm_vm_ioctl_irq_line+0x27/0x40 [kvm]
[35911.129923]  ? _copy_to_user+0x25/0x40
[35911.131164]  ? kvm_vm_ioctl+0x2bf/0xed0 [kvm]
[35911.132576]  ? kvm_vm_ioctl_irq_line+0x27/0x40 [kvm]
[35911.134109]  ? _copy_to_user+0x25/0x40
[35911.135290]  ? kvm_vm_ioctl+0x2bf/0xed0 [kvm]
[35911.136824]  ? __audit_syscall_entry+0xce/0x140
[35911.138090]  ? _copy_from_user+0x44/0x70
[35911.139329]  __x64_sys_ppoll+0xbc/0x150
[35911.140519]  do_syscall_64+0x5c/0x90
[35911.141793]  ? syscall_exit_to_user_mode+0x26/0x50
[35911.143025]  ? do_syscall_64+0x69/0x90
[35911.144182]  ? do_syscall_64+0x69/0x90
[35911.145404]  ? do_syscall_64+0x69/0x90
[35911.146586]  ? sysvec_apic_timer_interrupt+0x4e/0x90
[35911.147727]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
[35911.148854] RIP: 0033:0x7f06cf06ce26
[35911.150042] Code: 7c 24 08 e8 7c 0f f9 ff 4c 8b 54 24 18 48 8b 74 24 10 41 b8 08 00 00 00 41 89 c1 48 8b 7c 24 08 4c 89 e2 b8 0f 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 2a 44 89 cf 89 44 24 08 e8 a6 0f f9 ff 8b 44
[35911.152555] RSP: 002b:00007ffcea5bfac0 EFLAGS: 00000293 ORIG_RAX: 000000000000010f
[35911.153958] RAX: ffffffffffffffda RBX: 00005637c10d45b0 RCX: 00007f06cf06ce26
[35911.155338] RDX: 00007ffcea5bfae0 RSI: 000000000000000a RDI: 00005637c12e8ab0
[35911.156678] RBP: 00007ffcea5bfb4c R08: 0000000000000008 R09: 0000000000000000
[35911.158039] R10: 0000000000000000 R11: 0000000000000293 R12: 00007ffcea5bfae0
[35911.159384] R13: 00005637c10d45b0 R14: 00007ffcea5bfb50 R15: 0000000000000000
[35911.160749]  </TASK>
[35911.162213] BUG: kernel NULL pointer dereference, address: 00000000000000c3
[35911.164557] #PF: supervisor read access in kernel mode
[35911.169849] #PF: error_code(0x0000) - not-present page
[35911.179845] PGD 0 P4D 0
[35911.188046] Oops: 0000 [#1] PREEMPT SMP NOPTI
[35911.195003] CPU: 0 PID: 113088 Comm: kvm Kdump: loaded Tainted: P           O       6.2.9-1-pve #1
[35911.202934] Hardware name: Dell Inc. PowerEdge R715/0G2DP3, BIOS 3.4.1 05/04/2018
[35911.215103] RIP: 0010:intel_idle_ibrs+0x41/0xd0
[35911.225790] Code: 90 e8 33 72 09 ff 31 f6 bf 48 00 00 00 48 89 c3 48 89 f2 e8 91 bd 0d ff 90 4d 63 f4 49 83 fe 0a 77 68 4b 8d 04 76 49 8d 04 86 <41> 0f b6 7c c5 5b e8 14 b4 8d ff 89 de 48 c1 eb 20 bf 48 00 00 00
[35911.252636] RSP: 0018:ffffa068690d7a20 EFLAGS: 00010293
[35911.268634] RAX: 000000000000000d RBX: 0000000000000000 RCX: 0000000000000048
[35911.284818] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000048
[35911.301667] RBP: ffffa068690d7e48 R08: 0000000000000246 R09: 0000000000000001
[35911.312759] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000001
[35911.314932] R13: 0000000000000000 R14: 0000000000000001 R15: ffffa068690d7ab4
[35911.327533] FS:  00007f06d023a040(0000) GS:ffff91523b600000(0000) knlGS:0000000000000000
[35911.335726] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[35911.345212] CR2: 00002aaaaae0e100 CR3: 00000004db7a4000 CR4: 00000000000406f0
[35911.346651] Call Trace:
[35911.360478]  <TASK>
[35911.376700]  ? ioapic_service+0x13b/0x180 [kvm]
[35911.394453]  ? ioapic_set_irq+0xd4/0x300 [kvm]
[35911.412030]  ? __pfx___pollwait+0x10/0x10
[35911.429853]  ? __pfx_pollwake+0x10/0x10
[35911.446446]  ? kvm_set_ioapic_irq+0x1f/0x30 [kvm]
[35911.449592]  ? kvm_set_irq+0xed/0x200 [kvm]
[35911.467045]  ? __pfx_kvm_set_ioapic_irq+0x10/0x10 [kvm]
[35911.484688]  ? __pfx_kvm_set_pic_irq+0x10/0x10 [kvm]
[35911.503279]  ? __pfx_kvm_set_pic_irq+0x10/0x10 [kvm]
[35911.522243]  ? kvm_vm_ioctl_irq_line+0x27/0x40 [kvm]
[35911.540256]  ? _copy_to_user+0x25/0x40
[35911.559130]  ? kvm_vm_ioctl+0x2bf/0xed0 [kvm]
[35911.563711]  ? kvm_vm_ioctl_irq_line+0x27/0x40 [kvm]
[35911.576424]  ? _copy_to_user+0x25/0x40
[35911.596487]  ? kvm_vm_ioctl+0x2bf/0xed0 [kvm]
[35911.615492]  ? __audit_syscall_entry+0xce/0x140
[35911.633468]  ? _copy_from_user+0x44/0x70
[35911.652234]  __x64_sys_ppoll+0xbc/0x150
[35911.668564]  do_syscall_64+0x5c/0x90
[35911.675098]  ? syscall_exit_to_user_mode+0x26/0x50
[35911.677450]  ? do_syscall_64+0x69/0x90
[35911.693803]  ? do_syscall_64+0x69/0x90
[35911.709462]  ? do_syscall_64+0x69/0x90
[35911.724106]  ? sysvec_apic_timer_interrupt+0x4e/0x90
[35911.737792]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
[35911.746477] RIP: 0033:0x7f06cf06ce26
[35911.757755] Code: 7c 24 08 e8 7c 0f f9 ff 4c 8b 54 24 18 48 8b 74 24 10 41 b8 08 00 00 00 41 89 c1 48 8b 7c 24 08 4c 89 e2 b8 0f 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 2a 44 89 cf 89 44 24 08 e8 a6 0f f9 ff 8b 44
[35911.779710] RSP: 002b:00007ffcea5bfac0 EFLAGS: 00000293 ORIG_RAX: 000000000000010f
[35911.787416] RAX: ffffffffffffffda RBX: 00005637c10d45b0 RCX: 00007f06cf06ce26
[35911.788514] RDX: 00007ffcea5bfae0 RSI: 000000000000000a RDI: 00005637c12e8ab0
[35911.788993] RBP: 00007ffcea5bfb4c R08: 0000000000000008 R09: 0000000000000000
[35911.802015] R10: 0000000000000000 R11: 0000000000000293 R12: 00007ffcea5bfae0
[35911.810826] R13: 00005637c10d45b0 R14: 00007ffcea5bfb50 R15: 0000000000000000
[35911.824588]  </TASK>
[35911.833309] Modules linked in: cfg80211 8021q garp mrp veth ipmi_si tcp_diag inet_diag ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter bpfilter dell_rbu mptctl mptbase nf_tables nfnetlink_cttimeout bonding tls softdog openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nfnetlink_log nfnetlink xfs amd64_edac edac_mce_amd kvm_amd ccp kvm crct10dif_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 aesni_intel crypto_simd cryptd dcdbas ipmi_ssif cdc_acm joydev input_leds mgag200 serio_raw pcspkr drm_shmem_helper drm_kms_helper i2c_algo_bit syscopyarea sysfillrect sysimgblt k10temp fam15h_power ipmi_devintf mac_hid ipmi_msghandler acpi_power_meter zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) vhost_net vhost vhost_iotlb tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi vfio_pci vfio_pci_core irqbypass
[35911.838529]  vfio_iommu_type1 vfio iommufd pci_stub drm sunrpc ip_tables x_tables autofs4 btrfs blake2b_generic xor raid6_pq simplefb dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c ses enclosure uas usb_storage usbmouse usbkbd hid_generic usbhid hid mpt3sas raid_class crc32_pclmul psmouse scsi_transport_sas ohci_pci ehci_pci i2c_piix4 ohci_hcd ehci_hcd ahci bnx2 libahci [last unloaded: ipmi_si]
[35911.945428] CR2: 00000000000000c3

Code:
[  512.826439] fwbr912i0: port 2(fwln912o0) entered forwarding state
[35467.136243] kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
[35467.137966] BUG: unable to handle page fault for address: ffffb8ae2f5cbbc8
[35467.139495] #PF: supervisor instruction fetch in kernel mode
[35467.140898] #PF: error_code(0x0011) - permissions violation
[35467.142410] PGD 100000067 P4D 100000067 PUD 1001fc067 PMD 1a51de067 PTE 80000008ed103163
[35467.144248] Oops: 0011 [#1] PREEMPT SMP NOPTI
[35467.145505] CPU: 24 PID: 183637 Comm: kvm Kdump: loaded Tainted: P           O       6.2.9-1-pve #1
[35467.147384] Hardware name: Dell Inc. PowerEdge R715/0G2DP3, BIOS 3.4.1 05/04/2018
[35467.148331] RIP: 0010:0xffffb8ae2f5cbbc8
[35467.150687] Code: 00 00 40 00 00 00 00 00 00 00 00 50 66 2e ae b8 ff ff 01 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 ff ff ff ff 00 00 00 00 <d8> bb 5c 2f ae b8 ff ff fb ac 2b c1 ff ff ff ff e8 bc 5c 2f ae b8
[35467.154194] RSP: 0018:ffffb8ae2f5cbb90 EFLAGS: 00010286
[35467.155403] RAX: 0000000000000000 RBX: 0000000000000008 RCX: 0000000000000000
[35467.157628] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff9b8a94f19940
[35467.159316] RBP: ffff9b8b15279a00 R08: 0000000000000000 R09: ffff9b8b164b5ff0
[35467.160762] R10: ffff9b8b15279a18 R11: 0000000000000000 R12: ffff9b8b69e05628
[35467.162476] R13: 0000000000000000 R14: 794b48ab64e62700 R15: 0000000000000000
[35467.164329] FS:  00007f57efc0c040(0000) GS:ffff9bb0fbc00000(0000) knlGS:0000000000000000
[35467.165943] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[35467.167409] CR2: ffffb8ae2f5cbbc8 CR3: 00000002017d0000 CR4: 00000000000406e0
[35467.169478] Call Trace:
[35467.171212]  <TASK>
[35467.172877]  ? kvm_pic_set_irq+0xfa/0x230 [kvm]
[35467.174861]  ? kvm_set_pic_irq+0x1b/0x30 [kvm]
[35467.176799]  ? kvm_set_irq+0xed/0x200 [kvm]
[35467.178336]  ? __pfx_kvm_set_ioapic_irq+0x10/0x10 [kvm]
[35467.180512]  ? __pfx_kvm_set_pic_irq+0x10/0x10 [kvm]
[35467.182561]  ? kvm_ioapic_set_irq+0x85/0xd0 [kvm]
[35467.184492]  ? kvm_pic_set_irq+0xfa/0x230 [kvm]
[35467.186400]  ? kvm_vm_ioctl_irq_line+0x27/0x40 [kvm]
[35467.188308]  ? kvm_vm_ioctl+0x296/0xed0 [kvm]
[35467.190168]  ? __pfx_kvm_set_pic_irq+0x10/0x10 [kvm]
[35467.192066]  ? put_timespec64+0x3d/0x70
[35467.193309]  ? __audit_syscall_entry+0xce/0x140
[35467.195497]  ? __fget_light.part.0+0x8c/0xd0
[35467.196869]  ? __x64_sys_ioctl+0x95/0xd0
[35467.198873]  ? do_syscall_64+0x5c/0x90
[35467.200524]  ? _copy_to_user+0x25/0x40
[35467.201984]  ? put_timespec64+0x3d/0x70
[35467.203408]  ? __audit_syscall_entry+0xce/0x140
[35467.205326]  ? __fget_light.part.0+0x8c/0xd0
[35467.206887]  ? exit_to_user_mode_prepare+0x37/0x180
[35467.208540]  ? syscall_exit_to_user_mode+0x26/0x50
[35467.210283]  ? do_syscall_64+0x69/0x90
[35467.211877]  ? do_syscall_64+0x69/0x90
[35467.213410]  ? entry_SYSCALL_64_after_hwframe+0x72/0xdc
[35467.215120]  </TASK>
[35467.216491] Modules linked in: tcp_diag inet_diag cfg80211 8021q garp mrp veth ipmi_si ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter bpfilter dell_rbu mptctl mptbase nf_tables nfnetlink_cttimeout bonding tls openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 softdog nfnetlink_log nfnetlink xfs amd64_edac edac_mce_amd kvm_amd ccp kvm crct10dif_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 aesni_intel crypto_simd cryptd dcdbas mgag200 drm_shmem_helper input_leds cdc_acm drm_kms_helper joydev serio_raw pcspkr i2c_algo_bit syscopyarea sysfillrect sysimgblt ipmi_ssif ipmi_devintf fam15h_power k10temp mac_hid ipmi_msghandler acpi_power_meter zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) vhost_net vhost vhost_iotlb tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi vfio_pci vfio_pci_core irqbypass
[35467.216907]  vfio_iommu_type1 vfio iommufd pci_stub drm sunrpc ip_tables x_tables autofs4 btrfs blake2b_generic xor raid6_pq simplefb dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c ses enclosure uas usb_storage usbmouse usbkbd hid_generic usbhid hid mpt3sas raid_class crc32_pclmul psmouse scsi_transport_sas i2c_piix4 ohci_pci ehci_pci ohci_hcd ahci bnx2 ehci_hcd libahci [last unloaded: ipmi_si]
[35467.238402] CR2: ffffb8ae2f5cbbc8
root@pve:~#

If anybody has time to look or needs more please let me know.
Thanks
 
I tried the latest pve-kernel-6.2.9-1-pve and still getting panics. These are the end of the dmesg saved in the crash folder.
unchecked MSR access error: WRMSR to 0x48 (tried to write 0x0000000000000000) at rIP: 0xffffffff884a94d4 (native_write_msr+0x4/0x30) [35911.109392] Call Trace:
Can you check if you got latest CPU microcode and also if some powersaving options are enabled in UEFI/BIOS (C6 state or the like).

[35467.136243] kernel tried to execute NX-protected page - exploit attempt? (uid: 0) [35467.137966] BUG: unable to handle page fault for address: ffffb8ae2f5cbbc8 [35467.139495] #PF: supervisor instruction fetch in kernel mode
hmm that is odd..
PowerEdge R715
This is rather an ancient server with an AMD Opteron 6100-series CPU from ~2010?
We have no such HW in our Testlab to try reproducing this..

You could also try adding the mitigations=off kernel boot command line parameter, it could be a regression with changes there for very old CPUs.
 
Last edited:
Sorry didn't see this reply. The micro code is the newest and I tried one from debian testing too. No change. The processor has been upgraded at some point. Is now 32 x AMD Opteron(tm) Processor 6366 HE (2 Sockets).

Just added "mitigations=off" and changed to chrony for time sync as suggested by pve7to8. Will see how it goes.

Still worried about upgrading to PVE8. Can the latest PVE8 kernel be added to the opt-in program for PVE7?

Thanks
 
Thanks that seems to have worked. It has been up for over a week now. Wonder what I am now vulnerable to?
You can check with pasting the following command in a shell on the system (doesn't have to be root):
Code:
for f in /sys/devices/system/cpu/vulnerabilities/*; do echo "${f##*/} -" $(cat "$f"); done

How much your systems are actually vulnerable depends not only on above output, but also on what workload you host and if you do so for (untrusted) third-party users.
 
You can check with pasting the following command in a shell on the system (doesn't have to be root):
Code:
for f in /sys/devices/system/cpu/vulnerabilities/*; do echo "${f##*/} -" $(cat "$f"); done

How much your systems are actually vulnerable depends not only on above output, but also on what workload you host and if you do so for (untrusted) third-party users.
Thanks for the reply.
Code:
root@pve:~# for f in /sys/devices/system/cpu/vulnerabilities/*; do echo "${f##*/} -" $(cat "$f"); done
itlb_multihit - Not affected
l1tf - Not affected
mds - Not affected
meltdown - Not affected
mmio_stale_data - Not affected
retbleed - Vulnerable
spec_store_bypass - Vulnerable
spectre_v1 - Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers
spectre_v2 - Vulnerable, IBPB: disabled, STIBP: disabled, PBRSB-eIBRS: Not affected
srbds - Not affected
tsx_async_abort - Not affected
I can deal with that.

Next scary thing is more old hardware.
https://forum.proxmox.com/threads/no-sas2008-after-upgrade.129499/
Thanks
 
There is a newer opt-in kernel available.
Linux 6.2.16-4-bpo11-pve #1 SMP PREEMPT_DYNAMIC PVE 6.2.16-4~bpo11+1 (2023-07-07T15:05Z
Seems to be working fine so far.
Thanks
 
There is a newer opt-in kernel available.
Linux 6.2.16-4-bpo11-pve #1 SMP PREEMPT_DYNAMIC PVE 6.2.16-4~bpo11+1 (2023-07-07T15:05Z
Seems to be working fine so far.
Thanks
yeah, we got around to backporting the current kernel from Proxmox VE 8 to Bullseye based releases and uploaded it to pvetest yesterday.
 
  • Like
Reactions: donhwyo

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!