We have ASUS RS700A-E9 platform with dual epyc 7501 and installed Proxmox 6.0 on 4 HDDs. We wanted to upgrade HDDs to 2xNVME but faced kernel panic.
tried with 5.0.21-4-pve and test kernel 5.3.7-1-pve Full logs captured from serial console attached.
Code:
[ 13.738723] i40e 0000:21:00.0: PCI-Express: Speed 8.0GT/s Width x8
[ 13.751726] i40e 0000:21:00.0: Features: PF-id[0] VFs: 64 VSIs: 66 QP: 119 RSS FD_ATR FD_SB NTUPLE DCB VxLAN Geneve PTP VEPA
[ 13.753397] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[ 13.757378] {1}[Hardware Error]: event severity: fatal
[ 13.757378] {1}[Hardware Error]: Error 0, type: fatal
[ 13.757378] {1}[Hardware Error]: fru_text: PcieError
[ 13.757378] {1}[Hardware Error]: section_type: PCIe error
[ 13.757378] {1}[Hardware Error]: port_type: 4, root port
[ 13.757378] {1}[Hardware Error]: version: 0.2
[ 13.757378] {1}[Hardware Error]: command: 0x0407, status: 0x0010
[ 13.757378] {1}[Hardware Error]: device_id: 0000:40:01.2
[ 13.757378] {1}[Hardware Error]: slot: 238
[ 13.757378] {1}[Hardware Error]: secondary_bus: 0x41
[ 13.757378] {1}[Hardware Error]: vendor_id: 0x1022, device_id: 0x1453
[ 13.757378] {1}[Hardware Error]: class_code: 000406
[ 13.757378] {1}[Hardware Error]: bridge: secondary_status: 0x0000, control: 0x0010
[ 13.757378] {1}[Hardware Error]: aer_uncor_status: 0x00100000, aer_uncor_mask: 0x04500000
[ 13.757378] {1}[Hardware Error]: aer_uncor_severity: 0x004e2030
[ 13.757378] {1}[Hardware Error]: TLP Header: 00000000 00000000 00000000 00000000
[ 13.757378] Kernel panic - not syncing: Fatal hardware error!
[ 13.757378] CPU: 5 PID: 0 Comm: swapper/5 Not tainted 5.0.21-4-pve #1
[ 13.757378] Hardware name: ASUSTeK COMPUTER INC. RS700A-E9-RS4/KNPP-D32 Series, BIOS 1301 06/17/2019
[ 13.757378] Call Trace:
[ 13.757378] <IRQ>
[ 13.757378] dump_stack+0x63/0x8a
[ 13.757378] panic+0x101/0x2a7
[ 13.757378] __ghes_panic.cold.32+0x21/0x21
[ 13.757378] ? ghes_irq_func+0x50/0x50
[ 13.757378] ghes_proc+0xe0/0x140
[ 13.757378] ghes_poll_func+0x2c/0x60
[ 13.757378] call_timer_fn+0x30/0x130
[ 13.757378] run_timer_softirq+0x38a/0x420
[ 13.757378] ? ktime_get+0x40/0xa0
[ 13.757378] ? lapic_next_event+0x20/0x30
[ 13.757378] ? clockevents_program_event+0x93/0xf0
[ 13.757378] __do_softirq+0xdc/0x2f3
[ 13.757378] irq_exit+0xc0/0xd0
[ 13.757378] smp_apic_timer_interrupt+0x79/0x140
[ 13.757378] apic_timer_interrupt+0xf/0x20
[ 13.757378] </IRQ>
[ 13.757378] RIP: 0010:cpuidle_enter_state+0xbd/0x450
[ 13.757378] Code: ff e8 17 9d 85 ff 80 7d c7 00 74 17 9c 58 0f 1f 44 00 00 f6 c4 02 0f 85 63 03 00 00 31 ff e8 2a d2 8b ff fb 66 0f 1f 44 00 00 <45> 85 ed 0f 89 cf 01 00 00 41 c7 44 24 08 00 00 00 00 48 83 c4 18
[ 13.757378] RSP: 0018:ffffafe8c0217e60 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
[ 13.757378] RAX: ffff91f00e9621c0 RBX: ffffffff893629c0 RCX: 0000000333c38c26
[ 13.757378] RDX: 0000000333c38c26 RSI: 0000000333c38bfe RDI: 0000000000000000
[ 13.757378] RBP: ffffafe8c0217ea0 R08: ffffffffffc2f714 R09: 0000000000021a80
[ 14.707128] scsi 0:0:0:0: CD-ROM AMI Virtual CDROM0 1.00 PQ: 0 ANSI: 0 CCS
[ 14.714782] scsi 1:0:0:0: Direct-Access AMI Virtual Floppy0 1.00 PQ: 0 ANSI: 0 CCS
[ 13.757378] R10: 00000037e4dac2dc R11: ffff91f00e961044 R12: ffff91f000b3c000
[ 13.757378] R13: 0000000000000002 R14: ffffffff89362a98 R15: ffffffff89362a80
[ 13.757378] cpuidle_enter+0x17/0x20
[ 13.757378] call_cpuidle+0x23/0x40
[ 13.757378] do_idle+0x22c/0x270
[ 13.757378] cpu_startup_entry+0x1d/0x20
[ 13.757378] start_secondary+0x1ab/0x200
[ 13.757378] secondary_startup_64+0xa4/0xb0
[ 13.757378] Kernel Offset: 0x6c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 13.757378] Rebooting in 30 seconds..
tried with 5.0.21-4-pve and test kernel 5.3.7-1-pve Full logs captured from serial console attached.