Crash with Intel 760p NVME

HighwayStar

New Member
Nov 12, 2019
3
1
3
36
We have ASUS RS700A-E9 platform with dual epyc 7501 and installed Proxmox 6.0 on 4 HDDs. We wanted to upgrade HDDs to 2xNVME but faced kernel panic.


Code:
[   13.738723] i40e 0000:21:00.0: PCI-Express: Speed 8.0GT/s Width x8
[   13.751726] i40e 0000:21:00.0: Features: PF-id[0] VFs: 64 VSIs: 66 QP: 119 RSS FD_ATR FD_SB NTUPLE DCB VxLAN Geneve PTP VEPA
[   13.753397] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[   13.757378] {1}[Hardware Error]: event severity: fatal
[   13.757378] {1}[Hardware Error]:  Error 0, type: fatal
[   13.757378] {1}[Hardware Error]:  fru_text: PcieError
[   13.757378] {1}[Hardware Error]:   section_type: PCIe error
[   13.757378] {1}[Hardware Error]:   port_type: 4, root port
[   13.757378] {1}[Hardware Error]:   version: 0.2
[   13.757378] {1}[Hardware Error]:   command: 0x0407, status: 0x0010
[   13.757378] {1}[Hardware Error]:   device_id: 0000:40:01.2
[   13.757378] {1}[Hardware Error]:   slot: 238
[   13.757378] {1}[Hardware Error]:   secondary_bus: 0x41
[   13.757378] {1}[Hardware Error]:   vendor_id: 0x1022, device_id: 0x1453
[   13.757378] {1}[Hardware Error]:   class_code: 000406
[   13.757378] {1}[Hardware Error]:   bridge: secondary_status: 0x0000, control: 0x0010
[   13.757378] {1}[Hardware Error]:   aer_uncor_status: 0x00100000, aer_uncor_mask: 0x04500000
[   13.757378] {1}[Hardware Error]:   aer_uncor_severity: 0x004e2030
[   13.757378] {1}[Hardware Error]:   TLP Header: 00000000 00000000 00000000 00000000
[   13.757378] Kernel panic - not syncing: Fatal hardware error!
[   13.757378] CPU: 5 PID: 0 Comm: swapper/5 Not tainted 5.0.21-4-pve #1
[   13.757378] Hardware name: ASUSTeK COMPUTER INC. RS700A-E9-RS4/KNPP-D32 Series, BIOS 1301 06/17/2019
[   13.757378] Call Trace:
[   13.757378]  <IRQ>
[   13.757378]  dump_stack+0x63/0x8a
[   13.757378]  panic+0x101/0x2a7
[   13.757378]  __ghes_panic.cold.32+0x21/0x21
[   13.757378]  ? ghes_irq_func+0x50/0x50
[   13.757378]  ghes_proc+0xe0/0x140
[   13.757378]  ghes_poll_func+0x2c/0x60
[   13.757378]  call_timer_fn+0x30/0x130
[   13.757378]  run_timer_softirq+0x38a/0x420
[   13.757378]  ? ktime_get+0x40/0xa0
[   13.757378]  ? lapic_next_event+0x20/0x30
[   13.757378]  ? clockevents_program_event+0x93/0xf0
[   13.757378]  __do_softirq+0xdc/0x2f3
[   13.757378]  irq_exit+0xc0/0xd0
[   13.757378]  smp_apic_timer_interrupt+0x79/0x140
[   13.757378]  apic_timer_interrupt+0xf/0x20
[   13.757378]  </IRQ>
[   13.757378] RIP: 0010:cpuidle_enter_state+0xbd/0x450
[   13.757378] Code: ff e8 17 9d 85 ff 80 7d c7 00 74 17 9c 58 0f 1f 44 00 00 f6 c4 02 0f 85 63 03 00 00 31 ff e8 2a d2 8b ff fb 66 0f 1f 44 00 00 <45> 85 ed 0f 89 cf 01 00 00 41 c7 44 24 08 00 00 00 00 48 83 c4 18
[   13.757378] RSP: 0018:ffffafe8c0217e60 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
[   13.757378] RAX: ffff91f00e9621c0 RBX: ffffffff893629c0 RCX: 0000000333c38c26
[   13.757378] RDX: 0000000333c38c26 RSI: 0000000333c38bfe RDI: 0000000000000000
[   13.757378] RBP: ffffafe8c0217ea0 R08: ffffffffffc2f714 R09: 0000000000021a80
[   14.707128] scsi 0:0:0:0: CD-ROM            AMI      Virtual CDROM0   1.00 PQ: 0 ANSI: 0 CCS
[   14.714782] scsi 1:0:0:0: Direct-Access     AMI      Virtual Floppy0  1.00 PQ: 0 ANSI: 0 CCS
[   13.757378] R10: 00000037e4dac2dc R11: ffff91f00e961044 R12: ffff91f000b3c000
[   13.757378] R13: 0000000000000002 R14: ffffffff89362a98 R15: ffffffff89362a80
[   13.757378]  cpuidle_enter+0x17/0x20
[   13.757378]  call_cpuidle+0x23/0x40
[   13.757378]  do_idle+0x22c/0x270
[   13.757378]  cpu_startup_entry+0x1d/0x20
[   13.757378]  start_secondary+0x1ab/0x200
[   13.757378]  secondary_startup_64+0xa4/0xb0
[   13.757378] Kernel Offset: 0x6c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[   13.757378] Rebooting in 30 seconds..

tried with 5.0.21-4-pve and test kernel 5.3.7-1-pve Full logs captured from serial console attached.
 

Attachments

Was it nvme_core.default_ps_max_latency_us=1500 or nvme_core.default_ps_max_latency_us=5500? I'm using 2x1.2TB NVME SSD's (INTEL SSDPE2MX012T7), but the details I found on disabling the unsupported lowest power saving state refers to using 5500 vs 1500 for the value.
 
I've cheched avaliable powersaving states first with nvme id-ctrl from nvme-cli package.
Code:
#nvme id-ctrl /dev/nvme0
NVME Identify Controller:
vid       : 0x8086
ssvid     : 0x8086
sn        : EDITED
mn        : INTEL SSDPEKKW010T8                    
fr        : 004C  
rab       : 6
ieee      : 5cd2e4
cmic      : 0
mdts      : 6
cntlid    : 1
ver       : 10300
rtd3r     : 7a120
rtd3e     : 1e8480
oaes      : 0x200
ctratt    : 0
rrls      : 0
oacs      : 0x17
acl       : 4
aerl      : 7
frmw      : 0x14
lpa       : 0xf
elpe      : 255
npss      : 4
avscc     : 0
apsta     : 0x1
wctemp    : 348
cctemp    : 353
mtfa      : 50
hmpre     : 0
hmmin     : 0
tnvmcap   : 0
unvmcap   : 0
rpmbs     : 0
edstt     : 5
dsto      : 1
fwug      : 0
kas       : 0
hctma     : 0x1
mntmt     : 303
mxtmt     : 348
sanicap   : 0x3
hmminds   : 0
hmmaxd    : 0
nsetidmax : 0
anatt     : 0
anacap    : 0
anagrpmax : 0
nanagrpid : 0
sqes      : 0x66
cqes      : 0x44
maxcmd    : 0
nn        : 1
oncs      : 0x5f
fuses     : 0
fna       : 0x4
vwc       : 0x1
awun      : 0
awupf     : 0
nvscc     : 0
nwpc      : 0
acwu      : 0
sgls      : 0
mnan      : 0
subnqn    : nqn.2017-12.org.nvmexpress:uuid:11111111-2222-3333-4444-555555555555
ioccsz    : 0
iorcsz    : 0
icdoff    : 0
ctrattr   : 0
msdbd     : 0
ps    0 : mp:9.00W operational enlat:0 exlat:0 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:- active_power:-
ps    1 : mp:4.60W operational enlat:0 exlat:0 rrt:1 rrl:1
          rwt:1 rwl:1 idle_power:- active_power:-
ps    2 : mp:3.80W operational enlat:0 exlat:0 rrt:2 rrl:2
          rwt:2 rwl:2 idle_power:- active_power:-
ps    3 : mp:0.0450W non-operational enlat:2000 exlat:2000 rrt:3 rrl:3
          rwt:3 rwl:3 idle_power:- active_power:-
ps    4 : mp:0.0040W non-operational enlat:6000 exlat:8000 rrt:4 rrl:4
          rwt:4 rwl:4 idle_power:- active_power:-

you need to set nvme_core.default_ps_max_latency_us to value lower than enlat in state that you want to exclude. nvme_core.default_ps_max_latency_us=0 disables all powersaving states, but it should be fine for server use.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!