PVE 6.1-7 Crash (SMP NOPTI) [No answer from staff]

onepamopa · Feb 25, 2020

Hello, ~10-15 minutes ago the whole PVE server went down. There wasn't any unusual cpu/network usage.

Here's what in the log:

Feb 25 19:49:40 proxmox kernel: [425423.794455] PGD 0 P4D 0
Feb 25 19:49:40 proxmox kernel: [425423.794458] Oops: 0002 [#1] SMP NOPTI
Feb 25 19:49:40 proxmox kernel: [425423.794460] CPU: 16 PID: 849 Comm: kworker/16:1H Tainted: P O 5.3.18-2-pve #1
Feb 25 19:49:40 proxmox kernel: [425423.794462] Hardware name: System manufacturer System Product Name/PRIME TRX40-PRO, BIOS 0702 12/12/2019
Feb 25 19:49:40 proxmox kernel: [425423.794467] Workqueue: kblockd blk_mq_requeue_work
Feb 25 19:49:40 proxmox kernel: [425423.794471] RIP: 0010:_raw_spin_lock+0x10/0x30
Feb 25 19:49:40 proxmox kernel: [425423.794473] Code: 75 06 48 89 d8 5b 5d c3 e8 dd 27 63 ff eb f3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 31 c0 ba 01 00 00 00 <f0> 0f b1 17 75 02 5d c3 89 c6 e8 a1 13 63 ff 66 90 5d c3 66 66 2e
Feb 25 19:49:40 proxmox kernel: [425423.794476] RSP: 0018:ffffb91601b67e10 EFLAGS: 00010246
Feb 25 19:49:40 proxmox kernel: [425423.794478] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffb91601b67e48
Feb 25 19:49:40 proxmox kernel: [425423.794480] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000000
Feb 25 19:49:40 proxmox kernel: [425423.794481] RBP: ffffb91601b67e10 R08: 0000000000000000 R09: 00646b636f6c626b
Feb 25 19:49:40 proxmox kernel: [425423.794483] R10: 8080808080808080 R11: ffff9c85bd2294c4 R12: ffff9c85a8adbb80
Feb 25 19:49:40 proxmox kernel: [425423.794485] R13: 0000000000000000 R14: ffff9c85a8871d68 R15: 0ffff9c85bd23260
Feb 25 19:49:40 proxmox kernel: [425423.794487] FS: 0000000000000000(0000) GS:ffff9c85bd200000(0000) knlGS:0000000000000000
Feb 25 19:49:40 proxmox kernel: [425423.794489] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 25 19:49:40 proxmox kernel: [425423.794490] CR2: 0000000000000000 CR3: 000000075b7f2000 CR4: 0000000000340ee0
Feb 25 19:49:40 proxmox kernel: [425423.794492] Call Trace:
Feb 25 19:49:40 proxmox kernel: [425423.794495] blk_mq_request_bypass_insert+0x20/0x70
Feb 25 19:49:40 proxmox kernel: [425423.794497] blk_mq_requeue_work+0xa6/0x160
Feb 25 19:49:40 proxmox kernel: [425423.794500] process_one_work+0x20f/0x3d0
Feb 25 19:49:40 proxmox kernel: [425423.794502] worker_thread+0x34/0x400
Feb 25 19:49:40 proxmox kernel: [425423.794504] kthread+0x120/0x140
Feb 25 19:49:40 proxmox kernel: [425423.794505] ? process_one_work+0x3d0/0x3d0
Feb 25 19:49:40 proxmox kernel: [425423.794507] ? __kthread_parkme+0x70/0x70
Feb 25 19:49:40 proxmox kernel: [425423.794509] ret_from_fork+0x22/0x40
Feb 25 19:49:40 proxmox kernel: [425423.794511] Modules linked in: veth tcp_diag inet_diag vfio_pci vfio_virqfd vfio_iommu_type1 vfio ebtable_filter ebtables ip6table_raw ip6t_REJECT nf_reject_ipv6 ip6table_filter ip6_tables iptable_raw xt_mac ipt_REJECT nf_reject_ipv4 xt_mark xt_set xt_physdev xt_addrtype xt_comment xt_multiport xt_conntrack xt_tcpudp ip_set_hash_net ip_set iptable_filter bpfilter bonding softdog openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 snd_usb_audio snd_usbmidi_lib snd_hwdep edac_mce_amd snd_rawmidi kvm_amd kvm snd_seq_device irqbypass mc tcp_bbr snd_pcm snd_timer snd crct10dif_pclmul crc32_pclmul soundcore ghash_clmulni_intel aesni_intel aes_x86_64 crypto_simd ccp eeepc_wmi joydev input_leds cryptd k10temp glue_helper asus_wmi sparse_keymap video pcspkr mac_hid wmi_bmof mxm_wmi zfs(PO) zunicode(PO) zlua(PO) zavl(PO) icp(PO) nfnetlink_log nfnetlink zcommon(PO) znvpair(PO) spl(O) vhost_net vhost tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi
Feb 25 19:49:40 proxmox kernel: [425423.794540] scsi_transport_iscsi nct6775 hwmon_vid sunrpc ip_tables x_tables autofs4 btrfs xor zstd_compress raid6_pq dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c uas usb_storage usbmouse hid_generic usbkbd usbhid hid ahci libahci igb i2c_algo_bit dca i2c_piix4 wmi
Feb 25 19:49:40 proxmox kernel: [425423.794564] CR2: 0000000000000000
Feb 25 19:49:40 proxmox kernel: [425423.794566] ---[ end trace 0d4be7da105ef9bb ]---
Feb 25 19:49:40 proxmox kernel: [425423.794568] RIP: 0010:_raw_spin_lock+0x10/0x30
Feb 25 19:49:40 proxmox kernel: [425423.794569] Code: 75 06 48 89 d8 5b 5d c3 e8 dd 27 63 ff eb f3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 31 c0 ba 01 00 00 00 <f0> 0f b1 17 75 02 5d c3 89 c6 e8 a1 13 63 ff 66 90 5d c3 66 66 2e
Feb 25 19:49:40 proxmox kernel: [425423.794572] RSP: 0018:ffffb91601b67e10 EFLAGS: 00010246
Feb 25 19:49:40 proxmox kernel: [425423.794574] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffb91601b67e48
Feb 25 19:49:40 proxmox kernel: [425423.794575] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000000
Feb 25 19:49:40 proxmox kernel: [425423.794577] RBP: ffffb91601b67e10 R08: 0000000000000000 R09: 00646b636f6c626b
Feb 25 19:49:40 proxmox kernel: [425423.794579] R10: 8080808080808080 R11: ffff9c85bd2294c4 R12: ffff9c85a8adbb80
Feb 25 19:49:40 proxmox kernel: [425423.794580] R13: 0000000000000000 R14: ffff9c85a8871d68 R15: 0ffff9c85bd23260
Feb 25 19:49:40 proxmox kernel: [425423.794582] FS: 0000000000000000(0000) GS:ffff9c85bd200000(0000) knlGS:0000000000000000
Feb 25 19:49:40 proxmox kernel: [425423.794584] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 25 19:49:40 proxmox kernel: [425423.794585] CR2: 0000000000000000 CR3: 000000075b7f2000 CR4: 0000000000340ee0
Feb 25 19:50:10 proxmox kernel: [425453.807926] nvme nvme0: I/O 625 QID 5 timeout, aborting
Feb 25 19:50:10 proxmox kernel: [425453.813494] nvme nvme0: Abort status: 0x0
Feb 25 19:50:40 proxmox kernel: [425484.011717] nvme nvme0: I/O 625 QID 5 timeout, reset controller
Feb 25 19:50:40 proxmox kernel: [425484.351369] nvme nvme0: 7/0/0 default/read/poll queues

onepamopa · Feb 28, 2020

Does anyone from Staff know what this is caused by?
The post keeps slipping by now on 6th page...

Stoiko Ivanov · Feb 28, 2020

Seemingly not - it's not a very common issue where I (I guess same goes for my colleagues) could point my finger to and say - that's the definitive cause.

Questions which would help narrowing down where the issue might originate:
* does the issue happen consistently or did it only happen once?
* any particular steps which cause the issue to appear?
* does the issue also happen with a different kernel-version?
* What kind of hardware does this issue happen on?
* Does the system continue to be responsive or is it completely dead and needs to be reset?
* Check the journal/syslog for entries surrounding the timeframe of the OOPS - if available post them (and also more of the OOPS message if available)

Since the stacktrace seems to indicate work on blockdevices (and especially your NVME):
* which model of NVME do you have
* is there maybe a firmware update available for the NVME - if yes - please try to install it
* any other firmware updates available for the system?

last but not least - check other hw-components - sometimes it's a bad RAM-module which causes the issue

I hope this helps!

onepamopa · Feb 28, 2020

Stoiko Ivanov said:
Seemingly not - it's not a very common issue where I (I guess same goes for my colleagues) could point my finger to and say - that's the definitive cause.

Questions which would help narrowing down where the issue might originate:
* does the issue happen consistently or did it only happen once?
* any particular steps which cause the issue to appear?
* does the issue also happen with a different kernel-version?
* What kind of hardware does this issue happen on?
* Does the system continue to be responsive or is it completely dead and needs to be reset?
* Check the journal/syslog for entries surrounding the timeframe of the OOPS - if available post them (and also more of the OOPS message if available)

Since the stacktrace seems to indicate work on blockdevices (and especially your NVME):
* which model of NVME do you have
* is there maybe a firmware update available for the NVME - if yes - please try to install it
* any other firmware updates available for the system?

last but not least - check other hw-components - sometimes it's a bad RAM-module which causes the issue

I hope this helps!

Well, I'm using PVE for 2 months and this is the first time I've got this issue. Obviously something happened to make it so, which is why I reported it.

Stoiko Ivanov · Feb 28, 2020

onepamopa said:
Obviously something happened to make it so, which is why I reported it.

Thanks for the report and for putting the effort in making PVE better!

If the issue does not happen more often it is rather hard to determine its source and to reproduce it.

LnxBil · Feb 29, 2020

onepamopa said:
Obviously something happened to make it so, which is why I reported it.

As every member of any support staff would suggest:
- Update Firmware/BIOS
- Update OS

The hardware is a consumer product and really new, so be patient or use server grade hardware, which normally runs much better for server operating systems.

PVE 6.1-7 Crash (SMP NOPTI) [No answer from staff]

onepamopa

Well-Known Member

onepamopa

Well-Known Member

Stoiko Ivanov

Proxmox Staff Member

onepamopa

Well-Known Member

Stoiko Ivanov

Proxmox Staff Member

LnxBil

Distinguished Member

We value your privacy