micron p420m SSD IO stall on Proxmox 6

e100

Renowned Member
Nov 6, 2010
1,268
46
88
Columbus, Ohio
ulbuilder.wordpress.com
I have a p420m SSD that uses the mtip32xx driver in the kernel.
This drive worked perfectly fine in Proxmox 5.x, after upgrading to 6.x write IO to the disk stalls frequently and can only be recovered with a reboot.
We first experienced the problem within hours of upgrading to 6.x

The motherboard BIOS and the SSD firmware have the latest updates available.

The SSD is not worn out, still has 99% of its life remaining:
Code:
[  214.301679] mtip32xx 0000:84:00.0: Write protect progress: 1% (209715 blocks)

Looking at /proc/diskstats it seems like some IO is queued that never completes:
Code:
│ 252       0 rssda 24635 0 3715072 31937345 4785988 20663738 211254204 319064718 0 10526068 350307656 0 0 0 0
│ 252       1 rssda1 19405 0 3673176 31933499 350649 1557192 19381452 254545508 4 575520 286420084 0 0 0 0
│ 252       2 rssda2 5220 0 41816 3844 4435339 19106546 191872752 64519210 2 10159320 63887572 0 0 0 0

We are using DRBD on top of the SSD, two volumes one on rssda1 the other on rssda2.
DRBD tasks eventually hang because write IO cannot be completed:

Code:
[28880.147379] INFO: task drbd_r_drbd2:3473 blocked for more than 120 seconds.
[28880.147391]       Tainted: P           O      5.3.18-2-pve #1
[28880.147395] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[28880.147401] drbd_r_drbd2    D    0  3473      2 0x80004004
[28880.147403] Call Trace:
[28880.147414]  __schedule+0x2bb/0x660
[28880.147421]  ? blk_flush_plug_list+0xe2/0x110
[28880.147422]  schedule+0x33/0xa0
[28880.147424]  io_schedule+0x16/0x40
[28880.147441]  _drbd_wait_ee_list_empty+0x91/0xe0 [drbd]
[28880.147447]  ? wait_woken+0x80/0x80
[28880.147451]  conn_wait_active_ee_empty+0x5e/0xc0 [drbd]
[28880.147456]  ? drbd_bump_write_ordering+0x240/0x240 [drbd]
[28880.147461]  receive_Barrier+0x16e/0x3d0 [drbd]
[28880.147465]  ? decode_header+0x1c/0x100 [drbd]
[28880.147469]  ? drbd_bump_write_ordering+0x240/0x240 [drbd]
[28880.147475]  drbd_receiver+0x28a/0x2e8 [drbd]
[28880.147480]  drbd_thread_setup+0x76/0x130 [drbd]
[28880.147485]  kthread+0x120/0x140
[28880.147490]  ? drbd_destroy_connection+0xe0/0xe0 [drbd]
[28880.147491]  ? __kthread_parkme+0x70/0x70
[28880.147494]  ret_from_fork+0x35/0x40
[28880.147496] INFO: task drbd_r_drbd3:3475 blocked for more than 120 seconds.
[28880.147501]       Tainted: P           O      5.3.18-2-pve #1
[28880.147505] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[28880.147510] drbd_r_drbd3    D    0  3475      2 0x80004004
[28880.147511] Call Trace:
[28880.147513]  __schedule+0x2bb/0x660
[28880.147514]  schedule+0x33/0xa0
[28880.147515]  io_schedule+0x16/0x40
[28880.147519]  _drbd_wait_ee_list_empty+0x91/0xe0 [drbd]
[28880.147521]  ? wait_woken+0x80/0x80
[28880.147525]  conn_wait_active_ee_empty+0x5e/0xc0 [drbd]
[28880.147529]  ? drbd_bump_write_ordering+0x240/0x240 [drbd]
[28880.147534]  receive_Barrier+0x16e/0x3d0 [drbd]
[28880.147538]  ? decode_header+0x1c/0x100 [drbd]
[28880.147543]  ? drbd_bump_write_ordering+0x240/0x240 [drbd]
[28880.147548]  drbd_receiver+0x28a/0x2e8 [drbd]
[28880.147553]  drbd_thread_setup+0x76/0x130 [drbd]
[28880.147554]  kthread+0x120/0x140
[28880.147559]  ? drbd_destroy_connection+0xe0/0xe0 [drbd]
[28880.147560]  ? __kthread_parkme+0x70/0x70
[28880.147562]  ret_from_fork+0x35/0x40

The only way I've found to recover is to reboot but it eventually comes back.
I've tried changing the scheduler from mq-deadline to none for the SSD, still has same issue.

Any suggestions are appreciated.

pveversion -v:
Code:
root@vm7:~# pveversion -v
proxmox-ve: 6.1-2 (running kernel: 5.3.18-2-pve)
pve-manager: 6.1-8 (running version: 6.1-8/806edfe1)
pve-kernel-helper: 6.1-7
pve-kernel-5.3: 6.1-5
pve-kernel-5.3.18-2-pve: 5.3.18-2
pve-kernel-4.15.18-26-pve: 4.15.18-54
pve-kernel-4.13.13-2-pve: 4.13.13-33
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libpve-access-control: 6.0-6
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.0-17
libpve-guest-common-perl: 3.0-5
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-5
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 3.2.1-1
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
openvswitch-switch: 2.12.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.1-3
pve-cluster: 6.1-4
pve-container: 3.0-22
pve-docs: 6.1-6
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.0-10
pve-firmware: 3.0-6
pve-ha-manager: 3.0-9
pve-i18n: 2.0-4
pve-qemu-kvm: 4.1.1-4
pve-xtermjs: 4.3.0-1
pve-zsync: 2.0-2
qemu-server: 6.1-7
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.3-pve1
 
Any way to tell if this patch is included in the proxmox kernel?
https://lore.kernel.org/stable/20191006000130.GE25255@sasha-vm/T/

This seems to describe the exact issue I am having:
Since the dispatch context finishes after the write request completion
handling, marking the queue as needing a restart is not seen from
__blk_mq_free_request() and blk_mq_sched_restart() not executed leading
to the dispatch stall under 100% write workloads
.

The node having issues is only a drbd secondary, getting only writes from DRBD.
 
Also suffering the same issue, though with the 5.3.18-3 Kernel.

Code:
[ 1935.072149] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1935.072205] jbd2/rssda1-8   D    0  5201      2 0x80004000
[ 1935.072210] Call Trace:
[ 1935.072228]  __schedule+0x2bb/0x660
[ 1935.072237]  ? wait_woken+0x80/0x80
[ 1935.072240]  schedule+0x33/0xa0
[ 1935.072249]  jbd2_journal_commit_transaction+0x24f/0x1723
[ 1935.072256]  ? __switch_to_asm+0x34/0x70
[ 1935.072259]  ? __switch_to_asm+0x40/0x70
[ 1935.072263]  ? __switch_to_asm+0x34/0x70
[ 1935.072266]  ? __switch_to_asm+0x40/0x70
[ 1935.072269]  ? __switch_to_asm+0x34/0x70
[ 1935.072273]  ? __switch_to_asm+0x40/0x70
[ 1935.072276]  ? __switch_to_asm+0x34/0x70
[ 1935.072280]  ? __switch_to_asm+0x34/0x70
[ 1935.072283]  ? __switch_to_asm+0x40/0x70
[ 1935.072287]  ? __switch_to_asm+0x40/0x70
[ 1935.072291]  ? wait_woken+0x80/0x80
[ 1935.072297]  ? try_to_del_timer_sync+0x53/0x80
[ 1935.072300]  kjournald2+0xc8/0x270
[ 1935.072304]  ? wait_woken+0x80/0x80
[ 1935.072311]  kthread+0x120/0x140
[ 1935.072313]  ? commit_timeout+0x20/0x20
[ 1935.072317]  ? __kthread_parkme+0x70/0x70
[ 1935.072321]  ret_from_fork+0x35/0x40
[ 1935.072326] INFO: task ext4lazyinit:5203 blocked for more than 604 seconds.
[ 1935.072376]       Tainted: P           O      5.3.18-3-pve #1
[ 1935.072417] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1935.072471] ext4lazyinit    D    0  5203      2 0x80004000
[ 1935.072474] Call Trace:
[ 1935.072479]  __schedule+0x2bb/0x660
[ 1935.072487]  ? blk_mq_flush_plug_list+0x1f7/0x290
[ 1935.072491]  ? __switch_to_asm+0x40/0x70
[ 1935.072494]  schedule+0x33/0xa0
[ 1935.072498]  schedule_timeout+0x205/0x300
[ 1935.072504]  ? blk_flush_plug_list+0xe2/0x110
[ 1935.072508]  io_schedule_timeout+0x1e/0x50
[ 1935.072512]  wait_for_completion_io+0xb7/0x140
[ 1935.072516]  ? wake_up_q+0x80/0x80
[ 1935.072523]  submit_bio_wait+0x61/0x90
[ 1935.072527]  blkdev_issue_zeroout+0x140/0x220
[ 1935.072534]  ext4_init_inode_table+0x177/0x37b
[ 1935.072542]  ext4_lazyinit_thread+0x2c1/0x3a0
[ 1935.072547]  kthread+0x120/0x140
[ 1935.072550]  ? ext4_unregister_li_request+0x70/0x70
[ 1935.072553]  ? __kthread_parkme+0x70/0x70
[ 1935.072557]  ret_from_fork+0x35/0x40

pveversion -v:
Code:
proxmox-ve: 6.1-2 (running kernel: 5.3.18-3-pve)
pve-manager: 6.1-11 (running version: 6.1-11/f2f18736)
pve-kernel-helper: 6.1-9
pve-kernel-5.3: 6.1-6
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.18-2-pve: 5.3.18-2
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libproxmox-acme-perl: 1.0.2
libpve-access-control: 6.0-7
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.1-1
libpve-guest-common-perl: 3.0-10
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-7
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.2-1
lxcfs: 4.0.3-pve2
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.1-6
pve-cluster: 6.1-8
pve-container: 3.1-4
pve-docs: 6.1-6
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.1-2
pve-firmware: 3.0-7
pve-ha-manager: 3.0-9
pve-i18n: 2.1-1
pve-qemu-kvm: 4.1.1-4
pve-xtermjs: 4.3.0-1
qemu-server: 6.1-20
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.3-pve1
 
Potential update: I found that there was a critical update to the P420m firmware (with a pointer from Reddit /r/homelab) which fixed the hanging issue in Windows and Ubuntu, but still occurred in Proxmox.

However, I found a potentially newer version of the mtip32xx driver from Torvalds Linux repo on GitHub (https://github.com/torvalds/linux/blob/master/drivers/block/mtip32xx/mtip32xx.c) which after compiling with the latest PVE kernel (5.3.18-3-pve) seems to be working for now.

I'm going to run further tests, but it seems like an old version of the driver is included with Proxmox. Since the Linux kernel persists the 1.3.1 module version (and has for the past 6~ years), it's practically impossible to tell if Proxmox has the latest driver or not.
 
I had already downloaded and updated the firmware to the latest that I could find on Micron's website.

These cards were working flawless in Proxmox 5, something changed in the kernel to cause this problem.
I can setup a test system with my p420m cards to test solutions to the issue if anyone has additional suggestions.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!