I have a p420m SSD that uses the mtip32xx driver in the kernel.
This drive worked perfectly fine in Proxmox 5.x, after upgrading to 6.x write IO to the disk stalls frequently and can only be recovered with a reboot.
We first experienced the problem within hours of upgrading to 6.x
The motherboard BIOS and the SSD firmware have the latest updates available.
The SSD is not worn out, still has 99% of its life remaining:
Looking at /proc/diskstats it seems like some IO is queued that never completes:
We are using DRBD on top of the SSD, two volumes one on rssda1 the other on rssda2.
DRBD tasks eventually hang because write IO cannot be completed:
The only way I've found to recover is to reboot but it eventually comes back.
I've tried changing the scheduler from mq-deadline to none for the SSD, still has same issue.
Any suggestions are appreciated.
pveversion -v:
This drive worked perfectly fine in Proxmox 5.x, after upgrading to 6.x write IO to the disk stalls frequently and can only be recovered with a reboot.
We first experienced the problem within hours of upgrading to 6.x
The motherboard BIOS and the SSD firmware have the latest updates available.
The SSD is not worn out, still has 99% of its life remaining:
Code:
[ 214.301679] mtip32xx 0000:84:00.0: Write protect progress: 1% (209715 blocks)
Looking at /proc/diskstats it seems like some IO is queued that never completes:
Code:
│ 252 0 rssda 24635 0 3715072 31937345 4785988 20663738 211254204 319064718 0 10526068 350307656 0 0 0 0
│ 252 1 rssda1 19405 0 3673176 31933499 350649 1557192 19381452 254545508 4 575520 286420084 0 0 0 0
│ 252 2 rssda2 5220 0 41816 3844 4435339 19106546 191872752 64519210 2 10159320 63887572 0 0 0 0
We are using DRBD on top of the SSD, two volumes one on rssda1 the other on rssda2.
DRBD tasks eventually hang because write IO cannot be completed:
Code:
[28880.147379] INFO: task drbd_r_drbd2:3473 blocked for more than 120 seconds.
[28880.147391] Tainted: P O 5.3.18-2-pve #1
[28880.147395] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[28880.147401] drbd_r_drbd2 D 0 3473 2 0x80004004
[28880.147403] Call Trace:
[28880.147414] __schedule+0x2bb/0x660
[28880.147421] ? blk_flush_plug_list+0xe2/0x110
[28880.147422] schedule+0x33/0xa0
[28880.147424] io_schedule+0x16/0x40
[28880.147441] _drbd_wait_ee_list_empty+0x91/0xe0 [drbd]
[28880.147447] ? wait_woken+0x80/0x80
[28880.147451] conn_wait_active_ee_empty+0x5e/0xc0 [drbd]
[28880.147456] ? drbd_bump_write_ordering+0x240/0x240 [drbd]
[28880.147461] receive_Barrier+0x16e/0x3d0 [drbd]
[28880.147465] ? decode_header+0x1c/0x100 [drbd]
[28880.147469] ? drbd_bump_write_ordering+0x240/0x240 [drbd]
[28880.147475] drbd_receiver+0x28a/0x2e8 [drbd]
[28880.147480] drbd_thread_setup+0x76/0x130 [drbd]
[28880.147485] kthread+0x120/0x140
[28880.147490] ? drbd_destroy_connection+0xe0/0xe0 [drbd]
[28880.147491] ? __kthread_parkme+0x70/0x70
[28880.147494] ret_from_fork+0x35/0x40
[28880.147496] INFO: task drbd_r_drbd3:3475 blocked for more than 120 seconds.
[28880.147501] Tainted: P O 5.3.18-2-pve #1
[28880.147505] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[28880.147510] drbd_r_drbd3 D 0 3475 2 0x80004004
[28880.147511] Call Trace:
[28880.147513] __schedule+0x2bb/0x660
[28880.147514] schedule+0x33/0xa0
[28880.147515] io_schedule+0x16/0x40
[28880.147519] _drbd_wait_ee_list_empty+0x91/0xe0 [drbd]
[28880.147521] ? wait_woken+0x80/0x80
[28880.147525] conn_wait_active_ee_empty+0x5e/0xc0 [drbd]
[28880.147529] ? drbd_bump_write_ordering+0x240/0x240 [drbd]
[28880.147534] receive_Barrier+0x16e/0x3d0 [drbd]
[28880.147538] ? decode_header+0x1c/0x100 [drbd]
[28880.147543] ? drbd_bump_write_ordering+0x240/0x240 [drbd]
[28880.147548] drbd_receiver+0x28a/0x2e8 [drbd]
[28880.147553] drbd_thread_setup+0x76/0x130 [drbd]
[28880.147554] kthread+0x120/0x140
[28880.147559] ? drbd_destroy_connection+0xe0/0xe0 [drbd]
[28880.147560] ? __kthread_parkme+0x70/0x70
[28880.147562] ret_from_fork+0x35/0x40
The only way I've found to recover is to reboot but it eventually comes back.
I've tried changing the scheduler from mq-deadline to none for the SSD, still has same issue.
Any suggestions are appreciated.
pveversion -v:
Code:
root@vm7:~# pveversion -v
proxmox-ve: 6.1-2 (running kernel: 5.3.18-2-pve)
pve-manager: 6.1-8 (running version: 6.1-8/806edfe1)
pve-kernel-helper: 6.1-7
pve-kernel-5.3: 6.1-5
pve-kernel-5.3.18-2-pve: 5.3.18-2
pve-kernel-4.15.18-26-pve: 4.15.18-54
pve-kernel-4.13.13-2-pve: 4.13.13-33
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libpve-access-control: 6.0-6
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.0-17
libpve-guest-common-perl: 3.0-5
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-5
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 3.2.1-1
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
openvswitch-switch: 2.12.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.1-3
pve-cluster: 6.1-4
pve-container: 3.0-22
pve-docs: 6.1-6
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.0-10
pve-firmware: 3.0-6
pve-ha-manager: 3.0-9
pve-i18n: 2.0-4
pve-qemu-kvm: 4.1.1-4
pve-xtermjs: 4.3.0-1
pve-zsync: 2.0-2
qemu-server: 6.1-7
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.3-pve1