I/O error /dev/sda pve tainted IO

boilami

Member
Sep 15, 2019
6
0
6
40
HI,

My pve hang from time to time. I could said it happen 1once a month.
It happened tonight and the solution that I found working is a reboot of the server :(
2020-02-10 04_43_22-PVEstatus.png
From the error message it seems like I need to change one or more drives.
But after running tests on them, they seems healthy,
I would like to know what could cause the IO hang and how can I solve it.

Thanks for your help.

Description PowerEdge R610
BIOS Version 6.6.0
Lifecycle Controller Firmware 1.7.5.4
Raid card : LSI SAS2008-IT
4X 146Gb drive installed PVE in ZFS RAID10

pve-manager/6.0-4/2a719255 (running kernel: 5.0.15-1-pve)

In the attachement you can see what the screen looked like before rebooting.
--------------------------------2020-02-09 08_50_13-ZFS IO Error KERNEL.png
print_req_error: I/O error, dev sda, sector XXXXXXXXX flag 701
INFO: task z_wr_iss:XXX blocked for more than 120 seconds
Tainted : P IO 5.0.15-1 pve #1
"echo 0 > /proc/sys/kernel/hung_task_timout_secs" disable this message
--------------------------------


Looking at the kernel logs after reboot :
--------------------------------
root@pve:~# cat /var/log/kern.log.1 | grep sda
Feb 4 15:51:17 pve kernel: [1996460.343206] sd 2:0:0:0: [sda] tag#1247 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
Feb 4 15:51:17 pve kernel: [1996460.343211] sd 2:0:0:0: [sda] tag#1247 CDB: Write(10) 2a 00 00 f3 00 78 00 01 30 00
Feb 4 15:51:17 pve kernel: [1996460.343213] print_req_error: I/O error, dev sda, sector 15925368 flags 701
Feb 6 03:43:02 pve kernel: [2125558.892570] sd 2:0:0:0: [sda] tag#2887 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
Feb 6 03:43:02 pve kernel: [2125558.892575] sd 2:0:0:0: [sda] tag#2887 CDB: Write(10) 2a 00 08 49 8f 98 00 00 40 00
Feb 6 03:43:02 pve kernel: [2125558.892577] print_req_error: I/O error, dev sda, sector 139038616 flags 701
Feb 7 09:55:06 pve kernel: [2234278.018079] sd 2:0:0:0: [sda] tag#623 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
Feb 7 09:55:06 pve kernel: [2234278.018087] sd 2:0:0:0: [sda] tag#623 CDB: Write(10) 2a 00 0a 9c 59 00 00 00 30 00
Feb 7 09:55:06 pve kernel: [2234278.018089] print_req_error: I/O error, dev sda, sector 178018560 flags 701
Feb 7 11:28:01 pve kernel: [2239853.255288] sd 2:0:0:0: [sda] tag#727 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
Feb 7 11:28:01 pve kernel: [2239853.255337] sd 2:0:0:0: [sda] tag#727 CDB: Write(10) 2a 00 02 09 d2 40 00 01 00 00
Feb 7 11:28:01 pve kernel: [2239853.255356] print_req_error: I/O error, dev sda, sector 34198080 flags 701
--------------------------------

root@pve:/etc# zpool status
pool: rpool
state: ONLINE
scan: scrub repaired 0B in 0 days 00:05:24 with 0 errors on Sun Nov 10 00:29:26 2019
config:

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
scsi-35000c5000c6c5c87-part3 ONLINE 0 0 0
scsi-35000c5000ad66403-part3 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
scsi-35000c5000eeb5dcb ONLINE 0 0 0
scsi-35000c5000ad664b3 ONLINE 0 0 0

errors: No known data errors

#fdisk -l
Disk /dev/sda: 136.8 GiB, 146815733760 bytes, 286749480 sectors
Disk model: ST9146802SS
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 35B15105-5C87-46D5-9078-ABFED8B227AD

Device Start End Sectors Size Type
/dev/sda1 34 2047 2014 1007K BIOS boot
/dev/sda2 2048 1050623 1048576 512M EFI System
/dev/sda3 1050624 286749446 285698823 136.2G Solaris /usr & Apple ZFS


root@pve:~# smartctl -l selftest /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-5.0.15-1-pve] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background long Completed - 50761 - [- - -]
# 2 Background short Completed - 50759 - [- - -]
# 3 Background short Completed - 46771 - [- - -]
# 4 Background short Completed - 40208 - [- - -]
# 5 Background short Completed - 40181 - [- - -]
# 6 Background long Completed - 0 - [- - -]
# 7 Background short Completed - 0 - [- - -]

Long (extended) Self-test duration: 2070 seconds [34.5 minutes]

root@pve:/etc# cat /var/log/kern.log.1
Feb 3 09:28:31 pve kernel: [1887098.503308] perf: interrupt took too long (17941 > 17835), lowering kernel.perf_event_max_sample_rate to 11000
Feb 4 15:51:17 pve kernel: [1996460.343190] mpt2sas_cm0: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)
Feb 4 15:51:17 pve kernel: [1996460.343206] sd 2:0:0:0: [sda] tag#1247 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
Feb 4 15:51:17 pve kernel: [1996460.343211] sd 2:0:0:0: [sda] tag#1247 CDB: Write(10) 2a 00 00 f3 00 78 00 01 30 00
Feb 4 15:51:17 pve kernel: [1996460.343213] print_req_error: I/O error, dev sda, sector 15925368 flags 701
Feb 4 15:51:17 pve kernel: [1996460.343258] zio pool=rpool vdev=/dev/disk/by-id/scsi-35000c5000c6c5c87-part3 error=5 type=2 offset=7615868928 size=155648 flags=40080c80
Feb 6 03:02:47 pve kernel: [2123144.590392] device tap103i0 entered promiscuous mode
Feb 6 03:43:02 pve kernel: [2125558.892553] mpt2sas_cm0: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)
Feb 6 03:43:02 pve kernel: [2125558.892570] sd 2:0:0:0: [sda] tag#2887 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
Feb 6 03:43:02 pve kernel: [2125558.892575] sd 2:0:0:0: [sda] tag#2887 CDB: Write(10) 2a 00 08 49 8f 98 00 00 40 00
Feb 6 03:43:02 pve kernel: [2125558.892577] print_req_error: I/O error, dev sda, sector 139038616 flags 701
Feb 6 03:43:02 pve kernel: [2125558.892623] zio pool=rpool vdev=/dev/disk/by-id/scsi-35000c5000c6c5c87-part3 error=5 type=2 offset=70649851904 size=32768 flags=40080c80
Feb 7 09:55:06 pve kernel: [2234278.018031] mpt2sas_cm0: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)
Feb 7 09:55:06 pve kernel: [2234278.018079] sd 2:0:0:0: [sda] tag#623 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
Feb 7 09:55:06 pve kernel: [2234278.018087] sd 2:0:0:0: [sda] tag#623 CDB: Write(10) 2a 00 0a 9c 59 00 00 00 30 00
Feb 7 09:55:06 pve kernel: [2234278.018089] print_req_error: I/O error, dev sda, sector 178018560 flags 701
Feb 7 09:55:06 pve kernel: [2234278.018137] zio pool=rpool vdev=/dev/disk/by-id/scsi-35000c5000c6c5c87-part3 error=5 type=2 offset=90607583232 size=24576 flags=40080c80
Feb 7 11:28:01 pve kernel: [2239853.255228] mpt2sas_cm0: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)
Feb 7 11:28:01 pve kernel: [2239853.255288] sd 2:0:0:0: [sda] tag#727 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
Feb 7 11:28:01 pve kernel: [2239853.255337] sd 2:0:0:0: [sda] tag#727 CDB: Write(10) 2a 00 02 09 d2 40 00 01 00 00
Feb 7 11:28:01 pve kernel: [2239853.255356] print_req_error: I/O error, dev sda, sector 34198080 flags 701
Feb 7 11:28:01 pve kernel: [2239853.255391] zio pool=rpool vdev=/dev/disk/by-id/scsi-35000c5000c6c5c87-part3 error=5 type=2 offset=16971497472 size=131072 flags=40080c80


root@pve:~# cat /var/log/kern.log.1 | grep sda
Feb 4 15:51:17 pve kernel: [1996460.343206] sd 2:0:0:0: [sda] tag#1247 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
Feb 4 15:51:17 pve kernel: [1996460.343211] sd 2:0:0:0: [sda] tag#1247 CDB: Write(10) 2a 00 00 f3 00 78 00 01 30 00
Feb 4 15:51:17 pve kernel: [1996460.343213] print_req_error: I/O error, dev sda, sector 15925368 flags 701
Feb 6 03:43:02 pve kernel: [2125558.892570] sd 2:0:0:0: [sda] tag#2887 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
Feb 6 03:43:02 pve kernel: [2125558.892575] sd 2:0:0:0: [sda] tag#2887 CDB: Write(10) 2a 00 08 49 8f 98 00 00 40 00
Feb 6 03:43:02 pve kernel: [2125558.892577] print_req_error: I/O error, dev sda, sector 139038616 flags 701
Feb 7 09:55:06 pve kernel: [2234278.018079] sd 2:0:0:0: [sda] tag#623 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
Feb 7 09:55:06 pve kernel: [2234278.018087] sd 2:0:0:0: [sda] tag#623 CDB: Write(10) 2a 00 0a 9c 59 00 00 00 30 00
Feb 7 09:55:06 pve kernel: [2234278.018089] print_req_error: I/O error, dev sda, sector 178018560 flags 701
Feb 7 11:28:01 pve kernel: [2239853.255288] sd 2:0:0:0: [sda] tag#727 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
Feb 7 11:28:01 pve kernel: [2239853.255337] sd 2:0:0:0: [sda] tag#727 CDB: Write(10) 2a 00 02 09 d2 40 00 01 00 00
Feb 7 11:28:01 pve kernel: [2239853.255356] print_req_error: I/O error, dev sda, sector 34198080 flags 701
 
4X 146Gb drive installed PVE in ZFS RAID10
scan: scrub repaired 0B in 0 days 00:05:24 with 0 errors on Sun Nov 10 00:29:26 2019


The ZFS filesystem does by default a monthly scrub on the second Sunday every month:
Code:
cat /etc/cron.d/zfsutils-linux
which could explain your monthly experienced "hangs".
Higher IO is normal then, as a scrub needs to verify all data on disks.
But one should not need to reboot the system afterwards.

The IO read errors on the block devices look definitively suspicious though..
Raid card : LSI SAS2008-IT

How did you configured the RAID card, is it in HBA/pass-through mode?
 
Didn't knew about the montly scrub. Given the output it seems like it is not occuring like it should since the last one happened in november 2019 :(

I will check the cronjob tho


Yes ! It's a HBA passtrough.
 
Given the output it seems like it is not occuring like it should since the last one happened in november 2019

Oh, true. Can you see if starting off a manual scrub works (maybe start it off once out of office hours start, if possible):
zpool scrub rpool
 
I found the cronjob, indeed it seems to match with my IO errors timing.
I ran a scrub manually and didn't find any errors "errors: No known data errors"

root@pve:~# cat /etc/cron.d/zfsutils-linux
PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin

# Scrub the second Sunday of every month.
24 0 8-14 * * root [ $(date +\%w) -eq 0 ] && [ -x /usr/lib/zfs-linux/scrub ] && /usr/lib/zfs-linux/scrub
root@pve:~#

This is the default configuration of the cron, do it seems correct to you ?
It looks like the cron is executed since the system hang, but the zpool status don't get updated with the date of the last Scrub.

Do you know how to make sure the drives are not faulty then (based on the IO error sector flags)

Otherwise I think I will increase the schedule of the scrubbing to see if the pattern gets repeated.

Thanks for you quick support.
Have a good night.

Sebastien
 
This is the default configuration of the cron, do it seems correct to you ?

yes.

It looks like the cron is executed since the system hang, but the zpool status don't get updated with the date of the last Scrub.
Strange, it really should get updated after the scrub finished.

Do you know how to make sure the drives are not faulty then (based on the IO error sector flags)

A bit hard to tell, the read errors indicate that not everything is alright but ZFS saying it is show that wasn't any bit rot going on.
Could be also "just" a driver/software issue but rather unlikely. What drive models are in there?
If the smart data of the disks are OK I'D continue to monitor the situation and scrubs. Maybe see that you have a spare drive ready (you could even already insert it and tell ZFS that this is a spare).
 
I will make some testing with the cronjob to make sure it reports like it supposed to. Increasing the scrubbing frequency for the testing phase.

The disk model are 4XSEAGATE-ST9146802SS

Maybe I will reinstall the PVE os in a mirror raid ext4 on 2 new drives and keep the storage in the ZFS pool.
Let me know if you think it could make the setup more resilient to failure this way.
Thanks again for your support.
 
Hi,
After some testing, I switched the scubbing job to local "crontab -e"
24 00 * * 0 /sbin/zpool scrub rpool

Thanks for your support.
 
The system hang again a couple of days ago.
From the output below, I will change 1 disk. (underlined)
I also ordered 2 SSD to migrate the root file system on it. (ZFS Raid1)

root@pve:/var/log# zpool status
pool: rpool
state: ONLINE
scan: scrub repaired 0B in 4 days 03:37:27 with 0 errors on Sun Feb 16 00:30:28 2020
config:

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
scsi-35000c5000c6c5c87-part3 ONLINE 0 0 0
scsi-35000c5000ad66403-part3 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
scsi-35000c5000eeb5dcb ONLINE 0 0 0
scsi-35000c5000ad664b3 ONLINE 0 0 0

errors: No known data errors

root@pve:/etc# cat /var/log/kern.log.1
Feb 3 09:28:31 pve kernel: [1887098.503308] perf: interrupt took too long (17941 > 17835), lowering kernel.perf_event_max_sample_rate to 11000
Feb 4 15:51:17 pve kernel: [1996460.343190] mpt2sas_cm0: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)
Feb 4 15:51:17 pve kernel: [1996460.343206] sd 2:0:0:0: [sda] tag#1247 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
Feb 4 15:51:17 pve kernel: [1996460.343211] sd 2:0:0:0: [sda] tag#1247 CDB: Write(10) 2a 00 00 f3 00 78 00 01 30 00
Feb 4 15:51:17 pve kernel: [1996460.343213] print_req_error: I/O error, dev sda, sector 15925368 flags 701
Feb 4 15:51:17 pve kernel: [1996460.343258] zio pool=rpool vdev=/dev/disk/by-id/scsi-35000c5000c6c5c87-part3 error=5 type=2 offset=7615868928 size=155648 flags=40080c80
....
Feb 6 03:43:02 pve kernel: [2125558.892553] mpt2sas_cm0: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)
Feb 6 03:43:02 pve kernel: [2125558.892570] sd 2:0:0:0: [sda] tag#2887 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
Feb 6 03:43:02 pve kernel: [2125558.892575] sd 2:0:0:0: [sda] tag#2887 CDB: Write(10) 2a 00 08 49 8f 98 00 00 40 00
Feb 6 03:43:02 pve kernel: [2125558.892577] print_req_error: I/O error, dev sda, sector 139038616 flags 701
Feb 6 03:43:02 pve kernel: [2125558.892623] zio pool=rpool vdev=/dev/disk/by-id/scsi-35000c5000c6c5c87-part3 error=5 type=2 offset=70649851904 size=32768 flags=40080c80
Feb 7 09:55:06 pve kernel: [2234278.018031] mpt2sas_cm0: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)
Feb 7 09:55:06 pve kernel: [2234278.018079] sd 2:0:0:0: [sda] tag#623 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
Feb 7 09:55:06 pve kernel: [2234278.018087] sd 2:0:0:0: [sda] tag#623 CDB: Write(10) 2a 00 0a 9c 59 00 00 00 30 00
Feb 7 09:55:06 pve kernel: [2234278.018089] print_req_error: I/O error, dev sda, sector 178018560 flags 701
Feb 7 09:55:06 pve kernel: [2234278.018137] zio pool=rpool vdev=/dev/disk/by-id/scsi-35000c5000c6c5c87-part3 error=5 type=2 offset=90607583232 size=24576 flags=40080c80
Feb 7 11:28:01 pve kernel: [2239853.255228] mpt2sas_cm0: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)
Feb 7 11:28:01 pve kernel: [2239853.255288] sd 2:0:0:0: [sda] tag#727 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
Feb 7 11:28:01 pve kernel: [2239853.255337] sd 2:0:0:0: [sda] tag#727 CDB: Write(10) 2a 00 02 09 d2 40 00 01 00 00
Feb 7 11:28:01 pve kernel: [2239853.255356] print_req_error: I/O error, dev sda, sector 34198080 flags 701
Feb 7 11:28:01 pve kernel: [2239853.255391] zio pool=rpool vdev=/dev/disk/by-id/scsi-35000c5000c6c5c87-part3 error=5 type=2 offset=16971497472 size=131072 flags=40080c80
.....
Feb 16 18:20:17 pve kernel: [466357.104385] mpt2sas_cm0: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)
Feb 16 18:20:17 pve kernel: [466357.104403] sd 2:0:0:0: [sda] tag#685 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
Feb 16 18:20:17 pve kernel: [466357.104408] sd 2:0:0:0: [sda] tag#685 CDB: Write(10) 2a 00 04 31 c3 68 00 00 40 00
Feb 16 18:20:17 pve kernel: [466357.104410] print_req_error: I/O error, dev sda, sector 70370152 flags 701
Feb 16 18:20:17 pve kernel: [466357.104456] zio pool=rpool vdev=/dev/disk/by-id/scsi-35000c5000c6c5c87-part3 error=5 type=2 offset=35491598336 size=32768 flags=180880
Feb 18 03:02:38 pve kernel: [584092.433663] mpt2sas_cm0: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)
Feb 18 03:02:38 pve kernel: [584092.433726] sd 2:0:0:0: [sda] tag#1534 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
Feb 18 03:02:38 pve kernel: [584092.433731] sd 2:0:0:0: [sda] tag#1534 CDB: Write(10) 2a 00 03 a6 7d 88 00 02 80 00
Feb 18 03:02:38 pve kernel: [584092.433734] print_req_error: I/O error, dev sda, sector 61242760 flags 701
Feb 18 03:02:38 pve kernel: [584092.433781] zio pool=rpool vdev=/dev/disk/by-id/scsi-35000c5000c6c5c87-part3 error=5 type=2 offset=30818373632 size=327680 flags=40080c80
 
The system hang again a couple of days ago.
From the output below, I will change 1 disk. (underlined)
I also ordered 2 SSD to migrate the root file system on it. (ZFS Raid1)

root@pve:/var/log# zpool status
pool: rpool
state: ONLINE
scan: scrub repaired 0B in 4 days 03:37:27 with 0 errors on Sun Feb 16 00:30:28 2020
config:

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
scsi-35000c5000c6c5c87-part3 ONLINE 0 0 0
scsi-35000c5000ad66403-part3 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
scsi-35000c5000eeb5dcb ONLINE 0 0 0
scsi-35000c5000ad664b3 ONLINE 0 0 0

errors: No known data errors

root@pve:/etc# cat /var/log/kern.log.1
Feb 3 09:28:31 pve kernel: [1887098.503308] perf: interrupt took too long (17941 > 17835), lowering kernel.perf_event_max_sample_rate to 11000
Feb 4 15:51:17 pve kernel: [1996460.343190] mpt2sas_cm0: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)
Feb 4 15:51:17 pve kernel: [1996460.343206] sd 2:0:0:0: [sda] tag#1247 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
Feb 4 15:51:17 pve kernel: [1996460.343211] sd 2:0:0:0: [sda] tag#1247 CDB: Write(10) 2a 00 00 f3 00 78 00 01 30 00
Feb 4 15:51:17 pve kernel: [1996460.343213] print_req_error: I/O error, dev sda, sector 15925368 flags 701
Feb 4 15:51:17 pve kernel: [1996460.343258] zio pool=rpool vdev=/dev/disk/by-id/scsi-35000c5000c6c5c87-part3 error=5 type=2 offset=7615868928 size=155648 flags=40080c80
.....
Feb 18 03:02:38 pve kernel: [584092.433663] mpt2sas_cm0: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)
Feb 18 03:02:38 pve kernel: [584092.433726] sd 2:0:0:0: [sda] tag#1534 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
Feb 18 03:02:38 pve kernel: [584092.433731] sd 2:0:0:0: [sda] tag#1534 CDB: Write(10) 2a 00 03 a6 7d 88 00 02 80 00
Feb 18 03:02:38 pve kernel: [584092.433734] print_req_error: I/O error, dev sda, sector 61242760 flags 701
Feb 18 03:02:38 pve kernel: [584092.433781] zio pool=rpool vdev=/dev/disk/by-id/scsi-35000c5000c6c5c87-part3 error=5 type=2 offset=30818373632 size=327680 flags=40080c80

Hey, did you ever track down the issue you were experiencing?

I am seeing very similar warnings for an 8TB ultrastar on a proxmox node, not running ZFS, but *is* under heavy write load (backfilling ceph osd):
--> Bold added to the line that brought me here.

Code:
[152363.867593] mpt2sas_cm0: log_info(0x31120100): originator(PL), code(0x12), sub_code(0x0100)
[152363.884353] mpt2sas_cm0: log_info(0x31120100): originator(PL), code(0x12), sub_code(0x0100)
[152363.900328] mpt2sas_cm0: log_info(0x31120100): originator(PL), code(0x12), sub_code(0x0100)
[152363.916330] mpt2sas_cm0: log_info(0x31120100): originator(PL), code(0x12), sub_code(0x0100)
[152363.932329] mpt2sas_cm0: log_info(0x31120100): originator(PL), code(0x12), sub_code(0x0100)
[152363.952355] mpt2sas_cm0: log_info(0x31120100): originator(PL), code(0x12), sub_code(0x0100)
[152363.952361] scsi_io_completion_action: 16 callbacks suppressed
[152363.952369] sd 4:0:2:0: [sdd] tag#203 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
[152363.952373] sd 4:0:2:0: [sdd] tag#203 CDB: Write(10) 2a 00 08 9a 02 10 00 00 0c 00
[152363.952375] print_req_error: 16 callbacks suppressed
[152363.952377] blk_update_request: I/O error, dev sdd, sector 1154486400 op 0x1:(WRITE) flags 0x8800 phys_seg 6 prio class 0
[152364.328617] sd 4:0:2:0: device_block, handle(0x000a)
[152365.329011] sd 4:0:2:0: device_unblock and setting to running, handle(0x000a)
[152365.329272] sd 4:0:2:0: Power-on or device reset occurred
[152368.233548] libceph: osd27 down
[152372.892254] mpt2sas_cm0: log_info(0x31120100): originator(PL), code(0x12), sub_code(0x0100)
[152372.892262] sd 4:0:2:0: [sdd] tag#204 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
[152372.892268] sd 4:0:2:0: [sdd] tag#204 CDB: Read(10) 28 00 74 70 24 f0 00 00 01 00
[152372.892271] blk_update_request: I/O error, dev sdd, sector 15628052352 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[152372.902001] mpt2sas_cm0: log_info(0x31120100): originator(PL), code(0x12), sub_code(0x0100)
[152372.920227] mpt2sas_cm0: log_info(0x31120100): originator(PL), code(0x12), sub_code(0x0100)
[152372.936209] mpt2sas_cm0: log_info(0x31120100): originator(PL), code(0x12), sub_code(0x0100)
[152372.952213] mpt2sas_cm0: log_info(0x31120100): originator(PL), code(0x12), sub_code(0x0100)
[152372.972239] mpt2sas_cm0: log_info(0x31120100): originator(PL), code(0x12), sub_code(0x0100)
[152372.992217] mpt2sas_cm0: log_info(0x31120100): originator(PL), code(0x12), sub_code(0x0100)
[152372.992231] sd 4:0:2:0: [sdd] tag#213 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
[152372.992235] sd 4:0:2:0: [sdd] tag#213 CDB: Read(10) 28 00 74 70 24 f0 00 00 01 00
[152372.992238] blk_update_request: I/O error, dev sdd, sector 15628052352 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[152373.001420] Buffer I/O error on dev dm-4, logical block 1953506288, async page read
[152373.017279] mpt2sas_cm0: log_info(0x31120100): originator(PL), code(0x12), sub_code(0x0100)
[152373.017290] sd 4:0:2:0: [sdd] tag#214 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
[152373.017294] sd 4:0:2:0: [sdd] tag#214 CDB: Read(10) 28 00 74 70 24 f0 00 00 01 00
[152373.017297] blk_update_request: I/O error, dev sdd, sector 15628052352 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[152373.027186] mpt2sas_cm0: log_info(0x31120100): originator(PL), code(0x12), sub_code(0x0100)
[152373.040215] mpt2sas_cm0: log_info(0x31120100): originator(PL), code(0x12), sub_code(0x0100)
[152373.056223] mpt2sas_cm0: log_info(0x31120100): originator(PL), code(0x12), sub_code(0x0100)
[152373.072222] mpt2sas_cm0: log_info(0x31120100): originator(PL), code(0x12), sub_code(0x0100)
[152373.088238] mpt2sas_cm0: log_info(0x31120100): originator(PL), code(0x12), sub_code(0x0100)
[152373.104226] mpt2sas_cm0: log_info(0x31120100): originator(PL), code(0x12), sub_code(0x0100)
[152373.104240] sd 4:0:2:0: [sdd] tag#234 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
[152373.104244] sd 4:0:2:0: [sdd] tag#234 CDB: Read(10) 28 00 74 70 24 f0 00 00 01 00
[152373.104247] blk_update_request: I/O error, dev sdd, sector 15628052352 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[152373.113489] Buffer I/O error on dev dm-4, logical block 1953506288, async page read
[152373.578540] sd 4:0:2:0: device_block, handle(0x000a)
[152374.328930] sd 4:0:2:0: device_unblock and setting to running, handle(0x000a)
[152374.714673] sd 4:0:2:0: Power-on or device reset occurred
[152374.728151] sd 4:0:2:0: [sdd] tag#229 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[152374.728155] sd 4:0:2:0: [sdd] tag#229 Sense Key : Not Ready [current] [descriptor]
[152374.728158] sd 4:0:2:0: [sdd] tag#229 Add. Sense: Logical unit not ready, notify (enable spinup) required
[152374.728161] sd 4:0:2:0: [sdd] tag#229 CDB: Read(10) 28 00 00 00 01 00 00 00 04 00
[152374.728164] blk_update_request: I/O error, dev sdd, sector 2048 op 0x0:(READ) flags 0x80700 phys_seg 4 prio class 0
[152374.737988] sd 4:0:2:0: [sdd] tag#232 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[152374.737991] sd 4:0:2:0: [sdd] tag#232 Sense Key : Not Ready [current] [descriptor]
[152374.737993] sd 4:0:2:0: [sdd] tag#232 Add. Sense: Logical unit not ready, notify (enable spinup) required
[152374.737996] sd 4:0:2:0: [sdd] tag#232 CDB: Read(10) 28 00 00 00 01 00 00 00 01 00
[152374.737998] blk_update_request: I/O error, dev sdd, sector 2048 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[152374.747187] Buffer I/O error on dev dm-4, logical block 0, async page read
[152383.828453] sd 4:0:2:0: device_block, handle(0x000a)
[152384.578866] sd 4:0:2:0: device_unblock and setting to running, handle(0x000a)
[152384.947208] sd 4:0:2:0: Power-on or device reset occurred
[152384.959994] sd 4:0:2:0: [sdd] tag#749 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[152384.959997] sd 4:0:2:0: [sdd] tag#749 Sense Key : Not Ready [current] [descriptor]
[152384.959999] sd 4:0:2:0: [sdd] tag#749 Add. Sense: Logical unit not ready, notify (enable spinup) required
[152384.960002] sd 4:0:2:0: [sdd] tag#749 CDB: Read(10) 28 00 00 00 01 00 00 00 04 00
[152384.960005] blk_update_request: I/O error, dev sdd, sector 2048 op 0x0:(READ) flags 0x80700 phys_seg 4 prio class 0
[152384.969573] sd 4:0:2:0: [sdd] tag#268 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[152384.969575] sd 4:0:2:0: [sdd] tag#268 Sense Key : Not Ready [current] [descriptor]
[152384.969578] sd 4:0:2:0: [sdd] tag#268 Add. Sense: Logical unit not ready, notify (enable spinup) required
[152384.969580] sd 4:0:2:0: [sdd] tag#268 CDB: Read(10) 28 00 00 00 01 00 00 00 01 00
[152384.969583] blk_update_request: I/O error, dev sdd, sector 2048 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[152384.978570] Buffer I/O error on dev dm-4, logical block 0, async page read
[152393.828368] sd 4:0:2:0: device_block, handle(0x000a)
[152394.578781] sd 4:0:2:0: device_unblock and setting to running, handle(0x000a)
[152397.578328] sd 4:0:2:0: device_block, handle(0x000a)
[152398.328717] sd 4:0:2:0: device_unblock and setting to running, handle(0x000a)
[152408.078251] sd 4:0:2:0: device_block, handle(0x000a)
[152409.078637] sd 4:0:2:0: device_unblock and setting to running, handle(0x000a)
[152418.078154] sd 4:0:2:0: device_block, handle(0x000a)
[152418.828555] sd 4:0:2:0: device_unblock and setting to running, handle(0x000a)
[152421.578117] sd 4:0:2:0: device_block, handle(0x000a)
[152423.077757] sd 4:0:2:0: device_unblock and setting to running, handle(0x000a)
[152423.079199] sd 4:0:2:0: [sdd] Synchronizing SCSI cache
[152423.079251] sd 4:0:2:0: [sdd] Synchronize Cache(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[152423.111935] mpt2sas_cm0: mpt3sas_transport_port_remove: removed: sas_addr(0x5000cca25414d969)
[152423.111937] mpt2sas_cm0: removing handle(0x000a), sas_addr(0x5000cca25414d969)
[152423.111939] mpt2sas_cm0: enclosure logical id(0x500605b002c8c85a), slot(0)


I am also running an LSI raid card in HBA mode, and see mpt2sas/mpt3sas appears involved as was the case for OP.

Based on this:
https://github.com/torvalds/linux/b...5230c/drivers/scsi/libsas/sas_scsi_host.c#L63

I suspect the SAS queue is reaching a limit, but I do not know how to prevent reaching the limit, or how to increase it.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!