Hi,
Always after some days/weeks my pools encounter IO errors and I'm not sure whats causing them. It looks like heavy drive activity is causing it, because most of the time this happens if I try to scrub a pool or backup/snapshot VMs.
All the drives are connected directly to my mainboards onboard SATA controller (Intel C612 chipset). The board is a Supermicro X10SRM-F with 6 SATA-Ports + 4 SATA over one sff-8087. All SATA ports are set in BIOS to work in AHCI Mode.
I've got 2 zfs pools and 1 mdraid array. The first pool is called "VMpool7" and is a raidz1 pool consisting of four Intel S3710 and one S3700 SSDs and is used to store my VMs. The second pool is called "VMpool4" and is just a single Seagate ST4000DM004 HDD used as state storage to store the RAM dumps when I snapshot.
Proxmox was installed to two Intel S3700 SSDs I mirrored with mdraid and ontop of that LUKS and LVM.
The two proxmox SSDs are connected to the sff-8087 port, all other 6 drives to the normal 6 SATA ports.
Yesterday it happed again and VMpool7 got an write error. VMpool4 was fine but I as soon as I started a scrub of VMpool7 my VMpool4 also got IO errors.
I looked at the syslog and found these lines:
"sdg" was the SSD with a write error. "sdi" and "sdh" are the same HDD. Sometimes that HDD is switching between sdh and sdi if it encounters IO errors.
All temperatures on the board are fine (no sensor ever over 44 degree C). Power also shouldn'T be a problem (900W PSU).
These are the SMART values for sdg:
And for the HDD (sdh):
Yesterday when the IO erros occured max RAM usage was 82% and max CPU usage was 74%. Averaged IO delay is most of the time around 0.03% and yesterday maximum was 3.6%.
All 3 PCIe Slots are populated (passthrough Intel I350-T4, passthrough Geforce GT710 and a Intel NIC EXPI9301CTBLK) but directly connected to the CPU.
BIOS and SSD firmware are flashed to the latest version.
Any idea what could cause these IO errors?
Maybe some bad BIOS options or something? Feel free to ask if more infos like screenshots of the BIOS or logs are needed.
Always after some days/weeks my pools encounter IO errors and I'm not sure whats causing them. It looks like heavy drive activity is causing it, because most of the time this happens if I try to scrub a pool or backup/snapshot VMs.
All the drives are connected directly to my mainboards onboard SATA controller (Intel C612 chipset). The board is a Supermicro X10SRM-F with 6 SATA-Ports + 4 SATA over one sff-8087. All SATA ports are set in BIOS to work in AHCI Mode.
I've got 2 zfs pools and 1 mdraid array. The first pool is called "VMpool7" and is a raidz1 pool consisting of four Intel S3710 and one S3700 SSDs and is used to store my VMs. The second pool is called "VMpool4" and is just a single Seagate ST4000DM004 HDD used as state storage to store the RAM dumps when I snapshot.
Proxmox was installed to two Intel S3700 SSDs I mirrored with mdraid and ontop of that LUKS and LVM.
The two proxmox SSDs are connected to the sff-8087 port, all other 6 drives to the normal 6 SATA ports.
Yesterday it happed again and VMpool7 got an write error. VMpool4 was fine but I as soon as I started a scrub of VMpool7 my VMpool4 also got IO errors.
zpool status
returned this:
Code:
pool: VMpool4
state: SUSPENDED
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
see: http://zfsonlinux.org/msg/ZFS-8000-HC
scan: scrub repaired 0B in 0 days 00:51:29 with 0 errors on Sun Dec 13 18:26:03 2020
config:
NAME STATE READ WRITE CKSUM
VMpool4 ONLINE 0 0 0
ata-ST4000DM004-2CV104_WFN0N27Z ONLINE 4 0 0
errors: List of errors unavailable: pool I/O is currently suspended
errors: 8 data errors, use '-v' for a list
pool: VMpool7
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://zfsonlinux.org/msg/ZFS-8000-9P
scan: scrub in progress since Tue Dec 15 22:59:16 2020
529G scanned at 753M/s, 378G issued at 539M/s, 529G total
0B repaired, 71.53% done, 0 days 00:04:46 to go
config:
NAME STATE READ WRITE CKSUM
VMpool7 ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
ata-INTEL_SSDSC2BA200G4_BTHV636208L0200MGN ONLINE 0 0 0
ata-INTEL_SSDSC2BA200G4_BTHV511505AE200MGN ONLINE 0 0 0
ata-INTEL_SSDSC2BA200G4_BTHV636208K6200MGN ONLINE 0 0 0
ata-INTEL_SSDSC2BA200G3_BTTV334505YT200GGN ONLINE 0 0 0
ata-INTEL_SSDSC2BA200G4_BTHV737402JJ200MGN ONLINE 0 1 0
errors: No known data errors
I looked at the syslog and found these lines:
Code:
...
Dec 15 22:04:32 Hypervisor kernel: [272610.650340] ata9.00: exception Emask 0x10 SAct 0x200000 SErr 0x4090000 action 0xe frozen
Dec 15 22:04:32 Hypervisor kernel: [272610.650407] ata9.00: irq_stat 0x00400040, connection status changed
Dec 15 22:04:32 Hypervisor kernel: [272610.650449] ata9: SError: { PHYRdyChg 10B8B DevExch }
Dec 15 22:04:32 Hypervisor kernel: [272610.650486] ata9.00: failed command: WRITE FPDMA QUEUED
Dec 15 22:04:32 Hypervisor kernel: [272610.650584] ata9.00: cmd 61/08:a8:60:f0:db/00:00:0c:00:00/40 tag 21 ncq dma 4096 out
Dec 15 22:04:32 Hypervisor kernel: [272610.650584] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
Dec 15 22:04:32 Hypervisor kernel: [272610.650683] ata9.00: status: { DRDY }
Dec 15 22:04:32 Hypervisor kernel: [272610.650713] ata9: hard resetting link
Dec 15 22:04:32 Hypervisor kernel: [272611.306336] ata10: SATA link down (SStatus 0 SControl 300)
Dec 15 22:04:38 Hypervisor kernel: [272616.768637] ata10: SATA link down (SStatus 0 SControl 300)
Dec 15 22:04:43 Hypervisor kernel: [272621.834424] ata9: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Dec 15 22:04:43 Hypervisor kernel: [272621.835252] ata9.00: configured for UDMA/133
Dec 15 22:04:43 Hypervisor kernel: [272621.835277] sd 8:0:0:0: [sdg] tag#21 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Dec 15 22:04:43 Hypervisor kernel: [272621.835282] sd 8:0:0:0: [sdg] tag#21 Sense Key : Illegal Request [current]
Dec 15 22:04:43 Hypervisor kernel: [272621.835288] sd 8:0:0:0: [sdg] tag#21 Add. Sense: Unaligned write command
Dec 15 22:04:43 Hypervisor kernel: [272621.835294] sd 8:0:0:0: [sdg] tag#21 CDB: Write(10) 2a 00 0c db f0 60 00 00 08 00
Dec 15 22:04:43 Hypervisor kernel: [272621.835300] blk_update_request: I/O error, dev sdg, sector 215740512 op 0x1:(WRITE) flags 0x700 phys_seg 1 prio class 0
Dec 15 22:04:43 Hypervisor kernel: [272621.835385] zio pool=VMpool7 vdev=/dev/disk/by-id/ata-INTEL_SSDSC2BA200G4_BTHV737402JJ200MGN-part1 error=5 type=2 offset=110458093568 size=4096 flags=180880
Dec 15 22:04:43 Hypervisor kernel: [272621.835403] ata9: EH complete
Dec 15 22:04:43 Hypervisor kernel: [272621.836327] ata9.00: Enabling discard_zeroes_data
Dec 15 22:04:43 Hypervisor kernel: [272622.152844] ata10: SATA link down (SStatus 0 SControl 300)
Dec 15 22:04:43 Hypervisor kernel: [272622.152854] ata10.00: disabled
Dec 15 22:04:43 Hypervisor kernel: [272622.152887] ata10.00: detaching (SCSI 9:0:0:0)
Dec 15 22:04:43 Hypervisor kernel: [272622.158464] sd 9:0:0:0: [sdi] Synchronizing SCSI cache
Dec 15 22:04:43 Hypervisor kernel: [272622.158524] sd 9:0:0:0: [sdi] Synchronize Cache(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Dec 15 22:04:43 Hypervisor kernel: [272622.158526] sd 9:0:0:0: [sdi] Stopping disk
Dec 15 22:04:43 Hypervisor kernel: [272622.158538] sd 9:0:0:0: [sdi] Start/Stop Unit failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Dec 15 22:04:44 Hypervisor zed: eid=191 class=io pool_guid=0x7C02A172364ADEC5 vdev_path=/dev/disk/by-id/ata-INTEL_SSDSC2BA200G4_BTHV737402JJ200MGN-part1
...
Dec 15 22:05:04 Hypervisor kernel: [272643.530543] ata10: link is slow to respond, please be patient (ready=0)
Dec 15 22:05:07 Hypervisor kernel: [272645.738569] ata10: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Dec 15 22:05:07 Hypervisor kernel: [272645.776103] ata10.00: ATA-10: ST4000DM004-2CV104, 0001, max UDMA/133
Dec 15 22:05:07 Hypervisor kernel: [272645.776109] ata10.00: 7814037168 sectors, multi 16: LBA48 NCQ (depth 32), AA
Dec 15 22:05:07 Hypervisor kernel: [272645.836101] ata10.00: configured for UDMA/133
Dec 15 22:05:07 Hypervisor kernel: [272645.836392] scsi 9:0:0:0: Direct-Access ATA ST4000DM004-2CV1 0001 PQ: 0 ANSI: 5
Dec 15 22:05:07 Hypervisor kernel: [272645.836868] sd 9:0:0:0: Attached scsi generic sg7 type 0
Dec 15 22:05:07 Hypervisor kernel: [272645.836966] sd 9:0:0:0: [sdh] 7814037168 512-byte logical blocks: (4.00 TB/3.64 TiB)
Dec 15 22:05:07 Hypervisor kernel: [272645.836972] sd 9:0:0:0: [sdh] 4096-byte physical blocks
Dec 15 22:05:07 Hypervisor kernel: [272645.837038] sd 9:0:0:0: [sdh] Write Protect is off
Dec 15 22:05:07 Hypervisor kernel: [272645.837045] sd 9:0:0:0: [sdh] Mode Sense: 00 3a 00 00
Dec 15 22:05:07 Hypervisor kernel: [272645.837096] sd 9:0:0:0: [sdh] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Dec 15 22:05:07 Hypervisor kernel: [272645.932075] sdh: sdh1 sdh9
Dec 15 22:05:07 Hypervisor kernel: [272645.933329] sd 9:0:0:0: [sdh] Attached SCSI disk
...
Dec 15 22:26:13 Hypervisor smartd[3575]: Device: /dev/sde [SAT], SMART Usage Attribute: 190 Temperature_Case changed from 69 to 70
Dec 15 22:26:13 Hypervisor smartd[3575]: Device: /dev/sdh [SAT], reconnected ATA device
Dec 15 22:26:18 Hypervisor smartd[3575]: Device: /dev/sdh [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 100 to 82
Dec 15 22:26:18 Hypervisor smartd[3575]: Device: /dev/sdh [SAT], SMART Prefailure Attribute: 3 Spin_Up_Time changed from 99 to 98
Dec 15 22:26:18 Hypervisor smartd[3575]: Device: /dev/sdh [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 100 to 82
...
"sdg" was the SSD with a write error. "sdi" and "sdh" are the same HDD. Sometimes that HDD is switching between sdh and sdi if it encounters IO errors.
All temperatures on the board are fine (no sensor ever over 44 degree C). Power also shouldn'T be a problem (900W PSU).
These are the SMART values for sdg:
Code:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0032 100 100 000 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 9931
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 289
170 Available_Reservd_Space 0x0033 100 100 010 Pre-fail Always - 0
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
174 Unsafe_Shutdown_Count 0x0032 100 100 000 Old_age Always - 115
175 Power_Loss_Cap_Test 0x0033 100 100 010 Pre-fail Always - 5030 (32 794)
183 SATA_Downshift_Count 0x0032 100 100 000 Old_age Always - 0
184 End-to-End_Error 0x0033 100 100 090 Pre-fail Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
190 Temperature_Case 0x0022 072 071 000 Old_age Always - 28 (Min/Max 28/33)
192 Unsafe_Shutdown_Count 0x0032 100 100 000 Old_age Always - 115
194 Temperature_Internal 0x0022 100 100 000 Old_age Always - 28
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
199 CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0
225 Host_Writes_32MiB 0x0032 100 100 000 Old_age Always - 316612
226 Workld_Media_Wear_Indic 0x0032 100 100 000 Old_age Always - 471
227 Workld_Host_Reads_Perc 0x0032 100 100 000 Old_age Always - 30
228 Workload_Minutes 0x0032 100 100 000 Old_age Always - 594951
232 Available_Reservd_Space 0x0033 100 100 010 Pre-fail Always - 0
233 Media_Wearout_Indicator 0x0032 100 100 000 Old_age Always - 0
234 Thermal_Throttle 0x0032 100 100 000 Old_age Always - 0/0
241 Host_Writes_32MiB 0x0032 100 100 000 Old_age Always - 316612
242 Host_Reads_32MiB 0x0032 100 100 000 Old_age Always - 144459
243 NAND_Writes_32MiB 0x0032 100 100 000 Old_age Always - 989394
And for the HDD (sdh):
Code:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 077 054 006 Pre-fail Always - 46227264
3 Spin_Up_Time 0x0003 098 096 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 098 098 020 Old_age Always - 2459
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 085 060 045 Pre-fail Always - 293778150
9 Power_On_Hours 0x0032 091 091 000 Old_age Always - 8368 (29 163 0)
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 807
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 099 000 Old_age Always - 2 2 3
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 069 058 040 Old_age Always - 31 (Min/Max 30/34)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 148
193 Load_Cycle_Count 0x0032 095 095 000 Old_age Always - 10722
194 Temperature_Celsius 0x0022 031 042 000 Old_age Always - 31 (0 16 0 0 0)
195 Hardware_ECC_Recovered 0x001a 077 064 000 Old_age Always - 46227264
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 6912h+02m+48.217s
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 63013893475
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 102768225659
Yesterday when the IO erros occured max RAM usage was 82% and max CPU usage was 74%. Averaged IO delay is most of the time around 0.03% and yesterday maximum was 3.6%.
All 3 PCIe Slots are populated (passthrough Intel I350-T4, passthrough Geforce GT710 and a Intel NIC EXPI9301CTBLK) but directly connected to the CPU.
BIOS and SSD firmware are flashed to the latest version.
Any idea what could cause these IO errors?
Maybe some bad BIOS options or something? Feel free to ask if more infos like screenshots of the BIOS or logs are needed.
Last edited: