ZFS IO errors

Dunuin

Distinguished Member
Jun 30, 2020
14,793
4,595
258
Germany
Hi,

Always after some days/weeks my pools encounter IO errors and I'm not sure whats causing them. It looks like heavy drive activity is causing it, because most of the time this happens if I try to scrub a pool or backup/snapshot VMs.

All the drives are connected directly to my mainboards onboard SATA controller (Intel C612 chipset). The board is a Supermicro X10SRM-F with 6 SATA-Ports + 4 SATA over one sff-8087. All SATA ports are set in BIOS to work in AHCI Mode.
I've got 2 zfs pools and 1 mdraid array. The first pool is called "VMpool7" and is a raidz1 pool consisting of four Intel S3710 and one S3700 SSDs and is used to store my VMs. The second pool is called "VMpool4" and is just a single Seagate ST4000DM004 HDD used as state storage to store the RAM dumps when I snapshot.
Proxmox was installed to two Intel S3700 SSDs I mirrored with mdraid and ontop of that LUKS and LVM.
The two proxmox SSDs are connected to the sff-8087 port, all other 6 drives to the normal 6 SATA ports.

Yesterday it happed again and VMpool7 got an write error. VMpool4 was fine but I as soon as I started a scrub of VMpool7 my VMpool4 also got IO errors.
zpool status returned this:
Code:
  pool: VMpool4
state: SUSPENDED
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: http://zfsonlinux.org/msg/ZFS-8000-HC
  scan: scrub repaired 0B in 0 days 00:51:29 with 0 errors on Sun Dec 13 18:26:03 2020
config:

        NAME                               STATE     READ WRITE CKSUM
        VMpool4                            ONLINE       0     0     0
          ata-ST4000DM004-2CV104_WFN0N27Z  ONLINE       4     0     0
errors: List of errors unavailable: pool I/O is currently suspended

errors: 8 data errors, use '-v' for a list

  pool: VMpool7
state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-9P
  scan: scrub in progress since Tue Dec 15 22:59:16 2020
        529G scanned at 753M/s, 378G issued at 539M/s, 529G total
        0B repaired, 71.53% done, 0 days 00:04:46 to go
config:

        NAME                                            STATE     READ WRITE CKSUM
        VMpool7                                         ONLINE       0     0     0
          raidz1-0                                      ONLINE       0     0     0
            ata-INTEL_SSDSC2BA200G4_BTHV636208L0200MGN  ONLINE       0     0     0
            ata-INTEL_SSDSC2BA200G4_BTHV511505AE200MGN  ONLINE       0     0     0
            ata-INTEL_SSDSC2BA200G4_BTHV636208K6200MGN  ONLINE       0     0     0
            ata-INTEL_SSDSC2BA200G3_BTTV334505YT200GGN  ONLINE       0     0     0
            ata-INTEL_SSDSC2BA200G4_BTHV737402JJ200MGN  ONLINE       0     1     0
errors: No known data errors

I looked at the syslog and found these lines:
Code:
...
Dec 15 22:04:32 Hypervisor kernel: [272610.650340] ata9.00: exception Emask 0x10 SAct 0x200000 SErr 0x4090000 action 0xe frozen
Dec 15 22:04:32 Hypervisor kernel: [272610.650407] ata9.00: irq_stat 0x00400040, connection status changed
Dec 15 22:04:32 Hypervisor kernel: [272610.650449] ata9: SError: { PHYRdyChg 10B8B DevExch }
Dec 15 22:04:32 Hypervisor kernel: [272610.650486] ata9.00: failed command: WRITE FPDMA QUEUED
Dec 15 22:04:32 Hypervisor kernel: [272610.650584] ata9.00: cmd 61/08:a8:60:f0:db/00:00:0c:00:00/40 tag 21 ncq dma 4096 out
Dec 15 22:04:32 Hypervisor kernel: [272610.650584]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
Dec 15 22:04:32 Hypervisor kernel: [272610.650683] ata9.00: status: { DRDY }
Dec 15 22:04:32 Hypervisor kernel: [272610.650713] ata9: hard resetting link
Dec 15 22:04:32 Hypervisor kernel: [272611.306336] ata10: SATA link down (SStatus 0 SControl 300)
Dec 15 22:04:38 Hypervisor kernel: [272616.768637] ata10: SATA link down (SStatus 0 SControl 300)
Dec 15 22:04:43 Hypervisor kernel: [272621.834424] ata9: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Dec 15 22:04:43 Hypervisor kernel: [272621.835252] ata9.00: configured for UDMA/133
Dec 15 22:04:43 Hypervisor kernel: [272621.835277] sd 8:0:0:0: [sdg] tag#21 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Dec 15 22:04:43 Hypervisor kernel: [272621.835282] sd 8:0:0:0: [sdg] tag#21 Sense Key : Illegal Request [current]
Dec 15 22:04:43 Hypervisor kernel: [272621.835288] sd 8:0:0:0: [sdg] tag#21 Add. Sense: Unaligned write command
Dec 15 22:04:43 Hypervisor kernel: [272621.835294] sd 8:0:0:0: [sdg] tag#21 CDB: Write(10) 2a 00 0c db f0 60 00 00 08 00
Dec 15 22:04:43 Hypervisor kernel: [272621.835300] blk_update_request: I/O error, dev sdg, sector 215740512 op 0x1:(WRITE) flags 0x700 phys_seg 1 prio class 0
Dec 15 22:04:43 Hypervisor kernel: [272621.835385] zio pool=VMpool7 vdev=/dev/disk/by-id/ata-INTEL_SSDSC2BA200G4_BTHV737402JJ200MGN-part1 error=5 type=2 offset=110458093568 size=4096 flags=180880
Dec 15 22:04:43 Hypervisor kernel: [272621.835403] ata9: EH complete
Dec 15 22:04:43 Hypervisor kernel: [272621.836327] ata9.00: Enabling discard_zeroes_data
Dec 15 22:04:43 Hypervisor kernel: [272622.152844] ata10: SATA link down (SStatus 0 SControl 300)
Dec 15 22:04:43 Hypervisor kernel: [272622.152854] ata10.00: disabled
Dec 15 22:04:43 Hypervisor kernel: [272622.152887] ata10.00: detaching (SCSI 9:0:0:0)
Dec 15 22:04:43 Hypervisor kernel: [272622.158464] sd 9:0:0:0: [sdi] Synchronizing SCSI cache
Dec 15 22:04:43 Hypervisor kernel: [272622.158524] sd 9:0:0:0: [sdi] Synchronize Cache(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Dec 15 22:04:43 Hypervisor kernel: [272622.158526] sd 9:0:0:0: [sdi] Stopping disk
Dec 15 22:04:43 Hypervisor kernel: [272622.158538] sd 9:0:0:0: [sdi] Start/Stop Unit failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Dec 15 22:04:44 Hypervisor zed: eid=191 class=io pool_guid=0x7C02A172364ADEC5 vdev_path=/dev/disk/by-id/ata-INTEL_SSDSC2BA200G4_BTHV737402JJ200MGN-part1       
...       
Dec 15 22:05:04 Hypervisor kernel: [272643.530543] ata10: link is slow to respond, please be patient (ready=0)
Dec 15 22:05:07 Hypervisor kernel: [272645.738569] ata10: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Dec 15 22:05:07 Hypervisor kernel: [272645.776103] ata10.00: ATA-10: ST4000DM004-2CV104, 0001, max UDMA/133
Dec 15 22:05:07 Hypervisor kernel: [272645.776109] ata10.00: 7814037168 sectors, multi 16: LBA48 NCQ (depth 32), AA
Dec 15 22:05:07 Hypervisor kernel: [272645.836101] ata10.00: configured for UDMA/133
Dec 15 22:05:07 Hypervisor kernel: [272645.836392] scsi 9:0:0:0: Direct-Access     ATA      ST4000DM004-2CV1 0001 PQ: 0 ANSI: 5
Dec 15 22:05:07 Hypervisor kernel: [272645.836868] sd 9:0:0:0: Attached scsi generic sg7 type 0
Dec 15 22:05:07 Hypervisor kernel: [272645.836966] sd 9:0:0:0: [sdh] 7814037168 512-byte logical blocks: (4.00 TB/3.64 TiB)
Dec 15 22:05:07 Hypervisor kernel: [272645.836972] sd 9:0:0:0: [sdh] 4096-byte physical blocks
Dec 15 22:05:07 Hypervisor kernel: [272645.837038] sd 9:0:0:0: [sdh] Write Protect is off
Dec 15 22:05:07 Hypervisor kernel: [272645.837045] sd 9:0:0:0: [sdh] Mode Sense: 00 3a 00 00
Dec 15 22:05:07 Hypervisor kernel: [272645.837096] sd 9:0:0:0: [sdh] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Dec 15 22:05:07 Hypervisor kernel: [272645.932075]  sdh: sdh1 sdh9
Dec 15 22:05:07 Hypervisor kernel: [272645.933329] sd 9:0:0:0: [sdh] Attached SCSI disk
...       
Dec 15 22:26:13 Hypervisor smartd[3575]: Device: /dev/sde [SAT], SMART Usage Attribute: 190 Temperature_Case changed from 69 to 70
Dec 15 22:26:13 Hypervisor smartd[3575]: Device: /dev/sdh [SAT], reconnected ATA device
Dec 15 22:26:18 Hypervisor smartd[3575]: Device: /dev/sdh [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 100 to 82
Dec 15 22:26:18 Hypervisor smartd[3575]: Device: /dev/sdh [SAT], SMART Prefailure Attribute: 3 Spin_Up_Time changed from 99 to 98
Dec 15 22:26:18 Hypervisor smartd[3575]: Device: /dev/sdh [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 100 to 82
...

"sdg" was the SSD with a write error. "sdi" and "sdh" are the same HDD. Sometimes that HDD is switching between sdh and sdi if it encounters IO errors.

All temperatures on the board are fine (no sensor ever over 44 degree C). Power also shouldn'T be a problem (900W PSU).

These are the SMART values for sdg:
Code:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0032   100   100   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       9931
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       289
170 Available_Reservd_Space 0x0033   100   100   010    Pre-fail  Always       -       0
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
174 Unsafe_Shutdown_Count   0x0032   100   100   000    Old_age   Always       -       115
175 Power_Loss_Cap_Test     0x0033   100   100   010    Pre-fail  Always       -       5030 (32 794)
183 SATA_Downshift_Count    0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0033   100   100   090    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
190 Temperature_Case        0x0022   072   071   000    Old_age   Always       -       28 (Min/Max 28/33)
192 Unsafe_Shutdown_Count   0x0032   100   100   000    Old_age   Always       -       115
194 Temperature_Internal    0x0022   100   100   000    Old_age   Always       -       28
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
199 CRC_Error_Count         0x003e   100   100   000    Old_age   Always       -       0
225 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       316612
226 Workld_Media_Wear_Indic 0x0032   100   100   000    Old_age   Always       -       471
227 Workld_Host_Reads_Perc  0x0032   100   100   000    Old_age   Always       -       30
228 Workload_Minutes        0x0032   100   100   000    Old_age   Always       -       594951
232 Available_Reservd_Space 0x0033   100   100   010    Pre-fail  Always       -       0
233 Media_Wearout_Indicator 0x0032   100   100   000    Old_age   Always       -       0
234 Thermal_Throttle        0x0032   100   100   000    Old_age   Always       -       0/0
241 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       316612
242 Host_Reads_32MiB        0x0032   100   100   000    Old_age   Always       -       144459
243 NAND_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       989394

And for the HDD (sdh):
Code:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   077   054   006    Pre-fail  Always       -       46227264
  3 Spin_Up_Time            0x0003   098   096   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   098   098   020    Old_age   Always       -       2459
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   085   060   045    Pre-fail  Always       -       293778150
  9 Power_On_Hours          0x0032   091   091   000    Old_age   Always       -       8368 (29 163 0)
10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       807
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   099   000    Old_age   Always       -       2 2 3
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   069   058   040    Old_age   Always       -       31 (Min/Max 30/34)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       148
193 Load_Cycle_Count        0x0032   095   095   000    Old_age   Always       -       10722
194 Temperature_Celsius     0x0022   031   042   000    Old_age   Always       -       31 (0 16 0 0 0)
195 Hardware_ECC_Recovered  0x001a   077   064   000    Old_age   Always       -       46227264
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       6912h+02m+48.217s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       63013893475
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       102768225659

Yesterday when the IO erros occured max RAM usage was 82% and max CPU usage was 74%. Averaged IO delay is most of the time around 0.03% and yesterday maximum was 3.6%.
All 3 PCIe Slots are populated (passthrough Intel I350-T4, passthrough Geforce GT710 and a Intel NIC EXPI9301CTBLK) but directly connected to the CPU.

BIOS and SSD firmware are flashed to the latest version.

Any idea what could cause these IO errors?
Maybe some bad BIOS options or something? Feel free to ask if more infos like screenshots of the BIOS or logs are needed.
 
Last edited:
Please post the output of pveversion -v.
Are all 6 disks connected to the onboard controller? Try updating the firmware.
 
Code:
root@Hypervisor:~# pveversion -v
proxmox-ve: 6.3-1 (running kernel: 5.4.78-2-pve)
pve-manager: 6.3-3 (running version: 6.3-3/eee5f901)
pve-kernel-5.4: 6.3-3
pve-kernel-helper: 6.3-3
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.78-1-pve: 5.4.78-1
pve-kernel-5.4.73-1-pve: 5.4.73-1
pve-kernel-5.4.65-1-pve: 5.4.65-1
pve-kernel-5.4.60-1-pve: 5.4.60-2
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve3
ksmtuned: 4.20150325+b1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.7
libproxmox-backup-qemu0: 1.0.2-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.3-2
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.1-1
libpve-storage-perl: 6.3-3
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.0.6-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-3
pve-cluster: 6.2-1
pve-container: 3.3-2
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.1.0-7
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-2
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.5-pve1

All 5 SSDs and the one HDD of the two ZFS pools are connected directly to the onboard SATA ports (of the C612 chipset) working in AHCI mode.
BIOS and SSD firmware should be the latest version. Updated them just some months ago.
 
Last edited:
Hmm, so this happens reliably under load? Does it affect all drives or just a subset?

Possible reasons that could be checked:
  • faulty cables
  • faulty controller
  • controller getting to warm
 
Also, note that the Seagate BarraCuda ST4000DM004 is not really what one would call a reliable HDD. Cannot comment on the SSDs...
 
I already changed a SATA cable once but that doesn't fixed it. The controller should be part of the chipset and that never exceeds 44 deg C. Checking the controller isn't that easy because there is no free PCIe slot and I don't got a spare HBA laying around.

I'm not sure if it affects all the drives. My monitoring tool only monitored the last 2 months and in this 2 months it looks like the HDD had 4 problems and only one of the SSDs 3 problems.

I will try to move both of the drives to the sff-8087 port where two SATA cables are free.

Also, note that the Seagate BarraCuda ST4000DM004 is not really what one would call a reliable HDD. Cannot comment on the SSDs...
Yeah, that drive is really crap. I will never buy a SMR again. I've seen average response times of over 30 sec and it seems that this isn't unusual if you look for reviews.
 
I will never buy a SMR again
Oh, I did not realize this was an SMR drive. This could very well be the cause of the ZFS problems. I don't have personal experience but early year / spring WD with their hidden SMR drives in the RED line made quite some news and apparently one of the problems with them is that the drives might be so slow to report back that the kernel considers them lost and resets the SATA port. From what I remember, this was happening with the ZFS access pattern quite regularly compared to an MD raid or so
 
Oh, I did not realize this was an SMR drive. This could very well be the cause of the ZFS problems. I don't have personal experience but early year / spring WD with their hidden SMR drives in the RED line made quite some news and apparently one of the problems with them is that the drives might be so slow to report back that the kernel considers them lost and resets the SATA port. From what I remember, this was happening with the ZFS access pattern quite regularly compared to an MD raid or so
I will test that. But first the SSD got IO errors at around 9 to 10 PM. IO errors of the HDD only started a few minutes after I started the scrubbing of the SSD pool. Unless scrubbing somehow writes a lot of data to another pool I don't see why the HDD should be slowed down.
That HDD is only used as state storage and the snapshot tasks is only running at 5 AM, so at that moment the HDD should have been idleing.

I thought the sata controller or PCIe bus hardware interrupts might be overloaded.
Is there maybe a way to priorize the SATA controller on the proxmox host?
In my Win10 VM I needed to switch the graphiccard from line based interrupts to Message Signaled-Based Interrupts and set the priority to medium or I get no or crackling sound.
 
Last edited:
I will test that. But first the SSD got IO errors at around 9 to 10 PM. IO errors of the HDD only started a few minutes after I started the scrubbing of the SSD pool. Unless scrubbing somehow writes a lot of data to another pool I don't see why the HDD should be slowed down.
That HDD is only used as state storage and the snapshot tasks is only running at 5 AM, so at that moment the HDD should have been idleing.

I thought the sata controller or PCIe bus hardware interrupts might be overloaded.
Is there maybe a way to priorize the SATA controller on the proxmox host?
In my Win10 VM I needed to switch the graphiccard from line based interrupts to Message Signaled-Based Interrupts and set the priority to medium or I get no or crackling sound.
Maybe disable the HDD backed storage in the storage.cfg and export the corresponding pool and maybe even spin the disk down before trying the scrub on the other pool. Is that an option to check if that changes something?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!