I am observing some weird behaviour on one of my PVE systems: Every week on firday at around 02:00 o'clock one drive from the ZFS pool switches its state to "faulted". Last week at ~17:00 o'clock another drive of that pool also switched to "faulted". A reboot resolves the issue and a ZFS srub does not find any errors.
The system is based on:
AMD Ryzen 3700X
X470D4U - ASRock Rack
2x Kingston Server Premier DIMM 32GB, DDR4-3200
3x Samsung 860 EVO 1TB (connected via the onboard SATA ports)
be quiet! 550W Gold PSU
The system gets powered by an APC SMART UPS (load ~25%, one year old).
What i have done so far:
- Replacing every SATA cable (multiple times)
- Switching the Onboard SATA ports
- Replacing the Motherboard for a new X470D4U (yes, the same board but totally different dealer and production date)
- Updating the BIOS
- Replacing the power supply
- Replacing the "failed" disk with an 870 EVO 1TB (which just moved the issue to the new disk)
I have read that "some ryzen boards may have issues with their SATA onboard controller" - but i replaced the board so that should be fixed if there was an issue with the controller.
I attached the syslogs of the two events as file.
Does anybody have an idea what the PVE could be running at friday night at ~02:00 o'clock that causes such failures?
Here are some of the E-Mails i am getting from the PVE::
The system is based on:
AMD Ryzen 3700X
X470D4U - ASRock Rack
2x Kingston Server Premier DIMM 32GB, DDR4-3200
3x Samsung 860 EVO 1TB (connected via the onboard SATA ports)
be quiet! 550W Gold PSU
The system gets powered by an APC SMART UPS (load ~25%, one year old).
What i have done so far:
- Replacing every SATA cable (multiple times)
- Switching the Onboard SATA ports
- Replacing the Motherboard for a new X470D4U (yes, the same board but totally different dealer and production date)
- Updating the BIOS
- Replacing the power supply
- Replacing the "failed" disk with an 870 EVO 1TB (which just moved the issue to the new disk)
I have read that "some ryzen boards may have issues with their SATA onboard controller" - but i replaced the board so that should be fixed if there was an issue with the controller.
I attached the syslogs of the two events as file.
Does anybody have an idea what the PVE could be running at friday night at ~02:00 o'clock that causes such failures?
Here are some of the E-Mails i am getting from the PVE::
Code:
The number of I/O errors associated with a ZFS device exceeded acceptable levels. ZFS has marked the device as faulted.
impact: Fault tolerance of the pool may be compromised.
eid: 11
class: statechange
state: FAULTED
host: srvpve1
time: 2021-09-24 02:13:13+0200
vpath: /dev/disk/by-id/ata-Samsung_SSD_860_EVO_1TB_S3Z9NB0NA47041E-part1
vphys: pci-0000:03:00.1-ata-2.0
vguid: 0x9DD2445E8E38CE8F
devid: ata-Samsung_SSD_860_EVO_1TB_S3Z9NB0NA47041E-part1
pool: 0x0B72B08061CEFD09
The number of I/O errors associated with a ZFS device exceeded acceptable levels. ZFS has marked the device as faulted.
impact: Fault tolerance of the pool may be compromised.
eid: 113
class: statechange
state: FAULTED
host: srvpve1
time: 2021-09-17 02:20:54+0200
vpath: /dev/disk/by-id/ata-Samsung_SSD_860_EVO_1TB_S3Z9NB0NA47041E-part1
vphys: pci-0000:03:00.1-ata-2.0
vguid: 0x9DD2445E8E38CE8F
devid: ata-Samsung_SSD_860_EVO_1TB_S3Z9NB0NA47041E-part1
pool: 0x0B72B08061CEFD09
The number of I/O errors associated with a ZFS device exceeded acceptable levels. ZFS has marked the device as faulted.
impact: Fault tolerance of the pool may be compromised.
eid: 239
class: statechange
state: FAULTED
host: srvpve1
time: 2021-09-17 16:52:53+0200
vpath: /dev/disk/by-id/ata-Samsung_SSD_860_EVO_1TB_S3Z9NB0N921881X-part1
vphys: pci-0000:03:00.1-ata-1.0
vguid: 0xED11E256B67612B4
devid: ata-Samsung_SSD_860_EVO_1TB_S3Z9NB0N921881X-part1
pool: 0x0B72B08061CEFD09
Attachments
Last edited: