ZFS pool faults every week at Friday 02:00

slize26 · Sep 24, 2021

I am observing some weird behaviour on one of my PVE systems: Every week on firday at around 02:00 o'clock one drive from the ZFS pool switches its state to "faulted". Last week at ~17:00 o'clock another drive of that pool also switched to "faulted". A reboot resolves the issue and a ZFS srub does not find any errors.

The system is based on:
AMD Ryzen 3700X
X470D4U - ASRock Rack
2x Kingston Server Premier DIMM 32GB, DDR4-3200
3x Samsung 860 EVO 1TB (connected via the onboard SATA ports)
be quiet! 550W Gold PSU

The system gets powered by an APC SMART UPS (load ~25%, one year old).

What i have done so far:
- Replacing every SATA cable (multiple times)
- Switching the Onboard SATA ports
- Replacing the Motherboard for a new X470D4U (yes, the same board but totally different dealer and production date)
- Updating the BIOS
- Replacing the power supply
- Replacing the "failed" disk with an 870 EVO 1TB (which just moved the issue to the new disk)

I have read that "some ryzen boards may have issues with their SATA onboard controller" - but i replaced the board so that should be fixed if there was an issue with the controller.

I attached the syslogs of the two events as file.

Does anybody have an idea what the PVE could be running at friday night at ~02:00 o'clock that causes such failures?

Here are some of the E-Mails i am getting from the PVE::

Code:

The number of I/O errors associated with a ZFS device exceeded acceptable levels. ZFS has marked the device as faulted.
 impact: Fault tolerance of the pool may be compromised.
    eid: 11
  class: statechange
  state: FAULTED
   host: srvpve1
   time: 2021-09-24 02:13:13+0200
  vpath: /dev/disk/by-id/ata-Samsung_SSD_860_EVO_1TB_S3Z9NB0NA47041E-part1
  vphys: pci-0000:03:00.1-ata-2.0
  vguid: 0x9DD2445E8E38CE8F
  devid: ata-Samsung_SSD_860_EVO_1TB_S3Z9NB0NA47041E-part1
   pool: 0x0B72B08061CEFD09


The number of I/O errors associated with a ZFS device exceeded acceptable levels. ZFS has marked the device as faulted.
 impact: Fault tolerance of the pool may be compromised.
    eid: 113
  class: statechange
  state: FAULTED
   host: srvpve1
   time: 2021-09-17 02:20:54+0200
  vpath: /dev/disk/by-id/ata-Samsung_SSD_860_EVO_1TB_S3Z9NB0NA47041E-part1
  vphys: pci-0000:03:00.1-ata-2.0
  vguid: 0x9DD2445E8E38CE8F
  devid: ata-Samsung_SSD_860_EVO_1TB_S3Z9NB0NA47041E-part1
   pool: 0x0B72B08061CEFD09


The number of I/O errors associated with a ZFS device exceeded acceptable levels. ZFS has marked the device as faulted.
 impact: Fault tolerance of the pool may be compromised.
    eid: 239
  class: statechange
  state: FAULTED
   host: srvpve1
   time: 2021-09-17 16:52:53+0200
  vpath: /dev/disk/by-id/ata-Samsung_SSD_860_EVO_1TB_S3Z9NB0N921881X-part1
  vphys: pci-0000:03:00.1-ata-1.0
  vguid: 0xED11E256B67612B4
  devid: ata-Samsung_SSD_860_EVO_1TB_S3Z9NB0N921881X-part1
   pool: 0x0B72B08061CEFD09

leesteken · Sep 25, 2021

Does systemctl list-timers show anything suspect?

apoc · Sep 25, 2021

No clue what happens at 2am in Friday.
Have you inspected the crontab? Anything in there?
Is it exactly this time or "around this time"?

Asking because I experience a strange issue as well. On my end system starts to behave weird on 25th day of uptime. Have not four d the source yet. My workaround is rebooting every 21-24 days. Then I am not struck by the problem.

See
https://forum.proxmox.com/threads/s...on-node-even-if-no-vdisks-are-affected.70213/

slize26 · Sep 26, 2021

avw said:
Does systemctl list-timers show anything suspect?

Code:

root@srvpve1:~# systemctl list-timers
NEXT                         LEFT          LAST                         PASSED       UNIT                         ACTIVATES
Sun 2021-09-26 16:42:00 CEST 35s left      Sun 2021-09-26 16:41:00 CEST 23s ago      pvesr.timer                  pvesr.service
Sun 2021-09-26 18:41:00 CEST 1h 59min left Sat 2021-09-25 18:41:00 CEST 22h ago      systemd-tmpfiles-clean.timer systemd-tmpfiles-clean.service
Sun 2021-09-26 19:36:32 CEST 2h 55min left Sun 2021-09-26 10:49:41 CEST 5h 51min ago apt-daily.timer              apt-daily.service
Mon 2021-09-27 00:00:00 CEST 7h left       Sun 2021-09-26 00:00:00 CEST 16h ago      logrotate.timer              logrotate.service
Mon 2021-09-27 00:00:00 CEST 7h left       Sun 2021-09-26 00:00:00 CEST 16h ago      man-db.timer                 man-db.service
Mon 2021-09-27 00:49:47 CEST 8h left       Mon 2021-09-20 00:05:58 CEST 6 days ago   fstrim.timer                 fstrim.service
Mon 2021-09-27 03:10:31 CEST 10h left      Sun 2021-09-26 03:13:41 CEST 13h ago      pve-daily-update.timer       pve-daily-update.service
Mon 2021-09-27 06:11:50 CEST 13h left      Sun 2021-09-26 06:27:59 CEST 10h ago      apt-daily-upgrade.timer      apt-daily-upgrade.service
Sun 2021-10-03 03:10:28 CEST 6 days left   Sun 2021-09-26 03:10:59 CEST 13h ago      e2scrub_all.timer            e2scrub_all.service

9 timers listed.
Pass --all to see loaded but inactive timers, too.
root@srvpve1:~#

All the timers look fine to me, but maybe i am missing something.

The same with cronetabs:

Code:

root@srvpve1:~# for user in $(cut -f1 -d: /etc/passwd); do echo $user; crontab -u $user -l; done
root
no crontab for root
daemon
no crontab for daemon
bin
no crontab for bin
sys
no crontab for sys
sync
no crontab for sync
games
no crontab for games
man
no crontab for man
lp
no crontab for lp
mail
no crontab for mail
news
no crontab for news
uucp
no crontab for uucp
proxy
no crontab for proxy
www-data
no crontab for www-data
backup
no crontab for backup
list
no crontab for list
irc
no crontab for irc
gnats
no crontab for gnats
nobody
no crontab for nobody
_apt
no crontab for _apt
_chrony
no crontab for _chrony
messagebus
no crontab for messagebus
_rpc
no crontab for _rpc
systemd-network
no crontab for systemd-network
systemd-resolve
no crontab for systemd-resolve
postfix
no crontab for postfix
tcpdump
no crontab for tcpdump
sshd
no crontab for sshd
statd
no crontab for statd
gluster
no crontab for gluster
ceph
no crontab for ceph
systemd-timesync
no crontab for systemd-timesync
systemd-coredump
no crontab for systemd-coredump

@tburger it is around this time, not exactly. Thanks for the link to your issue. That seems similar to my faults. What disks, CPU and Mainbaord are you using?

Another notice: The failed server is about 1 meter from the house high voltage connection. Could it be possible that the disks fail due to electromagnetic interferences?

apoc · Sep 26, 2021

@slize26
Here you go. Dont think that this really helps but my gear is as follows:
- Opteron 6366HE 16C 1,8 GHz
- Supermicro H8SGL-F
- 128 GB Memory (8*16GB PC3-8500)
- 2x LSI SAS 9211-8i with SAS -> SATA Cables
- Radian RMS-200 NVMe storage
- Disks - happened with a few: WD Black and WD Blue. As of today the issue triggere

Started somewhere around the PVE5 to PVE6 release for myself.

Your idea with the electromagnetic shock/interference is an interesting one. However I don't think it applies in my situation. Since the issue is moving on with different disks I think this can be ruled out. My UPS is right below the system but it always have been.
I am running out of ideas - hence I am just rebooting every 24 days.

On your side: is the issue always affecting the same connector? On my end it is. It is always on 9211-8i Controller 1 the port 4.

/edit: If I understood you correctly you are using flash storage - is that correct? I think this makes the electromagnetic interference even less likely...

dcsapak · Sep 27, 2021

from experience, weird hardware errors (that do not have any obvious reason) are often caused by faulty memory or (undersized) psu...

apoc · Sep 27, 2021

Yeah, have been there (faulty PSU)...

On my end it is ECC memory and a very powerful PSU.
I also added extra PSU capabilities as I thought the 5v Rail is to weak. Sadlz didn't help at all :/

chrcoluk · Sep 30, 2021

if its all drives, could be a controller issue, dmi link issue, or psu issue in my opinion.

I would be surprised if ram caused that.

slize26 · Sep 30, 2021

tburger said:
@slize26
Here you go. Dont think that this really helps but my gear is as follows:
- Opteron 6366HE 16C 1,8 GHz
- Supermicro H8SGL-F
- 128 GB Memory (8*16GB PC3-8500)
- 2x LSI SAS 9211-8i with SAS -> SATA Cables
- Radian RMS-200 NVMe storage
- Disks - happened with a few: WD Black and WD Blue. As of today the issue triggere

Started somewhere around the PVE5 to PVE6 release for myself.

Your idea with the electromagnetic shock/interference is an interesting one. However I don't think it applies in my situation. Since the issue is moving on with different disks I think this can be ruled out. My UPS is right below the system but it always have been.
I am running out of ideas - hence I am just rebooting every 24 days.

On your side: is the issue always affecting the same connector? On my end it is. It is always on 9211-8i Controller 1 the port 4.

/edit: If I understood you correctly you are using flash storage - is that correct? I think this makes the electromagnetic interference even less likely...

Okay, so not a current Ryzen CPU. Than there is no obious correlation between our systems. :/

I did replace the PSU aswell, it is a be quiet! Straight Power 11 Platinum 550W, which should be more than enough power for the system.

To replace the RAM i would have to buy two more sticks of the Kingston Server Premier DIMM 32GB, DDR4-3200. But i dont know if its really worth it.

For me its only one drive that fails but its always a different one.

apoc · Oct 1, 2021

slize26 said:
Than there is no obious correlation between our systems. :/

Except it is both AMD as CPU vendor...

slize26 said:
did replace the PSU aswell, it is a be quiet! Straight Power 11 Platinum 550W, which should be more than enough power for the system.

Don't be mislead by large numbers here!
You need to look closely where the output lies.
I have had trouble with an aging PSU where over time the 5V rail went bad and all the sudden some strange crap happened. It turned out that my PSU was very powerful on the 12V rail but due to the fact I am running 20* 2,5" HDDs all the power of the 5V rail was exhausted.
Had a 550w PSU as well. Now I am running with a 430W PSU, however it had twice the power on the 5v rail.

Before you purchase new memory I would definetly look into the specs and another PSU!

slize26 said:
For me its only one drive that fails but its always a different one.

That's the real difference here. For me it is always the same HBA Port that is affected. Controller 1 Port 4. However it is showing up with different cables, HDDs and HBAs (when I replace the HBA with a spare) - so I have no clue what causes this.
My spares are obviously running with the same HW/FW so I guess this is a strange edge case where I am running into. Maybe even due to my population with ssds and HDDs. Because this is nothing I can change really...

slize26 · Oct 1, 2021

tburger said:
Except it is both AMD as CPU vendor...

Yes, thats for sure.

tburger said:
Don't be mislead by large numbers here!

You need to look closely where the output lies.
I have had trouble with an aging PSU where over time the 5V rail went bad and all the sudden some strange crap happened. It turned out that my PSU was very powerful on the 12V rail but due to the fact I am running 20* 2,5" HDDs all the power of the 5V rail was exhausted.
Had a 550w PSU as well. Now I am running with a 430W PSU, however it had twice the power on the 5v rail.

Before you purchase new memory I would definetly look into the specs and another PSU!

The server is just running 4 3.5" HDDs with 10TB each and three 1TB 2.5" SSDs. All connected with their own SATA Connector from the PSU. I mean i get your point but that's more or less the same amount on disks you would find in a desktop system.

tburger said:
That's the real difference here. For me it is always the same HBA Port that is affected. Controller 1 Port 4. However it is showing up with different cables, HDDs and HBAs (when I replace the HBA with a spare) - so I have no clue what causes this.
My spares are obviously running with the same HW/FW so I guess this is a strange edge case where I am running into. Maybe even due to my population with ssds and HDDs. Because this is nothing I can change really...

This week i installed an Fujitsu CP400i SATA/SAS HBA Controller which is flashed in IT-Mode. I added two SAS to SATA cables and as for now the issue is gone (today is friday and this night there were no disk errors).
That would indicate that the ASRock Rack X470D4U has some serious issues with the onboard SATA controller. I think not the ports itself but the controller (since the problem exists on multiple ports). And they didn't fix it in a year (the first board had an manufacturing date of october 2020, the current one is from August 2021).

I'll keep an eye on this, maybe it just moved the time when it happens.

leesteken · Oct 1, 2021

I have a different ASRock X470 motherboard and I just remembered that when connecting my two identical HDDs and two identical SSDs there was this funny thing: only some configurations of 4 ports out of the 6 available would see all four drives in the BIOS. I suspected a broken port but each of the 6 would work in some of the possible configurations, usually by swapping the ports of a HDD and an SSD. I know that a few anecdotes don't make a controller issue, but maybe there are rare issues with that SATA controller, maybe in combination with something else like BIOS version, power supply or case grounding?

slize26 · Oct 1, 2021

avw said:
I have a different ASRock X470 motherboard and I just remembered that when connecting my two identical HDDs and two identical SSDs there was this funny thing: only some configurations of 4 ports out of the 6 available would see all four drives in the BIOS. I suspected a broken port but each of the 6 would work in some of the possible configurations, usually by swapping the ports of a HDD and an SSD. I know that a few anecdotes don't make a controller issue, but maybe there are rare issues with that SATA controller, maybe in combination with something else like BIOS version, power supply or case grounding?

I never had such issues with my Intel platforms. I first switched to AMD when the Ryzen 3000 chips came out. It could also be an ASRock issue but i dont know for sure. The problem is, that ASRock is the only vendor that got an IPMI module on an AM4 socket. So i have to stick with them.

Search

Search

ZFS pool faults every week at Friday 02:00

slize26

Member

Attachments

leesteken

Distinguished Member

apoc

Famous Member

slize26

Member

apoc

Famous Member

dcsapak

Proxmox Staff Member

apoc

Famous Member

chrcoluk

Renowned Member

slize26

Member

apoc

Famous Member

slize26

Member

leesteken

Distinguished Member

slize26

Member

We value your privacy