ZFS pool faults every week at Friday 02:00

slize26

New Member
Aug 5, 2021
14
1
3
25
I am observing some weird behaviour on one of my PVE systems: Every week on firday at around 02:00 o'clock one drive from the ZFS pool switches its state to "faulted". Last week at ~17:00 o'clock another drive of that pool also switched to "faulted". A reboot resolves the issue and a ZFS srub does not find any errors.

The system is based on:
AMD Ryzen 3700X
X470D4U - ASRock Rack
2x Kingston Server Premier DIMM 32GB, DDR4-3200
3x Samsung 860 EVO 1TB (connected via the onboard SATA ports)
be quiet! 550W Gold PSU

The system gets powered by an APC SMART UPS (load ~25%, one year old).

What i have done so far:
- Replacing every SATA cable (multiple times)
- Switching the Onboard SATA ports
- Replacing the Motherboard for a new X470D4U (yes, the same board but totally different dealer and production date)
- Updating the BIOS
- Replacing the power supply
- Replacing the "failed" disk with an 870 EVO 1TB (which just moved the issue to the new disk)

I have read that "some ryzen boards may have issues with their SATA onboard controller" - but i replaced the board so that should be fixed if there was an issue with the controller.

I attached the syslogs of the two events as file.

Does anybody have an idea what the PVE could be running at friday night at ~02:00 o'clock that causes such failures?

Here are some of the E-Mails i am getting from the PVE::
Code:
The number of I/O errors associated with a ZFS device exceeded acceptable levels. ZFS has marked the device as faulted.
 impact: Fault tolerance of the pool may be compromised.
    eid: 11
  class: statechange
  state: FAULTED
   host: srvpve1
   time: 2021-09-24 02:13:13+0200
  vpath: /dev/disk/by-id/ata-Samsung_SSD_860_EVO_1TB_S3Z9NB0NA47041E-part1
  vphys: pci-0000:03:00.1-ata-2.0
  vguid: 0x9DD2445E8E38CE8F
  devid: ata-Samsung_SSD_860_EVO_1TB_S3Z9NB0NA47041E-part1
   pool: 0x0B72B08061CEFD09


The number of I/O errors associated with a ZFS device exceeded acceptable levels. ZFS has marked the device as faulted.
 impact: Fault tolerance of the pool may be compromised.
    eid: 113
  class: statechange
  state: FAULTED
   host: srvpve1
   time: 2021-09-17 02:20:54+0200
  vpath: /dev/disk/by-id/ata-Samsung_SSD_860_EVO_1TB_S3Z9NB0NA47041E-part1
  vphys: pci-0000:03:00.1-ata-2.0
  vguid: 0x9DD2445E8E38CE8F
  devid: ata-Samsung_SSD_860_EVO_1TB_S3Z9NB0NA47041E-part1
   pool: 0x0B72B08061CEFD09


The number of I/O errors associated with a ZFS device exceeded acceptable levels. ZFS has marked the device as faulted.
 impact: Fault tolerance of the pool may be compromised.
    eid: 239
  class: statechange
  state: FAULTED
   host: srvpve1
   time: 2021-09-17 16:52:53+0200
  vpath: /dev/disk/by-id/ata-Samsung_SSD_860_EVO_1TB_S3Z9NB0N921881X-part1
  vphys: pci-0000:03:00.1-ata-1.0
  vguid: 0xED11E256B67612B4
  devid: ata-Samsung_SSD_860_EVO_1TB_S3Z9NB0N921881X-part1
   pool: 0x0B72B08061CEFD09
 

Attachments

  • Syslog.txt
    18.6 KB · Views: 3
Last edited:

tburger

Active Member
Oct 13, 2017
778
90
33
38
No clue what happens at 2am in Friday.
Have you inspected the crontab? Anything in there?
Is it exactly this time or "around this time"?

Asking because I experience a strange issue as well. On my end system starts to behave weird on 25th day of uptime. Have not four d the source yet. My workaround is rebooting every 21-24 days. Then I am not struck by the problem.

See
https://forum.proxmox.com/threads/s...on-node-even-if-no-vdisks-are-affected.70213/
 

slize26

New Member
Aug 5, 2021
14
1
3
25
Does systemctl list-timers show anything suspect?


Code:
root@srvpve1:~# systemctl list-timers
NEXT                         LEFT          LAST                         PASSED       UNIT                         ACTIVATES
Sun 2021-09-26 16:42:00 CEST 35s left      Sun 2021-09-26 16:41:00 CEST 23s ago      pvesr.timer                  pvesr.service
Sun 2021-09-26 18:41:00 CEST 1h 59min left Sat 2021-09-25 18:41:00 CEST 22h ago      systemd-tmpfiles-clean.timer systemd-tmpfiles-clean.service
Sun 2021-09-26 19:36:32 CEST 2h 55min left Sun 2021-09-26 10:49:41 CEST 5h 51min ago apt-daily.timer              apt-daily.service
Mon 2021-09-27 00:00:00 CEST 7h left       Sun 2021-09-26 00:00:00 CEST 16h ago      logrotate.timer              logrotate.service
Mon 2021-09-27 00:00:00 CEST 7h left       Sun 2021-09-26 00:00:00 CEST 16h ago      man-db.timer                 man-db.service
Mon 2021-09-27 00:49:47 CEST 8h left       Mon 2021-09-20 00:05:58 CEST 6 days ago   fstrim.timer                 fstrim.service
Mon 2021-09-27 03:10:31 CEST 10h left      Sun 2021-09-26 03:13:41 CEST 13h ago      pve-daily-update.timer       pve-daily-update.service
Mon 2021-09-27 06:11:50 CEST 13h left      Sun 2021-09-26 06:27:59 CEST 10h ago      apt-daily-upgrade.timer      apt-daily-upgrade.service
Sun 2021-10-03 03:10:28 CEST 6 days left   Sun 2021-09-26 03:10:59 CEST 13h ago      e2scrub_all.timer            e2scrub_all.service

9 timers listed.
Pass --all to see loaded but inactive timers, too.
root@srvpve1:~#

All the timers look fine to me, but maybe i am missing something.

The same with cronetabs:
Code:
root@srvpve1:~# for user in $(cut -f1 -d: /etc/passwd); do echo $user; crontab -u $user -l; done
root
no crontab for root
daemon
no crontab for daemon
bin
no crontab for bin
sys
no crontab for sys
sync
no crontab for sync
games
no crontab for games
man
no crontab for man
lp
no crontab for lp
mail
no crontab for mail
news
no crontab for news
uucp
no crontab for uucp
proxy
no crontab for proxy
www-data
no crontab for www-data
backup
no crontab for backup
list
no crontab for list
irc
no crontab for irc
gnats
no crontab for gnats
nobody
no crontab for nobody
_apt
no crontab for _apt
_chrony
no crontab for _chrony
messagebus
no crontab for messagebus
_rpc
no crontab for _rpc
systemd-network
no crontab for systemd-network
systemd-resolve
no crontab for systemd-resolve
postfix
no crontab for postfix
tcpdump
no crontab for tcpdump
sshd
no crontab for sshd
statd
no crontab for statd
gluster
no crontab for gluster
ceph
no crontab for ceph
systemd-timesync
no crontab for systemd-timesync
systemd-coredump
no crontab for systemd-coredump

@tburger it is around this time, not exactly. Thanks for the link to your issue. That seems similar to my faults. What disks, CPU and Mainbaord are you using?

Another notice: The failed server is about 1 meter from the house high voltage connection. Could it be possible that the disks fail due to electromagnetic interferences?
 

tburger

Active Member
Oct 13, 2017
778
90
33
38
@slize26
Here you go. Dont think that this really helps but my gear is as follows:
- Opteron 6366HE 16C 1,8 GHz
- Supermicro H8SGL-F
- 128 GB Memory (8*16GB PC3-8500)
- 2x LSI SAS 9211-8i with SAS -> SATA Cables
- Radian RMS-200 NVMe storage
- Disks - happened with a few: WD Black and WD Blue. As of today the issue triggere

Started somewhere around the PVE5 to PVE6 release for myself.

Your idea with the electromagnetic shock/interference is an interesting one. However I don't think it applies in my situation. Since the issue is moving on with different disks I think this can be ruled out. My UPS is right below the system but it always have been.
I am running out of ideas - hence I am just rebooting every 24 days.

On your side: is the issue always affecting the same connector? On my end it is. It is always on 9211-8i Controller 1 the port 4.

/edit: If I understood you correctly you are using flash storage - is that correct? I think this makes the electromagnetic interference even less likely...
 
Last edited:

dcsapak

Proxmox Staff Member
Staff member
Feb 1, 2016
6,518
735
133
33
Vienna
from experience, weird hardware errors (that do not have any obvious reason) are often caused by faulty memory or (undersized) psu...
 
  • Like
Reactions: avw

tburger

Active Member
Oct 13, 2017
778
90
33
38
Yeah, have been there (faulty PSU)...

On my end it is ECC memory and a very powerful PSU.
I also added extra PSU capabilities as I thought the 5v Rail is to weak. Sadlz didn't help at all :/
 

chrcoluk

Member
Oct 7, 2018
113
16
23
42
if its all drives, could be a controller issue, dmi link issue, or psu issue in my opinion.

I would be surprised if ram caused that.
 

slize26

New Member
Aug 5, 2021
14
1
3
25
@slize26
Here you go. Dont think that this really helps but my gear is as follows:
- Opteron 6366HE 16C 1,8 GHz
- Supermicro H8SGL-F
- 128 GB Memory (8*16GB PC3-8500)
- 2x LSI SAS 9211-8i with SAS -> SATA Cables
- Radian RMS-200 NVMe storage
- Disks - happened with a few: WD Black and WD Blue. As of today the issue triggere

Started somewhere around the PVE5 to PVE6 release for myself.

Your idea with the electromagnetic shock/interference is an interesting one. However I don't think it applies in my situation. Since the issue is moving on with different disks I think this can be ruled out. My UPS is right below the system but it always have been.
I am running out of ideas - hence I am just rebooting every 24 days.

On your side: is the issue always affecting the same connector? On my end it is. It is always on 9211-8i Controller 1 the port 4.

/edit: If I understood you correctly you are using flash storage - is that correct? I think this makes the electromagnetic interference even less likely...

Okay, so not a current Ryzen CPU. Than there is no obious correlation between our systems. :/

I did replace the PSU aswell, it is a be quiet! Straight Power 11 Platinum 550W, which should be more than enough power for the system.

To replace the RAM i would have to buy two more sticks of the Kingston Server Premier DIMM 32GB, DDR4-3200. But i dont know if its really worth it.

For me its only one drive that fails but its always a different one.
 

tburger

Active Member
Oct 13, 2017
778
90
33
38
Than there is no obious correlation between our systems. :/
Except it is both AMD as CPU vendor...

did replace the PSU aswell, it is a be quiet! Straight Power 11 Platinum 550W, which should be more than enough power for the system.
Don't be mislead by large numbers here!
You need to look closely where the output lies.
I have had trouble with an aging PSU where over time the 5V rail went bad and all the sudden some strange crap happened. It turned out that my PSU was very powerful on the 12V rail but due to the fact I am running 20* 2,5" HDDs all the power of the 5V rail was exhausted.
Had a 550w PSU as well. Now I am running with a 430W PSU, however it had twice the power on the 5v rail.

Before you purchase new memory I would definetly look into the specs and another PSU!
For me its only one drive that fails but its always a different one.
That's the real difference here. For me it is always the same HBA Port that is affected. Controller 1 Port 4. However it is showing up with different cables, HDDs and HBAs (when I replace the HBA with a spare) - so I have no clue what causes this.
My spares are obviously running with the same HW/FW so I guess this is a strange edge case where I am running into. Maybe even due to my population with ssds and HDDs. Because this is nothing I can change really...
 

slize26

New Member
Aug 5, 2021
14
1
3
25
Except it is both AMD as CPU vendor...
Yes, thats for sure.

Don't be mislead by large numbers here!

You need to look closely where the output lies.
I have had trouble with an aging PSU where over time the 5V rail went bad and all the sudden some strange crap happened. It turned out that my PSU was very powerful on the 12V rail but due to the fact I am running 20* 2,5" HDDs all the power of the 5V rail was exhausted.
Had a 550w PSU as well. Now I am running with a 430W PSU, however it had twice the power on the 5v rail.

Before you purchase new memory I would definetly look into the specs and another PSU!
The server is just running 4 3.5" HDDs with 10TB each and three 1TB 2.5" SSDs. All connected with their own SATA Connector from the PSU. I mean i get your point but that's more or less the same amount on disks you would find in a desktop system.

That's the real difference here. For me it is always the same HBA Port that is affected. Controller 1 Port 4. However it is showing up with different cables, HDDs and HBAs (when I replace the HBA with a spare) - so I have no clue what causes this.
My spares are obviously running with the same HW/FW so I guess this is a strange edge case where I am running into. Maybe even due to my population with ssds and HDDs. Because this is nothing I can change really...
This week i installed an Fujitsu CP400i SATA/SAS HBA Controller which is flashed in IT-Mode. I added two SAS to SATA cables and as for now the issue is gone (today is friday and this night there were no disk errors).
That would indicate that the ASRock Rack X470D4U has some serious issues with the onboard SATA controller. I think not the ports itself but the controller (since the problem exists on multiple ports). And they didn't fix it in a year (the first board had an manufacturing date of october 2020, the current one is from August 2021).

I'll keep an eye on this, maybe it just moved the time when it happens.
 

avw

Active Member
May 31, 2020
865
140
43
I have a different ASRock X470 motherboard and I just remembered that when connecting my two identical HDDs and two identical SSDs there was this funny thing: only some configurations of 4 ports out of the 6 available would see all four drives in the BIOS. I suspected a broken port but each of the 6 would work in some of the possible configurations, usually by swapping the ports of a HDD and an SSD. I know that a few anecdotes don't make a controller issue, but maybe there are rare issues with that SATA controller, maybe in combination with something else like BIOS version, power supply or case grounding?
 

slize26

New Member
Aug 5, 2021
14
1
3
25
I have a different ASRock X470 motherboard and I just remembered that when connecting my two identical HDDs and two identical SSDs there was this funny thing: only some configurations of 4 ports out of the 6 available would see all four drives in the BIOS. I suspected a broken port but each of the 6 would work in some of the possible configurations, usually by swapping the ports of a HDD and an SSD. I know that a few anecdotes don't make a controller issue, but maybe there are rare issues with that SATA controller, maybe in combination with something else like BIOS version, power supply or case grounding?
I never had such issues with my Intel platforms. I first switched to AMD when the Ryzen 3000 chips came out. It could also be an ASRock issue but i dont know for sure. The problem is, that ASRock is the only vendor that got an IPMI module on an AM4 socket. So i have to stick with them.
 
  • Like
Reactions: tburger

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!