Backup server eating drives

Zephrant

Member
Sep 12, 2021
34
3
8
124
I took a working FreeNAS system and reformatted it for Proxmox backup server. It contains 36 4T Seagate SAS drives, and has been in use for almost three years.

After I started using it, it started getting errors on the drives, and failing them out of the zpool. Recently, it was failing a drive every day or two. I replaced 6 drives over several weeks and had three more failing as of yesterday. Thinking that I had reached the end of life of my drives, I bought 12 new drives, and rebuilt it yesterday. Removing 24 drives, and installing 12 12T WD NAS drives.

Zpool Z2 with one spare created just fine.

Last night I rsynced 40T of data on to it.

As of this morning, it thinks there is another failed drive:
Code:
root@spk-prox-b01:~# zpool status
  pool: store1
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Thu Mar 10 00:02:56 2022
        19.1T scanned at 563M/s, 13.3T issued at 392M/s, 33.1T total
        1.15T resilvered, 40.11% done, 14:44:09 to go
config:

        NAME                          STATE     READ WRITE CKSUM
        store1                        DEGRADED     0     0     0
          raidz2-0                    DEGRADED     0     0     0
            wwn-0x5000cca2b0540584    ONLINE       0     0     0
            wwn-0x5000cca2b052ab94    ONLINE       0     0     0
            wwn-0x5000cca2b04cedd4    ONLINE       0     0     0
            wwn-0x5000cca2b053de08    ONLINE       0     0     0
            wwn-0x5000cca2b05399fc    ONLINE       0     0     0
            wwn-0x5000cca2b053ff54    ONLINE       0     0     0
            wwn-0x5000cca2b0540824    ONLINE       0     0     0
            spare-7                   DEGRADED     0     0     0
              wwn-0x5000cca2b0543370  FAULTED      0    38     0  too many errors
              wwn-0x5000cca2b053df58  ONLINE       0     0     0  (resilvering)
            wwn-0x5000cca2b053ddfc    ONLINE       0     0     0
            wwn-0x5000cca2b05405c0    ONLINE       0     0     0
            wwn-0x5000cca2b0540548    ONLINE       0     0     0
        spares
          wwn-0x5000cca2b053df58      INUSE     currently in use

errors: No known data errors

Considering that three other identical systems were purchased at the same time, and they are all still running FreeNAS without major issues (one failed drive in the last year), I'm wondering why Proxmox Backup server is having so many issues with drives, and what I should do about it?
 
please check (and maybe post) the smart status of the drives in question, and also check the system logs for errors. like dietmar said. PBS doesn't do anything special other than writing (potentially lots ;)) of data.. if the disks look okay, I'd check cables and/or contorller/HBA/backplane next.
 
Smartctl was not showing errors, only the zpool status (read and write).

The above errors looked suspicious, so I cleared it and let the drive resilver back in to use. Then I wrote 74T to the pool, and scrubbed it. No errors still.

So I've now gone almost a week without any errors at all, where I was getting one failed drive every day or two. I'm starting to suspect something was wrong in the kernel, and it was causing the failed pools.

Going to run tests on it for a few more weeks. If they pass, I'll put it back in to service as a secondary backup pool.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!