ZFS Device Fault - Advice on moving device to new Port

helojunkie

Well-Known Member
Jul 28, 2017
69
1
48
56
San Diego, CA
I am running Proxmox 5.2 on a single non-clustered server. This is an HP Z820 with an LSI SAS 3008 controller flashed to IT mode. I am running ZFS across the board. The LSI controller is NOT the controller built onto the motherboard, but a new controller I installed into the server.

My boot devices and primary (rpool) pool are on two Samsung 860 Pro 512GB SSD drives connected to the LSI SAS Ports 0 & 1. I have another pool called spinners with 2 x 8TB HGST Helium drives on SAS ports 2 & 3 and two more Samsung 850 EVO 256GB SSD on SAS Ports 4 & 5 that operate as L2ARC and ZIL for my spinners pool.

The box is a Dual Xeon with 256GB of ECC RAM of which 8GB has been dedicated to ZFS primary ARC.

I have been running this way for over a year with no problems. The box is lightly loaded with a combination of 12 VMs/CTs, windows and linux.

About two weeks ago ZED started alerting me to pool degradation on my rpool with the following error:

Code:
The number of I/O errors associated with a ZFS device exceeded
acceptable levels. ZFS has marked the device as faulted.

I started doing some research and tried reseating the cables, reseating the drive, etc. I would then clear the errors from the pool and it seems the problem would go away for a few days then crop back up. I ran a long smart test of the SSD drive per Samsung's request and it showed 13 CRC errors. Samsung said to replace the drive.

So last night I pulled the drive (all hot swap so this was done live), installed the new drive ran sgdisk and replaced the device in the pool. I thought all was done. This morning at 0130 I received the exact same error on the new drive, so now I know it is not the drive, but I suspect a bad cable/port.

I have a lot of experience with FreeNAS and I know that with FreeNAS it does not matter what port a drive ends up on, it simply will not matter. In fact, one of the things I show people is that I can shut down my FreeNAS server and swap all the drives around between locations and reboot and everything comes up just fine.

What I don't know is if Proxmox with ZFS operates the same exact way. If I shut down the system, pull the drive, put it on another port and reboot, will the system simply see the drive, know it was part of my pool and be OK with it?

I am using device-id for my pools:

Code:
root@proxmox:~# zpool status
  pool: rpool
 state: ONLINE
  scan: resilvered 1.38G in 0h0m with 0 errors on Wed Jul 18 07:27:42 2018
config:

        NAME                              STATE     READ WRITE CKSUM
        rpool                             ONLINE       0     0     0
          mirror-0                        ONLINE       0     0     0
            wwn-0x5002538e401c3fcf-part2  ONLINE       0     0     0
            wwn-0x5002538d41fb6695-part2  ONLINE       0     0     0


So I guess my questions is this: Can I simply shut down the system, swap the drive to a new port and reboot and expect ZFS to identify the drive correctly regardless of port, or should I approach this in a different manner?
 
ZFS don't care ports position. But it 'may' happens with error of cache pool data file. But I don't think it will happen. In that case import pool

#zpool import -d /dev/disk/by-id/ pool_name

and its done.
 
Hi,

I suggest to try to replace your sata cables with new ones. I have see something like in your case. After I change the cables, no do not see any error.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!