ZFS Degraded after few days from installation

Vittorio · Nov 17, 2022

Hi all

I'm experiencing a degraded state in my new Proxmox installation on a Beelink U59 with one M.2 Sata 256GB on which I placed Proxmox and a SATA SSD on which I placed all my VM and LXC, that has been formatted in ZFS

Hi have one VM with Home Assistant and some LXCs with Nodered, Adguard and some other stuff related to HA.

This is the result of zpool status:

Code:

  pool: ZFS-DATI
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
    attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
    using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub repaired 0B in 00:04:11 with 0 errors on Sun Nov 13 00:28:12 2022
config:

    NAME                                STATE     READ WRITE CKSUM
    ZFS-DATI                            DEGRADED     0     0     0
      ata-CT1000BX500SSD1_2215E6265352  DEGRADED     0   236     0  too many errors

errors: No known data errors

and this is the result of smartctl -a /dev/sdb:

Code:

smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.64-1-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Crucial/Micron Client SSDs
Device Model:     CT1000BX500SSD1
Serial Number:    2215E6265352
LU WWN Device Id: 5 00a075 1e6265352
Firmware Version: M6CR054
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
TRIM Command:     Available
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 4
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Thu Nov 17 12:21:46 2022 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (  120) seconds.
Offline data collection
capabilities:              (0x11) SMART execute Offline immediate.
                    No Auto Offline data collection support.
                    Suspend Offline collection upon new
                    command.
                    No Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    No Selective Self-test supported.
SMART capabilities:            (0x0002)    Does not save SMART data before
                    entering power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      (  10) minutes.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   000    Pre-fail  Always       -       0
  5 Reallocate_NAND_Blk_Cnt 0x0032   100   100   010    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       171
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       27
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
173 Ave_Block-Erase_Count   0x0032   100   100   000    Old_age   Always       -       3
174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    Old_age   Always       -       18
180 Unused_Reserve_NAND_Blk 0x0033   100   100   000    Pre-fail  Always       -       43
183 SATA_Interfac_Downshift 0x0032   100   100   000    Old_age   Always       -       2
184 Error_Correction_Count  0x0032   100   100   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   045   035   000    Old_age   Always       -       55 (Min/Max 21/65)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_ECC_Cnt 0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       14
202 Percent_Lifetime_Remain 0x0030   100   100   001    Old_age   Offline      -       0
206 Write_Error_Rate        0x000e   100   100   000    Old_age   Always       -       0
210 Success_RAIN_Recov_Cnt  0x0032   100   100   000    Old_age   Always       -       0
246 Total_LBAs_Written      0x0032   100   100   000    Old_age   Always       -       1829525467
247 Host_Program_Page_Count 0x0032   100   100   000    Old_age   Always       -       57172670
248 FTL_Program_Page_Count  0x0032   100   100   000    Old_age   Always       -       41091072
249 Unkn_CrucialMicron_Attr 0x0032   100   100   000    Old_age   Always       -       0
250 Read_Error_Retry_Rate   0x0032   100   100   000    Old_age   Always       -       0
251 Unkn_CrucialMicron_Attr 0x0032   100   100   000    Old_age   Always       -       1693661469
252 Unkn_CrucialMicron_Attr 0x0032   100   100   000    Old_age   Always       -       0
253 Unkn_CrucialMicron_Attr 0x0032   100   100   000    Old_age   Always       -       0
254 Unkn_CrucialMicron_Attr 0x0032   100   100   000    Old_age   Always       -       0
223 Unkn_CrucialMicron_Attr 0x0032   100   100   000    Old_age   Always       -       0

SMART Error Log not supported

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

Selective Self-tests/Logging not supported

What can I check or do?

Thanks

leesteken · Nov 17, 2022

I think write errors without read/checksum errors either indicate a bad connection (SMART shows 14 UDMA CRC errors) or power failures which prevent the writes from finishing (SMART shows 18 unexpected power losses). Or maybe the NAND flash is failing (SMART shows 43 unused reserve). Since the driver is not old (SMART shows 171 power-on hours), maybe it's just failing (within the warranty period)?

Vittorio · Nov 17, 2022

Thanks @leesteken
This is the first time I use this SSD, but I bought some times ago.
So I will try to got a replacement hoping the "big seller" will do it for me.

Next step is, how can I move everything on this SSD to the new one?
And, since this is the first time I've used ZFS in Proxmox, in my "simple" configuration, do you reccomend using ZFS instead of the standard format?

Thanks again

leesteken · Nov 17, 2022

Vittorio said:
Next step is, how can I move everything on this SSD to the new one?

Reinstall Proxmox and restore VMs from backup is the safest and easiest method.

Vittorio said:
And, since this is the first time I've used ZFS in Proxmox, in my "simple" configuration, do you reccomend using ZFS instead of the standard format?

As you will find on this forum, people recommend enterprise SSDs (not consumer and not pro-sumer or other "pro" naming) with power-loss protection (PLP) for being most durable with the high (small) writes caused by Proxmox and VMs. Whether you want the features and protection of ZFS or the simplicity of LVM is up to you.

leesteken · Nov 17, 2022

Vittorio said:
So I will try to got a replacement hoping the "big seller" will do it for me.

It might just be a poor connection or corrosion on the connectors or recent power losses. I can't guarantee that it will actually be confirmed broken by the seller.

Write errors might also be caused by the drive responding too slowly to ZFS. Search the Proxmox Syslog (or journalctl for more information about the errors (it is an media error or a timeout error?). With LVM you might not have those error, or maybe you do have the same errors with LVM and not be notified about it.

Vittorio · Nov 17, 2022

Thanks @leesteken

I will check the SATA connection in the minipc (that is made by a very tiny cable) and your other suggestions.

Inglebard · Nov 17, 2022

Do you have something in dmesg ?

Vittorio · Nov 17, 2022

@Inglebard just checked

You mean something like that ?

Code:

[  657.877538] sd 1:0:0:0: [sdb] tag#13 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
[  657.877542] sd 1:0:0:0: [sdb] tag#13 Sense Key : Illegal Request [current]
[  657.877544] sd 1:0:0:0: [sdb] tag#13 Add. Sense: Unaligned write command
[  657.877547] sd 1:0:0:0: [sdb] tag#13 CDB: Write(10) 2a 00 3f 7b a6 c0 00 01 00 00
[  657.877548] blk_update_request: I/O error, dev sdb, sector 1065068224 op 0x1:(WRITE) flags 0x700 phys_seg 32 prio class 0
[  657.877561] zio pool=ZFS-DATI vdev=/dev/disk/by-id/ata-CT1000BX500SSD1_2215E6265352-part1 error=5 type=2 offset=545313882112 size=131072 flags=40080c80
[  657.877577] sd 1:0:0:0: [sdb] tag#25 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
[  657.877579] sd 1:0:0:0: [sdb] tag#25 Sense Key : Illegal Request [current]
[  657.877581] sd 1:0:0:0: [sdb] tag#25 Add. Sense: Unaligned write command
[  657.877582] sd 1:0:0:0: [sdb] tag#25 CDB: Write(10) 2a 00 3f 7b a7 c0 00 00 f0 00
[  657.877583] blk_update_request: I/O error, dev sdb, sector 1065068480 op 0x1:(WRITE) flags 0x700 phys_seg 29 prio class 0
[  657.877589] zio pool=ZFS-DATI vdev=/dev/disk/by-id/ata-CT1000BX500SSD1_2215E6265352-part1 error=5 type=2 offset=545314013184 size=122880 flags=40080c80
[  657.877593] ata2: EH complete
[  913.851488] ata2.00: exception Emask 0x10 SAct 0xc00 SErr 0x400100 action 0x6 frozen
[  913.851503] ata2.00: irq_stat 0x08000000, interface fatal error
[  913.851506] ata2: SError: { UnrecovData Handshk }
[  913.851510] ata2.00: failed command: WRITE FPDMA QUEUED
[  913.851512] ata2.00: cmd 61/00:50:e0:58:33/01:00:48:00:00/40 tag 10 ncq dma 131072 out
                        res 40/00:54:e0:58:33/00:00:48:00:00/40 Emask 0x10 (ATA bus error)
[  913.851521] ata2.00: status: { DRDY }
[  913.851524] ata2.00: failed command: WRITE FPDMA QUEUED
[  913.851526] ata2.00: cmd 61/00:58:e0:59:33/01:00:48:00:00/40 tag 11 ncq dma 131072 out
                        res 40/00:54:e0:58:33/00:00:48:00:00/40 Emask 0x10 (ATA bus error)
[  913.851533] ata2.00: status: { DRDY }
[  913.851538] ata2: hard resetting link
[  914.165921] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[  914.180143] ata2.00: configured for UDMA/133
[  914.180162] sd 1:0:0:0: [sdb] tag#10 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
[  914.180166] sd 1:0:0:0: [sdb] tag#10 Sense Key : Illegal Request [current]
[  914.180168] sd 1:0:0:0: [sdb] tag#10 Add. Sense: Unaligned write command
[  914.180171] sd 1:0:0:0: [sdb] tag#10 CDB: Write(10) 2a 00 48 33 58 e0 00 01 00 00
[  914.180172] blk_update_request: I/O error, dev sdb, sector 1211324640 op 0x1:(WRITE) flags 0x700 phys_seg 32 prio class 0
[  914.180184] zio pool=ZFS-DATI vdev=/dev/disk/by-id/ata-CT1000BX500SSD1_2215E6265352-part1 error=5 type=2 offset=620197167104 size=131072 flags=40080c80
[  914.180198] sd 1:0:0:0: [sdb] tag#11 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
[  914.180200] sd 1:0:0:0: [sdb] tag#11 Sense Key : Illegal Request [current]
[  914.180201] sd 1:0:0:0: [sdb] tag#11 Add. Sense: Unaligned write command
[  914.180203] sd 1:0:0:0: [sdb] tag#11 CDB: Write(10) 2a 00 48 33 59 e0 00 01 00 00
[  914.180204] blk_update_request: I/O error, dev sdb, sector 1211324896 op 0x1:(WRITE) flags 0x700 phys_seg 32 prio class 0
[  914.180209] zio pool=ZFS-DATI vdev=/dev/disk/by-id/ata-CT1000BX500SSD1_2215E6265352-part1 error=5 type=2 offset=620197298176 size=131072 flags=40080c80
[  914.180213] ata2: EH complete

Neobin · Nov 17, 2022

Vittorio said:
Device Model: CT1000BX500SSD1

Sidenote: QLC-nand. Not really good for anything; but even worse with ZFS/any CoW-filesystem.

Vittorio · Nov 17, 2022

Thanks @Neobin

any suggestion for a non-so-expensive SSD for home use with Proxmox?
I'm used to do a backup every couple of days of all my VMs and LXCs to a NAS.

Inglebard · Nov 17, 2022

You may have this issue : https://github.com/openzfs/zfs/issues/10094

Neobin · Nov 19, 2022

Vittorio said:
Thanks @Neobin

any suggestion for a non-so-expensive SSD for home use with Proxmox?
I'm used to do a backup every couple of days of all my VMs and LXCs to a NAS.

For ZFS/Ceph(/BTRFS?) it is in any usecase highly recommended to only use enterprise SSDs with PLP (Power Loss Protection).

For a homeserver with other filesystems like EXT4 or XFS for example, a decent consumer/prosumer SSD with TLC-nand from a well-known brand might be sufficient. (Of course, that heavily depends on the workload.)

I personally have good experiences with Samsung SSDs since ages; but do not get the QVOs, because those also use QLC-nand.

Generally I would recommend to not cheap out on such things...

Maybe someone other can give some recommendations for other brands and their models.

@Dunuin gave some recommendations for enterprise SSDs here:
https://forum.proxmox.com/threads/im-unable-to-upload-files-to-my-proxmox-server.114541/#post-509026

Vittorio · Nov 21, 2022

I've decided to replace my QLC Crucial disk with a TLC (I know it's still a consumer disk!) Samsung 870 EVO disk and restore all my VMs and LXCs.
One thing I've never done, is to add or replace a disk in Proxmox.
This time I will not choose ZFS but I will choose....
What I have to choose when adding the disk in Datacenter->Storage ?
LVM orLVM-Thin ?

I 'm using a NAS in NFS to store my Snapshots.

Or is there a way to "move" disk A ZFS to disk 2 LVM?

Thanks

Dunuin · Nov 21, 2022

I would do:
1.) build in the new SSD and keep the old SSD in there
2.) wipe the new SSD destroying all data at YourNodeName -> Disks -> YourNewSSD -> Wipe Disk Button
3.) format that new SSD using LVM-Thin: YourNodeName -> Disks -> LVM-Thin -> Create: thinpool
4.) go to every VM and move the virtual disks from old ZFS storage to new LVM-Thin storage: YourNode -> YourVM -> Hardware -> YourDisk -> Disk Action Button -> Move storage

At Datacenter->Storage you can only add existing LVM/LVM-Thin pools as a storage. It won't create that LVM/LVM-Thin for you.

Vittorio · Nov 21, 2022

Dunuin said:
I would do:
1.) build in the new SSD and keep the old SSD in there
2.) wipe the new SSD destroying all data at YourNodeName -> Disks -> YourNewSSD -> Wipe Disk Button
3.) format that new SSD using LVM-Thin: YourNodeName -> Disks -> LVM-Thin -> Create: thinpool
4.) go to every VM and move the virtual disks from old ZFS storage to new LVM-Thin storage: YourNode -> YourVM -> Hardware -> YourDisk -> Disk Action Button -> Move storage

Thanks @Dunuin

Some little questions...

My actual dis ks configuration is:
M.2 SATA SSD as Proxmox boot disk (256GB)
2.5 SSD formatted in ZFS for VMs and LXCs

Since I have a "minipc" with just one internal slot for SSD, can I use the new disk as external and then put it internally after moving storage?

My curiosity is why LVM-Thin instead of LVM

Actually I already have an LVM-Thin entry (named data) which relates to the boot disk and has 98% free disk

Creating the thinpool will merge with the actual thinpool ?

Moving storage is valid also for LXCs?

Thanks

Dunuin · Nov 21, 2022

Vittorio said:
Thanks @Dunuin

Some little questions...

My actual dis ks configuration is:
M.2 SATA SSD as Proxmox boot disk (256GB)
2.5 SSD formatted in ZFS for VMs and LXCs

Since I have a "minipc" with just one internal slot for SSD, can I use the new disk as external and then put it internally after moving storage?

Yes, probably. But then I would prefer to backup and restore the VMs. You should have an NAS or external disk and a automated backup job for anyway to always have a recent backup.

Vittorio said:
My curiosity is why LVM-Thin instead of LVM

LVM-Thin supports snapshots and thin-provisioning. LVM does not.

Vittorio said:
Actually I already have an LVM-Thin entry (named data) which relates to the boot disk and has 98% free disk

Creating the thinpool will merge with the actual thinpool ?

No, then you got two individual thin pools.

Vittorio said:
Moving storage is valid also for LXCs?

Jup, its then called "Volume Action -> Move Storage".

Vittorio · Nov 21, 2022

Thanks @Dunuin

I've tried the first option, moving storage directly
It was ok for all the LXCs and the only VM I have, except for one LXC, and strangely I get an error like the target storage is full, while is not

Here the complete error log that ends with


TASK ERROR: unable to restore CT 108 - command 'tar xpf - --zstd --totals --one-file-system -p --sparse --numeric-owner --acls --xattrs '--xattrs-include=user.*' '--xattrs-include=security.capability' '--warning=no-file-ignored' '--warning=no-xattr-write' -C /var/lib/lxc/108/rootfs --skip-old-files --anchored --exclude './dev/*'' failed: exit code 2

But here is the disk space left on DATI:

This is the df -h from the LXC that I'm trying to restore

Code:

root@nodered-ct:~# df -h
Filesystem                  Size  Used Avail Use% Mounted on
ZFS-DATI/subvol-100-disk-0   20G  6.6G   14G  33% /
/dev/mapper/pve-root         59G  4.7G   51G   9% /mnt/shared
none                        492K  4.0K  488K   1% /dev
udev                        7.7G     0  7.7G   0% /dev/net/tun
tmpfs                       7.8G     0  7.8G   0% /dev/shm
tmpfs                       3.1G   68K  3.1G   1% /run
tmpfs                       5.0M     0  5.0M   0% /run/lock
//192.168.1.190/config       31G   16G   16G  51% /mnt/ha-config
//192.168.1.190/share        31G   16G   16G  51% /mnt/ha-share

What can I check?

I've tried both way, moving storage and restoring from a backup and in both cases I get same "no left space on device" message.
The size of the backup is 1.8GB

The error I get if I try to move storage is:

Code:

 Logical volume "vm-100-disk-0" created.
Creating filesystem with 5242880 4k blocks and 1310720 inodes
Filesystem UUID: 8aa28451-81ba-4341-a49b-8b743f8e8085
Superblock backups stored on blocks:
    32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
rsync: [receiver] write failed on "/var/lib/lxc/100/.copy-volume-1/var/log/journal/5298dd9cff034f10b48a127dd201f1da/user-1000@3947c5b6ddc54033a2330f66437f2c34-00000000040a6b8f-0005edbc02c6323c.journal": No space left on device (28)
rsync error: error in file IO (code 11) at receiver.c(378) [receiver=3.2.3]
rsync: [sender] write error: Broken pipe (32)
  Logical volume "vm-100-disk-0" successfully removed
TASK ERROR: command 'rsync --stats -X -A --numeric-ids -aH --whole-file --sparse --one-file-system '--bwlimit=0' /var/lib/lxc/100/.copy-volume-2/ /var/lib/lxc/100/.copy-volume-1' failed: exit code 11

PS: I succeded in restoring the backup to same original ZFS disk, but no way to restore or move to the new 2.5 SSD LVM disk nor to the LVM M.2 Proxmox boot disk in LVM-Thin

Search

Search

ZFS Degraded after few days from installation

Vittorio

Well-Known Member

leesteken

Distinguished Member

Vittorio

Well-Known Member

leesteken

Distinguished Member

leesteken

Distinguished Member

Vittorio

Well-Known Member

Inglebard

Renowned Member

Vittorio

Well-Known Member

Neobin

Distinguished Member

Vittorio

Well-Known Member

Inglebard

Renowned Member

Neobin

Distinguished Member

Vittorio

Well-Known Member

Dunuin

Distinguished Member

Vittorio

Well-Known Member

Dunuin

Distinguished Member

Vittorio

Well-Known Member

We value your privacy