[SOLVED] How to delete/clear zfs cache after hard drive broke

danman · Jun 5, 2022

Hey

One of my hard drive decided to give up.

Now zfs-import-cache.service can't start anymore because the cache of the pool is still there. I can't find anything for a proper solution in the www.

Does anyone has an idea?

Thanks
Dan

fiona · Jun 8, 2022

Hi,
you can try using zpool set cachefile=/etc/zfs/zpool.cache <pool> for an existing pool to regenerate the cache file. In case you don't want that pool to use the cache file, just do zpool set cachefile=none <pool> afterwards

danman · Jun 8, 2022

Actually, the whole pool is gone as well. It's just the cache

Code:

systemctl status zfs-import-cache.service
cannot import 'barras': one or more devices is currently unavailable

fiona · Jun 9, 2022

What is the output of zpool status?

danman · Jun 10, 2022

Thanks for trying to help.
I think this is getting more and more strange. So at first I thought the hard drive was failing, which is a Seagate Barracuda 2TB and I created a zfs pool called barras for it.

Yesterday I finally had time to take the drive out and test it. And the drive seems to be fine.

So I checked before on the proxmox server also the zpool status command and the barras pool was just gone and everything else was/is fine. lsblk also showed me nothing about the barracuda. So I already bought 2 new drives and installed them yesterday too. Same sata slot, same cable and the drive(s) are there. So there should be no hardware problem on that side either.

I will reconnect the barra tomorrow. Is there any way to get the data back on track? zfs import something? I need to do more research on this. I'm not that familiar with zfs.

Anyway, the problem is no longer the cache, I think. I haven't rebooted since, but yesterday I noticed that the barras under Datacenter -> Storage was still enabled. So I disabled it and I think that will be the solution to the cache problem. I will do a restart tomorrow and will post it here then.

danman · Jun 11, 2022

So the problem with the cache was indeed:

but yesterday I noticed that the barras under Datacenter -> Storage was still enabled

By disabling or deleting the storage, this problem is solved.

I reconnected the drive and imported the pool, and everything is back to normal.
There is still a big question mark flying around in my head. It would be great to know what the problem was in the end. Why the pool was gone or the whole hard drive.

danman · Jun 11, 2022

Me again =)

I restored an LXC backup to another pool that was previously the barras and now moved it to the wolf pool, which worked, but it took a while and I looked at the status of the pools again:

Code:

❯ zpool status -v
  pool: barras
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 00:08:05 with 0 errors on Sun May  8 00:32:06 2022
config:

        NAME        STATE     READ WRITE CKSUM
        barras      ONLINE       0     0     0
          sdc       ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        <0x85c5>:<0x185c5>

  pool: wolfs
 state: ONLINE
status: Some supported and requested features are not enabled on the pool.
        The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(7) for details.
  scan: scrub repaired 0B in 07:46:25 with 0 errors on Sun May  8 08:10:28 2022
config:

        NAME                      STATE     READ WRITE CKSUM
        wolfs                     ONLINE       0     0     0
          wwn-0x5000c500c931825e  ONLINE       0     0     0
          wwn-0x5000c500c9316572  ONLINE       0     0     0

errors: No known data errors

Before I run any

Code:

zpool scrub

commands, I want to wait for some good advice. I hope you wake up soon

fiona · Jun 13, 2022

You might want to check if the disk belonging to barras is still healthy. There also might be something in the syslog from around the time the issues happened.

danman · Jun 17, 2022

I bought 2 new drives. So I will be able to check the barra(s) thoroughly again.

Some entries from the logs:

Code:

Jun 04 07:22:35 proxmox-media kernel: ata5.00: exception Emask 0x0 SAct 0xe604008 SErr 0x0 action 0x0
Jun 04 07:22:35 proxmox-media kernel: ata5.00: irq_stat 0x40000008
Jun 04 07:22:35 proxmox-media kernel: ata5.00: failed command: READ FPDMA QUEUED
Jun 04 07:22:35 proxmox-media kernel: ata5.00: cmd 60/00:a8:50:7c:4f/02:00:0c:00:00/40 tag 21 ncq dma 262144 in
         res 43/40:00:30:7d:4f/00:02:0c:00:00/00 Emask 0x409 (media error) <F>
Jun 04 07:22:35 proxmox-media kernel: ata5.00: status: { DRDY SENSE ERR }
Jun 04 07:22:35 proxmox-media kernel: ata5.00: error: { UNC }
Jun 04 07:22:35 proxmox-media kernel: ata5.00: configured for UDMA/133
Jun 04 07:22:35 proxmox-media kernel: sd 4:0:0:0: [sdc] tag#21 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
Jun 04 07:22:35 proxmox-media kernel: sd 4:0:0:0: [sdc] tag#21 Sense Key : Medium Error [current]
Jun 04 07:22:35 proxmox-media kernel: sd 4:0:0:0: [sdc] tag#21 Add. Sense: Unrecovered read error - auto reallocate failed
Jun 04 07:22:35 proxmox-media kernel: sd 4:0:0:0: [sdc] tag#21 CDB: Read(10) 28 00 0c 4f 7c 50 00 02 00 00
Jun 04 07:22:35 proxmox-media kernel: blk_update_request: I/O error, dev sdc, sector 206535984 op 0x0:(READ) flags 0x700 phys_seg 3 prio class 0
Jun 04 07:22:35 proxmox-media kernel: zio pool=barras vdev=/dev/sdc1 error=5 type=1 offset=105745260544 size=262144 flags=40080c80
Jun 04 07:22:35 proxmox-media kernel: ata5: EH complete
Jun 04 07:22:35 proxmox-media zed[2099921]: eid=28 class=io pool='barras' vdev=sdc1 size=262144 offset=105745260544 priority=2 err=5 flags=0x40080c80 delay=217ms
Jun 04 07:22:35 proxmox-media kernel: ata5.00: exception Emask 0x0 SAct 0x900900 SErr 0x0 action 0x0
Jun 04 07:22:35 proxmox-media kernel: ata5.00: irq_stat 0x40000008
Jun 04 07:22:35 proxmox-media kernel: ata5.00: failed command: READ FPDMA QUEUED
Jun 04 07:22:35 proxmox-media kernel: ata5.00: cmd 60/00:40:50:7c:4f/01:00:0c:00:00/40 tag 8 ncq dma 131072 in
         res 43/40:00:30:7d:4f/00:01:0c:00:00/00 Emask 0x409 (media error) <F>
Jun 04 07:22:35 proxmox-media kernel: ata5.00: status: { DRDY SENSE ERR }
Jun 04 07:22:35 proxmox-media kernel: ata5.00: error: { UNC }
Jun 04 07:22:35 proxmox-media kernel: ata5.00: configured for UDMA/133
Jun 04 07:22:35 proxmox-media kernel: sd 4:0:0:0: [sdc] tag#8 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
Jun 04 07:22:35 proxmox-media kernel: sd 4:0:0:0: [sdc] tag#8 Sense Key : Medium Error [current]
Jun 04 07:22:35 proxmox-media kernel: sd 4:0:0:0: [sdc] tag#8 Add. Sense: Unrecovered read error - auto reallocate failed
Jun 04 07:22:35 proxmox-media kernel: sd 4:0:0:0: [sdc] tag#8 CDB: Read(10) 28 00 0c 4f 7c 50 00 01 00 00
Jun 04 07:22:35 proxmox-media kernel: blk_update_request: I/O error, dev sdc, sector 206535984 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Jun 04 07:22:35 proxmox-media kernel: zio pool=barras vdev=/dev/sdc1 error=5 type=1 offset=105745260544 size=131072 flags=188981
Jun 04 07:22:35 proxmox-media kernel: ata5: EH complete
Jun 04 07:22:35 proxmox-media zed[2099968]: eid=29 class=io pool='barras' vdev=sdc1 size=131072 offset=105745260544 priority=0 err=5 flags=0x188981 delay=232ms bookmark=34245:99781:0:290
Jun 04 07:22:35 proxmox-media zed[2099970]: eid=30 class=data pool='barras' priority=0 err=5 flags=0x8981 bookmark=34245:99781:0:290

It doesn't look good, though.

But I also have some other strange problems..
On one VM where I use Seafile, I can't create backups on a specific drive.

So the whole proxmox server is brand new. Bought it 6 months ago or so.
I have proxmox backup server running in a VM and have 2 8TB (also new) hard drives connected via USB 3.

I do daily Seafile backups on both and on one 8TB the VM backup no longer works:

Code:

102: 2022-06-13 03:01:56 INFO:  26% (284.6 GiB of 1.1 TiB) in 10h 8m 8s, read: 2.3 MiB/s, write: 2.3 MiB/s
102: 2022-06-13 03:01:56 ERROR: backup write data failed: command error: protocol canceled
102: 2022-06-13 03:01:56 INFO: aborting backup job
102: 2022-06-13 03:01:56 INFO: resuming VM again
102: 2022-06-13 03:01:56 ERROR: Backup of VM 102 failed - backup write data failed: command error: protocol canceled

The other 8TB backups work fine. Seafile also complains about Failed to get fs id list or that https://github.com/haiwen/seafile/issues/2518. I uploaded my files again (approx 250GB) but after while it starts again.
Seafile does not run on barras. Seafile runs on wolfs (Seagate) pool and some bits of nvme for the system. But those drives seems to be fine. Might be "just" a Seafile problem.

Edit:
I created a post at Seafile too. The last backup actually worked on both disks. Not sure whats going on. Maybe just a hiccup.

danman · Jun 18, 2022

Seafile problem is solved. It was just a 504 error.

I'm going to test the backups the next couple of days.

danman · Jun 25, 2022

Okay, it just happend again.

But this time, it happend on the other backup storage.

Backup logs:

Code:

INFO:  53% (583.6 GiB of 1.1 TiB) in 7h 2m 56s, read: 5.0 MiB/s, write: 4.1 MiB/s
ERROR: backup write data failed: command error: write_data upload error: broken pipe
INFO: aborting backup job
INFO: resuming VM again
ERROR: Backup of VM 102 failed - backup write data failed: command error: write_data upload error: broken pipe
INFO: Failed at 2022-06-26 07:19:48

dmesg logs:

Code:

[73836.893377] systemd[1]: systemd-journald.service: State 'stop-sigterm' timed out. Killing.
[73836.893518] systemd[1]: systemd-journald.service: Killing process 427185 (systemd-journal) with signal SIGKILL.
[73836.893652] systemd[1]: systemd-journald.service: Killing process 294 (systemd-journal) with signal SIGKILL.
[73836.893796] systemd[1]: systemd-journald.service: Killing process 427116 (journal-offline) with signal SIGKILL.
[73860.767621] sd 2:0:0:0: [sda] tag#201 timing out command, waited 180s
[73860.767666] sd 2:0:0:0: [sda] tag#201 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=913s
[73860.767674] sd 2:0:0:0: [sda] tag#201 Sense Key : Aborted Command [current]
[73860.767675] sd 2:0:0:1: [sdb] tag#9 timing out command, waited 180s
[73860.767681] sd 2:0:0:0: [sda] tag#201 Add. Sense: I/O process terminated
[73860.767688] sd 2:0:0:0: [sda] tag#201 CDB: Write(10) 2a 00 02 f9 e0 00 00 00 10 00
[73860.767694] sd 2:0:0:1: [sdb] tag#9 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=889s
[73860.767693] I/O error, dev sda, sector 49930240 op 0x1:(WRITE) flags 0x800 phys_seg 2 prio class 0
[73860.767699] sd 2:0:0:1: [sdb] tag#9 Sense Key : Aborted Command [current]
[73860.767702] sd 2:0:0:1: [sdb] tag#9 Add. Sense: I/O process terminated
[73860.767707] sd 2:0:0:1: [sdb] tag#9 CDB: Write(10) 2a 00 62 49 40 00 00 01 68 00
[73860.767709] I/O error, dev sdb, sector 1648967680 op 0x1:(WRITE) flags 0x0 phys_seg 45 prio class 0
[73860.767767] EXT4-fs warning (device dm-3): ext4_end_bio:343: I/O error 10 writing to inode 51555441 starting block 206116608)
[73860.767785] sd 2:0:0:1: [sdb] tag#17 timing out command, waited 180s
[73860.767789] sd 2:0:0:1: [sdb] tag#17 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=858s
[73860.767792] sd 2:0:0:1: [sdb] tag#17 Sense Key : Aborted Command [current]
[73860.767794] sd 2:0:0:1: [sdb] tag#17 Add. Sense: I/O process terminated
[73860.767797] sd 2:0:0:1: [sdb] tag#17 CDB: Write(10) 2a 00 67 81 8f b0 00 00 08 00
[73860.767799] I/O error, dev sdb, sector 1736544176 op 0x1:(WRITE) flags 0x103000 phys_seg 1 prio class 0
[73860.767807] Buffer I/O error on dev dm-3, logical block 217063670, lost async page write
[73860.767887] EXT4-fs error (device dm-3): ext4_check_bdev_write_error:217: comm kworker/u8:4: Error while async write back metadata
[73860.767937] Buffer I/O error on device dm-3, logical block 206116608
[73860.767942] Buffer I/O error on device dm-3, logical block 206116609
[73860.767943] Buffer I/O error on device dm-3, logical block 206116610
[73860.767945] Buffer I/O error on device dm-3, logical block 206116611
[73860.767947] Buffer I/O error on device dm-3, logical block 206116612
[73860.767948] Buffer I/O error on device dm-3, logical block 206116613
[73860.767950] Buffer I/O error on device dm-3, logical block 206116614
[73860.767951] Buffer I/O error on device dm-3, logical block 206116615
[73860.767952] Buffer I/O error on device dm-3, logical block 206116616
[73860.767954] Buffer I/O error on device dm-3, logical block 206116617
[73860.775235] sd 2:0:0:0: [sda] tag#202 timing out command, waited 180s
[73860.775262] sd 2:0:0:0: [sda] tag#202 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=913s
[73860.775267] sd 2:0:0:0: [sda] tag#202 Sense Key : Aborted Command [current]
[73860.775271] sd 2:0:0:0: [sda] tag#202 Add. Sense: I/O process terminated
[73860.775276] sd 2:0:0:0: [sda] tag#202 CDB: Write(10) 2a 00 02 f9 e0 18 00 00 08 00
[73860.775279] I/O error, dev sda, sector 49930264 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
[73860.778984] sd 2:0:0:0: [sda] tag#204 timing out command, waited 180s
[73860.779010] sd 2:0:0:0: [sda] tag#204 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=913s
[73860.779033] sd 2:0:0:0: [sda] tag#204 Sense Key : Aborted Command [current]
[73860.779037] sd 2:0:0:0: [sda] tag#204 Add. Sense: I/O process terminated
[73860.779042] sd 2:0:0:0: [sda] tag#204 CDB: Write(10) 2a 00 02 f9 e0 20 00 00 08 00
[73860.779045] I/O error, dev sda, sector 49930272 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
[73860.781260] sd 2:0:0:0: [sda] tag#203 timing out command, waited 180s
[73860.781281] sd 2:0:0:0: [sda] tag#203 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=913s
[73860.781285] sd 2:0:0:0: [sda] tag#203 Sense Key : Aborted Command [current]
[73860.781287] sd 2:0:0:0: [sda] tag#203 Add. Sense: I/O process terminated
[73860.781290] sd 2:0:0:0: [sda] tag#203 CDB: Write(10) 2a 00 02 f9 e0 10 00 00 08 00
[73860.781292] I/O error, dev sda, sector 49930256 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
[73860.784623] sd 2:0:0:0: [sda] tag#205 timing out command, waited 180s
[73860.784640] sd 2:0:0:0: [sda] tag#205 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=913s
[73860.784642] sd 2:0:0:0: [sda] tag#205 Sense Key : Aborted Command [current]
[73860.784644] sd 2:0:0:0: [sda] tag#205 Add. Sense: I/O process terminated
[73860.784647] sd 2:0:0:0: [sda] tag#205 CDB: Write(10) 2a 00 02 f9 e0 28 00 00 08 00
[73860.784649] I/O error, dev sda, sector 49930280 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
[73860.784745] Aborting journal on device dm-2-8.
[73860.784756] EXT4-fs error (device dm-2) in ext4_reserve_inode_write:5748: Journal has aborted
[73860.784762] EXT4-fs error (device dm-2): __ext4_ext_dirty:183: inode #1574110: comm kworker/u8:0: mark_inode_dirty error
[73860.784834] EXT4-fs error (device dm-2) in ext4_mb_clear_bb:6076: Journal has aborted
[73860.784853] EXT4-fs (dm-2): Delayed block allocation failed for inode 1049937 at logical offset 3 with max blocks 2 with error 30
[73860.784865] EXT4-fs (dm-2): Delayed block allocation failed for inode 1574110 at logical offset 0 with max blocks 2 with error 30
[73860.784892] EXT4-fs (dm-2): This should not happen!! Data will be lost

[73860.785549] EXT4-fs (dm-2): This should not happen!! Data will be lost

[73860.785937] EXT4-fs error (device dm-2) in ext4_writepages:2817: Journal has aborted
[73860.786004] EXT4-fs error (device dm-2): ext4_journal_check_start:83: comm kworker/u8:3: Detected aborted journal
[73860.786028] EXT4-fs error (device dm-2) in ext4_writepages:2817: Journal has aborted
[73860.786189] EXT4-fs error (device dm-2): __ext4_ext_dirty:183: inode #1582379: comm dockerd: mark_inode_dirty error
[73860.786256] EXT4-fs error (device dm-2) in ext4_mb_clear_bb:6076: Journal has aborted
[73860.786301] EXT4-fs error (device dm-2) in ext4_writepages:2817: Journal has aborted
[73860.786412] EXT4-fs error (device dm-2): ext4_journal_check_start:83: comm dockerd: Detected aborted journal
[73860.822626] EXT4-fs (dm-2): Remounting filesystem read-only
[73860.822679] EXT4-fs (dm-2): ext4_writepages: jbd2_start: 1024 pages, ino 1048907; err -30
[73860.829903] EXT4-fs (dm-2): failed to convert unwritten extents to written extents -- potential data loss!  (inode 1048659, error -30)
[73860.830024] EXT4-fs (dm-2): failed to convert unwritten extents to written extents -- potential data loss!  (inode 1050465, error -30)
[73860.830085] EXT4-fs (dm-2): failed to convert unwritten extents to written extents -- potential data loss!  (inode 1048743, error -30)
[73860.830137] EXT4-fs (dm-2): failed to convert unwritten extents to written extents -- potential data loss!  (inode 1055490, error -30)
[73860.836021] systemd[1]: systemd-journald.service: Main process exited, code=killed, status=9/KILL
[73860.846006] systemd[1]: systemd-journald.service: Failed with result 'timeout'.
[73860.846437] systemd[1]: Failed to start Journal Service.
[73860.847652] systemd[1]: systemd-journald.service: Scheduled restart job, restart counter is at 2.
[73860.847819] systemd-coredump[427203]: Failed to get COMM: No such process
[73860.848017] systemd[1]: Stopped Journal Service.
[73860.848383] audit: type=1130 audit(1656184788.832:725): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-journald comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'
[73860.848390] audit: type=1130 audit(1656184788.832:726): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-journald comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[73860.848396] audit: type=1131 audit(1656184788.832:727): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-journald comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[73860.875678] systemd[1]: Starting Journal Service...
[73860.875874] audit: type=1334 audit(1656184788.859:728): prog-id=55 op=LOAD
[73860.875879] audit: type=1334 audit(1656184788.859:729): prog-id=56 op=LOAD
[73860.875882] audit: type=1334 audit(1656184788.859:730): prog-id=57 op=LOAD
[73860.876020] audit: type=1334 audit(1656184788.859:731): prog-id=0 op=UNLOAD
[73860.876022] audit: type=1334 audit(1656184788.859:732): prog-id=0 op=UNLOAD
[73860.876025] audit: type=1334 audit(1656184788.859:733): prog-id=0 op=UNLOAD
[73860.895312] audit: type=1305 audit(1656184788.879:734): op=set audit_enabled=1 old=1 auid=4294967295 ses=4294967295 res=1
[73860.923360] systemd[1]: Started Journal Service.
[73866.395298] kauditd_printk_skb: 15 callbacks suppressed

fiona · Jun 27, 2022

danman said:

Okay, it just happend again.

But this time, it happend on the other backup storage.

Backup logs:

Code:

INFO:  53% (583.6 GiB of 1.1 TiB) in 7h 2m 56s, read: 5.0 MiB/s, write: 4.1 MiB/s
ERROR: backup write data failed: command error: write_data upload error: broken pipe
INFO: aborting backup job
INFO: resuming VM again
ERROR: Backup of VM 102 failed - backup write data failed: command error: write_data upload error: broken pipe
INFO: Failed at 2022-06-26 07:19:48

dmesg logs:

Code:

[73836.893377] systemd[1]: systemd-journald.service: State 'stop-sigterm' timed out. Killing.
[73836.893518] systemd[1]: systemd-journald.service: Killing process 427185 (systemd-journal) with signal SIGKILL.
[73836.893652] systemd[1]: systemd-journald.service: Killing process 294 (systemd-journal) with signal SIGKILL.
[73836.893796] systemd[1]: systemd-journald.service: Killing process 427116 (journal-offline) with signal SIGKILL.
[73860.767621] sd 2:0:0:0: [sda] tag#201 timing out command, waited 180s
[73860.767666] sd 2:0:0:0: [sda] tag#201 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=913s
[73860.767674] sd 2:0:0:0: [sda] tag#201 Sense Key : Aborted Command [current]
[73860.767675] sd 2:0:0:1: [sdb] tag#9 timing out command, waited 180s
[73860.767681] sd 2:0:0:0: [sda] tag#201 Add. Sense: I/O process terminated
[73860.767688] sd 2:0:0:0: [sda] tag#201 CDB: Write(10) 2a 00 02 f9 e0 00 00 00 10 00
[73860.767694] sd 2:0:0:1: [sdb] tag#9 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=889s
[73860.767693] I/O error, dev sda, sector 49930240 op 0x1:(WRITE) flags 0x800 phys_seg 2 prio class 0
[73860.767699] sd 2:0:0:1: [sdb] tag#9 Sense Key : Aborted Command [current]
[73860.767702] sd 2:0:0:1: [sdb] tag#9 Add. Sense: I/O process terminated
[73860.767707] sd 2:0:0:1: [sdb] tag#9 CDB: Write(10) 2a 00 62 49 40 00 00 01 68 00
[73860.767709] I/O error, dev sdb, sector 1648967680 op 0x1:(WRITE) flags 0x0 phys_seg 45 prio class 0
[73860.767767] EXT4-fs warning (device dm-3): ext4_end_bio:343: I/O error 10 writing to inode 51555441 starting block 206116608)
[73860.767785] sd 2:0:0:1: [sdb] tag#17 timing out command, waited 180s
[73860.767789] sd 2:0:0:1: [sdb] tag#17 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=858s
[73860.767792] sd 2:0:0:1: [sdb] tag#17 Sense Key : Aborted Command [current]
[73860.767794] sd 2:0:0:1: [sdb] tag#17 Add. Sense: I/O process terminated
[73860.767797] sd 2:0:0:1: [sdb] tag#17 CDB: Write(10) 2a 00 67 81 8f b0 00 00 08 00
[73860.767799] I/O error, dev sdb, sector 1736544176 op 0x1:(WRITE) flags 0x103000 phys_seg 1 prio class 0
[73860.767807] Buffer I/O error on dev dm-3, logical block 217063670, lost async page write
[73860.767887] EXT4-fs error (device dm-3): ext4_check_bdev_write_error:217: comm kworker/u8:4: Error while async write back metadata
[73860.767937] Buffer I/O error on device dm-3, logical block 206116608
[73860.767942] Buffer I/O error on device dm-3, logical block 206116609
[73860.767943] Buffer I/O error on device dm-3, logical block 206116610
[73860.767945] Buffer I/O error on device dm-3, logical block 206116611
[73860.767947] Buffer I/O error on device dm-3, logical block 206116612
[73860.767948] Buffer I/O error on device dm-3, logical block 206116613
[73860.767950] Buffer I/O error on device dm-3, logical block 206116614
[73860.767951] Buffer I/O error on device dm-3, logical block 206116615
[73860.767952] Buffer I/O error on device dm-3, logical block 206116616
[73860.767954] Buffer I/O error on device dm-3, logical block 206116617
[73860.775235] sd 2:0:0:0: [sda] tag#202 timing out command, waited 180s
[73860.775262] sd 2:0:0:0: [sda] tag#202 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=913s
[73860.775267] sd 2:0:0:0: [sda] tag#202 Sense Key : Aborted Command [current]
[73860.775271] sd 2:0:0:0: [sda] tag#202 Add. Sense: I/O process terminated
[73860.775276] sd 2:0:0:0: [sda] tag#202 CDB: Write(10) 2a 00 02 f9 e0 18 00 00 08 00
[73860.775279] I/O error, dev sda, sector 49930264 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
[73860.778984] sd 2:0:0:0: [sda] tag#204 timing out command, waited 180s
[73860.779010] sd 2:0:0:0: [sda] tag#204 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=913s
[73860.779033] sd 2:0:0:0: [sda] tag#204 Sense Key : Aborted Command [current]
[73860.779037] sd 2:0:0:0: [sda] tag#204 Add. Sense: I/O process terminated
[73860.779042] sd 2:0:0:0: [sda] tag#204 CDB: Write(10) 2a 00 02 f9 e0 20 00 00 08 00
[73860.779045] I/O error, dev sda, sector 49930272 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
[73860.781260] sd 2:0:0:0: [sda] tag#203 timing out command, waited 180s
[73860.781281] sd 2:0:0:0: [sda] tag#203 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=913s
[73860.781285] sd 2:0:0:0: [sda] tag#203 Sense Key : Aborted Command [current]
[73860.781287] sd 2:0:0:0: [sda] tag#203 Add. Sense: I/O process terminated
[73860.781290] sd 2:0:0:0: [sda] tag#203 CDB: Write(10) 2a 00 02 f9 e0 10 00 00 08 00
[73860.781292] I/O error, dev sda, sector 49930256 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
[73860.784623] sd 2:0:0:0: [sda] tag#205 timing out command, waited 180s
[73860.784640] sd 2:0:0:0: [sda] tag#205 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=913s
[73860.784642] sd 2:0:0:0: [sda] tag#205 Sense Key : Aborted Command [current]
[73860.784644] sd 2:0:0:0: [sda] tag#205 Add. Sense: I/O process terminated
[73860.784647] sd 2:0:0:0: [sda] tag#205 CDB: Write(10) 2a 00 02 f9 e0 28 00 00 08 00
[73860.784649] I/O error, dev sda, sector 49930280 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
[73860.784745] Aborting journal on device dm-2-8.
[73860.784756] EXT4-fs error (device dm-2) in ext4_reserve_inode_write:5748: Journal has aborted
[73860.784762] EXT4-fs error (device dm-2): __ext4_ext_dirty:183: inode #1574110: comm kworker/u8:0: mark_inode_dirty error
[73860.784834] EXT4-fs error (device dm-2) in ext4_mb_clear_bb:6076: Journal has aborted
[73860.784853] EXT4-fs (dm-2): Delayed block allocation failed for inode 1049937 at logical offset 3 with max blocks 2 with error 30
[73860.784865] EXT4-fs (dm-2): Delayed block allocation failed for inode 1574110 at logical offset 0 with max blocks 2 with error 30
[73860.784892] EXT4-fs (dm-2): This should not happen!! Data will be lost

[73860.785549] EXT4-fs (dm-2): This should not happen!! Data will be lost

[73860.785937] EXT4-fs error (device dm-2) in ext4_writepages:2817: Journal has aborted
[73860.786004] EXT4-fs error (device dm-2): ext4_journal_check_start:83: comm kworker/u8:3: Detected aborted journal
[73860.786028] EXT4-fs error (device dm-2) in ext4_writepages:2817: Journal has aborted
[73860.786189] EXT4-fs error (device dm-2): __ext4_ext_dirty:183: inode #1582379: comm dockerd: mark_inode_dirty error
[73860.786256] EXT4-fs error (device dm-2) in ext4_mb_clear_bb:6076: Journal has aborted
[73860.786301] EXT4-fs error (device dm-2) in ext4_writepages:2817: Journal has aborted
[73860.786412] EXT4-fs error (device dm-2): ext4_journal_check_start:83: comm dockerd: Detected aborted journal
[73860.822626] EXT4-fs (dm-2): Remounting filesystem read-only
[73860.822679] EXT4-fs (dm-2): ext4_writepages: jbd2_start: 1024 pages, ino 1048907; err -30
[73860.829903] EXT4-fs (dm-2): failed to convert unwritten extents to written extents -- potential data loss!  (inode 1048659, error -30)
[73860.830024] EXT4-fs (dm-2): failed to convert unwritten extents to written extents -- potential data loss!  (inode 1050465, error -30)
[73860.830085] EXT4-fs (dm-2): failed to convert unwritten extents to written extents -- potential data loss!  (inode 1048743, error -30)
[73860.830137] EXT4-fs (dm-2): failed to convert unwritten extents to written extents -- potential data loss!  (inode 1055490, error -30)
[73860.836021] systemd[1]: systemd-journald.service: Main process exited, code=killed, status=9/KILL
[73860.846006] systemd[1]: systemd-journald.service: Failed with result 'timeout'.
[73860.846437] systemd[1]: Failed to start Journal Service.
[73860.847652] systemd[1]: systemd-journald.service: Scheduled restart job, restart counter is at 2.
[73860.847819] systemd-coredump[427203]: Failed to get COMM: No such process
[73860.848017] systemd[1]: Stopped Journal Service.
[73860.848383] audit: type=1130 audit(1656184788.832:725): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-journald comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'
[73860.848390] audit: type=1130 audit(1656184788.832:726): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-journald comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[73860.848396] audit: type=1131 audit(1656184788.832:727): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-journald comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[73860.875678] systemd[1]: Starting Journal Service...
[73860.875874] audit: type=1334 audit(1656184788.859:728): prog-id=55 op=LOAD
[73860.875879] audit: type=1334 audit(1656184788.859:729): prog-id=56 op=LOAD
[73860.875882] audit: type=1334 audit(1656184788.859:730): prog-id=57 op=LOAD
[73860.876020] audit: type=1334 audit(1656184788.859:731): prog-id=0 op=UNLOAD
[73860.876022] audit: type=1334 audit(1656184788.859:732): prog-id=0 op=UNLOAD
[73860.876025] audit: type=1334 audit(1656184788.859:733): prog-id=0 op=UNLOAD
[73860.895312] audit: type=1305 audit(1656184788.879:734): op=set audit_enabled=1 old=1 auid=4294967295 ses=4294967295 res=1
[73860.923360] systemd[1]: Started Journal Service.
[73866.395298] kauditd_printk_skb: 15 callbacks suppressed

Sounds like the drive is dying. I'd suggest not using it anymore and salvaging the data on (e.g. using ddrescue) it while you still can.

danman · Jun 27, 2022

So do you think the sdb is failing? That's the wolfs pool and the status seems to be fine. Or do you think that the backup disk is failing which means both of them would fail.

Or sda which is a nvm disk and used for all systems (/ only, not for storage).

Code:

Smart Log for NVME device:nvme0 namespace-id:ffffffff
critical_warning                        : 0
temperature                             : 33 C
available_spare                         : 100%
available_spare_threshold               : 5%
percentage_used                         : 2%
endurance group critical warning summary: 0
data_units_read                         : 115,701,789
data_units_written                      : 71,179,265
host_read_commands                      : 1,209,460,035
host_write_commands                     : 1,659,256,558
controller_busy_time                    : 12,462
power_cycles                            : 46
power_on_hours                          : 9,446
unsafe_shutdowns                        : 16
media_errors                            : 0
num_err_log_entries                     : 22
Warning Temperature Time                : 0
Critical Composite Temperature Time     : 0
Thermal Management T1 Trans Count       : 0
Thermal Management T2 Trans Count       : 0
Thermal Management T1 Total Time        : 0
Thermal Management T2 Total Time        : 0

So all hard disks checked with smartcontrol or nvme show nothing unusual.

danman · Aug 6, 2022

Im going to mark this here at solved. It never happened again since. The main question was solved anyway.
Thanks @fiona

fiona · Aug 8, 2022

The dmesg log did sound rather bad, there were errors for sdb and sda. I made another quick search and there was a similar report on the kernel mailing list. Were the above errors also during shutdown for you? The kernel developer also suggested replacing the drive, but since it never happened again, maybe it is/was some kernel bug after all? Regularly checking the drive health is a good idea nonetheless

danman · Aug 11, 2022

This happened only during the backup. Sometimes by 10% sometimes by 60% sometimes 99%. And these are all snapshots. I don't think the drives are the problem. But who knows, I'm definitely keeping an eye on it

[SOLVED] How to delete/clear zfs cache after hard drive broke

danman

Active Member

fiona

Proxmox Staff Member

danman

Active Member

fiona

Proxmox Staff Member

danman

Active Member

danman

Active Member

danman

Active Member

fiona

Proxmox Staff Member

danman

Active Member

danman

Active Member

danman

Active Member

fiona

Proxmox Staff Member

danman

Active Member

danman

Active Member

fiona

Proxmox Staff Member

danman

Active Member

We value your privacy