Yesterday I found something strange. I'm a new proxmox user with 48days uptime on my test server, running proxmox 6.0-11.
One of my vm-s stopped yesterday, and after that the whole system (ui, cli, vms) freezed. 1-2 hours later i was able to ssh in, and I found sata errors in the kernel log.
That's not a big deal, a simple sata cable could case something like that, but i don't know what happened with the raid1 zfs store:
I saw this after the problem:
root@nsx:~# zpool status
pool: rpool
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://zfsonlinux.org/msg/ZFS-8000-9P
scan: scrub repaired 0B in 0 days 00:02:38 with 0 errors on Sun Dec 8 00:26:40 2019
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-LITEONIT_LCS-256M6S_2.5_7mm_256GB_TW0XFJWX5508544H4195-part3 ONLINE 0 887 0
ata-SK_hynix_SH920_2.5_7MM_256GB_EIE8M008910808D6E-part3 ONLINE 0 0 0
errors: No known data errors
I tried "zpool clear" and the command never ended, but the state of the pool got back to normal. After that i was able to reset the freezed vm-s and everthing came back so the ui.
In the kernel logs all error affected only the sda (ata-LITEONIT_LCS-256M6S_2.5_7mm_256GB) disk. SMART status of course OK on both disk.
Can anybody explain me why the degraded zfs pool mirror couldn't handle (turn off, or ignore) the problematic disk after some time or after n sata reset? Is there any setting to control the behavior when errors like this occurs?
My original ZFS raid1 configuration was:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-LITEONIT_LCS-256M6S_2.5_7mm_256GB_TW0XFJWX5508544H4195-part3 ONLINE 0 0 0
ata-SK_hynix_SH920_2.5_7MM_256GB_EIE8M008910808D6E-part3 ONLINE 0 0 0
Kernel log when the problem started. SATA error messages like this repeats each other on sda disk:
Dec 20 12:26:38 nsx kernel: [3441060.611911] sd 1:0:0:0: [sda] tag#1 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
Dec 20 12:26:38 nsx kernel: [3441060.611914] sd 1:0:0:0: [sda] tag#1 CDB: Write(10) 2a 00 07 49 77 d0 00 00 38 00
Dec 20 12:26:38 nsx kernel: [3441060.611916] print_req_error: I/O error, dev sda, sector 122255312 flags 701
Dec 20 12:26:38 nsx kernel: [3441060.611946] zio pool=rpool vdev=/dev/disk/by-id/ata-LITEONIT_LCS-256M6S_2.5_7mm_256GB_TW0XFJWX5508544H4195-part3 error
Dec 20 23:35:58 nsx kernel: [3481218.870570] ata1.00: exception Emask 0x0 SAct 0xc000210 SErr 0x0 action 0x6 frozen
Dec 20 23:35:58 nsx kernel: [3481218.870610] ata1.00: failed command: WRITE FPDMA QUEUED
Dec 20 23:35:58 nsx kernel: [3481218.870630] ata1.00: cmd 61/00:20:58:5d:49/01:00:07:00:00/40 tag 4 ncq dma 131072 out
Dec 20 23:35:58 nsx kernel: [3481218.870630] res 40/00:00:06:01:80/00:00:00:00:00/00 Emask 0x4 (timeout)
Dec 20 23:35:58 nsx kernel: [3481218.870674] ata1.00: status: { DRDY }
Dec 20 23:35:58 nsx kernel: [3481218.870686] ata1.00: failed command: WRITE FPDMA QUEUED
Dec 20 23:35:58 nsx kernel: [3481218.870705] ata1.00: cmd 61/e8:48:58:5e:49/00:00:07:00:00/40 tag 9 ncq dma 118784 out
Dec 20 23:35:58 nsx kernel: [3481218.870705] res 40/00:00:00:01:80/00:00:00:00:00/00 Emask 0x4 (timeout)
Dec 20 23:35:58 nsx kernel: [3481218.870748] ata1.00: status: { DRDY }
Dec 20 23:35:58 nsx kernel: [3481218.870761] ata1.00: failed command: WRITE FPDMA QUEUED
Dec 20 23:35:58 nsx kernel: [3481218.870780] ata1.00: cmd 61/10:d0:10:8e:55/00:00:0e:00:00/40 tag 26 ncq dma 8192 out
Dec 20 23:35:58 nsx kernel: [3481218.870780] res 40/00:00:06:01:80/00:00:00:00:00/00 Emask 0x4 (timeout)
Dec 20 23:35:58 nsx kernel: [3481218.870822] ata1.00: status: { DRDY }
Dec 20 23:35:58 nsx kernel: [3481218.870835] ata1.00: failed command: WRITE FPDMA QUEUED
Dec 20 23:35:58 nsx kernel: [3481218.870853] ata1.00: cmd 61/50:d8:e0:3e:62/00:00:13:00:00/40 tag 27 ncq dma 40960 out
Dec 20 23:35:58 nsx kernel: [3481218.870853] res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Dec 20 23:35:58 nsx kernel: [3481218.870904] ata1.00: status: { DRDY }
Dec 20 23:35:58 nsx kernel: [3481218.870919] ata1: hard resetting link
Dec 20 23:35:58 nsx kernel: [3481219.185124] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Dec 20 23:35:58 nsx kernel: [3481219.185708] ata1.00: configured for UDMA/133
Dec 20 23:35:58 nsx kernel: [3481219.185722] sd 1:0:0:0: [sda] tag#4 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
One of my vm-s stopped yesterday, and after that the whole system (ui, cli, vms) freezed. 1-2 hours later i was able to ssh in, and I found sata errors in the kernel log.
That's not a big deal, a simple sata cable could case something like that, but i don't know what happened with the raid1 zfs store:
I saw this after the problem:
root@nsx:~# zpool status
pool: rpool
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://zfsonlinux.org/msg/ZFS-8000-9P
scan: scrub repaired 0B in 0 days 00:02:38 with 0 errors on Sun Dec 8 00:26:40 2019
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-LITEONIT_LCS-256M6S_2.5_7mm_256GB_TW0XFJWX5508544H4195-part3 ONLINE 0 887 0
ata-SK_hynix_SH920_2.5_7MM_256GB_EIE8M008910808D6E-part3 ONLINE 0 0 0
errors: No known data errors
I tried "zpool clear" and the command never ended, but the state of the pool got back to normal. After that i was able to reset the freezed vm-s and everthing came back so the ui.
In the kernel logs all error affected only the sda (ata-LITEONIT_LCS-256M6S_2.5_7mm_256GB) disk. SMART status of course OK on both disk.
Can anybody explain me why the degraded zfs pool mirror couldn't handle (turn off, or ignore) the problematic disk after some time or after n sata reset? Is there any setting to control the behavior when errors like this occurs?
My original ZFS raid1 configuration was:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-LITEONIT_LCS-256M6S_2.5_7mm_256GB_TW0XFJWX5508544H4195-part3 ONLINE 0 0 0
ata-SK_hynix_SH920_2.5_7MM_256GB_EIE8M008910808D6E-part3 ONLINE 0 0 0
Kernel log when the problem started. SATA error messages like this repeats each other on sda disk:
Dec 20 12:26:38 nsx kernel: [3441060.611911] sd 1:0:0:0: [sda] tag#1 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
Dec 20 12:26:38 nsx kernel: [3441060.611914] sd 1:0:0:0: [sda] tag#1 CDB: Write(10) 2a 00 07 49 77 d0 00 00 38 00
Dec 20 12:26:38 nsx kernel: [3441060.611916] print_req_error: I/O error, dev sda, sector 122255312 flags 701
Dec 20 12:26:38 nsx kernel: [3441060.611946] zio pool=rpool vdev=/dev/disk/by-id/ata-LITEONIT_LCS-256M6S_2.5_7mm_256GB_TW0XFJWX5508544H4195-part3 error
Dec 20 23:35:58 nsx kernel: [3481218.870570] ata1.00: exception Emask 0x0 SAct 0xc000210 SErr 0x0 action 0x6 frozen
Dec 20 23:35:58 nsx kernel: [3481218.870610] ata1.00: failed command: WRITE FPDMA QUEUED
Dec 20 23:35:58 nsx kernel: [3481218.870630] ata1.00: cmd 61/00:20:58:5d:49/01:00:07:00:00/40 tag 4 ncq dma 131072 out
Dec 20 23:35:58 nsx kernel: [3481218.870630] res 40/00:00:06:01:80/00:00:00:00:00/00 Emask 0x4 (timeout)
Dec 20 23:35:58 nsx kernel: [3481218.870674] ata1.00: status: { DRDY }
Dec 20 23:35:58 nsx kernel: [3481218.870686] ata1.00: failed command: WRITE FPDMA QUEUED
Dec 20 23:35:58 nsx kernel: [3481218.870705] ata1.00: cmd 61/e8:48:58:5e:49/00:00:07:00:00/40 tag 9 ncq dma 118784 out
Dec 20 23:35:58 nsx kernel: [3481218.870705] res 40/00:00:00:01:80/00:00:00:00:00/00 Emask 0x4 (timeout)
Dec 20 23:35:58 nsx kernel: [3481218.870748] ata1.00: status: { DRDY }
Dec 20 23:35:58 nsx kernel: [3481218.870761] ata1.00: failed command: WRITE FPDMA QUEUED
Dec 20 23:35:58 nsx kernel: [3481218.870780] ata1.00: cmd 61/10:d0:10:8e:55/00:00:0e:00:00/40 tag 26 ncq dma 8192 out
Dec 20 23:35:58 nsx kernel: [3481218.870780] res 40/00:00:06:01:80/00:00:00:00:00/00 Emask 0x4 (timeout)
Dec 20 23:35:58 nsx kernel: [3481218.870822] ata1.00: status: { DRDY }
Dec 20 23:35:58 nsx kernel: [3481218.870835] ata1.00: failed command: WRITE FPDMA QUEUED
Dec 20 23:35:58 nsx kernel: [3481218.870853] ata1.00: cmd 61/50:d8:e0:3e:62/00:00:13:00:00/40 tag 27 ncq dma 40960 out
Dec 20 23:35:58 nsx kernel: [3481218.870853] res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Dec 20 23:35:58 nsx kernel: [3481218.870904] ata1.00: status: { DRDY }
Dec 20 23:35:58 nsx kernel: [3481218.870919] ata1: hard resetting link
Dec 20 23:35:58 nsx kernel: [3481219.185124] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Dec 20 23:35:58 nsx kernel: [3481219.185708] ata1.00: configured for UDMA/133
Dec 20 23:35:58 nsx kernel: [3481219.185722] sd 1:0:0:0: [sda] tag#4 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE