PVE 8.3.2 - SSD crashes on irregular basis - works after reboot

Apr 20, 2022
19
4
8
On a non regular basis my SSD where my VM`s are running, crashes, so my VM`s fail to work. I notice the below in the GUI.

1737536025184.png

Shutting down the system gives the above error on VM200 that is running Windows 10 but the hardware does shut down in the end and after starting up, everything works as usual, sometimes for months, sometimes weeks. I run regular updates and reboots. Below the output of the logs around moment where it happened last night.

Anybody a clue how to solve, debug ? I can workaround by a shutdown but that not the way.

Jan 22 03:12:41 pve kernel: clocksource: timekeeping watchdog on CPU3: Marking clocksource 'tsc' as unstable because the skew is too large:
Jan 22 03:12:41 pve kernel: clocksource: 'hpet' wd_nsec: 488221547 wd_now: a64c0bd3 wd_last: a5e16167 mask: ffffffff
Jan 22 03:12:41 pve kernel: clocksource: 'tsc' cs_nsec: 495971079 cs_now: 399d320ffdf3e cs_last: 399d2ebe32b44 mask: ffffffffffffffff
Jan 22 03:12:41 pve kernel: clocksource: Clocksource 'tsc' skewed 7749532 ns (7 ms) over watchdog 'hpet' interval of 488221547 ns (488 ms)
Jan 22 03:12:41 pve kernel: clocksource: 'tsc' is current clocksource.
Jan 22 03:12:41 pve kernel: tsc: Marking TSC unstable due to clocksource watchdog
Jan 22 03:12:41 pve kernel: TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
Jan 22 03:12:41 pve kernel: sched_clock: Marking unstable (564146159078148, -14823090444)<-(564131353404432, -17429249)
Jan 22 03:12:41 pve kernel: clocksource: Checking clocksource tsc synchronization from CPU 10 to CPUs 0-2,8,12-14.
Jan 22 03:12:41 pve kernel: clocksource: Switched to clocksource hpet
Jan 22 03:12:42 pve kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0xc0000 action 0x6 frozen
Jan 22 03:12:42 pve kernel: ata2: SError: { CommWake 10B8B }
Jan 22 03:12:42 pve kernel: ata2.00: failed command: DATA SET MANAGEMENT
Jan 22 03:12:42 pve kernel: ata2.00: cmd 06/01:01:00:00:00/00:00:00:00:00/a0 tag 4 dma 512 out res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan 22 03:12:42 pve kernel: ata2.00: status: { DRDY }
Jan 22 03:12:42 pve kernel: ata2: hard resetting link
Jan 22 03:12:48 pve kernel: ata2: link is slow to respond, please be patient (ready=0)
Jan 22 03:12:52 pve kernel: ata2: found unknown device (class 0)
Jan 22 03:12:53 pve kernel: ata2: found unknown device (class 0)
Jan 22 03:12:53 pve kernel: ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Jan 22 03:12:58 pve kernel: ata2.00: qc timeout after 5000 msecs (cmd 0xec)
Jan 22 03:12:58 pve kernel: ata2.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Jan 22 03:12:58 pve kernel: ata2.00: revalidation failed (errno=-5)
Jan 22 03:12:58 pve kernel: ata2: hard resetting link
Jan 22 03:13:03 pve kernel: ata2: link is slow to respond, please be patient (ready=0)
Jan 22 03:13:08 pve kernel: ata2: found unknown device (class 0)
Jan 22 03:13:08 pve kernel: ata2: found unknown device (class 0)
Jan 22 03:13:08 pve kernel: ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Jan 22 03:13:18 pve kernel: ata2.00: qc timeout after 10000 msecs (cmd 0xec)
Jan 22 03:13:18 pve kernel: ata2.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Jan 22 03:13:18 pve kernel: ata2.00: revalidation failed (errno=-5)
Jan 22 03:13:18 pve kernel: ata2: limiting SATA link speed to 3.0 Gbps
Jan 22 03:13:18 pve kernel: ata2: hard resetting link
Jan 22 03:13:24 pve kernel: ata2: link is slow to respond, please be patient (ready=0)
Jan 22 03:13:28 pve kernel: ata2: found unknown device (class 0)
Jan 22 03:13:28 pve kernel: ata2: found unknown device (class 0)
Jan 22 03:13:28 pve kernel: ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
Jan 22 03:13:59 pve kernel: ata2.00: qc timeout after 30000 msecs (cmd 0xec)
Jan 22 03:13:59 pve kernel: ata2.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Jan 22 03:13:59 pve kernel: ata2.00: revalidation failed (errno=-5)
Jan 22 03:13:59 pve kernel: ata2.00: disable device
Jan 22 03:14:04 pve kernel: ata2: link is slow to respond, please be patient (ready=0)
Jan 22 03:14:09 pve kernel: ata2: found unknown device (class 0)
Jan 22 03:14:09 pve kernel: ata2: found unknown device (class 0)
Jan 22 03:14:09 pve kernel: ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
Jan 22 03:14:09 pve kernel: ata2: EH complete
Jan 22 03:14:09 pve kernel: sd 1:0:0:0: [sda] tag#13 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=116s
Jan 22 03:14:09 pve kernel: sd 1:0:0:0: [sda] tag#18 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Jan 22 03:14:09 pve kernel: sd 1:0:0:0: [sda] tag#14 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=116s
Jan 22 03:14:09 pve kernel: sd 1:0:0:0: [sda] tag#18 CDB: Write(10) 2a 00 15 b0 4d a8 00 00 08 00
Jan 22 03:14:09 pve kernel: I/O error, dev sda, sector 363875752 op 0x1:(WRITE) flags 0x8800 phys_seg 1 prio class 0
Jan 22 03:14:09 pve kernel: sd 1:0:0:0: [sda] tag#13 CDB: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00
Jan 22 03:14:09 pve kernel: sd 1:0:0:0: [sda] tag#14 CDB: Write same(16) 93 08 00 00 00 00 08 7a 2a 80 00 00 00 18 00 00
Jan 22 03:14:09 pve kernel: I/O error, dev sda, sector 142224000 op 0x3:(DISCARD) flags 0x800 phys_seg 1 prio class 0
Jan 22 03:14:09 pve kernel: I/O error, dev sda, sector 268445696 op 0x1:(WRITE) flags 0x800 phys_seg 0 prio class 0
Jan 22 03:14:09 pve kernel: sd 1:0:0:0: [sda] tag#19 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Jan 22 03:14:09 pve kernel: sd 1:0:0:0: [sda] tag#19 CDB: Write(10) 2a 00 15 b0 4d e0 00 00 08 00
Jan 22 03:14:09 pve kernel: I/O error, dev sda, sector 363875808 op 0x1:(WRITE) flags 0x8800 phys_seg 1 prio class 0
Jan 22 03:14:09 pve kernel: sd 1:0:0:0: [sda] tag#21 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Jan 22 03:14:09 pve kernel: sd 1:0:0:0: [sda] tag#20 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Jan 22 03:14:09 pve kernel: sd 1:0:0:0: [sda] tag#21 CDB: Write(10) 2a 00 18 26 21 f0 00 00 18 00
Jan 22 03:14:09 pve kernel: sd 1:0:0:0: [sda] tag#20 CDB: Write(10) 2a 00 18 12 fa 48 00 00 02 00
Jan 22 03:14:09 pve kernel: I/O error, dev sda, sector 403896904 op 0x1:(WRITE) flags 0x8800 phys_seg 1 prio class 0
Jan 22 03:14:09 pve kernel: sd 1:0:0:0: [sda] tag#15 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Jan 22 03:14:09 pve kernel: I/O error, dev sda, sector 405152240 op 0x1:(WRITE) flags 0x8800 phys_seg 3 prio class 0
Jan 22 03:14:09 pve kernel: sd 1:0:0:0: [sda] tag#23 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Jan 22 03:14:09 pve kernel: sd 1:0:0:0: [sda] tag#26 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Jan 22 03:14:09 pve kernel: sd 1:0:0:0: [sda] tag#15 CDB: Write(10) 2a 00 09 d8 73 48 00 00 08 00
Jan 22 03:14:09 pve kernel: sd 1:0:0:0: [sda] tag#26 CDB: Write(10) 2a 00 04 0d 31 30 00 00 30 00
Jan 22 03:14:09 pve kernel: sd 1:0:0:0: [sda] tag#23 CDB: Write(10) 2a 00 20 b4 89 d8 00 00 10 00
Jan 22 03:14:09 pve kernel: I/O error, dev sda, sector 67973424 op 0x1:(WRITE) flags 0x8800 phys_seg 6 prio class 0
Jan 22 03:14:09 pve kernel: I/O error, dev sda, sector 548702680 op 0x1:(WRITE) flags 0x8800 phys_seg 1 prio class 0
Jan 22 03:14:09 pve kernel: I/O error, dev sda, sector 165180232 op 0x1:(WRITE) flags 0x100000 phys_seg 1 prio class 0
Jan 22 03:14:09 pve kernel: sd 1:0:0:0: [sda] tag#19 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Jan 22 03:14:09 pve kernel: Buffer I/O error on dev dm-2, logical block 3869801, lost async page write
Jan 22 03:14:09 pve kernel: sd 1:0:0:0: [sda] tag#19 CDB: Write(10) 2a 00 06 07 85 00 00 00 08 00
Jan 22 03:14:09 pve kernel: I/O error, dev sda, sector 101156096 op 0x1:(WRITE) flags 0x8800 phys_seg 1 prio class 0
Jan 22 03:14:09 pve kernel: Buffer I/O error on dev dm-2, logical block 5316177, lost async page write
Jan 22 03:14:09 pve kernel: Buffer I/O error on dev dm-2, logical block 5316178, lost async page write
Jan 22 03:14:09 pve kernel: Buffer I/O error on dev dm-2, logical block 5316179, lost async page write
Jan 22 03:14:09 pve kernel: Buffer I/O error on dev dm-2, logical block 5316180, lost async page write
Jan 22 03:14:09 pve kernel: Buffer I/O error on dev dm-2, logical block 5316181, lost async page write
Jan 22 03:14:09 pve kernel: Buffer I/O error on dev dm-2, logical block 782613, lost async page write
Jan 22 03:14:09 pve kernel: Buffer I/O error on dev dm-2, logical block 782614, lost async page write
Jan 22 03:14:09 pve kernel: Buffer I/O error on dev dm-2, logical block 121640, lost async page write
Jan 22 03:14:09 pve kernel: Buffer I/O error on dev dm-2, logical block 782615, lost async page write
Jan 22 03:14:09 pve pvestatd[1135]: status update time (108.318 seconds)
Jan 22 03:14:14 pve kernel: scsi_io_completion_action: 1005 callbacks suppressed
Jan 22 03:14:14 pve kernel: blk_print_req_error: 1005 callbacks suppressed
Jan 22 03:14:14 pve kernel: I/O error, dev sda, sector 150812200 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Jan 22 03:14:14 pve kernel: sd 1:0:0:0: [sda] tag#24 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Jan 22 03:14:14 pve kernel: sd 1:0:0:0: [sda] tag#1 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Jan 22 03:14:14 pve kernel: sd 1:0:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Jan 22 03:14:14 pve kernel: sd 1:0:0:0: [sda] tag#2 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Jan 22 03:14:14 pve kernel: buffer_io_error: 116 callbacks suppressed
Jan 22 03:14:14 pve kernel: sd 1:0:0:0: [sda] tag#0 CDB: Read(10) 28 00 09 15 88 a8 00 00 08 00
Jan 22 03:14:14 pve kernel: Buffer I/O error on dev dm-2, logical block 2073797, async page read
Jan 22 03:14:14 pve kernel: sd 1:0:0:0: [sda] tag#24 CDB: Read(10) 28 00 08 e9 df c0 00 00 08 00
Jan 22 03:14:14 pve kernel: I/O error, dev sda, sector 149544896 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Jan 22 03:14:14 pve kernel: sd 1:0:0:0: [sda] tag#2 CDB: Read(10) 28 00 0d 7d 12 98 00 00 08 00
Jan 22 03:14:14 pve kernel: I/O error, dev sda, sector 152406184 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Jan 22 03:14:14 pve kernel: sd 1:0:0:0: [sda] tag#1 CDB: Read(10) 28 00 09 cd 5c c8 00 00 08 00
Jan 22 03:14:14 pve kernel: Buffer I/O error on dev dm-2, logical block 1915384, async page read
Jan 22 03:14:14 pve kernel: I/O error, dev sda, sector 226300568 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Jan 22 03:14:14 pve kernel: Buffer I/O error on dev dm-2, logical block 2273045, async page read
Jan 22 03:14:14 pve kernel: I/O error, dev sda, sector 164453576 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Jan 22 03:14:14 pve kernel: Buffer I/O error on dev dm-2, logical block 11509843, async page read
Jan 22 03:14:14 pve kernel: Buffer I/O error on dev dm-2, logical block 3778969, async page read
 
Hi,

all the error messages are generated by kernel, which is not really Proxmox specific. You could feed any search engine with it and might get (a lot) of suggestions. :)
Like:
  • defective drive, check smart values, cableing, have backup ready, eventually replace it
  • disable power management for the drive
  • update BIOS, update firmware of RAID controller (if any)
  • update kernel
My guess is with first option, because specific sectors are mentioned with I/O errors.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!