[SOLVED] proxmox webgui freezes, file system read-only, exception frozen

skTom

New Member
Jan 4, 2024
2
0
1
Hi, I need help.

A few days ago, I was reinstalling Proxmox, and since then, I've been experiencing system freezes. Every few or several hours, the webGUI and all VM/LXC instances stop responding, or the file system becomes read-only.

Code:
Jan 05 03:00:26 proxmox kernel: ata1.00: exception Emask 0x0 SAct 0x1003c SErr 0x0 action 0x6 frozen
Jan 05 03:01:27 proxmox kernel: ata1.00: failed command: WRITE FPDMA QUEUED
Jan 05 03:01:27 proxmox kernel: ata1.00: cmd 61/08:10:00:3e:b4/00:00:11:00:00/40 tag 2 ncq dma 4096 out
         res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan 05 03:01:27 proxmox kernel: ata1.00: status: { DRDY }
Jan 05 03:01:27 proxmox kernel: ata1.00: failed command: WRITE FPDMA QUEUED
Jan 05 03:01:27 proxmox kernel: ata1.00: cmd 61/08:18:b8:44:b4/00:00:11:00:00/40 tag 3 ncq dma 4096 out
         res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan 05 03:01:27 proxmox kernel: ata1.00: status: { DRDY }
Jan 05 03:01:27 proxmox kernel: ata1.00: failed command: WRITE FPDMA QUEUED
Jan 05 03:01:27 proxmox kernel: ata1.00: cmd 61/08:20:e0:f0:ef/00:00:0d:00:00/40 tag 4 ncq dma 4096 out
         res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan 05 03:01:27 proxmox kernel: ata1.00: status: { DRDY }
Jan 05 03:01:27 proxmox kernel: ata1.00: failed command: READ FPDMA QUEUED
Jan 05 03:01:27 proxmox kernel: ata1.00: cmd 60/00:28:00:08:20/01:00:00:00:00/40 tag 5 ncq dma 131072 in
         res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan 05 03:01:27 proxmox kernel: ata1.00: status: { DRDY }
Jan 05 03:01:27 proxmox kernel: ata1.00: failed command: WRITE FPDMA QUEUED
Jan 05 03:01:27 proxmox kernel: ata1.00: cmd 61/10:80:00:b2:28/00:00:07:00:00/40 tag 16 ncq dma 8192 out
         res 40/00:01:06:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan 05 03:01:27 proxmox kernel: ata1.00: status: { DRDY }
Jan 05 03:01:27 proxmox kernel: ata1: hard resetting link
Jan 05 03:01:27 proxmox kernel: ata1: link is slow to respond, please be patient (ready=0)
Jan 05 03:01:27 proxmox kernel: ata1: softreset failed (device not ready)
Jan 05 03:01:27 proxmox kernel: ata1: hard resetting link
Jan 05 03:01:27 proxmox kernel: ata1: link is slow to respond, please be patient (ready=0)
Jan 05 03:01:27 proxmox kernel: ata1: softreset failed (device not ready)
Jan 05 03:01:27 proxmox kernel: ata1: hard resetting link
Jan 05 03:01:27 proxmox kernel: ata1: link is slow to respond, please be patient (ready=0)
Jan 05 03:01:27 proxmox kernel: ata1: link is slow to respond, please be patient (ready=0)
Jan 05 03:01:27 proxmox kernel: ata1: softreset failed (device not ready)
Jan 05 03:01:27 proxmox kernel: ata1: limiting SATA link speed to 3.0 Gbps
Jan 05 03:01:27 proxmox kernel: ata1: hard resetting link
Jan 05 03:01:27 proxmox kernel: ata1: softreset failed (device not ready)
Jan 05 03:01:27 proxmox kernel: ata1: reset failed, giving up
Jan 05 03:01:27 proxmox kernel: ata1.00: disable device
Jan 05 03:01:27 proxmox kernel: ata1: EH complete
Jan 05 03:01:27 proxmox kernel: sd 0:0:0:0: [sda] tag#2 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Jan 05 03:01:27 proxmox kernel: sd 0:0:0:0: [sda] tag#2 CDB: Read(10) 28 00 38 ca 37 98 00 00 08 00
Jan 05 03:01:27 proxmox kernel: I/O error, dev sda, sector 952776600 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
Jan 05 03:01:27 proxmox kernel: sd 0:0:0:0: [sda] tag#24 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Jan 05 03:01:27 proxmox kernel: sd 0:0:0:0: [sda] tag#24 CDB: ATA command pass through(16) 85 06 2c 00 00 00 00 00 00 00 00 00 00 00 e5 00
Jan 05 03:01:27 proxmox kernel: sd 0:0:0:0: [sda] tag#3 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=96s
Jan 05 03:01:27 proxmox kernel: sd 0:0:0:0: [sda] tag#3 CDB: Write(10) 2a 00 11 b4 3e 00 00 00 08 00
Jan 05 03:01:27 proxmox kernel: I/O error, dev sda, sector 297025024 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 2
Jan 05 03:01:27 proxmox kernel: Buffer I/O error on dev dm-7, logical block 355696, lost async page write


pveversion -v
Code:
proxmox-ve: 8.1.0 (running kernel: 6.5.11-7-pve)
pve-manager: 8.1.3 (running version: 8.1.3/b46aac3b42da5d15)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.5: 6.5.11-7
proxmox-kernel-6.5.11-7-pve-signed: 6.5.11-7
ceph-fuse: 17.2.7-pve1
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx7
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.0
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.0.7
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.1.0
libpve-guest-common-perl: 5.0.6
libpve-http-server-perl: 5.0.5
libpve-network-perl: 0.9.5
libpve-rs-perl: 0.8.7
libpve-storage-perl: 8.0.5
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve4
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.1.2-1
proxmox-backup-file-restore: 3.1.2-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.2
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.1.3
pve-cluster: 8.0.5
pve-container: 5.0.8
pve-docs: 8.1.3
pve-edk2-firmware: 4.2023.08-2
pve-firewall: 5.0.3
pve-firmware: 3.9-1
pve-ha-manager: 4.0.3
pve-i18n: 3.1.5
pve-qemu-kvm: 8.1.2-6
pve-xtermjs: 5.3.0-3
qemu-server: 8.0.10
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.2-pve1

SMART

Code:
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.5.11-7-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org


=== START OF INFORMATION SECTION ===
Device Model:     INTENSO
Serial Number:    AA000000000000003231
Firmware Version: V0621A0
User Capacity:    512,110,190,592 bytes [512 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
TRIM Command:     Available
Device is:        Not in smartctl database 7.3/5319
ATA Version is:   ACS-3 T13/2161-D revision 4
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sat Jan  6 22:59:24 2024 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled


=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED


General SMART Values:
Offline data collection status:  (0x02) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  120) seconds.
Offline data collection
capabilities:                    (0x11) SMART execute Offline immediate.
                                        No Auto Offline data collection support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        No Selective Self-test supported.
SMART capabilities:            (0x0002) Does not save SMART data before
                                        entering power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (  10) minutes.
SCT capabilities:              (0x0001) SCT Status supported.


SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x0032   100   100   050    Old_age   Always       -       0
  5 Reallocated_Sector_Ct   0x0032   100   100   050    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   050    Old_age   Always       -       4550
 12 Power_Cycle_Count       0x0032   100   100   050    Old_age   Always       -       81
160 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       0
161 Unknown_Attribute       0x0033   100   100   050    Pre-fail  Always       -       100
163 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       10
164 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       8998
165 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       129
166 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       2
167 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       41
168 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       5050
169 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       100
175 Program_Fail_Count_Chip 0x0032   100   100   050    Old_age   Always       -       0
176 Erase_Fail_Count_Chip   0x0032   100   100   050    Old_age   Always       -       0
177 Wear_Leveling_Count     0x0032   100   100   050    Old_age   Always       -       0
178 Used_Rsvd_Blk_Cnt_Chip  0x0032   100   100   050    Old_age   Always       -       0
181 Program_Fail_Cnt_Total  0x0032   100   100   050    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   050    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   050    Old_age   Always       -       74
194 Temperature_Celsius     0x0022   100   100   050    Old_age   Always       -       40
195 Hardware_ECC_Recovered  0x0032   100   100   050    Old_age   Always       -       0
196 Reallocated_Event_Count 0x0032   100   100   050    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   050    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0032   100   100   050    Old_age   Always       -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   050    Old_age   Always       -       0
232 Available_Reservd_Space 0x0032   100   100   050    Old_age   Always       -       100
241 Total_LBAs_Written      0x0030   100   100   050    Old_age   Offline      -       78485
242 Total_LBAs_Read         0x0030   100   100   050    Old_age   Offline      -       67568
245 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       190224

Is the problem with the SSD? I don't know how to interpret the SMART logs above.

thanks.
 
Hello,

the SMART output looks good to me.

I think there is a problem either with your sata-controller/cable or your SSD (SMART cannot find any error) .
Can you try using a different cable or SSD to find out what is faulty?
 
Last edited:
Hello,

the SMART output looks good to me.

I think there is a problem either with your sata-controller/cable or your SSD (SMART cannot find any error) .
Can you try using a different cable or SSD to find out what is faulty?
Hi, I noticed that the SMART results were not changing. I tested the drive and found that it was terribly slow and had a couple of bad sectors. After replacing the drive so far it is ok, if the problem does not occur for a few days I will mark this topic as resolved.


EDIT: mark as solved, the problem does not occur after changing the disk.
 
Last edited: