[SOLVED] File System keeps going read only

TheBR

New Member
Jan 30, 2023
4
0
1
Hi,
I have Proxmox installed and it seems to be every 6 or 7 days the file system goes readonly and i cant work out why.

Looking at syslog i see the following when it happens - prior to this the system is perfect.

Code:
Jan 30 00:39:53 proxmox systemd[1]: Starting Discard unused blocks on filesystems from /etc/fstab...
Jan 30 00:39:54 proxmox systemd[1]: fstrim.service: Main process exited, code=exited, status=32/n/a
Jan 30 00:39:54 proxmox fstrim[2136896]: fstrim: /: FITRIM ioctl failed: Bad message
Jan 30 00:39:54 proxmox systemd[1]: fstrim.service: Failed with result 'exit-code'.
Jan 30 00:39:54 proxmox systemd[1]: Failed to start Discard unused blocks on filesystems from /etc/fstab.
Jan 30 00:39:55 proxmox pvestatd[1002]: can't lock file '/var/log/pve/tasks/.active.lock' - can't open file - Read-only file system
Jan 30 00:40:00 proxmox kernel: [623045.863680] EXT4-fs error (device loop0): ext4_journal_check_start:83: comm database: Detected aborted journal
Jan 30 00:40:00 proxmox kernel: [623045.863827] blk_update_request: I/O error, dev loop0, sector 0 op 0x1:(WRITE) flags 0x3800 phys_seg 1 prio class 0
Jan 30 00:40:00 proxmox kernel: [623045.863837] Buffer I/O error on dev loop0, logical block 0, lost sync page write
Jan 30 00:40:00 proxmox kernel: [623045.863851] EXT4-fs (loop0): I/O error while writing superblock
Jan 30 00:40:00 proxmox kernel: [623045.863858] EXT4-fs (loop0): Remounting filesystem read-only
Jan 30 00:40:03 proxmox pvescheduler[2136939]: replication: can't lock file '/var/lib/pve-manager/pve-replication-state.lck' - can't open file - Read->
Jan 30 00:40:05 proxmox pvestatd[1002]: can't lock file '/var/log/pve/tasks/.active.lock' - can't open file - Read-only file system
Jan 30 00:40:15 proxmox pvestatd[1002]: can't lock file '/var/log/pve/tasks/.active.lock' - can't open file - Read-only file system
Jan 30 00:40:25 proxmox pvestatd[1002]: can't lock file '/var/log/pve/tasks/.active.lock' - can't open file - Read-only file system

As the system is readonly i ran an fsck on pve-root which detected issues:-

Code:
Running fsck.ext4 -fp /dev/mapper/pve-root

<snip>
JBD2: Invalid checksum recovering data block 524525 in log
JBD2: Invalid checksum recovering data block 524531 in log
JBD2: Invalid checksum recovering data block 524531 in log
JBD2: Invalid checksum recovering data block 524445 in log
JBD2: Invalid checksum recovering data block 524520 in log
JBD2: Invalid checksum recovering data block 524539 in log
JBD2: Invalid checksum recovering data block 524331 in log
JBD2: Invalid checksum recovering data block 524541 in log
JBD2: Invalid checksum recovering data block 4718870 in log
JBD2: Invalid checksum recovering data block 8912949 in log
JBD2: Invalid checksum recovering data block 0 in log
JBD2: Invalid checksum recovering data block 524531 in log
JBD2: Invalid checksum recovering data block 524531 in log
Journal checksum error found in /dev/mapper/pve-root
/dev/mapper/pve-root: Inode 131266, i_blocks is 608, should be 288.  FIXED.
/dev/mapper/pve-root: Inode 134094 extent tree (at level 1) could be shorter.  IGNORED.
/dev/mapper/pve-root: Inode 134094, i_blocks is 672, should be 8.  FIXED.
/dev/mapper/pve-root: Deleted inode 1196137 has zero dtime.  FIXED.
/dev/mapper/pve-root: Inodes that were part of a corrupted orphan linked list found.

/dev/mapper/pve-root: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
        (i.e., without -a or -p options)

root@proxmox:~# fsck.ext4 -f /dev/mapper/pve-root
e2fsck 1.46.5 (30-Dec-2021)
Pass 1: Checking inodes, blocks, and sizes
Inode 134094 extent tree (at level 1) could be shorter.  Optimize<y>? yes
Inodes that were part of a corrupted orphan linked list found.  Fix<y>? yes
Inode 2490382 was part of the orphaned inode list.  FIXED.
Pass 1E: Optimizing extent trees
Pass 2: Checking directory structure
Entry 'pve-replication-state.json' in /var/lib/pve-manager (133707) has deleted/unused inode 134360.  Clear<y>? yes
Entry 'rrd.journal.1675038983.759590' in /var/lib/rrdcached/journal (134284) has deleted/unused inode 134360.  Clear<y>? yes
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Inode 134017 ref count is 1, should be 2.  Fix<y>? yes
Unattached zero-length inode 134094.  Clear<y>? yes
Pass 5: Checking group summary information
Block bitmap differences:  -(649800--649803) -649805 -649825 -(649828--649829) -649840 +(649847--649848) +(649851--649858) +(649863--649868) -(1043968--1044120) -(3622960--3623007) +(3623412--3623426) -(3623488--3623506) +4751570 +(4751597--4751600) +4751641 -(8396832--8397606) -(8397608--8397879) -(10305985--10305989) -10305999 +10306009 -10306014 +10306029 +(10310205--10310208) -(11075712--11075722) -(12025856--12025901) -12025936 -12025947 -(12025949--12026086) -12026088 -(12026112--12026260) -(12026272--12026295) -(12026304--12026322) -(12026336--12026352) -(12026368--12027903) -(12028801--12028927) -(12029591--12058623) -12512126 -12512128 -(12512130--12512131) -(12512154--12512157) -12512164 -(17240347--17240348) -17240350 -17240352 +17240354 -17240373 +(17240420--17240427) +(17240432--17240474) +17240476
Fix<y>? yes
Free blocks count wrong for group #19 (9664, counted=9665).
Fix<y>? yes
Free blocks count wrong for group #31 (5361, counted=5514).
Fix<y>? yes
Free blocks count wrong for group #110 (3571, counted=3666).
Fix ('a' enables 'yes' to all) <y>? yes
Free blocks count wrong for group #145 (32709, counted=32716).
Fix ('a' enables 'yes' to all) <y>? yes
Free blocks count wrong for group #164 (10551, counted=10552).
Fix ('a' enables 'yes' to all) <y>? yes
Free blocks count wrong for group #314 (22634, counted=22677).
Fix ('a' enables 'yes' to all) <y>? yes
Free blocks count wrong for group #338 (32635, counted=32640).
Fix<y>? yes
Free blocks count wrong for group #381 (6537, counted=6546).
Fix<y>? yes
Free blocks count wrong for group #526 (4978, counted=4986).
Fix<y>? yes
Free blocks count wrong (15432177, counted=15344951).
Fix<y>? yes
Inode bitmap differences:  -134360 -1196137
Fix<y>? yes
Free inodes count wrong for group #16 (4570, counted=4572).
Fix<y>? yes
Free inodes count wrong for group #146 (17, counted=18).
Fix<y>? yes
Free inodes count wrong (4229066, counted=4229005).
Fix<y>? yes

/dev/mapper/pve-root: ***** FILE SYSTEM WAS MODIFIED *****
/dev/mapper/pve-root: ***** REBOOT SYSTEM *****
/dev/mapper/pve-root: 88179/4317184 files (0.2% non-contiguous), 1896137/17241088 blocks
root@proxmox:~# fsck.ext4 -f /dev/mapper/pve-root
e2fsck 1.46.5 (30-Dec-2021)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/mapper/pve-root: 88179/4317184 files (0.2% non-contiguous), 1896137/17241088 blocks

As you can see lots or errors on the root volume.

However, if i run as smartctl on my drive it passes with no issues:-

Code:
root@proxmox:~# lsblk
NAME                         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
loop0                          7:0    0     4G  0 loop
loop1                          7:1    0    12G  0 loop
sda                            8:0    0 223.6G  0 disk
├─sda1                         8:1    0  1007K  0 part
├─sda2                         8:2    0   512M  0 part
└─sda3                         8:3    0 223.1G  0 part
  ├─pve-swap                 253:0    0     8G  0 lvm  [SWAP]
  ├─pve-root                 253:1    0  65.8G  0 lvm  /
  ├─pve-data_tmeta           253:2    0   1.3G  0 lvm
  │ └─pve-data-tpool         253:4    0 130.6G  0 lvm
  │   ├─pve-data             253:5    0 130.6G  1 lvm
  │   └─pve-vm--100--disk--0 253:6    0    16G  0 lvm
  └─pve-data_tdata           253:3    0 130.6G  0 lvm
    └─pve-data-tpool         253:4    0 130.6G  0 lvm
      ├─pve-data             253:5    0 130.6G  1 lvm
      └─pve-vm--100--disk--0 253:6    0    16G  0 lvm
sdb                            8:16   0  14.6T  0 disk /mnt/USBData
mmcblk0                      179:0    0   7.3G  0 disk
mmcblk0boot0                 179:8    0     4M  1 disk
mmcblk0boot1                 179:16   0     4M  1 disk
root@proxmox:~# smartctl -H /dev/sda3
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.83-1-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

Finally after looking around i've seen a few mentions of this being caused by low disk space. While i didnt get this prior to running fsck and rebooting, after i rebooted and it came back up post fsck i can see the following:-

Code:
root@proxmox:~# df -h
Filesystem            Size  Used Avail Use% Mounted on
udev                  7.8G     0  7.8G   0% /dev
tmpfs                 1.6G  2.5M  1.6G   1% /run
/dev/mapper/pve-root   65G  5.7G   56G  10% /
tmpfs                 7.8G   46M  7.7G   1% /dev/shm
tmpfs                 5.0M     0  5.0M   0% /run/lock
/dev/sdb               15T  6.6T  7.2T  49% /mnt/USBData
/dev/fuse             128M   16K  128M   1% /etc/pve
tmpfs                 1.6G     0  1.6G   0% /run/user/0

So no disk appears to be full here, at least not after a reboot.

Can anyone help me out here - i would suspect a failing drive if not for the smart tests failing and the consistant working for a week before it fails making me thing something is triggering this.
 
Last edited:
An IO error is a very good indicator that an actual IO error happened. Absent any special software shims, it usually points to a disk.
A SMART error is also a good indicator of some sort of disk problem, however its absence is not an indication of fully healthy disk.

If you grep for "Discard" in your logs, you will probably find:
Code:
Jan 30 06:30:55 proxmox7-nvme2 systemd[1]: Started Discard unused blocks once a week.

What could be running this service? Its not in crontab.
If you google the other messages it should lead you to fstrim. Lets check services for its presence:
Code:
systemctl |grep -i fst
  fstrim.timer                                                                                     loaded active     waiting   Discard unused blocks once a week

So based on the presented facts we know its the fstrim operation that is being run once a week.
You have not mentioned make and model of your disk. Many cheap and obscure consumer disks are not meant to be run 24x7 with server type load.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
An IO error is a very good indicator that an actual IO error happened. Absent any special software shims, it usually points to a disk. A SMART error is also a good indicator of some sort of disk problem, however its absence is not an indication of fully healthy disk.

If you grep for "Discard" in your logs, you will probably find:
Code:
Jan 30 06:30:55 proxmox7-nvme2 systemd[1]: Started Discard unused blocks once a week.

What could be running this service? Its not in crontab.
If you google the other messages it should lead you to fstrim. Lets check services for its presence:
Code:
systemctl |grep -i fst
  fstrim.timer                                                                                     loaded active     waiting   Discard unused blocks once a week

So based on the presented facts we know its the fstrim operation that is being run once a week.
You have not mentioned make and model of your disk. Many cheap and obscure consumer disks are not meant to be run 24x7 with server type load.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

Hi and many thanks for the response. The drive is a KINGSTON SUV500240G but this Proxmox is only used for PfSense, Pihole and a Unifi controller. Its not exaxtly working its ass off.

Given the evidence so far would you agree a replacement for the drive is the best start here?
 
Last edited:
Its not exaxtly working its ass off.
Its not about quantity, its about quality. People dont run fstrim jobs on their windows laptops.
Given the evidence so far would you agree a replacement for the drive is the best start here?
I would probably look first whether there are other reports about fstrim interaction with this disk. Run fstrim manually to confirm, check the firmware, run vendor testing tool if it exists. But I would also weigh that against the ease/cost of replacing the disk.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
  • Like
Reactions: TheBR and bobmc
Its not about quantity, its about quality. People dont run fstrim jobs on their windows laptops.

I would probably look first whether there are other reports about fstrim interaction with this disk. Run fstrim manually to confirm, check the firmware, run vendor testing tool if it exists. But I would also weigh that against the ease/cost of replacing the disk.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
Well,
I tried to run fstrim manually and i see the following, i can also see that fstrim is set to run once a week:-

Code:
root@proxmox:~# fstrim --fstab --verbose
/: 58.5 GiB (62837260288 bytes) trimmed on /dev/pve/root
root@proxmox:~# systemctl status fstrim.timer
● fstrim.timer - Discard unused blocks once a week
     Loaded: loaded (/lib/systemd/system/fstrim.timer; enabled; vendor preset: enabled)
     Active: active (waiting) since Mon 2023-01-30 12:44:14 GMT; 2h 8min ago
    Trigger: Mon 2023-02-06 00:44:54 GMT; 6 days left
   Triggers: ● fstrim.service
       Docs: man:fstrim

So that seemed to work - at least at the moment.

I did find a post which suggested that fstrim wasnt supported on Kingston, but it was from 2016 IIRC, maybe 2018. I also found another post suggesting a number of ways to check here - https://unix.stackexchange.com/questions/584549/how-do-i-check-if-my-ssd-supports-fstrim

However, my one does seem to support it, note the DISC-GRAN and DISC-MAX being higher than zero which is an indicator -

Code:
root@proxmox:~# hdparm -I /dev/sda3 | grep -i trim
           *    Data Set Management TRIM supported (limit 8 blocks)
           *    Deterministic read data after TRIM

I am running a
root@proxmox:~# lsblk --discard

NAME                         DISC-ALN DISC-GRAN DISC-MAX DISC-ZERO
loop0                               0        4K       4G         0
loop1                               0        4K       4G         0
sda                                 0        4K       2G         0
├─sda1                           3072        4K       2G         0
├─sda2                              0        4K       2G         0
└─sda3                              0        4K       2G         0
And in fact i found a spec sheet stating that it does at the firmware level that im running specifically - https://smarthdd.com/database/KINGSTON-SUV500240G/003056RI/

Not sure what to do here, im tempted just to get another drive at £30 and be done with it, hopefully, (Samsung 870 Evo or Crucial MX500)
 
Final update - New SSD installed and working. When i took the old one out and attatched it to a Laptop to clone it even failed cloning due to an IO error so it was the drive at fault. So much for SMART though and thumbsup to me for having backups of my vms and containers!
 
Final update - New SSD installed and working. When i took the old one out and attatched it to a Laptop to clone it even failed cloning due to an IO error so it was the drive at fault. So much for SMART though and thumbsup to me for having backups of my vms and containers!
Please mark the thread as solved.

Best Regards