Proxmox EXT4 FS Error on R210ii with four SSDs

Feb 26, 2021
5
0
1
29
I am running proxmox on a Dell R210 II, which I recently upgraded to use four Kingston A400 SSDs running in RAID 10 using the H200 RAID controller. After this machine ran for a few hours, it encountered an error and fell back into a read-only root filesystem (I've seen some answers that suggest changing the mount options in /etc/fstab to errors=continue, but I'd rather identify and fix the source of the problem). I rans fsck and was able to fix the problems, but I am interested in preventing this from occurring in the future when I am not around to fix it.

Potential Cause 1: Kingston Firmware. I bought the four SSDs off Amazon, and unfortunately the cheapest seller limited purchases to three units; I needed to buy the fourth from a different seller. When I set these up in RAID 10, I noticed that first three drives are running the same firmware version (SA400S30107) but the fourth drive is running a different firmware version (SA400S30009). see screenshot here: https://imgur.com/a/Y5TI3tO. I'm not sure whether/how to run a firmware update on them given that the disks are in a RAID configuration and, to the best of my knowledge, the H200 does not support disk pass through.

Potential Cause 2: Power Saving Issues. I found a similar problem on Ubuntu (see https://askubuntu.com/a/1113288) that has to do with the power saving features on NVME drives. I don't think this is the problem, since it pertains to NVMEs (and my SSDs are SATA drives). I'm including it for the purpose of comprehensiveness. The link recommends passing the following grub option at boot: nvme_core.default_ps_max_latency_us=6000.

Potential Cause 3: Time Synchronization Issues. Unfortunately, I cannot find the post that I obtained this possible cause from. However, I remember slightly that there might be issues stemming from differences between the hardware clock and time set by the NTP server. Putting this here just in case someone has more thoughts on this.

I would appreciate any advice on identifying and fixing this problem.
 
Last edited:
Hi,

can you post the syslog/journal/dmesg from the time it goes into read only mode? that might shine some light on the cause of the issue.

while the firmware/power saving might be a factor, i would not completely rule out the raid controller itself (firmware/cables/incompatibility)

the mount options in /etc/fstab to errors=continue
yeah, on a production server i'd never use those options if my data is important.
 
@dcsapak here are the relevant (and abridged, as the rsyslog error repeats for some time) journal logs:

Code:
Mar 05 10:31:34 pve1a kernel: sd 0:1:0:0: [sda] tag#2404 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Mar 05 10:31:34 pve1a kernel: sd 0:1:0:0: [sda] tag#2404 Sense Key : Not Ready [current]
Mar 05 10:31:34 pve1a kernel: sd 0:1:0:0: [sda] tag#2404 Add. Sense: Logical unit not ready, initializing command required
Mar 05 10:31:34 pve1a kernel: sd 0:1:0:0: [sda] tag#2404 CDB: Write(10) 2a 00 04 54 cf 78 00 00 10 00
Mar 05 10:31:34 pve1a kernel: blk_update_request: I/O error, dev sda, sector 72667000 op 0x1:(WRITE) flags 0x800 phys_seg 2 prio class 0
Mar 05 10:31:34 pve1a kernel: Aborting journal on device dm-1-8.
Mar 05 10:31:34 pve1a kernel: EXT4-fs error (device dm-1): ext4_journal_check_start:61: Detected aborted journal
Mar 05 10:31:34 pve1a kernel: EXT4-fs (dm-1): Remounting filesystem read-only
Mar 05 10:31:34 pve1a rsyslogd[655]: file '8' write error: Read-only file system [v8.1901.0 try https://www.rsyslog.com/e/2027 ]
Mar 05 10:31:34 pve1a rsyslogd[655]: action 'action-1-builtin:omfile' (module 'builtin:omfile') message lost, could not be processed. Check for additional error messages before this one. [v8.1901.0 try https://www.rsyslog.com/e/2027 ]
Mar 05 10:31:34 pve1a rsyslogd[655]: file '8' write error: Read-only file system [v8.1901.0 try https://www.rsyslog.com/e/2027 ]
Mar 05 10:31:34 pve1a rsyslogd[655]: action 'action-1-builtin:omfile' (module 'builtin:omfile') message lost, could not be processed. Check for additional error messages before this one. [v8.1901.0 try https://www.rsyslog.com/e/2027 ]
Mar 05 10:31:34 pve1a rsyslogd[655]: file '8' write error: Read-only file system [v8.1901.0 try https://www.rsyslog.com/e/2027 ]
Mar 05 10:31:34 pve1a rsyslogd[655]: action 'action-1-builtin:omfile' (module 'builtin:omfile') message lost, could not be processed. Check for additional error messages before this one. [v8.1901.0 try https://www.rsyslog.com/e/2027 ]
Mar 05 10:31:34 pve1a rsyslogd[655]: file '8' write error: Read-only file system [v8.1901.0 try https://www.rsyslog.com/e/2027 ]
Mar 05 10:31:34 pve1a rsyslogd[655]: action 'action-1-builtin:omfile' (module 'builtin:omfile') message lost, could not be processed. Check for additional error messages before this one. [v8.1901.0 try https://www.rsyslog.com/e/2027 ]
Mar 05 10:31:34 pve1a rsyslogd[655]: file '8' write error: Read-only file system [v8.1901.0 try https://www.rsyslog.com/e/2027 ]
Mar 05 10:31:34 pve1a rsyslogd[655]: action 'action-1-builtin:omfile' (module 'builtin:omfile') message lost, could not be processed. Check for additional error messages before this one. [v8.1901.0 try https://www.rsyslog.com/e/2027 ]

And here's what dmesg says:

Code:
[Fri Mar  5 10:31:34 2021] sd 0:1:0:0: [sda] tag#2404 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Fri Mar  5 10:31:34 2021] sd 0:1:0:0: [sda] tag#2404 Sense Key : Not Ready [current]
[Fri Mar  5 10:31:34 2021] sd 0:1:0:0: [sda] tag#2404 Add. Sense: Logical unit not ready, initializing command required
[Fri Mar  5 10:31:34 2021] sd 0:1:0:0: [sda] tag#2404 CDB: Write(10) 2a 00 04 54 cf 78 00 00 10 00
[Fri Mar  5 10:31:34 2021] blk_update_request: I/O error, dev sda, sector 72667000 op 0x1:(WRITE) flags 0x800 phys_seg 2 prio class 0
[Fri Mar  5 10:31:34 2021] Aborting journal on device dm-1-8.
[Fri Mar  5 10:31:34 2021] EXT4-fs error (device dm-1): ext4_journal_check_start:61: Detected aborted journal
[Fri Mar  5 10:31:34 2021] EXT4-fs (dm-1): Remounting filesystem read-only

Syslog begins losing messages at the point that the root filesystem remounts read only.
 
Last edited:
Update: I pulled the array out of the R210 II and used the Kingston SSD manager (which unfortunately can only be run over a Windows client) to check for (1) firmware updates and (2) errors on each drive. They are at full health (according to Kingston, at least; if you have any better drive health-test software, please share it with me). The Kingston SSD manager also stated that there was no new firmware available for the drive running on firmware version SA400S30009... strange. I left Kingston technical support a voicemail.

As you stated, it might make sense to do a firmware update on the H200 RAID Controller, which is running v7.11.10.00 (2011.06.02). Here's a potentially applicable firmware update that mentions fixing Disk IO errors: https://www.dell.com/support/home/en-us/drivers/driversdetails?driverid=081kv.

I previously ran proxmox on two HDDs in RAID 1 using this same controller, so I am skeptical it is the H200 causing the problem. Will report back after reading through this user manual: https://downloads.dell.com/manuals/all-products/esuprt_ser_stor_net/esuprt_dell_adapters/poweredge-rc-h200_user's guide_en-us.pdf.

I'm debating whether to pick up an additional A400 running the correct firmware, but I don't think I will right now (out of principle) since we haven't established that the firmware is the cause of this problem, and because I would be rewarding Kingston with more business after receiving what may very well be a substandard product.
 
Last edited:
I was able to borrow another Kingston A400 running firmware v. SBFKB1H5 from a coworker. I exchanged this for the drive running v.SA400S30009, reconfigured RAID 10, and reinstalled proxmox. After 72 hours of uptime, there have been no filesystem errors. Seems to have been a firmware problem.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!