One of the servers is a few years old and it's conceivable the SSDs are wearing out as it s a build server and gets heavy use however we had no issues while it was running PVE7 but upgraded it as PVE7 went EOL and have been having issues with it since. Drives report they are at 70% lifetime.
We are getting occasional `lost sync page write` on seemingly random volumes which leans me towards bad cable or connector.
However the server has two sets of drives, as we just added more, on four different SAS connectors so it seems unlikely that one of the old cables failed and one of the new cables is bad. (We have gotten sync errors on both the old set of drives and the new set of drives.)
The other server is brand new, two months old, all new drives and also just threw a Buffer I/O error. It's possible we got a lemon but now that we've seen it on two PVE8 servers I wanted to get a message out here in case others are seeing this.
Actual drive failure on both servers is unlikely.
I was preparing to replace the cables in the old server until I just saw the Buffer error on the new server leaning me towards kernel issue.
We use LVM on SSD arrays. I am aware of the discard/RAID/ZRAT et. al. issues and carefully selected the drives and tested and verified them.
(And the old server ran for years without issue on PVE7.)
(FWIW we have three other PVE8 servers with similar configurations running without issue but they are not as heavily used as these two; e.g. one is purely for our testing.)
We are getting occasional `lost sync page write` on seemingly random volumes which leans me towards bad cable or connector.
However the server has two sets of drives, as we just added more, on four different SAS connectors so it seems unlikely that one of the old cables failed and one of the new cables is bad. (We have gotten sync errors on both the old set of drives and the new set of drives.)
The other server is brand new, two months old, all new drives and also just threw a Buffer I/O error. It's possible we got a lemon but now that we've seen it on two PVE8 servers I wanted to get a message out here in case others are seeing this.
Actual drive failure on both servers is unlikely.
I was preparing to replace the cables in the old server until I just saw the Buffer error on the new server leaning me towards kernel issue.
We use LVM on SSD arrays. I am aware of the discard/RAID/ZRAT et. al. issues and carefully selected the drives and tested and verified them.
(And the old server ran for years without issue on PVE7.)
(FWIW we have three other PVE8 servers with similar configurations running without issue but they are not as heavily used as these two; e.g. one is purely for our testing.)
Code:
[230035.115918] Buffer I/O error on dev dm-173, logical block 9318, lost sync page write
[230035.115929] EXT4-fs error (device dm-173): kmmpd:185: comm kmmpd-dm-173: Error writing to MMP block
[230035.116688] Aborting journal on device dm-173-8.
[230245.343432] EXT4-fs error (device dm-173): ext4_journal_check_start:84: comm syslogd: Detected aborted journal
[230245.348910] EXT4-fs (dm-173): Remounting filesystem read-only
Code:
[1041671.640466] Buffer I/O error on dev dm-195, logical block 9255, lost sync page write
[1041671.640474] EXT4-fs error (device dm-195): kmmpd:185: comm kmmpd-dm-195: Error writing to MMP block
[1041671.640910] Aborting journal on device dm-195-8.
[1041815.982074] EXT4-fs error (device dm-195): ext4_journal_check_start:84: comm journal-offline: Detected aborted journal
[1041815.982418] EXT4-fs (dm-195): Remounting filesystem read-only
Last edited: