ext4 failure, can it be caused by cache=writeback?

375gnu

New Member
Jun 17, 2014
17
0
1
We tried to run one of Win2k8 VM with cache=writeback to test its disk performance as with default cache=none we have very poor disk performance (io wait is about 20% with only one active VM). Host system had pve-kernel-2.6.32-29-pve. After several minutes of not very active write operations (files downloading over 100M network) host remounted / as RO, dmesg showed the following:


EXT4-fs error (device md0): ext4_ext_search_left: inode #27526386: (comm kvm) ix (4064983) != EXT_FIRST_INDEX (0) (depth 1)!
Aborting journal on device md0-8.
EXT4-fs error (device md0): ext4_journal_start_sb: Detected aborted journal
EXT4-fs (md0): Remounting filesystem read-only
EXT4-fs (md0): Remounting filesystem read-only
EXT4-fs (md0): delayed block allocation failed for inode 27526386 at logical offset 4064983 with max blocks 617 with error -5

This should not happen!! Data will be lost
EXT4-fs error (device md0) in ext4_da_writepages: IO failure
EXT4-fs (md0): ext4_da_writepages: jbd2_start: 9223372036854774198 pages, ino 27526386; err -30


Can this problem be caused by cache=writeback?
 
very likely. Since there is a 20% performance difference between cache=none and cache=writeback I suspect you have a bad behaving IO subsystem (Disks, bus etc). What kind of storage do you use?
 
There you have your answer. Software RAID1 will effectively reduce your write speed, and presumably you IO, by 50%. The SATA3 drives is not the problem giving that they are attached to at least SATA 2 ports. Whether using SATA 2 or SATA 3 makes no difference for HDD.
 
We tried to run one of Win2k8 VM with cache=writeback to test its disk performance as with default cache=none we have very poor disk performance (io wait is about 20% with only one active VM). Host system had pve-kernel-2.6.32-29-pve. After several minutes of not very active write operations (files downloading over 100M network) host remounted / as RO, dmesg showed the following:


EXT4-fs error (device md0): ext4_ext_search_left: inode #27526386: (comm kvm) ix (4064983) != EXT_FIRST_INDEX (0) (depth 1)!
Aborting journal on device md0-8.
EXT4-fs error (device md0): ext4_journal_start_sb: Detected aborted journal
EXT4-fs (md0): Remounting filesystem read-only
EXT4-fs (md0): Remounting filesystem read-only
EXT4-fs (md0): delayed block allocation failed for inode 27526386 at logical offset 4064983 with max blocks 617 with error -5

This should not happen!! Data will be lost
EXT4-fs error (device md0) in ext4_da_writepages: IO failure
EXT4-fs (md0): ext4_da_writepages: jbd2_start: 9223372036854774198 pages, ino 27526386; err -30


Can this problem be caused by cache=writeback?

writeback should not be a problem.
you can lost datas in case of power failure, but it shouldn't break filesystem.

Do you have mount your ext4 with barrier=0 ?
 
There you have your answer. Software RAID1 will effectively reduce your write speed, and presumably you IO, by 50%. The SATA3 drives is not the problem giving that they are attached to at least SATA 2 ports. Whether using SATA 2 or SATA 3 makes no difference for HDD.

Only for a broken software raid implementation. Writes *should* be able to go in parallel.
 
I stand corrected for write performance but for IOPS the performance is reduced by 50% since every write requires 2 IOPS (1 write to each disk in the array).
 
That is a distinction without a difference. A write comes in to the storage layer. It issues a write to block N on disk0, then issues the same write to block N on disk1. Both writes need to complete for the logical write to complete, but as they would go in parallel, the IOPS should be the same.
 
Whether writes go in parallel or not every single write still requires 2 IOPS and since IOPS is limited you are by force of nature facing a performance hit.
 
I'm trying to be polite here, but this is nonsense. Any useful metric involving IOPS is related to what an application, remote host, etc, can do. No one cares how many low-level writes are done by the storage subsystem. In a very literal sense, there are two writes being done, but this should never be visible to a user. So unless you have a limitation of bandwidth at the controller level, PCI-express bus level, etc, this will never be relevant. Under any useful (real-world) scenario, this will never occur. If you put two sata disks in a zfs raid-1 (mirror), and run iozone or somesuch, you will see the same write IOPS as with a single disk.
 
Whether writes go in parallel or not every single write still requires 2 IOPS and since IOPS is limited you are by force of nature facing a performance hit.

Hi mir
About of IOPS, please let me to do a question...
- How i understand, if the writes go in parallel, then 2 IOPS will be two outputs per second rather than one output, therefore there will be no performance loss in each second lapsed.
- Also, since that modern processors and RAM can work much more quick that the devices SATA or SAS (3 or 6 Gb/s on bus), i guess that the performance will be the same compared to a single HDD (or may be that for reads, i can get more speed if mdadm can read different data in parallel)

If you don't think equal, can you give me a explanation more precise and technical?

Best regards
Cesar
 
Last edited:
Either there is a language issue here, or he is completely clueless. If you can cite a *single* source that backs your point of view, feel free. One last time, IOPS is from the point of view of the client, not individual spindles (or whatever.) Or, if you try to claim otherwise, since the non-broken raid1 will do the writes in paralle, it has twice the IOPS of a single disk - either way, it's not a performance hit.
 
Either there is a language issue here, or he is completely clueless. If you can cite a *single* source that backs your point of view, feel free. One last time, IOPS is from the point of view of the client, not individual spindles (or whatever.) Or, if you try to claim otherwise, since the non-broken raid1 will do the writes in paralle, it has twice the IOPS of a single disk - either way, it's not a performance hit.

dswartz, no offense, if mdadm can read different data in parallel, the heads of the hard disks will be in different positions in the most times, so for the writes, we may have a delay until that the last head perform the writes, from this obviously is deducted that there will fewer IOPS for the writes and more IOPS for the reads.

I don't know if mdadm can do multiple reads of blocks of disk in parallel. If you have a source of information, please tell it here

Best regards
Cesar
 
Last edited:
no, fstab has only options "noatime,errors=remount-ro", so mount shows "rw,noatime,errors=remount-ro,barrier=1,data=ordered"
 
Cesar, let me try it this way: if you are really doing random writes, chances are both drives heads will be out of position and need to be moved before the requested block(s) can be written. If so, it doesn't really matter whether the two drives heads are in different locations or not - as if they are, the seeks can be simultaneous (as part of the write request to the disk). Think of this: if the drive heads are in the same place so it takes X ms to seek and write, okay. On the other hand, if they are not in the same place, it's as likely that one will be take less than X ms to seek and write as the chance of taking longer. I can't speak to md raid, but with zfs, reads absolutely ARE load-balanced and it's typical to see about 2X the random IOPS as for random writes. I don't know how to make it any clearer than that :(
 
dswartz, thanks for your answer, but i am agree with you, i just asked if mdadm can do multiple reads in parallel (and i expected an answer on whether or not with some fundament), take it easy, is only a forum.... :)

Best regards
Cesar
 
I have no idea, like I said. This is why I was getting annoyed (sorry). It started off as a spurious assertion that mirrored writes eat half your IOPS and now we're off to a completely unrelated topic :(
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!