WARNING. Possible data lost after write and/or data corruption. MDRAID with writemostly. Kernel 6.14.11-2

glaeken2

Member
Jun 7, 2023
22
3
8
Observed behavior:
Data written to an LVM placed on MD raid, type: raid1, with turned on writemostly flag for one of the partitions, IS NOT WRITTEN. After a reboot, all changes to the LVM are REVERTED.

Additional observed behavior:
1. any lvm snapshot are impossible to remove, different errors, or messages saying that snapshot is removed, but it is NOT removed.
2. SWAP signature written to pve/swap lv is not recognized when trying to mount swap
3. Some files may contain random bytes instead of the proper content
4. Task messages (lower panel) in proxmox web, may contain random characters
5. ext4 may be damaged
6. no aparent messages in dmesg
7. happened after upgrade from pve 8 to 9

Affected machines: 1
Tried to reproduce: no
Problem solved instantly after disabling writemostly: yes
Missing files after full recovery: 1
dpkg --verify detected any damage: no
ram is good: yes
drives are good: yes, no smart errors, no zfs errors (zfs placed on the same drives), no raid errors.
any zfs pools affected: no

Possible connection: https://lkml.org/lkml/2025/9/3/112
Is related?: Inconclusive

Additional info:
Kernel: 6.14.11-2 (2025-09-12T09:46Z)
pveversion: pve-manager/9.0.10
1st drive: sata, ssd (set writemostly)
2nd drive: nvme
MDRaid, raid1 for system LVM, partition number 3. bitmap: 1/1 pages [4KB], 65536KB chunk, metadata 1.2, boot from raid, grub and kernel are assembling the array on the fly when booting.

First suspicion in the event chain: bad ram
First reaction: reboot with auto ram check (kernel memtest=3)
Second suspicion in the event chain: bad drive
Second reaction: drives check - no errors
Third suspicion in the event chain: BUG in LVM handling
Third reaction: lvm check - no errors/inconclusive/won'tcheck
Fourth suspicion in the event chain: md raid
Fourth reaction: initiatie md check, no change, no errors detected. Disable writemostly -> problem is fixed instantly without a reboot.
 
Hi @glaeken2,

Sorry to hear you're running into issues. I wanted to clarify the difference between writemostly and write-behind, as they're often confused:

writemostly only affects read balancing. It tells the RAID layer: "avoid reading from this device unless necessary." Writes remain synchronous by default, so this setting doesn't make writes asynchronous.

write-behind (the --write-behind=N option in mdadm) controls whether a write to a mirror can be acknowledged before all devices are updated. When write-behind is enabled, RAID1 spawns an asynchronous "behind bio" for the trailing device. This is exactly when alloc_behind_master_bio() is called - to create the trailing bio clone.

So, if you're only using writemostly, this particular kernel patch likely isn't related to your issue.

Hope this helps somewhat!


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Last edited:
Adding a little context:
1. Screenshot of the damaged text:

wtf1.png

2. Messages when trying to remove lvm snapshots, and other stuff:

Code:
root@[censored]:~# mkswap /dev/pve/swap
Setting up swapspace version 1, size = 8 GiB (8589930496 bytes)
no label, UUID=[censored]
root@[censored]:~# swapon -d /dev/pve/swap
swapon: /dev/pve/swap: read swap header failed

--------

# lvremove -f pve/[censored]
Do you really want to remove and DISCARD active logical volume pve/[censored]? [y/n]: y
  /dev/mapper/pve-root-real: open failed: No such file or directory
  Logical volume "[censored]" successfully removed.

--------

root@[censored]:~# lvremove -f pve/[censored]
  Logical volume "[censored]" successfully removed.
root@[censored]:~# lvs
  LV              VG  Attr       LSize  Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  [censored]      pve -wi-ao----  1.00g
  root            pve owi-aos--- 12.00g
  [censored]      pve swi---s--- 12.05g      root
  [censored]      pve -wi-a----- 12.00g

(still exists)
--------

# cat /etc/apt/apt.conf.d/76pveconf
▒▒▒H▒5▒H▒=▒▒▒▒▒f.▒ATU1▒SH▒▒▒b▒▒▒▒▒u▒rL

--------

cat: /root/[censored]: Structure needs cleaning

So if you see similiar errors all at once, it's a major red flag, as no data may be written, and anything written may be lost.