VM disks gets corrupted on thin ZSF storage

LMC · Feb 19, 2020

Hi,

I have a server running latest PVE with a ZFS RAIDZ storage on SSDs. From time to time, some KVM VMs gets readonly filesystem and need to be rebooted and the filesystem repaired. Doesn't look like the problem is specific to a certain OS, because it happened on CentOS, Ubuntu, Debian... The zpool looks fine, scrub never reported any error and the PVE root filesystem was always ok. Anyone got a similar situation?

The screenshots are from the VM console.

Dan

dcsapak · Feb 20, 2020

is your zfs full?
check 'zfs list' output

LMC · Feb 21, 2020

It doesn't look like... Also, the VM disks are not full, maximum 50%.

NAME USED AVAIL REFER MOUNTPOINT
vmdata 1.23T 2.28T 31.5G /vmdata

Stoiko Ivanov · Feb 21, 2020

anything interesting showing up in `dmesg` or the journal of the PVE node - while the guests experience the disk errors?

apoc · Feb 21, 2020

Have you (by any chance) an SAS-Expander between the SSD and the controller?
This sounds similar to my issues when I tried to run ZFS through LSI 9211-8i + HPe SAS-Expander when starting with Proxmox.
Never got it sorted out and since I have eliminated the expander everything is running fine.

LMC · Feb 23, 2020

Stoiko Ivanov said:
anything interesting showing up in `dmesg` or the journal of the PVE node - while the guests experience the disk errors?

Not really... nothing realated with a running VM. But I see some things like:

[1191162.119258] vmbr0: port 7(tap1198i0) entered disabled state
[1191256.251036] kvm[11166]: segfault at 2 ip 00005636a1acedd0 sp 00007fb5d4558de0 error 4 in qemu-system-x86_64[5636a18a8000+491000]
[1191256.251044] Code: 00 5b c3 0f 1f 80 00 00 00 00 41 57 41 56 41 55 41 54 55 48 89 cd b9 a0 03 00 00 53 48 89 f3 48 8d 35 2e 4f 2d 00 48 83 ec 78 <44> 0f b7 6a 02 4c 8b 7a 08 4c 89 44 24 10 4c 8d 05 4b 26 2d 00 45

I saw 6 segfault during current uptime of 19 days. And also:

[73770.332448] EXT4-fs (zd80p1): warning: mounting fs with errors, running e2fsck is recommended
[73770.338172] EXT4-fs (zd80p1): Errors on filesystem, clearing orphan list.
[73770.338173] EXT4-fs (zd80p1): recovery complete
[73770.341540] EXT4-fs (zd80p1): mounted filesystem with ordered data mode. Opts: (null)
[82269.118705] EXT4-fs (zd80p1): warning: mounting fs with errors, running e2fsck is recommended
[82269.125652] EXT4-fs (zd80p1): mounted filesystem with ordered data mode. Opts: (null)
[103124.125435] device tap1193i0 entered promiscuous mode

[1105193.467709] vmbr0: port 7(tap1198i0) entered disabled state
[1108146.299528] device-mapper: table: 253:0: zd160 too small for target: start=2048, len=25163776, dev_size=20971520
[1108146.299530] device-mapper: core: Cannot calculate initial queue limits
[1108146.299565] device-mapper: ioctl: unable to set up device queue for new table.

[1115364.402263] vmbr0: port 14(tap1178i0) entered disabled state
[1115380.107999] zd320: p1
[1115380.116292] zd320: p1
[1115380.599536] zd320: p1 p2
[1115381.548982] EXT4-fs (dm-0): bad geometry: block count 78642944 exceeds size of device (78249984 blocks)
[1115381.554728] EXT4-fs (dm-0): bad geometry: block count 78642944 exceeds size of device (78249984 blocks)
[1115399.787433] EXT4-fs (dm-0): bad geometry: block count 78642944 exceeds size of device (78249984 blocks)
[1115399.793514] EXT4-fs (dm-0): bad geometry: block count 78642944 exceeds size of device (78249984 blocks)

I think this latest messages are related to repairs of the VM disks. No other errors...

tburger said:
Have you (by any chance) an SAS-Expander between the SSD and the controller?
This sounds similar to my issues when I tried to run ZFS through LSI 9211-8i + HPe SAS-Expander when starting with Proxmox.
Never got it sorted out and since I have eliminated the expander everything is running fine.

No, they are connected to the AHCI SATA ports on the motherboard.

apoc · Feb 23, 2020

Then you could try different (shorter) cables.
I was hunting a "flacky HDD" the other day - I noticed that my issues seemed to persist on the same port, even when changing HDDs. After I have replaced the cable my issue went away. My system is stable since then.
Good luck!

/edit: saw the exact same message as you in picture two.

Stoiko Ivanov · Feb 24, 2020

segfaults in kvm and randomly occuring disk errors could indeed point to a hardware issue

* run `memtest` on the host for a long period
* check that all cables are well seated (and maybe try to change them)
* check if the issue persists with a different PSU

I hope this helps!

LnxBil · Feb 29, 2020

tburger said:
Have you (by any chance) an SAS-Expander between the SSD and the controller?
This sounds similar to my issues when I tried to run ZFS through LSI 9211-8i + HPe SAS-Expander when starting with Proxmox.
Never got it sorted out and since I have eliminated the expander everything is running fine.

Thank you! I just stumbled across this and I was going to buy a HPe SAS Expander for myself .. at least until now. Have you put in multiple HBAs?

apoc · Mar 1, 2020

@LnxBil
yes. switched to individual HBAs. Less "efficient" in terms of PCIe-Slot usage but stable. And that is all what counts.

If you are interested in more details of my personal drama you can find a detailed report here:
https://forum.proxmox.com/threads/p...issues-after-adding-5th-mirror-to-pool.37514/

I found an explanation at some place that SATA isn't very reliable in terms of "multiplexing" and that this only has improved recently.
If you are using SAS-attached-disks you should be fine. I however recommend the expanders (even though they are cheap) due to my own experience.

LnxBil · Mar 2, 2020

tburger said:
@@LnxBil
yes. switched to individual HBAs. Less "efficient" in terms of PCIe-Slot usage but stable. And that is all what counts.

If you are interested in more details of my personal drama you can find a detailed report here:
https://forum.proxmox.com/threads/p...issues-after-adding-5th-mirror-to-pool.37514/

Thank you!

tburger said:
I found an explanation at some place that SATA isn't very reliable in terms of "multiplexing" and that this only has improved recently.
If you are using SAS-attached-disks you should be fine.

I read too, that you shall not mix SATA/SAS on the same port or even bus.

apoc · Mar 2, 2020

That wasn't the case for me. Due to me being an "advanced consumer" I am using SATA drives.
No SAS at all in place. So I think I was a victim of SATA multiplexing.
Can imagine though that SATA/SAS mix also could mean trouble...

uibmz · Aug 17, 2020

We have similar errors in our Environment, but we have two different servers where this kind of behaviour occurs.

Both servers are have two zfs pools with striped mirror-vdevs:

INI:

root@prodnode1:~# zpool list -v
NAME                                        SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
pool_spinning                              3.62T  1.04T  2.59T        -         -    12%    28%  1.00x    ONLINE  -
  mirror                                    928G   220G   708G        -         -    11%  23.7%      -  ONLINE
    sdm                                        -      -      -        -         -      -      -      -  ONLINE
    sdn                                        -      -      -        -         -      -      -      -  ONLINE
  mirror                                    928G   266G   662G        -         -    13%  28.7%      -  ONLINE
    sdo                                        -      -      -        -         -      -      -      -  ONLINE
    sdp                                        -      -      -        -         -      -      -      -  ONLINE
  mirror                                    928G   290G   638G        -         -    13%  31.2%      -  ONLINE
    sdq                                        -      -      -        -         -      -      -      -  ONLINE
    sdr                                        -      -      -        -         -      -      -      -  ONLINE
  mirror                                    928G   284G   644G        -         -    14%  30.6%      -  ONLINE
    sds                                        -      -      -        -         -      -      -      -  ONLINE
    sdt                                        -      -      -        -         -      -      -      -  ONLINE
pool_ssd                                   2.60T  1.67T   955G        -         -    30%    64%  1.00x    ONLINE  -
  mirror                                    444G   285G   159G        -         -    29%  64.2%      -  ONLINE
    sda                                        -      -      -        -         -      -      -      -  ONLINE
    sdb                                        -      -      -        -         -      -      -      -  ONLINE
  mirror                                    444G   285G   159G        -         -    30%  64.2%      -  ONLINE
    sdc                                        -      -      -        -         -      -      -      -  ONLINE
    sdd                                        -      -      -        -         -      -      -      -  ONLINE
  mirror                                    444G   285G   159G        -         -    30%  64.1%      -  ONLINE
    sde                                        -      -      -        -         -      -      -      -  ONLINE
    sdf                                        -      -      -        -         -      -      -      -  ONLINE
  mirror                                    444G   285G   159G        -         -    31%  64.1%      -  ONLINE
    sdg                                        -      -      -        -         -      -      -      -  ONLINE
    sdh                                        -      -      -        -         -      -      -      -  ONLINE
  mirror                                    444G   285G   159G        -         -    31%  64.2%      -  ONLINE
    sdi                                        -      -      -        -         -      -      -      -  ONLINE
    sdj                                        -      -      -        -         -      -      -      -  ONLINE
  mirror                                    444G   285G   159G        -         -    32%  64.2%      -  ONLINE
    sdk                                        -      -      -        -         -      -      -      -  ONLINE
    sdl                                        -      -      -        -         -      -      -      -  ONLINE

It seems that there has to be more than average io load on the pool in order for the error to occur.
When the error occurs, the vms switch to Readonly-FS. ( See attached picture1)
We even had vms whose filesystem was riddled with errors, so restore from backup was the only option.

Furthermore we had vms whose partition table became unreadable, restoring with testdisk was possible...

The servers in question are both Supermicro Machines, one is a SC216BE1C-R920LPB with a X10-DRi-T Board and a 9361-8i RAID-Controller in JBOD mode and the other is a SC216BE1C-R920LPB with a X11-DPi-NT Board and a Broadcom SAS III HBA 9300-8i

One might say that the 9361-8i is the problem, as it is a raid-controller running in jbod-mode, and if the error would occur only on this node, i would totally agree. But the error happens on both Nodes, the 9300-8i should be a perfectly viable HBA for ZFS...

Both servers have the Backplane ( BPN-SAS3-216EL1 ) in common, the drives used are:
- INTEL_SSDSC2KB240G8
- HGST_HTE721010A9E630

turnicus · Apr 3, 2023

Hello. My PVE is on 6.4-15 and I faced the exact same problem today. On some of my VM's, the guest OS reported "contains a filesystem with errors" and "inodes that were parts orphan linked list found" while the underlying storage looked OK:
- zpool status on the host reported "no know errors"
- zfs list reported plenty of free space

I had to run fsck manually within all VM's to fix the issue... How come the virtual disk can get corrupted and the underlying ZFS pool doesn't notice? On all these VM's I am using:
- VirtIO SCSI
- Cache: Default (no cache)
- SSD emulation: ticked (enabled)
- Discard: ticked (enabled)

Thanks for any help!

Dunuin · Apr 3, 2023

turnicus said:
I had to run fsck manually within all VM's to fix the issue... How come the virtual disk can get corrupted and the underlying ZFS pool doesn't notice? On all these VM's I am using:

Did you run a recent zpool scrub YourPool? By default this will only be run once per month, so data might be corrupted but ZFS can't see it yet, as it might got corrupted after the last scrub. And are you using ECC RAM? ZFS can only detect when data corrupts while stored on the disks. Data might still corrupt while in RAM or CPU without ZFS being able to notice it.

fiona · Apr 4, 2023

Hi,

turnicus said:
Hello. My PVE is on 6.4-15 and I faced the exact same problem today.

this version has been end-of-life since 1.5 years now. Please upgrade to a current version: https://pve.proxmox.com/wiki/Upgrade_from_6.x_to_7.0

LnxBil · Apr 7, 2023

You filesystem gets also corrupted if you pool or dataset quota runs out of space and is not detected directly.

Search

Search

VM disks gets corrupted on thin ZSF storage

LMC

Active Member

Attachments

dcsapak

Proxmox Staff Member

LMC

Active Member

Stoiko Ivanov

Proxmox Staff Member

apoc

Famous Member

LMC

Active Member

apoc

Famous Member

Stoiko Ivanov

Proxmox Staff Member

LnxBil

Distinguished Member

apoc

Famous Member

LnxBil

Distinguished Member

apoc

Famous Member

uibmz

Renowned Member

Attachments

turnicus

Active Member

Dunuin

Distinguished Member

fiona

Proxmox Staff Member

LnxBil

Distinguished Member