VM disks gets corrupted on thin ZSF storage

LMC

Active Member
Apr 16, 2019
19
0
41
46
Hi,

I have a server running latest PVE with a ZFS RAIDZ storage on SSDs. From time to time, some KVM VMs gets readonly filesystem and need to be rebooted and the filesystem repaired. Doesn't look like the problem is specific to a certain OS, because it happened on CentOS, Ubuntu, Debian... The zpool looks fine, scrub never reported any error and the PVE root filesystem was always ok. Anyone got a similar situation?

The screenshots are from the VM console.

Dan
 

Attachments

  • Screen Shot 2020-02-19 at 23.03.17.png
    Screen Shot 2020-02-19 at 23.03.17.png
    245.4 KB · Views: 23
  • Screen Shot 2020-02-19 at 23.03.42.png
    Screen Shot 2020-02-19 at 23.03.42.png
    111.1 KB · Views: 21
  • Screen Shot 2020-02-19 at 23.09.12.png
    Screen Shot 2020-02-19 at 23.09.12.png
    367.5 KB · Views: 18
Last edited:
is your zfs full?
check 'zfs list' output
 
It doesn't look like... Also, the VM disks are not full, maximum 50%.

NAME USED AVAIL REFER MOUNTPOINT
vmdata 1.23T 2.28T 31.5G /vmdata
 
anything interesting showing up in `dmesg` or the journal of the PVE node - while the guests experience the disk errors?
 
Have you (by any chance) an SAS-Expander between the SSD and the controller?
This sounds similar to my issues when I tried to run ZFS through LSI 9211-8i + HPe SAS-Expander when starting with Proxmox.
Never got it sorted out and since I have eliminated the expander everything is running fine.
 
  • Like
Reactions: Stoiko Ivanov
anything interesting showing up in `dmesg` or the journal of the PVE node - while the guests experience the disk errors?

Not really... nothing realated with a running VM. But I see some things like:
[1191162.119258] vmbr0: port 7(tap1198i0) entered disabled state
[1191256.251036] kvm[11166]: segfault at 2 ip 00005636a1acedd0 sp 00007fb5d4558de0 error 4 in qemu-system-x86_64[5636a18a8000+491000]
[1191256.251044] Code: 00 5b c3 0f 1f 80 00 00 00 00 41 57 41 56 41 55 41 54 55 48 89 cd b9 a0 03 00 00 53 48 89 f3 48 8d 35 2e 4f 2d 00 48 83 ec 78 <44> 0f b7 6a 02 4c 8b 7a 08 4c 89 44 24 10 4c 8d 05 4b 26 2d 00 45
I saw 6 segfault during current uptime of 19 days. And also:
[73770.332448] EXT4-fs (zd80p1): warning: mounting fs with errors, running e2fsck is recommended
[73770.338172] EXT4-fs (zd80p1): Errors on filesystem, clearing orphan list.
[73770.338173] EXT4-fs (zd80p1): recovery complete
[73770.341540] EXT4-fs (zd80p1): mounted filesystem with ordered data mode. Opts: (null)
[82269.118705] EXT4-fs (zd80p1): warning: mounting fs with errors, running e2fsck is recommended
[82269.125652] EXT4-fs (zd80p1): mounted filesystem with ordered data mode. Opts: (null)
[103124.125435] device tap1193i0 entered promiscuous mode

[1105193.467709] vmbr0: port 7(tap1198i0) entered disabled state
[1108146.299528] device-mapper: table: 253:0: zd160 too small for target: start=2048, len=25163776, dev_size=20971520
[1108146.299530] device-mapper: core: Cannot calculate initial queue limits
[1108146.299565] device-mapper: ioctl: unable to set up device queue for new table.

[1115364.402263] vmbr0: port 14(tap1178i0) entered disabled state
[1115380.107999] zd320: p1
[1115380.116292] zd320: p1
[1115380.599536] zd320: p1 p2
[1115381.548982] EXT4-fs (dm-0): bad geometry: block count 78642944 exceeds size of device (78249984 blocks)
[1115381.554728] EXT4-fs (dm-0): bad geometry: block count 78642944 exceeds size of device (78249984 blocks)
[1115399.787433] EXT4-fs (dm-0): bad geometry: block count 78642944 exceeds size of device (78249984 blocks)
[1115399.793514] EXT4-fs (dm-0): bad geometry: block count 78642944 exceeds size of device (78249984 blocks)

I think this latest messages are related to repairs of the VM disks. No other errors...

Have you (by any chance) an SAS-Expander between the SSD and the controller?
This sounds similar to my issues when I tried to run ZFS through LSI 9211-8i + HPe SAS-Expander when starting with Proxmox.
Never got it sorted out and since I have eliminated the expander everything is running fine.
No, they are connected to the AHCI SATA ports on the motherboard.
 
Then you could try different (shorter) cables.
I was hunting a "flacky HDD" the other day - I noticed that my issues seemed to persist on the same port, even when changing HDDs. After I have replaced the cable my issue went away. My system is stable since then.
Good luck!

/edit: saw the exact same message as you in picture two.
 
segfaults in kvm and randomly occuring disk errors could indeed point to a hardware issue

* run `memtest` on the host for a long period
* check that all cables are well seated (and maybe try to change them)
* check if the issue persists with a different PSU

I hope this helps!
 
Have you (by any chance) an SAS-Expander between the SSD and the controller?
This sounds similar to my issues when I tried to run ZFS through LSI 9211-8i + HPe SAS-Expander when starting with Proxmox.
Never got it sorted out and since I have eliminated the expander everything is running fine.

Thank you! I just stumbled across this and I was going to buy a HPe SAS Expander for myself .. at least until now. Have you put in multiple HBAs?
 
@LnxBil
yes. switched to individual HBAs. Less "efficient" in terms of PCIe-Slot usage but stable. And that is all what counts.

If you are interested in more details of my personal drama you can find a detailed report here:
https://forum.proxmox.com/threads/p...issues-after-adding-5th-mirror-to-pool.37514/

I found an explanation at some place that SATA isn't very reliable in terms of "multiplexing" and that this only has improved recently.
If you are using SAS-attached-disks you should be fine. I however recommend the expanders (even though they are cheap) due to my own experience.
 
@@LnxBil
yes. switched to individual HBAs. Less "efficient" in terms of PCIe-Slot usage but stable. And that is all what counts.

If you are interested in more details of my personal drama you can find a detailed report here:
https://forum.proxmox.com/threads/p...issues-after-adding-5th-mirror-to-pool.37514/

Thank you!

I found an explanation at some place that SATA isn't very reliable in terms of "multiplexing" and that this only has improved recently.
If you are using SAS-attached-disks you should be fine.

I read too, that you shall not mix SATA/SAS on the same port or even bus.
 
That wasn't the case for me. Due to me being an "advanced consumer" I am using SATA drives.
No SAS at all in place. So I think I was a victim of SATA multiplexing.
Can imagine though that SATA/SAS mix also could mean trouble...
 
We have similar errors in our Environment, but we have two different servers where this kind of behaviour occurs.

Both servers are have two zfs pools with striped mirror-vdevs:
INI:
root@prodnode1:~# zpool list -v
NAME                                        SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
pool_spinning                              3.62T  1.04T  2.59T        -         -    12%    28%  1.00x    ONLINE  -
  mirror                                    928G   220G   708G        -         -    11%  23.7%      -  ONLINE
    sdm                                        -      -      -        -         -      -      -      -  ONLINE
    sdn                                        -      -      -        -         -      -      -      -  ONLINE
  mirror                                    928G   266G   662G        -         -    13%  28.7%      -  ONLINE
    sdo                                        -      -      -        -         -      -      -      -  ONLINE
    sdp                                        -      -      -        -         -      -      -      -  ONLINE
  mirror                                    928G   290G   638G        -         -    13%  31.2%      -  ONLINE
    sdq                                        -      -      -        -         -      -      -      -  ONLINE
    sdr                                        -      -      -        -         -      -      -      -  ONLINE
  mirror                                    928G   284G   644G        -         -    14%  30.6%      -  ONLINE
    sds                                        -      -      -        -         -      -      -      -  ONLINE
    sdt                                        -      -      -        -         -      -      -      -  ONLINE
pool_ssd                                   2.60T  1.67T   955G        -         -    30%    64%  1.00x    ONLINE  -
  mirror                                    444G   285G   159G        -         -    29%  64.2%      -  ONLINE
    sda                                        -      -      -        -         -      -      -      -  ONLINE
    sdb                                        -      -      -        -         -      -      -      -  ONLINE
  mirror                                    444G   285G   159G        -         -    30%  64.2%      -  ONLINE
    sdc                                        -      -      -        -         -      -      -      -  ONLINE
    sdd                                        -      -      -        -         -      -      -      -  ONLINE
  mirror                                    444G   285G   159G        -         -    30%  64.1%      -  ONLINE
    sde                                        -      -      -        -         -      -      -      -  ONLINE
    sdf                                        -      -      -        -         -      -      -      -  ONLINE
  mirror                                    444G   285G   159G        -         -    31%  64.1%      -  ONLINE
    sdg                                        -      -      -        -         -      -      -      -  ONLINE
    sdh                                        -      -      -        -         -      -      -      -  ONLINE
  mirror                                    444G   285G   159G        -         -    31%  64.2%      -  ONLINE
    sdi                                        -      -      -        -         -      -      -      -  ONLINE
    sdj                                        -      -      -        -         -      -      -      -  ONLINE
  mirror                                    444G   285G   159G        -         -    32%  64.2%      -  ONLINE
    sdk                                        -      -      -        -         -      -      -      -  ONLINE
    sdl                                        -      -      -        -         -      -      -      -  ONLINE

It seems that there has to be more than average io load on the pool in order for the error to occur.
When the error occurs, the vms switch to Readonly-FS. ( See attached picture1)
We even had vms whose filesystem was riddled with errors, so restore from backup was the only option.

Furthermore we had vms whose partition table became unreadable, restoring with testdisk was possible...

The servers in question are both Supermicro Machines, one is a SC216BE1C-R920LPB with a X10-DRi-T Board and a 9361-8i RAID-Controller in JBOD mode and the other is a SC216BE1C-R920LPB with a X11-DPi-NT Board and a Broadcom SAS III HBA 9300-8i

One might say that the 9361-8i is the problem, as it is a raid-controller running in jbod-mode, and if the error would occur only on this node, i would totally agree. But the error happens on both Nodes, the 9300-8i should be a perfectly viable HBA for ZFS...

Both servers have the Backplane ( BPN-SAS3-216EL1 ) in common, the drives used are:
- INTEL_SSDSC2KB240G8
- HGST_HTE721010A9E630
 

Attachments

  • picture1.png
    picture1.png
    177.1 KB · Views: 7
Hello. My PVE is on 6.4-15 and I faced the exact same problem today. On some of my VM's, the guest OS reported "contains a filesystem with errors" and "inodes that were parts orphan linked list found" while the underlying storage looked OK:
- zpool status on the host reported "no know errors"
- zfs list reported plenty of free space

I had to run fsck manually within all VM's to fix the issue... How come the virtual disk can get corrupted and the underlying ZFS pool doesn't notice? On all these VM's I am using:
- VirtIO SCSI
- Cache: Default (no cache)
- SSD emulation: ticked (enabled)
- Discard: ticked (enabled)

Thanks for any help!
 
Last edited:
I had to run fsck manually within all VM's to fix the issue... How come the virtual disk can get corrupted and the underlying ZFS pool doesn't notice? On all these VM's I am using:
Did you run a recent zpool scrub YourPool? By default this will only be run once per month, so data might be corrupted but ZFS can't see it yet, as it might got corrupted after the last scrub. And are you using ECC RAM? ZFS can only detect when data corrupts while stored on the disks. Data might still corrupt while in RAM or CPU without ZFS being able to notice it.
 
  • Like
Reactions: fiona

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!