[SOLVED] Weird Problem on PVE host - has anyone seen this before?

fortechitsolutions · Mar 8, 2021

Hi,

I've got a Proxmox host which has a strange problem, I noted it yesterday, and it is not cleared up after a reboot this morning.

Strictly speaking I have no idea if this is a (proxmox issue) or (debian / linux) issue per se. But in case someone has seen it I thought I would post here. Google hits are kind of not helping much so far.

Context piece: Custom install proxmox host, was originally installed as proxmox 4 when that was latest at the time.
Underlying it is a Debian minimal install / software raid with MD devices. Pair of SATA drives for bulk storage and pair of SSD drives. Proxmox core install on the SATA Disk raid / with suitable LVM and Raid config to make things happy. Then there is added fun of 'bcache' to accelerate a bulk block storage which is bulk-blocks on a backing mirror SATA MD device. SSD bcache device is on sw raid on the SSDs.

Anyhow. This has all been working smoothly for >years. About a month ago I had to re-do bcache config because we had disk fail out of the SSD set but that was resolved and things were happy after that.

On the weekend I noted the performance of the Proxmox host was utterly dreadful. Checking dmesg logs. It was actively running a ~ monthly SW Raid parity health check task / to confirm raid health.

And in so doing it was throwing tons and tons of messages like this:


[ 5103.337244] EXT4-fs (loop1): Delayed block allocation failed for inode 1179652 at logical offset 285340 with max blocks 7 with error 117
[ 5103.337366] EXT4-fs (loop1): This should not happen!! Data will be lost

The weird thing is that the errors all relate to loop1 and loop2 devices, but not any more traditional filesystem type device (ie, md1, md2 or sda1 sda2 etc)

I manually forced the raid health parity check to stop, and performance went back to 'usable' but still low level of those errors persisting.

Today after host was rebooted to see if that helped situation. There is no change. System is usable, VMs are online and operational, but we have a non-trivial background level of these errors.

Currently we see things generally are 'ok' on the host, otherwise., ie,

Code:

root@prox4-42:~#

MD RAID HEALTH SEEMS OK

root@prox4-42:~# cat /proc/mdstat
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 sdb2[0] sda2[2]
      15617024 blocks super 1.2 [2/2] [UU]

md1 : active raid1 sdb3[0] sda3[2]
      15617024 blocks super 1.2 [2/2] [UU]

md4 : active raid1 sda4[1] sdb4[0]
      2898787776 blocks super 1.2 [2/2] [UU]
      bitmap: 7/22 pages [28KB], 65536KB chunk

md127 : active raid1 sdd1[1] sde1[0]
      117153216 blocks super 1.2 [2/2] [UU]
      bitmap: 1/1 pages [4KB], 65536KB chunk

unused devices: <none>

DISK SLICES  MOUNTED / LAYOUT HINT
=============================================
root@prox4-42:~# df -h
Filesystem              Size  Used Avail Use% Mounted on
udev                     16G     0   16G   0% /dev
tmpfs                   3.2G  9.1M  3.2G   1% /run
/dev/md0                 15G  9.0G  4.9G  66% /
tmpfs                    16G   46M   16G   1% /dev/shm
tmpfs                   5.0M  4.0K  5.0M   1% /run/lock
tmpfs                    16G     0   16G   0% /sys/fs/cgroup
/dev/sdc1               2.7T  130G  2.6T   5% /3tbbackupinternal
/dev/mapper/data-lvol0  2.5T  1.8T  728G  71% /var/lib/vz
/dev/fuse                30M   24K   30M   1% /etc/pve
tmpfs                   3.2G     0  3.2G   0% /run/user/0
root@prox4-42:~#

LV and VG hint:
====================
root@prox4-42:~# lvs
  LV    VG   Attr       LSize Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  lvol0 data -wi-ao---- 2.50t

root@prox4-42:~# vgs
  VG   #PV #LV #SN Attr   VSize  VFree
  data   1   1   0 wz--n- <2.70t <204.50g
root@prox4-42:~#


NOTE status check all drives attached are "OK" when using "SMARTCTL --all" it tells me all drives attached are healthy / passed.

I've got a few KVM VMs running here and a few LXC as well.

LXC stuff for example,

Code:

root@prox4-42:~# pct list
VMID       Status     Lock         Name
103        stopped                 cacti
105        running                 filestorage
106        stopped                 mrtg
109        running                 unifi-deb-8
110        running                 openvpn-turnkey
root@prox4-42:~#

nothing is particularly strange, except this persistent error about loop FS Drama end of the world looms.

I am curious if anyone has ever seen anything like this / has any thoughts / suggestions on what rabbit hole I should run down?

Maybe next step, I will wait until after hours, stop all VM and see if anything changes.

Other options include..
-- rebuild server from scratch, do restore using PBS backups.
-- not sure what else I can do to resolve this
-- but I'm not sure I can fix this and not sure I can leave it alone either.

Lots of fun!

Many thanks if you read this far and even more if you feel like suggesting something!

Thanks!

Tim

fortechitsolutions · Mar 11, 2021

In case anyone gets here from google searches by keyword, etc. I think I found solution to problem. I waited for a time when I could get remote access and power off all VMs hosted on this promox box, then powered then on one-at-a-time. Monitored system logs. It was quite clear that the strange errors around loop filesystem filesystem corruption - were specifically related to one LXC VM. I simply restored from backup / using a backup that pre-dated when I had a 'SSD bcache drive fail even' last month. (The VM is a VPN service host, does not get a lot of change made regularly). After that, things are running smoothly with all VMs online / the bad copy powered off / the good restored copy of that one VM in service instead.

So. If anyone else sees something like this in logs. It probably(Maybe?) means you have an LXC Container which hidden/silent filesystem corruption that is not readily apparent. Try power off all your LXC containers, monitor logs, make sure problem is gone, then power up the LXC Containers one at a time until you see problem resume, and this will identify a problem VM for you. Restore from backup, proceed with power-on tests to make sure you don't have >1 VM in this bad state, and after that you are all good. (* and after testing is done remember to delete the lxc vm which has filesystem problems.. )

And of course you do have good backups, since proxmox backup server is now so easy to use, that everyone has it in service with proxmox, yes?

Tim

Search

Search

[SOLVED] Weird Problem on PVE host - has anyone seen this before?

fortechitsolutions

Renowned Member

fortechitsolutions

Renowned Member