Hi,
I've got a Proxmox host which has a strange problem, I noted it yesterday, and it is not cleared up after a reboot this morning.
Strictly speaking I have no idea if this is a (proxmox issue) or (debian / linux) issue per se. But in case someone has seen it I thought I would post here. Google hits are kind of not helping much so far.
Context piece: Custom install proxmox host, was originally installed as proxmox 4 when that was latest at the time.
Underlying it is a Debian minimal install / software raid with MD devices. Pair of SATA drives for bulk storage and pair of SSD drives. Proxmox core install on the SATA Disk raid / with suitable LVM and Raid config to make things happy. Then there is added fun of 'bcache' to accelerate a bulk block storage which is bulk-blocks on a backing mirror SATA MD device. SSD bcache device is on sw raid on the SSDs.
Anyhow. This has all been working smoothly for >years. About a month ago I had to re-do bcache config because we had disk fail out of the SSD set but that was resolved and things were happy after that.
On the weekend I noted the performance of the Proxmox host was utterly dreadful. Checking dmesg logs. It was actively running a ~ monthly SW Raid parity health check task / to confirm raid health.
And in so doing it was throwing tons and tons of messages like this:
The weird thing is that the errors all relate to loop1 and loop2 devices, but not any more traditional filesystem type device (ie, md1, md2 or sda1 sda2 etc)
I manually forced the raid health parity check to stop, and performance went back to 'usable' but still low level of those errors persisting.
Today after host was rebooted to see if that helped situation. There is no change. System is usable, VMs are online and operational, but we have a non-trivial background level of these errors.
Currently we see things generally are 'ok' on the host, otherwise., ie,
I've got a few KVM VMs running here and a few LXC as well.
LXC stuff for example,
nothing is particularly strange, except this persistent error about loop FS Drama end of the world looms.
I am curious if anyone has ever seen anything like this / has any thoughts / suggestions on what rabbit hole I should run down?
Maybe next step, I will wait until after hours, stop all VM and see if anything changes.
Other options include..
-- rebuild server from scratch, do restore using PBS backups.
-- not sure what else I can do to resolve this
-- but I'm not sure I can fix this and not sure I can leave it alone either.
Lots of fun!
Many thanks if you read this far and even more if you feel like suggesting something!
Thanks!
Tim
I've got a Proxmox host which has a strange problem, I noted it yesterday, and it is not cleared up after a reboot this morning.
Strictly speaking I have no idea if this is a (proxmox issue) or (debian / linux) issue per se. But in case someone has seen it I thought I would post here. Google hits are kind of not helping much so far.
Context piece: Custom install proxmox host, was originally installed as proxmox 4 when that was latest at the time.
Underlying it is a Debian minimal install / software raid with MD devices. Pair of SATA drives for bulk storage and pair of SSD drives. Proxmox core install on the SATA Disk raid / with suitable LVM and Raid config to make things happy. Then there is added fun of 'bcache' to accelerate a bulk block storage which is bulk-blocks on a backing mirror SATA MD device. SSD bcache device is on sw raid on the SSDs.
Anyhow. This has all been working smoothly for >years. About a month ago I had to re-do bcache config because we had disk fail out of the SSD set but that was resolved and things were happy after that.
On the weekend I noted the performance of the Proxmox host was utterly dreadful. Checking dmesg logs. It was actively running a ~ monthly SW Raid parity health check task / to confirm raid health.
And in so doing it was throwing tons and tons of messages like this:
[ 5103.337244] EXT4-fs (loop1): Delayed block allocation failed for inode 1179652 at logical offset 285340 with max blocks 7 with error 117
[ 5103.337366] EXT4-fs (loop1): This should not happen!! Data will be lost
The weird thing is that the errors all relate to loop1 and loop2 devices, but not any more traditional filesystem type device (ie, md1, md2 or sda1 sda2 etc)
I manually forced the raid health parity check to stop, and performance went back to 'usable' but still low level of those errors persisting.
Today after host was rebooted to see if that helped situation. There is no change. System is usable, VMs are online and operational, but we have a non-trivial background level of these errors.
Currently we see things generally are 'ok' on the host, otherwise., ie,
Code:
root@prox4-42:~#
MD RAID HEALTH SEEMS OK
root@prox4-42:~# cat /proc/mdstat
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 sdb2[0] sda2[2]
15617024 blocks super 1.2 [2/2] [UU]
md1 : active raid1 sdb3[0] sda3[2]
15617024 blocks super 1.2 [2/2] [UU]
md4 : active raid1 sda4[1] sdb4[0]
2898787776 blocks super 1.2 [2/2] [UU]
bitmap: 7/22 pages [28KB], 65536KB chunk
md127 : active raid1 sdd1[1] sde1[0]
117153216 blocks super 1.2 [2/2] [UU]
bitmap: 1/1 pages [4KB], 65536KB chunk
unused devices: <none>
DISK SLICES MOUNTED / LAYOUT HINT
=============================================
root@prox4-42:~# df -h
Filesystem Size Used Avail Use% Mounted on
udev 16G 0 16G 0% /dev
tmpfs 3.2G 9.1M 3.2G 1% /run
/dev/md0 15G 9.0G 4.9G 66% /
tmpfs 16G 46M 16G 1% /dev/shm
tmpfs 5.0M 4.0K 5.0M 1% /run/lock
tmpfs 16G 0 16G 0% /sys/fs/cgroup
/dev/sdc1 2.7T 130G 2.6T 5% /3tbbackupinternal
/dev/mapper/data-lvol0 2.5T 1.8T 728G 71% /var/lib/vz
/dev/fuse 30M 24K 30M 1% /etc/pve
tmpfs 3.2G 0 3.2G 0% /run/user/0
root@prox4-42:~#
LV and VG hint:
====================
root@prox4-42:~# lvs
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
lvol0 data -wi-ao---- 2.50t
root@prox4-42:~# vgs
VG #PV #LV #SN Attr VSize VFree
data 1 1 0 wz--n- <2.70t <204.50g
root@prox4-42:~#
NOTE status check all drives attached are "OK" when using "SMARTCTL --all" it tells me all drives attached are healthy / passed.
I've got a few KVM VMs running here and a few LXC as well.
LXC stuff for example,
Code:
root@prox4-42:~# pct list
VMID Status Lock Name
103 stopped cacti
105 running filestorage
106 stopped mrtg
109 running unifi-deb-8
110 running openvpn-turnkey
root@prox4-42:~#
nothing is particularly strange, except this persistent error about loop FS Drama end of the world looms.
I am curious if anyone has ever seen anything like this / has any thoughts / suggestions on what rabbit hole I should run down?
Maybe next step, I will wait until after hours, stop all VM and see if anything changes.
Other options include..
-- rebuild server from scratch, do restore using PBS backups.
-- not sure what else I can do to resolve this
-- but I'm not sure I can fix this and not sure I can leave it alone either.
Lots of fun!
Many thanks if you read this far and even more if you feel like suggesting something!
Thanks!
Tim