Sudden filesystem errors

luison

Renowned Member
Feb 22, 2010
158
5
83
Spain
elsurexiste.com
We've been struggling for a couple of days with an undetermined issue with corrupted filesystem on our root (LVM /dev/pve/root) and data (LVM /dev/data/home).
Suddenly started reporting EXT4 errors till the system became unstable.

Code:
Nov 30 00:05:52 e20home kernel: [543969.097160] EXT4-fs error (device dm-28): __ext4_find_entry:1542: inode #807352: comm gunicorn: checksumming dir
ectory block 0
Nov 30 00:06:52 e20home kernel: [544029.097195] EXT4-fs warning (device dm-28): ext4_dirblock_csum_verify:370: inode #807352: comm gunicorn: No spac
e for directory leaf checksum. Please run e2fsck -D.
Nov 30 00:07:52 e20home kernel: [544089.097374] EXT4-fs warning (device dm-28): ext4_dirblock_csum_verify:370: inode #807352: comm gunicorn: No spac
e for directory leaf checksum. Please run e2fsck -D.
Nov 30 00:08:52 e20home kernel: [544149.097218] EXT4-fs error (device dm-28): __ext4_find_entry:1542: inode #807352: comm gunicorn: checksumming dir
ectory block 0
Nov 30 00:09:52 e20home kernel: [544209.096784] EXT4-fs warning (device dm-28): ext4_dirblock_csum_verify:370: inode #807352: comm gunicorn: No spac
e for directory leaf checksum. Please run e2fsck -D.
Nov 30 00:10:52 e20home kernel: [544269.097161] EXT4-fs error (device dm-28): __ext4_find_entry:1542: inode #807352: comm gunicorn: checksumming dir
ectory block 0
Nov 30 00:11:52 e20home kernel: [544329.096723] EXT4-fs warning (device dm-28): ext4_dirblock_csum_verify:370: inode #807352: comm gunicorn: No spac
e for directory leaf checksum. Please run e2fsck -D.
Nov 30 00:12:52 e20home kernel: [544389.097332] EXT4-fs error (device dm-28): __ext4_find_entry:1542: inode #807352: comm gunicorn: checksumming dir
ectory block 0
Nov 30 00:13:52 e20home kernel: [544449.096743] EXT4-fs error (device dm-28): __ext4_find_entry:1542: inode #807352: comm gunicorn: checksumming dir
ectory block 0
Nov 30 00:14:52 e20home kernel: [544509.097187] EXT4-fs error (device dm-28): __ext4_find_entry:1542: inode #807352: comm gunicorn: checksumming dir
ectory block 0
Nov 30 00:17:52 e20home kernel: [544689.097692] EXT4-fs warning (device dm-28): ext4_dirblock_csum_verify:370: inode #807352: comm gunicorn: No spac
e for directory leaf checksum. Please run e2fsck -D.
Nov 30 00:18:52 e20home kernel: [544749.097133] EXT4-fs warning (device dm-28): ext4_dirblock_csum_verify:370: inode #807352: comm gunicorn: No spac
e for directory leaf checksum. Please run e2fsck -D.
Nov 30 00:19:52 e20home kernel: [544809.096933] EXT4-fs warning (device dm-28): ext4_dirblock_csum_verify:370: inode #807352: comm gunicorn: No spac
e for directory leaf checksum. Please run e2fsck -D.
Nov 30 00:27:52 e20home kernel: [545289.096616] EXT4-fs warning (device dm-28): ext4_dirblock_csum_verify:370: inode #807352: comm gunicorn: No spac
e for directory leaf checksum. Please run e2fsck -D.

We then from a rescue disk tested the disks with badblocks and another couple of utilities without errors. Linux raid behind the LVM reports no errors either.
We deleted the partitions, reformat and restored backups. Tested filesystem without any issues.
Rescue disk is an older PVE install with older kernel.



After rebooting on to the system we keep getting EXT4 errors on the filesystem again and the partitions prove to slowly becoming unstable again and we can't figure out what to test. Most containers were blocked as if they were in the middle of a snapshot.

The kernel error again:

Code:
Dec  1 17:33:02 e20home kernel: [ 1072.358923] EXT4-fs warning (device dm-28): ext4_dirblock_csum_verify:370: inode #2884850: comm gitaly: No space for directory leaf checksum. Please run e2fsck -D.
Dec  1 17:30:42 e20home kernel: [  932.100043] EXT4-fs error (device dm-28): __ext4_find_entry:1541: inode #2752514: comm mysqld: checksumming directory block 0
Dec  1 17:32:45 e20home kernel: [ 1054.920811] EXT4-fs error (device dm-28): __ext4_find_entry:1541: inode #2884850: comm gitaly: checksumming directory block 0
Dec  1 17:32:56 e20home kernel: [ 1065.912091] vmbr1: received packet on bond0 with own address as source address (addr:34:97:f6:31:82:13, vlan:0)
Dec  1 17:33:02 e20home kernel: [ 1072.358923] EXT4-fs warning (device dm-28): ext4_dirblock_csum_verify:370: inode #2884850: comm gitaly: No space for directory leaf checksum. Please run e2fsck -D.
Dec  1 17:41:23 e20home kernel: [ 1573.057286] EXT4-fs error (device dm-28): __ext4_find_entry:1541: inode #2884850: comm gitaly: checksumming directory block 0
Dec  1 17:49:03 e20home kernel: [ 2033.474292] EXT4-fs error (device dm-28): __ext4_find_entry:1541: inode #2884850: comm gitaly: checksumming directory block 0

Our system is using NVMe PCI4 disks on a X570 Asus Motheboard which we don't know if might be related to the issue or it might be related to a recent kernel update.
We will be trying to boot previous kernel to see if it might be related but any assistance in possible forms of debugging

Code:
pveversion -v
proxmox-ve: 6.2-2 (running kernel: 5.3.10-1-pve)
pve-manager: 6.2-15 (running version: 6.2-15/48bd51b6)
pve-kernel-5.4: 6.3-1
pve-kernel-helper: 6.3-1
pve-kernel-5.3: 6.1-6
pve-kernel-5.4.73-1-pve: 5.4.73-1
pve-kernel-5.4.65-1-pve: 5.4.65-1
pve-kernel-5.4.60-1-pve: 5.4.60-2
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.10-1-pve: 5.3.10-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.5
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.2-4
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.2-10
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.0.1-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.3-10
pve-cluster: 6.2-1
pve-container: 3.2-4
pve-docs: 6.2-6
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.1.0-6
pve-xtermjs: 4.7.0-2
qemu-server: 6.2-20
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.5-pve1
 
You're running an older kernel. PVE 6.2 uses the 5.4 kernel.
Please post the output of df -h and the syslog/journal since the last boot.
 
We are actually investigating how that could have happened. The latest kernel and its initram was correctly installed and updated and supposedly no manual selection was done, so we are now completely uncertain if the issues came up while running 5.4.73-1 or 5.3.10-1.

Our rescue disk is a debian with no pve, so we could have not generated the pveversion with it.

We've now updated again and made sure the standard default reboot was with the correct kernel (5.4.73) and will have to monitor and stress the disk again to tell if the issue comes up.

The only error while updating was regarding it could not rm /boot/pve as it did not exist, which we assumme us not relevant.

Will post if they do. Thanks.

[UPDATE]
We've now confirmed that the issues came up with kernel 5.4.73. Unsure how a previous kernel was running. Trying to reproduce again.
 
Last edited:
We've now confirmed the this seems to have started after updating and running with kernel 5.4.73.

After reconfirming no aparent phisical errors on NVM units, memory or power supply, our most reasonable conclusion is this is related to some form of issue with the linux raid10 bellow the LVM with this new kernel.

After restoring root and home the system seem stable for a couple of hours. No errors on log for a few hours and suddenly system becomes unstable. By the time we see the first EXT4 errors on logs the partitions seem very damaged with tons of lost files and directories when doing an fsck. The filesystem become very unstable in a form we need to rm and copy rather than rsync in order for them to work.

We've opted to move the VGs to another disc so we could destroy those underlaying raids and perhaps add separate partitions later with LVM mirror option. As we understand the thin lvm thin group cannot be "moved" as such, we are moving the CT disks to a new thin outside of the raids so we can reconfigure everything in the hope that the cause is that as explained. I include some of the requested logs and info.

Code:
df -h

udev                      16G      0   16G   0% /dev
tmpfs                    3,2G   1,8M  3,2G   1% /run
/dev/mapper/pve-root      42G   4,2G   35G  11% /
tmpfs                     16G    37M   16G   1% /dev/shm
tmpfs                    5,0M   4,0K  5,0M   1% /run/lock
tmpfs                     16G      0   16G   0% /sys/fs/cgroup
/dev/mapper/data-home2   570G   304G  238G  57% /home2
/dev/mapper/data-logs     18G    18G     0 100% /logs   -- test being done.
/dev/mapper/data-cache    12G   5,1G  6,1G  46% /cache
/dev/sdc1                1,8T   1,7T   62G  97% /backups
/dev/mapper/data-tmp      15G    41M   14G   1% /tmp
/dev/mapper/data-video   295G   224G   56G  81% /home2/serverDisk1/VIDEO
/dev/sda5                511M   210M  302M  42% /boot/efi
/dev/mapper/data-home    247G   132G  103G  57% /home
/dev/fuse                 30M    36K   30M   1% /etc/pve
tmpfs                    3,2G      0  3,2G   0% /run/user/0

pvesm status
Name                  Type     Status           Total            Used       Available        %
dumps-bak2             dir     active        43086896         4360820        36507684   10.12%
dumps-hdbackup         dir     active      1922728840      1760681668        64355112   91.57%
local                  dir     active        43086896         4360820        36507684   10.12%
thin               lvmthin     active       272629760       185470025        87159734   68.03%
thin-bak           lvmthin     active       272629760         2835349       269794410    1.04%
thin-ssd           lvmthin     active       188743680        33143390       155600289   17.56%
vz-alsur               dir     active       597580336       318605440       248549808   53.32%

pveversion -v
proxmox-ve: 6.2-2 (running kernel: 5.4.65-1-pve)
pve-manager: 6.2-15 (running version: 6.2-15/48bd51b6)
pve-kernel-5.4: 6.3-1
pve-kernel-helper: 6.3-1
pve-kernel-5.3: 6.1-6
pve-kernel-5.4.73-1-pve: 5.4.73-1
pve-kernel-5.4.65-1-pve: 5.4.65-1
pve-kernel-5.4.60-1-pve: 5.4.60-2
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.10-1-pve: 5.3.10-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.5
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.2-4
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.2-10
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.0.1-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.3-10
pve-cluster: 6.2-1
pve-container: 3.2-4
pve-docs: 6.2-6
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.1.0-6
pve-xtermjs: 4.7.0-2
qemu-server: 6.2-20
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.5-pve1
 

Attachments

Last edited:
Further investigating these the filesystem corruption seem to have stopped by returning to kernel 5.4.65.
We had similar issues when running rescue enviroiment under debian with kernel 4.19.0-12 which I assumme is perhaps the one PVE 5.4.73 is derived from, which would explain is something related to it, alltough we have no clue of what.

We will be changing the raid and partition structure and remain on that kernel for now till we can confirm what is going on.

By the way, whats the best way to change the UEFI boot preference without uninstalling the new kernel?
 
PVE 6 uses the Ubuntu 20.04 kernel. Are you using mdraid?
 
Faced this today also when tried to run docker under ubuntu20. On the first run it works, after reboot VM has

ext4_dirblock_csum_verify:377: inode #278015: comm containerd: No space for directory leaf checksum

Using OCSF2 cluster over ISCSI.

Rolled back all hosts to stable kernel 5.4.73-1-pve
The issue is happening together with this one: https://forum.proxmox.com/threads/centos-7-crashes-on-pve-5-4-78-1.80939/#post-382585


Question to proxmox devs: will new proxmox 7 work on old 5.4.73-1-pve?
 
Last edited: