Broken Filesystem in LXC

May 31, 2015
19
0
21
Hello everyone!

I have a strange problem with my LXCs the last days and I already spent hours during x-mas, but couldn't solve it...

Here is the "history":
Last week (on 21st & 22nd Dec) two of my four cluster-hosts crashed and nothing worked on it (only question marks in web panel and no bash-command worked, even reboot). I had to reset them. I honestly don't know why they hang and what the problem was, but since it only happened once, that's not my actual problem.

And here is what's about now:
In all LXC-Containers which were running on the hosts during crash I have filesystem errors since the crash. kern.log looks like this (on host and in LXC):
Code:
kernel: [360205.838168] EXT4-fs (loop3): previous I/O error to superblock detected
kernel: [360205.839314] Buffer I/O error on dev loop3, logical block 0, lost sync page write
kernel: [367213.313281] print_req_error: I/O error, dev loop11, sector 470195848

If I saw such of this errors I shut down the LXC and tried to repair the filesystem with e2fsck:
Code:
e2fsck -p -c -f -v /mnt/pve/storage/images/71403/vm-71403-disk-1.raw
This worked fine, it found some broken inodes and fixed it:
Code:
recovering journal
Clearing orphaned inode 1180012 (uid=0, gid=0, mode=0100666, size=0)
Clearing orphaned inode 1179930 (uid=0, gid=0, mode=0100666, size=0)
Clearing orphaned inode 10444 (uid=0, gid=0, mode=0100600, size=0)
Updating bad block inode.
When I start the LXC now everything works fine for a few hours and then the kern.log shows FS-errors for the LXC again... For a few LXCs I already did this three times and the problems still occur.

Is there anybody with an idea what to do? I really really don't what to re-setup all LXCs (about 20 are affected)... restoring backups is even not really a option, because some data changed a lot the last days...

Moreover there happened something strange yesterday: kern.log showed filesystem errors for a container which has never been on the crashed hosts. After the crash I migrated all containers to a working host which didn't crashed and now a LXC is effected which was on this working host during the crash of the other ones?!
Could one broken LXC-filesystem affect others on the same host/storage? I already tested the ZFS-filesystem on the storage, this hadn't any errors.

Thanks for your help!
Andi
 
Last week (on 21st & 22nd Dec) two of my four cluster-hosts crashed and nothing worked on it (only question marks in web panel and no bash-command worked, even reboot). I had to reset them. I honestly don't know why they hang and what the problem was, but since it only happened once, that's not my actual problem.
I think, that this is connected to your corrupt filesystems in the CTs.

/mnt/pve/storage/images/71403/vm-71403-disk-1.raw
What is your shared storage? As you said other lxc images are corrupt too, this looks like the common denominator.

Also, which version (pveversion -v) do you run and are you on the lastest updates?
 
Yes, I'm on the latest updates:
I am afraid, not quite the latest, you can find new packages in the repositories, also the kernel updates to 4.13.13-2.
Code:
proxmox-ve: 5.1-32 
pve-manager: 5.1-41

So it could possibly be some issue with the network to the fourth node, NFS server, RAM, disks on the fourth node. As an alternative test, you could put some of the containers onto local storage to check if the failure still persists.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!