Broken Filesystem in LXC

Andi · Dec 27, 2017

Hello everyone!

I have a strange problem with my LXCs the last days and I already spent hours during x-mas, but couldn't solve it...

Here is the "history":
Last week (on 21st & 22nd Dec) two of my four cluster-hosts crashed and nothing worked on it (only question marks in web panel and no bash-command worked, even reboot). I had to reset them. I honestly don't know why they hang and what the problem was, but since it only happened once, that's not my actual problem.

And here is what's about now:
In all LXC-Containers which were running on the hosts during crash I have filesystem errors since the crash. kern.log looks like this (on host and in LXC):

Code:

kernel: [360205.838168] EXT4-fs (loop3): previous I/O error to superblock detected
kernel: [360205.839314] Buffer I/O error on dev loop3, logical block 0, lost sync page write
kernel: [367213.313281] print_req_error: I/O error, dev loop11, sector 470195848

If I saw such of this errors I shut down the LXC and tried to repair the filesystem with e2fsck:

Code:

e2fsck -p -c -f -v /mnt/pve/storage/images/71403/vm-71403-disk-1.raw

This worked fine, it found some broken inodes and fixed it:

Code:

recovering journal
Clearing orphaned inode 1180012 (uid=0, gid=0, mode=0100666, size=0)
Clearing orphaned inode 1179930 (uid=0, gid=0, mode=0100666, size=0)
Clearing orphaned inode 10444 (uid=0, gid=0, mode=0100600, size=0)
Updating bad block inode.

When I start the LXC now everything works fine for a few hours and then the kern.log shows FS-errors for the LXC again... For a few LXCs I already did this three times and the problems still occur.

Is there anybody with an idea what to do? I really really don't what to re-setup all LXCs (about 20 are affected)... restoring backups is even not really a option, because some data changed a lot the last days...

Moreover there happened something strange yesterday: kern.log showed filesystem errors for a container which has never been on the crashed hosts. After the crash I migrated all containers to a working host which didn't crashed and now a LXC is effected which was on this working host during the crash of the other ones?!
Could one broken LXC-filesystem affect others on the same host/storage? I already tested the ZFS-filesystem on the storage, this hadn't any errors.

Thanks for your help!
Andi

Alwin · Dec 28, 2017

Andi said:
Last week (on 21st & 22nd Dec) two of my four cluster-hosts crashed and nothing worked on it (only question marks in web panel and no bash-command worked, even reboot). I had to reset them. I honestly don't know why they hang and what the problem was, but since it only happened once, that's not my actual problem.

I think, that this is connected to your corrupt filesystems in the CTs.

Andi said:
/mnt/pve/storage/images/71403/vm-71403-disk-1.raw

What is your shared storage? As you said other lxc images are corrupt too, this looks like the common denominator.

Also, which version (pveversion -v) do you run and are you on the lastest updates?

Andi · Dec 28, 2017

Alwin said:
What is your shared storage?

On one of my four cluster hosts there is a NFS server, which is based on a ZFS raid.

Alwin said:
Also, which version (pveversion -v) do you run and are you on the lastest updates?

Yes, I'm on the latest updates:

Code:

proxmox-ve: 5.1-30 (running kernel: 4.13.8-3-pve)
pve-manager: 5.1-38 (running version: 5.1-38/1e9bc777)

Alwin · Dec 28, 2017

Andi said:
Yes, I'm on the latest updates:

I am afraid, not quite the latest, you can find new packages in the repositories, also the kernel updates to 4.13.13-2.

Code:

proxmox-ve: 5.1-32 
pve-manager: 5.1-41

So it could possibly be some issue with the network to the fourth node, NFS server, RAM, disks on the fourth node. As an alternative test, you could put some of the containers onto local storage to check if the failure still persists.

Search

Search

Broken Filesystem in LXC

Andi

Member

Alwin

Proxmox Retired Staff

Andi

Member

Alwin

Proxmox Retired Staff