Hello everyone!
I have a strange problem with my LXCs the last days and I already spent hours during x-mas, but couldn't solve it...
Here is the "history":
Last week (on 21st & 22nd Dec) two of my four cluster-hosts crashed and nothing worked on it (only question marks in web panel and no bash-command worked, even reboot). I had to reset them. I honestly don't know why they hang and what the problem was, but since it only happened once, that's not my actual problem.
And here is what's about now:
In all LXC-Containers which were running on the hosts during crash I have filesystem errors since the crash. kern.log looks like this (on host and in LXC):
If I saw such of this errors I shut down the LXC and tried to repair the filesystem with e2fsck:
This worked fine, it found some broken inodes and fixed it:
When I start the LXC now everything works fine for a few hours and then the kern.log shows FS-errors for the LXC again... For a few LXCs I already did this three times and the problems still occur.
Is there anybody with an idea what to do? I really really don't what to re-setup all LXCs (about 20 are affected)... restoring backups is even not really a option, because some data changed a lot the last days...
Moreover there happened something strange yesterday: kern.log showed filesystem errors for a container which has never been on the crashed hosts. After the crash I migrated all containers to a working host which didn't crashed and now a LXC is effected which was on this working host during the crash of the other ones?!
Could one broken LXC-filesystem affect others on the same host/storage? I already tested the ZFS-filesystem on the storage, this hadn't any errors.
Thanks for your help!
Andi
I have a strange problem with my LXCs the last days and I already spent hours during x-mas, but couldn't solve it...
Here is the "history":
Last week (on 21st & 22nd Dec) two of my four cluster-hosts crashed and nothing worked on it (only question marks in web panel and no bash-command worked, even reboot). I had to reset them. I honestly don't know why they hang and what the problem was, but since it only happened once, that's not my actual problem.
And here is what's about now:
In all LXC-Containers which were running on the hosts during crash I have filesystem errors since the crash. kern.log looks like this (on host and in LXC):
Code:
kernel: [360205.838168] EXT4-fs (loop3): previous I/O error to superblock detected
kernel: [360205.839314] Buffer I/O error on dev loop3, logical block 0, lost sync page write
kernel: [367213.313281] print_req_error: I/O error, dev loop11, sector 470195848
If I saw such of this errors I shut down the LXC and tried to repair the filesystem with e2fsck:
Code:
e2fsck -p -c -f -v /mnt/pve/storage/images/71403/vm-71403-disk-1.raw
Code:
recovering journal
Clearing orphaned inode 1180012 (uid=0, gid=0, mode=0100666, size=0)
Clearing orphaned inode 1179930 (uid=0, gid=0, mode=0100666, size=0)
Clearing orphaned inode 10444 (uid=0, gid=0, mode=0100600, size=0)
Updating bad block inode.
Is there anybody with an idea what to do? I really really don't what to re-setup all LXCs (about 20 are affected)... restoring backups is even not really a option, because some data changed a lot the last days...
Moreover there happened something strange yesterday: kern.log showed filesystem errors for a container which has never been on the crashed hosts. After the crash I migrated all containers to a working host which didn't crashed and now a LXC is effected which was on this working host during the crash of the other ones?!
Could one broken LXC-filesystem affect others on the same host/storage? I already tested the ZFS-filesystem on the storage, this hadn't any errors.
Thanks for your help!
Andi