After an upgrade of our CEPH cluster (3 nodes) to latest release after the whole process went perfectly "ok", as one of the last actions, I have migrated 15 CT back to their original server… I migrated them using "batch migrate with 4 parallel tasks".
Everything on those servers were 100% ok.
This is where problem begun…
Upon migration target system seems to be "pct" frozen, no way to do any "pct list", "pct status" or anything like this.
GUI for the server started to show question mark for all CT and also for the server itself…
I was still able to access the server using SSH and underneath CEPH system seemed to be working perfectly.
No way to interact with the system, I just managed to kill couple of lxc startup processes… Before I decided to reboot…
I could see these logs:
And after reboot, these ones:
Not completely sure if there is anything I should do to avoid this for future upgrades ?
Things got back on track on their own after the reboot, but I really don't like that. Since I was forced to hard reset the server for reboot because of stucked LXC CT.
Everything went back on track and root cause seems to have been the parallel tasks or something locking the migration process for LXC CT.
Any feedback on this will be very appreciated.
Everything on those servers were 100% ok.
pve-manager/7.3-4/d69b70d4 (running kernel: 5.15.83-1-pve)
This is where problem begun…
Upon migration target system seems to be "pct" frozen, no way to do any "pct list", "pct status" or anything like this.
GUI for the server started to show question mark for all CT and also for the server itself…
I was still able to access the server using SSH and underneath CEPH system seemed to be working perfectly.
No way to interact with the system, I just managed to kill couple of lxc startup processes… Before I decided to reboot…
I could see these logs:
Code:
Feb 08 11:38:59 pve2 sudo[86302]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=0)
Feb 08 11:39:01 pve2 audit[86555]: AVC apparmor="DENIED" operation="mount" info="failed perms check" error=-13 profile="lxc-306_</var/lib/lxc>" name="/run/systemd/unit-root/" pid=86555 comm="(>
Feb 08 11:39:01 pve2 kernel: audit: type=1400 audit(1675852741.751:290): apparmor="DENIED" operation="mount" info="failed perms check" error=-13 profile="lxc-306_</var/lib/lxc>" name="/run/sys>
Feb 08 11:39:01 pve2 sudo[86211]: pam_unix(sudo:session): session closed for user root
Feb 08 11:39:03 pve2 audit[86606]: AVC apparmor="DENIED" operation="mount" info="failed perms check" error=-13 profile="lxc-418_</var/lib/lxc>" name="/run/systemd/unit-root/" pid=86606 comm="(>
Feb 08 11:39:03 pve2 kernel: audit: type=1400 audit(1675852743.611:291): apparmor="DENIED" operation="mount" info="failed perms check" error=-13 profile="lxc-418_</var/lib/lxc>" name="/run/sys>
Feb 08 11:40:01 pve2 pmxcfs[2454]: [status] notice: received log
Feb 08 11:42:01 pve2 pmxcfs[2454]: [status] notice: received log
And after reboot, these ones:
Code:
Feb 08 12:04:57 pve2 audit[12478]: AVC apparmor="STATUS" operation="profile_replace" info="not policy admin" error=-13 label="lxc-306_</var/lib/lxc>//&:lxc-306_<-var-lib-lxc>:unconfined" pid=1>
Feb 08 12:04:57 pve2 audit[12478]: AVC apparmor="STATUS" operation="profile_replace" info="not policy admin" error=-13 label="lxc-306_</var/lib/lxc>//&:lxc-306_<-var-lib-lxc>:unconfined" pid=1>
Feb 08 12:04:57 pve2 kernel: rbd: rbd6: breaking header lock owned by client62239310
Feb 08 12:04:57 pve2 audit[12526]: AVC apparmor="STATUS" operation="profile_replace" info="not policy admin" error=-13 label="lxc-306_</var/lib/lxc>//&:lxc-306_<-var-lib-lxc>:unconfined" pid=1>
Feb 08 12:04:58 pve2 audit[12509]: AVC apparmor="STATUS" operation="profile_replace" info="not policy admin" error=-13 label="lxc-306_</var/lib/lxc>//&:lxc-306_<-var-lib-lxc>:unconfined" pid=1>
Feb 08 12:04:58 pve2 kernel: rbd: rbd6: breaking object map lock owned by client62239310
Feb 08 12:04:58 pve2 audit[12530]: AVC apparmor="STATUS" operation="profile_replace" info="not policy admin" error=-13 label="lxc-306_</var/lib/lxc>//&:lxc-306_<-var-lib-lxc>:unconfined" pid=1>
Feb 08 12:04:58 pve2 kernel: rbd: rbd6: capacity 10737418240 features 0x3d
Feb 08 12:04:58 pve2 kernel: EXT4-fs warning (device rbd6): ext4_multi_mount_protect:326: MMP interval 42 higher than expected, please wait.
Feb 08 12:04:58 pve2 audit[12534]: AVC apparmor="STATUS" operation="profile_replace" info="not policy admin" error=-13 label="lxc-306_</var/lib/lxc>//&:lxc-306_<-var-lib-lxc>:unconfined" pid=1>
Feb 08 12:04:58 pve2 audit[12540]: AVC apparmor="STATUS" operation="profile_replace" info="not policy admin" error=-13 label="lxc-306_</var/lib/lxc>//&:lxc-306_<-var-lib-lxc>:unconfined" pid=1>
Feb 08 12:04:58 pve2 audit[12571]: AVC apparmor="DENIED" operation="mount" info="failed perms check" error=-13 profile="lxc-306_</var/lib/lxc>" name="/run/systemd/unit-root/" pid=12571 comm="(>
Feb 08 12:05:01 pve2 pmxcfs[2589]: [status] notice: received log
Feb 08 12:05:09 pve2 audit[13015]: AVC apparmor="DENIED" operation="mount" info="failed perms check" error=-13 profile="lxc-306_</var/lib/lxc>" name="/run/systemd/unit-root/" pid=13015 comm="(>
Feb 08 12:05:09 pve2 kernel: kauditd_printk_skb: 26 callbacks suppressed
Feb 08 12:05:09 pve2 kernel: audit: type=1400 audit(1675854309.326:88): apparmor="DENIED" operation="mount" info="failed perms check" error=-13 profile="lxc-306_</var/lib/lxc>" name="/run/syst>
Feb 08 12:05:11 pve2 audit[12177]: AVC apparmor="DENIED" operation="mount" info="failed type match" error=-13 profile="lxc-305_</var/lib/lxc>" name="/proc/sys/" pid=12177 comm="(un-parts)" fla>
Feb 08 12:05:11 pve2 kernel: audit: type=1400 audit(1675854311.162:89): apparmor="DENIED" operation="mount" info="failed type match" error=-13 profile="lxc-305_</var/lib/lxc>" name="/proc/sys/>
Feb 08 12:05:11 pve2 audit[13326]: AVC apparmor="DENIED" operation="mount" info="failed type match" error=-13 profile="lxc-305_</var/lib/lxc>" name="/sys/fs/cgroup/freezer/" pid=13326 comm="(s>
Feb 08 12:05:11 pve2 kernel: audit: type=1400 audit(1675854311.182:90): apparmor="DENIED" operation="mount" info="failed type match" error=-13 profile="lxc-305_</var/lib/lxc>" name="/sys/fs/cg>
Feb 08 12:05:11 pve2 audit[13328]: AVC apparmor="DENIED" operation="mount" info="failed type match" error=-13 profile="lxc-305_</var/lib/lxc>" name="/sys/fs/cgroup/net_cls,net_prio/" pid=13328>
Feb 08 12:05:11 pve2 kernel: audit: type=1400 audit(1675854311.194:91): apparmor="DENIED" operation="mount" info="failed type match" error=-13 profile="lxc-305_</var/lib/lxc>" name="/sys/fs/cg
Not completely sure if there is anything I should do to avoid this for future upgrades ?
Things got back on track on their own after the reboot, but I really don't like that. Since I was forced to hard reset the server for reboot because of stucked LXC CT.
Everything went back on track and root cause seems to have been the parallel tasks or something locking the migration process for LXC CT.
Any feedback on this will be very appreciated.