LXC containers frozen after bulk migration

DynFi User

Renowned Member
Apr 18, 2016
152
17
83
49
dynfi.com
After an upgrade of our CEPH cluster (3 nodes) to latest release after the whole process went perfectly "ok", as one of the last actions, I have migrated 15 CT back to their original server… I migrated them using "batch migrate with 4 parallel tasks".

Everything on those servers were 100% ok.

pve-manager/7.3-4/d69b70d4 (running kernel: 5.15.83-1-pve)

This is where problem begun…

Upon migration target system seems to be "pct" frozen, no way to do any "pct list", "pct status" or anything like this.
GUI for the server started to show question mark for all CT and also for the server itself…

I was still able to access the server using SSH and underneath CEPH system seemed to be working perfectly.

No way to interact with the system, I just managed to kill couple of lxc startup processes… Before I decided to reboot…

I could see these logs:

Code:
Feb 08 11:38:59 pve2 sudo[86302]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=0)
Feb 08 11:39:01 pve2 audit[86555]: AVC apparmor="DENIED" operation="mount" info="failed perms check" error=-13 profile="lxc-306_</var/lib/lxc>" name="/run/systemd/unit-root/" pid=86555 comm="(>
Feb 08 11:39:01 pve2 kernel: audit: type=1400 audit(1675852741.751:290): apparmor="DENIED" operation="mount" info="failed perms check" error=-13 profile="lxc-306_</var/lib/lxc>" name="/run/sys>
Feb 08 11:39:01 pve2 sudo[86211]: pam_unix(sudo:session): session closed for user root
Feb 08 11:39:03 pve2 audit[86606]: AVC apparmor="DENIED" operation="mount" info="failed perms check" error=-13 profile="lxc-418_</var/lib/lxc>" name="/run/systemd/unit-root/" pid=86606 comm="(>
Feb 08 11:39:03 pve2 kernel: audit: type=1400 audit(1675852743.611:291): apparmor="DENIED" operation="mount" info="failed perms check" error=-13 profile="lxc-418_</var/lib/lxc>" name="/run/sys>
Feb 08 11:40:01 pve2 pmxcfs[2454]: [status] notice: received log
Feb 08 11:42:01 pve2 pmxcfs[2454]: [status] notice: received log


And after reboot, these ones:

Code:
Feb 08 12:04:57 pve2 audit[12478]: AVC apparmor="STATUS" operation="profile_replace" info="not policy admin" error=-13 label="lxc-306_</var/lib/lxc>//&:lxc-306_<-var-lib-lxc>:unconfined" pid=1>
Feb 08 12:04:57 pve2 audit[12478]: AVC apparmor="STATUS" operation="profile_replace" info="not policy admin" error=-13 label="lxc-306_</var/lib/lxc>//&:lxc-306_<-var-lib-lxc>:unconfined" pid=1>
Feb 08 12:04:57 pve2 kernel: rbd: rbd6: breaking header lock owned by client62239310
Feb 08 12:04:57 pve2 audit[12526]: AVC apparmor="STATUS" operation="profile_replace" info="not policy admin" error=-13 label="lxc-306_</var/lib/lxc>//&:lxc-306_<-var-lib-lxc>:unconfined" pid=1>
Feb 08 12:04:58 pve2 audit[12509]: AVC apparmor="STATUS" operation="profile_replace" info="not policy admin" error=-13 label="lxc-306_</var/lib/lxc>//&:lxc-306_<-var-lib-lxc>:unconfined" pid=1>
Feb 08 12:04:58 pve2 kernel: rbd: rbd6: breaking object map lock owned by client62239310
Feb 08 12:04:58 pve2 audit[12530]: AVC apparmor="STATUS" operation="profile_replace" info="not policy admin" error=-13 label="lxc-306_</var/lib/lxc>//&:lxc-306_<-var-lib-lxc>:unconfined" pid=1>
Feb 08 12:04:58 pve2 kernel: rbd: rbd6: capacity 10737418240 features 0x3d
Feb 08 12:04:58 pve2 kernel: EXT4-fs warning (device rbd6): ext4_multi_mount_protect:326: MMP interval 42 higher than expected, please wait.
Feb 08 12:04:58 pve2 audit[12534]: AVC apparmor="STATUS" operation="profile_replace" info="not policy admin" error=-13 label="lxc-306_</var/lib/lxc>//&:lxc-306_<-var-lib-lxc>:unconfined" pid=1>
Feb 08 12:04:58 pve2 audit[12540]: AVC apparmor="STATUS" operation="profile_replace" info="not policy admin" error=-13 label="lxc-306_</var/lib/lxc>//&:lxc-306_<-var-lib-lxc>:unconfined" pid=1>
Feb 08 12:04:58 pve2 audit[12571]: AVC apparmor="DENIED" operation="mount" info="failed perms check" error=-13 profile="lxc-306_</var/lib/lxc>" name="/run/systemd/unit-root/" pid=12571 comm="(>
Feb 08 12:05:01 pve2 pmxcfs[2589]: [status] notice: received log
Feb 08 12:05:09 pve2 audit[13015]: AVC apparmor="DENIED" operation="mount" info="failed perms check" error=-13 profile="lxc-306_</var/lib/lxc>" name="/run/systemd/unit-root/" pid=13015 comm="(>
Feb 08 12:05:09 pve2 kernel: kauditd_printk_skb: 26 callbacks suppressed
Feb 08 12:05:09 pve2 kernel: audit: type=1400 audit(1675854309.326:88): apparmor="DENIED" operation="mount" info="failed perms check" error=-13 profile="lxc-306_</var/lib/lxc>" name="/run/syst>
Feb 08 12:05:11 pve2 audit[12177]: AVC apparmor="DENIED" operation="mount" info="failed type match" error=-13 profile="lxc-305_</var/lib/lxc>" name="/proc/sys/" pid=12177 comm="(un-parts)" fla>
Feb 08 12:05:11 pve2 kernel: audit: type=1400 audit(1675854311.162:89): apparmor="DENIED" operation="mount" info="failed type match" error=-13 profile="lxc-305_</var/lib/lxc>" name="/proc/sys/>
Feb 08 12:05:11 pve2 audit[13326]: AVC apparmor="DENIED" operation="mount" info="failed type match" error=-13 profile="lxc-305_</var/lib/lxc>" name="/sys/fs/cgroup/freezer/" pid=13326 comm="(s>
Feb 08 12:05:11 pve2 kernel: audit: type=1400 audit(1675854311.182:90): apparmor="DENIED" operation="mount" info="failed type match" error=-13 profile="lxc-305_</var/lib/lxc>" name="/sys/fs/cg>
Feb 08 12:05:11 pve2 audit[13328]: AVC apparmor="DENIED" operation="mount" info="failed type match" error=-13 profile="lxc-305_</var/lib/lxc>" name="/sys/fs/cgroup/net_cls,net_prio/" pid=13328>
Feb 08 12:05:11 pve2 kernel: audit: type=1400 audit(1675854311.194:91): apparmor="DENIED" operation="mount" info="failed type match" error=-13 profile="lxc-305_</var/lib/lxc>" name="/sys/fs/cg



Not completely sure if there is anything I should do to avoid this for future upgrades ?

Things got back on track on their own after the reboot, but I really don't like that. Since I was forced to hard reset the server for reboot because of stucked LXC CT.

Everything went back on track and root cause seems to have been the parallel tasks or something locking the migration process for LXC CT.

Any feedback on this will be very appreciated.
 
As a complementary info, I could add the following:

  • LXC seems slow on this server (which is not normal since it runs on powerful NVMe SSD with 384GB of mostly unused RAM).
  • While upgrading some packages, inside one of the LXC CT, I received the following infos:
    • Processing triggers for dbus (1.12.16-2ubuntu2.3) ...
      Failed to open connection to "system" message bus: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
      Error: Timeout was reached

CT is: Description: Ubuntu 20.04.5 LTS
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!