LXC Unprivileged Container Isolation

silverstone

Renowned Member
Apr 28, 2018
202
30
68
36
I am trying to understand a bit better the Security Architecture of using LXC Unprivileged Containers.

I am familiar with Virtual Machines (KVM) and also Podman Containers (similar to Docker), but relatively recently I've been deploying quite a few LXC Unprivileged Containers, I can definitively see the Advantages in terms of:
  • Much lower Disk Space Usage
  • Much quicker System Updates
  • Somewhat lower CPU Usage
  • Somewhat lower RAM Usage
  • No Issues related to IOMMU Groups & PCIe Passthrough (e.g. can share a GPU with multiple Containers or passthrough a PCIe Device even though it's within a "Shared" IOMMU Group with the SATA/NIC Controllers)

However I am now wondering about the Security Implications.

LXC Unprivileged Containers are generally regarded as safe, because, well, they are unprivileged :) .

One Point I am considering, while forgetting many others for sure, is about mounted LXC Shares and File Permissions.

In order for a compromised unprivileged LXC Container to gain root Access on the Proxmox VE Host, that seems quite unlikely to me right now.

However, for a compromised unprivileged LXC Container to gain root Access to another unprivileged LXC Container, that would seem somewhat easier, particularly because by default all LXC Containers run as the same User.

a. Mounted Files / Folders

On my System, it seems that all LXC Containers have this ownership:
  • User: 100000
  • Group: 100000 (lxc_shares)

Doesn't this pose a particular Risk ? Can we be sure that a User "escaping" what is essentially a chroot cannot "jump" into another Container ?

Not sure about the practicalities (particularly if they are read-only and/or different Mountpoints) but IIRC Programs like rsync by default do NOT follow symlinks and this was one of the Reasons why. Cannot something similar occur in this Scenario ?

b. Shared /dev/net Mountpoint
This is required for e.g. running Podman Rootless within a LXC Unprivileged Container.

Since all Containers share the /dev/net Mountpoint, cannot a compromised LXC unprivileged Container escape like this and affect the other Containers ?

How to solve

As I see it, the fix is relatively trivial. Just use a separate UID/GID Block for all LXC Containers and call it a Day. I do this already for Podman within LXC, but in that Case I only remap the podman User within the LXC Container to e.g. a forgejo User on the Host. For the next LXC Container I choose a different Host Username & UID/GID to remap to.

But I guess that the same concept should be extended for the whole UIDMap / GIDMap from Host -> LXC Container. Not sure what the practical Limits are, but couldn't we remap an entire 200000 Block per each Container.

Since subuid / subgid are most likely limited to the Range 0-4.294.967.295, even when assigning 200.000 subuid / subgid for each LXC, you would need to run more than 21.474 LXC Containers (more than twenty thousand !) on a single Host before running out.

Why is this not the default though ? Wouldn't this achieve better Security by further isolating the Containers (between themselves) ?
 
by default all unprivileged containers map to the same host range (typically 100000:65536). The actual isolation relies on multiple kernel layers, not just
UIDs:

1. Mount namespaces: each container has its own filesystem view. A process inside container A literally cannot see container B's mount tree.
2. PID namespaces: processes in one container can't signal or ptrace processes in another.
3. AppArmor/seccomp profile: restrict syscalls that could be used for escape.
4. cgroup v2 device controller: restricts which devices a container can access.

So the scenario you're worried about requires first breaking out of the mount namespace (a kernel-level escape), and then having matching UIDs to access another container's files. The shared UIDs make the second step trivial if the first succeeds, that's the problem.

Regarding symlinks and bind mounts: Symlinks inside a container resolve within that container's mount namespace, so a symlink to /etc/shadow stays within the container. However, if two containers have bind mounts to the same host directory (shared storage), then bingo, both containers' processes run as the same host UIDs and can freely read/write each other's files in that shared path. This is by design for shared storage, but it means a compromised container with a shared mount can tamper with another container's data without any namespace escape at all.

The device node /dev/net/tun is a kernel interface for creating TUN/TAP interfaces. The actual network interfaces created throug it are scoped to the creating process's network namespace. So container A creating a tun0 device doesn't give it access to container B's network.
The risk here is more subtle: if there's a kernel vulnerability in the TUN/TAP driver itself, having the device node available increases the attack surface. But under normal operation, the network namespace isolation holds.

Your solution and the math seems correct. The reasons Proxmox probably doesn't default to thisare probably:

1. Shared storage becomes painful,If container A is 100000:65536 and container B is 200000:65536, they can't share a bind-mounted directory without ACLs, idmapped mounts (kernel 5.12+), or an intermediate permission scheme.
2. Migration and backup complexity, Custom ID maps must travel with the container. Restoring a backup to a different host requires the same subuid/subgid configuration.
3. Historical simplicity, the LXC/Proxmox developers considered namespace isolation sufficient for the default case. The shared UID range is a defense-in-depth failure, not a primary isolation failure.
4. Proxmox's target audience, many users run trusted workloads where inter-container isolation is less critical than host protection.

you can configure this per-container in /etc/pve/lxc/<CTID>.conf:

# Container 100: UIDs 100000-299999
lxc.idmap: u 0 100000 200000
lxc.idmap: g 0 100000 200000

# Container 101: UIDs 300000-499999
lxc.idmap: u 0 300000 200000
lxc.idmap: g 0 300000 200000

And the corresponding entries in /etc/subuid and /etc/subgid on the host:

root:100000:200000
root:300000:200000

With 200,000 IDs per container and a 32-bit UID space, you get about 21,474 containers as you calculated, thats more than enough.

Separate UID ranges could be a good solution for security-conscious deployments, the main tradeoff is operational complexity with shared storage, which idmapped mounts (available in modern kernels) can probably solve.
my 50 cents.
 
by default all unprivileged containers map to the same host range (typically 100000:65536). The actual isolation relies on multiple kernel layers, not just
UIDs:

1. Mount namespaces: each container has its own filesystem view. A process inside container A literally cannot see container B's mount tree.
2. PID namespaces: processes in one container can't signal or ptrace processes in another.
3. AppArmor/seccomp profile: restrict syscalls that could be used for escape.
4. cgroup v2 device controller: restricts which devices a container can access.

So the scenario you're worried about requires first breaking out of the mount namespace (a kernel-level escape), and then having matching UIDs to access another container's files. The shared UIDs make the second step trivial if the first succeeds, that's the problem.
Thanks for your in-depth Explanation :) .

Maybe to add yet another Attack Surface related to Mountpoints: what about the Case of a shared GPU via one or more of the following
Code:
dev0: /dev/dri/card0,mode=0660
dev1: /dev/dri/renderD128,gid=992,mode=0666
lxc.mount.entry: /dev/net dev/net none bind,create=dir
lxc.mount.entry: /dev/dri/by-path dev/dri/by-path none rbind,create=dir 0 0
lxc.mount.entry: /dev/kfd dev/kfd none bind,create=file 0 0

Right now my UID Mapping looks like following, only the podman User inside the LXC Container is remapped on the Host, everything else is as default:
Code:
# System UID/GID inside Container (UID/GID < 1000)
lxc.idmap: u 0 100000 1000
lxc.idmap: g 0 100000 1000


# Remap UID & GID <1000> inside Container (<podman> User/Group) to UID 1002 on the Host
lxc.idmap: u 1000 1002 1
lxc.idmap: g 1000 1002 1

# Remaining of UID/GID inside Container
lxc.idmap: u 1001 101001 165536
lxc.idmap: g 1001 101001 165536

Regarding symlinks and bind mounts: Symlinks inside a container resolve within that container's mount namespace, so a symlink to /etc/shadow stays within the container. However, if two containers have bind mounts to the same host directory (shared storage), then bingo, both containers' processes run as the same host UIDs and can freely read/write each other's files in that shared path. This is by design for shared storage, but it means a compromised container with a shared mount can tamper with another container's data without any namespace escape at all.

The device node /dev/net/tun is a kernel interface for creating TUN/TAP interfaces. The actual network interfaces created throug it are scoped to the creating process's network namespace. So container A creating a tun0 device doesn't give it access to container B's network.
The risk here is more subtle: if there's a kernel vulnerability in the TUN/TAP driver itself, having the device node available increases the attack surface. But under normal operation, the network namespace isolation holds.

Your solution and the math seems correct. The reasons Proxmox probably doesn't default to thisare probably:

1. Shared storage becomes painful,If container A is 100000:65536 and container B is 200000:65536, they can't share a bind-mounted directory without ACLs, idmapped mounts (kernel 5.12+), or an intermediate permission scheme.
2. Migration and backup complexity, Custom ID maps must travel with the container. Restoring a backup to a different host requires the same subuid/subgid configuration.
3. Historical simplicity, the LXC/Proxmox developers considered namespace isolation sufficient for the default case. The shared UID range is a defense-in-depth failure, not a primary isolation failure.
4. Proxmox's target audience, many users run trusted workloads where inter-container isolation is less critical than host protection.

you can configure this per-container in /etc/pve/lxc/<CTID>.conf:

# Container 100: UIDs 100000-299999
lxc.idmap: u 0 100000 200000
lxc.idmap: g 0 100000 200000

# Container 101: UIDs 300000-499999
lxc.idmap: u 0 300000 200000
lxc.idmap: g 0 300000 200000

And the corresponding entries in /etc/subuid and /etc/subgid on the host:

root:100000:200000
root:300000:200000

With 200,000 IDs per container and a 32-bit UID space, you get about 21,474 containers as you calculated, thats more than enough.

Separate UID ranges could be a good solution for security-conscious deployments, the main tradeoff is operational complexity with shared storage, which idmapped mounts (available in modern kernels) can probably solve.
my 50 cents.
This is what I did according to this Tutorial basically.

Remapping only UID 1000 (inside the Container) is quite easy, I just need to chown -R <hostuid>:<hostuid> /rpool/data/subvol-<containerid>-disk-0/home/podman/, however if I remap the entire range it surely gets more complicated, since i would need to chown every File/Folder based on all the Ownership in the Container's /etc/passwd (User ID) and /etc/group (Group ID) and they are not limited to only a specific Path.

It probably can be automated by a small Script that does a Loop on these two Files inside the Container according to a Numeric Offset and using chown --from (untested):

Code:
#!/bin/bash

# Get Container ID
CONTAINER_ID="$1"

# Offset ID
# Might need to use a better algorithm if Container ID is Huge
OFFSET_ID=$(echo "${CONTAINER_ID}*200000" | bc)

# Get List of UID inside Container
mapfile -t CONTAINER_UIDS < <(cat ""/rpool/data/subvol-${CONTAINER_ID}-disk-0/etc/passwd" | awk -F: '{print $3}')

# Get List of GID inside Container
mapfile -t CONTAINER_GIDS < <(cat ""/rpool/data/subvol-${CONTAINER_ID}-disk-0/etc/group" | awk -F: '{print $3}')

# Loop over Containers UIDs
for CONTAINER_UID in "${CONTAINER_UIDS[@]}"
do
    # Get current Host UID
    OLD_HOST_UID=$(echo "100000 + ${CONTAINER_UID}" | bc)

    # Compute new Host UID
    NEW_HOST_UID=$(echo "${OLD_HOST_UID} + ${OFFSET_ID}" | bc)

    # Change only UID
    chown --from="${OLD_HOST_UID}" -Rc "${NEW_HOST_UID}" "/rpool/data/subvol-${CONTAINER_ID}-disk-0/"
done

# Loop over Containers GIDs
for CONTAINER_GID in "${CONTAINER_GIDS[@]}"
do
    # Get current Host GID
    OLD_HOST_GID=$(echo "100000 + ${CONTAINER_GID}" | bc)
    
    # Compute new Host GID
    NEW_HOST_GID=$(echo "${OLD_HOST_GID} + ${OFFSET_ID}" | bc)

    # Change only GID
    chown --from=:"${OLD_HOST_GID}" -Rc :"${NEW_HOST_GID}" "/rpool/data/subvol-${CONTAINER_ID}-disk-0/"
done