I am trying to understand a bit better the Security Architecture of using LXC Unprivileged Containers.
I am familiar with Virtual Machines (KVM) and also Podman Containers (similar to Docker), but relatively recently I've been deploying quite a few LXC Unprivileged Containers, I can definitively see the Advantages in terms of:
However I am now wondering about the Security Implications.
LXC Unprivileged Containers are generally regarded as safe, because, well, they are unprivileged
.
One Point I am considering, while forgetting many others for sure, is about mounted LXC Shares and File Permissions.
In order for a compromised unprivileged LXC Container to gain root Access on the Proxmox VE Host, that seems quite unlikely to me right now.
However, for a compromised unprivileged LXC Container to gain root Access to another unprivileged LXC Container, that would seem somewhat easier, particularly because by default all LXC Containers run as the same User.
a. Mounted Files / Folders
On my System, it seems that all LXC Containers have this ownership:
Doesn't this pose a particular Risk ? Can we be sure that a User "escaping" what is essentially a chroot cannot "jump" into another Container ?
Not sure about the practicalities (particularly if they are read-only and/or different Mountpoints) but IIRC Programs like rsync by default do NOT follow symlinks and this was one of the Reasons why. Cannot something similar occur in this Scenario ?
b. Shared /dev/net Mountpoint
This is required for e.g. running Podman Rootless within a LXC Unprivileged Container.
Since all Containers share the /dev/net Mountpoint, cannot a compromised LXC unprivileged Container escape like this and affect the other Containers ?
How to solve
As I see it, the fix is relatively trivial. Just use a separate UID/GID Block for all LXC Containers and call it a Day. I do this already for Podman within LXC, but in that Case I only remap the
But I guess that the same concept should be extended for the whole UIDMap / GIDMap from Host -> LXC Container. Not sure what the practical Limits are, but couldn't we remap an entire 200000 Block per each Container.
Since subuid / subgid are most likely limited to the Range 0-4.294.967.295, even when assigning 200.000 subuid / subgid for each LXC, you would need to run more than 21.474 LXC Containers (more than twenty thousand !) on a single Host before running out.
Why is this not the default though ? Wouldn't this achieve better Security by further isolating the Containers (between themselves) ?
I am familiar with Virtual Machines (KVM) and also Podman Containers (similar to Docker), but relatively recently I've been deploying quite a few LXC Unprivileged Containers, I can definitively see the Advantages in terms of:
- Much lower Disk Space Usage
- Much quicker System Updates
- Somewhat lower CPU Usage
- Somewhat lower RAM Usage
- No Issues related to IOMMU Groups & PCIe Passthrough (e.g. can share a GPU with multiple Containers or passthrough a PCIe Device even though it's within a "Shared" IOMMU Group with the SATA/NIC Controllers)
However I am now wondering about the Security Implications.
LXC Unprivileged Containers are generally regarded as safe, because, well, they are unprivileged
One Point I am considering, while forgetting many others for sure, is about mounted LXC Shares and File Permissions.
In order for a compromised unprivileged LXC Container to gain root Access on the Proxmox VE Host, that seems quite unlikely to me right now.
However, for a compromised unprivileged LXC Container to gain root Access to another unprivileged LXC Container, that would seem somewhat easier, particularly because by default all LXC Containers run as the same User.
a. Mounted Files / Folders
On my System, it seems that all LXC Containers have this ownership:
- User: 100000
- Group: 100000 (lxc_shares)
Doesn't this pose a particular Risk ? Can we be sure that a User "escaping" what is essentially a chroot cannot "jump" into another Container ?
Not sure about the practicalities (particularly if they are read-only and/or different Mountpoints) but IIRC Programs like rsync by default do NOT follow symlinks and this was one of the Reasons why. Cannot something similar occur in this Scenario ?
b. Shared /dev/net Mountpoint
This is required for e.g. running Podman Rootless within a LXC Unprivileged Container.
Since all Containers share the /dev/net Mountpoint, cannot a compromised LXC unprivileged Container escape like this and affect the other Containers ?
How to solve
As I see it, the fix is relatively trivial. Just use a separate UID/GID Block for all LXC Containers and call it a Day. I do this already for Podman within LXC, but in that Case I only remap the
podman User within the LXC Container to e.g. a forgejo User on the Host. For the next LXC Container I choose a different Host Username & UID/GID to remap to.But I guess that the same concept should be extended for the whole UIDMap / GIDMap from Host -> LXC Container. Not sure what the practical Limits are, but couldn't we remap an entire 200000 Block per each Container.
Since subuid / subgid are most likely limited to the Range 0-4.294.967.295, even when assigning 200.000 subuid / subgid for each LXC, you would need to run more than 21.474 LXC Containers (more than twenty thousand !) on a single Host before running out.
Why is this not the default though ? Wouldn't this achieve better Security by further isolating the Containers (between themselves) ?