Unexpected 'nobody' filesystem privileges in LXC container

gstrong

New Member
Sep 24, 2023
6
0
1
Hey everyone, I'm having fileystem permission issues with LXC in Proxmox VE,
8.3.3, which I know has been discussed to death. I've read a lot of the threads,
tried a bunch of the suggested solutions, and I'm still facing the same issue.

Scenario

I have a Proxmox VE 8.3.3 machine, and I've largely been utilizing VMs, but LXC
containers present a lot of benefits in terms of resource efficiency and I'd
like to convert my services over. I'm starting with a container running Coredns
with an Alpine 3.20 base, but this is a problem that affects all LXC containers
regardless of the workload. Following the principle of least privilege, I am
defaulting to unprivileged containers; I have no need for privileged containers.
I'm building the images with the following commands:

Code:
sudo distrobuilder build-dir <image-config> <output>
sudo distrobuilder pack-lxc <image-config> <output>

I have a CIFS storage pool configured in my proxmox node that holds the
container images, as well as a disk on the machine that's configured as lvm-thin
storage. The latter contains the container disk volume.

I am creating the container with:

Code:
pct create $_containerId $cifsStorageName:vztmpl/$_imgName --rootfs $_localStorageName:1 --force 1

coredns is installed into the image, and during this installation it creates
a `coredns` user, which has a uid:gid of 100:101. The openrc service is
configured to write, by default, to `/var/log/coredns`.

Container Fundamentals

Let's talk about the core Linux primitives that provide container functionality,
particularly security and user namespace mapping.

User namespace mapping protects a host system from rogue container workloads
by mapping container uid:gids to a non-privileged id space on the host:

Code:
root@rift:~# cat /etc/sub{uid,gid}
root:100000:65536
root:100000:65536

This means that when I start an unprivileged container using user namespace
mapping, starting at container uid:0 (root), the process runs as uid:0 inside
of the container, and from the container's perspective, that user is uid:0.
On the host, that user is mapped to uid:100000, and filesystem operations
performed within the container as uid:0 will actually be in the host filesystem
as 100000. Further, the same offset is utilized for non-zero users inside of the
container. In the coredns case, I'd expect to see it's files mapped to 100100.

Key point here is that even if you're running everything as root inside of the
containers, the host is protected because in fact none of the operations are
actually performed as uid:0 on the host, it's all using this non-privileged
range.

Expected Behavior

When I run this container, I'm expecting coredns to log its output to
`/var/log/coredns`, and furthermore, I'm expecting the broad system to be
owned by `root, uid:0,gid:0`, and the respective coredns sections of the filesystem
to be owned by `coredns, uid:100,gid:101`.

When an unprivileged container starts:
* Files owned by root (UID 0) in the container image should appear as root
inside the container.
* On the host, these same files are owned by UID 100000
* This mapping is automatic and transparent to the container.

Actual Behavior

The broad system is just owned by 'nobody', uid 65534. It's breaking all
privileges.

Code:
/var/log # rc-service coredns start
 * /var/log/coredns: correcting mode
 * checkpath: chmod: Operation not permitted
 * ERROR: coredns failed to start

/var/log # ls -ln
total 4
drwxr-xr-x    2 65534    65534         4096 Feb  8 19:59 coredns

Further inspecting the image to make sure the file permissions baked into the
image are as we expect them to be:

Code:
root@rift:/# pct mount 102
root@rift:/var/lib/lxc/102/rootfs/var/log# ls -lan
total 12
drwxr-xr-x  3   0   0 4096 Feb  8 14:59 .
drwxr-xr-x 12   0   0 4096 Jan  8 06:04 ..
drwxr-xr-x  2 100 101 4096 Feb  8 14:59 coredns

These are exactly the ids that I expect to be preserved when the container is run.

Suggested Solution

Edit the container defintiion to apply an explicit idmap:

Code:
lxc.idmap: u 0 100000 65536
lxc.idmap: g 0 100000 65536

Not only does this not work, but this is a completely impractical solution
What if I have have 50 containers? Now I have to know the id range ahead of time
for each container, make sure that they don't overlap, manage the space for
every single container defintiion. This doesn't work at scale and is highly
error prone for something that ought to be handled by the runtime transparently.

Questions

* Most importantly, I need to fundamentally understand *why* this is happening.
* What is considered to be the best practice for managing this? Again, I find
the manual id space management suggestion to be totally unsatisfying; there
must be a broader reason why this is happening, because it *should* be handled
transparently, across all containers.
 
Last edited:
if you run "pct create" without explicitly setting "unprivileged", the resulting container will be privileged. are you switching it to unprivileged after creation? that doesn't work, because you'd also need to convert the on-disk ownership (as you found out)..