Bootdisk size 0 B on LXC

dmemsm

New Member
Jul 18, 2024
8
1
3
Hello, on one of our nodes we ran into problem with several LXC servers. Proxmox shows bootdisk size 0B on them:
1754867470721.png
We have not been able to notice any patterns that cause this problem. We can't stop such containers, since shutdown and stop tasks for them take an infinitely long time to complete. Also we tried to kill one container by killing all processes associated with this container. It helped us to stop the container, but now we can't do anything else with it, even destroy, because trying to destroy CT fails due to ZFS error: "cannot destroy 'rpool/data/subvol-855-disk-0': dataset is busy". We use ZFS as storage for container disks on this node. A million attempts to figure out which process is using this dataset and delete it haven't brought any success. What can cause such a problem and how can it be solved?
 
A couple of days ago we got another such container, there are already about five of them, and we absolutely don't understand what to do with them and how to avoid this problem in the future. Does anyone have any ideas on how to destroy broken containers and find out what is causing this bug?
 
In an attempt to solve the problem, I found a process for destroying the zfs dataset, which was launched on July 12 (more than a month ago) and has not yet been completed. Trying to terminate it via kill does not help. Would it be safe to try to kill it with kill -9, or are there any other ways? There is a suspicion that this hung process may be the cause of 0B disks in containers, but even if not, such a process is hardly considered normal and something needs to be done about it.1755604041865.png
 
Processes in D state are uninterruptible and therefore unkillable (but a reboot would remove them). I don't have the experience about this to really help you with this issue, sorry.
 
Oh, yes, thanks for the information, I didn't notice this detail in the information flow. We will plan to reboot this node in the near future. I hope this helps in some way, but to be honest, I doubt it. Also, now such containers have begun to appear more often. The first one appeared about a month ago, 1-2 more in the previous week and 2-3 this week. We were also able to determine exactly when the problem occurs: when trying to shut down or restart the container. We did not see zero disks on containers that didn't have shutdown/reboot tasks. But shutting down/rebooting a container doesn't mean that this problem will definitely occur, because on the same node there are also successful reboots of containers with the same, at first glance, configuration. So we don't even understand how we can predict on which container the problem will occur next time, it looks like a complete random. If anyone has any ideas about what can be done, I will be very happy to hear, because I have absolutely no idea how to solve this.
 
Hello,


we’d like to share an update because after a lot of digging we finally managed to find the immediate cause and a workaround for this issue.


The containers that got stuck with bootdisk size 0B and could not be stopped/destroyed all had a broken /etc/machine-id inside the CT rootfs. Instead of the expected 0 bytes (empty file) or 33 bytes (valid UUID + newline), we found many cases where the file was 14 bytes long with the literal text:


uninitialized


When such a CT was stopped or rebooted, systemd inside would try to run systemd-machine-id-setup --commit, which went into D-state (uninterruptible I/O sleep) and never returned. That single stuck process held the ZFS dataset busy forever → which explains why stop/destroy/snapshot all hung, and Proxmox GUI showed the rootfs as “0 B”.


What we did​


  • We confirmed by tracing processes that the D-state always belonged to systemd-machine-id-setup --commit.
  • Fixing /etc/machine-id (truncate to 0 bytes or, better, regenerate a valid 33-byte ID) immediately prevents the bug.
  • We added a hookscript in Proxmox (pre-start) that checks the container rootfs before start and rewrites /etc/machine-id if it’s 14 bytes, replacing it with a valid ID. It also links /var/lib/dbus/machine-id back to /etc/machine-id.
  • Since then, containers start/stop cleanly, and no new “0 B” cases appear.

Open question​


We still don’t know why this happens in the first place. Something (maybe related to Ubuntu 24.04 template + LXC + nesting/Docker) sometimes writes uninitialized\n into /etc/machine-id. We cannot reproduce it reliably, only observe it randomly after reboots/stops.

So at this point:
  • Cause: broken 14-byte machine-id leads to D-state systemd process.
  • Workaround: fix machine-id via hookscript before start.
  • Unknown: what exactly causes /etc/machine-id to become “uninitialized” inside these CTs.

If anyone from Proxmox team or the community has ideas about the root cause (template bug? systemd quirk? LXC integration?), we’d be very interested.


Hope this helps others who hit the same “0 B disk” + dataset is busy problem.
 
  • Like
Reactions: leesteken