[SOLVED] Alpine lxc containers don't shut down cleanly, init hangs? (is a zfs bug with zfs_txg_timeout, backport possible?)

deviantintegral

New Member
Dec 22, 2023
11
5
3
I figured this out while writing the post, so here's the story!

I noticed that lxc containers using Alpine don't shut down properly. While all of the processes in the container terminate, the init process won't exit and the shutdown task will time out. Eventually the process exits and everything cleans up, but it certainly extends host shutdown times by quite a bit.

I can replicate this using a fresh container from alpine-3.22-default_20250617_amd64.tar.xz:

Code:
arch: amd64
cores: 1
features: nesting=1
hostname: alpinetest
memory: 512
net0: name=eth0,bridge=vmbr0,firewall=1,hwaddr=BC:24:11:22:F2:F9,ip=dhcp,ip6=dhcp,type=veth
ostype: alpine
rootfs: uberphoenix-vmdata:subvol-104-disk-0,size=8G
swap: 512
unprivileged: 1

When I run shutdown, eventually the init process shows up in the Ds state, aka "uninterruptible sleep":

Code:
100000    210569  0.0  0.0   1624   384 ?        Ds   20:23   0:00 /sbin/init

Here's strace, but I'm not seeing anything obviously wrong.

Code:
write(2, "\rThe system is going down NOW!\n", 31) = 31
kill(-1, SIGTERM)                       = -1 ESRCH (No such process)
write(2, "\rSent SIGTERM to all processes\n", 31) = 31
sync(


)                                  = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7fff5b3d9570) = 0
kill(-1, SIGKILL)                       = -1 ESRCH (No such process)
write(2, "\rSent SIGKILL to all processes\n", 31) = 31
sync(

Oh wait... is that hanging on sync? I then ran `sync` from the command line, and it also stalled.

A few months ago, I set zfs_txg_timeout to 60 to reduce ssd wear. That led me to this reddit post which explains the issue exactly. Upstream bug is #14290.

It looks like it will be fixed in an upcoming zfs 2.4 release. For now, I'll reduce my timeout to 10 or 20 seconds which should help combine more writes without hitting this issue. Anyone know what the policy is for backporting bug fixes like this to Proxmox's packages?
 
  • Like
Reactions: Onslow