HA Cluster - Debian LXC - Docker not starting on 1 Host

cr4sh0verride · 2024-11-20T23:30:58+0100

Have an interesting issue that I haven't been able to get to the bottom of

I have a Cluster of 3 nodes and configured an LXC Debian container running Docker
When I set it up I tested the failover across all 3 nodes and it worked fine on each

Now though there is one node it refuses to start on and the error is pertaining to starting the network bridge
I've not changed any part of the nodes configuration after I tested it to be working and they are all setup identically in regards to config

When I start it up with a docker debug this is where the error is occurring

Given that's nothing has changed the issue has to be part of that node and in proxmox somewhere, because I can fail back the container to the other 2 nodes and it continues to start fine

Any help if someone has experienced the same would be great

WARN[2024-11-21T09:26:52.285760907+11:00] Could not find endpoint count key docker/network/v1.0/endpoint_count/*redacted*/ for network bridge while listing: Key not found in store
INFO[2024-11-21T09:26:52.285956061+11:00] stopping healthcheck following graceful shutdown module=libcontainerd
INFO[2024-11-21T09:26:52.285969031+11:00] stopping event stream following graceful shutdown error="context canceled" module=libcontainerd namespace=moby
INFO[2024-11-21T09:26:52.285972116+11:00] stopping event stream following graceful shutdown error="context canceled" module=libcontainerd namespace=plugins.moby
DEBU[2024-11-21T09:26:52.286152415+11:00] received signal signal=terminated
DEBU[2024-11-21T09:26:52.286204735+11:00] sd notification notified=false state="STOPPING=1"
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x1c pc=0x5b583c1e65d5]

goroutine 1 [running, locked to thread]:
github.com/docker/docker/libnetwork.(*endpointCnt).EndpointCnt(0xc000b073b0?)
/root/build-deb/engine/libnetwork/endpoint_cnt.go:102 +0x35
github.com/docker/docker/libnetwork.(*Network).delete(0x5b583c1d7700?, 0x0, 0x0)
/root/build-deb/engine/libnetwork/network.go:988 +0x28a
github.com/docker/docker/libnetwork.(*Network).Delete(0xc000029d40, {0x0, 0x0, 0x8?})
/root/build-deb/engine/libnetwork/network.go:952 +0x6b
github.com/docker/docker/daemon.configureNetworking(0xc0003bf5e0, 0xc000330688)
/root/build-deb/engine/daemon/daemon_unix.go:876 +0x1e5
github.com/docker/docker/daemon.(*Daemon).initNetworkController(0xc000138288, 0xc000330688, 0xc0008bf2f0)
/root/build-deb/engine/daemon/daemon_unix.go:850 +0x131
github.com/docker/docker/daemon.(*Daemon).restore(0xc000138288, 0xc000330688)
/root/build-deb/engine/daemon/daemon.go:581 +0x67b
github.com/docker/docker/daemon.NewDaemon({0x5b583d87e998, 0xc00001d810}, 0xc000149608, 0xc00046d6b0, 0xc000053560)
/root/build-deb/engine/daemon/daemon.go:1246 +0x393a
main.(*DaemonCli).start(0xc0004bf640, 0xc0004acf00)
/root/build-deb/engine/cmd/dockerd/daemon.go:260 +0xe09
main.runDaemon(...)
/root/build-deb/engine/cmd/dockerd/docker_unix.go:13
main.newDaemonCommand.func1(0xc0004c2500?, {0xc0004a6eb0?, 0x7?, 0x5b583cc3f438?})
/root/build-deb/engine/cmd/dockerd/docker.go:37 +0x94
github.com/spf13/cobra.(*Command).execute(0xc000545b08, {0xc000052070, 0x1, 0x1})
/root/build-deb/engine/vendor/github.com/spf13/cobra/command.go:985 +0xaca
github.com/spf13/cobra.(*Command).ExecuteC(0xc000545b08)
/root/build-deb/engine/vendor/github.com/spf13/cobra/command.go:1117 +0x3ff
github.com/spf13/cobra.(*Command).Execute(...)
/root/build-deb/engine/vendor/github.com/spf13/cobra/command.go:1041
main.main()
/root/build-deb/engine/cmd/dockerd/docker.go:106 +0x17b

cr4sh0verride · 2024-11-20T23:54:40+0100

Well as I was collecting data for this post it's not broken on all 3 nodes

I restored from a backup on the primary node though and it's working again

So I assume now something with the migration is causing the issue

cr4sh0verride · 2024-11-21T01:01:47+0100

I fulled removed docker and all it's components and reinstalled

It's now working again between all the nodes

I have a feeling it had something to do with the replication turned on
Maybe a bug in replication for LXC containers

The issue was the containerd.sock was missing and it wouldn't start properly
Something in the config files got corrupted because a remove/reinstall of docker without removing the config folders would cause the same issue

Search

Search

HA Cluster - Debian LXC - Docker not starting on 1 Host

cr4sh0verride

New Member

cr4sh0verride

New Member

cr4sh0verride

New Member