HA Cluster - Debian LXC - Docker not starting on 1 Host

cr4sh0verride

New Member
Nov 20, 2024
6
0
1
Have an interesting issue that I haven't been able to get to the bottom of

I have a Cluster of 3 nodes and configured an LXC Debian container running Docker
When I set it up I tested the failover across all 3 nodes and it worked fine on each

Now though there is one node it refuses to start on and the error is pertaining to starting the network bridge
I've not changed any part of the nodes configuration after I tested it to be working and they are all setup identically in regards to config

When I start it up with a docker debug this is where the error is occurring

Given that's nothing has changed the issue has to be part of that node and in proxmox somewhere, because I can fail back the container to the other 2 nodes and it continues to start fine

Any help if someone has experienced the same would be great

WARN[2024-11-21T09:26:52.285760907+11:00] Could not find endpoint count key docker/network/v1.0/endpoint_count/*redacted*/ for network bridge while listing: Key not found in store
INFO[2024-11-21T09:26:52.285956061+11:00] stopping healthcheck following graceful shutdown module=libcontainerd
INFO[2024-11-21T09:26:52.285969031+11:00] stopping event stream following graceful shutdown error="context canceled" module=libcontainerd namespace=moby
INFO[2024-11-21T09:26:52.285972116+11:00] stopping event stream following graceful shutdown error="context canceled" module=libcontainerd namespace=plugins.moby
DEBU[2024-11-21T09:26:52.286152415+11:00] received signal signal=terminated
DEBU[2024-11-21T09:26:52.286204735+11:00] sd notification notified=false state="STOPPING=1"
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x1c pc=0x5b583c1e65d5]

goroutine 1 [running, locked to thread]:
github.com/docker/docker/libnetwork.(*endpointCnt).EndpointCnt(0xc000b073b0?)
/root/build-deb/engine/libnetwork/endpoint_cnt.go:102 +0x35
github.com/docker/docker/libnetwork.(*Network).delete(0x5b583c1d7700?, 0x0, 0x0)
/root/build-deb/engine/libnetwork/network.go:988 +0x28a
github.com/docker/docker/libnetwork.(*Network).Delete(0xc000029d40, {0x0, 0x0, 0x8?})
/root/build-deb/engine/libnetwork/network.go:952 +0x6b
github.com/docker/docker/daemon.configureNetworking(0xc0003bf5e0, 0xc000330688)
/root/build-deb/engine/daemon/daemon_unix.go:876 +0x1e5
github.com/docker/docker/daemon.(*Daemon).initNetworkController(0xc000138288, 0xc000330688, 0xc0008bf2f0)
/root/build-deb/engine/daemon/daemon_unix.go:850 +0x131
github.com/docker/docker/daemon.(*Daemon).restore(0xc000138288, 0xc000330688)
/root/build-deb/engine/daemon/daemon.go:581 +0x67b
github.com/docker/docker/daemon.NewDaemon({0x5b583d87e998, 0xc00001d810}, 0xc000149608, 0xc00046d6b0, 0xc000053560)
/root/build-deb/engine/daemon/daemon.go:1246 +0x393a
main.(*DaemonCli).start(0xc0004bf640, 0xc0004acf00)
/root/build-deb/engine/cmd/dockerd/daemon.go:260 +0xe09
main.runDaemon(...)
/root/build-deb/engine/cmd/dockerd/docker_unix.go:13
main.newDaemonCommand.func1(0xc0004c2500?, {0xc0004a6eb0?, 0x7?, 0x5b583cc3f438?})
/root/build-deb/engine/cmd/dockerd/docker.go:37 +0x94
github.com/spf13/cobra.(*Command).execute(0xc000545b08, {0xc000052070, 0x1, 0x1})
/root/build-deb/engine/vendor/github.com/spf13/cobra/command.go:985 +0xaca
github.com/spf13/cobra.(*Command).ExecuteC(0xc000545b08)
/root/build-deb/engine/vendor/github.com/spf13/cobra/command.go:1117 +0x3ff
github.com/spf13/cobra.(*Command).Execute(...)
/root/build-deb/engine/vendor/github.com/spf13/cobra/command.go:1041
main.main()
/root/build-deb/engine/cmd/dockerd/docker.go:106 +0x17b
 
Well as I was collecting data for this post it's not broken on all 3 nodes

I restored from a backup on the primary node though and it's working again

So I assume now something with the migration is causing the issue
 
Last edited:
I fulled removed docker and all it's components and reinstalled

It's now working again between all the nodes

I have a feeling it had something to do with the replication turned on
Maybe a bug in replication for LXC containers

The issue was the containerd.sock was missing and it wouldn't start properly
Something in the config files got corrupted because a remove/reinstall of docker without removing the config folders would cause the same issue
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!