systemd gets killed in a running container every couple days, requires host restart

wpowiertowski · Feb 21, 2019

Hi I'm experiencing a problem with one of the containers in my system. The container runs fine for couple days then suddenly systemd receives a SIGRTMIN+3 signal and the container stays in a zombie state with only /sbin/init left running. Stopping and restarting the container does not help, only restart of the host system seems to resolve the state of the container. All other containers in the system are working fine though, so this is somehow isolated to this one instance.

running processes when container gets to this state:

Code:

root@LXC-TimescaleDB:~# ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.1  45816  4416 ?        Ss   19:20   0:00 /sbin/init
root        23  0.0  0.0   6268  2508 ?        Ss   19:20   0:00 /bin/bash
root        39  0.0  0.0  15984  1840 ?        R+   19:24   0:00 ps aux

syslog (don't have GUI in the container so not sure what the last line is about):

Code:

root@LXC-TimescaleDB:~# cat /var/log/syslog
Feb 20 00:00:01 LXC-TimescaleDB systemd[1]: Started Rotate log files.
Feb 20 00:05:01 LXC-TimescaleDB CRON[17232]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Feb 20 00:08:01 LXC-TimescaleDB CRON[17241]: (root) CMD (   test -x /etc/cron.daily/popularity-contest && /etc/cron.daily/popularity-contest --crond)
Feb 20 00:15:01 LXC-TimescaleDB CRON[17259]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Feb 20 00:25:01 LXC-TimescaleDB CRON[17283]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Feb 20 00:35:01 LXC-TimescaleDB CRON[17307]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Feb 20 00:45:01 LXC-TimescaleDB CRON[17331]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Feb 20 00:54:01 LXC-TimescaleDB CRON[17353]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Feb 20 00:55:01 LXC-TimescaleDB CRON[17358]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Feb 20 01:05:01 LXC-TimescaleDB CRON[17383]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Feb 20 01:15:01 LXC-TimescaleDB CRON[17407]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Feb 20 01:21:07 LXC-TimescaleDB systemd[1]: Received SIGRTMIN+3.
Feb 20 01:21:07 LXC-TimescaleDB systemd[1]: Stopped target Graphical Interface.

host setup:

Code:

root@bear:~# pveversion --verbose
proxmox-ve: 5.3-1 (running kernel: 4.15.18-10-pve)
pve-manager: 5.3-8 (running version: 5.3-8/2929af8e)
pve-kernel-4.15: 5.3-1
pve-kernel-4.15.18-10-pve: 4.15.18-32
pve-kernel-4.15.18-9-pve: 4.15.18-30
pve-kernel-4.15.17-1-pve: 4.15.17-9
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-3
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-44
libpve-guest-common-perl: 2.0-19
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-36
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-2
lxcfs: 3.0.2-2
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-22
pve-cluster: 5.0-33
pve-container: 2.0-33
pve-docs: 5.3-1
pve-edk2-firmware: 1.20181023-1
pve-firewall: 3.0-17
pve-firmware: 2.0-6
pve-ha-manager: 2.0-6
pve-i18n: 1.0-9
pve-libspice-server1: 0.14.1-2
pve-qemu-kvm: 2.12.1-1
pve-xtermjs: 3.10.1-1
qemu-server: 5.0-45
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.12-pve1~bpo1

t.lamprecht · Feb 28, 2019

Hi!
What is happening in the PVE host syslog/journal during that time? It seam like it gets an external halt, but the "fails to start afterwards if not rebooted" sound really strange...

What distro and daemons are running in the CT, from the name I assume TimescaleDB?
The CT config would also be nice to see.

wpowiertowski · Mar 1, 2019

It seems to be happening during backup procedure which is configured to stop containers, however out of 6 containers running on the system only this one doesn’t come back up. All containers are built using Ubuntu 18.10 templates.

I’ll get you the list of running daemons over the weekend, but it’s essentially bare container with PostgreSQL running. The container confit is:

Code:

# cat /etc/pve/lxc/102.conf
arch: amd64
cores: 2
hostname: LXC-TimescaleDB
memory: 4096
net0: name=eth0,bridge=vmbr0,gw=10.0.1.1,hwaddr=EA:61:79:75:9E:1A,ip=10.0.1.74/24,type=veth
onboot: 1
ostype: ubuntu
rootfs: tank:subvol-102-disk-0,size=100G
swap: 512
unprivileged: 1

mailinglists · Mar 1, 2019

Does dmesg -T show anything around that time?
Does manual stop/shutdown and start work?

wpowiertowski · Mar 9, 2019

Sorry for late reply - didn't had time to look into the issue lately.
So to start with this are the processes that are normally running on the system:

Code:

root@LXC-TimescaleDB:~# ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.1  46860  6604 ?        Ss   09:17   0:00 /sbin/init
root        50  0.0  0.2  54488 11024 ?        Ss   09:17   0:00 /lib/systemd/systemd-journald
root        69  0.0  0.0  21844  2388 ?        Ss   09:17   0:00 /lib/systemd/systemd-udevd
systemd+    85  0.0  0.0  39632  3760 ?        Ss   09:17   0:00 /lib/systemd/systemd-networkd
systemd+   109  0.0  0.1  54264  5012 ?        Ss   09:17   0:00 /lib/systemd/systemd-resolved
root       140  0.0  0.3  42700 14416 ?        Ss   09:17   0:00 /usr/bin/python3 /usr/bin/networkd-dispatcher --run-startup-triggers
syslog     142  0.0  0.0 170936  2960 ?        Ssl  09:17   0:00 /usr/sbin/rsyslogd -n
root       144  0.0  0.0  11796  1700 ?        Ss   09:17   0:00 /usr/sbin/cron -f
avahi      145  0.0  0.0  18708  2884 ?        Ss   09:17   0:00 avahi-daemon: running [LXC-TimescaleDB.local]
root       146  0.0  0.1  38148  4332 ?        Ss   09:17   0:00 /lib/systemd/systemd-logind
message+   147  0.0  0.0  21524  3084 ?        Ss   09:17   0:00 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only
root       148  0.0  0.1 245808  4516 ?        Ssl  09:17   0:00 /usr/lib/accountsservice/accounts-daemon
avahi      149  0.0  0.0  18504   320 ?        S    09:17   0:00 avahi-daemon: chroot helper
postgres   161  0.0  0.5 264336 22928 ?        S    09:17   0:00 /usr/lib/postgresql/10/bin/postgres -D /var/lib/postgresql/10/main -c config_file=/etc/postgresql/10/main/postgresql.conf
postgres   164  0.0  0.0 264336  3920 ?        Ss   09:17   0:00 postgres: 10/main: checkpointer process   
postgres   165  0.0  0.1 264336  5176 ?        Ss   09:17   0:00 postgres: 10/main: writer process   
postgres   166  0.0  0.2 264336  9284 ?        Ss   09:17   0:00 postgres: 10/main: wal writer process   
postgres   167  0.0  0.1 264764  6100 ?        Ss   09:17   0:00 postgres: 10/main: autovacuum launcher process   
postgres   168  0.0  0.1 120308  5484 ?        Ss   09:17   0:00 postgres: 10/main: stats collector process 
postgres   169  0.0  0.1 264636  6104 ?        Ss   09:17   0:00 postgres: 10/main: bgworker: TimescaleDB Background Worker Launcher   
postgres   170  0.0  0.1 264636  6104 ?        Ss   09:17   0:00 postgres: 10/main: bgworker: logical replication launcher   
postgres   172  0.0  0.3 266100 13216 ?        Ss   09:17   0:00 postgres: 10/main: bgworker: TimescaleDB Background Worker Scheduler   
root       173  0.0  0.3 121048 16704 ?        Ssl  09:17   0:00 /usr/bin/python3 /usr/share/unattended-upgrades/unattended-upgrade-shutdown --wait-for-signal
root       179  0.0  0.0   2592  1240 pts/1    Ss+  09:17   0:00 /sbin/agetty -o -p -- \u --noclear --keep-baud console 115200,38400,9600 linux
root       180  0.0  0.0   2592  1232 pts/0    Ss+  09:17   0:00 /sbin/agetty -o -p -- \u --noclear --keep-baud tty1 115200,38400,9600 linux
root       346  0.0  0.0  40632  2584 ?        Ss   09:17   0:00 /usr/lib/postfix/sbin/master -w
postfix    347  0.0  0.0  40972  3636 ?        S    09:17   0:00 pickup -l -t unix -u -c
postfix    348  0.0  0.0  41020  3748 ?        S    09:17   0:00 qmgr -l -t unix -u

running "pct stop <id>" and "pct start <id>" works normally, I've also changed the backup method to snapshot vs stop and that seems to be working fine without the issue of getting into the zombie state of the container

Search

Search

systemd gets killed in a running container every couple days, requires host restart

wpowiertowski

Member

t.lamprecht

Proxmox Staff Member

wpowiertowski

Member

mailinglists

Renowned Member

wpowiertowski

Member

We value your privacy