Journal feature prevents LX-Container from (re)starting. Conflict with rbd-mirror.

tpuetz · Mar 19, 2020

Hello everybody,

TL;DR We are currently struggling with the activated journaling feature on LX containers. On the one hand this is mandatory for one-way rbd-mirror sync into a pool from master to slave cluster, on the other hand it prevents containers in the master from (re)starting after the disaster scenario is over

System data:

# of cluster: 2
Proxmox instances per cluster: 4
Ceph version: Nautilus (14)
Proxmox version: 6.1 with latest updates
OSD versions: All 14.2.6

scenario:

From our master cluster we configured a one-way rbd-mirror into a specially created pool on the slave cluster (according to official Proxmox documentation!). For images, which should be synced, the features "exclusive-lock" and "journaling" are mandatory. The sync runs without problems.

When we tested the disaster scenario at the end of the week, the promotion of the slave pool and the spawning of the images as LXC instances succeeded without any problems. However, promoting had to be used with the --force flag since our scenario says that master site just breaks down and there is no demoting possible beforehand.

Then, when we then reversed the disaster scenario, i.e. deactivated the slave LXC instances and simply wanted to put the master cluster back into operation, we noticed that the journaling feature prevents the (re)starting of an LX container. We just tested this several times separately from the rbd-mirror scenario.

Only the journaling feature matters here. If you deactivate it, you can start a container without any problems. If you activate it again, it is no longer possible to start a disabled container.

The first log from "pct start" shows:

Code:

root@proxmoxsm34:~# pct start 112233445

Job for pve-container@112233445.service failed because the control process exited with error code.

See "systemctl status pve-container@112233445.service" and "journalctl -xe" for details.

command 'systemctl start pve-container@112233445' failed: exit code 1

Then:

Code:

root@proxmoxsm34:~# journalctl -u pve-container@112233445
-- Logs begin at Sat 2020-01-25 12:03:35 CET, end at Mon 2020-01-27 09:57:04 CET. --
Jan 25 12:12:55 proxmoxsm34 systemd[1]: Starting PVE LXC Container: 112233445...
Jan 25 12:12:57 proxmoxsm34 systemd[1]: Started PVE LXC Container: 112233445.
Jan 25 12:18:24 proxmoxsm34 systemd[1]: pve-container@112233445.service: Succeeded.
Jan 27 09:36:11 proxmoxsm34 systemd[1]: Starting PVE LXC Container: 112233445...
Jan 27 09:36:12 proxmoxsm34 lxc-start[1873895]: lxc-start: 112233445: lxccontainer.c: wait_on_daemonized_start: 865 No such file or directory - Failed to receive the container state
Jan 27 09:36:12 proxmoxsm34 lxc-start[1873895]: lxc-start: 112233445: tools/lxc_start.c: main: 329 The container failed to start
Jan 27 09:36:12 proxmoxsm34 lxc-start[1873895]: lxc-start: 112233445: tools/lxc_start.c: main: 332 To get more details, run the container in foreground mode
Jan 27 09:36:12 proxmoxsm34 lxc-start[1873895]: lxc-start: 112233445: tools/lxc_start.c: main: 335 Additional information can be obtained by setting the --logfile and --logpriority options
Jan 27 09:36:12 proxmoxsm34 systemd[1]: pve-container@112233445.service: Control process exited, code=exited, status=1/FAILURE
Jan 27 09:36:12 proxmoxsm34 systemd[1]: pve-container@112233445.service: Failed with result 'exit-code'.
Jan 27 09:36:12 proxmoxsm34 systemd[1]: Failed to start PVE LXC Container: 112233445.

Then with an "lxc-start -F" for a more precise error message:

Code:

root@proxmoxsm34:~# lxc-start -F -f /etc/pve/local/lxc/112233445.conf --name test-tp  --logpriority TRACE
lxc-start: test-tp: utils.c: safe_mount: 1212 No such file or directory - Failed to mount "/dev/pts/10" onto "/dev/console"
                                                                                                                           lxc-start: test-tp: conf.c: lxc_setup_dev_console: 1774 Failed to mount "/dev/pts/10" on "/dev/console"
                                    lxc-start: test-tp: conf.c: lxc_setup: 3683 Failed to setup console
                                                                                                       lxc-start: test-tp: start.c: do_start: 1338 Failed to setup container "test-tp"
                                                                                                                                                                                      lxc-start: test-tp: sync.c: __sync_wait: 62 An error occurred in another process (expected sequence number 5)
                                                                                                     lxc-start: test-tp: start.c: lxc_abort: 1133 Function not implemented - Failed to send SIGKILL to 1983476
                lxc-start: test-tp: start.c: __lxc_start: 2080 Failed to spawn container "test-tp"
                                                                                                  lxc-start: test-tp: tools/lxc_start.c: main: 329 The container failed to start
lxc-start: test-tp: tools/lxc_start.c: main: 335 Additional information can be obtained by setting the --logfile and --logpriority options

Where exactly is our possible error of reasoning? If we did not make a mistake, please explain why the journaling feature is mandatory for an rbd-mirror and why does this cause a problem like this ? Is our use case wrong ? What does journaling do exactly ? Why does it prevent containers from (re)starting and thus, in our humble opinion, endanger the disaster scenario we have devised.

Thanks a lot in advance!

Alwin · Mar 20, 2020

Can you please post the following:

pveversion -v of your two clusters
the ceph.conf of both
a pct config <vmid>
the output of a debug start of an container?

https://pve.proxmox.com/pve-docs/chapter-pct.html#_obtaining_debugging_logs

tpuetz · Apr 14, 2020

Hi there,

I am so sorry for the late reply. There was a lot to do due to the Corona situation the last weeks.

Here are the outputs:

Cluster 1 (Master in the rbd-mirror scenario):

Code:

[global]
         auth_client_required = cephx
         auth_cluster_required = cephx
         auth_service_required = cephx
         cluster_network = 192.168.0.27/24
         fsid = 633c704a-42b6-4fdc-b9f3-a2f99e0a7763
         mon_allow_pool_delete = true
         mon_host = 192.168.0.27 192.168.0.32 192.168.0.33 192.168.0.34
         osd_journal_size = 5120
         osd_pool_default_min_size = 2
         osd_pool_default_size = 3
         public_network = 192.168.0.27/24

[client]
         keyring = /etc/pve/priv/$cluster.$name.keyring

[mds]
         keyring = /var/lib/ceph/mds/ceph-$id/keyring

[osd]
         keyring = /var/lib/ceph/osd/ceph-$id/keyring

[mds.proxmoxsm27]
         host = proxmoxsm27
         mds standby for name = pve

[mds.proxmoxsm32]
         host = proxmoxsm32
         mds_standby_for_name = pve

[mds.proxmoxsm34]
         host = proxmoxsm34
         mds_standby_for_name = pve

[mds.proxmoxsm33]
         host = proxmoxsm33
         mds_standby_for_name = pve

[mon.proxmoxsm32]
         host = proxmoxsm32
         mon_addr = 192.168.0.32:6789

[mon.proxmoxsm27]
         host = proxmoxsm27
         mon_addr = 192.168.0.27:6789

[mon.proxmoxsm33]
         host = proxmoxsm33
         mon_addr = 192.168.0.33:6789

[mon.proxmoxsm34]
         host = proxmoxsm34
         mon_addr = 192.168.0.34:6789

Cluster 2 (Slave in the rbd-mirror scenario):

Code:

[global]
         auth_client_required = cephx
         auth_cluster_required = cephx
         auth_service_required = cephx
         cluster_network = 192.168.0.26/24
         fsid = 87651fea-6a98-4d37-9ffb-fdb8b343eef0
         mon_allow_pool_delete = true
         mon_host = 192.168.0.26 192.168.0.35 192.168.0.36 192.168.0.37
         osd_journal_size = 5120
         osd_pool_default_min_size = 2
         osd_pool_default_size = 3
         public_network = 192.168.0.26/24

[client]
         keyring = /etc/pve/priv/$cluster.$name.keyring

[mds]
         keyring = /var/lib/ceph/mds/ceph-$id/keyring

[osd]
         keyring = /var/lib/ceph/osd/ceph-$id/keyring

[mds.proxmoxsm36]
         host = proxmoxsm36
         mds_standby_for_name = pve

[mds.proxmoxsm26]
         host = proxmoxsm26
         mds standby for name = pve

[mds.proxmoxsm37]
         host = proxmoxsm37
         mds_standby_for_name = pve

[mds.proxmoxsm35]
         host = proxmoxsm35
         mds_standby_for_name = pve

[mon.proxmoxsm35]
         host = proxmoxsm35
         mon_addr = 192.168.0.35:6789

[mon.proxmoxsm37]
         host = proxmoxsm37
         mon_addr = 192.168.0.37:6789

[mon.proxmoxsm26]
         host = proxmoxsm26
         mon_addr = 192.168.0.26:6789

[mon.proxmoxsm36]
         host = proxmoxsm36
         mon_addr = 192.168.0.36:6789

Unfortunately we dont have container with the ID 112233445 anymore, but all our configs are pretty much the same.

Container config as example:

Code:

lxc.arch = amd64
lxc.include = /usr/share/lxc/config/debian.common.conf
lxc.apparmor.profile = generated
lxc.apparmor.raw = deny mount -> /proc/,
lxc.apparmor.raw = deny mount -> /sys/,
lxc.monitor.unshare = 1
lxc.tty.max = 2
lxc.environment = TERM=linux
lxc.uts.name = infixhubpr.macd.com
lxc.cgroup.memory.limit_in_bytes = 4294967296
lxc.cgroup.memory.memsw.limit_in_bytes = 8589934592
lxc.cgroup.cpu.shares = 1024
lxc.rootfs.path = /var/lib/lxc/17794/rootfs
lxc.net.0.type = veth
lxc.net.0.veth.pair = veth17794i0
lxc.net.0.hwaddr = 76:F8:9F:69:F9:67
lxc.net.0.name = eth0
lxc.cgroup.cpuset.cpus = 9,12,23,36

Search

Search

Journal feature prevents LX-Container from (re)starting. Conflict with rbd-mirror.

tpuetz

New Member

Alwin

Proxmox Retired Staff

tpuetz

New Member