Hello everybody,
TL;DR We are currently struggling with the activated journaling feature on LX containers. On the one hand this is mandatory for one-way rbd-mirror sync into a pool from master to slave cluster, on the other hand it prevents containers in the master from (re)starting after the disaster scenario is over
System data:
# of cluster: 2
Proxmox instances per cluster: 4
Ceph version: Nautilus (14)
Proxmox version: 6.1 with latest updates
OSD versions: All 14.2.6
scenario:
From our master cluster we configured a one-way rbd-mirror into a specially created pool on the slave cluster (according to official Proxmox documentation!). For images, which should be synced, the features "exclusive-lock" and "journaling" are mandatory. The sync runs without problems.
When we tested the disaster scenario at the end of the week, the promotion of the slave pool and the spawning of the images as LXC instances succeeded without any problems. However, promoting had to be used with the --force flag since our scenario says that master site just breaks down and there is no demoting possible beforehand.
Then, when we then reversed the disaster scenario, i.e. deactivated the slave LXC instances and simply wanted to put the master cluster back into operation, we noticed that the journaling feature prevents the (re)starting of an LX container. We just tested this several times separately from the rbd-mirror scenario.
Only the journaling feature matters here. If you deactivate it, you can start a container without any problems. If you activate it again, it is no longer possible to start a disabled container.
The first log from "pct start" shows:
Then:
Then with an "lxc-start -F" for a more precise error message:
Where exactly is our possible error of reasoning? If we did not make a mistake, please explain why the journaling feature is mandatory for an rbd-mirror and why does this cause a problem like this ? Is our use case wrong ? What does journaling do exactly ? Why does it prevent containers from (re)starting and thus, in our humble opinion, endanger the disaster scenario we have devised.
Thanks a lot in advance!
TL;DR We are currently struggling with the activated journaling feature on LX containers. On the one hand this is mandatory for one-way rbd-mirror sync into a pool from master to slave cluster, on the other hand it prevents containers in the master from (re)starting after the disaster scenario is over
System data:
# of cluster: 2
Proxmox instances per cluster: 4
Ceph version: Nautilus (14)
Proxmox version: 6.1 with latest updates
OSD versions: All 14.2.6
scenario:
From our master cluster we configured a one-way rbd-mirror into a specially created pool on the slave cluster (according to official Proxmox documentation!). For images, which should be synced, the features "exclusive-lock" and "journaling" are mandatory. The sync runs without problems.
When we tested the disaster scenario at the end of the week, the promotion of the slave pool and the spawning of the images as LXC instances succeeded without any problems. However, promoting had to be used with the --force flag since our scenario says that master site just breaks down and there is no demoting possible beforehand.
Then, when we then reversed the disaster scenario, i.e. deactivated the slave LXC instances and simply wanted to put the master cluster back into operation, we noticed that the journaling feature prevents the (re)starting of an LX container. We just tested this several times separately from the rbd-mirror scenario.
Only the journaling feature matters here. If you deactivate it, you can start a container without any problems. If you activate it again, it is no longer possible to start a disabled container.
The first log from "pct start" shows:
Code:
root@proxmoxsm34:~# pct start 112233445
Job for pve-container@112233445.service failed because the control process exited with error code.
See "systemctl status pve-container@112233445.service" and "journalctl -xe" for details.
command 'systemctl start pve-container@112233445' failed: exit code 1
Then:
Code:
root@proxmoxsm34:~# journalctl -u pve-container@112233445
-- Logs begin at Sat 2020-01-25 12:03:35 CET, end at Mon 2020-01-27 09:57:04 CET. --
Jan 25 12:12:55 proxmoxsm34 systemd[1]: Starting PVE LXC Container: 112233445...
Jan 25 12:12:57 proxmoxsm34 systemd[1]: Started PVE LXC Container: 112233445.
Jan 25 12:18:24 proxmoxsm34 systemd[1]: pve-container@112233445.service: Succeeded.
Jan 27 09:36:11 proxmoxsm34 systemd[1]: Starting PVE LXC Container: 112233445...
Jan 27 09:36:12 proxmoxsm34 lxc-start[1873895]: lxc-start: 112233445: lxccontainer.c: wait_on_daemonized_start: 865 No such file or directory - Failed to receive the container state
Jan 27 09:36:12 proxmoxsm34 lxc-start[1873895]: lxc-start: 112233445: tools/lxc_start.c: main: 329 The container failed to start
Jan 27 09:36:12 proxmoxsm34 lxc-start[1873895]: lxc-start: 112233445: tools/lxc_start.c: main: 332 To get more details, run the container in foreground mode
Jan 27 09:36:12 proxmoxsm34 lxc-start[1873895]: lxc-start: 112233445: tools/lxc_start.c: main: 335 Additional information can be obtained by setting the --logfile and --logpriority options
Jan 27 09:36:12 proxmoxsm34 systemd[1]: pve-container@112233445.service: Control process exited, code=exited, status=1/FAILURE
Jan 27 09:36:12 proxmoxsm34 systemd[1]: pve-container@112233445.service: Failed with result 'exit-code'.
Jan 27 09:36:12 proxmoxsm34 systemd[1]: Failed to start PVE LXC Container: 112233445.
Then with an "lxc-start -F" for a more precise error message:
Code:
root@proxmoxsm34:~# lxc-start -F -f /etc/pve/local/lxc/112233445.conf --name test-tp --logpriority TRACE
lxc-start: test-tp: utils.c: safe_mount: 1212 No such file or directory - Failed to mount "/dev/pts/10" onto "/dev/console"
lxc-start: test-tp: conf.c: lxc_setup_dev_console: 1774 Failed to mount "/dev/pts/10" on "/dev/console"
lxc-start: test-tp: conf.c: lxc_setup: 3683 Failed to setup console
lxc-start: test-tp: start.c: do_start: 1338 Failed to setup container "test-tp"
lxc-start: test-tp: sync.c: __sync_wait: 62 An error occurred in another process (expected sequence number 5)
lxc-start: test-tp: start.c: lxc_abort: 1133 Function not implemented - Failed to send SIGKILL to 1983476
lxc-start: test-tp: start.c: __lxc_start: 2080 Failed to spawn container "test-tp"
lxc-start: test-tp: tools/lxc_start.c: main: 329 The container failed to start
lxc-start: test-tp: tools/lxc_start.c: main: 335 Additional information can be obtained by setting the --logfile and --logpriority options
Where exactly is our possible error of reasoning? If we did not make a mistake, please explain why the journaling feature is mandatory for an rbd-mirror and why does this cause a problem like this ? Is our use case wrong ? What does journaling do exactly ? Why does it prevent containers from (re)starting and thus, in our humble opinion, endanger the disaster scenario we have devised.
Thanks a lot in advance!