Hi all,
My setup is a three node cluster (two mini pc's with local nvme storage and a HDD for VM's) with a corosync device (Rpi 3 B+). Failover in this scenario works and I've tested a few times (controlled and uncontrolled). Both nodes thus have a ZFS storage disk with the same name and I've scheduled storage replication every day at 2AM.
What I'm noticing is that the node (doesn't matter which one), running the replicated container becomes status: unknown (grey questionmark on the node, it's storage and all of it's containers) on exactly the time of replication. If I delete the replication schedule en recreate it, it's ok for about a week or so. Then the same problem returns. In the unknown situation, everything running in the container is not responsive. a reboot of the node fixes this until next run of the schedule and then it's again status unknown.
The unknown node responds to ping and http requests (UI), but doesn't let the containers running on the node be stopped of migrated. It seems to have something to do with the logs of node 2 where there is a permission denied for docker? This only shows up on the days of the crashes.
Please help me out here, I'm at wits end... If you need any logs please let me know I'll gladly post them here. Below is the syslog from around the time of the last fail.
Kind Regards!
Node 1:
Node 2 was running the container and became unknown:
My setup is a three node cluster (two mini pc's with local nvme storage and a HDD for VM's) with a corosync device (Rpi 3 B+). Failover in this scenario works and I've tested a few times (controlled and uncontrolled). Both nodes thus have a ZFS storage disk with the same name and I've scheduled storage replication every day at 2AM.
What I'm noticing is that the node (doesn't matter which one), running the replicated container becomes status: unknown (grey questionmark on the node, it's storage and all of it's containers) on exactly the time of replication. If I delete the replication schedule en recreate it, it's ok for about a week or so. Then the same problem returns. In the unknown situation, everything running in the container is not responsive. a reboot of the node fixes this until next run of the schedule and then it's again status unknown.
The unknown node responds to ping and http requests (UI), but doesn't let the containers running on the node be stopped of migrated. It seems to have something to do with the logs of node 2 where there is a permission denied for docker? This only shows up on the days of the crashes.
Please help me out here, I'm at wits end... If you need any logs please let me know I'll gladly post them here. Below is the syslog from around the time of the last fail.
Kind Regards!
Node 1:
Code:
Aug 02 01:58:27 prx1 systemd[1]: Started Checkmk agent (PID 1027/UID 997).
Aug 02 01:58:28 prx1 systemd[1]: check-mk-agent@10094-1027-997.service: Succeeded.
Aug 02 01:58:29 prx1 pvedaemon[3518832]: <root@pam> successful auth for user 'checkmk@pve'
Aug 02 01:59:11 prx1 pmxcfs[1175]: [status] notice: received log
Aug 02 01:59:27 prx1 systemd[1]: Started Checkmk agent (PID 1027/UID 997).
Aug 02 01:59:28 prx1 systemd[1]: check-mk-agent@10095-1027-997.service: Succeeded.
Aug 02 01:59:28 prx1 pvedaemon[3518832]: <root@pam> successful auth for user 'checkmk@pve'
Aug 02 02:00:00 prx1 sshd[3604710]: Accepted publickey for root from 10.0.0.22 port 41996 ssh2: RSA SHA256:GhnSnKiGL7KJpSSKyuMcLgBWIV1tiCyP83Ik+c2V744
Aug 02 02:00:00 prx1 sshd[3604710]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Aug 02 02:00:00 prx1 systemd[1]: Created slice User Slice of UID 0.
Aug 02 02:00:00 prx1 systemd[1]: Starting User Runtime Directory /run/user/0...
Aug 02 02:00:00 prx1 systemd-logind[904]: New session 252 of user root.
Aug 02 02:00:00 prx1 systemd[1]: Finished User Runtime Directory /run/user/0.
Aug 02 02:00:00 prx1 systemd[1]: Starting User Manager for UID 0...
Aug 02 02:00:00 prx1 systemd[3604713]: pam_unix(systemd-user:session): session opened for user root(uid=0) by (uid=0)
Aug 02 02:00:00 prx1 systemd[3604713]: Queued start job for default target Main User Target.
Aug 02 02:00:00 prx1 systemd[3604713]: Created slice User Application Slice.
Aug 02 02:00:00 prx1 systemd[3604713]: Reached target Paths.
Aug 02 02:00:00 prx1 systemd[3604713]: Reached target Timers.
Aug 02 02:00:00 prx1 systemd[3604713]: Listening on GnuPG network certificate management daemon.
Aug 02 02:00:00 prx1 systemd[3604713]: Listening on GnuPG cryptographic agent and passphrase cache (access for web browsers).
Aug 02 02:00:00 prx1 systemd[3604713]: Listening on GnuPG cryptographic agent and passphrase cache (restricted).
Aug 02 02:00:00 prx1 systemd[3604713]: Listening on GnuPG cryptographic agent (ssh-agent emulation).
Aug 02 02:00:00 prx1 systemd[3604713]: Listening on GnuPG cryptographic agent and passphrase cache.
Aug 02 02:00:00 prx1 systemd[3604713]: Reached target Sockets.
Aug 02 02:00:00 prx1 systemd[3604713]: Reached target Basic System.
Aug 02 02:00:00 prx1 systemd[3604713]: Reached target Main User Target.
Aug 02 02:00:00 prx1 systemd[3604713]: Startup finished in 60ms.
Aug 02 02:00:00 prx1 systemd[1]: Started User Manager for UID 0.
Aug 02 02:00:00 prx1 systemd[1]: Started Session 252 of user root.
Aug 02 02:00:00 prx1 sshd[3604710]: Received disconnect from 10.0.0.22 port 41996:11: disconnected by user
Aug 02 02:00:00 prx1 sshd[3604710]: Disconnected from user root 10.0.0.22 port 41996
Aug 02 02:00:00 prx1 sshd[3604710]: pam_unix(sshd:session): session closed for user root
Aug 02 02:00:00 prx1 systemd[1]: session-252.scope: Succeeded.
Aug 02 02:00:00 prx1 systemd-logind[904]: Session 252 logged out. Waiting for processes to exit.
Aug 02 02:00:00 prx1 systemd-logind[904]: Removed session 252.
Aug 02 02:00:05 prx1 sshd[3604742]: Accepted publickey for root from 10.0.0.22 port 48580 ssh2: RSA SHA256:GhnSnKiGL7KJpSSKyuMcLgBWIV1tiCyP83Ik+c2V744
Aug 02 02:00:05 prx1 sshd[3604742]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Aug 02 02:00:05 prx1 systemd-logind[904]: New session 254 of user root.
Aug 02 02:00:05 prx1 systemd[1]: Started Session 254 of user root.
Aug 02 02:00:06 prx1 sshd[3604742]: Received disconnect from 10.0.0.22 port 48580:11: disconnected by user
Aug 02 02:00:06 prx1 sshd[3604742]: Disconnected from user root 10.0.0.22 port 48580
Aug 02 02:00:06 prx1 sshd[3604742]: pam_unix(sshd:session): session closed for user root
Aug 02 02:00:06 prx1 systemd[1]: session-254.scope: Succeeded.
Aug 02 02:00:06 prx1 systemd-logind[904]: Session 254 logged out. Waiting for processes to exit.
Aug 02 02:00:06 prx1 systemd-logind[904]: Removed session 254.
Aug 02 02:00:06 prx1 sshd[3604749]: Accepted publickey for root from 10.0.0.22 port 48586 ssh2: RSA SHA256:GhnSnKiGL7KJpSSKyuMcLgBWIV1tiCyP83Ik+c2V744
Aug 02 02:00:06 prx1 sshd[3604749]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Aug 02 02:00:06 prx1 systemd-logind[904]: New session 255 of user root.
Aug 02 02:00:06 prx1 systemd[1]: Started Session 255 of user root.
Aug 02 02:00:11 prx1 pmxcfs[1175]: [status] notice: received log
Aug 02 02:00:13 prx1 sshd[3604749]: Received disconnect from 10.0.0.22 port 48586:11: disconnected by user
Aug 02 02:00:13 prx1 sshd[3604749]: Disconnected from user root 10.0.0.22 port 48586
Aug 02 02:00:13 prx1 sshd[3604749]: pam_unix(sshd:session): session closed for user root
Aug 02 02:00:13 prx1 systemd[1]: session-255.scope: Succeeded.
Aug 02 02:00:13 prx1 systemd[1]: session-255.scope: Consumed 2.947s CPU time.
Aug 02 02:00:13 prx1 systemd-logind[904]: Session 255 logged out. Waiting for processes to exit.
Aug 02 02:00:13 prx1 systemd-logind[904]: Removed session 255.
Aug 02 02:00:14 prx1 sshd[3606199]: Accepted publickey for root from 10.0.0.22 port 41018 ssh2: RSA SHA256:GhnSnKiGL7KJpSSKyuMcLgBWIV1tiCyP83Ik+c2V744
Aug 02 02:00:14 prx1 sshd[3606199]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Aug 02 02:00:14 prx1 systemd-logind[904]: New session 256 of user root.
Aug 02 02:00:14 prx1 systemd[1]: Started Session 256 of user root.
Aug 02 02:00:14 prx1 sshd[3606199]: Received disconnect from 10.0.0.22 port 41018:11: disconnected by user
Aug 02 02:00:14 prx1 sshd[3606199]: Disconnected from user root 10.0.0.22 port 41018
Aug 02 02:00:14 prx1 sshd[3606199]: pam_unix(sshd:session): session closed for user root
Aug 02 02:00:14 prx1 systemd[1]: session-256.scope: Succeeded.
Aug 02 02:00:14 prx1 systemd-logind[904]: Session 256 logged out. Waiting for processes to exit.
Aug 02 02:00:14 prx1 systemd-logind[904]: Removed session 256.
Aug 02 02:00:24 prx1 systemd[1]: Stopping User Manager for UID 0...
Aug 02 02:00:24 prx1 systemd[3604713]: Stopped target Main User Target.
Aug 02 02:00:24 prx1 systemd[3604713]: Stopped target Basic System.
Aug 02 02:00:24 prx1 systemd[3604713]: Stopped target Paths.
Aug 02 02:00:24 prx1 systemd[3604713]: Stopped target Sockets.
Aug 02 02:00:24 prx1 systemd[3604713]: Stopped target Timers.
Aug 02 02:00:24 prx1 systemd[3604713]: dirmngr.socket: Succeeded.
Aug 02 02:00:24 prx1 systemd[3604713]: Closed GnuPG network certificate management daemon.
Aug 02 02:00:24 prx1 systemd[3604713]: gpg-agent-browser.socket: Succeeded.
Aug 02 02:00:24 prx1 systemd[3604713]: Closed GnuPG cryptographic agent and passphrase cache (access for web browsers).
Aug 02 02:00:24 prx1 systemd[3604713]: gpg-agent-extra.socket: Succeeded.
Aug 02 02:00:24 prx1 systemd[3604713]: Closed GnuPG cryptographic agent and passphrase cache (restricted).
Aug 02 02:00:24 prx1 systemd[3604713]: gpg-agent-ssh.socket: Succeeded.
Aug 02 02:00:24 prx1 systemd[3604713]: Closed GnuPG cryptographic agent (ssh-agent emulation).
Aug 02 02:00:24 prx1 systemd[3604713]: gpg-agent.socket: Succeeded.
Aug 02 02:00:24 prx1 systemd[3604713]: Closed GnuPG cryptographic agent and passphrase cache.
Aug 02 02:00:24 prx1 systemd[3604713]: Removed slice User Application Slice.
Aug 02 02:00:24 prx1 systemd[3604713]: Reached target Shutdown.
Aug 02 02:00:24 prx1 systemd[3604713]: systemd-exit.service: Succeeded.
Aug 02 02:00:24 prx1 systemd[3604713]: Finished Exit the Session.
Aug 02 02:00:24 prx1 systemd[3604713]: Reached target Exit the Session.
Aug 02 02:00:24 prx1 systemd[1]: user@0.service: Succeeded.
Aug 02 02:00:24 prx1 systemd[1]: Stopped User Manager for UID 0.
Aug 02 02:00:24 prx1 systemd[1]: Stopping User Runtime Directory /run/user/0...
Aug 02 02:00:24 prx1 systemd[1]: run-user-0.mount: Succeeded.
Aug 02 02:00:24 prx1 systemd[1]: user-runtime-dir@0.service: Succeeded.
Aug 02 02:00:24 prx1 systemd[1]: Stopped User Runtime Directory /run/user/0.
Aug 02 02:00:24 prx1 systemd[1]: Removed slice User Slice of UID 0.
Aug 02 02:00:24 prx1 systemd[1]: user-0.slice: Consumed 4.294s CPU time.
Aug 02 02:00:27 prx1 systemd[1]: Started Checkmk agent (PID 1027/UID 997).
Aug 02 02:00:28 prx1 systemd[1]: check-mk-agent@10096-1027-997.service: Succeeded.
Aug 02 02:00:29 prx1 pvedaemon[3401359]: <root@pam> successful auth for user 'checkmk@pve'
Aug 02 02:01:11 prx1 pmxcfs[1175]: [status] notice: received log
Aug 02 02:01:27 prx1 systemd[1]: Started Checkmk agent (PID 1027/UID 997).
Aug 02 02:01:28 prx1 systemd[1]: check-mk-agent@10097-1027-997.service: Succeeded.
Aug 02 02:01:28 prx1 pvedaemon[3486375]: <root@pam> successful auth for user 'checkmk@pve'
Node 2 was running the container and became unknown:
Code:
Aug 02 01:58:11 prx2 pvedaemon[1655804]: <root@pam> successful auth for user 'checkmk@pve'
Aug 02 01:58:29 prx2 pmxcfs[1265]: [status] notice: received log
Aug 02 01:59:09 prx2 systemd[1]: Started Checkmk agent (PID 1117/UID 997).
Aug 02 01:59:11 prx2 systemd[1]: check-mk-agent@10099-1117-997.service: Succeeded.
Aug 02 01:59:11 prx2 pvedaemon[1957699]: <root@pam> successful auth for user 'checkmk@pve'
Aug 02 01:59:28 prx2 pmxcfs[1265]: [status] notice: received log
Aug 02 02:00:05 prx2 pvescheduler[1967913]: failed to open /var/lib/docker/fuse-overlayfs/f7056dd7352e3b6402355325aee9406827ee1a0950a3e492f390d6d738edd9f6/merged: Permission denied
Aug 02 02:00:05 prx2 pvescheduler[1967913]: failed to open /var/lib/docker/fuse-overlayfs/a14fcdf9ffa4e3008a1cb33aa3b7ef0081b8d965e5523b84e952a995d82f066a/merged: Permission denied
Aug 02 02:00:05 prx2 pvescheduler[1967913]: failed to open /var/lib/docker/fuse-overlayfs/a381f5194ab0cf7c6b3aa6adc54e0e4ff4e4b47d778b539b1ae18e58a62ef3ff/merged: Permission denied
Aug 02 02:00:05 prx2 pvescheduler[1967913]: failed to open /var/lib/docker/fuse-overlayfs/8bdc328a458a6928af68862eaee23fd0371e0e0b49c62ab006b584f5767a0519/merged: Permission denied
Aug 02 02:00:05 prx2 pvescheduler[1967913]: failed to open /var/lib/docker/fuse-overlayfs/ff5247a0cb70a7071cc8369f1eea74e020b3fde40925d2bc619f9874eb0e99bb/merged: Permission denied
Aug 02 02:00:09 prx2 systemd[1]: Started Checkmk agent (PID 1117/UID 997).
Aug 02 02:00:11 prx2 systemd[1]: check-mk-agent@10100-1117-997.service: Succeeded.
Aug 02 02:00:11 prx2 pvedaemon[1957699]: <root@pam> successful auth for user 'checkmk@pve'
Aug 02 02:00:29 prx2 pmxcfs[1265]: [status] notice: received log
Aug 02 02:01:09 prx2 systemd[1]: Started Checkmk agent (PID 1117/UID 997).
Aug 02 02:01:10 prx2 systemd[1]: check-mk-agent@10101-1117-997.service: Succeeded.
Aug 02 02:01:11 prx2 pvedaemon[1957699]: <root@pam> successful auth for user 'checkmk@pve'
[ICODE]
edit: formatting
edit2: correct logs
Last edited: