[SOLVED] status unknown - vgs not responding

RichtigerBot

New Member
Dec 3, 2024
5
0
1
Hi guys, I have a rather strange problem with my current proxmox configuration.

The status of 2 out of 3 nodes always goes to unknown, about 3 minutes after restarting a node. In these 3 minutes the status is online. The node I restarted is working fine.
Does anyone know what I have done wrong? I would be very grateful if I could finally solve the problem.

General Information:
- 3 nodes in a cluster (Don't let the names confuse you — node3 is called prox09)
- ceph cluster (storage is a SAN which is connected via multipath)

What I gathered so far:

I tried to run time pvesm status on every node, but I only get a response on the server with the lowest uptime. On the other two, the command does not execute. Same with vgs command.
The warning "lvm[977]: WARNING: lvmlockd process is not running." is pretty interesting to me, but I can see lvmlockd starting a few seconds after that message. (see below to the lvm status on node1)
I found this message on node1 when node3 booted up the first time.
May 20 07:38:51 prox01 pvestatd[2136]: status update time (65131.801 seconds)

In this example I rebooted node3.

on node3:
Bash:
# time pvesm status
  Skipping global lock: lockspace is starting
  Skipping global lock: lockspace is starting
Name             Type     Status           Total            Used       Available        %
ceph              rbd     active      1855337117        89429661      1765907456    4.82%
local             dir     active        44867864         5582292        36973996   12.44%
local-lvm     lvmthin     active        68513792               0        68513792    0.00%

real    0m1.425s
user    0m1.189s
sys    0m0.201s

Bash:
# vgs
  Skipping global lock: lockspace is starting
  VG                                        #PV #LV #SN Attr   VSize   VFree 
  ceph-2a1fdede-aebc-470a-a3fa-c4577ecbbf56   1   1   0 wz--n-  <1.82t     0 
  pve                                         1   3   0 wz--n- 135.12g 16.00g

Bash:
# dlm_tool status
cluster nodeid 3 quorate 1 ring seq 203 203
daemon now 3656 fence_pid 0
node 1 M add 27 rem 0 fail 0 fence 0 at 0 0
node 2 M add 27 rem 0 fail 0 fence 0 at 0 0
node 3 M add 25 rem 0 fail 0 fence 0 at 0 0

You can find the log file of node3 in the attachments.

Node1:
Bash:
# multipath -ll
mpath0 (3600c0ff000fcbe3d64d6eb6701000000) dm-5 DellEMC,ME5
size=1.8T features='0' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| `- 11:0:0:0 sdb 8:16 active ready running
`-+- policy='service-time 0' prio=10 status=enabled
  `- 12:0:0:0 sdc 8:32 active ready running

Bash:
# ceph status
  cluster:
    id:     2a211c88-f574-472b-b29a-0a1c4f8549bc
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum prox01,prox02,prox09 (age 24m)
    mgr: prox01(active, since 10d)
    osd: 3 osds: 3 up (since 24m), 3 in (since 22h)
 
  data:
    pools:   2 pools, 33 pgs
    objects: 24.91k objects, 90 GiB
    usage:   255 GiB used, 5.2 TiB / 5.5 TiB avail
    pgs:     33 active+clean
 
  io:
    client:   0 B/s rd, 29 KiB/s wr, 0 op/s rd, 5 op/s wr

Bash:
# systemctl status lvm*
● lvmlockd.service - LVM lock daemon
     Loaded: loaded (/lib/systemd/system/lvmlockd.service; enabled; preset: enabled)
     Active: active (running) since Fri 2025-05-09 10:05:22 CEST; 1 week 3 days ago
       Docs: man:lvmlockd(8)
   Main PID: 2649 (lvmlockd)
      Tasks: 4 (limit: 154476)
     Memory: 3.0M
        CPU: 53.971s
     CGroup: /system.slice/lvmlockd.service
             └─2649 /sbin/lvmlockd --foreground

May 09 10:05:02 prox01 systemd[1]: Starting lvmlockd.service - LVM lock daemon...
May 09 10:05:22 prox01 lvmlockd[2649]: [D] creating /run/lvm/lvmlockd.socket
May 09 10:05:22 prox01 lvmlockd[2649]: 1746777922 lvmlockd started
May 09 10:05:22 prox01 systemd[1]: Started lvmlockd.service - LVM lock daemon.

● lvm2-monitor.service - Monitoring of LVM2 mirrors, snapshots etc. using dmeventd or progress polling
     Loaded: loaded (/lib/systemd/system/lvm2-monitor.service; enabled; preset: enabled)
     Active: active (exited) since Fri 2025-05-09 10:04:51 CEST; 1 week 3 days ago
       Docs: man:dmeventd(8)
             man:lvcreate(8)
             man:lvchange(8)
             man:vgchange(8)
   Main PID: 977 (code=exited, status=0/SUCCESS)
        CPU: 16ms

May 09 10:04:50 prox01 lvm[977]:   WARNING: lvmlockd process is not running.
May 09 10:04:50 prox01 lvm[977]:   Reading without shared global lock.
May 09 10:04:50 prox01 lvm[977]:   5 logical volume(s) in volume group "pve" monitored
May 09 10:04:51 prox01 systemd[1]: Finished lvm2-monitor.service - Monitoring of LVM2 mirrors, snapshots etc. using d>
Notice: journal has been rotated since unit was started, output may be incomplete.

● lvmlocks.service - LVM locking start and stop
     Loaded: loaded (/lib/systemd/system/lvmlocks.service; enabled; preset: enabled)
     Active: active (exited) since Fri 2025-05-09 10:05:23 CEST; 1 week 3 days ago
       Docs: man:lvmlockd(8)
   Main PID: 2652 (code=exited, status=0/SUCCESS)
        CPU: 13ms

May 09 10:05:22 prox01 systemd[1]: Starting lvmlocks.service - LVM locking start and stop...
May 09 10:05:23 prox01 systemd[1]: Finished lvmlocks.service - LVM locking start and stop.

● lvm2-lvmpolld.socket - LVM2 poll daemon socket
     Loaded: loaded (/lib/systemd/system/lvm2-lvmpolld.socket; enabled; preset: enabled)
     Active: active (listening) since Fri 2025-05-09 10:04:50 CEST; 1 week 3 days ago
   Triggers: ● lvm2-lvmpolld.service
       Docs: man:lvmpolld(8)
     Listen: /run/lvm/lvmpolld.socket (Stream)
     CGroup: /system.slice/lvm2-lvmpolld.socket

[CODE=bash]# dlm_tool status
cluster nodeid 1 quorate 1 ring seq 203 203
daemon now 945540 fence_pid 0
node 1 M add 17 rem 0 fail 0 fence 0 at 0 0
node 2 M add 876508 rem 536761 fail 0 fence 0 at 0 0
node 3 M add 941902 rem 941646 fail 0 fence 0 at 0 0

Notice: journal has been rotated since unit was started, output may be incomplete.[/CODE]

If you require further information, I will be happy to provide you with more.
 

Attachments

> The status of 2 out of 3 nodes always goes to unknown, about 3 minutes after restarting a node. In these 3 minutes the status is online.

That sounds like network issues to me.
Could you please generate a new log as follows, as the current one does not give enough information.

journalctl -S 2024-05-16 -u corosync > $(date -I)_journal.txt
 
  • Like
Reactions: Kingneutron
Here you go. I changed 2024 to 2025. Hope that's correct.

Code:
May 19 09:55:11 tuwza7y-prox01 corosync[1954]:   [CFG   ] Node 3 was shut down by sysadmin
May 19 09:55:11 tuwza7y-prox01 corosync[1954]:   [QUORUM] Sync members[2]: 1 2
May 19 09:55:11 tuwza7y-prox01 corosync[1954]:   [QUORUM] Sync left[1]: 3
May 19 09:55:11 tuwza7y-prox01 corosync[1954]:   [TOTEM ] A new membership (1.b4) was formed. Members left: 3
May 19 09:55:11 tuwza7y-prox01 corosync[1954]:   [QUORUM] Members[2]: 1 2
May 19 09:55:11 tuwza7y-prox01 corosync[1954]:   [MAIN  ] Completed service synchronization, ready to provide service.
May 19 09:55:12 tuwza7y-prox01 corosync[1954]:   [KNET  ] link: host: 3 link: 0 is down
May 19 09:55:12 tuwza7y-prox01 corosync[1954]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
May 19 09:55:12 tuwza7y-prox01 corosync[1954]:   [KNET  ] host: host: 3 has no active links
May 19 10:06:38 tuwza7y-prox01 corosync[1954]:   [KNET  ] link: Resetting MTU for link 0 because host 3 joined
May 19 10:06:38 tuwza7y-prox01 corosync[1954]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
May 19 10:06:38 tuwza7y-prox01 corosync[1954]:   [QUORUM] Sync members[3]: 1 2 3
May 19 10:06:38 tuwza7y-prox01 corosync[1954]:   [QUORUM] Sync joined[1]: 3
May 19 10:06:38 tuwza7y-prox01 corosync[1954]:   [TOTEM ] A new membership (1.b9) was formed. Members joined: 3
May 19 10:06:38 tuwza7y-prox01 corosync[1954]:   [QUORUM] Members[3]: 1 2 3
May 19 10:06:38 tuwza7y-prox01 corosync[1954]:   [MAIN  ] Completed service synchronization, ready to provide service.
May 19 10:06:38 tuwza7y-prox01 corosync[1954]:   [KNET  ] pmtud: Global data MTU changed to: 1397
May 19 13:24:15 tuwza7y-prox01 corosync[1954]:   [CFG   ] Node 2 was shut down by sysadmin
May 19 13:24:15 tuwza7y-prox01 corosync[1954]:   [QUORUM] Sync members[2]: 1 3
May 19 13:24:15 tuwza7y-prox01 corosync[1954]:   [QUORUM] Sync left[1]: 2
May 19 13:24:15 tuwza7y-prox01 corosync[1954]:   [TOTEM ] A new membership (1.bd) was formed. Members left: 2
May 19 13:24:15 tuwza7y-prox01 corosync[1954]:   [QUORUM] Members[2]: 1 3
May 19 13:24:15 tuwza7y-prox01 corosync[1954]:   [MAIN  ] Completed service synchronization, ready to provide service.
May 19 13:24:16 tuwza7y-prox01 corosync[1954]:   [KNET  ] link: host: 2 link: 0 is down
May 19 13:24:16 tuwza7y-prox01 corosync[1954]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
May 19 13:24:16 tuwza7y-prox01 corosync[1954]:   [KNET  ] host: host: 2 has no active links
May 19 13:33:13 tuwza7y-prox01 corosync[1954]:   [KNET  ] rx: host: 2 link: 0 is up
May 19 13:33:13 tuwza7y-prox01 corosync[1954]:   [KNET  ] link: Resetting MTU for link 0 because host 2 joined
May 19 13:33:13 tuwza7y-prox01 corosync[1954]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
May 19 13:33:13 tuwza7y-prox01 corosync[1954]:   [QUORUM] Sync members[3]: 1 2 3
May 19 13:33:13 tuwza7y-prox01 corosync[1954]:   [QUORUM] Sync joined[1]: 2
May 19 13:33:13 tuwza7y-prox01 corosync[1954]:   [TOTEM ] A new membership (1.c2) was formed. Members joined: 2
May 19 13:33:13 tuwza7y-prox01 corosync[1954]:   [QUORUM] Members[3]: 1 2 3
May 19 13:33:13 tuwza7y-prox01 corosync[1954]:   [MAIN  ] Completed service synchronization, ready to provide service.
May 19 13:33:13 tuwza7y-prox01 corosync[1954]:   [KNET  ] pmtud: Global data MTU changed to: 1397
May 20 07:38:45 tuwza7y-prox01 corosync[1954]:   [KNET  ] link: host: 3 link: 0 is down
May 20 07:38:45 tuwza7y-prox01 corosync[1954]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
May 20 07:38:45 tuwza7y-prox01 corosync[1954]:   [KNET  ] host: host: 3 has no active links
May 20 07:38:46 tuwza7y-prox01 corosync[1954]:   [TOTEM ] Token has not been received in 2737 ms
May 20 07:38:47 tuwza7y-prox01 corosync[1954]:   [TOTEM ] A processor failed, forming new configuration: token timed out (3650ms), waiting 4380ms for consensus.
May 20 07:38:51 tuwza7y-prox01 corosync[1954]:   [QUORUM] Sync members[2]: 1 2
May 20 07:38:51 tuwza7y-prox01 corosync[1954]:   [QUORUM] Sync left[1]: 3
May 20 07:38:51 tuwza7y-prox01 corosync[1954]:   [TOTEM ] A new membership (1.c6) was formed. Members left: 3
May 20 07:38:51 tuwza7y-prox01 corosync[1954]:   [TOTEM ] Failed to receive the leave message. failed: 3
May 20 07:38:51 tuwza7y-prox01 corosync[1954]:   [QUORUM] Members[2]: 1 2
May 20 07:38:51 tuwza7y-prox01 corosync[1954]:   [MAIN  ] Completed service synchronization, ready to provide service.
May 20 07:43:07 tuwza7y-prox01 corosync[1954]:   [KNET  ] rx: host: 3 link: 0 is up
May 20 07:43:07 tuwza7y-prox01 corosync[1954]:   [KNET  ] link: Resetting MTU for link 0 because host 3 joined
May 20 07:43:07 tuwza7y-prox01 corosync[1954]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
May 20 07:43:07 tuwza7y-prox01 corosync[1954]:   [KNET  ] pmtud: Global data MTU changed to: 1397
May 20 07:43:07 tuwza7y-prox01 corosync[1954]:   [QUORUM] Sync members[3]: 1 2 3
May 20 07:43:07 tuwza7y-prox01 corosync[1954]:   [QUORUM] Sync joined[1]: 3
May 20 07:43:07 tuwza7y-prox01 corosync[1954]:   [TOTEM ] A new membership (1.cb) was formed. Members joined: 3
May 20 07:43:07 tuwza7y-prox01 corosync[1954]:   [QUORUM] Members[3]: 1 2 3
May 20 07:43:07 tuwza7y-prox01 corosync[1954]:   [MAIN  ] Completed service synchronization, ready to provide service.

If you need anything else, please let me know.
 

Attachments

Last edited:
Here you go. I changed 2024 to 2025. Hope that's correct.
yes thank you.

Ok so you can see that the link of the nodes 2 and 3 are flapping/loosing connection from time to time.
Therefore i'd recommend checking why this is the case.
  • Check your switch (if possible) for errors.
  • Check the cables and switch them
  • Use another NIC/Port (if possible)
 
The Servers are blade-servers in a PowerEdge M1000e the switch is a MXL blade switch.
Unfortunately, I can't access the switch right now, but I will check tomorrow if I find any related logs.

The connection flapping/loosing comes from me, rebooting the nodes to check for any abnormal behavior. This setup is currently under testing. We planned to go into production in the near future.

Here is the corosync boot log of node3:
Bash:
May 19 09:55:11 tuwza7y-prox09 systemd[1]: Stopping corosync.service - Corosync Cluster Engine...
May 19 09:55:11 tuwza7y-prox09 corosync-cfgtool[757980]: Shutting down corosync
May 19 09:55:11 tuwza7y-prox09 corosync[2968956]:   [MAIN  ] Node was shut down by a signal
May 19 09:55:11 tuwza7y-prox09 corosync[2968956]:   [SERV  ] Unloading all Corosync service engines.
May 19 09:55:11 tuwza7y-prox09 corosync[2968956]:   [QB    ] withdrawing server sockets
May 19 09:55:11 tuwza7y-prox09 corosync[2968956]:   [SERV  ] Service engine unloaded: corosync vote quorum service v1.0
May 19 09:55:11 tuwza7y-prox09 corosync[2968956]:   [CFG   ] Node 3 was shut down by sysadmin
May 19 09:55:11 tuwza7y-prox09 corosync[2968956]:   [QB    ] withdrawing server sockets
May 19 09:55:11 tuwza7y-prox09 corosync[2968956]:   [SERV  ] Service engine unloaded: corosync configuration map access
May 19 09:55:11 tuwza7y-prox09 corosync[2968956]:   [QB    ] withdrawing server sockets
May 19 09:55:11 tuwza7y-prox09 corosync[2968956]:   [SERV  ] Service engine unloaded: corosync configuration service
May 19 09:55:11 tuwza7y-prox09 corosync[2968956]:   [QB    ] withdrawing server sockets
May 19 09:55:11 tuwza7y-prox09 corosync[2968956]:   [SERV  ] Service engine unloaded: corosync cluster closed process group service v1.01
May 19 09:55:11 tuwza7y-prox09 corosync[2968956]:   [QB    ] withdrawing server sockets
May 19 09:55:11 tuwza7y-prox09 corosync[2968956]:   [SERV  ] Service engine unloaded: corosync cluster quorum service v0.1
May 19 09:55:11 tuwza7y-prox09 corosync[2968956]:   [SERV  ] Service engine unloaded: corosync profile loading service
May 19 09:55:11 tuwza7y-prox09 corosync[2968956]:   [SERV  ] Service engine unloaded: corosync resource monitoring service
May 19 09:55:11 tuwza7y-prox09 corosync[2968956]:   [SERV  ] Service engine unloaded: corosync watchdog service
May 19 09:55:12 tuwza7y-prox09 corosync[2968956]:   [KNET  ] link: Resetting MTU for link 0 because host 3 joined
May 19 09:55:12 tuwza7y-prox09 corosync[2968956]:   [KNET  ] link: Resetting MTU for link 0 because host 2 joined
May 19 09:55:12 tuwza7y-prox09 corosync[2968956]:   [KNET  ] link: Resetting MTU for link 0 because host 1 joined
May 19 09:55:12 tuwza7y-prox09 corosync[2968956]:   [MAIN  ] Corosync Cluster Engine exiting normally
May 19 09:55:12 tuwza7y-prox09 systemd[1]: corosync.service: Deactivated successfully.
May 19 09:55:12 tuwza7y-prox09 systemd[1]: Stopped corosync.service - Corosync Cluster Engine.
May 19 09:55:12 tuwza7y-prox09 systemd[1]: corosync.service: Consumed 4h 6min 22.167s CPU time.
-- Boot a29b2923505a437ba8884c2d84fc0b93 --
May 19 10:06:35 tuwza7y-prox09 systemd[1]: Starting corosync.service - Corosync Cluster Engine...
May 19 10:06:35 tuwza7y-prox09 corosync[1595]:   [MAIN  ] Corosync Cluster Engine  starting up
May 19 10:06:35 tuwza7y-prox09 corosync[1595]:   [MAIN  ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf vqsim nozzle snmp pie relro bindnow
May 19 10:06:35 tuwza7y-prox09 corosync[1595]:   [TOTEM ] Initializing transport (Kronosnet).
May 19 10:06:35 tuwza7y-prox09 corosync[1595]:   [TOTEM ] totemknet initialized
May 19 10:06:35 tuwza7y-prox09 corosync[1595]:   [KNET  ] pmtud: MTU manually set to: 0
May 19 10:06:35 tuwza7y-prox09 corosync[1595]:   [KNET  ] common: crypto_nss.so has been loaded from /usr/lib/x86_64-linux-gnu/kronosnet/crypto_nss.so
May 19 10:06:36 tuwza7y-prox09 corosync[1595]:   [SERV  ] Service engine loaded: corosync configuration map access [0]
May 19 10:06:36 tuwza7y-prox09 corosync[1595]:   [QB    ] server name: cmap
May 19 10:06:36 tuwza7y-prox09 corosync[1595]:   [SERV  ] Service engine loaded: corosync configuration service [1]
May 19 10:06:36 tuwza7y-prox09 corosync[1595]:   [QB    ] server name: cfg
May 19 10:06:36 tuwza7y-prox09 corosync[1595]:   [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
May 19 10:06:36 tuwza7y-prox09 corosync[1595]:   [QB    ] server name: cpg
May 19 10:06:36 tuwza7y-prox09 corosync[1595]:   [SERV  ] Service engine loaded: corosync profile loading service [4]
May 19 10:06:36 tuwza7y-prox09 corosync[1595]:   [SERV  ] Service engine loaded: corosync resource monitoring service [6]
May 19 10:06:36 tuwza7y-prox09 corosync[1595]:   [WD    ] Watchdog not enabled by configuration
May 19 10:06:36 tuwza7y-prox09 corosync[1595]:   [WD    ] resource load_15min missing a recovery key.
May 19 10:06:36 tuwza7y-prox09 corosync[1595]:   [WD    ] resource memory_used missing a recovery key.
May 19 10:06:36 tuwza7y-prox09 corosync[1595]:   [WD    ] no resources configured.
May 19 10:06:36 tuwza7y-prox09 corosync[1595]:   [SERV  ] Service engine loaded: corosync watchdog service [7]
May 19 10:06:36 tuwza7y-prox09 corosync[1595]:   [QUORUM] Using quorum provider corosync_votequorum
May 19 10:06:36 tuwza7y-prox09 corosync[1595]:   [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
May 19 10:06:36 tuwza7y-prox09 corosync[1595]:   [QB    ] server name: votequorum
May 19 10:06:36 tuwza7y-prox09 corosync[1595]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
May 19 10:06:36 tuwza7y-prox09 corosync[1595]:   [QB    ] server name: quorum
May 19 10:06:36 tuwza7y-prox09 corosync[1595]:   [TOTEM ] Configuring link 0
May 19 10:06:36 tuwza7y-prox09 corosync[1595]:   [TOTEM ] Configured link number 0: local addr: 10.162.81.159, port=5405
May 19 10:06:36 tuwza7y-prox09 corosync[1595]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 0)
May 19 10:06:36 tuwza7y-prox09 corosync[1595]:   [KNET  ] host: host: 1 has no active links
May 19 10:06:36 tuwza7y-prox09 corosync[1595]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
May 19 10:06:36 tuwza7y-prox09 corosync[1595]:   [KNET  ] host: host: 1 has no active links
May 19 10:06:36 tuwza7y-prox09 corosync[1595]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
May 19 10:06:36 tuwza7y-prox09 corosync[1595]:   [KNET  ] host: host: 1 has no active links
May 19 10:06:36 tuwza7y-prox09 corosync[1595]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
May 19 10:06:36 tuwza7y-prox09 corosync[1595]:   [KNET  ] host: host: 2 has no active links
May 19 10:06:36 tuwza7y-prox09 corosync[1595]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
May 19 10:06:36 tuwza7y-prox09 corosync[1595]:   [KNET  ] host: host: 2 has no active links
May 19 10:06:36 tuwza7y-prox09 corosync[1595]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
May 19 10:06:36 tuwza7y-prox09 corosync[1595]:   [KNET  ] host: host: 2 has no active links
May 19 10:06:36 tuwza7y-prox09 corosync[1595]:   [KNET  ] link: Resetting MTU for link 0 because host 3 joined
May 19 10:06:36 tuwza7y-prox09 corosync[1595]:   [QUORUM] Sync members[1]: 3
May 19 10:06:36 tuwza7y-prox09 corosync[1595]:   [QUORUM] Sync joined[1]: 3
May 19 10:06:36 tuwza7y-prox09 corosync[1595]:   [TOTEM ] A new membership (3.b5) was formed. Members joined: 3
May 19 10:06:36 tuwza7y-prox09 corosync[1595]:   [QUORUM] Members[1]: 3
May 19 10:06:36 tuwza7y-prox09 corosync[1595]:   [MAIN  ] Completed service synchronization, ready to provide service.
May 19 10:06:36 tuwza7y-prox09 systemd[1]: Started corosync.service - Corosync Cluster Engine.
May 19 10:06:38 tuwza7y-prox09 corosync[1595]:   [KNET  ] rx: host: 2 link: 0 is up
May 19 10:06:38 tuwza7y-prox09 corosync[1595]:   [KNET  ] link: Resetting MTU for link 0 because host 2 joined
May 19 10:06:38 tuwza7y-prox09 corosync[1595]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
May 19 10:06:38 tuwza7y-prox09 corosync[1595]:   [KNET  ] rx: host: 1 link: 0 is up
May 19 10:06:38 tuwza7y-prox09 corosync[1595]:   [KNET  ] link: Resetting MTU for link 0 because host 1 joined
May 19 10:06:38 tuwza7y-prox09 corosync[1595]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
May 19 10:06:38 tuwza7y-prox09 corosync[1595]:   [KNET  ] pmtud: PMTUD link change for host: 2 link: 0 from 469 to 1397
May 19 10:06:38 tuwza7y-prox09 corosync[1595]:   [KNET  ] pmtud: PMTUD link change for host: 1 link: 0 from 469 to 1397
May 19 10:06:38 tuwza7y-prox09 corosync[1595]:   [KNET  ] pmtud: Global data MTU changed to: 1397
May 19 10:06:38 tuwza7y-prox09 corosync[1595]:   [QUORUM] Sync members[3]: 1 2 3
May 19 10:06:38 tuwza7y-prox09 corosync[1595]:   [QUORUM] Sync joined[2]: 1 2
May 19 10:06:38 tuwza7y-prox09 corosync[1595]:   [TOTEM ] A new membership (1.b9) was formed. Members joined: 1 2
May 19 10:06:38 tuwza7y-prox09 corosync[1595]:   [QUORUM] This node is within the primary component and will provide service.
May 19 10:06:38 tuwza7y-prox09 corosync[1595]:   [QUORUM] Members[3]: 1 2 3
May 19 10:06:38 tuwza7y-prox09 corosync[1595]:   [MAIN  ] Completed service synchronization, ready to provide service.
May 19 13:24:15 tuwza7y-prox09 corosync[1595]:   [CFG   ] Node 2 was shut down by sysadmin
May 19 13:24:15 tuwza7y-prox09 corosync[1595]:   [QUORUM] Sync members[2]: 1 3
May 19 13:24:15 tuwza7y-prox09 corosync[1595]:   [QUORUM] Sync left[1]: 2
May 19 13:24:15 tuwza7y-prox09 corosync[1595]:   [TOTEM ] A new membership (1.bd) was formed. Members left: 2
May 19 13:24:15 tuwza7y-prox09 corosync[1595]:   [QUORUM] Members[2]: 1 3
May 19 13:24:15 tuwza7y-prox09 corosync[1595]:   [MAIN  ] Completed service synchronization, ready to provide service.
May 19 13:24:16 tuwza7y-prox09 corosync[1595]:   [KNET  ] link: host: 2 link: 0 is down
May 19 13:24:16 tuwza7y-prox09 corosync[1595]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
May 19 13:24:16 tuwza7y-prox09 corosync[1595]:   [KNET  ] host: host: 2 has no active links
May 19 13:33:13 tuwza7y-prox09 corosync[1595]:   [KNET  ] rx: host: 2 link: 0 is up
May 19 13:33:13 tuwza7y-prox09 corosync[1595]:   [KNET  ] link: Resetting MTU for link 0 because host 2 joined
May 19 13:33:13 tuwza7y-prox09 corosync[1595]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
May 19 13:33:13 tuwza7y-prox09 corosync[1595]:   [QUORUM] Sync members[3]: 1 2 3
May 19 13:33:13 tuwza7y-prox09 corosync[1595]:   [QUORUM] Sync joined[1]: 2
May 19 13:33:13 tuwza7y-prox09 corosync[1595]:   [TOTEM ] A new membership (1.c2) was formed. Members joined: 2
May 19 13:33:13 tuwza7y-prox09 corosync[1595]:   [QUORUM] Members[3]: 1 2 3
May 19 13:33:13 tuwza7y-prox09 corosync[1595]:   [MAIN  ] Completed service synchronization, ready to provide service.
May 19 13:33:13 tuwza7y-prox09 corosync[1595]:   [KNET  ] pmtud: Global data MTU changed to: 1397
-- Boot cd6193add25a4677990fcdb2540fa860 --
May 20 07:43:03 tuwza7y-prox09 systemd[1]: Starting corosync.service - Corosync Cluster Engine...
May 20 07:43:03 tuwza7y-prox09 corosync[1560]:   [MAIN  ] Corosync Cluster Engine  starting up
May 20 07:43:03 tuwza7y-prox09 corosync[1560]:   [MAIN  ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf vqsim nozzle snmp pie relro bindnow
May 20 07:43:03 tuwza7y-prox09 corosync[1560]:   [TOTEM ] Initializing transport (Kronosnet).
May 20 07:43:04 tuwza7y-prox09 corosync[1560]:   [TOTEM ] totemknet initialized
May 20 07:43:04 tuwza7y-prox09 corosync[1560]:   [KNET  ] pmtud: MTU manually set to: 0
May 20 07:43:04 tuwza7y-prox09 corosync[1560]:   [KNET  ] common: crypto_nss.so has been loaded from /usr/lib/x86_64-linux-gnu/kronosnet/crypto_nss.so
May 20 07:43:04 tuwza7y-prox09 corosync[1560]:   [SERV  ] Service engine loaded: corosync configuration map access [0]
May 20 07:43:04 tuwza7y-prox09 corosync[1560]:   [QB    ] server name: cmap
May 20 07:43:04 tuwza7y-prox09 corosync[1560]:   [SERV  ] Service engine loaded: corosync configuration service [1]
May 20 07:43:04 tuwza7y-prox09 corosync[1560]:   [QB    ] server name: cfg
May 20 07:43:04 tuwza7y-prox09 corosync[1560]:   [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
May 20 07:43:04 tuwza7y-prox09 corosync[1560]:   [QB    ] server name: cpg
May 20 07:43:04 tuwza7y-prox09 corosync[1560]:   [SERV  ] Service engine loaded: corosync profile loading service [4]
May 20 07:43:04 tuwza7y-prox09 corosync[1560]:   [SERV  ] Service engine loaded: corosync resource monitoring service [6]
May 20 07:43:04 tuwza7y-prox09 corosync[1560]:   [WD    ] Watchdog not enabled by configuration
May 20 07:43:04 tuwza7y-prox09 corosync[1560]:   [WD    ] resource load_15min missing a recovery key.
May 20 07:43:04 tuwza7y-prox09 corosync[1560]:   [WD    ] resource memory_used missing a recovery key.
May 20 07:43:04 tuwza7y-prox09 corosync[1560]:   [WD    ] no resources configured.
May 20 07:43:04 tuwza7y-prox09 corosync[1560]:   [SERV  ] Service engine loaded: corosync watchdog service [7]
May 20 07:43:04 tuwza7y-prox09 corosync[1560]:   [QUORUM] Using quorum provider corosync_votequorum
May 20 07:43:04 tuwza7y-prox09 corosync[1560]:   [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
May 20 07:43:04 tuwza7y-prox09 corosync[1560]:   [QB    ] server name: votequorum
May 20 07:43:04 tuwza7y-prox09 corosync[1560]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
May 20 07:43:04 tuwza7y-prox09 corosync[1560]:   [QB    ] server name: quorum
May 20 07:43:04 tuwza7y-prox09 corosync[1560]:   [TOTEM ] Configuring link 0
May 20 07:43:04 tuwza7y-prox09 corosync[1560]:   [TOTEM ] Configured link number 0: local addr: 10.162.81.159, port=5405
May 20 07:43:04 tuwza7y-prox09 corosync[1560]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
May 20 07:43:04 tuwza7y-prox09 corosync[1560]:   [KNET  ] host: host: 1 has no active links
May 20 07:43:04 tuwza7y-prox09 corosync[1560]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
May 20 07:43:04 tuwza7y-prox09 corosync[1560]:   [KNET  ] host: host: 1 has no active links
May 20 07:43:04 tuwza7y-prox09 corosync[1560]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
 
okay thank you for the information.

Another thing that i just noticed:
but I can see lvmlockd starting a few seconds after that message.
Why are you using lvmlockd? Proxmox VE does not use lvmlockd because we implemented our own locking mechanism and having two different locking mechanisms can break quite bad.
 
  • Like
Reactions: RichtigerBot
I'm not sure why I set it up this way, I thought ceph needs it.
Can I just disable lvmlockd and reboot the machines one by one? Or do I have to change something in the pve/ceph config?
 
No lvmlockd is not required at all.
I'd recommend to remove it from all systems (apt uninstall lvmlockd), rebooting the node and then checking the status of the cluster again
 
  • Like
Reactions: RichtigerBot
That would explain the issue. I did remove the package and rebooted all nodes. I disabled dlm too. Hope that's correct.
I did some testing, and it looks like it fixed the problem.

I feel embarrassed now.
Thank you soo much.
 
yes please also remove the dlm as it is not required and case cause issues as well.
You're welcome :)
 
  • Like
Reactions: RichtigerBot