status unknown - vgs not responding

RichtigerBot

New Member
Dec 3, 2024
1
0
1
Hi guys, I have a rather strange problem with my current proxmox configuration.

The status of 2 out of 3 nodes always goes to unknown, about 3 minutes after restarting a node. In these 3 minutes the status is online. The node I restarted is working fine.
Does anyone know what I have done wrong? I would be very grateful if I could finally solve the problem.

General Information:
- 3 nodes in a cluster (Don't let the names confuse you — node3 is called prox09)
- ceph cluster (storage is a SAN which is connected via multipath)

What I gathered so far:

I tried to run time pvesm status on every node, but I only get a response on the server with the lowest uptime. On the other two, the command does not execute. Same with vgs command.
The warning "lvm[977]: WARNING: lvmlockd process is not running." is pretty interesting to me, but I can see lvmlockd starting a few seconds after that message. (see below to the lvm status on node1)
I found this message on node1 when node3 booted up the first time.
May 20 07:38:51 prox01 pvestatd[2136]: status update time (65131.801 seconds)

In this example I rebooted node3.

on node3:
Bash:
# time pvesm status
  Skipping global lock: lockspace is starting
  Skipping global lock: lockspace is starting
Name             Type     Status           Total            Used       Available        %
ceph              rbd     active      1855337117        89429661      1765907456    4.82%
local             dir     active        44867864         5582292        36973996   12.44%
local-lvm     lvmthin     active        68513792               0        68513792    0.00%

real    0m1.425s
user    0m1.189s
sys    0m0.201s

Bash:
# vgs
  Skipping global lock: lockspace is starting
  VG                                        #PV #LV #SN Attr   VSize   VFree 
  ceph-2a1fdede-aebc-470a-a3fa-c4577ecbbf56   1   1   0 wz--n-  <1.82t     0 
  pve                                         1   3   0 wz--n- 135.12g 16.00g

Bash:
# dlm_tool status
cluster nodeid 3 quorate 1 ring seq 203 203
daemon now 3656 fence_pid 0
node 1 M add 27 rem 0 fail 0 fence 0 at 0 0
node 2 M add 27 rem 0 fail 0 fence 0 at 0 0
node 3 M add 25 rem 0 fail 0 fence 0 at 0 0

You can find the log file of node3 in the attachments.

Node1:
Bash:
# multipath -ll
mpath0 (3600c0ff000fcbe3d64d6eb6701000000) dm-5 DellEMC,ME5
size=1.8T features='0' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| `- 11:0:0:0 sdb 8:16 active ready running
`-+- policy='service-time 0' prio=10 status=enabled
  `- 12:0:0:0 sdc 8:32 active ready running

Bash:
# ceph status
  cluster:
    id:     2a211c88-f574-472b-b29a-0a1c4f8549bc
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum prox01,prox02,prox09 (age 24m)
    mgr: prox01(active, since 10d)
    osd: 3 osds: 3 up (since 24m), 3 in (since 22h)
 
  data:
    pools:   2 pools, 33 pgs
    objects: 24.91k objects, 90 GiB
    usage:   255 GiB used, 5.2 TiB / 5.5 TiB avail
    pgs:     33 active+clean
 
  io:
    client:   0 B/s rd, 29 KiB/s wr, 0 op/s rd, 5 op/s wr

Bash:
# systemctl status lvm*
● lvmlockd.service - LVM lock daemon
     Loaded: loaded (/lib/systemd/system/lvmlockd.service; enabled; preset: enabled)
     Active: active (running) since Fri 2025-05-09 10:05:22 CEST; 1 week 3 days ago
       Docs: man:lvmlockd(8)
   Main PID: 2649 (lvmlockd)
      Tasks: 4 (limit: 154476)
     Memory: 3.0M
        CPU: 53.971s
     CGroup: /system.slice/lvmlockd.service
             └─2649 /sbin/lvmlockd --foreground

May 09 10:05:02 prox01 systemd[1]: Starting lvmlockd.service - LVM lock daemon...
May 09 10:05:22 prox01 lvmlockd[2649]: [D] creating /run/lvm/lvmlockd.socket
May 09 10:05:22 prox01 lvmlockd[2649]: 1746777922 lvmlockd started
May 09 10:05:22 prox01 systemd[1]: Started lvmlockd.service - LVM lock daemon.

● lvm2-monitor.service - Monitoring of LVM2 mirrors, snapshots etc. using dmeventd or progress polling
     Loaded: loaded (/lib/systemd/system/lvm2-monitor.service; enabled; preset: enabled)
     Active: active (exited) since Fri 2025-05-09 10:04:51 CEST; 1 week 3 days ago
       Docs: man:dmeventd(8)
             man:lvcreate(8)
             man:lvchange(8)
             man:vgchange(8)
   Main PID: 977 (code=exited, status=0/SUCCESS)
        CPU: 16ms

May 09 10:04:50 prox01 lvm[977]:   WARNING: lvmlockd process is not running.
May 09 10:04:50 prox01 lvm[977]:   Reading without shared global lock.
May 09 10:04:50 prox01 lvm[977]:   5 logical volume(s) in volume group "pve" monitored
May 09 10:04:51 prox01 systemd[1]: Finished lvm2-monitor.service - Monitoring of LVM2 mirrors, snapshots etc. using d>
Notice: journal has been rotated since unit was started, output may be incomplete.

● lvmlocks.service - LVM locking start and stop
     Loaded: loaded (/lib/systemd/system/lvmlocks.service; enabled; preset: enabled)
     Active: active (exited) since Fri 2025-05-09 10:05:23 CEST; 1 week 3 days ago
       Docs: man:lvmlockd(8)
   Main PID: 2652 (code=exited, status=0/SUCCESS)
        CPU: 13ms

May 09 10:05:22 prox01 systemd[1]: Starting lvmlocks.service - LVM locking start and stop...
May 09 10:05:23 prox01 systemd[1]: Finished lvmlocks.service - LVM locking start and stop.

● lvm2-lvmpolld.socket - LVM2 poll daemon socket
     Loaded: loaded (/lib/systemd/system/lvm2-lvmpolld.socket; enabled; preset: enabled)
     Active: active (listening) since Fri 2025-05-09 10:04:50 CEST; 1 week 3 days ago
   Triggers: ● lvm2-lvmpolld.service
       Docs: man:lvmpolld(8)
     Listen: /run/lvm/lvmpolld.socket (Stream)
     CGroup: /system.slice/lvm2-lvmpolld.socket

[CODE=bash]# dlm_tool status
cluster nodeid 1 quorate 1 ring seq 203 203
daemon now 945540 fence_pid 0
node 1 M add 17 rem 0 fail 0 fence 0 at 0 0
node 2 M add 876508 rem 536761 fail 0 fence 0 at 0 0
node 3 M add 941902 rem 941646 fail 0 fence 0 at 0 0

Notice: journal has been rotated since unit was started, output may be incomplete.[/CODE]

If you require further information, I will be happy to provide you with more.
 

Attachments