Hi guys, I have a rather strange problem with my current proxmox configuration.
The status of 2 out of 3 nodes always goes to unknown, about 3 minutes after restarting a node. In these 3 minutes the status is online. The node I restarted is working fine.
Does anyone know what I have done wrong? I would be very grateful if I could finally solve the problem.
General Information:
- 3 nodes in a cluster (Don't let the names confuse you — node3 is called prox09)
- ceph cluster (storage is a SAN which is connected via multipath)
What I gathered so far:
I tried to run
The warning "lvm[977]: WARNING: lvmlockd process is not running." is pretty interesting to me, but I can see lvmlockd starting a few seconds after that message. (see below to the lvm status on node1)
I found this message on node1 when node3 booted up the first time.
May 20 07:38:51 prox01 pvestatd[2136]: status update time (65131.801 seconds)
In this example I rebooted node3.
on node3:
You can find the log file of node3 in the attachments.
Node1:
Notice: journal has been rotated since unit was started, output may be incomplete.[/CODE]
If you require further information, I will be happy to provide you with more.
The status of 2 out of 3 nodes always goes to unknown, about 3 minutes after restarting a node. In these 3 minutes the status is online. The node I restarted is working fine.
Does anyone know what I have done wrong? I would be very grateful if I could finally solve the problem.
General Information:
- 3 nodes in a cluster (Don't let the names confuse you — node3 is called prox09)
- ceph cluster (storage is a SAN which is connected via multipath)
What I gathered so far:
I tried to run
time pvesm status
on every node, but I only get a response on the server with the lowest uptime. On the other two, the command does not execute. Same with vgs
command. The warning "lvm[977]: WARNING: lvmlockd process is not running." is pretty interesting to me, but I can see lvmlockd starting a few seconds after that message. (see below to the lvm status on node1)
I found this message on node1 when node3 booted up the first time.
May 20 07:38:51 prox01 pvestatd[2136]: status update time (65131.801 seconds)
In this example I rebooted node3.
on node3:
Bash:
# time pvesm status
Skipping global lock: lockspace is starting
Skipping global lock: lockspace is starting
Name Type Status Total Used Available %
ceph rbd active 1855337117 89429661 1765907456 4.82%
local dir active 44867864 5582292 36973996 12.44%
local-lvm lvmthin active 68513792 0 68513792 0.00%
real 0m1.425s
user 0m1.189s
sys 0m0.201s
Bash:
# vgs
Skipping global lock: lockspace is starting
VG #PV #LV #SN Attr VSize VFree
ceph-2a1fdede-aebc-470a-a3fa-c4577ecbbf56 1 1 0 wz--n- <1.82t 0
pve 1 3 0 wz--n- 135.12g 16.00g
Bash:
# dlm_tool status
cluster nodeid 3 quorate 1 ring seq 203 203
daemon now 3656 fence_pid 0
node 1 M add 27 rem 0 fail 0 fence 0 at 0 0
node 2 M add 27 rem 0 fail 0 fence 0 at 0 0
node 3 M add 25 rem 0 fail 0 fence 0 at 0 0
You can find the log file of node3 in the attachments.
Node1:
Bash:
# multipath -ll
mpath0 (3600c0ff000fcbe3d64d6eb6701000000) dm-5 DellEMC,ME5
size=1.8T features='0' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| `- 11:0:0:0 sdb 8:16 active ready running
`-+- policy='service-time 0' prio=10 status=enabled
`- 12:0:0:0 sdc 8:32 active ready running
Bash:
# ceph status
cluster:
id: 2a211c88-f574-472b-b29a-0a1c4f8549bc
health: HEALTH_OK
services:
mon: 3 daemons, quorum prox01,prox02,prox09 (age 24m)
mgr: prox01(active, since 10d)
osd: 3 osds: 3 up (since 24m), 3 in (since 22h)
data:
pools: 2 pools, 33 pgs
objects: 24.91k objects, 90 GiB
usage: 255 GiB used, 5.2 TiB / 5.5 TiB avail
pgs: 33 active+clean
io:
client: 0 B/s rd, 29 KiB/s wr, 0 op/s rd, 5 op/s wr
Bash:
# systemctl status lvm*
● lvmlockd.service - LVM lock daemon
Loaded: loaded (/lib/systemd/system/lvmlockd.service; enabled; preset: enabled)
Active: active (running) since Fri 2025-05-09 10:05:22 CEST; 1 week 3 days ago
Docs: man:lvmlockd(8)
Main PID: 2649 (lvmlockd)
Tasks: 4 (limit: 154476)
Memory: 3.0M
CPU: 53.971s
CGroup: /system.slice/lvmlockd.service
└─2649 /sbin/lvmlockd --foreground
May 09 10:05:02 prox01 systemd[1]: Starting lvmlockd.service - LVM lock daemon...
May 09 10:05:22 prox01 lvmlockd[2649]: [D] creating /run/lvm/lvmlockd.socket
May 09 10:05:22 prox01 lvmlockd[2649]: 1746777922 lvmlockd started
May 09 10:05:22 prox01 systemd[1]: Started lvmlockd.service - LVM lock daemon.
● lvm2-monitor.service - Monitoring of LVM2 mirrors, snapshots etc. using dmeventd or progress polling
Loaded: loaded (/lib/systemd/system/lvm2-monitor.service; enabled; preset: enabled)
Active: active (exited) since Fri 2025-05-09 10:04:51 CEST; 1 week 3 days ago
Docs: man:dmeventd(8)
man:lvcreate(8)
man:lvchange(8)
man:vgchange(8)
Main PID: 977 (code=exited, status=0/SUCCESS)
CPU: 16ms
May 09 10:04:50 prox01 lvm[977]: WARNING: lvmlockd process is not running.
May 09 10:04:50 prox01 lvm[977]: Reading without shared global lock.
May 09 10:04:50 prox01 lvm[977]: 5 logical volume(s) in volume group "pve" monitored
May 09 10:04:51 prox01 systemd[1]: Finished lvm2-monitor.service - Monitoring of LVM2 mirrors, snapshots etc. using d>
Notice: journal has been rotated since unit was started, output may be incomplete.
● lvmlocks.service - LVM locking start and stop
Loaded: loaded (/lib/systemd/system/lvmlocks.service; enabled; preset: enabled)
Active: active (exited) since Fri 2025-05-09 10:05:23 CEST; 1 week 3 days ago
Docs: man:lvmlockd(8)
Main PID: 2652 (code=exited, status=0/SUCCESS)
CPU: 13ms
May 09 10:05:22 prox01 systemd[1]: Starting lvmlocks.service - LVM locking start and stop...
May 09 10:05:23 prox01 systemd[1]: Finished lvmlocks.service - LVM locking start and stop.
● lvm2-lvmpolld.socket - LVM2 poll daemon socket
Loaded: loaded (/lib/systemd/system/lvm2-lvmpolld.socket; enabled; preset: enabled)
Active: active (listening) since Fri 2025-05-09 10:04:50 CEST; 1 week 3 days ago
Triggers: ● lvm2-lvmpolld.service
Docs: man:lvmpolld(8)
Listen: /run/lvm/lvmpolld.socket (Stream)
CGroup: /system.slice/lvm2-lvmpolld.socket
[CODE=bash]# dlm_tool status
cluster nodeid 1 quorate 1 ring seq 203 203
daemon now 945540 fence_pid 0
node 1 M add 17 rem 0 fail 0 fence 0 at 0 0
node 2 M add 876508 rem 536761 fail 0 fence 0 at 0 0
node 3 M add 941902 rem 941646 fail 0 fence 0 at 0 0
Notice: journal has been rotated since unit was started, output may be incomplete.[/CODE]
If you require further information, I will be happy to provide you with more.