One of my hosts is going 'grey ?' and backups are not running (freezes on LXC indefinitely). A reboot solves the issue for a while, maybe until the next backup job, but haven't confirmed. I have made some changes to CEPH on x.x.x.212 a couple of days ago but all other nodes are fine. Cluster and ceph are all healthy except for this one node. (Status and logs below)
Log of errors:
journalctl -xe
Sep 08 16:58:31 acemagic-1 kernel: libceph: osd3 (1)192.168.1.212:6811 bad crc/signature
Sep 08 16:58:31 acemagic-1 kernel: libceph: read_partial_message 00000000ef285663 signature check failed
Sep 08 16:58:31 acemagic-1 kernel: libceph: osd4 (1)192.168.1.212:6819 bad crc/signature
Sep 08 16:58:31 acemagic-1 kernel: libceph: read_partial_message 00000000f2a8c138 signature check failed
Sep 08 16:58:31 acemagic-1 kernel: libceph: osd3 (1)192.168.1.212:6811 bad crc/signature
Sep 08 16:58:31 acemagic-1 kernel: libceph: read_partial_message 00000000ef285663 signature check failed
Sep 08 16:58:31 acemagic-1 kernel: libceph: osd4 (1)192.168.1.212:6819 bad crc/signature
Sep 08 16:58:31 acemagic-1 kernel: libceph: read_partial_message 00000000f2a8c138 signature check failed
Sep 08 16:58:31 acemagic-1 kernel: libceph: osd3 (1)192.168.1.212:6811 bad crc/signature
Sep 08 16:58:31 acemagic-1 kernel: libceph: read_partial_message 00000000ef285663 signature check failed
Sep 08 16:58:31 acemagic-1 kernel: libceph: osd4 (1)192.168.1.212:6819 bad crc/signature
Sep 08 16:58:31 acemagic-1 kernel: libceph: read_partial_message 00000000f2a8c138 signature check failed
Sep 08 16:58:31 acemagic-1 kernel: libceph: osd3 (1)192.168.1.212:6811 bad crc/signature
Sep 08 16:58:31 acemagic-1 kernel: libceph: read_partial_message 00000000ef285663 signature check failed
Sep 08 16:58:31 acemagic-1 kernel: libceph: osd4 (1)192.168.1.212:6819 bad crc/signature
Sep 08 16:58:31 acemagic-1 kernel: libceph: read_partial_message 00000000f2a8c138 signature check failed
Log of errors:
journalctl -xe
Sep 08 16:58:31 acemagic-1 kernel: libceph: osd3 (1)192.168.1.212:6811 bad crc/signature
Sep 08 16:58:31 acemagic-1 kernel: libceph: read_partial_message 00000000ef285663 signature check failed
Sep 08 16:58:31 acemagic-1 kernel: libceph: osd4 (1)192.168.1.212:6819 bad crc/signature
Sep 08 16:58:31 acemagic-1 kernel: libceph: read_partial_message 00000000f2a8c138 signature check failed
Sep 08 16:58:31 acemagic-1 kernel: libceph: osd3 (1)192.168.1.212:6811 bad crc/signature
Sep 08 16:58:31 acemagic-1 kernel: libceph: read_partial_message 00000000ef285663 signature check failed
Sep 08 16:58:31 acemagic-1 kernel: libceph: osd4 (1)192.168.1.212:6819 bad crc/signature
Sep 08 16:58:31 acemagic-1 kernel: libceph: read_partial_message 00000000f2a8c138 signature check failed
Sep 08 16:58:31 acemagic-1 kernel: libceph: osd3 (1)192.168.1.212:6811 bad crc/signature
Sep 08 16:58:31 acemagic-1 kernel: libceph: read_partial_message 00000000ef285663 signature check failed
Sep 08 16:58:31 acemagic-1 kernel: libceph: osd4 (1)192.168.1.212:6819 bad crc/signature
Sep 08 16:58:31 acemagic-1 kernel: libceph: read_partial_message 00000000f2a8c138 signature check failed
Sep 08 16:58:31 acemagic-1 kernel: libceph: osd3 (1)192.168.1.212:6811 bad crc/signature
Sep 08 16:58:31 acemagic-1 kernel: libceph: read_partial_message 00000000ef285663 signature check failed
Sep 08 16:58:31 acemagic-1 kernel: libceph: osd4 (1)192.168.1.212:6819 bad crc/signature
Sep 08 16:58:31 acemagic-1 kernel: libceph: read_partial_message 00000000f2a8c138 signature check failed
Code:
root@acemagic-1:~# systemctl status pve-cluster
● pve-cluster.service - The Proxmox VE cluster filesystem
Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; preset: enabled)
Active: active (running) since Sat 2024-09-07 11:11:22 EDT; 1 day 5h ago
Main PID: 1172 (pmxcfs)
Tasks: 8 (limit: 38096)
Memory: 82.7M
CPU: 2min 58.255s
CGroup: /system.slice/pve-cluster.service
└─1172 /usr/bin/pmxcfs
Sep 08 16:51:08 acemagic-1 pmxcfs[1172]: [status] notice: received log
Sep 08 16:51:24 acemagic-1 pmxcfs[1172]: [status] notice: received log
Sep 08 16:51:28 acemagic-1 pmxcfs[1172]: [status] notice: received log
Sep 08 16:51:28 acemagic-1 pmxcfs[1172]: [status] notice: received log
Sep 08 16:51:42 acemagic-1 pmxcfs[1172]: [status] notice: received log
Sep 08 16:51:50 acemagic-1 pmxcfs[1172]: [status] notice: received log
Sep 08 16:51:50 acemagic-1 pmxcfs[1172]: [status] notice: received log
Sep 08 16:51:52 acemagic-1 pmxcfs[1172]: [status] notice: received log
Sep 08 16:54:16 acemagic-1 pmxcfs[1172]: [status] notice: received log
Sep 08 16:54:16 acemagic-1 pmxcfs[1172]: [status] notice: received log
root@acemagic-1:~# systemctl status pvedaemon
● pvedaemon.service - PVE API Daemon
Loaded: loaded (/lib/systemd/system/pvedaemon.service; enabled; preset: enabled)
Active: active (running) since Sat 2024-09-07 11:11:24 EDT; 1 day 5h ago
Main PID: 1407 (pvedaemon)
Tasks: 9 (limit: 38096)
Memory: 207.6M
CPU: 50.662s
CGroup: /system.slice/pvedaemon.service
├─ 1407 pvedaemon
├─531828 "pvedaemon worker"
├─641523 "pvedaemon worker"
├─649924 "pvedaemon worker"
├─689458 "task UPID:acemagic-1:000A8532:00895FB8:66DDCCBA:vzstart:102:root@pam:"
├─689463 lxc-info -n 102 -p
├─689468 lxc-info -n 102 -p
├─689692 "task UPID:acemagic-1:000A861C:0089858D:66DDCD1B:vzstart:110:root@pam:"
└─689738 lxc-info -n 110 -p
Notice: journal has been rotated since unit was started, output may be incomplete.
root@acemagic-1:~# systemctl status pvestatd
● pvestatd.service - PVE Status Daemon
Loaded: loaded (/lib/systemd/system/pvestatd.service; enabled; preset: enabled)
Active: active (running) since Sat 2024-09-07 11:11:23 EDT; 1 day 5h ago
Process: 691073 ExecReload=/usr/bin/pvestatd restart (code=exited, status=0/SUCCESS)
Main PID: 1362 (pvestatd)
Tasks: 2 (limit: 38096)
Memory: 157.2M
CPU: 1h 45min 6.245s
CGroup: /system.slice/pvestatd.service
├─ 1362 pvestatd
└─689488 lxc-info -n 102 -p
Notice: journal has been rotated since unit was started, output may be incomplete.
root@acemagic-1:~# pvecm status
Cluster information
-------------------
Name: prmx-cluster-1
Config Version: 15
Transport: knet
Secure auth: on
Quorum information
------------------
Date: Sun Sep 8 16:57:23 2024
Quorum provider: corosync_votequorum
Nodes: 5
Node ID: 0x00000001
Ring ID: 1.1c2
Quorate: Yes
Votequorum information
----------------------
Expected votes: 5
Highest expected: 5
Total votes: 5
Quorum: 3
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
0x00000001 1 192.168.1.210 (local)
0x00000002 1 192.168.1.211
0x00000003 1 192.168.1.212
0x00000004 1 192.168.1.213
0x00000005 1 192.168.1.214
root@acemagic-1:~# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 15.54070 root default
-5 1.25078 host acemagic-1
1 ssd 0.31929 osd.1 up 1.00000 1.00000
2 ssd 0.93149 osd.2 up 1.00000 1.00000
-7 2.18228 host acemagic-2
5 ssd 1.71649 osd.5 up 1.00000 1.00000
6 ssd 0.46579 osd.6 up 1.00000 1.00000
-3 3.16257 host minif-1
0 nvme 0.36809 osd.0 up 1.00000 1.00000
3 ssd 0.93149 osd.3 up 1.00000 1.00000
4 ssd 1.86299 osd.4 up 1.00000 1.00000
-9 8.94507 host pmox-5700g
7 ssd 3.63869 osd.7 up 1.00000 1.00000
8 ssd 3.63869 osd.8 up 1.00000 1.00000
10 ssd 1.66769 osd.10 up 1.00000 1.00000
root@acemagic-1:~# ceph osd status
ID HOST USED AVAIL WR OPS WR DATA RD OPS RD DATA STATE
0 minif-1 68.3G 308G 0 819 0 0 exists,up
1 acemagic-1 51.0G 275G 0 0 0 0 exists,up
2 acemagic-1 203G 749G 1 10.3k 0 0 exists,up
3 minif-1 144G 809G 0 819 0 0 exists,up
4 minif-1 262G 1644G 2 12.0k 291 1164k exists,up
5 acemagic-2 338G 1419G 1 8192 0 0 exists,up
6 acemagic-2 92.5G 384G 0 1638 0 0 exists,up
7 pmox-5700g 261G 3465G 3 71.1k 1 0 exists,up
8 pmox-5700g 204G 3521G 3 15.1k 1 0 exists,up
10 pmox-5700g 114G 1593G 0 5734 0 0 exists,up
root@acemagic-1:~# ceph health detail
HEALTH_OK
root@acemagic-1:~#