A few days ago, I had a VM hang on a cluster of 4 servers. 3 servers have 2 SSDs per server for CEPH, in total 6 SSDs. At the time of the problem, version 7.1 was installed. CEPH 16.2.7 was also launched.
There are 3 Ethernet interfaces on each node. Two built-in motherboard (1Gb) and one 10Gb PCIE. 1Gb interfaces are used to provide access to VMs, PROXMOX management interface and cluster operation. The 10G interface has been dedicated to CEPH. IPv6 subnets were created for the cluster and CEPH.
At the time of the problem, I found that one of the nodes was placed in the CEPH monitor as not working (node B). I have rebooted this node. After returning back, nothing has changed. At this point, I obviously made a fatal mistake. Because in any case, all VMs were not working, I rebooted all nodes on which CEPH was used as storage for VMs and updated the system to the current version 7.2-7.
After the update, I found that CEPH completely stopped working and did not respond to ceph -s.
After conducting a series of tests, I found out that all nodes lost the ability to transmit IPv6 packets through 10G interfaces, while IPv4 packets were transmitted without any restrictions. There was no such problem on other interfaces. I also tested IPv6 on these interfaces using Linux BRIDGE and OVS BRIDGE, it was the same... only IPv4 was working...
Today I lost hope of restoring the IPv6 network. I made a number of changes and attached IPv6 subnets intended for CEPH to the 1Gb interfaces. As a result of these actions, CEPH started working in a few minutes. But in the monitor I found that 3 out of 6 OSDs did not start. All non-working OSDs had an old version. I checked the SMART status on all SSDs that included non-working OSDs. All SSDs had 3-4% wearout (a series for servers) and SMART without errors. After a few minutes, another OSD went into the DOWN state. As a result, I had two nodes with a non working OSD.
I tried to figure out what the problem was. I bring the output of commands to NodeA (the same for others).
After some time, I found that all the OSDs were missing from the Proxmox WEB interface.
CRASH also does not work correctly due to the absence of a keyring
At the moment, I have a need to make a backup of one of the VMs, all other services have been successfully restored and are working.
Current CEPH status:
I will be grateful for any help.
There are 3 Ethernet interfaces on each node. Two built-in motherboard (1Gb) and one 10Gb PCIE. 1Gb interfaces are used to provide access to VMs, PROXMOX management interface and cluster operation. The 10G interface has been dedicated to CEPH. IPv6 subnets were created for the cluster and CEPH.
At the time of the problem, I found that one of the nodes was placed in the CEPH monitor as not working (node B). I have rebooted this node. After returning back, nothing has changed. At this point, I obviously made a fatal mistake. Because in any case, all VMs were not working, I rebooted all nodes on which CEPH was used as storage for VMs and updated the system to the current version 7.2-7.
After the update, I found that CEPH completely stopped working and did not respond to ceph -s.
After conducting a series of tests, I found out that all nodes lost the ability to transmit IPv6 packets through 10G interfaces, while IPv4 packets were transmitted without any restrictions. There was no such problem on other interfaces. I also tested IPv6 on these interfaces using Linux BRIDGE and OVS BRIDGE, it was the same... only IPv4 was working...
Today I lost hope of restoring the IPv6 network. I made a number of changes and attached IPv6 subnets intended for CEPH to the 1Gb interfaces. As a result of these actions, CEPH started working in a few minutes. But in the monitor I found that 3 out of 6 OSDs did not start. All non-working OSDs had an old version. I checked the SMART status on all SSDs that included non-working OSDs. All SSDs had 3-4% wearout (a series for servers) and SMART without errors. After a few minutes, another OSD went into the DOWN state. As a result, I had two nodes with a non working OSD.
I tried to figure out what the problem was. I bring the output of commands to NodeA (the same for others).
Code:
root@node-A:~# systemctl status ceph-osd@0.service
● ceph-osd@0.service - Ceph object storage daemon osd.0
Loaded: loaded (/lib/systemd/system/ceph-osd@.service; enabled-runtime; vendor preset: enabled)
Drop-In: /usr/lib/systemd/system/ceph-osd@.service.d
└─ceph-after-pve-cluster.conf
Active: failed (Result: signal) since Mon 2022-09-05 15:55:38 EEST; 2h 31min ago
Process: 5938 ExecStartPre=/usr/libexec/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id 0 (code=exited, status=0/SUCCESS)
Process: 5942 ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} --id 0 --setuser ceph --setgroup ceph (code=killed, signal=ABRT)
Main PID: 5942 (code=killed, signal=ABRT)
CPU: 1min 5.361s
Sep 05 15:55:28 node-A systemd[1]: ceph-osd@0.service: Consumed 1min 5.361s CPU time.
Sep 05 15:55:38 node-A systemd[1]: ceph-osd@0.service: Scheduled restart job, restart counter is at 3.
Sep 05 15:55:38 node-A systemd[1]: Stopped Ceph object storage daemon osd.0.
Sep 05 15:55:38 node-A systemd[1]: ceph-osd@0.service: Consumed 1min 5.361s CPU time.
Sep 05 15:55:38 node-A systemd[1]: ceph-osd@0.service: Start request repeated too quickly.
Sep 05 15:55:38 node-A systemd[1]: ceph-osd@0.service: Failed with result 'signal'.
Sep 05 15:55:38 node-A systemd[1]: Failed to start Ceph object storage daemon osd.0.
Code:
root@node-A:~# ceph-volume lvm list
====== osd.0 =======
[block] /dev/ceph-1690a548-18f3-4ef6-936c-36dca7faa707/osd-block-2f745f4d-3155-4811-89fd-4fd56902fbae
block device /dev/ceph-1690a548-18f3-4ef6-936c-36dca7faa707/osd-block-2f745f4d-3155-4811-89fd-4fd56902fbae
block uuid 9tbjoJ-uckJ-tOKo-1IeH-ELp1-41hm-RAUSwF
cephx lockbox secret
cluster fsid e3a3a2e8-7d85-432c-8ba3-f98f4e54c96e
cluster name ceph
crush device class ssd
encrypted 0
osd fsid 2f745f4d-3155-4811-89fd-4fd56902fbae
osd id 0
osdspec affinity
type block
vdo 0
devices /dev/sdb
====== osd.1 =======
[block] /dev/ceph-d73181a4-5030-46f8-abda-5f2e4bc3311a/osd-block-09c04b90-ba59-4bd0-b69f-113828c676cc
block device /dev/ceph-d73181a4-5030-46f8-abda-5f2e4bc3311a/osd-block-09c04b90-ba59-4bd0-b69f-113828c676cc
block uuid rrcKqT-2zDY-gjD6-Tw3f-YCx0-MNeg-XNIw7j
cephx lockbox secret
cluster fsid e3a3a2e8-7d85-432c-8ba3-f98f4e54c96e
cluster name ceph
crush device class ssd
encrypted 0
osd fsid 09c04b90-ba59-4bd0-b69f-113828c676cc
osd id 1
osdspec affinity
type block
vdo 0
devices /dev/sdc
Code:
root@node-A:/etc/sysctl.d# df -h
Filesystem Size Used Avail Use% Mounted on
udev 63G 0 63G 0% /dev
tmpfs 13G 1.4M 13G 1% /run
/dev/mapper/pve-root 59G 11G 45G 19% /
tmpfs 63G 66M 63G 1% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
/dev/fuse 128M 52K 128M 1% /etc/pve
tmpfs 63G 24K 63G 1% /var/lib/ceph/osd/ceph-1
tmpfs 63G 24K 63G 1% /var/lib/ceph/osd/ceph-0
tmpfs 13G 0 13G 0% /run/user/0
After some time, I found that all the OSDs were missing from the Proxmox WEB interface.
CRASH also does not work correctly due to the absence of a keyring
Code:
auth: unable to find a keyring on /etc/pve/priv/ceph.client.crash.keyring: (2) No such file or directory
Code:
root@node-A:/etc/pve/priv# ls -l
total 4
drwx------ 2 root www-data 0 Jan 18 2022 acme
-rw------- 1 root www-data 1679 Sep 5 09:37 authkey.key
-rw------- 1 root www-data 1573 Feb 9 2022 authorized_keys
drwx------ 2 root www-data 0 Jan 19 2022 ceph
-rw------- 1 root www-data 151 Jan 19 2022 ceph.client.admin.keyring
-rw------- 1 root www-data 228 Jan 19 2022 ceph.mon.keyring
-rw------- 1 root www-data 5074 Feb 9 2022 known_hosts
drwx------ 2 root www-data 0 Jan 18 2022 lock
-rw------- 1 root www-data 3243 Jan 18 2022 pve-root-ca.key
-rw------- 1 root www-data 3 Feb 9 2022 pve-root-ca.srl
drwx------ 2 root www-data 0 Feb 10 2022 storage
At the moment, I have a need to make a backup of one of the VMs, all other services have been successfully restored and are working.
Current CEPH status:
Code:
root@node-A:/etc/pve/priv# ceph -s
cluster:
id: e3a3a2e8-7d85-432c-8ba3-f98f4e54c96e
health: HEALTH_WARN
2 osds down
2 hosts (4 osds) down
Reduced data availability: 129 pgs inactive, 129 pgs stale
Degraded data redundancy: 188104/282156 objects degraded (66.667%), 129 pgs degraded, 129 pgs undersized
36 pgs not deep-scrubbed in time
14 daemons have recently crashed
services:
mon: 4 daemons, quorum node-A,node-B,node-C,node-D (age 3h)
mgr: node-D(active, since 4h), standbys: node-A
osd: 6 osds: 2 up (since 3h), 4 in (since 3h)
data:
pools: 2 pools, 129 pgs
objects: 94.05k objects, 357 GiB
usage: 348 GiB used, 129 GiB / 477 GiB avail
pgs: 100.000% pgs not active
188104/282156 objects degraded (66.667%)
129 stale+undersized+degraded+peered
I will be grateful for any help.