CEPH not starting after node fail

moviedo

New Member
Aug 25, 2024
1
0
1
Hi, I have a 3 nodes PVE Cluster, running CEPH in 3 OSDs (1 per server), and 2 monitors (pve1, pve3).

My 3rd server died (pve3), and now CEPH is down. OSDs on pve1 and pve2 are up, but nothing shows up in GUI.

I would like to understand why CEPH failed if just 1 node went down. Moreover, to know if it is possible to restore ceph and/or recover VM disks that were in CEPH.

"ceph -s" hangs.

root@pve1:/var/lib/ceph/mon# pvecm status
Cluster information
-------------------
Name: cluster-pve
Config Version: 3
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Sun Aug 25 13:33:27 2024
Quorum provider: corosync_votequorum
Nodes: 2
Node ID: 0x00000001
Ring ID: 1.aa
Quorate: Yes

Votequorum information
----------------------
Expected votes: 2
Highest expected: 2
Total votes: 2
Quorum: 2
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 192.168.30.11 (local)
0x00000003 1 192.168.30.12

root@pve1:/var/lib/ceph/mon# systemctl status ceph-mon.target
● ceph-mon.target - ceph target allowing to start/stop all ceph-mon@.service instances at once
Loaded: loaded (/lib/systemd/system/ceph-mon.target; enabled; preset: enabled)
Active: active since Fri 2024-08-23 16:53:23 CDT; 1 day 20h ago

root@pve1:/var/lib/ceph/mon# systemctl status ceph-osd@0.service
ceph-osd@0.service - Ceph object storage daemon osd.0
Loaded: loaded (/lib/systemd/system/ceph-osd@.service; enabled-runtime; preset: enabled)
Drop-In: /usr/lib/systemd/system/ceph-osd@.service.d
└─ceph-after-pve-cluster.conf
Active: active (running) since Sun 2024-08-25 13:28:13 CDT; 1s ago
Process: 629343 ExecStartPre=/usr/libexec/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id 0 (code=exited, status=0/SUCCESS)
Main PID: 629347 (ceph-osd)
Tasks: 9
Memory: 11.0M
CPU: 33ms
CGroup: /system.slice/system-ceph\x2dosd.slice/ceph-osd@0.service
└─629347 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph

root@pve1:/var/lib/ceph/mon# systemctl status ceph-mgr.target
● ceph-mgr.target - ceph target allowing to start/stop all ceph-mgr@.service instances at once
Loaded: loaded (/lib/systemd/system/ceph-mgr.target; enabled; preset: enabled)
Active: active since Fri 2024-08-23 16:53:23 CDT; 1 day 20h ago

root@pve1:/var/lib/ceph/mon# systemctl status ceph-volume@lvm-0-1d18f9e4-119e-43f3-b8a1-e4bc78ae9966.service
ceph-volume@lvm-0-1d18f9e4-119e-43f3-b8a1-e4bc78ae9966.service - Ceph Volume activation: lvm-0-1d18f9e4-119e-43f3-b8a1-e4bc78ae9966
Loaded: loaded (/lib/systemd/system/ceph-volume@.service; enabled; preset: enabled)
Drop-In: /usr/lib/systemd/system/ceph-volume@.service.d
└─ceph-after-pve-cluster.conf
Active: inactive (dead) since Fri 2024-08-23 16:53:24 CDT; 1 day 20h ago
Main PID: 1339 (code=exited, status=0/SUCCESS)
CPU: 217ms

root@pve1:/var/lib/ceph/mon# systemctl status ceph-mgr@pve1.service
× ceph-mgr@pve1.service - Ceph cluster manager daemon
Loaded: loaded (/lib/systemd/system/ceph-mgr@.service; enabled; preset: enabled)
Drop-In: /usr/lib/systemd/system/ceph-mgr@.service.d
└─ceph-after-pve-cluster.conf
Active: failed (Result: exit-code) since Sun 2024-08-25 13:12:50 CDT; 17min ago
Duration: 29ms
Process: 624899 ExecStart=/usr/bin/ceph-mgr -f --cluster ${CLUSTER} --id pve1 --setuser ceph --setgroup ceph (code=exited, status=1/FAILURE)
Main PID: 624899 (code=exited, status=1/FAILURE)
CPU: 29ms


root@pve1:/var/lib/ceph/mon# ceph-volume lvm activate --all
--> OSD ID 0 FSID 1d18f9e4-119e-43f3-b8a1-e4bc78ae9966 process is active. Skipping activation

root@pve2:~# ceph-volume lvm activate --all
--> OSD ID 1 FSID 848fb3c5-8121-409f-9972-2df69f171074 process is active. Skipping activation

root@pve2:~# systemctl status ceph-osd@1
ceph-osd@1.service - Ceph object storage daemon osd.1
Loaded: loaded (/lib/systemd/system/ceph-osd@.service; enabled-runtime; preset: enabled)
Drop-In: /usr/lib/systemd/system/ceph-osd@.service.d
└─ceph-after-pve-cluster.conf
Active: active (running) since Sun 2024-08-25 13:31:15 CDT; 24s ago
Process: 603630 ExecStartPre=/usr/libexec/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id 1 (code=exited, status=0/SUCCESS)
Main PID: 603634 (ceph-osd)
Tasks: 9
Memory: 10.9M
CPU: 44ms
CGroup: /system.slice/system-ceph\x2dosd.slice/ceph-osd@1.service
└─603634 /usr/bin/ceph-osd -f --cluster ceph --id 1 --setuser ceph --setgroup ceph


Any help will be highly appreciated!
 
Hi, I have a 3 nodes PVE Cluster, running CEPH in 3 OSDs (1 per server), and 2 monitors (pve1, pve3).

My 3rd server died (pve3), and now CEPH is down. OSDs on pve1 and pve2 are up, but nothing shows up in GUI.

I would like to understand why CEPH failed if just 1 node went down. Moreover, to know if it is possible to restore ceph and/or recover VM disks that were in CEPH.
Ceph reacts completely correct.

You have only configured 2 monitors and the monitors form the quorum for the CLuster. To keep the cluster online, more than 50% of the monitors are required to be available. In your case, only one is still online, so the cluster is down.

You should bring your PVE3 back online and then configure a monitor on the PVE2 as well.
Then you can shut down one server and everything stays online.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!