OSD ghost

JanoStasik · Nov 26, 2024

Hello,
I am running 3 node proxmox cluster on version 8.3.0. I have CEPH 19.2.0 installed on each node. I am running 3 monitors and managers, healt status is OK. On each node I have Samsung 990 Pro nvme drive, dedicated for CEPH OSD. No matter what i try, no matter what order i pick I always end with OSD as a ghost.
I click on CEPH, OSD and create OSD. System offer unused Samsung drive, and i am not touching anything else except create. TASK run OK with no errors. But after that i can see created OSD on that page, only overal page shows that i have osd.0 as a ghost.
What am I doing wrong?

PS: Before I've started to create OSD, I've erased drive on each node with: ceph-volume lvm zap /dev/nvme0n1 --destroy

Attached are screens:
create osd - how i create it
log file from successful task
ceph_after_create - configuration osd, default is blank, nothing there
gohst_osd - on dashboard is visible ghost osd

sofmeright · Jan 11, 2025

I have the same issue its bizzarre!

tbomb456 · Mar 12, 2025

I have the exact same issue too, I can't get OSDs at all, I wonder if this is an issue with squid?

Edit: just noticed this in the logs

root@Instalation01:~# systemctl status ceph-osd@1
× ceph-osd@1.service - Ceph object storage daemon osd.1
Loaded: loaded (/lib/systemd/system/ceph-osd@.service; enabled-runtime; preset: enabled)
Drop-In: /usr/lib/systemd/system/ceph-osd@.service.d
└─ceph-after-pve-cluster.conf
Active: failed (Result: exit-code) since Tue 2025-03-11 23:13:29 MST; 10s ago
Duration: 827ms
Process: 19663 ExecStartPre=/usr/libexec/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id 1 (code=exited, status=0/SUCCESS)
Process: 19668 ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} --id 1 --setuser ceph --setgroup ceph (code=exited, status=1/FAILURE)
Main PID: 19668 (code=exited, status=1/FAILURE)
CPU: 97ms

Mar 11 23:13:29 Instalation01 systemd[1]: ceph-osd@1.service: Scheduled restart job, restart counter is at 3.
Mar 11 23:13:29 Instalation01 systemd[1]: Stopped ceph-osd@1.service - Ceph object storage daemon osd.1.
Mar 11 23:13:29 Instalation01 systemd[1]: ceph-osd@1.service: Start request repeated too quickly.
Mar 11 23:13:29 Instalation01 systemd[1]: ceph-osd@1.service: Failed with result 'exit-code'.
Mar 11 23:13:29 Instalation01 systemd[1]: Failed to start ceph-osd@1.service - Ceph object storage daemon osd.1.

also attached is my log file

Edit again

root@Instalation01:~# systemctl status ceph-osd@0
● ceph-osd@0.service - Ceph object storage daemon osd.0
Loaded: loaded (/lib/systemd/system/ceph-osd@.service; enabled-runtime; preset: enabled)
Drop-In: /usr/lib/systemd/system/ceph-osd@.service.d
└─ceph-after-pve-cluster.conf
Active: activating (auto-restart) (Result: exit-code) since Tue 2025-03-11 23:19:38 MST; 2s ago
Process: 21808 ExecStartPre=/usr/libexec/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id 0 (code=exited, status=0/SUCCESS)
Process: 21819 ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} --id 0 --setuser ceph --setgroup ceph (code=exited, status=1/FAILURE)
Main PID: 21819 (code=exited, status=1/FAILURE)
CPU: 99ms
root@Instalation01:~# systemctl status ceph-osd@0
● ceph-osd@0.service - Ceph object storage daemon osd.0
Loaded: loaded (/lib/systemd/system/ceph-osd@.service; enabled-runtime; preset: enabled)
Drop-In: /usr/lib/systemd/system/ceph-osd@.service.d
└─ceph-after-pve-cluster.conf
Active: active (running) since Tue 2025-03-11 23:19:48 MST; 679ms ago
Process: 22076 ExecStartPre=/usr/libexec/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id 0 (code=exited, status=0/SUCCESS)
Main PID: 22099 (ceph-osd)
Tasks: 8
Memory: 11.2M
CPU: 97ms
CGroup: /system.slice/system-ceph\x2dosd.slice/ceph-osd@0.service
└─22099 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph

Mar 11 23:19:48 Instalation01 systemd[1]: Starting ceph-osd@0.service - Ceph object storage daemon osd.0...
Mar 11 23:19:48 Instalation01 systemd[1]: Started ceph-osd@0.service - Ceph object storage daemon osd.0.
root@Instalation01:~# systemctl status ceph-osd@0
● ceph-osd@0.service - Ceph object storage daemon osd.0
Loaded: loaded (/lib/systemd/system/ceph-osd@.service; enabled-runtime; preset: enabled)
Drop-In: /usr/lib/systemd/system/ceph-osd@.service.d
└─ceph-after-pve-cluster.conf
Active: activating (auto-restart) (Result: exit-code) since Tue 2025-03-11 23:19:49 MST; 5s ago
Process: 22076 ExecStartPre=/usr/libexec/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id 0 (code=exited, status=0/SUCCESS)
Process: 22099 ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} --id 0 --setuser ceph --setgroup ceph (code=exited, status=1/FAILURE)
Main PID: 22099 (code=exited, status=1/FAILURE)
CPU: 104ms

Just tried this to no avail, seems like its a permisions error almost??

tbomb456 · Mar 12, 2025

Just found this in my logs when running [ ls /var/log/ceph/ceph-osd.0.log ] seems like it dose'nt like that I split my cluster network onto ipv6 and my public to ipv4. Found this on https://docs.ceph.com/en/reef/rados/troubleshooting/troubleshooting-osd/ hopefully this helps.

2025-03-11T23:19:59.496-0700 75e39dbea840 0 set uid:gid to 64045:64045 (ceph:ceph)
2025-03-11T23:19:59.496-0700 75e39dbea840 0 ceph version 19.2.0 (3815e3391b18c593539df6fa952c9f45c37ee4d0) squid (stable), process ceph-osd, pid 22244
2025-03-11T23:19:59.496-0700 75e39dbea840 0 pidfile_write: ignore empty --pid-file
2025-03-11T23:19:59.498-0700 75e39dbea840 1 bdev(0x57456e03ee00 /var/lib/ceph/osd/ceph-0/block) open path /var/lib/ceph/osd/ceph-0/block
2025-03-11T23:19:59.498-0700 75e39dbea840 0 bdev(0x57456e03ee00 /var/lib/ceph/osd/ceph-0/block) ioctl(F_SET_FILE_RW_HINT) on /var/lib/ceph/osd/ceph-0/block failed: (22) Invalid argument
2025-03-11T23:19:59.498-0700 75e39dbea840 1 bdev(0x57456e03ee00 /var/lib/ceph/osd/ceph-0/block) open size 500103643136 (0x7470800000, 466 GiB) block_size 4096 (4 KiB) rotational device, discard not supported
2025-03-11T23:19:59.498-0700 75e39dbea840 1 bluestore(/var/lib/ceph/osd/ceph-0) _set_cache_sizes cache_size 1073741824 meta 0.45 kv 0.45 kv_onode 0.04 data 0.06
2025-03-11T23:19:59.499-0700 75e39dbea840 1 bdev(0x57456e03f180 /var/lib/ceph/osd/ceph-0/block.db) open path /var/lib/ceph/osd/ceph-0/block.db
2025-03-11T23:19:59.499-0700 75e39dbea840 0 bdev(0x57456e03f180 /var/lib/ceph/osd/ceph-0/block.db) ioctl(F_SET_FILE_RW_HINT) on /var/lib/ceph/osd/ceph-0/block.db failed: (22) Invalid argument
2025-03-11T23:19:59.499-0700 75e39dbea840 1 bdev(0x57456e03f180 /var/lib/ceph/osd/ceph-0/block.db) open size 50012880896 (0xba5000000, 47 GiB) block_size 4096 (4 KiB) non-rotational device, discard supported
2025-03-11T23:19:59.499-0700 75e39dbea840 1 bluefs add_block_device bdev 1 path /var/lib/ceph/osd/ceph-0/block.db size 47 GiB
2025-03-11T23:19:59.500-0700 75e39dbea840 1 bdev(0x57456e03f500 /var/lib/ceph/osd/ceph-0/block) open path /var/lib/ceph/osd/ceph-0/block
2025-03-11T23:19:59.500-0700 75e39dbea840 0 bdev(0x57456e03f500 /var/lib/ceph/osd/ceph-0/block) ioctl(F_SET_FILE_RW_HINT) on /var/lib/ceph/osd/ceph-0/block failed: (22) Invalid argument
2025-03-11T23:19:59.500-0700 75e39dbea840 1 bdev(0x57456e03f500 /var/lib/ceph/osd/ceph-0/block) open size 500103643136 (0x7470800000, 466 GiB) block_size 4096 (4 KiB) rotational device, discard not supported
2025-03-11T23:19:59.500-0700 75e39dbea840 1 bluefs add_block_device bdev 2 path /var/lib/ceph/osd/ceph-0/block size 466 GiB
2025-03-11T23:19:59.500-0700 75e39dbea840 1 bdev(0x57456e03f180 /var/lib/ceph/osd/ceph-0/block.db) close
2025-03-11T23:19:59.767-0700 75e39dbea840 1 bdev(0x57456e03f500 /var/lib/ceph/osd/ceph-0/block) close
2025-03-11T23:20:00.012-0700 75e39dbea840 1 bdev(0x57456e03ee00 /var/lib/ceph/osd/ceph-0/block) close
2025-03-11T23:20:00.262-0700 75e39dbea840 0 starting osd.0 osd_data /var/lib/ceph/osd/ceph-0 /var/lib/ceph/osd/ceph-0/journal
2025-03-11T23:20:00.262-0700 75e39dbea840 -1 unable to find any IPv6 address in networks '192.168.23.1/24' interfaces ''
2025-03-11T23:20:00.262-0700 75e39dbea840 -1 Failed to pick public address.

gurubert · Mar 13, 2025

Ceph is not able to run dual stack. You have to either use IPv4 or IPv6 on both networks.

jhonyperez · Jul 30, 2025

Did you find a solution to this problem?? I seem to have the same issue, no IPV6.

sofmeright · Jul 30, 2025

jhonyperez said:
Did you find a solution to this problem?? I seem to have the same issue, no IPV6.

Make sure your network is navigable for all the nodes, can they ping each other? Are the private and public ips set in the ceph & corosync.conf? It should just work at that point so long as you didnt do anything to the files required for authentication, such as the bootstrap osd key etc. Also as stated above ipv6 works just you need all ceph ips either on ipv4 or ipv6, you can not split nodes between ipv6 and ipv4. Good luck!

jhonyperez · Jul 31, 2025

sofmeright said:
Make sure your network is navigable for all the nodes, can they ping each other? Are the private and public ips set in the ceph & corosync.conf? It should just work at that point so long as you didnt do anything to the files required for authentication, such as the bootstrap osd key etc. Also as stated above ipv6 works just you need all ceph ips either on ipv4 or ipv6, you can not split nodes between ipv6 and ipv4. Good luck!

Thank you so much for your quick response. Yes, all 3 nodes can ping each other and the ceph config is pointing to the IPv4 addresses.

All started during an upgrade, node 2 would not boot, had to select a previous kernel to get it to boot, then reinstalled the correct kernel (6.8.12) and now it boots just fine, but OSDs will not start. Removed them, wipe them and add them again and now they just show as Ghost OSDs. No matter what I do, OSDs on that node will not join.

jhonyperez · Aug 2, 2025

Update: We took one of the OSD from the node with the issues, and move it to another node and it worked on first try. So definitely seem like there's something wrong with the node software. Tomorrow we'll remove from cluster, reinstall and add it back to the cluster.

jhonyperez · Aug 4, 2025

Removing the node from the cluster and reinstalling Proxmox, then rejoining the cluster fixed the issue. Something must have been corrupt on that node during the upgrade, a minor upgrade at that. Keep that in mind and make sure you have good backups prior to updating.

Thank you.

Search

Search

OSD ghost

JanoStasik

New Member

Attachments

sofmeright

Member

tbomb456

New Member

Attachments

tbomb456

New Member

gurubert

Distinguished Member

jhonyperez

New Member

sofmeright

Member

jhonyperez

New Member

jhonyperez

New Member

jhonyperez

New Member

We value your privacy