OSD ghost

JanoStasik

New Member
Sep 23, 2024
2
0
1
Hello,
I am running 3 node proxmox cluster on version 8.3.0. I have CEPH 19.2.0 installed on each node. I am running 3 monitors and managers, healt status is OK. On each node I have Samsung 990 Pro nvme drive, dedicated for CEPH OSD. No matter what i try, no matter what order i pick I always end with OSD as a ghost.
I click on CEPH, OSD and create OSD. System offer unused Samsung drive, and i am not touching anything else except create. TASK run OK with no errors. But after that i can see created OSD on that page, only overal page shows that i have osd.0 as a ghost.
What am I doing wrong?

PS: Before I've started to create OSD, I've erased drive on each node with: ceph-volume lvm zap /dev/nvme0n1 --destroy

Attached are screens:
create osd - how i create it
log file from successful task
ceph_after_create - configuration osd, default is blank, nothing there
gohst_osd - on dashboard is visible ghost osd
 

Attachments

Last edited:
I have the exact same issue too, I can't get OSDs at all, I wonder if this is an issue with squid?

Edit: just noticed this in the logs

root@Instalation01:~# systemctl status ceph-osd@1
× ceph-osd@1.service - Ceph object storage daemon osd.1
Loaded: loaded (/lib/systemd/system/ceph-osd@.service; enabled-runtime; preset: enabled)
Drop-In: /usr/lib/systemd/system/ceph-osd@.service.d
└─ceph-after-pve-cluster.conf
Active: failed (Result: exit-code) since Tue 2025-03-11 23:13:29 MST; 10s ago
Duration: 827ms
Process: 19663 ExecStartPre=/usr/libexec/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id 1 (code=exited, status=0/SUCCESS)
Process: 19668 ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} --id 1 --setuser ceph --setgroup ceph (code=exited, status=1/FAILURE)
Main PID: 19668 (code=exited, status=1/FAILURE)
CPU: 97ms

Mar 11 23:13:29 Instalation01 systemd[1]: ceph-osd@1.service: Scheduled restart job, restart counter is at 3.
Mar 11 23:13:29 Instalation01 systemd[1]: Stopped ceph-osd@1.service - Ceph object storage daemon osd.1.
Mar 11 23:13:29 Instalation01 systemd[1]: ceph-osd@1.service: Start request repeated too quickly.
Mar 11 23:13:29 Instalation01 systemd[1]: ceph-osd@1.service: Failed with result 'exit-code'.
Mar 11 23:13:29 Instalation01 systemd[1]: Failed to start ceph-osd@1.service - Ceph object storage daemon osd.1.

also attached is my log file

Edit again

root@Instalation01:~# systemctl status ceph-osd@0
ceph-osd@0.service - Ceph object storage daemon osd.0
Loaded: loaded (/lib/systemd/system/ceph-osd@.service; enabled-runtime; preset: enabled)
Drop-In: /usr/lib/systemd/system/ceph-osd@.service.d
└─ceph-after-pve-cluster.conf
Active: activating (auto-restart) (Result: exit-code) since Tue 2025-03-11 23:19:38 MST; 2s ago
Process: 21808 ExecStartPre=/usr/libexec/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id 0 (code=exited, status=0/SUCCESS)
Process: 21819 ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} --id 0 --setuser ceph --setgroup ceph (code=exited, status=1/FAILURE)
Main PID: 21819 (code=exited, status=1/FAILURE)
CPU: 99ms
root@Instalation01:~# systemctl status ceph-osd@0
ceph-osd@0.service - Ceph object storage daemon osd.0
Loaded: loaded (/lib/systemd/system/ceph-osd@.service; enabled-runtime; preset: enabled)
Drop-In: /usr/lib/systemd/system/ceph-osd@.service.d
└─ceph-after-pve-cluster.conf
Active: active (running) since Tue 2025-03-11 23:19:48 MST; 679ms ago
Process: 22076 ExecStartPre=/usr/libexec/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id 0 (code=exited, status=0/SUCCESS)
Main PID: 22099 (ceph-osd)
Tasks: 8
Memory: 11.2M
CPU: 97ms
CGroup: /system.slice/system-ceph\x2dosd.slice/ceph-osd@0.service
└─22099 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph

Mar 11 23:19:48 Instalation01 systemd[1]: Starting ceph-osd@0.service - Ceph object storage daemon osd.0...
Mar 11 23:19:48 Instalation01 systemd[1]: Started ceph-osd@0.service - Ceph object storage daemon osd.0.
root@Instalation01:~# systemctl status ceph-osd@0
ceph-osd@0.service - Ceph object storage daemon osd.0
Loaded: loaded (/lib/systemd/system/ceph-osd@.service; enabled-runtime; preset: enabled)
Drop-In: /usr/lib/systemd/system/ceph-osd@.service.d
└─ceph-after-pve-cluster.conf
Active: activating (auto-restart) (Result: exit-code) since Tue 2025-03-11 23:19:49 MST; 5s ago
Process: 22076 ExecStartPre=/usr/libexec/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id 0 (code=exited, status=0/SUCCESS)
Process: 22099 ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} --id 0 --setuser ceph --setgroup ceph (code=exited, status=1/FAILURE)
Main PID: 22099 (code=exited, status=1/FAILURE)
CPU: 104ms

Just tried this to no avail, seems like its a permisions error almost??
 

Attachments

Last edited:
Just found this in my logs when running [ ls /var/log/ceph/ceph-osd.0.log ] seems like it dose'nt like that I split my cluster network onto ipv6 and my public to ipv4. Found this on https://docs.ceph.com/en/reef/rados/troubleshooting/troubleshooting-osd/ hopefully this helps.


2025-03-11T23:19:59.496-0700 75e39dbea840 0 set uid:gid to 64045:64045 (ceph:ceph)
2025-03-11T23:19:59.496-0700 75e39dbea840 0 ceph version 19.2.0 (3815e3391b18c593539df6fa952c9f45c37ee4d0) squid (stable), process ceph-osd, pid 22244
2025-03-11T23:19:59.496-0700 75e39dbea840 0 pidfile_write: ignore empty --pid-file
2025-03-11T23:19:59.498-0700 75e39dbea840 1 bdev(0x57456e03ee00 /var/lib/ceph/osd/ceph-0/block) open path /var/lib/ceph/osd/ceph-0/block
2025-03-11T23:19:59.498-0700 75e39dbea840 0 bdev(0x57456e03ee00 /var/lib/ceph/osd/ceph-0/block) ioctl(F_SET_FILE_RW_HINT) on /var/lib/ceph/osd/ceph-0/block failed: (22) Invalid argument
2025-03-11T23:19:59.498-0700 75e39dbea840 1 bdev(0x57456e03ee00 /var/lib/ceph/osd/ceph-0/block) open size 500103643136 (0x7470800000, 466 GiB) block_size 4096 (4 KiB) rotational device, discard not supported
2025-03-11T23:19:59.498-0700 75e39dbea840 1 bluestore(/var/lib/ceph/osd/ceph-0) _set_cache_sizes cache_size 1073741824 meta 0.45 kv 0.45 kv_onode 0.04 data 0.06
2025-03-11T23:19:59.499-0700 75e39dbea840 1 bdev(0x57456e03f180 /var/lib/ceph/osd/ceph-0/block.db) open path /var/lib/ceph/osd/ceph-0/block.db
2025-03-11T23:19:59.499-0700 75e39dbea840 0 bdev(0x57456e03f180 /var/lib/ceph/osd/ceph-0/block.db) ioctl(F_SET_FILE_RW_HINT) on /var/lib/ceph/osd/ceph-0/block.db failed: (22) Invalid argument
2025-03-11T23:19:59.499-0700 75e39dbea840 1 bdev(0x57456e03f180 /var/lib/ceph/osd/ceph-0/block.db) open size 50012880896 (0xba5000000, 47 GiB) block_size 4096 (4 KiB) non-rotational device, discard supported
2025-03-11T23:19:59.499-0700 75e39dbea840 1 bluefs add_block_device bdev 1 path /var/lib/ceph/osd/ceph-0/block.db size 47 GiB
2025-03-11T23:19:59.500-0700 75e39dbea840 1 bdev(0x57456e03f500 /var/lib/ceph/osd/ceph-0/block) open path /var/lib/ceph/osd/ceph-0/block
2025-03-11T23:19:59.500-0700 75e39dbea840 0 bdev(0x57456e03f500 /var/lib/ceph/osd/ceph-0/block) ioctl(F_SET_FILE_RW_HINT) on /var/lib/ceph/osd/ceph-0/block failed: (22) Invalid argument
2025-03-11T23:19:59.500-0700 75e39dbea840 1 bdev(0x57456e03f500 /var/lib/ceph/osd/ceph-0/block) open size 500103643136 (0x7470800000, 466 GiB) block_size 4096 (4 KiB) rotational device, discard not supported
2025-03-11T23:19:59.500-0700 75e39dbea840 1 bluefs add_block_device bdev 2 path /var/lib/ceph/osd/ceph-0/block size 466 GiB
2025-03-11T23:19:59.500-0700 75e39dbea840 1 bdev(0x57456e03f180 /var/lib/ceph/osd/ceph-0/block.db) close
2025-03-11T23:19:59.767-0700 75e39dbea840 1 bdev(0x57456e03f500 /var/lib/ceph/osd/ceph-0/block) close
2025-03-11T23:20:00.012-0700 75e39dbea840 1 bdev(0x57456e03ee00 /var/lib/ceph/osd/ceph-0/block) close
2025-03-11T23:20:00.262-0700 75e39dbea840 0 starting osd.0 osd_data /var/lib/ceph/osd/ceph-0 /var/lib/ceph/osd/ceph-0/journal
2025-03-11T23:20:00.262-0700 75e39dbea840 -1 unable to find any IPv6 address in networks '192.168.23.1/24' interfaces ''
2025-03-11T23:20:00.262-0700 75e39dbea840 -1 Failed to pick public address.
 
Did you find a solution to this problem?? I seem to have the same issue, no IPV6.
Make sure your network is navigable for all the nodes, can they ping each other? Are the private and public ips set in the ceph & corosync.conf? It should just work at that point so long as you didnt do anything to the files required for authentication, such as the bootstrap osd key etc. Also as stated above ipv6 works just you need all ceph ips either on ipv4 or ipv6, you can not split nodes between ipv6 and ipv4. Good luck!
 
Last edited:
Make sure your network is navigable for all the nodes, can they ping each other? Are the private and public ips set in the ceph & corosync.conf? It should just work at that point so long as you didnt do anything to the files required for authentication, such as the bootstrap osd key etc. Also as stated above ipv6 works just you need all ceph ips either on ipv4 or ipv6, you can not split nodes between ipv6 and ipv4. Good luck!
Thank you so much for your quick response. Yes, all 3 nodes can ping each other and the ceph config is pointing to the IPv4 addresses.

All started during an upgrade, node 2 would not boot, had to select a previous kernel to get it to boot, then reinstalled the correct kernel (6.8.12) and now it boots just fine, but OSDs will not start. Removed them, wipe them and add them again and now they just show as Ghost OSDs. No matter what I do, OSDs on that node will not join.
 
Last edited:
Update: We took one of the OSD from the node with the issues, and move it to another node and it worked on first try. So definitely seem like there's something wrong with the node software. Tomorrow we'll remove from cluster, reinstall and add it back to the cluster.
 
Removing the node from the cluster and reinstalling Proxmox, then rejoining the cluster fixed the issue. Something must have been corrupt on that node during the upgrade, a minor upgrade at that. Keep that in mind and make sure you have good backups prior to updating.

Thank you.