Proxmox 6.0 cluster ceph issue

mlevings

New Member
Aug 11, 2017
4
0
1
47
I have a new cluster of 3 nodes. All are identical hardware (Dell R710 with H200 in IT mode, 2 ssd's, and 4 x 4TB spinning disks) . I have ceph running on all 3 nodes with 4 OSD's per node (spinning disks only). Everything is fine except for all OSD's on 1 host go down after a short amount of time. I can select them and tell them to start and they come back online and the cluster becomes healthy again, but they eventually go down again. I reinstalled the whole cluster from scratch to try to fix it, but I'm having the same exact issue. Any ideas as to what to look for?
 
Is the CEPH proccess stopping?, Anything showing in the logs?

If you do "service ceph-osd@# status" what does it show? # being the OSD ID.
 
Here is what I am getting from that. I see something about vmbr1 in the log. I don't thing I'm loosing any packets on that interface. It's a 10gb port and I set a continuous ping for troubleshooting and never lost a packet.


root@ProxmoxH02:~# service ceph-osd@7 status
ceph-osd@7.service - Ceph object storage daemon osd.7
Loaded: loaded (/lib/systemd/system/ceph-osd@.service; enabled-runtime; vendor preset: enabled)
Drop-In: /usr/lib/systemd/system/ceph-osd@.service.d
└─ceph-after-pve-cluster.conf
Active: inactive (dead) since Wed 2019-09-25 13:55:51 CDT; 17h ago
Process: 1297588 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id 7 (code=exited, status=0/SUCCESS)
Process: 1297597 ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} --id 7 --setuser ceph --setgroup ceph (code=exited, status=0/SUCCESS)
Main PID: 1297597 (code=exited, status=0/SUCCESS)

Sep 25 13:55:49 ProxmoxH02 ceph-osd[1297597]: 2019-09-25 13:55:49.583 7f5e564e4700 -1 osd.7 919 set_numa_affinity unable to identify public interface 'vmbr1' numa node: (2) No such file or directory
Sep 25 13:55:50 ProxmoxH02 ceph-osd[1297597]: 2019-09-25 13:55:50.727 7f5e5971c700 -1 received signal: Interrupt from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0
Sep 25 13:55:50 ProxmoxH02 ceph-osd[1297597]: 2019-09-25 13:55:50.727 7f5e5971c700 -1 osd.7 921 *** Got signal Interrupt ***
Sep 25 13:55:50 ProxmoxH02 ceph-osd[1297597]: 2019-09-25 13:55:50.755 7f5e5971c700 -1 osd.7 921 pgid 1.3c has ref count of 2
Sep 25 13:55:50 ProxmoxH02 ceph-osd[1297597]: 2019-09-25 13:55:50.755 7f5e5971c700 -1 osd.7 921 pgid 1.24 has ref count of 2
Sep 25 13:55:50 ProxmoxH02 ceph-osd[1297597]: 2019-09-25 13:55:50.755 7f5e5971c700 -1 osd.7 921 pgid 1.2 has ref count of 2
Sep 25 13:55:50 ProxmoxH02 ceph-osd[1297597]: 2019-09-25 13:55:50.755 7f5e5971c700 -1 osd.7 921 pgid 1.4 has ref count of 2
Sep 25 13:55:51 ProxmoxH02 ceph-osd[1297597]: 2019-09-25 13:55:51.552 7f5e5d1edf80 -1 leaked refs:
Sep 25 13:55:51 ProxmoxH02 ceph-osd[1297597]: dump_weak_refs 0x5584321a4960 weak_refs: 921 = 0x55843833a000 with 4 refs
Sep 25 13:55:51 ProxmoxH02 systemd[1]: ceph-osd@7.service: Succeeded.
 
You can ignore the vmbr1 line.

Can you get more lines of the log from the log file in /var/log
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!