Proxmox 6.0 cluster ceph issue

mlevings · Sep 25, 2019

I have a new cluster of 3 nodes. All are identical hardware (Dell R710 with H200 in IT mode, 2 ssd's, and 4 x 4TB spinning disks) . I have ceph running on all 3 nodes with 4 OSD's per node (spinning disks only). Everything is fine except for all OSD's on 1 host go down after a short amount of time. I can select them and tell them to start and they come back online and the cluster becomes healthy again, but they eventually go down again. I reinstalled the whole cluster from scratch to try to fix it, but I'm having the same exact issue. Any ideas as to what to look for?

sg90 · Sep 26, 2019

Is the CEPH proccess stopping?, Anything showing in the logs?

If you do "service ceph-osd@# status" what does it show? # being the OSD ID.

mlevings · Sep 26, 2019

Here is what I am getting from that. I see something about vmbr1 in the log. I don't thing I'm loosing any packets on that interface. It's a 10gb port and I set a continuous ping for troubleshooting and never lost a packet.

root@ProxmoxH02:~# service ceph-osd@7 status
● ceph-osd@7.service - Ceph object storage daemon osd.7
Loaded: loaded (/lib/systemd/system/ceph-osd@.service; enabled-runtime; vendor preset: enabled)
Drop-In: /usr/lib/systemd/system/ceph-osd@.service.d
└─ceph-after-pve-cluster.conf
Active: inactive (dead) since Wed 2019-09-25 13:55:51 CDT; 17h ago
Process: 1297588 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id 7 (code=exited, status=0/SUCCESS)
Process: 1297597 ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} --id 7 --setuser ceph --setgroup ceph (code=exited, status=0/SUCCESS)
Main PID: 1297597 (code=exited, status=0/SUCCESS)

Sep 25 13:55:49 ProxmoxH02 ceph-osd[1297597]: 2019-09-25 13:55:49.583 7f5e564e4700 -1 osd.7 919 set_numa_affinity unable to identify public interface 'vmbr1' numa node: (2) No such file or directory
Sep 25 13:55:50 ProxmoxH02 ceph-osd[1297597]: 2019-09-25 13:55:50.727 7f5e5971c700 -1 received signal: Interrupt from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0
Sep 25 13:55:50 ProxmoxH02 ceph-osd[1297597]: 2019-09-25 13:55:50.727 7f5e5971c700 -1 osd.7 921 *** Got signal Interrupt ***
Sep 25 13:55:50 ProxmoxH02 ceph-osd[1297597]: 2019-09-25 13:55:50.755 7f5e5971c700 -1 osd.7 921 pgid 1.3c has ref count of 2
Sep 25 13:55:50 ProxmoxH02 ceph-osd[1297597]: 2019-09-25 13:55:50.755 7f5e5971c700 -1 osd.7 921 pgid 1.24 has ref count of 2
Sep 25 13:55:50 ProxmoxH02 ceph-osd[1297597]: 2019-09-25 13:55:50.755 7f5e5971c700 -1 osd.7 921 pgid 1.2 has ref count of 2
Sep 25 13:55:50 ProxmoxH02 ceph-osd[1297597]: 2019-09-25 13:55:50.755 7f5e5971c700 -1 osd.7 921 pgid 1.4 has ref count of 2
Sep 25 13:55:51 ProxmoxH02 ceph-osd[1297597]: 2019-09-25 13:55:51.552 7f5e5d1edf80 -1 leaked refs:
Sep 25 13:55:51 ProxmoxH02 ceph-osd[1297597]: dump_weak_refs 0x5584321a4960 weak_refs: 921 = 0x55843833a000 with 4 refs
Sep 25 13:55:51 ProxmoxH02 systemd[1]: ceph-osd@7.service: Succeeded.

sg90 · Sep 27, 2019

You can ignore the vmbr1 line.

Can you get more lines of the log from the log file in /var/log

Search

Search

Proxmox 6.0 cluster ceph issue

mlevings

Active Member

sg90

Renowned Member

mlevings

Active Member

sg90

Renowned Member

We value your privacy