Hi,
I have a 5 node Ceph cluster and one Node is just acting up.
Right now it's not in production use, still testing.
I installed everything yesterday.
tested it and everything seemed stable. Just some random stuttering here and then. didn't overthink it.
updated packages
set a local timserver
created Bonds for networking
created cluster
installed ceph
everything is up and running.
Today my first node started being weird.
On management I have packet loss of about 17-25% I have balance-alb. Disabled the bond and tested adapters individually, but still same error on both cards. (These are the random stutters in the Console I saw earlier. They always appear to happen, when some packets are not going through)
I have an Broadcom dual 25GB/s Adapter installed. One Port does not work anymore. I can ping it locally (so my networking seems to know about the adapter), but I see no other devices via Ping. the other port works just fine.
Switching the cable makes the other network unavailable. So it's for certain the adapter itself is causing the problem.
But with the network issues in general I think thats more of a Softwareproblem.
I would have to delete the node, reinstall it and rejoin, setup ceph again and purge ceph beforehand.
I just don't want to reinstall it all just to see whether it is the OS or not.
What can I do?
I have a 5 node Ceph cluster and one Node is just acting up.
Right now it's not in production use, still testing.
I installed everything yesterday.
tested it and everything seemed stable. Just some random stuttering here and then. didn't overthink it.
updated packages
set a local timserver
created Bonds for networking
created cluster
installed ceph
everything is up and running.
Today my first node started being weird.
On management I have packet loss of about 17-25% I have balance-alb. Disabled the bond and tested adapters individually, but still same error on both cards. (These are the random stutters in the Console I saw earlier. They always appear to happen, when some packets are not going through)
I have an Broadcom dual 25GB/s Adapter installed. One Port does not work anymore. I can ping it locally (so my networking seems to know about the adapter), but I see no other devices via Ping. the other port works just fine.
Switching the cable makes the other network unavailable. So it's for certain the adapter itself is causing the problem.
But with the network issues in general I think thats more of a Softwareproblem.
root@pve1:~# pveversion -v
proxmox-ve: 7.1-1 (running kernel: 5.13.19-1-pve)
pve-manager: 7.1-5 (running version: 7.1-5/6fe299a0)
pve-kernel-5.13: 7.1-4
pve-kernel-helper: 7.1-4
pve-kernel-5.11: 7.0-10
pve-kernel-5.13.19-1-pve: 5.13.19-2
pve-kernel-5.11.22-7-pve: 5.11.22-12
pve-kernel-5.11.22-4-pve: 5.11.22-9
ceph: 16.2.6-pve2
ceph-fuse: 16.2.6-pve2
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-2
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-14
libpve-guest-common-perl: 4.0-3
libpve-http-server-perl: 4.0-3
libpve-storage-perl: 7.0-15
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.9-4
lxcfs: 4.0.8-pve2
novnc-pve: 1.2.0-3
proxmox-backup-client: 2.0.14-1
proxmox-backup-file-restore: 2.0.14-1
proxmox-mini-journalreader: 1.2-1
proxmox-widget-toolkit: 3.4-2
pve-cluster: 7.1-2
pve-container: 4.1-2
pve-docs: 7.1-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.3-3
pve-ha-manager: 3.3-1
pve-i18n: 2.6-1
pve-qemu-kvm: 6.1.0-2
pve-xtermjs: 4.12.0-1
qemu-server: 7.1-3
smartmontools: 7.2-1
spiceterm: 3.2-2
swtpm: 0.7.0~rc1+2
vncterm: 1.7-1
zfsutils-linux: 2.1.1-pve3
proxmox-ve: 7.1-1 (running kernel: 5.13.19-1-pve)
pve-manager: 7.1-5 (running version: 7.1-5/6fe299a0)
pve-kernel-5.13: 7.1-4
pve-kernel-helper: 7.1-4
pve-kernel-5.11: 7.0-10
pve-kernel-5.13.19-1-pve: 5.13.19-2
pve-kernel-5.11.22-7-pve: 5.11.22-12
pve-kernel-5.11.22-4-pve: 5.11.22-9
ceph: 16.2.6-pve2
ceph-fuse: 16.2.6-pve2
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-2
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-14
libpve-guest-common-perl: 4.0-3
libpve-http-server-perl: 4.0-3
libpve-storage-perl: 7.0-15
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.9-4
lxcfs: 4.0.8-pve2
novnc-pve: 1.2.0-3
proxmox-backup-client: 2.0.14-1
proxmox-backup-file-restore: 2.0.14-1
proxmox-mini-journalreader: 1.2-1
proxmox-widget-toolkit: 3.4-2
pve-cluster: 7.1-2
pve-container: 4.1-2
pve-docs: 7.1-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.3-3
pve-ha-manager: 3.3-1
pve-i18n: 2.6-1
pve-qemu-kvm: 6.1.0-2
pve-xtermjs: 4.12.0-1
qemu-server: 7.1-3
smartmontools: 7.2-1
spiceterm: 3.2-2
swtpm: 0.7.0~rc1+2
vncterm: 1.7-1
zfsutils-linux: 2.1.1-pve3
I would have to delete the node, reinstall it and rejoin, setup ceph again and purge ceph beforehand.
I just don't want to reinstall it all just to see whether it is the OS or not.
What can I do?