Ceph went haywire after a switch hiccup, and i am trying to figure out what went wrong.
So we have 6 Proxmox servers. All with double 10Gbit network cards for Ceph. All servers are connected to two 10Gbit Switches in Active-Backup mode.
Each of have 4 SSDs running as Ceph OSDs
The servers are named Proxmox1-6
IPs on the 10Gbit network are 10.10.10.11-10.10.10.16 for Proxmox 1-6
The Switches are Unifi Switch XG 16.
The problem started when we migrated the switches from one unifi controller to another.
The Switches did not restart, and they did not change their config, but i guess the config was reloaded, and there was a brief moment without contact.
When migrating from one controller to another the controller-IP is changed, and the same config might get reloaded from the new controller, but its the same config.
I am adding the Ceph logs. From both Proxmox3 and Proxmox6 to get a better perspective.
Tell me if you need any more logs.
The problem started 08:04
The symptoms was:
-All VMs having disks on Ceph were switching between slow and not responding. (Proxmox3 has OSD: 14, 8, 3, and 2.)
-Proxmox3 reported OSDs down.
-Ceph tried to re-balance and repair but seem to be slow/stuck
-3 of 4 OSDs on Proxmox3 was auto-marked as Down and Out.
I started migrating VMs from Proxmox3 to reboot it, this took time.
After the reboot of Proxmox3, Ceph managed to get back on track, and all OSDs where automatically IN and UP.
Here is the concerning part. Many of the VMs has error messages of corrupt OS, and stuck.
And not just the default IOwait when Ceph is unavailable due to re-balancing.
Had to reboot serveral VMs to get them back up.
So we have 6 Proxmox servers. All with double 10Gbit network cards for Ceph. All servers are connected to two 10Gbit Switches in Active-Backup mode.
Each of have 4 SSDs running as Ceph OSDs
The servers are named Proxmox1-6
IPs on the 10Gbit network are 10.10.10.11-10.10.10.16 for Proxmox 1-6
The Switches are Unifi Switch XG 16.
The problem started when we migrated the switches from one unifi controller to another.
The Switches did not restart, and they did not change their config, but i guess the config was reloaded, and there was a brief moment without contact.
When migrating from one controller to another the controller-IP is changed, and the same config might get reloaded from the new controller, but its the same config.
I am adding the Ceph logs. From both Proxmox3 and Proxmox6 to get a better perspective.
Tell me if you need any more logs.
The problem started 08:04
The symptoms was:
-All VMs having disks on Ceph were switching between slow and not responding. (Proxmox3 has OSD: 14, 8, 3, and 2.)
-Proxmox3 reported OSDs down.
-Ceph tried to re-balance and repair but seem to be slow/stuck
-3 of 4 OSDs on Proxmox3 was auto-marked as Down and Out.
I started migrating VMs from Proxmox3 to reboot it, this took time.
After the reboot of Proxmox3, Ceph managed to get back on track, and all OSDs where automatically IN and UP.
Here is the concerning part. Many of the VMs has error messages of corrupt OS, and stuck.
And not just the default IOwait when Ceph is unavailable due to re-balancing.
Had to reboot serveral VMs to get them back up.