Ceph went haywire after a switch hiccup

potetpro · Nov 9, 2022

Ceph went haywire after a switch hiccup, and i am trying to figure out what went wrong.

So we have 6 Proxmox servers. All with double 10Gbit network cards for Ceph. All servers are connected to two 10Gbit Switches in Active-Backup mode.

Each of have 4 SSDs running as Ceph OSDs

The servers are named Proxmox1-6
IPs on the 10Gbit network are 10.10.10.11-10.10.10.16 for Proxmox 1-6

The Switches are Unifi Switch XG 16.

The problem started when we migrated the switches from one unifi controller to another.
The Switches did not restart, and they did not change their config, but i guess the config was reloaded, and there was a brief moment without contact.
When migrating from one controller to another the controller-IP is changed, and the same config might get reloaded from the new controller, but its the same config.

I am adding the Ceph logs. From both Proxmox3 and Proxmox6 to get a better perspective.
Tell me if you need any more logs.
The problem started 08:04

The symptoms was:
-All VMs having disks on Ceph were switching between slow and not responding. (Proxmox3 has OSD: 14, 8, 3, and 2.)
-Proxmox3 reported OSDs down.
-Ceph tried to re-balance and repair but seem to be slow/stuck
-3 of 4 OSDs on Proxmox3 was auto-marked as Down and Out.

I started migrating VMs from Proxmox3 to reboot it, this took time.

After the reboot of Proxmox3, Ceph managed to get back on track, and all OSDs where automatically IN and UP.

Here is the concerning part. Many of the VMs has error messages of corrupt OS, and stuck.
And not just the default IOwait when Ceph is unavailable due to re-balancing.
Had to reboot serveral VMs to get them back up.

potetpro · Nov 9, 2022

One of our customers are now reporting corrupt documents on their VM
The corrupt files are not even on Ceph, but on a VM Disk located on Gluster connected to the VM, so this is strange, and not good

aaron · Nov 9, 2022

The big question is, in what state the network was. If the connections between the nodes has been broken / partitioned in a way, where nodes have a different view of which nodes they can talk to, Ceph will react in very unexpected ways.
OSDs might be shown as down for some nodes, while they might be up for others, and such things.
The result can be, that the pools will become either slow or completely IO blocked. The guests will be seeing IO problems as a result.

To get into a similar situation in a small (test)cluster, one can set up a 3 node cluster with a full-mesh network. By either using broadcast or the simple routed variant, you can trigger such behavior by removing the connection between two nodes. As these variants have no method to establish a fallback, the two nodes where you cut the network connection cannot see each other anymore, but they can still communicate with the 3rd node.

potetpro said:
but on a VM Disk located on Gluster connected to the VM

I am not an expert with gluster, but is gluster using the same switches? Could it be that this also affected gluster's synchronization/replication?

potetpro · Nov 9, 2022

aaron said:
The big question is, in what state the network was. If the connections between the nodes has been broken / partitioned in a way, where nodes have a different view of which nodes they can talk to, Ceph will react in very unexpected ways.
OSDs might be shown as down for some nodes, while they might be up for others, and such things.
The result can be, that the pools will become either slow or completely IO blocked. The guests will be seeing IO problems as a result.

To get into a similar situation in a small (test)cluster, one can set up a 3 node cluster with a full-mesh network. By either using broadcast or the simple routed variant, you can trigger such behavior by removing the connection between two nodes. As these variants have no method to establish a fallback, the two nodes where you cut the network connection cannot see each other anymore, but they can still communicate with the 3rd node.

I am not an expert with gluster, but is gluster using the same switches? Could it be that this also affected gluster's synchronization/replication?

The network was in perfect working condition.
Yes Gluster is connected to the same switches.
[2022-11-09 07:59:12.174334 +0000] I [MSGID: 108026] [afr-self-heal-data.c:346:afr_selfheal_data_do] 0-gluster1-replicate-0: performing data selfheal on a0c00a7f-3d0a-441f-82de-2a99a44892be
[2022-11-09 10:11:39.739150 +0000] I [MSGID: 108026] [afr-self-heal-common.c:1742:afr_log_selfheal] 0-gluster1-replicate-0: Completed data selfheal on a0c00a7f-3d0a-441f-82de-2a99a44892be. sources=[1] sinks=0 2
[2022-11-09 10:11:40.244931 +0000] I [MSGID: 108026] [afr-self-heal-data.c:346:afr_selfheal_data_do] 0-gluster1-replicate-0: performing data selfheal on 8b635352-2367-4671-830d-397591cc1b26
[2022-11-09 10:56:08.651101 +0000] I [MSGID: 108026] [afr-self-heal-common.c:1742:afr_log_selfheal] 0-gluster1-replicate-0: Completed data selfheal on 8b635352-2367-4671-830d-397591cc1b26. sources=[1] sinks=0 2

This might be affected as well.

I think the panic is over now.
First problems with the corrupt files.
Then after a reboot the disk is gone.
Then after another reboot the disk is back, but the customer cannot save files to the volume
Then after removing read-only attribute from the volume, things are back to normal.

I Hope this is the end of it.

Search

Search

Ceph went haywire after a switch hiccup

potetpro

Member

Attachments

potetpro

Member

aaron

Proxmox Staff Member

potetpro

Member

We value your privacy