Hi everyone
I have 4 nodes running 7.1-10 with each node having 5 x SSD's. I have set up CEPH and created monitors, managers and metadata servers on all of the nodes. I know that not all nodes need to have these set up but the plan is to expand this cluster in future. I've created OSD's for all of the disks and set up cephFS for my ISO's etc. I also created a pool for VMs and CT's. All of the above was done with the default proxmox values. I then installed a test VM and configured HA with shutdown_policy=migrate. Rebooting the node where the VM is running, works as expected ie the VM migrates to another node and all is well.
The problem comes in when I reboot 2 of the nodes at the same time. When I do this, the other 2 nodes lock up and eventually reboot. I can understand the reboot as that's probably the proxmox fencing mechanisms at work. To make sure something wasn't misbehaving, I've tried with both IPMI watchdog and the default softdog.
Is this expected behaviour? Would changing the ceph replication sizes make any difference here?
Some additional context: I have 2 x Cisco Nexus 3064 set up with VPC for redundancy. Each of my servers has 1 x quad 10G nic and 1 x dual 10G nic.
2 ports in 802.3ad across both switches for management and public access to VM's (I have set up separate VLAN's and bridges of course)
2 ports in 802.3ad across both switches for ceph public and cluster (I have set up 2 vlans here for separating public and cluster)
2 ports without any bonding for proxmox corosync cluster (I assigned an IP to each interface and set them up as link 0 and link 1 in cluster config)
Any help or advice would be very much appreciated
I have 4 nodes running 7.1-10 with each node having 5 x SSD's. I have set up CEPH and created monitors, managers and metadata servers on all of the nodes. I know that not all nodes need to have these set up but the plan is to expand this cluster in future. I've created OSD's for all of the disks and set up cephFS for my ISO's etc. I also created a pool for VMs and CT's. All of the above was done with the default proxmox values. I then installed a test VM and configured HA with shutdown_policy=migrate. Rebooting the node where the VM is running, works as expected ie the VM migrates to another node and all is well.
The problem comes in when I reboot 2 of the nodes at the same time. When I do this, the other 2 nodes lock up and eventually reboot. I can understand the reboot as that's probably the proxmox fencing mechanisms at work. To make sure something wasn't misbehaving, I've tried with both IPMI watchdog and the default softdog.
Is this expected behaviour? Would changing the ceph replication sizes make any difference here?
Some additional context: I have 2 x Cisco Nexus 3064 set up with VPC for redundancy. Each of my servers has 1 x quad 10G nic and 1 x dual 10G nic.
2 ports in 802.3ad across both switches for management and public access to VM's (I have set up separate VLAN's and bridges of course)
2 ports in 802.3ad across both switches for ceph public and cluster (I have set up 2 vlans here for separating public and cluster)
2 ports without any bonding for proxmox corosync cluster (I assigned an IP to each interface and set them up as link 0 and link 1 in cluster config)
Any help or advice would be very much appreciated