System: HP SAS Disk array made available on storage node via vblade (AoE daemon) on 8 1GB ports bonded as bond0. 4 Proxmox 3.0 hosts access the array on bonded (bond0) interfaces of 4 1GB ports per host using the linux kernel AoE modules. AoE is via a Netgear GS748T. Shared Storage is handled using LVM.
Problem: if the Netgear power cycles or resets itself the bond0 interfaces seem to enter some sort of panic mode which causes the AoE to declare the disk offline. This hangs all the kvm instances until a reboot. (the virtual machines are still 'running' but none of them can access their disks.) Cause seems to be that the when switch resets, the link aggregation groups (LAG's) are not active until the unit fully boots. While that is happening, the bonding driver, ifenslave, detects all kinds of non-aggreggated traffic and appears to panic. The AoE module on the Proxmox nodes does not seem to be able to handle an event like this
Solution: reconfigure server clients and switch to run without bonding. Install and run ggaoed on storage node instead of vblade. Specify all desired interfaces for ggaoed to use. Ggaoed seems to handle multiple interfaces more obviously than vblade. (I couldn't determine from the documentation just how to specify mulitple interfaces with vblade)
Conclusion:
Performance is comparable or better after some ggaoed tuning. Power cycling the swich seems to simply halt the disk activity over AoE during the short down time. When the switch is back up again, disk activity continues with no apparent error.
Problem: if the Netgear power cycles or resets itself the bond0 interfaces seem to enter some sort of panic mode which causes the AoE to declare the disk offline. This hangs all the kvm instances until a reboot. (the virtual machines are still 'running' but none of them can access their disks.) Cause seems to be that the when switch resets, the link aggregation groups (LAG's) are not active until the unit fully boots. While that is happening, the bonding driver, ifenslave, detects all kinds of non-aggreggated traffic and appears to panic. The AoE module on the Proxmox nodes does not seem to be able to handle an event like this
Solution: reconfigure server clients and switch to run without bonding. Install and run ggaoed on storage node instead of vblade. Specify all desired interfaces for ggaoed to use. Ggaoed seems to handle multiple interfaces more obviously than vblade. (I couldn't determine from the documentation just how to specify mulitple interfaces with vblade)
Conclusion:
Performance is comparable or better after some ggaoed tuning. Power cycling the swich seems to simply halt the disk activity over AoE during the short down time. When the switch is back up again, disk activity continues with no apparent error.
Last edited: