Hi
I have recently strarted using proxmox and have a cluster running fine.
I have a Huawei server with 6 blades and a big SAN attached to them that i reloaded Proxmox onto
I setup the SAN to be a FibreChannel rather than iSCSI as we have speed issues with a current iSCSI that we are using
Managed to setup the Fibrechannel on firse node by following 2 guides:
https://pve.proxmox.com/wiki/Multipath
https://www.youtube.com/watch?v=aF2QUbmxvcw
These where my steps on first node
Install Proxmox, then run a handy script that enables non-sub repositories and updates proxmox then restart node.
Then apt update, followed by apt install multipath-tools, also did apt install grub-efi-amd64, reboot
Then fdisk -l
found 2 devices (well 4 they each repeat twice)
Disk /dev/sdb: 17.28 TiB, 19002344603648 bytes, 37113954304 sectors
Disk model: XSG1
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk /dev/sdc: 139.93 TiB, 153854788239360 bytes, 300497633280 sectors
Disk model: XSG1
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
To get wwid on device /lib/udev/scsi_id -g -u -d /dev/sdc (and sdb)
Now add the wwid to the system
multipath -a 36785860100b13fad1aa17fca00000001
multipath -a 36785860100b13fad1aa1950100000002
Update system with multipath -r
Create a file /etc/multipath.conf (fcpath1 and 2 is names that i chose)
defaults {
find_multipaths yes
user_friendly_names yes
}
blacklist {
devnode "^hd[a-z][[0-9]*"
devnode "^sda$"
}
multipaths {
multipath {
wwid "36785860100b13fad1aa17fca00000001"
alias "fcpath1"
}
multipath {
wwid "36785860100b13fad1aa1950100000002"
alias "fcpath2"
}
}
To see if system is working i run multipath -ll
fcpath1 (36785860100b13fad1aa17fca00000001) dm-5 HUAWEI,XSG1
size=17T features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
|- 11:0:0:1 sdb 8:16 active ghost running
`- 12:0:0:1 sdd 8:48 active ready running
fcpath2 (36785860100b13fad1aa1950100000002) dm-6 HUAWEI,XSG1
size=140T features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
|- 11:0:0:2 sdc 8:32 active ready running
`- 12:0:0:2 sde 8:64 active ghost running
Now i run
pvcreate /dev/mapper/fcpath1
pvcreate /dev/mapper/fcpath2
and then
vgcreate fastsas /dev/mapper/fcpath1
vgcreate slowsas /dev/mapper/fcpath2
Then i went into WebGui and added the node to my cluster.
Then i went tp Datacenter, Storage and click ADD and select LVM. ID i made FCfast and select volume fastsas and selected the node and select shared. did the same with other but called it FCslow
All works well i can migrate live vms to it, vms run fine they can ping and see internet all is good. pinging them from outside computers they can ping fine no packet losses.
Now i have 4 more nodes that are also connected to same SAN. I installed proxmox on 3 of them. followed same steps as first one exept i stop right before i run pvcreate and vgcreate. because as i understand it this will destroy the data on the devices? am i wrong?
I then add them to node, go to datacenter and just add the node name to FCfast LVM and FCslow LVM.
The LVMs comes up under the storage things i can migrate offline VMs to them no problem, but live VM the migration succeeds and fails the VM does not recover it is offline. i need to manually start it, no biggy.
But now i have a weird problem. VMS running on 2nd, 3rd and 4th node can ping outwards with no packet loss. BUT pinging from outside computers to them i get either no reply or massive packet loss of 15-25%.
as i have seen on this post https://forum.proxmox.com/threads/fiber-channel-setup.35225/
"- you must manually configure the SAN Backed LVM storage, from each proxmox node. There is not any 'automatic propigation' of shared storage from one Proxmox node to other nodes, for any (!) of the shared storage types, as far as I am aware (ie, iSCSI, etc)
ie, you basically
- setup on first proxmox node
- then go to second node,
- only step you won't have to repeat, is the creation of the LVM on the SAN Disk volume. But you do need to 'add' the storage into the proxmox node. Flag it as type = shared with the check-box option. Then it comes online. You must rinse and repeat this config process on all nodes who need access to the shared storage."
the "you must manually configure the SAN Backed LVM storage, from each proxmox node." which i believe i did with multipath?
but i dont understand:
"you do need to 'add' the storage into the proxmox node. Flag it as type = shared with the check-box option. Then it comes online. You must rinse and repeat this config process on all nodes who need access to the shared storage."
you cant add it to the node itself you have to do it via Datacenter that is cluster wide
now my theory is that the FibreChannel settings was not completed on 2nd, 3rd and 4th node. and they are accessing the FCfast and FCslow over network with node1 and thats causing issues... well im dont know i am spit balling
Should i run pvcreate and vgcreate on 2nd, 3rd and 4th node as well? does this just create a physical volume and volume group on the node and does not touch the data on the SAN?
running multipath -ll gives same results on all 4 nodes
i can ping node 1, 2, 3, 4 and the vms on node 1 and vms on node 2 (nothing on 3 and 4 yet) at same time and only vms on node 2 will drop packets while nodes are stable and VMs on node 1 also stable, so its not the nodes network interfaces. (VM on node 3 ping the internet fine but had 100% packet lost when pinging it from another device so migrated it to node 1) VMS work fine if i migrate them to node 1 or other nodes in cluster where they origionally was.
all 4 nodes use same network interface, its basically a virtual interface that the huawei server make available to all blades.
all 4 nodes have identical interface config (except ips ofcourse), hosts file and resolv.conf file
all Nodes are on same management vlan 210 (have gateway and dns but blocked outside comms are block on firewall)
all Nodes are on same Proxmox cluster vlan 220
all Nodes are on same iSCSI vlan 210
all Nodes are on same vlan 10 as servers that they are hosted just in case.
All servers and devices used to do ping test is either on vlan 10 or has full access to vlan 10 via firewall rules.
Any insights will be appreciated.
I have recently strarted using proxmox and have a cluster running fine.
I have a Huawei server with 6 blades and a big SAN attached to them that i reloaded Proxmox onto
I setup the SAN to be a FibreChannel rather than iSCSI as we have speed issues with a current iSCSI that we are using
Managed to setup the Fibrechannel on firse node by following 2 guides:
https://pve.proxmox.com/wiki/Multipath
https://www.youtube.com/watch?v=aF2QUbmxvcw
These where my steps on first node
Install Proxmox, then run a handy script that enables non-sub repositories and updates proxmox then restart node.
Then apt update, followed by apt install multipath-tools, also did apt install grub-efi-amd64, reboot
Then fdisk -l
found 2 devices (well 4 they each repeat twice)
Disk /dev/sdb: 17.28 TiB, 19002344603648 bytes, 37113954304 sectors
Disk model: XSG1
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk /dev/sdc: 139.93 TiB, 153854788239360 bytes, 300497633280 sectors
Disk model: XSG1
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
To get wwid on device /lib/udev/scsi_id -g -u -d /dev/sdc (and sdb)
Now add the wwid to the system
multipath -a 36785860100b13fad1aa17fca00000001
multipath -a 36785860100b13fad1aa1950100000002
Update system with multipath -r
Create a file /etc/multipath.conf (fcpath1 and 2 is names that i chose)
defaults {
find_multipaths yes
user_friendly_names yes
}
blacklist {
devnode "^hd[a-z][[0-9]*"
devnode "^sda$"
}
multipaths {
multipath {
wwid "36785860100b13fad1aa17fca00000001"
alias "fcpath1"
}
multipath {
wwid "36785860100b13fad1aa1950100000002"
alias "fcpath2"
}
}
To see if system is working i run multipath -ll
fcpath1 (36785860100b13fad1aa17fca00000001) dm-5 HUAWEI,XSG1
size=17T features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
|- 11:0:0:1 sdb 8:16 active ghost running
`- 12:0:0:1 sdd 8:48 active ready running
fcpath2 (36785860100b13fad1aa1950100000002) dm-6 HUAWEI,XSG1
size=140T features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
|- 11:0:0:2 sdc 8:32 active ready running
`- 12:0:0:2 sde 8:64 active ghost running
Now i run
pvcreate /dev/mapper/fcpath1
pvcreate /dev/mapper/fcpath2
and then
vgcreate fastsas /dev/mapper/fcpath1
vgcreate slowsas /dev/mapper/fcpath2
Then i went into WebGui and added the node to my cluster.
Then i went tp Datacenter, Storage and click ADD and select LVM. ID i made FCfast and select volume fastsas and selected the node and select shared. did the same with other but called it FCslow
All works well i can migrate live vms to it, vms run fine they can ping and see internet all is good. pinging them from outside computers they can ping fine no packet losses.
Now i have 4 more nodes that are also connected to same SAN. I installed proxmox on 3 of them. followed same steps as first one exept i stop right before i run pvcreate and vgcreate. because as i understand it this will destroy the data on the devices? am i wrong?
I then add them to node, go to datacenter and just add the node name to FCfast LVM and FCslow LVM.
The LVMs comes up under the storage things i can migrate offline VMs to them no problem, but live VM the migration succeeds and fails the VM does not recover it is offline. i need to manually start it, no biggy.
But now i have a weird problem. VMS running on 2nd, 3rd and 4th node can ping outwards with no packet loss. BUT pinging from outside computers to them i get either no reply or massive packet loss of 15-25%.
as i have seen on this post https://forum.proxmox.com/threads/fiber-channel-setup.35225/
"- you must manually configure the SAN Backed LVM storage, from each proxmox node. There is not any 'automatic propigation' of shared storage from one Proxmox node to other nodes, for any (!) of the shared storage types, as far as I am aware (ie, iSCSI, etc)
ie, you basically
- setup on first proxmox node
- then go to second node,
- only step you won't have to repeat, is the creation of the LVM on the SAN Disk volume. But you do need to 'add' the storage into the proxmox node. Flag it as type = shared with the check-box option. Then it comes online. You must rinse and repeat this config process on all nodes who need access to the shared storage."
the "you must manually configure the SAN Backed LVM storage, from each proxmox node." which i believe i did with multipath?
but i dont understand:
"you do need to 'add' the storage into the proxmox node. Flag it as type = shared with the check-box option. Then it comes online. You must rinse and repeat this config process on all nodes who need access to the shared storage."
you cant add it to the node itself you have to do it via Datacenter that is cluster wide
now my theory is that the FibreChannel settings was not completed on 2nd, 3rd and 4th node. and they are accessing the FCfast and FCslow over network with node1 and thats causing issues... well im dont know i am spit balling
Should i run pvcreate and vgcreate on 2nd, 3rd and 4th node as well? does this just create a physical volume and volume group on the node and does not touch the data on the SAN?
running multipath -ll gives same results on all 4 nodes
i can ping node 1, 2, 3, 4 and the vms on node 1 and vms on node 2 (nothing on 3 and 4 yet) at same time and only vms on node 2 will drop packets while nodes are stable and VMs on node 1 also stable, so its not the nodes network interfaces. (VM on node 3 ping the internet fine but had 100% packet lost when pinging it from another device so migrated it to node 1) VMS work fine if i migrate them to node 1 or other nodes in cluster where they origionally was.
all 4 nodes use same network interface, its basically a virtual interface that the huawei server make available to all blades.
all 4 nodes have identical interface config (except ips ofcourse), hosts file and resolv.conf file
all Nodes are on same management vlan 210 (have gateway and dns but blocked outside comms are block on firewall)
all Nodes are on same Proxmox cluster vlan 220
all Nodes are on same iSCSI vlan 210
all Nodes are on same vlan 10 as servers that they are hosted just in case.
All servers and devices used to do ping test is either on vlan 10 or has full access to vlan 10 via firewall rules.
Any insights will be appreciated.