Quick version: NetworkManager gets confused, interface assignment is not deterministic, sometimes my vlan interfaces do not connect due to parent interface binding the wrong device since VF MAC is not assigned prior to VM boot.
Proxmox VE 8.2.7
Working ConnectX3 SRIOV
VFs for VMs assigned from Resource Mapping pool.
Problematic Guest:
MXlinux KDE, uses NetworkManager
1x virtio NIC with static MAC
1x PCIe passthrough ConnectX3 VF
Problem:
1) The VFs from the pool do not have an assigned MAC when the VM boots, so the mlx4_core and/or mlx_en drivers assign a random MAC during boot, shows up in dmesg logs
2) the interface for my Virtio NIC is bound to its MAC, always works
3) The interface for my trunk VF cannot be bound to MAC since it is currently random, assigned during boot.
4) Unless an interface is explicitly assigned to a MAC, NetworkManager tries to use the same interface for EVERYthing, and tries to bind to both the VF and virtio devices which causes vlans to fail 50% of the time depending on the order of binding.
4) My vlans work fine when the trunk VF happens to get enumerated prior to the virtio NIC so NetworkManager binds the interface to the right device, well, the vlan ends up with whichever device the parent binds to first (or last?), so it is a 50% chance each boot.
* ) I do not want to pull VFs out of the pool to do host-side MAC assignment, and then manually assign individual VFs to individual VMs, kinda breaks the whole point of using a pool.
Q's:
1- Are there any guest-side driver options (like the host-side "options mlx4_core num_vfs= ...") that can force the ConnectX driver to assign a specific MAC instead of random at boot time, before NetworkManager gets confused?
2- If not, then any ideas how to fix this problem? I've tried a few things, but without a MAC to bind an interface to, NetworkManager tries to bind it to EVERY device even if the device's MAC is already bound to another interface (like my virtio NIC).
Right now, it is a 50/50 chance every boot that my vlans will not work because the parent interface was bound to both devices and the wrong device just happened to be bound first since it appears non-deterministic.
Would appreciate your thoughts on the matter, thanks
Proxmox VE 8.2.7
Working ConnectX3 SRIOV
VFs for VMs assigned from Resource Mapping pool.
Problematic Guest:
MXlinux KDE, uses NetworkManager
1x virtio NIC with static MAC
1x PCIe passthrough ConnectX3 VF
Problem:
1) The VFs from the pool do not have an assigned MAC when the VM boots, so the mlx4_core and/or mlx_en drivers assign a random MAC during boot, shows up in dmesg logs
2) the interface for my Virtio NIC is bound to its MAC, always works
3) The interface for my trunk VF cannot be bound to MAC since it is currently random, assigned during boot.
4) Unless an interface is explicitly assigned to a MAC, NetworkManager tries to use the same interface for EVERYthing, and tries to bind to both the VF and virtio devices which causes vlans to fail 50% of the time depending on the order of binding.
4) My vlans work fine when the trunk VF happens to get enumerated prior to the virtio NIC so NetworkManager binds the interface to the right device, well, the vlan ends up with whichever device the parent binds to first (or last?), so it is a 50% chance each boot.
* ) I do not want to pull VFs out of the pool to do host-side MAC assignment, and then manually assign individual VFs to individual VMs, kinda breaks the whole point of using a pool.
Q's:
1- Are there any guest-side driver options (like the host-side "options mlx4_core num_vfs= ...") that can force the ConnectX driver to assign a specific MAC instead of random at boot time, before NetworkManager gets confused?
2- If not, then any ideas how to fix this problem? I've tried a few things, but without a MAC to bind an interface to, NetworkManager tries to bind it to EVERY device even if the device's MAC is already bound to another interface (like my virtio NIC).
Right now, it is a 50/50 chance every boot that my vlans will not work because the parent interface was bound to both devices and the wrong device just happened to be bound first since it appears non-deterministic.
Would appreciate your thoughts on the matter, thanks