Strange network problem between VMs

b416 · Mar 14, 2023

I have a very strange problem on my local network and my PVE.

Brand new server, 10Gb ixgbe network card connected via SFP+ to a 10Gb switch. A couple of VMs and some other computers on my local network. The new server is replacing an older machine, the VMs were copied to the new server. The setup is a replica of what I had before. Very simple setup, fixed IP and 1 vmbr shared by all the VMs.

So, working as expected and used to work before :
- network between laptop and host
- network between laptop and VMs
- network between host and VMs
- internet access (everyone)

Not working (and used to work with almost identical setup) :
- network between VMs

The firewall is disabled on VMs, enabled on host (the host is accessible via internet, the VMs are accessible via a reverse proxy on my firewall - OPNsense on a separate machine).

The behavior is very strange, it's not that I have no network at all between the VMs. They can ping each other, I can ssh from one to another, but as soon as there is files involved, the network link falls apart. For example, I start a web ui, I am getting a timeout error. I start scp, it starts but gets stalled. I can mount a SMB or NFS share, no errors, but as soon as I list a directory with files in it, same thing... stalled... timeout.

It took me a while to understand what's working and what's not. I have not really a clue where to begin to search...

Any help ?

Matthias. · Mar 15, 2023

Please show your network configuration and the VM configs of two VMs (output of qm config <VMID>)

Max Carrara · Mar 15, 2023

Very strange indeed - how exactly did you "copy" your VMs? What storage setup are you using (e.g. LVM, ZFS, Ceph, ...)?

b416 · Mar 16, 2023

Matthias. said:
Please show your network configuration and the VM configs of two VMs (output of qm config <VMID>)

/etc/network/interfaces :

Code:

auto lo
iface lo inet loopback

auto enp1s0
iface enp1s0 inet manual

auto vmbr0
iface vmbr0 inet static
    address 192.168.10.210/24
    gateway 192.168.10.1
    bridge-ports enp1s0
    bridge-stp off
    bridge-fd 0

iface vmbr0 inet6 static
    address 2a01:e0a:31d:ee01:192:168:10:210/64
    gateway 2a01:e0a:31d:ee01::1

the sharing VM (TrueNAS with passthrough disks) :

Code:

agent: 1
boot: order=scsi0;ide2;net0
cores: 2
ide2: none,media=cdrom
memory: 16384
meta: creation-qemu=7.2.0,ctime=1678933901
name: filesrv
net0: virtio=42:A3:97:AF:43:58,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: l26
scsi0: data:vm-214-disk-0,iothread=1,size=20G
scsi1: /dev/disk/by-id/ata-Samsung_SSD_860_QVO_4TB_S4CXNF0M500245W,size=3907018584K
scsi2: /dev/disk/by-id/ata-Samsung_SSD_870_QVO_4TB_S5STNG0NC05907V,size=3907018584K
scsi3: /dev/disk/by-id/ata-Samsung_SSD_860_QVO_4TB_S4CXNF0M424128D,size=3907018584K
scsihw: virtio-scsi-single
smbios1: uuid=65856fb3-e4b4-482a-a014-121b7bf3df2e
sockets: 1
vmgenid: 270c94c7-ed4d-49c2-bf6f-61b842586b93

the client VM that gets stuck :

Code:

agent: 1,fstrim_cloned_disks=1
boot: dcn
bootdisk: scsi0
cores: 2
ide2: none,media=cdrom
memory: 8192
name: plex
net0: virtio=7E:62:D2:DE:F8:64,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: l26
scsi0: data:vm-213-disk-0,discard=on,size=34G,ssd=1
scsihw: virtio-scsi-pci
serial0: socket
smbios1: uuid=f9d8816f-e3f9-4392-a825-8df090b3a503
sockets: 1
startup: order=5
vmgenid: b81dd9ae-9ff6-48a1-ae75-8a5456d23569

b416 · Mar 16, 2023

Max Carrara said:
Very strange indeed - how exactly did you "copy" your VMs? What storage setup are you using (e.g. LVM, ZFS, Ceph, ...)?

I have a cluster, so I migrated the VMs to another node and back after installation.
The VMs system disks are on local ZFS, the shared files are on passthrough disks managed by TrueNAS

b416 · Mar 16, 2023

In the client logs I have :

Code:

2023-03-16 04:09:51 plex kernel:[    4.561201] CIFS: Attempting to mount \\192.168.10.214\plex
2023-03-16 04:13:32 plex kernel:[  225.334347] CIFS: VFS: \\192.168.10.214 has not responded in 180 seconds. Reconnecting...
2023-03-16 04:17:26 plex kernel:[  459.062252] CIFS: VFS: \\192.168.10.214 has not responded in 180 seconds. Reconnecting...
2023-03-16 04:20:32 plex kernel:[  645.432359] CIFS: VFS: \\192.168.10.214 has not responded in 180 seconds. Reconnecting...
2023-03-16 04:20:32 plex kernel:[  645.436686] CIFS: reconnect tcon failed rc = -11
2023-03-16 04:23:39 plex kernel:[  831.799938] CIFS: VFS: \\192.168.10.214 has not responded in 180 seconds. Reconnecting...

b416 · Mar 16, 2023

Another example :

I have a VM called devbox with Ubuntu Mate. From that VM I can't access the TrueNAS web UI, it just goes in timeout. Same thing to another VM web interface.

I can access all web interfaces from my laptop. Same local network for everyone, 192.168.10.0/24, no VLAN, nothing fancy of course.

ovash04 · Mar 16, 2023

It looks like the same behavior that I faced with the last month when I started using proxmox:
Structure:
Proxmox host:
- Ubuntu VM1 - K8s control-panel
- Ubuntu VM2 - K8s worker node
- LXC Ubuntu image with Jenkins

Evertything works normally when the server has recenltly been booted.
But after few hours:
- network between laptop and Ubuntu VM1 does not work, connection refused. I am not able to work with it (screenshot is attached)
- network between laptop and Proxmox Host Ubuntu VM2, Jenkins LXC works.
- Sporadically network works inside proxmox network: VM2 -> VM1 works, but VM1 -> VM2 does not or from Jenkins LXC -> VM1 does not

After disabling and enabling network interface on VM1 everything works correctly about few hours, then the issue repeated again.
Firewalls are disabled, restarting pve* services did not help, logs are empy, only disconnecting and connecting network on VM1 is present in the logs.
.
What information can I provide there to helo resolve this issue:

b416 · Mar 16, 2023

Found something.... if I use IPv6, everything works as expected !

I have IPv4 and IPv6 on all my machines. IPv4 addresses are assigned by DHCP with static addresses for the servers. IPv6 are assigned by RADVD, so they can change between reboots... I will look into that, put fixed IPv6 or DHCPv6 with static assignments.

But what I still do not understand, is the reason of this behaviour. I have something kind of fishy in the IPv4 cluster configuration, but it used to work before the install of the new server.

The fishy part :

I have 2 nodes at OVH, public IPv4 (and more public IPs for the VMs), IPv6 /64, everything smooth...
I wanted to join the cluster from my local server, but this one is behind my home firewall (OPNsense) and I only have 1 public IP, no way to have more, it's a simple home internet plan. On the firewall, I redirect the ports 22, 8006 and UDP 5405 - 5412 to the proxmox node. I can join the cluster, move VMs between nodes etc. Everything works as expected on the home node web UI.
But if I access the cluster web UI on an external node, the home node gets a communication failure when I try to access it. In order to correct that, I found a little trick : in /etc/hosts I put my home public IPv4 and public DNS name and then everything works !

pvecm status :

Code:

Cluster information
-------------------
Name:             JDJ
Config Version:   18
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Thu Mar 16 11:25:26 2023
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000003
Ring ID:          1.1c2f
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 2001:41d0:xxxx:xxxx:xxxx:xxxx:xxxx:xxxx%32621
0x00000002          1 2001:41d0:xxxx:xxxx:xxxx:xxxx:xxxx:xxxx%32621
0x00000003          1 2a01:e0a:xxxx:xxxx:xxxx:xxxx:xxxx:xxxx%32621 (local)

/etc/pve/.members :

Code:

{
"nodename": "pxbsy",
"version": 5,
"cluster": { "name": "JDJ", "version": 18, "nodes": 3, "quorate": 1 },
"nodelist": {
  "pxgra": { "id": 1, "online": 1, "ip": "37.xx.xx.xx"},
  "pxsbg": { "id": 2, "online": 1, "ip": "54.xx.xx.xx"},
  "pxbsy": { "id": 3, "online": 1, "ip": "82.xx.xx.xx"}
  }
}

But all this used to work before...

Search

Search

Strange network problem between VMs

b416

Member

Matthias.

Proxmox Retired Staff

Max Carrara

Well-Known Member

b416

Member

b416

Member

b416

Member

b416

Member

ovash04

New Member

Attachments

b416

Member

We value your privacy