[SOLVED] Lost Connections with VMs

M.R

Member
Aug 23, 2019
8
0
6
45
Hi, I'm new in the forum, I had been working with Proxmox for about 3/4 years so far without problems.

Since a few Months ago I have a trouble with 2 virtual servers running in a Host with proxmox 5.4-13 , this Host was upgrade from promox 3, to version 4 and finally to the version 5 following the instructions of the "How To" from Proxmox Wiki

Now, the problem is this, there are 2 running server and both lose network connection from a few second to a few minutes at random time. I mean, I'm working...lose connection, no ping, no VNC console...then a few minutes all return to normal. Sometimes both server at same time and sometimes 1 server down and the other server have connection, as I said above, it's random.

In my firts test, to discard a fisical problem, I tried to add a new patch cord and configured a new Bridge connection, then move the servers from vmbr0 to vmbr1 but the problem persist.

Then I created a new virtual server put it online and launch an uninterrupted ping to the new created server, to the management IP of the Proxmox Host and to the failing servers, I let run the ping all over the weekend. When I check the log the ping to the Host and to the new server never lost a packet, but the pings to the 2 servers lost packets with "unreachable destination" or "timeout connection" or both.

I had already apply all updates availables to the Proxmox host too.

I have been thinking in backup the VMs and reinstall Proxmox from the latest version then restore the VMs, but this may take to long and this servers are in production.

I need any suggestions to avoid this last option.

Thanks in advance.
 
check the logs of the systems, when they are not reachable - I could imagine that there is an issue with your cluster traffic (just a wild initial guess)

I hope this helps!
 
check the logs of the systems, when they are not reachable - I could imagine that there is an issue with your cluster traffic (just a wild initial guess)

I hope this helps!
Hi, thanks for your Reply Stoiko, I have no cluster, it's just 1 Host with 3 VMs and 2 of that VMs lose the network connection. That 2 VMs comes from an initial setup of Proxmox 3, and the VMs was working well, after I upgraded to version 5 the fails starts. I think something was broken during the upgrades. It's strange too that I can't even connect to console via VNC when the connectivity lost
Can you tell me where I can find the logs of network Traffic?

Thanks
 
check the logs of the systems, when they are not reachable - I could imagine that there is an issue with your cluster traffic (just a wild initial guess)

I hope this helps!
The cluster seemed fine, i saw no indication of inter cluster communication at that time. I looked at continuous corosync quorum tool outputs and everything was just fine.
And only some VMs had this issue. Others were available.
Also, no VM ever recovered without being shut down and started again.
 
Hi, thanks for your Reply Stoiko, I have no cluster, it's just 1 Host with 3 VMs and 2 of that VMs lose the network connection. That 2 VMs comes from an initial setup of Proxmox 3,
please post the configs of the VMs (some things have changed since PVE 3.x - maybe the issue is there.

else - please post the output of `dmesg` and `journalctl --since '-1 day'` - maybe there is a hint there.
 
The cluster seemed fine, i saw no indication of inter cluster communication at that time. I looked at continuous corosync quorum tool outputs and everything was just fine.
And only some VMs had this issue. Others were available.
Also, no VM ever recovered without being shut down and started again.
Please open a new thread with a separate issue - this helps tremendously to not get confused by which reply belongs to which question - thanks!
 
please post the configs of the VMs (some things have changed since PVE 3.x - maybe the issue is there.

else - please post the output of `dmesg` and `journalctl --since '-1 day'` - maybe there is a hint there.
Hi, I left you the files you asked for. Thanks
 

Attachments

  • Journactl.log
    268.9 KB · Views: 6
  • Documents.zip
    26.8 KB · Views: 3
hmm nothing too obvious in the logs.
Since it seems that you're running Windows guests (with virtio NICs) - have you upgraded to the latest stable virtio drivers? (this sometimes causes issues) - else - if possible in your environment - you could try changing the NIC to e1000 and see if the problem persists (though that comes at a cost of performance)

On a unrelated note - please consider upgrading your bios - it's still a version from 2013 - and especially with the Spectre/Meltdown/etc vulnerabilities it is quite important to keep the BIOS/Firmware up to date!

I hope this helps!
 
hmm nothing too obvious in the logs.
Since it seems that you're running Windows guests (with virtio NICs) - have you upgraded to the latest stable virtio drivers? (this sometimes causes issues) - else - if possible in your environment - you could try changing the NIC to e1000 and see if the problem persists (though that comes at a cost of performance)

On a unrelated note - please consider upgrading your bios - it's still a version from 2013 - and especially with the Spectre/Meltdown/etc vulnerabilities it is quite important to keep the BIOS/Firmware up to date!

I hope this helps!
When the problem starts, the first thing I do was changes the virtual NIC from Intel E1000 to Virtio with the lastest drivers available.
Thanks for the advice about the BIOS, I will do the upgrade ASAP.
I have no more ideas about what it's going on with the VMs, and I'm running out of time so I think I'll do the reinstall of the latest version of Proxmox as final test.
If anyone have another idea. let me know.
Thanks
 
short of looking with tcpdump where the packets get lost or trying whether the issue remains with a different guest (Linux) - not really
 
Hi, I want to say that I have resolved the problem. After read the sys logs some lines caught my atenttion...


Aug 29 14:30:03 proxmox smartd[778]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 82 to 83

Aug 29 14:30:03 proxmox smartd[778]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 39 to 38


This message repeats few times with diferents sectors...The SMART of the disk looks good..but...


This is not necesarily a Disk error but I changed the disk, install the new Proxmox 6 version, restore the VMs from a day earlier backup and everything works fine, no more network failures.

I think that in some point the VMs was traying to read some faulty sector of the disk and there it get blocked and fail by timeout.

Well, I just want to write to close this thread and if in case anyone find this resolution helpfull.

Thanks a lot
 
Glad you resolved your issue - Please mark the thread as 'SOLVED' (it helps others with similar problems)
Thanks!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!