Some VMs suddenly freeze

s3bbo · Dec 5, 2021

Hi,

TL;DR is at the end.

I am struggling for quite some time now and now finally have no more idea and hence turn to the wisdom of the forums...

I had my ProxmoxVE host running for over a month now without any noticeable issue and I was very happy having made the switch from plain Debian + Libvirt to Proxmox.

My Setup is
HW: Asus PN51
+48GB RAM
+1TB nVME
+2TB SSD

I run:
2 LXC containers
8 Libvirt VMs (7 debian, 1 ubuntu server)

As said, I did not have any problem whatsoever in over a month of constant uptime.

On Friday I needed to do some changes to the general setup of my infrastructure and as I pyhiscally moved the AsusPN51, I shut it down.

After completing the maintenance work I restarted it and since then my problem started.

First 2 of the VMs suddenly, well, aver 5-30 minutes after the reboot.
- stopped their work (the service they were supposed to deliver)
- did not respond to ping any more
- I could not access the console in Proxmox VE web, VNC failed to load

I tried to stop the VMs via WebUI, but the operation always seemed to time out. SO i

Code:

qm stop 123

the VMs and then could reboot them. I thought

Then it got weird. The Nextcloud VM did this constantly. I could restart it and after 5 - 30 minutes it always froze, no more network access, no more console.

Funny is that Proxmox Summary for this VM shows that it has Memory and CPU usage (latter very low though) but no Disk or Network I/O.

This is the situation now for 5 VMs so far. They work for some time, one even worked over night without problems, but froze early this morning.

For 3 of the VMs i have not yet experienced any freeze.

There is nothing weird in the VMs syslog or when I tried yesterday with dmesg -w and just waited for the vm to freeze. It just stopped at one time.

So I started writing this post after i had rebooted the Proxmox Host once more 28 minutes ago. One VM frozen again :|

I see a lot of

Code:

Dec  5 10:40:26 newearth pvestatd[1050]: VM 120 qmp command failed - VM 120 qmp command 'query-proxmox-support' failed - unable to connect to VM 120 qmp socket - timeout after 31 retries

And these qmp failures seem to correspond nicely with the freezing of the machines.

In /var/log/syslog of the Proxmox Host.

TL;DR:

I attached todays Syslog, Reboot happened at around 10:30 CET, since then VM 120 is frozen again. Hope the others wont freeze too soon.
Maybe someone can see something weird (I just anonmized it a little w.r.t. to email host)

I hope someone can see something odd in there. OR someone has an idea where / which logfile I could look at.

Thanks for your help, I really appreciate it!

br
Sebastian

Update:

Okay... now it is 3 Machines again. One machine i had restored from Wednesdays backup, because MySql data was corrupted.

It happened some mintues ago... Here is syslog and i attach the "Summary" screenshots, because it looks weird

Code:

Dec  5 12:01:27 newearth pvestatd[1050]: status update time (12.249 seconds)
Dec  5 12:01:36 newearth pvestatd[1050]: VM 130 qmp command failed - VM 130 qmp command 'query-proxmox-support' failed - unable to connect to VM 130 qmp socket - timeout after 31 retries
Dec  5 12:01:39 newearth pvestatd[1050]: VM 120 qmp command failed - VM 120 qmp command 'query-proxmox-support' failed - unable to connect to VM 120 qmp socket - timeout after 31 retries
Dec  5 12:01:39 newearth pvestatd[1050]: status update time (12.264 seconds)
Dec  5 12:01:48 newearth pvestatd[1050]: VM 130 qmp command failed - VM 130 qmp command 'query-proxmox-support' failed - unable to connect to VM 130 qmp socket - timeout after 31 retries
Dec  5 12:01:51 newearth pvestatd[1050]: VM 120 qmp command failed - VM 120 qmp command 'query-proxmox-support' failed - unable to connect to VM 120 qmp socket - timeout after 31 retries
Dec  5 12:01:51 newearth pvestatd[1050]: status update time (12.250 seconds)
Dec  5 12:02:00 newearth pvestatd[1050]: VM 130 qmp command failed - VM 130 qmp command 'query-proxmox-support' failed - unable to connect to VM 130 qmp socket - timeout after 31 retries
Dec  5 12:02:03 newearth pvestatd[1050]: VM 120 qmp command failed - VM 120 qmp command 'query-proxmox-support' failed - unable to connect to VM 120 qmp socket - timeout after 31 retries
Dec  5 12:02:04 newearth pvestatd[1050]: status update time (12.246 seconds)
Dec  5 12:02:13 newearth pvestatd[1050]: VM 130 qmp command failed - VM 130 qmp command 'query-proxmox-support' failed - unable to connect to VM 130 qmp socket - timeout after 31 retries
Dec  5 12:02:16 newearth pvestatd[1050]: VM 140 qmp command failed - VM 140 qmp command 'query-proxmox-support' failed - got timeout
Dec  5 12:02:19 newearth pvestatd[1050]: VM 120 qmp command failed - VM 120 qmp command 'query-proxmox-support' failed - unable to connect to VM 120 qmp socket - timeout after 31 retries
Dec  5 12:02:19 newearth pvestatd[1050]: status update time (15.241 seconds)
Dec  5 12:02:20 newearth pvedaemon[1079]: VM 140 qmp command failed - VM 140 qmp command 'query-proxmox-support' failed - unable to connect to VM 140 qmp socket - timeout after 31 retries
Dec  5 12:02:31 newearth pvestatd[1050]: VM 130 qmp command failed - VM 130 qmp command 'query-proxmox-support' failed - unable to connect to VM 130 qmp socket - timeout after 31 retries
Dec  5 12:02:34 newearth pvestatd[1050]: VM 120 qmp command failed - VM 120 qmp command 'query-proxmox-support' failed - unable to connect to VM 120 qmp socket - timeout after 31 retries
Dec  5 12:02:37 newearth pvestatd[1050]: VM 140 qmp command failed - VM 140 qmp command 'query-proxmox-support' failed - unable to connect to VM 140 qmp socket - timeout after 31 retries
Dec  5 12:02:37 newearth pvestatd[1050]: status update time (18.265 seconds)
Dec  5 12:02:42 newearth pvedaemon[1078]: VM 140 qmp command failed - VM 140 qmp command 'query-proxmox-support' failed - unable to connect to VM 140 qmp socket - timeout after 31 retries
Dec  5 12:02:49 newearth pvestatd[1050]: VM 130 qmp command failed - VM 130 qmp command 'query-proxmox-support' failed - unable to connect to VM 130 qmp socket - timeout after 31 retries
Dec  5 12:02:52 newearth pvestatd[1050]: VM 140 qmp command failed - VM 140 qmp command 'query-proxmox-support' failed - unable to connect to VM 140 qmp socket - timeout after 31 retries
Dec  5 12:02:55 newearth pvestatd[1050]: VM 120 qmp command failed - VM 120 qmp command 'query-proxmox-support' failed - unable to connect to VM 120 qmp socket - timeout after 31 retries
Dec  5 12:02:55 newearth pvestatd[1050]: status update time (18.269 seconds)
Dec  5 12:03:05 newearth pvedaemon[1078]: VM 140 qmp command failed - VM 140 qmp command 'query-proxmox-support' failed - unable to connect to VM 140 qmp socket - timeout after 31 retries
Dec  5 12:03:07 newearth pvestatd[1050]: VM 120 qmp command failed - VM 120 qmp command 'query-proxmox-support' failed - unable to connect to VM 120 qmp socket - timeout after 31 retries
Dec  5 12:03:10 newearth pvestatd[1050]: VM 140 qmp command failed - VM 140 qmp command 'query-proxmox-support' failed - unable to connect to VM 140 qmp socket - timeout after 31 retries
Dec  5 12:03:13 newearth pvestatd[1050]: VM 130 qmp command failed - VM 130 qmp command 'query-proxmox-support' failed - unable to connect to VM 130 qmp socket - timeout after 31 retries
Dec  5 12:03:14 newearth pvestatd[1050]: status update time (18.254 seconds)

The VM still has "Brain" (so it is not brain dead... but in coma somehow) because CPU and Memory are still "there"?
But there is no DiskIO or Network IO

A little earler in syslog i see these lines, i wonder if they are relating. I have no idea how to match the vbr0 ports to VMs

Code:

Dec  5 11:58:08 newearth systemd-udevd[20747]: Using default interface naming scheme 'v247'.
Dec  5 11:58:08 newearth systemd-udevd[20747]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Dec  5 11:58:09 newearth kernel: [ 5017.660581] device tap140i0 entered promiscuous mode
Dec  5 11:58:09 newearth systemd-udevd[20746]: Using default interface naming scheme 'v247'.
Dec  5 11:58:09 newearth systemd-udevd[20746]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Dec  5 11:58:09 newearth systemd-udevd[20746]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Dec  5 11:58:09 newearth systemd-udevd[20747]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Dec  5 11:58:09 newearth kernel: [ 5017.689234] fwbr140i0: port 1(fwln140i0) entered blocking state
Dec  5 11:58:09 newearth kernel: [ 5017.689239] fwbr140i0: port 1(fwln140i0) entered disabled state
Dec  5 11:58:09 newearth kernel: [ 5017.689298] device fwln140i0 entered promiscuous mode
Dec  5 11:58:09 newearth kernel: [ 5017.689333] fwbr140i0: port 1(fwln140i0) entered blocking state
Dec  5 11:58:09 newearth kernel: [ 5017.689334] fwbr140i0: port 1(fwln140i0) entered forwarding state
Dec  5 11:58:09 newearth kernel: [ 5017.693442] vmbr0: port 7(fwpr140p0) entered blocking state
Dec  5 11:58:09 newearth kernel: [ 5017.693452] vmbr0: port 7(fwpr140p0) entered disabled state
Dec  5 11:58:09 newearth kernel: [ 5017.693584] device fwpr140p0 entered promiscuous mode
Dec  5 11:58:09 newearth kernel: [ 5017.693624] vmbr0: port 7(fwpr140p0) entered blocking state
Dec  5 11:58:09 newearth kernel: [ 5017.693625] vmbr0: port 7(fwpr140p0) entered forwarding state
Dec  5 11:58:09 newearth kernel: [ 5017.696999] fwbr140i0: port 2(tap140i0) entered blocking state
Dec  5 11:58:09 newearth kernel: [ 5017.697004] fwbr140i0: port 2(tap140i0) entered disabled state
Dec  5 11:58:09 newearth kernel: [ 5017.697101] fwbr140i0: port 2(tap140i0) entered blocking state
Dec  5 11:58:09 newearth kernel: [ 5017.697102] fwbr140i0: port 2(tap140i0) entered forwarding state

fiona · Dec 6, 2021

Hi,
from the syslog, it seems like you are running the kernel from pve-kernel-5.13.19-1-pve=5.13.19-2. The first part is the ABI version, the latter part is the package version. There was a fix in pve-kernel-5.13.19-1-pve=5.13.19-3 to mitigate similar issues, so please upgrade to at least that one and reboot.

If the issue persists after that, please post the configuration of an affected VM (qm config <ID>).

s3bbo · Dec 6, 2021

Thanks a lot for your response!
I was already somehow thinking about a kernel issue. I had updated some days back and maybe not rebooted right away and thats why it only appeared after the reboot on Friday.

I will reboot later when i am back at home and report back if it helps!

Thanks again!
br
sebastian

s3bbo · Dec 6, 2021

Okay, so after i thought i'd update the machine remotely from work and reboot - baaad idea! - i came home.
The Asus did not boot properly anylonger.

I never experienced any problems so far, so i'm a bit oblivious w.r.t. to troubleshooting.

Only today did I find the possibility to easily boot another Kernel.

So now I booted the older Kernel

Code:

Linux newearth 5.11.22-7-pve #1 SMP PVE 5.11.22-12 (Sun, 07 Nov 2021 21:46:36 +0100) x86_64 GNU/Linux

So far it is working... (well uptime ~20 minutes now). But the one VM which in the meantime completely had lost its will to live and did not even boot anymore, is now booting again! So i'm very optimistic.

Apparently this has cost me way too much time and i'm tired of further debugging right now, so I can't / don't want to reboot once again the newest kernal to screenshot the error message when booting, I need to postpone that to next week or so and will now stick with the Kernel 2 versions back.

Thanks again for pointing me to the kernel!

br
sebastian

fiona · Dec 7, 2021

s3bbo said:
Okay, so after i thought i'd update the machine remotely from work and reboot - baaad idea! - i came home.
The Asus did not boot properly anylonger.

Sorry to hear that.

s3bbo said:
I never experienced any problems so far, so i'm a bit oblivious w.r.t. to troubleshooting.

Only today did I find the possibility to easily boot another Kernel.

So now I booted the older Kernel

Code:

Linux newearth 5.11.22-7-pve #1 SMP PVE 5.11.22-12 (Sun, 07 Nov 2021 21:46:36 +0100) x86_64 GNU/Linux

So far it is working... (well uptime ~20 minutes now). But the one VM which in the meantime completely had lost its will to live and did not even boot anymore, is now booting again! So i'm very optimistic.

But good to hear that.

s3bbo said:
Apparently this has cost me way too much time and i'm tired of further debugging right now, so I can't / don't want to reboot once again the newest kernal to screenshot the error message when booting, I need to postpone that to next week or so and will now stick with the Kernel 2 versions back.

You could also try the new 5.15 kernel and hope that the issues have been fixed in the meantime.

s3bbo said:
Thanks again for pointing me to the kernel!

br
sebastian

CvR_XX · Dec 7, 2021

I just had the exact same problem. Updated a while ago but forget to reboot. Had a power failure and one VM started failing every few hours. Just update the `pve-kernel-5.13.19-2-pve` to `5.13.19-4`. Hope it resolves the issue.

s3bbo · Dec 7, 2021

CvR_XX said:
I just had the exact same problem. Updated a while ago but forget to reboot. Had a power failure and one VM started failing every few hours. Just update the `pve-kernel-5.13.19-2-pve` to `5.13.19-4`. Hope it resolves the issue.

That only made it worse for me.

I can report 16 hours of uptime with

Code:

5.11.22-7-pve #1 SMP PVE 5.11.22-12 (Sun, 07 Nov 2021 21:46:36 +0100) x86_64 GNU/Linux

wihtout any issues - so far.

So maybe i need to find out how to either stick to that kernel for some while, while being headless, i.e. making this one the default again for rebooting...

or check the 5.15 kernel that fabian has linked to... sounds interesting. But rebooting my environment takes me quite some time each time as i have VMs with encrypted disk images that need manual care when rebooting...

Kephin · Dec 7, 2021

Fabian_E said:
Sorry to hear that.

Just want to chime in on the system not booting after updates, boot issue could possibly be related to an issue beeing discussed in this topic.
I've noticed OP also has AMD hardware and a good couple of people (myself included) seem to experience boot issues with the latest production kernel that is beeing pushed through regular updates. So far all running AMD hardware. Something with initialising GPU.
The proxmox team might want to take a look at this as well.

CvR_XX · Dec 8, 2021

CvR_XX said:
I just had the exact same problem. Updated a while ago but forget to reboot. Had a power failure and one VM started failing every few hours. Just update the `pve-kernel-5.13.19-2-pve` to `5.13.19-4`. Hope it resolves the issue.

A quick update: VM has been running for 21 hours without issues.

fiona · Dec 9, 2021

Hi,

Kephin said:
Just want to chime in on the system not booting after updates, boot issue could possibly be related to an issue beeing discussed in this topic.
I've noticed OP also has AMD hardware and a good couple of people (myself included) seem to experience boot issues with the latest production kernel that is beeing pushed through regular updates. So far all running AMD hardware. Something with initialising GPU.
The proxmox team might want to take a look at this as well.

I did take a look, but unfortunately couldn't find a candidate for a fix to backport. There might be such a fix at some point (if we find one, or coming in via the Ubuntu kernel ours is based on), but can't give a guarantee of course. For now, you can either upgrade to kernel 5.15 or stay at 5.11 (holding 5.13 is not ideal, as you won't get other updates).

Search

Search

Some VMs suddenly freeze

s3bbo

Member

Attachments

fiona

Proxmox Staff Member

s3bbo

Member

s3bbo

Member

fiona

Proxmox Staff Member

CvR_XX

Member

s3bbo

Member

Kephin

Renowned Member

CvR_XX

Member

fiona

Proxmox Staff Member

We value your privacy