Hi,
TL;DR is at the end.
I am struggling for quite some time now and now finally have no more idea and hence turn to the wisdom of the forums...
I had my ProxmoxVE host running for over a month now without any noticeable issue and I was very happy having made the switch from plain Debian + Libvirt to Proxmox.
My Setup is
HW: Asus PN51
+48GB RAM
+1TB nVME
+2TB SSD
I run:
2 LXC containers
8 Libvirt VMs (7 debian, 1 ubuntu server)
As said, I did not have any problem whatsoever in over a month of constant uptime.
On Friday I needed to do some changes to the general setup of my infrastructure and as I pyhiscally moved the AsusPN51, I shut it down.
After completing the maintenance work I restarted it and since then my problem started.
First 2 of the VMs suddenly, well, aver 5-30 minutes after the reboot.
- stopped their work (the service they were supposed to deliver)
- did not respond to ping any more
- I could not access the console in Proxmox VE web, VNC failed to load
I tried to stop the VMs via WebUI, but the operation always seemed to time out. SO i
the VMs and then could reboot them. I thought
Then it got weird. The Nextcloud VM did this constantly. I could restart it and after 5 - 30 minutes it always froze, no more network access, no more console.
Funny is that Proxmox Summary for this VM shows that it has Memory and CPU usage (latter very low though) but no Disk or Network I/O.
This is the situation now for 5 VMs so far. They work for some time, one even worked over night without problems, but froze early this morning.
For 3 of the VMs i have not yet experienced any freeze.
There is nothing weird in the VMs syslog or when I tried yesterday with dmesg -w and just waited for the vm to freeze. It just stopped at one time.
So I started writing this post after i had rebooted the Proxmox Host once more 28 minutes ago. One VM frozen again :|
I see a lot of
And these qmp failures seem to correspond nicely with the freezing of the machines.
In /var/log/syslog of the Proxmox Host.
TL;DR:
I attached todays Syslog, Reboot happened at around 10:30 CET, since then VM 120 is frozen again. Hope the others wont freeze too soon.
Maybe someone can see something weird (I just anonmized it a little w.r.t. to email host)
I hope someone can see something odd in there. OR someone has an idea where / which logfile I could look at.
Thanks for your help, I really appreciate it!
br
Sebastian
Update:
Okay... now it is 3 Machines again. One machine i had restored from Wednesdays backup, because MySql data was corrupted.
It happened some mintues ago... Here is syslog and i attach the "Summary" screenshots, because it looks weird
The VM still has "Brain" (so it is not brain dead... but in coma somehow) because CPU and Memory are still "there"?
But there is no DiskIO or Network IO
A little earler in syslog i see these lines, i wonder if they are relating. I have no idea how to match the vbr0 ports to VMs
TL;DR is at the end.
I am struggling for quite some time now and now finally have no more idea and hence turn to the wisdom of the forums...
I had my ProxmoxVE host running for over a month now without any noticeable issue and I was very happy having made the switch from plain Debian + Libvirt to Proxmox.
My Setup is
HW: Asus PN51
+48GB RAM
+1TB nVME
+2TB SSD
I run:
2 LXC containers
8 Libvirt VMs (7 debian, 1 ubuntu server)
As said, I did not have any problem whatsoever in over a month of constant uptime.
On Friday I needed to do some changes to the general setup of my infrastructure and as I pyhiscally moved the AsusPN51, I shut it down.
After completing the maintenance work I restarted it and since then my problem started.
First 2 of the VMs suddenly, well, aver 5-30 minutes after the reboot.
- stopped their work (the service they were supposed to deliver)
- did not respond to ping any more
- I could not access the console in Proxmox VE web, VNC failed to load
I tried to stop the VMs via WebUI, but the operation always seemed to time out. SO i
Code:
qm stop 123
the VMs and then could reboot them. I thought
Then it got weird. The Nextcloud VM did this constantly. I could restart it and after 5 - 30 minutes it always froze, no more network access, no more console.
Funny is that Proxmox Summary for this VM shows that it has Memory and CPU usage (latter very low though) but no Disk or Network I/O.
This is the situation now for 5 VMs so far. They work for some time, one even worked over night without problems, but froze early this morning.
For 3 of the VMs i have not yet experienced any freeze.
There is nothing weird in the VMs syslog or when I tried yesterday with dmesg -w and just waited for the vm to freeze. It just stopped at one time.
So I started writing this post after i had rebooted the Proxmox Host once more 28 minutes ago. One VM frozen again :|
I see a lot of
Code:
Dec 5 10:40:26 newearth pvestatd[1050]: VM 120 qmp command failed - VM 120 qmp command 'query-proxmox-support' failed - unable to connect to VM 120 qmp socket - timeout after 31 retries
And these qmp failures seem to correspond nicely with the freezing of the machines.
In /var/log/syslog of the Proxmox Host.
TL;DR:
I attached todays Syslog, Reboot happened at around 10:30 CET, since then VM 120 is frozen again. Hope the others wont freeze too soon.
Maybe someone can see something weird (I just anonmized it a little w.r.t. to email host)
I hope someone can see something odd in there. OR someone has an idea where / which logfile I could look at.
Thanks for your help, I really appreciate it!
br
Sebastian
Update:
Okay... now it is 3 Machines again. One machine i had restored from Wednesdays backup, because MySql data was corrupted.
It happened some mintues ago... Here is syslog and i attach the "Summary" screenshots, because it looks weird
Code:
Dec 5 12:01:27 newearth pvestatd[1050]: status update time (12.249 seconds)
Dec 5 12:01:36 newearth pvestatd[1050]: VM 130 qmp command failed - VM 130 qmp command 'query-proxmox-support' failed - unable to connect to VM 130 qmp socket - timeout after 31 retries
Dec 5 12:01:39 newearth pvestatd[1050]: VM 120 qmp command failed - VM 120 qmp command 'query-proxmox-support' failed - unable to connect to VM 120 qmp socket - timeout after 31 retries
Dec 5 12:01:39 newearth pvestatd[1050]: status update time (12.264 seconds)
Dec 5 12:01:48 newearth pvestatd[1050]: VM 130 qmp command failed - VM 130 qmp command 'query-proxmox-support' failed - unable to connect to VM 130 qmp socket - timeout after 31 retries
Dec 5 12:01:51 newearth pvestatd[1050]: VM 120 qmp command failed - VM 120 qmp command 'query-proxmox-support' failed - unable to connect to VM 120 qmp socket - timeout after 31 retries
Dec 5 12:01:51 newearth pvestatd[1050]: status update time (12.250 seconds)
Dec 5 12:02:00 newearth pvestatd[1050]: VM 130 qmp command failed - VM 130 qmp command 'query-proxmox-support' failed - unable to connect to VM 130 qmp socket - timeout after 31 retries
Dec 5 12:02:03 newearth pvestatd[1050]: VM 120 qmp command failed - VM 120 qmp command 'query-proxmox-support' failed - unable to connect to VM 120 qmp socket - timeout after 31 retries
Dec 5 12:02:04 newearth pvestatd[1050]: status update time (12.246 seconds)
Dec 5 12:02:13 newearth pvestatd[1050]: VM 130 qmp command failed - VM 130 qmp command 'query-proxmox-support' failed - unable to connect to VM 130 qmp socket - timeout after 31 retries
Dec 5 12:02:16 newearth pvestatd[1050]: VM 140 qmp command failed - VM 140 qmp command 'query-proxmox-support' failed - got timeout
Dec 5 12:02:19 newearth pvestatd[1050]: VM 120 qmp command failed - VM 120 qmp command 'query-proxmox-support' failed - unable to connect to VM 120 qmp socket - timeout after 31 retries
Dec 5 12:02:19 newearth pvestatd[1050]: status update time (15.241 seconds)
Dec 5 12:02:20 newearth pvedaemon[1079]: VM 140 qmp command failed - VM 140 qmp command 'query-proxmox-support' failed - unable to connect to VM 140 qmp socket - timeout after 31 retries
Dec 5 12:02:31 newearth pvestatd[1050]: VM 130 qmp command failed - VM 130 qmp command 'query-proxmox-support' failed - unable to connect to VM 130 qmp socket - timeout after 31 retries
Dec 5 12:02:34 newearth pvestatd[1050]: VM 120 qmp command failed - VM 120 qmp command 'query-proxmox-support' failed - unable to connect to VM 120 qmp socket - timeout after 31 retries
Dec 5 12:02:37 newearth pvestatd[1050]: VM 140 qmp command failed - VM 140 qmp command 'query-proxmox-support' failed - unable to connect to VM 140 qmp socket - timeout after 31 retries
Dec 5 12:02:37 newearth pvestatd[1050]: status update time (18.265 seconds)
Dec 5 12:02:42 newearth pvedaemon[1078]: VM 140 qmp command failed - VM 140 qmp command 'query-proxmox-support' failed - unable to connect to VM 140 qmp socket - timeout after 31 retries
Dec 5 12:02:49 newearth pvestatd[1050]: VM 130 qmp command failed - VM 130 qmp command 'query-proxmox-support' failed - unable to connect to VM 130 qmp socket - timeout after 31 retries
Dec 5 12:02:52 newearth pvestatd[1050]: VM 140 qmp command failed - VM 140 qmp command 'query-proxmox-support' failed - unable to connect to VM 140 qmp socket - timeout after 31 retries
Dec 5 12:02:55 newearth pvestatd[1050]: VM 120 qmp command failed - VM 120 qmp command 'query-proxmox-support' failed - unable to connect to VM 120 qmp socket - timeout after 31 retries
Dec 5 12:02:55 newearth pvestatd[1050]: status update time (18.269 seconds)
Dec 5 12:03:05 newearth pvedaemon[1078]: VM 140 qmp command failed - VM 140 qmp command 'query-proxmox-support' failed - unable to connect to VM 140 qmp socket - timeout after 31 retries
Dec 5 12:03:07 newearth pvestatd[1050]: VM 120 qmp command failed - VM 120 qmp command 'query-proxmox-support' failed - unable to connect to VM 120 qmp socket - timeout after 31 retries
Dec 5 12:03:10 newearth pvestatd[1050]: VM 140 qmp command failed - VM 140 qmp command 'query-proxmox-support' failed - unable to connect to VM 140 qmp socket - timeout after 31 retries
Dec 5 12:03:13 newearth pvestatd[1050]: VM 130 qmp command failed - VM 130 qmp command 'query-proxmox-support' failed - unable to connect to VM 130 qmp socket - timeout after 31 retries
Dec 5 12:03:14 newearth pvestatd[1050]: status update time (18.254 seconds)
The VM still has "Brain" (so it is not brain dead... but in coma somehow) because CPU and Memory are still "there"?
But there is no DiskIO or Network IO
A little earler in syslog i see these lines, i wonder if they are relating. I have no idea how to match the vbr0 ports to VMs
Code:
Dec 5 11:58:08 newearth systemd-udevd[20747]: Using default interface naming scheme 'v247'.
Dec 5 11:58:08 newearth systemd-udevd[20747]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Dec 5 11:58:09 newearth kernel: [ 5017.660581] device tap140i0 entered promiscuous mode
Dec 5 11:58:09 newearth systemd-udevd[20746]: Using default interface naming scheme 'v247'.
Dec 5 11:58:09 newearth systemd-udevd[20746]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Dec 5 11:58:09 newearth systemd-udevd[20746]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Dec 5 11:58:09 newearth systemd-udevd[20747]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Dec 5 11:58:09 newearth kernel: [ 5017.689234] fwbr140i0: port 1(fwln140i0) entered blocking state
Dec 5 11:58:09 newearth kernel: [ 5017.689239] fwbr140i0: port 1(fwln140i0) entered disabled state
Dec 5 11:58:09 newearth kernel: [ 5017.689298] device fwln140i0 entered promiscuous mode
Dec 5 11:58:09 newearth kernel: [ 5017.689333] fwbr140i0: port 1(fwln140i0) entered blocking state
Dec 5 11:58:09 newearth kernel: [ 5017.689334] fwbr140i0: port 1(fwln140i0) entered forwarding state
Dec 5 11:58:09 newearth kernel: [ 5017.693442] vmbr0: port 7(fwpr140p0) entered blocking state
Dec 5 11:58:09 newearth kernel: [ 5017.693452] vmbr0: port 7(fwpr140p0) entered disabled state
Dec 5 11:58:09 newearth kernel: [ 5017.693584] device fwpr140p0 entered promiscuous mode
Dec 5 11:58:09 newearth kernel: [ 5017.693624] vmbr0: port 7(fwpr140p0) entered blocking state
Dec 5 11:58:09 newearth kernel: [ 5017.693625] vmbr0: port 7(fwpr140p0) entered forwarding state
Dec 5 11:58:09 newearth kernel: [ 5017.696999] fwbr140i0: port 2(tap140i0) entered blocking state
Dec 5 11:58:09 newearth kernel: [ 5017.697004] fwbr140i0: port 2(tap140i0) entered disabled state
Dec 5 11:58:09 newearth kernel: [ 5017.697101] fwbr140i0: port 2(tap140i0) entered blocking state
Dec 5 11:58:09 newearth kernel: [ 5017.697102] fwbr140i0: port 2(tap140i0) entered forwarding state
Attachments
Last edited: