VMs suddenly stopped and I have to start them again.

hassoon

Active Member
Jan 27, 2020
56
1
28
64
Im having strange problem after my old server stopped working completely on IBM x3650 M4 and I had to replace the whole server with a working one and insert the old drives into the new server's SAS bay.
everything seemed to work normally on the new server. but after two weeks of smooth operations the vms started to halt by itself.
Ive checked the syslog and I cant find anything strange apart of:

kernel: L1TF CPU bug present and SMT on, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.
Nov 07 08:03:53 awpve kernel: device tap777i0 entered promiscuous mode
Nov 07 08:03:53 awpve kernel: vmbr0: port 2(tap777i0) entered blocking state
Nov 07 08:03:53 awpve kernel: vmbr0: port 2(tap777i0) entered disabled state
Nov 07 08:03:53 awpve kernel: vmbr0: port 2(tap777i0) entered blocking state
Nov 07 08:03:53 awpve kernel: vmbr0: port 2(tap777i0) entered forwarding state
Nov 07 08:03:54 awpve kernel: device tap777i1 entered promiscuous mode
Nov 07 08:03:54 awpve kernel: vmbr1: port 2(tap777i1) entered blocking state
Nov 07 08:03:54 awpve kernel: vmbr1: port 2(tap777i1) entered disabled state
Nov 07 08:03:54 awpve kernel: vmbr1: port 2(tap777i1) entered blocking state
Nov 07 08:03:54 awpve kernel: vmbr1: port 2(tap777i1) entered forwarding state
I have changed network driver from virtio to intel, still the same thing. Ive reduced ram and cpu use and still the same outcome.
Im on PVE 8.0.3
and ran apt upadte && apt upgrade
and still the same issue.
can this be IBM FW issue?
VM stops only the server keeps running without showing any error
 
one note I can see before sudden halt is the below


Nov 07 12:30:02 awpve pveproxy[3342]: worker 714147 finished
Nov 07 12:30:02 awpve pveproxy[3342]: starting 1 worker(s)
Nov 07 12:30:02 awpve pveproxy[3342]: worker 811932 started
Nov 07 12:30:08 awpve pvedaemon[607170]: worker exit
Nov 07 12:30:08 awpve pvedaemon[3332]: worker 607170 finished
Nov 07 12:30:08 awpve pvedaemon[3332]: starting 1 worker(s)
Nov 07 12:30:08 awpve pvedaemon[3332]: worker 812175 started
Nov 07 12:30:10 awpve pvedaemon[612950]: <root@pam> successful auth for user 'xxxxxx@pam'
Nov 07 12:30:36 awpve kernel: zd32: p1 p2
Nov 07 12:30:36 awpve kernel: zd0: p1
Nov 07 12:30:36 awpve pvedaemon[812175]: VM 777 qmp command failed - VM 777 qmp command 'guest-ping' failed - got timeout
Nov 07 12:30:36 awpve kernel: vmbr0: port 2(tap777i0) entered disabled state
Nov 07 12:30:37 awpve kernel: vmbr1: port 2(tap777i1) entered disabled state
Nov 07 12:30:37 awpve qmeventd[2768]: read: Connection reset by peer
Nov 07 12:30:37 awpve pvedaemon[812175]: VM 777 qmp command failed - VM 777 not running
Nov 07 12:30:37 awpve pvestatd[3306]: VM 777 qmp command failed - VM 777 not running
Nov 07 12:30:37 awpve systemd[1]: 777.scope: Deactivated successfully.
Nov 07 12:30:37 awpve systemd[1]: 777.scope: Consumed 19min 39.571s CPU time.
Nov 07 12:30:38 awpve qmeventd[817748]: Starting cleanup for 777
Nov 07 12:30:38 awpve qmeventd[817748]: Finished cleanup for 777
 
could it be that the technician put the SAS drives in another order so it would need extra zfs commands?
just did scrub and cant find anything showing error.
do I need to do re silver?
 
Hi,
Im having strange problem after my old server stopped working completely on IBM x3650 M4 and I had to replace the whole server with a working one and insert the old drives into the new server's SAS bay.
everything seemed to work normally on the new server. but after two weeks of smooth operations the vms started to halt by itself.
Ive checked the syslog and I cant find anything strange apart of:

kernel: L1TF CPU bug present and SMT on, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.
I'd try turning SMT or EPT off, following the kernel documentation.
Nov 07 08:03:53 awpve kernel: device tap777i0 entered promiscuous mode
Nov 07 08:03:53 awpve kernel: vmbr0: port 2(tap777i0) entered blocking state
Nov 07 08:03:53 awpve kernel: vmbr0: port 2(tap777i0) entered disabled state
Nov 07 08:03:53 awpve kernel: vmbr0: port 2(tap777i0) entered blocking state
Nov 07 08:03:53 awpve kernel: vmbr0: port 2(tap777i0) entered forwarding state
Nov 07 08:03:54 awpve kernel: device tap777i1 entered promiscuous mode
Nov 07 08:03:54 awpve kernel: vmbr1: port 2(tap777i1) entered blocking state
Nov 07 08:03:54 awpve kernel: vmbr1: port 2(tap777i1) entered disabled state
Nov 07 08:03:54 awpve kernel: vmbr1: port 2(tap777i1) entered blocking state
Nov 07 08:03:54 awpve kernel: vmbr1: port 2(tap777i1) entered forwarding state
I have changed network driver from virtio to intel, still the same thing. Ive reduced ram and cpu use and still the same outcome.
Im on PVE 8.0.3
and ran apt upadte && apt upgrade
Always use full-upgrade or dist-upgrade on Proxmox VE systems (see: https://lore.proxmox.com/pve-devel/20240909102050.40220-1-f.ebner@proxmox.com/)
and still the same issue.
What kernel are you running now?
can this be IBM FW issue?
VM stops only the server keeps running without showing any error
Make sure you have latest CPU microcode installed: https://pve.proxmox.com/pve-docs/chapter-sysadmin.html#sysadmin_firmware_cpu
one note I can see before sudden halt is the below


Nov 07 12:30:02 awpve pveproxy[3342]: worker 714147 finished
Nov 07 12:30:02 awpve pveproxy[3342]: starting 1 worker(s)
Nov 07 12:30:02 awpve pveproxy[3342]: worker 811932 started
Nov 07 12:30:08 awpve pvedaemon[607170]: worker exit
Nov 07 12:30:08 awpve pvedaemon[3332]: worker 607170 finished
Nov 07 12:30:08 awpve pvedaemon[3332]: starting 1 worker(s)
Nov 07 12:30:08 awpve pvedaemon[3332]: worker 812175 started
Nov 07 12:30:10 awpve pvedaemon[612950]: <root@pam> successful auth for user 'xxxxxx@pam'
Nov 07 12:30:36 awpve kernel: zd32: p1 p2
Nov 07 12:30:36 awpve kernel: zd0: p1
Nov 07 12:30:36 awpve pvedaemon[812175]: VM 777 qmp command failed - VM 777 qmp command 'guest-ping' failed - got timeout
Nov 07 12:30:36 awpve kernel: vmbr0: port 2(tap777i0) entered disabled state
Nov 07 12:30:37 awpve kernel: vmbr1: port 2(tap777i1) entered disabled state
Nov 07 12:30:37 awpve qmeventd[2768]: read: Connection reset by peer
Nov 07 12:30:37 awpve pvedaemon[812175]: VM 777 qmp command failed - VM 777 not running
Nov 07 12:30:37 awpve pvestatd[3306]: VM 777 qmp command failed - VM 777 not running
Nov 07 12:30:37 awpve systemd[1]: 777.scope: Deactivated successfully.
Nov 07 12:30:37 awpve systemd[1]: 777.scope: Consumed 19min 39.571s CPU time.
Nov 07 12:30:38 awpve qmeventd[817748]: Starting cleanup for 777
Nov 07 12:30:38 awpve qmeventd[817748]: Finished cleanup for 777
that is strange, there is no crash or shutdown shown, but the process seems to be gone (thus the systemd scope is deactivated). Can you install apt install systemd-coredump? That should catch crashes (at least in userspace).
 
Hi,

I'd try turning SMT or EPT off, following the kernel documentation.

Always use full-upgrade or dist-upgrade on Proxmox VE systems (see: https://lore.proxmox.com/pve-devel/20240909102050.40220-1-f.ebner@proxmox.com/)

What kernel are you running now?

Make sure you have latest CPU microcode installed: https://pve.proxmox.com/pve-docs/chapter-sysadmin.html#sysadmin_firmware_cpu

that is strange, there is no crash or shutdown shown, but the process seems to be gone (thus the systemd scope is deactivated). Can you install apt install systemd-coredump? That should catch crashes (at least in userspace).
Thanks

just got another halt

here is what I saw

Nov 07 13:28:30 awpve pvedaemon[3332]: worker 718783 finished
Nov 07 13:28:30 awpve pvedaemon[3332]: starting 1 worker(s)
Nov 07 13:28:30 awpve pvedaemon[3332]: worker 3154004 started
Nov 07 13:30:44 awpve pvedaemon[3071133]: <root@pam> successful auth for user 'root@pam'
Nov 07 13:31:18 awpve pvedaemon[2900375]: VM 777 qmp command failed - VM 777 qmp command 'guest-ping' failed - got timeout
Nov 07 13:31:18 awpve kernel: zd32: p1 p2
Nov 07 13:31:18 awpve kernel: zd0: p1
Nov 07 13:31:19 awpve kernel: vmbr0: port 2(tap777i0) entered disabled state
Nov 07 13:31:19 awpve kernel: vmbr1: port 2(tap777i1) entered disabled state
Nov 07 13:31:19 awpve qmeventd[2768]: read: Connection reset by peer
Nov 07 13:31:19 awpve pvedaemon[3154004]: VM 777 qmp command failed - VM 777 not running
Nov 07 13:31:19 awpve systemd[1]: 777.scope: Deactivated successfully.
Nov 07 13:31:19 awpve systemd[1]: 777.scope: Consumed 37min 15.765s CPU time.
Nov 07 13:31:20 awpve qmeventd[3163562]: Starting cleanup for 777
Nov 07 13:31:20 awpve qmeventd[3163562]: Finished cleanup for 777


kernel is:
Linux awpve 6.2.16-3-pve #1 SMP PREEMPT_DYNAMIC PVE 6.2.16-3 (2023-06-17T05:58Z) x86_64 GNU/Linux
 
Last edited:
Thanks

just got another halt

here is what I saw

Nov 07 13:28:30 awpve pvedaemon[3332]: worker 718783 finished
Nov 07 13:28:30 awpve pvedaemon[3332]: starting 1 worker(s)
Nov 07 13:28:30 awpve pvedaemon[3332]: worker 3154004 started
Nov 07 13:30:44 awpve pvedaemon[3071133]: <root@pam> successful auth for user 'root@pam'
Nov 07 13:31:18 awpve pvedaemon[2900375]: VM 777 qmp command failed - VM 777 qmp command 'guest-ping' failed - got timeout
Nov 07 13:31:18 awpve kernel: zd32: p1 p2
Nov 07 13:31:18 awpve kernel: zd0: p1
Nov 07 13:31:19 awpve kernel: vmbr0: port 2(tap777i0) entered disabled state
Nov 07 13:31:19 awpve kernel: vmbr1: port 2(tap777i1) entered disabled state
Nov 07 13:31:19 awpve qmeventd[2768]: read: Connection reset by peer
Nov 07 13:31:19 awpve pvedaemon[3154004]: VM 777 qmp command failed - VM 777 not running
Nov 07 13:31:19 awpve systemd[1]: 777.scope: Deactivated successfully.
Nov 07 13:31:19 awpve systemd[1]: 777.scope: Consumed 37min 15.765s CPU time.
Nov 07 13:31:20 awpve qmeventd[3163562]: Starting cleanup for 777
Nov 07 13:31:20 awpve qmeventd[3163562]: Finished cleanup for 777
Did you already install systemd-coredump before this crash? If yes, what does coredumpctl -1 say?

kernel is:
Linux awpve 6.2.16-3-pve #1 SMP PREEMPT_DYNAMIC PVE 6.2.16-3 (2023-06-17T05:58Z) x86_64 GNU/Linux
I'd try upgrading to the current 6.8 kernel.
 
Did you already install systemd-coredump before this crash? If yes, what does coredumpctl -1 say?


I'd try upgrading to the current 6.8 kernel.
Thx
Ive just did the below:

root@awpve:~# apt install intel-microcode
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
iucode-tool
The following NEW packages will be installed:
intel-microcode iucode-tool
0 upgraded, 2 newly installed, 0 to remove and 0 not upgraded.
Need to get 7,052 kB of archives.
After this operation, 14.2 MB of additional disk space will be used.
Do you want to continue? [Y/n] y
Get:1 http://ftp.tr.debian.org/debian bookworm/main amd64 iucode-tool amd64 2.3.1-3 [56.1 kB]
Get:2 http://ftp.tr.debian.org/debian bookworm/non-free-firmware amd64 intel-microcode amd64 3.20240813.1~deb12u1 [6,996 kB]
Fetched 7,052 kB in 9s (806 kB/s)
Selecting previously unselected package iucode-tool.
(Reading database ... 45161 files and directories currently installed.)
Preparing to unpack .../iucode-tool_2.3.1-3_amd64.deb ...
Unpacking iucode-tool (2.3.1-3) ...
Selecting previously unselected package intel-microcode.
Preparing to unpack .../intel-microcode_3.20240813.1~deb12u1_amd64.deb ...
Unpacking intel-microcode (3.20240813.1~deb12u1) ...
Setting up iucode-tool (2.3.1-3) ...
Setting up intel-microcode (3.20240813.1~deb12u1) ...
intel-microcode: microcode will be updated at next boot
Processing triggers for man-db (2.11.2-2) ...
Processing triggers for initramfs-tools (0.142+deb12u1) ...
update-initramfs: Generating /boot/initrd.img-6.2.16-3-pve
Running hook script 'zz-proxmox-boot'..
Re-executing '/etc/kernel/postinst.d/zz-proxmox-boot' in new private mount namespace..#.........]
No /etc/kernel/proxmox-boot-uuids found, skipping ESP sync.
root@awpve:~#
 
What error do you get?
here is the latest from the syslog

Nov 07 18:51:16 awpve systemd[4001909]: Closed dirmngr.socket - GnuPG network certificate management daemon.
Nov 07 18:51:16 awpve systemd[4001909]: Closed gpg-agent-browser.socket - GnuPG cryptographic agent and passphrase cache (access for web browsers).
Nov 07 18:51:16 awpve systemd[4001909]: Closed gpg-agent-extra.socket - GnuPG cryptographic agent and passphrase cache (restricted).
Nov 07 18:51:16 awpve systemd[4001909]: Closed gpg-agent-ssh.socket - GnuPG cryptographic agent (ssh-agent emulation).
Nov 07 18:51:16 awpve systemd[4001909]: Closed gpg-agent.socket - GnuPG cryptographic agent and passphrase cache.
Nov 07 18:51:16 awpve systemd[4001909]: Removed slice app.slice - User Application Slice.
Nov 07 18:51:16 awpve systemd[4001909]: Reached target shutdown.target - Shutdown.
Nov 07 18:51:16 awpve systemd[4001909]: Finished systemd-exit.service - Exit the Session.
Nov 07 18:51:16 awpve systemd[4001909]: Reached target exit.target - Exit the Session.
Nov 07 18:51:16 awpve systemd[1]: user@0.service: Deactivated successfully.
Nov 07 18:51:16 awpve systemd[1]: Stopped user@0.service - User Manager for UID 0.
Nov 07 18:51:16 awpve systemd[1]: Stopping user-runtime-dir@0.service - User Runtime Directory /run/user/0...
Nov 07 18:51:16 awpve systemd[1]: run-user-0.mount: Deactivated successfully.
Nov 07 18:51:16 awpve systemd[1]: user-runtime-dir@0.service: Deactivated successfully.
Nov 07 18:51:16 awpve systemd[1]: Stopped user-runtime-dir@0.service - User Runtime Directory /run/user/0.
Nov 07 18:51:16 awpve systemd[1]: Removed slice user-0.slice - User Slice of UID 0.
Nov 07 18:51:16 awpve systemd[1]: user-0.slice: Consumed 1.569s CPU time.
Nov 07 19:07:43 awpve kernel: zd32: p1 p2
Nov 07 19:07:43 awpve kernel: zd0: p1
Nov 07 19:07:44 awpve kernel: vmbr0: port 2(tap777i0) entered disabled state
Nov 07 19:07:44 awpve kernel: vmbr1: port 2(tap777i1) entered disabled state
Nov 07 19:07:44 awpve qmeventd[2768]: read: Connection reset by peer
Nov 07 19:07:44 awpve systemd[1]: 777.scope: Deactivated successfully.
Nov 07 19:07:44 awpve systemd[1]: 777.scope: Consumed 8min 26.765s CPU time.
Nov 07 19:07:45 awpve qmeventd[4032627]: Starting cleanup for 777
Nov 07 19:07:45 awpve qmeventd[4032627]: Finished cleanup for 777
Nov 07 19:17:01 awpve CRON[4033930]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Nov 07 19:17:01 awpve CRON[4033931]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Nov 07 19:17:01 awpve CRON[4033930]: pam_unix(cron:session): session closed for user root





I noted that once the below is shown

Nov 07 19:07:43 awpve kernel: zd32: p1 p2
Nov 07 19:07:43 awpve kernel: zd0: p1
Nov 07 19:07:44 awpve kernel: vmbr0: port 2(tap777i0) entered disabled state
Nov 07 19:07:44 awpve kernel: vmbr1: port 2(tap777i1) entered disabled state


I see the vm being halted.
why is tap getting created? is this due to win2016 server's request?
 
What error do you get?
root@awpve:~# apt dist-upgrade
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Calculating upgrade... Done
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.


root@awpve:~# uname -a
Linux awpve 6.2.16-3-pve #1 SMP PREEMPT_DYNAMIC PVE 6.2.16-3 (2023-06-17T05:58Z) x86_64 GNU/Linux

how do i kernel upgrade?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!