Thanks, can you post the config of one of those VMs?I just want to add the following backup log:
qm config X
Thanks, can you post the config of one of those VMs?I just want to add the following backup log:
qm config X
Thanks, can you post the config of one of those VMs?qm config X
VMID | NAME | STATUS | TIME | SIZE | FILENAME |
103 | FortisNXFilter | OK | 00:04:14 | 3.47GB | /Backups/dump/vzdump-qemu-103-2019_09_15-23_00_02.vma.lzo |
104 | AtlasNXFilter | OK | 00:04:12 | 5.11GB | /Backups/dump/vzdump-qemu-104-2019_09_15-23_04_16.vma.lzo |
105 | replica.atlasict.co.za | OK | 00:05:25 | 7.54GB | /Backups/dump/vzdump-qemu-105-2019_09_15-23_08_28.vma.lzo |
106 | BuildingAccessControl | FAILED | 00:10:12 | got timeout | |
107 | Spiceworks | FAILED | 00:10:08 | got timeout | |
108 | SolarWinds-NCentral | FAILED | 00:10:05 | got timeout | |
109 | PFSense-AtlasICT | FAILED | 00:10:05 | got timeout | |
110 | FortisTS01 | FAILED | 00:10:09 | got timeout | |
111 | FortisDC | FAILED | 00:10:08 | got timeout | |
112 | LigoWaveController | FAILED | 00:10:08 | got timeout | |
113 | AcutusAccounting | FAILED | 00:10:14 | got timeout | |
115 | FortisMan3000 | FAILED | 00:10:12 | got timeout | |
116 | TrisnetWebServer | FAILED | 00:10:15 | got timeout | |
TOTAL | 01:55:27 | 16.12GB |
VM 5355 qmp command 'change' failed - got timeout
TASK ERROR: Failed to run vncproxy.
2019-09-20 12:11:42 ERROR: migration aborted (duration 00:00:03): VM 5373 qmp command 'query-machines' failed - got timeout
TASK ERROR: migration aborted
Sep 23 00:01:09 h3 pve-ha-lrm[1185494]: VM 10302 qmp command failed - VM 10302 qmp command 'query-status' failed - got timeout
Sep 23 00:01:09 h3 pve-ha-lrm[1185494]: VM 10302 qmp command 'query-status' failed - got timeout#012
Sep 23 00:01:19 h3 pve-ha-lrm[1188717]: VM 10302 qmp command failed - VM 10302 qmp command 'query-status' failed - got timeout
Sep 23 00:01:19 h3 pve-ha-lrm[1188717]: VM 10302 qmp command 'query-status' failed - got timeout#012
Sep 23 10:18:26 h3 pvesr[2013891]: VM 9803 qmp command failed - VM 9803 qmp command 'guest-ping' failed - got timeout
Sep 23 10:18:26 h3 pvesr[2013891]: Qemu Guest Agent is not running - VM 9803 qmp command 'guest-ping' failed - got timeout
Sep 23 09:14:25 h3 pvedaemon[767596]: VM 10904 qmp command failed - VM 10904 qmp command 'guest-ping' failed - got timeout
Sep 23 09:14:26 h3 pveproxy[737451]: 2019-09-23 09:14:26.259301 +0200 error AnyEvent::Util: Runtime error in AnyEvent::guard callback: Can't call method "_put_se
ssion" on an undefined value at /usr/lib/x86_64-linux-gnu/perl5/5.28/AnyEvent/Handle.pm line 2259 during global destruction.
Sep 23 09:14:20 h3 qm[744483]: VM 10904 qmp command failed - VM 10904 qmp command 'change' failed - got timeout
Sep 23 09:14:20 h3 qm[741842]: VM 10904 qmp command failed - VM 10904 qmp command 'change' failed - got timeout
root@h3:~# pveversion -V
proxmox-ve: 6.0-2 (running kernel: 5.0.21-1-pve)
pve-manager: 6.0-6 (running version: 6.0-6/c71f879f)
pve-kernel-5.0: 6.0-7
pve-kernel-helper: 6.0-7
pve-kernel-5.0.21-1-pve: 5.0.21-2
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph: 14.2.2-pve1
ceph-fuse: 14.2.2-pve1
corosync: 3.0.2-pve2
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.11-pve1
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-4
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-2
libpve-storage-perl: 6.0-7
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-64
lxcfs: 3.0.3-pve60
novnc-pve: 1.0.0-60
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-7
pve-cluster: 6.0-7
pve-container: 3.0-5
pve-docs: 6.0-4
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-7
pve-firmware: 3.0-2
pve-ha-manager: 3.0-2
pve-i18n: 2.0-2
pve-qemu-kvm: 4.0.0-5
pve-xtermjs: 3.13.2-1
pve-zsync: 2.0-1
qemu-server: 6.0-7
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.1-pve2
agent: 1
bootdisk: scsi0
cores: 1
cpu: host
ide2: none,media=cdrom
memory: 8192
name: RCP
net0: virtio=5E:02:9B:11:0A:FD,bridge=vmbr0,tag=98
numa: 1
onboot: 1
ostype: win8
scsi0: ZFS-H3-SSD:vm-9801-disk-0,cache=writeback,discard=on,size=50G
scsihw: virtio-scsi-pci
smbios1: uuid=30dd3844-d706-4c5a-aa1d-8554c2f71143
sockets: 2
vga: qxl
vmgenid: 11a00656-5a67-4e02-a421-65067de1f9b8
Did the VM react to a shutdown command or did you have to stop it hard?Last Friday this wasn't the case, even our domain controller became unreachable which caused massive problems.
VMs i had issues with did react to normal Shutdown commands from the GUI or cli.Thank you all for the new information. We are still trying to figure out what is causing this behavior. We'll keep you updated
Did the VM react to a shutdown command or did you have to stop it hard?
I'm not sure about that. Maybe that's the case but for me VMs on other nodes didn't experience this bug, and they were running longer. Other nodes are other hardware for me though. But the node that experienced the problem is the heaviest loaded one for me/BTW: This phenomenon is not Node dependent. The "Friday" incident happend on VM-6 while the currently stuck VMs are on VM-3. Let me just check if we have some more VMs on other Nodes that are currently stuck.
Thank you all for the new information. We are still trying to figure out what is causing this behavior. We'll keep you updated
Did the VM react to a shutdown command or did you have to stop it hard?
On the VMs that created issues for us we did not have ballooning enabled at all (we don't use ballooning anywhere). So this is not a factor.So, some additional information:
Every Guest that is currently stuck, has Balloning enabled. But not every Guest with ballooning is stuck, so this does not have to mean anything.
No additional stuck Guests were found and it hit different Guests than on Friday.
PS:
This Problem scares me. Our main SQL Server is one of the stuck Guests and I really don't want any downtime on this one during workhours.
I will gladly help, if there is something to dig deeper into.
Oh and: it hits Windows Guests as well als Linux Guests.
I have to stop it hard. All stuck VMs do still work but cannot be controlled from the UI nor the shell.
If you shut it down outside of working hours and then start it fresh it should be fine for some time (as a workaround). Why your DC failed completely I cannot say for sure :/PS:
This Problem scares me. Our main SQL Server is one of the stuck Guests and I really don't want any downtime on this one during workhours.
I will gladly help, if there is something to dig deeper into.
Oh and: it hits Windows Guests as well als Linux Guests.
We have one stuck VM with Guest Agent enabled, the others have it disabled. This VM reacted to shutdown command from Web GUI while stuck, and it reacts to shutdown command right after a fresh restart.So far as I understand the situation certain commands like the backup or VNC get a timeout but the shutdown command is passed to the VM.
- Do you have the guest agent installed in the VMs and enabled in the options?
- Do the VMs react to a shutdown command right after a fresh start?
After a clean start a VM can run days or weeks until this issues hits it again. Which makes hard to reproduce and debug.
TASK ERROR: VM quit/powerdown failed
/root/trace_patterns
handle_qmp_command
monitor_qmp_cmd_in_band
monitor_qmp_cmd_out_of_band
qmp_job_cancel
qmp_job_pause
qmp_job_resume
qmp_job_complete
qmp_job_finalize
qmp_job_dismiss
qmp_block_job_cancel
qmp_block_job_pause
qmp_block_job_resume
qmp_block_job_complete
qmp_block_job_finalize
qmp_block_job_dismiss
qmp_block_stream
monitor_protocol_event_queue
monitor_suspend
monitor_protocol_event_handler
monitor_protocol_event_emit
monitor_protocol_event_queue
qm config <vmid>
. Look out for the args
parameter.qm set <vmid> --args '-trace events=/root/trace_patterns,file=/root/qemu_trace_<vmid>'
<vmid>
with the respective VM IDs.trace_pattern
file is present on all nodes at the same location.