[SOLVED] Certain VMs from a cluster cannot be backed up and managed

gradinaruvasile · Aug 16, 2019

We have a 4 node cluster of proxmox 6

Code:

proxmox-ve: 6.0-2 (running kernel: 5.0.18-1-pve)
pve-manager: 6.0-5 (running version: 6.0-5/f8a710d7)
pve-kernel-5.0: 6.0-6
pve-kernel-helper: 6.0-6
pve-kernel-4.15: 5.4-7
pve-kernel-5.0.18-1-pve: 5.0.18-3
pve-kernel-4.15.18-19-pve: 4.15.18-45
pve-kernel-4.15.18-16-pve: 4.15.18-41
pve-kernel-4.15.18-14-pve: 4.15.18-39
pve-kernel-4.15.18-10-pve: 4.15.18-32
pve-kernel-4.15.18-9-pve: 4.15.18-30
pve-kernel-4.15.17-1-pve: 4.15.17-9
ceph-fuse: 12.2.11+dfsg1-2.1
corosync: 3.0.2-pve2
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.10-pve2
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-3
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-2
libpve-storage-perl: 6.0-7
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-63
lxcfs: 3.0.3-pve60
novnc-pve: 1.0.0-60
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-5
pve-cluster: 6.0-5
pve-container: 3.0-5
pve-docs: 6.0-4
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-7
pve-firmware: 3.0-2
pve-ha-manager: 3.0-2
pve-i18n: 2.0-2
pve-qemu-kvm: 4.0.0-5
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-7
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.1-pve1

and on one of th enodes we have some VMs that cannot be managed:
- VNC fails with

Code:

VM ID qmp command 'change' failed - got timeoutTASK ERROR: Failed to run vncproxy.

- Migration fails with

Code:

Task started by HA resource agent
2019-08-16 14:56:20 ERROR: migration aborted (duration 00:00:03): VM 118 qmp command 'query-machines' failed - got timeout
TASK ERROR: migration aborted

- Backup fails with

Code:

NFO: Starting Backup of VM 118 (qemu)
INFO: Backup started at 2019-08-16 11:33:30
INFO: status = running
INFO: update VM 118: -lock backup
INFO: VM Name: VMNAME
INFO: include disk 'scsi0' 'zesan-lvm:vm-118-disk-0' 25G
/dev/sdc: open failed: No medium found
/dev/sdc: open failed: No medium found
/dev/sdc: open failed: No medium found
/dev/sdc: open failed: No medium found
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: The backup started
INFO: creating archive '/mnt/pve/mntpoint/dump/vzdump-qemu-118-2019_08_16-11_33_30.vma.lzo'
ERROR: got timeout
INFO: aborting backup job
ERROR: VM 118 qmp command 'backup-cancel' failed - got timeout
ERROR: Backup of VM 118 failed - got timeout
INFO: Failed at 2019-08-16 11:43:36
INFO: Backup ended
INFO: Backup job finished with errors
TASK ERROR: job errors

It seems that if we shut down and start the VMs they will start working.
But this happened a few days back, restarted the VMs since then (actually the whole cluster) but now this issue is back for random machines on the same node.
This is breaking automatic backups for us and possibly HA.

How can i diagnose this?

Edit:

The logs for pve-ha-lrm service are full with entries lik ethis:

Code:

Aug 16 11:43:38 srv pve-ha-lrm[25059]: VM 118 qmp command 'query-status' failed - got timeout
Aug 16 11:43:38 srv pve-ha-lrm[25059]: VM 118 qmp command failed - VM 118 qmp command 'query-status' failed - got timeout
Aug 16 11:43:25 srv pve-ha-lrm[25012]: VM 118 qmp command 'query-status' failed - unable to connect to VM 118 qmp socket - timeout after 31 retries
Aug 16 11:43:25 srv pve-ha-lrm[25012]: VM 118 qmp command failed - VM 118 qmp command 'query-status' failed - unable to connect to VM 118 qmp socket - timeout after 31 retries
Aug 16 11:43:15 srv pve-ha-lrm[24963]: VM 118 qmp command 'query-status' failed - unable to connect to VM 118 qmp socket - timeout after 31 retries
Aug 16 11:43:15 srv pve-ha-lrm[24963]: VM 118 qmp command failed - VM 118 qmp command 'query-status' failed - unable to connect to VM 118 qmp socket - timeout after 31 retries
Aug 16 11:43:05 srv pve-ha-lrm[24912]: VM 118 qmp command 'query-status' failed - unable to connect to VM 118 qmp socket - timeout after 31 retries
Aug 16 11:43:05 srv pve-ha-lrm[24912]: VM 118 qmp command failed - VM 118 qmp command 'query-status' failed - unable to connect to VM 118 qmp socket - timeout after 31 retries
Aug 16 11:42:55 srv pve-ha-lrm[24851]: VM 118 qmp command 'query-status' failed - unable to connect to VM 118 qmp socket - timeout after 31 retries
Aug 16 11:42:55 srv pve-ha-lrm[24851]: VM 118 qmp command failed - VM 118 qmp command 'query-status' failed - unable to connect to VM 118 qmp socket - timeout after 31 retries
Aug 16 11:42:45 srv pve-ha-lrm[24802]: VM 118 qmp command 'query-status' failed - unable to connect to VM 118 qmp socket - timeout after 31 retries
Aug 16 11:42:45 srv pve-ha-lrm[24802]: VM 118 qmp command failed - VM 118 qmp command 'query-status' failed - unable to connect to VM 118 qmp socket - timeout after 31 retries
Aug 16 11:42:35 srv pve-ha-lrm[24753]: VM 118 qmp command 'query-status' failed - unable to connect to VM 118 qmp socket - timeout after 31 retries

gradinaruvasile · Aug 17, 2019

So it seems this issue does not just go away.
I had one node where all except one VMs exhibited this issue.
Only some of qmp commands are working. Shutdown is working and some others, but not migrate or backup or even monitor (console isn't working).
And in time this seem to "get caught" by other VMs. No idea what to look for.
I rebooted all vm's on a node then migrated them, restarted the node.
Now it is working but i had the issue appeared on another VM on another node (which was working fine for some time).

asedev · Aug 20, 2019

I am actually having the same issue running Virtual Environment 6.0-5.

Cannot connect to the console nor running the reset command after some time a VM is active. Only restarting the VM helps.

Any idea what can cause this issue? Did yet not find anything helpful in the logs either.

EvgenyKh · Aug 20, 2019

I have the same problem. VM on qcow2.

Console problem
VM 111 qmp command 'change' failed - got timeout
TASK ERROR: Failed to run vncproxy.

BackupProblem
INFO: starting new backup job: vzdump 111 --storage only2copy --node pm7 --remove 0 --mode snapshot --compress lzo
INFO: Starting Backup of VM 111 (qemu)
INFO: Backup started at 2019-08-20 09:26:50
INFO: status = running
INFO: update VM 111: -lock backup
INFO: VM Name: vda-dev-b24
INFO: include disk 'scsi0' 'local:111/vm-111-disk-1.qcow2' 32G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating archive '/var/lib/vz/only2copy/dump/vzdump-qemu-111-2019_08_20-09_26_50.vma.lzo'
ERROR: unable to connect to VM 111 qmp socket - timeout after 31 retries
INFO: aborting backup job
ERROR: VM 111 qmp command 'backup-cancel' failed - got timeout
ERROR: Backup of VM 111 failed - unable to connect to VM 111 qmp socket - timeout after 31 retries
INFO: Failed at 2019-08-20 09:37:31
INFO: Backup job finished with errors
TASK ERROR: job errors

Pavel Olenev · Aug 21, 2019

Exactly same thing on 6.0-5:
some calls cause "qmp command 'query-machines' failed" and backups got timeout. Seems that restarting of VMs only helpful.

aaron · Aug 22, 2019

Can you guys please tell me on what kind of storage those VMs are running? (LVM, ZFS, qcow2 on NFS, ...)?

This might help us to narrow it down and reproduce the problem.

Pavel Olenev · Aug 22, 2019

aaron said:
Can you guys please tell me on what kind of storage those VMs are running? (LVM, ZFS, qcow2 on NFS, ...)?

This might help us to narrow it down and reproduce the problem.

In my case it is a local LVM-Thinpool

asedev · Aug 22, 2019

I am running a single drive based on LVM storage.

gradinaruvasile · Aug 22, 2019

If i remember correctly most were LVM (iscsi over LVM), but i think there were also some NFS ones. Unfortunately i cannot say for sure for NFS because lately we migrated quite some storages.
But this issue is not only backup related. The machines cannot be managed as in no migrate or even vnc.

EvgenyKh · Aug 22, 2019

i use qcow2 on ext4 on raid1

fireon · Aug 26, 2019

Same problem here, use qcow2 on HW-Raid. On ZFS machines we did not have these problems.

Dominic · Aug 26, 2019

Unfortunately, we have had no luck reproducing this so far. Could you please post the following

Code:

cat /etc/pve/storage.cfg

For the relevant VM(s) with id X

Code:

qm config X

Code:

qm showcmd X --pretty

gradinaruvasile · Aug 26, 2019

Dominic said:
Unfortunately, we have had no luck reproducing this so far. Could you please post the following

Code:

cat /etc/pve/storage.cfg

For the relevant VM(s) with id X

Code:

qm config X

Code:

qm showcmd X --pretty

Unfortunately i have no known misbehaving VMs right now (i don't know how to issue bulk commands that might trigger it and they have no indication otherwise until you need a console, migration or backup). I restarted all of them and since this issue does not seem to have appear again.

fireon · Aug 26, 2019

Code:

1dir: local
        path /var/lib/vz
        content rootdir,images
        maxfiles 0
        shared 0

cifs: iso-images
        path /mnt/pve/iso-images
        server storage.supertux.lan
        share iso-images
        content vztmpl,snippets,iso
        domain supertux.lan
        username proxmox

cifs: sicherung
        path /mnt/pve/sicherung
        server srv-backup01.supertux.lan
        share sicherung
        content backup
        maxfiles 11
        username localbackup

cifs: archiv
        path /mnt/pve/archiv
        server srv-backup01.supertux.lan
        share archiv
        content backup
        maxfiles 5
        username localbackup

cifs: archiv-replica
        path /mnt/pve/archiv-replica
        server srv-backup01.supertux.lan
        share archiv-replica
        content backup
        maxfiles 1
        username localbackup

cifs: archiv02
        path /mnt/pve/archiv02
        server srv-backup02.supertux.lan
        share archiv02
        content backup
        maxfiles 1
        username localbackup

dir: ssd_vm
        path /mnt/pve/ssd_vm
        content rootdir,images
        is_mountpoint 1
        nodes srv-virtu02
        shared 0

Code:

agent: 1
boot: c
bootdisk: virtio0
cores: 12
cpu: kvm64,flags=+pcid
cpulimit: 4
description: VOIP-Server f%C3%BCr Telefonie
hotplug: disk,network,usb,memory
ide2: none,media=cdrom
memory: 4096
name: vsrv-voip.supertux.lan
net0: bridge=vmbr0,virtio=62:64:34:32:37:65,tag=13
numa: 1
onboot: 1
ostype: l26
protection: 1
scsihw: virtio-scsi-pci
serial0: socket
smbios1: uuid=9cf6cdfa-8d21-48a9-b46a-c8eaf46ba6df
sockets: 2
usb0: spice
vga: qxl
virtio0: local:142/vm-142-disk-1.qcow2,discard=on,size=50G

Code:

/usr/bin/kvm \
  -id 142 \
  -name vsrv-voip.supertux.lan \
  -chardev 'socket,id=qmp,path=/var/run/qemu-server/142.qmp,server,nowait' \
  -mon 'chardev=qmp,mode=control' \
  -chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' \
  -mon 'chardev=qmp-event,mode=control' \
  -pidfile /var/run/qemu-server/142.pid \
  -daemonize \
  -smbios 'type=1,uuid=9cf6cdfa-8d21-48a9-b46a-c8eaf46ba6df' \
  -smp '24,sockets=2,cores=12,maxcpus=24' \
  -nodefaults \
  -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' \
  -vnc unix:/var/run/qemu-server/142.vnc,password \
  -cpu kvm64,+pcid,+lahf_lm,+sep,+kvm_pv_unhalt,+kvm_pv_eoi,enforce \
  -m 'size=1024,slots=255,maxmem=4194304M' \
  -object 'memory-backend-ram,id=ram-node0,size=512M' \
  -numa 'node,nodeid=0,cpus=0-11,memdev=ram-node0' \
  -object 'memory-backend-ram,id=ram-node1,size=512M' \
  -numa 'node,nodeid=1,cpus=12-23,memdev=ram-node1' \
  -object 'memory-backend-ram,id=mem-dimm0,size=512M' \
  -device 'pc-dimm,id=dimm0,memdev=mem-dimm0,node=0' \
  -object 'memory-backend-ram,id=mem-dimm1,size=512M' \
  -device 'pc-dimm,id=dimm1,memdev=mem-dimm1,node=1' \
  -object 'memory-backend-ram,id=mem-dimm2,size=512M' \
  -device 'pc-dimm,id=dimm2,memdev=mem-dimm2,node=0' \
  -object 'memory-backend-ram,id=mem-dimm3,size=512M' \
  -device 'pc-dimm,id=dimm3,memdev=mem-dimm3,node=1' \
  -object 'memory-backend-ram,id=mem-dimm4,size=512M' \
  -device 'pc-dimm,id=dimm4,memdev=mem-dimm4,node=0' \
  -object 'memory-backend-ram,id=mem-dimm5,size=512M' \
  -device 'pc-dimm,id=dimm5,memdev=mem-dimm5,node=1' \
  -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' \
  -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' \
  -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' \
  -readconfig /usr/share/qemu-server/pve-usb.cfg \
  -chardev 'spicevmc,id=usbredirchardev0,name=usbredir' \
  -device 'usb-redir,chardev=usbredirchardev0,id=usbredirdev0,bus=ehci.0' \
  -chardev 'socket,id=serial0,path=/var/run/qemu-server/142.serial0,server,nowait' \
  -device 'isa-serial,chardev=serial0' \
  -device 'qxl-vga,id=vga,bus=pci.0,addr=0x2' \
  -chardev 'socket,path=/var/run/qemu-server/142.qga,server,nowait,id=qga0' \
  -device 'virtio-serial,id=qga0,bus=pci.0,addr=0x8' \
  -device 'virtserialport,chardev=qga0,name=org.qemu.guest_agent.0' \
  -spice 'tls-port=61004,addr=127.0.0.1,tls-ciphers=HIGH,seamless-migration=on' \
  -device 'virtio-serial,id=spice,bus=pci.0,addr=0x9' \
  -chardev 'spicevmc,id=vdagent,name=vdagent' \
  -device 'virtserialport,chardev=vdagent,name=com.redhat.spice.0' \
  -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3' \
  -iscsi 'initiator-name=iqn.1993-08.org.debian:01:d389d4b77d5' \
  -drive 'if=none,id=drive-ide2,media=cdrom,aio=threads' \
  -device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2' \
  -drive 'file=/var/lib/vz/images/142/vm-142-disk-1.qcow2,if=none,id=drive-virtio0,discard=on,format=qcow2,cache=none,aio=native,detect-zeroes=unmap' \
  -device 'virtio-blk-pci,drive=drive-virtio0,id=virtio0,bus=pci.0,addr=0xa,bootindex=100' \
  -netdev 'type=tap,id=net0,ifname=tap142i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' \
  -device 'virtio-net-pci,mac=62:64:34:32:37:65,netdev=net0,bus=pci.0,addr=0x12,id=net0' \
  -machine 'type=pc'

EvgenyKh · Aug 26, 2019

Code:

root@pm7:~# cat /etc/pve/storage.cfg
dir: ssd480
    path /mntssd
    content images
    nodes pm7
    shared 0

dir: local
    path /var/lib/vz
    content backup,vztmpl,rootdir,snippets,images,iso
    maxfiles 5
    shared 0

dir: only2copy
    path /var/lib/vz/only2copy
    content backup
    maxfiles 2
    shared 0

Code:

root@pm7:~# qm config 100
bootdisk: scsi0
cores: 1
ide2: none,media=cdrom
memory: 24000
name: deb-teleg-smsbanan
net0: e1000=AE:08:A0:43:25:CA,bridge=vmbr0
numa: 0
onboot: 1
ostype: l26
scsi0: ssd480:100/vm-100-disk-0.qcow2,discard=on,size=32G
scsihw: virtio-scsi-pci
smbios1: uuid=05b40800-c959-478a-a4b8-849ba8d7cb02
sockets: 2

Code:

root@pm7:~# qm showcmd 100 --pretty
/usr/bin/kvm \
  -id 100 \
  -name deb-teleg-smsbanan \
  -chardev 'socket,id=qmp,path=/var/run/qemu-server/100.qmp,server,nowait' \
  -mon 'chardev=qmp,mode=control' \
  -chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' \
  -mon 'chardev=qmp-event,mode=control' \
  -pidfile /var/run/qemu-server/100.pid \
  -daemonize \
  -smbios 'type=1,uuid=05b40800-c959-478a-a4b8-849ba8d7cb02' \
  -smp '2,sockets=2,cores=1,maxcpus=2' \
  -nodefaults \
  -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' \
  -vnc unix:/var/run/qemu-server/100.vnc,password \
  -cpu kvm64,+lahf_lm,+sep,+kvm_pv_unhalt,+kvm_pv_eoi,enforce \
  -m 24000 \
  -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' \
  -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' \
  -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' \
  -device 'usb-tablet,id=tablet,bus=uhci.0,port=1' \
  -device 'VGA,id=vga,bus=pci.0,addr=0x2' \
  -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3' \
  -iscsi 'initiator-name=iqn.1993-08.org.debian:01:48073f11533' \
  -drive 'if=none,id=drive-ide2,media=cdrom,aio=threads' \
  -device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=200' \
  -device 'virtio-scsi-pci,id=scsihw0,bus=pci.0,addr=0x5' \
  -drive 'file=/mntssd/images/100/vm-100-disk-0.qcow2,if=none,id=drive-scsi0,discard=on,format=qcow2,cache=none,aio=native,detect-zeroes=unmap' \
  -device 'scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100' \
  -netdev 'type=tap,id=net0,ifname=tap100i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown' \
  -device 'e1000,mac=AE:08:A0:43:25:CA,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300' \
  -machine 'type=pc'

EvgenyKh · Aug 26, 2019

I assume the VM goes into this state after trying to use the console in the admin panel.
Try switching console fast with several VMs

gradinaruvasile · Aug 27, 2019

EvgenyKh said:
I assume the VM goes into this state after trying to use the console in the admin panel.
Try switching console fast with several VMs

It doesn't work, i tried it. When the VM is in this state, console, migration and backup are all unusable. Sometimes i found the machines when the backups finished and i saw the failure emails. So it is not triggered by the vnc console. And nothing seems to work but stopping the VM and starting it again.

kdavies · Sep 10, 2019

I currently have the exact same behavor. 6 Node cluster, all running on Proxmox based Ceph storage.
It started this morning and I feel like I have a vauge idea of what caused it.

I use Infiniband for all Ceph + Proxmox traffic (Which has been working fine). I decided to use Infiniband for the nas (Freenas via NFS) and it was acting a bit hit and miss. Some of my hosts couldn't reliably talk to the nas. So last night, after most of my backups failed, I changed back to a standard 1gbe copper link. Now, one host has 11 machines that are affected by this (out of about 20 machines on the host).

140	bumblebee	FAILED	00:10:06	got timeout
141	0095-printsys-2	OK	00:06:37	2.95GB	/mnt/pve/backups/dump/vzdump-qemu-141-2019_09_10-05_03_38.vma.lzo
142	nagios-slave	FAILED	00:10:05	got timeout
145	proxmox-mail	FAILED	00:10:06	got timeout
147	0104-haproxy4	FAILED	00:10:05	got timeout
151	saltmaster	FAILED	00:10:09	got timeout

I cannot migrate to another host and I cannot use the VNC console. If I go onto the host it's self and try a migration, it clearly tries,

root@hydra1:~# qm migrate 138 hydra3 --online
Requesting HA migration for VM 138 to node hydra3

but then the HA manager fails

task started by HA resource agent
2019-09-10 08:50:21 ERROR: migration aborted (duration 00:00:03): VM 138 qmp command 'query-machines' failed - got timeout
TASK ERROR: migration aborted

Unsure if it's related, but dmesg shows a lot of the interfaces got disabled, then enabled, then in blocking state at roughly the same time:

[Tue Sep 10 03:06:04 2019] fwbr118i0: port 2(tap118i0) entered blocking state
[Tue Sep 10 03:06:04 2019] fwbr118i0: port 2(tap118i0) entered disabled state
[Tue Sep 10 03:06:04 2019] fwbr118i0: port 2(tap118i0) entered blocking state
[Tue Sep 10 03:06:04 2019] fwbr118i0: port 2(tap118i0) entered forwarding state
[Tue Sep 10 03:11:10 2019] fwbr118i0: port 2(tap118i0) entered disabled state
[Tue Sep 10 03:11:10 2019] fwbr118i0: port 1(fwln118i0) entered disabled state
[Tue Sep 10 03:11:10 2019] vmbr0: port 19(fwpr118p0) entered disabled state
[Tue Sep 10 03:11:10 2019] device fwln118i0 left promiscuous mode
[Tue Sep 10 03:11:10 2019] fwbr118i0: port 1(fwln118i0) entered disabled state
[Tue Sep 10 03:11:10 2019] device fwpr118p0 left promiscuous mode
[Tue Sep 10 03:11:10 2019] vmbr0: port 19(fwpr118p0) entered disabled state

Also seeing this in the syslog

Sep 10 00:00:03 hydra1 systemd-udevd[38337]: Using default interface naming scheme 'v240'.
Sep 10 00:00:03 hydra1 systemd-udevd[38337]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Sep 10 00:00:03 hydra1 systemd-udevd[38337]: Could not generate persistent MAC address for tap100i0: No such file or directory
Sep 10 00:00:04 hydra1 pmxcfs[1433]: [dcdb] notice: data verification successful
Sep 10 00:00:04 hydra1 kernel: [472579.807640] device tap100i0 entered promiscuous mode
Sep 10 00:00:04 hydra1 systemd-udevd[38342]: Using default interface naming scheme 'v240'.
Sep 10 00:00:04 hydra1 systemd-udevd[38342]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Sep 10 00:00:04 hydra1 systemd-udevd[38342]: Could not generate persistent MAC address for fwbr100i0: No such file or directory
Sep 10 00:00:04 hydra1 systemd-udevd[38337]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Sep 10 00:00:04 hydra1 systemd-udevd[38337]: Could not generate persistent MAC address for fwpr100p0: No such file or directory
Sep 10 00:00:04 hydra1 systemd-udevd[38340]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Sep 10 00:00:04 hydra1 systemd-udevd[38340]: Using default interface naming scheme 'v240'.
Sep 10 00:00:04 hydra1 systemd-udevd[38340]: Could not generate persistent MAC address for fwln100i0: No such file or directory
Sep 10 00:00:04 hydra1 kernel: [472579.849073] fwbr100i0: port 1(fwln100i0) entered blocking state
Sep 10 00:00:04 hydra1 kernel: [472579.849077] fwbr100i0: port 1(fwln100i0) entered disabled state
Sep 10 00:00:04 hydra1 kernel: [472579.849232] device fwln100i0 entered promiscuous mode
Sep 10 00:00:04 hydra1 kernel: [472579.849324] fwbr100i0: port 1(fwln100i0) entered blocking state
Sep 10 00:00:04 hydra1 kernel: [472579.849326] fwbr100i0: port 1(fwln100i0) entered forwarding state
Sep 10 00:00:04 hydra1 kernel: [472579.854674] vmbr0: port 19(fwpr100p0) entered blocking state
Sep 10 00:00:04 hydra1 kernel: [472579.854677] vmbr0: port 19(fwpr100p0) entered disabled state
Sep 10 00:00:04 hydra1 kernel: [472579.854820] device fwpr100p0 entered promiscuous mode
Sep 10 00:00:04 hydra1 kernel: [472579.854881] vmbr0: port 19(fwpr100p0) entered blocking state
Sep 10 00:00:04 hydra1 kernel: [472579.854883] vmbr0: port 19(fwpr100p0) entered forwarding state
Sep 10 00:00:04 hydra1 kernel: [472579.859972] fwbr100i0: port 2(tap100i0) entered blocking state
Sep 10 00:00:04 hydra1 kernel: [472579.859974] fwbr100i0: port 2(tap100i0) entered disabled state
Sep 10 00:00:04 hydra1 kernel: [472579.860189] fwbr100i0: port 2(tap100i0) entered blocking state
Sep 10 00:00:04 hydra1 kernel: [472579.860191] fwbr100i0: port 2(tap100i0) entered forwarding state

EDIT:
This is a config of one of the affected machines:
/usr/bin/kvm -id 142 -name nagios-slave -chardev 'socket,id=qmp,path=/var/run/qemu-server/142.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' -mon 'chardev=qmp-event,mode=control' -pidfile /var/run/qemu-server/142.pid -daemonize -smbios 'type=1,uuid=d24a50e0-9b4e-478d-ba86-4974d37aa272' -smp '4,sockets=1,cores=4,maxcpus=4' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vnc unix:/var/run/qemu-server/142.vnc,password -cpu kvm64,+lahf_lm,+sep,+kvm_pv_unhalt,+kvm_pv_eoi,enforce -m 1024 -k en-gb -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' -device 'usb-tablet,id=tablet,bus=uhci.0,port=1' -device 'VGA,id=vga,bus=pci.0,addr=0x2' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:2022e4fa592' -device 'virtio-scsi-pci,id=scsihw0,bus=pci.0,addr=0x5' -drive 'file=rbd:hdd-pool/vm-142-disk-0:conf=/etc/pve/ceph.conf:id=admin:keyring=/etc/pve/priv/ceph/hdd-pool.keyring,if=none,id=drive-scsi0,format=raw,cache=none,aio=native,detect-zeroes=on' -device 'scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100' -netdev 'type=tap,id=net0,ifname=tap142i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=D2:03:96:17:77:E9,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300' -machine 'type=pc'

EDIT 2:
Can't monitor the VM from the host either:

root@hydra1:/var/run/qemu-server# qm monitor 142
Entering Qemu Monitor for VM 142 - type 'help' for help
qm> help
ERROR: VM 142 qmp command 'human-monitor-command' failed - got timeout

Sarel Pretorius · Sep 14, 2019

Good Day

I am having the same issue, some VMs backup and can be managed via noVNC, others gets timeout when backing up, cannot manage via noVNC

Sarel Pretorius · Sep 17, 2019

Sarel Pretorius said:
Good Day

I am having the same issue, some VMs backup and can be managed via noVNC, others gets timeout when backing up, cannot manage via noVNC

I just want to add the following backup log:

[SOLVED] Certain VMs from a cluster cannot be backed up and managed

Renowned Member

Renowned Member

Member

New Member

New Member

Proxmox Staff Member

New Member

Member

Renowned Member

New Member

Distinguished Member

Proxmox Retired Staff

Renowned Member

Distinguished Member

New Member

New Member

Renowned Member

New Member

Member

Member

Attachments