[SOLVED] Incoherent VM status following HA failover (bug?)

Le PAH

Member
Oct 17, 2018
38
0
6
France
Hello all,

I'm experimenting with HA clustering and I ran into a possible bug.

VM is configured with HA group to preferably run on the node 3:
Code:
root@srv-pve1:~# cat /etc/pve/ha/groups.cfg
group: PVE3_First
   comment PVE3 is prefered for this group
   nodes srv-pve3
   nofailback 0
   restricted 0

Code:
root@srv-pve1:~# cat /etc/pve/ha/resources.cfg
vm: 244
    comment Associates OMV4 on PVE3
    group PVE3_First
    state started


Cluster is healthy and quorates:
Code:
root@srv-pve1:~# ha-manager status
quorum OK
master srv-pve2 (active, Wed Mar 20 10:10:49 2019)
lrm srv-pve1 (active, Wed Mar 20 10:10:48 2019)
lrm srv-pve2 (active, Wed Mar 20 10:10:51 2019)
lrm srv-pve3 (active, Wed Mar 20 10:10:54 2019)
service vm:244 (srv-pve3, started)


I shutdown the node 3 and the VM is shutdown as well as per HA default policy.


HA detects that node PVE3 is unreachable, that the VM is stopped, tries and fails to start the VM…
Code:
task started by HA resource agent
TASK ERROR: start failed: command '/usr/bin/kvm -id 244 -name SRV-OMV-VIDEO4 -chardev 'socket,id=qmp,path=/var/run/qemu-server/244.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' -mon 'chardev=qmp-event,mode=control' -pidfile /var/run/qemu-server/244.pid -daemonize -smbios 'type=1,uuid=8868a218-d321-4b38-afc4-dc56108658b9' -smp '8,sockets=2,cores=4,maxcpus=8' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vnc unix:/var/run/qemu-server/244.vnc,x509,password -cpu kvm64,+lahf_lm,+sep,+kvm_pv_unhalt,+kvm_pv_eoi,enforce -m 8192 -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' -device 'vmgenid,guid=449229ce-33c8-430e-b801-4132a9633ac7' -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' -device 'usb-tablet,id=tablet,bus=uhci.0,port=1' -device 'VGA,id=vga,bus=pci.0,addr=0x2' -chardev 'socket,path=/var/run/qemu-server/244.qga,server,nowait,id=qga0' -device 'virtio-serial,id=qga0,bus=pci.0,addr=0x8' -device 'virtserialport,chardev=qga0,name=org.qemu.guest_agent.0' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:9d7c74e48281' -drive 'if=none,id=drive-ide2,media=cdrom,aio=threads' -device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=200' -device 'virtio-scsi-pci,id=scsihw0,bus=pci.0,addr=0x5' -drive 'file=rbd:STD_POOL/vm-244-disk-0:conf=/etc/pve/ceph.conf:id=admin:keyring=/etc/pve/priv/ceph/STD_POOL_vm.keyring,if=none,id=drive-scsi0,format=raw,cache=none,aio=native,detect-zeroes=on' -device 'scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100' -drive 'file=rbd:STD_POOL/vm-244-disk-1:conf=/etc/pve/ceph.conf:id=admin:keyring=/etc/pve/priv/ceph/STD_POOL_vm.keyring,if=none,id=drive-scsi1,format=raw,cache=none,aio=native,detect-zeroes=on' -device 'scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=1,drive=drive-scsi1,id=scsi1' -netdev 'type=tap,id=net0,ifname=tap244i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=56:67:F6:99:BD:4C,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300' -netdev 'type=tap,id=net1,ifname=tap244i1,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=6A:33:04:C1:03:F9,netdev=net1,bus=pci.0,addr=0x13,id=net1,bootindex=301' -machine 'type=pc'' failed: got timeout


… but the VM is considered started by the HA nevertheless:
Code:
root@srv-pve1:~# ha-manager status
quorum OK
master srv-pve2 (active, Wed Mar 20 10:23:09 2019)
lrm srv-pve1 (active, Wed Mar 20 10:23:07 2019)
lrm srv-pve2 (active, Wed Mar 20 10:23:09 2019)
lrm srv-pve3 (old timestamp - dead?, Wed Mar 20 10:13:39 2019)
service vm:244 (srv-pve1, started)


Same status can be seen by QEMU:
Code:
root@srv-pve1:~# qm status 244
status: running


All the while, the PVE Cluster is healthy:
Code:
root@srv-pve1:~# pvecm status
Quorum information
------------------
Date:             Wed Mar 20 10:23:30 2019
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000001
Ring ID:          1/756
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      2
Quorum:           2 
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.0.0.101 (local)
0x00000002          1 10.0.0.102


The VM is unreacheable: no console, no ping, no services, no guest agent…
I didn't try to power cycle the VM that is in this incoherent state (HA vs QM).


Everything works properly as soon as I power the node 3 back:
Code:
root@srv-pve3:~# qm status 244
status: running

Code:
root@srv-pve1:~# ha-manager status
quorum OK
master srv-pve2 (active, Wed Mar 20 10:49:49 2019)
lrm srv-pve1 (active, Wed Mar 20 10:49:48 2019)
lrm srv-pve2 (active, Wed Mar 20 10:49:51 2019)
lrm srv-pve3 (active, Wed Mar 20 10:49:54 2019)
service vm:244 (srv-pve3, started)

Code:
root@srv-pve1:~# pvecm status
Quorum information
------------------
Date:             Wed Mar 20 10:49:35 2019
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000001
Ring ID:          1/760
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2 
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.0.0.101 (local)
0x00000002          1 10.0.0.102
0x00000003          1 10.0.0.103

The VM is back on node 3, pings and services are operational.


Regarding the environment, the cluster is up-to-date and all nodes run the same configuration:
Code:
root@srv-pve1:~# pveversion -v
proxmox-ve: 5.3-1 (running kernel: 4.15.18-12-pve)
pve-manager: 5.3-11 (running version: 5.3-11/d4907f84)
pve-kernel-4.15: 5.3-3
pve-kernel-4.15.18-12-pve: 4.15.18-35
pve-kernel-4.15.18-11-pve: 4.15.18-34
pve-kernel-4.15.18-10-pve: 4.15.18-32
pve-kernel-4.15.18-9-pve: 4.15.18-30
pve-kernel-4.15.18-8-pve: 4.15.18-28
pve-kernel-4.15.18-7-pve: 4.15.18-27
pve-kernel-4.15.17-1-pve: 4.15.17-9
ceph: 12.2.11-pve1
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-3
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-47
libpve-guest-common-perl: 2.0-20
libpve-http-server-perl: 2.0-12
libpve-storage-perl: 5.0-39
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-3
lxcfs: 3.0.3-pve1
novnc-pve: 1.0.0-3
proxmox-widget-toolkit: 1.0-23
pve-cluster: 5.0-33
pve-container: 2.0-35
pve-docs: 5.3-3
pve-edk2-firmware: 1.20181023-1
pve-firewall: 3.0-18
pve-firmware: 2.0-6
pve-ha-manager: 2.0-8
pve-i18n: 1.0-9
pve-libspice-server1: 0.14.1-2
pve-qemu-kvm: 2.12.1-2
pve-xtermjs: 3.10.1-2
qemu-server: 5.0-47
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.13-pve1~bpo1


Storage is powered by a healthy ceph cluster:
Code:
root@srv-pve1:~# ceph-brag
{
  "cluster_creation_date": "2018-10-11 17:51:26.259055",
  "uuid": "dd52bfc1-5409-4730-8f3a-72637478418a",
  "components_count": {
    "num_data_bytes": 7485079178309,
    "num_mons": 3,
    "num_pgs": 672,
    "num_mdss": 1,
    "num_pools": 3,
    "num_osds": 18,
    "num_bytes_total": 72002146295808,
    "num_objects": 1788813
  },
  "crush_types": [
    {
      "count": 6,
      "type": "host"
    },
    {
      "count": 2,
      "type": "root"
    },
    {
      "count": 18,
      "type": "devices"
    }
  ],
  "ownership": {},
  "pool_metadata": [
    {
      "type": 1,
      "id": 6,
      "size": 2
    },
    {
      "type": 1,
      "id": 9,
      "size": 3
    },
    {
      "type": 1,
      "id": 10,
      "size": 3
    }
  ],
  "sysinfo": {
    "kernel_types": [
      {
        "count": 18,
        "type": "#1 SMP PVE 4.15.18-35 (Wed, 13 Mar 2019 08:24:42 +0100)"
      }
    ],
    "cpu_archs": [
      {
        "count": 18,
        "arch": "x86_64"
      }
    ],
    "cpus": [
      {
        "count": 18,
        "cpu": "Intel(R) Xeon(R) CPU E5-2440 0 @ 2.40GHz"
      }
    ],
    "kernel_versions": [
      {
        "count": 18,
        "version": "4.15.18-12-pve"
      }
    ],
    "ceph_versions": [
      {
        "count": 18,
        "version": "12.2.11(c96e82ac735a75ae99d4847983711e1f2dbf12e5)"
      }
    ],
    "os_info": [
      {
        "count": 18,
        "os": "Linux"
      }
    ],
    "distros": []
  }
}

Could it be a bug that I need to report or is it a configuration problem?

BR, PAH.
 
I've got the exact same issue for a different VM on a different HA Group for a different node:

Code:
()
task started by HA resource agent
TASK ERROR: start failed: command '/usr/bin/kvm -id 106 -name SRV-APT-REPO -chardev 'socket,id=qmp,path=/var/run/qemu-server/106.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' -mon 'chardev=qmp-event,mode=control' -pidfile /var/run/qemu-server/106.pid -daemonize -smbios 'type=1,uuid=d2cb6d15-c93e-41d5-a87d-230f193b5632' -smp '1,sockets=1,cores=1,maxcpus=1' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vnc unix:/var/run/qemu-server/106.vnc,x509,password -cpu kvm64,+lahf_lm,+sep,+kvm_pv_unhalt,+kvm_pv_eoi,enforce -m 2048 -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' -device 'vmgenid,guid=ec8aa96c-49bf-4d92-8869-7cfebb1075fd' -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' -device 'usb-tablet,id=tablet,bus=uhci.0,port=1' -device 'VGA,id=vga,bus=pci.0,addr=0x2' -chardev 'socket,path=/var/run/qemu-server/106.qga,server,nowait,id=qga0' -device 'virtio-serial,id=qga0,bus=pci.0,addr=0x8' -device 'virtserialport,chardev=qga0,name=org.qemu.guest_agent.0' -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:7dbdeef98674' -drive 'if=none,id=drive-ide2,media=cdrom,aio=threads' -device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=200' -device 'virtio-scsi-pci,id=scsihw0,bus=pci.0,addr=0x5' -drive 'file=rbd:STD_POOL/vm-106-disk-0:conf=/etc/pve/ceph.conf:id=admin:keyring=/etc/pve/priv/ceph/STD_POOL_vm.keyring,if=none,id=drive-scsi0,format=raw,cache=none,aio=native,detect-zeroes=on' -device 'scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100' -netdev 'type=tap,id=net0,ifname=tap106i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=56:B4:5A:55:79:08,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300' -machine 'type=pc'' failed: got timeout
 
Last edited:
it seems that the qemu process runs (thus ha sees it as 'running') but it does not fully boot for whatever reason
you can try to see in what state the vm is with the qemu monitor (qm monitor ID)
or you can try to launch the commandline in the foreground if some error pops up
(qm showcmd ID gives you the commandline; when deleting the '-daemonized' part of it, the vm starts in the foreground)
 
Thanks for you reply :)

The qm monitor is not successful:
Code:
root@srv-pve3:~# qm monitor 106
Entering Qemu Monitor for VM 106 - type 'help' for help
qm> stop
ERROR: VM 106 qmp command 'human-monitor-command' failed - unable to connect to VM 106 qmp socket - timeout after 31 retries


Trying to interact with the VM, using the KVM command line fails as well:
Code:
root@srv-pve3:~# /usr/bin/kvm -id 106 -name SRV-APT-REPO -chardev 'socket,id=qmp,path=/var/run/qemu-server/106.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' -mon 'chardev=qmp-event,mode=control' -pidfile /var/run/qemu-server/106.pid -smbios 'type=1,uuid=d2cb6d15-c93e-41d5-a87d-230f193b5632' -smp '1,sockets=1,cores=1,maxcpus=1' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vnc unix:/var/run/qemu-server/106.vnc,x509,password -cpu kvm64,+lahf_lm,+sep,+kvm_pv_unhalt,+kvm_pv_eoi,enforce -m 2048 -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' -device 'vmgenid,guid=ec8aa96c-49bf-4d92-8869-7cfebb1075fd' -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' -device 'usb-tablet,id=tablet,bus=uhci.0,port=1' -device 'VGA,id=vga,bus=pci.0,addr=0x2' -chardev 'socket,path=/var/run/qemu-server/106.qga,server,nowait,id=qga0' -device 'virtio-serial,id=qga0,bus=pci.0,addr=0x8' -device 'virtserialport,chardev=qga0,name=org.qemu.guest_agent.0' -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:7dbdeef98674' -drive 'if=none,id=drive-ide2,media=cdrom,aio=threads' -device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=200' -device 'virtio-scsi-pci,id=scsihw0,bus=pci.0,addr=0x5' -drive 'file=rbd:STD_POOL/vm-106-disk-0:conf=/etc/pve/ceph.conf:id=admin:keyring=/etc/pve/priv/ceph/STD_POOL_vm.keyring,if=none,id=drive-scsi0,format=raw,cache=none,aio=native,detect-zeroes=on' -device 'scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100' -netdev 'type=tap,id=net0,ifname=tap106i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=56:B4:5A:55:79:08,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300' -machine 'type=pc'
kvm: could not acquire pid file: Resource temporarily unavailable

Trying to stop it through the GUI fails up to SIGTERM:
Code:
task started by HA resource agent
VM quit/powerdown failed - terminating now with SIGTERM
TASK OK

I can then change the VM state to ignored in the HA, before trying to start it once again, without the -daemonized flag but nothing seems to happen, the command shows nothing back even after a good while…

Nevertheless, the GUI shows weird behaviour on RAM:
upload_2019-3-25_16-52-3.png

I really wonder what could cause such a problem, migrating VM by hand works almost all the time.
 
Last edited:
can you post the vm config ?
 
Here you go:

Code:
root@srv-pve3:~# cat  /etc/pve/qemu-server/106.conf
#Serveur utilis%C3%A9 pour stocker une image des d%C3%A9p%C3%B4ts Debian.

agent: 1
balloon: 1024
bootdisk: scsi0
cores: 1
ide2: none,media=cdrom
memory: 2048
name: SRV-APT-REPO
net0: virtio=56:B4:5A:55:79:08,bridge=vmbr0,rate=1.4
numa: 0
ostype: l26
scsi0: STD_POOL_vm:vm-106-disk-0,size=180G
scsihw: virtio-scsi-pci
smbios1: uuid=d2cb6d15-c93e-41d5-a87d-230f193b5632
sockets: 1
vmgenid: ec8aa96c-49bf-4d92-8869-7cfebb1075fd



root@srv-pve3:~# cat  /etc/pve/qemu-server/244.conf
agent: 1
balloon: 0
bootdisk: scsi0
cores: 4
ide2: none,media=cdrom
memory: 8192
name: SRV-OMV-VIDEO4
net0: virtio=56:67:F6:99:BD:4C,bridge=vmbr0
net1: virtio=6A:33:04:C1:03:F9,bridge=vmbr0,tag=111
numa: 0
onboot: 1
ostype: l26
scsi0: STD_POOL_vm:vm-244-disk-0,size=32G
scsi1: STD_POOL_vm:vm-244-disk-1,size=10T
scsihw: virtio-scsi-pci
smbios1: uuid=8868a218-d321-4b38-afc4-dc56108658b9
sockets: 2
vmgenid: 449229ce-33c8-430e-b801-4132a9633ac7
 
For more information, the cluster fails sometime to start a VM, when migrating it between 2 nodes:
Code:
()
task started by HA resource agent
2019-03-26 10:45:25 use dedicated network address for sending migration traffic (10.0.0.102)
2019-03-26 10:45:26 starting migration of VM 106 to node 'srv-pve2' (10.0.0.102)
2019-03-26 10:45:26 copying disk images
2019-03-26 10:45:26 starting VM 106 on remote node 'srv-pve2'
2019-03-26 10:45:28 start remote tunnel
2019-03-26 10:45:29 ssh tunnel ver 1
2019-03-26 10:45:29 starting online/live migration on unix:/run/qemu-server/106.migrate
2019-03-26 10:45:29 migrate_set_speed: 8589934592
2019-03-26 10:45:29 migrate_set_downtime: 0.1
2019-03-26 10:45:29 set migration_caps
2019-03-26 10:45:29 set cachesize: 268435456
2019-03-26 10:45:29 start migrate command to unix:/run/qemu-server/106.migrate
2019-03-26 10:45:30 migration status: active (transferred 115096991, remaining 1073610752), total 2165121024)
2019-03-26 10:45:30 migration xbzrle cachesize: 268435456 transferred 0 pages 0 cachemiss 0 overflow 0
2019-03-26 10:45:31 migration status: active (transferred 232592602, remaining 828686336), total 2165121024)
2019-03-26 10:45:31 migration xbzrle cachesize: 268435456 transferred 0 pages 0 cachemiss 0 overflow 0
2019-03-26 10:45:32 migration status: active (transferred 350073219, remaining 709861376), total 2165121024)
2019-03-26 10:45:32 migration xbzrle cachesize: 268435456 transferred 0 pages 0 cachemiss 0 overflow 0
2019-03-26 10:45:33 migration status: active (transferred 467616340, remaining 592412672), total 2165121024)
2019-03-26 10:45:33 migration xbzrle cachesize: 268435456 transferred 0 pages 0 cachemiss 0 overflow 0
2019-03-26 10:45:34 migration status: active (transferred 584564562, remaining 475471872), total 2165121024)
2019-03-26 10:45:34 migration xbzrle cachesize: 268435456 transferred 0 pages 0 cachemiss 0 overflow 0
2019-03-26 10:45:35 migration status: active (transferred 701914805, remaining 358207488), total 2165121024)
2019-03-26 10:45:35 migration xbzrle cachesize: 268435456 transferred 0 pages 0 cachemiss 0 overflow 0
2019-03-26 10:45:36 migration status: active (transferred 819363589, remaining 240824320), total 2165121024)
2019-03-26 10:45:36 migration xbzrle cachesize: 268435456 transferred 0 pages 0 cachemiss 0 overflow 0
2019-03-26 10:45:37 migration status: active (transferred 936121857, remaining 122740736), total 2165121024)
2019-03-26 10:45:37 migration xbzrle cachesize: 268435456 transferred 0 pages 0 cachemiss 0 overflow 0
2019-03-26 10:45:38 migration speed: 227.56 MB/s - downtime 82 ms
2019-03-26 10:45:38 migration status: completed
2019-03-26 10:45:38 ERROR: tunnel replied 'ERR: resume failed - unable to find configuration file for VM 106 - no such machine' to command 'resume 106'
2019-03-26 10:45:41 ERROR: migration finished with problems (duration 00:00:16)
TASK ERROR: migration problems
 
For more information, the cluster fails sometime to start a VM, when migrating it between 2 nodes:
this looks like a different problem, this you can workaround with issuing the 'resume' manually after (and killing the source vm)

i see nothing in your config that would indicate any problems when starting...
is there anything in the syslog while such a vm is starting?
 
Hello,

You might have found the culprit.

On all nodes, one of the interfaces flaps:
Code:
Mar 27 10:46:16 srv-pve2 corosync[2333]: error   [TOTEM ] Marking ringid 1 interface 192.168.1.102 FAULTY
Mar 27 10:46:16 srv-pve2 corosync[2333]:  [TOTEM ] Marking ringid 1 interface 192.168.1.102 FAULTY
Mar 27 10:46:17 srv-pve2 corosync[2333]: notice  [TOTEM ] Automatically recovered ring 1
Mar 27 10:46:17 srv-pve2 corosync[2333]:  [TOTEM ] Automatically recovered ring 1


Here is the corosync.conf:
Code:
root@srv-pve1:~# cat /etc/corosync/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: srv-pve1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.0.0.101
    ring1_addr: 192.168.1.101

  }
  node {
    name: srv-pve2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.0.0.102
    ring1_addr: 192.168.1.102
  }
  node {
    name: srv-pve3
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.0.0.103
    ring1_addr: 192.168.1.103
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: Cluster
  config_version: 10
  rrp_mode: passive
  ip_version: ipv4
  secauth: on
  version: 2

  interface {
    bindnetaddr: 10.0.0.101
    ringnumber: 0
  }

  interface {
    bindnetaddr: 192.168.1.101
    ringnumber: 1
  }

nodelist {
  node {
    name: srv-pve2
    nodeid: 1
    quorum_votes: 1
    ring0_addr: srv-pve2-private
    ring1_addr: srv-pve2
  }

 node {
    name: srv-pve3
    nodeid: 2
    quorum_votes: 1
    ring0_addr: srv-pve3-private
    ring1_addr: srv-pve3
  }

}

}

Should I try and reboot all nodes to see if this could be the root cause of the problem?
 
Last edited:
i meant the syslog of the host when a vm is hanging during start


Here is the full syslog between the moment when the initial host is stopped and when it is back on:
Code:
Mar 27 11:10:32 srv-pve3 corosync[2197]: error   [TOTEM ] Marking ringid 1 interface 192.168.1.103 FAULTY
Mar 27 11:10:32 srv-pve3 corosync[2197]:  [TOTEM ] Marking ringid 1 interface 192.168.1.103 FAULTY
Mar 27 11:10:33 srv-pve3 corosync[2197]: notice  [TOTEM ] Automatically recovered ring 1
Mar 27 11:10:33 srv-pve3 corosync[2197]:  [TOTEM ] Automatically recovered ring 1
Mar 27 11:10:33 srv-pve3 corosync[2197]: notice  [TOTEM ] Retransmit List: 10b275
Mar 27 11:10:33 srv-pve3 corosync[2197]:  [TOTEM ] Retransmit List: 10b275
Mar 27 11:10:34 srv-pve3 corosync[2197]: notice  [TOTEM ] Retransmit List: 10b276
Mar 27 11:10:34 srv-pve3 corosync[2197]:  [TOTEM ] Retransmit List: 10b276
Mar 27 11:10:34 srv-pve3 corosync[2197]:  [TOTEM ] Retransmit List: 10b276
Mar 27 11:10:34 srv-pve3 corosync[2197]: notice  [TOTEM ] Retransmit List: 10b276
Mar 27 11:10:35 srv-pve3 pmxcfs[2072]: [dcdb] notice: members: 1/1933, 3/2072
Mar 27 11:10:35 srv-pve3 pmxcfs[2072]: [dcdb] notice: starting data syncronisation
Mar 27 11:10:35 srv-pve3 pmxcfs[2072]: [status] notice: members: 1/1933, 3/2072
Mar 27 11:10:35 srv-pve3 pmxcfs[2072]: [status] notice: starting data syncronisation
Mar 27 11:10:35 srv-pve3 pmxcfs[2072]: [dcdb] notice: received sync request (epoch 1/1933/00000014)
Mar 27 11:10:35 srv-pve3 pmxcfs[2072]: [status] notice: received sync request (epoch 1/1933/00000014)
Mar 27 11:10:36 srv-pve3 corosync[2197]: notice  [TOTEM ] A processor failed, forming new configuration.
Mar 27 11:10:36 srv-pve3 corosync[2197]:  [TOTEM ] A processor failed, forming new configuration.
Mar 27 11:10:38 srv-pve3 corosync[2197]: notice  [TOTEM ] A new membership (10.0.0.101:856) was formed. Members left: 2
Mar 27 11:10:38 srv-pve3 corosync[2197]: notice  [TOTEM ] Failed to receive the leave message. failed: 2
Mar 27 11:10:38 srv-pve3 corosync[2197]:  [TOTEM ] A new membership (10.0.0.101:856) was formed. Members left: 2
Mar 27 11:10:38 srv-pve3 corosync[2197]:  [TOTEM ] Failed to receive the leave message. failed: 2
Mar 27 11:10:38 srv-pve3 corosync[2197]: warning [CPG   ] downlist left_list: 1 received
Mar 27 11:10:38 srv-pve3 corosync[2197]:  [CPG   ] downlist left_list: 1 received
Mar 27 11:10:38 srv-pve3 corosync[2197]: warning [CPG   ] downlist left_list: 1 received
Mar 27 11:10:38 srv-pve3 corosync[2197]:  [CPG   ] downlist left_list: 1 received
Mar 27 11:10:38 srv-pve3 corosync[2197]: notice  [QUORUM] Members[2]: 1 3
Mar 27 11:10:38 srv-pve3 corosync[2197]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
Mar 27 11:10:38 srv-pve3 corosync[2197]:  [QUORUM] Members[2]: 1 3
Mar 27 11:10:38 srv-pve3 corosync[2197]:  [MAIN  ] Completed service synchronization, ready to provide service.
Mar 27 11:10:38 srv-pve3 pmxcfs[2072]: [dcdb] notice: received all states
Mar 27 11:10:38 srv-pve3 pmxcfs[2072]: [dcdb] notice: leader is 1/1933
Mar 27 11:10:38 srv-pve3 pmxcfs[2072]: [dcdb] notice: synced members: 1/1933, 3/2072
Mar 27 11:10:38 srv-pve3 pmxcfs[2072]: [dcdb] notice: all data is up to date
Mar 27 11:10:38 srv-pve3 pmxcfs[2072]: [status] notice: received all states
Mar 27 11:10:38 srv-pve3 pmxcfs[2072]: [status] notice: all data is up to date
Mar 27 11:10:38 srv-pve3 pve-ha-crm[3580]: node 'srv-pve2': state changed from 'online' => 'unknown'
Mar 27 11:10:39 srv-pve3 kernel: [577822.410594] tg3 0000:41:00.0 enp65s0f0: Link is down
Mar 27 11:10:39 srv-pve3 kernel: [577822.470457] bond0: link status definitely down for interface enp65s0f0, disabling it
Mar 27 11:10:41 srv-pve3 pvestatd[2512]: got timeout
Mar 27 11:10:42 srv-pve3 kernel: [577825.304974] tg3 0000:41:00.0 enp65s0f0: Link is up at 100 Mbps, full duplex
Mar 27 11:10:42 srv-pve3 kernel: [577825.304989] tg3 0000:41:00.0 enp65s0f0: Flow control is on for TX and on for RX
Mar 27 11:10:42 srv-pve3 kernel: [577825.304992] tg3 0000:41:00.0 enp65s0f0: EEE is enabled
Mar 27 11:10:42 srv-pve3 kernel: [577825.386427] bond0: link status definitely up for interface enp65s0f0, 100 Mbps full duplex
Mar 27 11:10:42 srv-pve3 kernel: [577826.115447] tg3 0000:41:00.0 enp65s0f0: Link is down
Mar 27 11:10:42 srv-pve3 kernel: [577826.218451] bond0: link status definitely down for interface enp65s0f0, disabling it
Mar 27 11:10:45 srv-pve3 kernel: [577828.982832] tg3 0000:41:00.0 enp65s0f0: Link is up at 100 Mbps, full duplex
Mar 27 11:10:45 srv-pve3 kernel: [577828.982857] tg3 0000:41:00.0 enp65s0f0: Flow control is on for TX and on for RX
Mar 27 11:10:45 srv-pve3 kernel: [577828.982859] tg3 0000:41:00.0 enp65s0f0: EEE is enabled
Mar 27 11:10:45 srv-pve3 kernel: [577829.026406] bond0: link status definitely up for interface enp65s0f0, 100 Mbps full duplex
Mar 27 11:10:46 srv-pve3 pvestatd[2512]: got timeout
Mar 27 11:10:49 srv-pve3 pvestatd[2512]: status update time (10.050 seconds)
Mar 27 11:11:00 srv-pve3 systemd[1]: Starting Proxmox VE replication runner...
Mar 27 11:11:01 srv-pve3 systemd[1]: Started Proxmox VE replication runner.
Mar 27 11:11:28 srv-pve3 pve-ha-crm[3580]: service 'vm:106': state changed from 'started' to 'fence'
Mar 27 11:11:28 srv-pve3 pve-ha-crm[3580]: node 'srv-pve2': state changed from 'unknown' => 'fence'
Mar 27 11:11:28 srv-pve3 pve-ha-crm[3580]: fencing: acknowledged - got agent lock for node 'srv-pve2'
Mar 27 11:11:28 srv-pve3 pve-ha-crm[3580]: node 'srv-pve2': state changed from 'fence' => 'unknown'
Mar 27 11:11:28 srv-pve3 postfix/pickup[2387852]: E4D9B60A23: uid=0 from=<root>
Mar 27 11:11:28 srv-pve3 postfix/cleanup[2395795]: E4D9B60A23: message-id=<20190327101128.E4D9B60A23@srv-pve3.lechesnay.local>
Mar 27 11:11:28 srv-pve3 postfix/qmgr[1983]: E4D9B60A23: from=<root@srv-pve3.lechesnay.local>, size=1605, nrcpt=1 (queue active)
Mar 27 11:11:28 srv-pve3 postfix/pickup[2387852]: E6F0C60A5D: uid=0 from=<root>
Mar 27 11:11:28 srv-pve3 postfix/cleanup[2395795]: E6F0C60A5D: message-id=<20190327101128.E6F0C60A5D@srv-pve3.lechesnay.local>
Mar 27 11:11:28 srv-pve3 postfix/qmgr[1983]: E6F0C60A5D: from=<root@srv-pve3.lechesnay.local>, size=1671, nrcpt=1 (queue active)
Mar 27 11:11:28 srv-pve3 pve-ha-crm[3580]: recover service 'vm:106' from fenced node 'srv-pve2' to node 'srv-pve3'
Mar 27 11:11:28 srv-pve3 pve-ha-crm[3580]: service 'vm:106': state changed from 'fence' to 'started'  (node = srv-pve3)
Mar 27 11:11:29 srv-pve3 pvemailforward[2395802]: mail forward failed: user 'root@pam' does not have a email address
Mar 27 11:11:29 srv-pve3 pvemailforward[2395801]: mail forward failed: user 'root@pam' does not have a email address
Mar 27 11:11:29 srv-pve3 postfix/local[2395800]: E6F0C60A5D: to=<root@srv-pve3.lechesnay.local>, orig_to=<root>, relay=local, delay=0.34, delays=0/0.01/0/0.33, dsn=2.0.0, status=sent (delivered to command: /usr/bin/pvemailforward)
Mar 27 11:11:29 srv-pve3 postfix/local[2395799]: E4D9B60A23: to=<root@srv-pve3.lechesnay.local>, orig_to=<root>, relay=local, delay=0.36, delays=0.01/0.01/0/0.34, dsn=2.0.0, status=sent (delivered to command: /usr/bin/pvemailforward)
Mar 27 11:11:29 srv-pve3 postfix/qmgr[1983]: E6F0C60A5D: removed
Mar 27 11:11:29 srv-pve3 postfix/qmgr[1983]: E4D9B60A23: removed
Mar 27 11:11:38 srv-pve3 pve-ha-lrm[2395838]: starting service vm:106
Mar 27 11:11:38 srv-pve3 pve-ha-lrm[2395839]: start VM 106: UPID:srv-pve3:00248EBF:0371C9B0:5C9B4C5A:qmstart:106:root@pam:
Mar 27 11:11:38 srv-pve3 pve-ha-lrm[2395838]: <root@pam> starting task UPID:srv-pve3:00248EBF:0371C9B0:5C9B4C5A:qmstart:106:root@pam:
Mar 27 11:11:39 srv-pve3 systemd[1]: Started 106.scope.
Mar 27 11:11:39 srv-pve3 systemd-udevd[2395850]: Could not generate persistent MAC address for tap106i0: No such file or directory
Mar 27 11:11:39 srv-pve3 kernel: [577882.967535] device tap106i0 entered promiscuous mode
Mar 27 11:11:39 srv-pve3 kernel: [577882.981691] vmbr0: port 3(tap106i0) entered blocking state
Mar 27 11:11:39 srv-pve3 kernel: [577882.981694] vmbr0: port 3(tap106i0) entered disabled state
Mar 27 11:11:39 srv-pve3 kernel: [577882.981893] vmbr0: port 3(tap106i0) entered blocking state
Mar 27 11:11:39 srv-pve3 kernel: [577882.981895] vmbr0: port 3(tap106i0) entered forwarding state
Mar 27 11:11:43 srv-pve3 pve-ha-lrm[2395838]: Task 'UPID:srv-pve3:00248EBF:0371C9B0:5C9B4C5A:qmstart:106:root@pam:' still active, waiting
Mar 27 11:11:48 srv-pve3 pve-ha-lrm[2395838]: Task 'UPID:srv-pve3:00248EBF:0371C9B0:5C9B4C5A:qmstart:106:root@pam:' still active, waiting
Mar 27 11:11:53 srv-pve3 pve-ha-lrm[2395838]: Task 'UPID:srv-pve3:00248EBF:0371C9B0:5C9B4C5A:qmstart:106:root@pam:' still active, waiting
Mar 27 11:11:58 srv-pve3 pve-ha-lrm[2395838]: Task 'UPID:srv-pve3:00248EBF:0371C9B0:5C9B4C5A:qmstart:106:root@pam:' still active, waiting
Mar 27 11:12:00 srv-pve3 systemd[1]: Starting Proxmox VE replication runner...
Mar 27 11:12:01 srv-pve3 systemd[1]: Started Proxmox VE replication runner.
Mar 27 11:12:03 srv-pve3 pve-ha-lrm[2395838]: Task 'UPID:srv-pve3:00248EBF:0371C9B0:5C9B4C5A:qmstart:106:root@pam:' still active, waiting
Mar 27 11:12:08 srv-pve3 pve-ha-lrm[2395838]: Task 'UPID:srv-pve3:00248EBF:0371C9B0:5C9B4C5A:qmstart:106:root@pam:' still active, waiting
Mar 27 11:12:09 srv-pve3 pve-ha-lrm[2395839]: start failed: command '/usr/bin/kvm -id 106 -name SRV-APT-REPO -chardev 'socket,id=qmp,path=/var/run/qemu-server/106.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' -mon 'chardev=qmp-event,mode=control' -pidfile /var/run/qemu-server/106.pid -daemonize -smbios 'type=1,uuid=d2cb6d15-c93e-41d5-a87d-230f193b5632' -smp '1,sockets=1,cores=1,maxcpus=1' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vnc unix:/var/run/qemu-server/106.vnc,x509,password -cpu kvm64,+lahf_lm,+sep,+kvm_pv_unhalt,+kvm_pv_eoi,enforce -m 2048 -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' -device 'vmgenid,guid=ec8aa96c-49bf-4d92-8869-7cfebb1075fd' -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' -device 'usb-tablet,id=tablet,bus=uhci.0,port=1' -device 'VGA,id=vga,bus=pci.0,addr=0x2' -chardev 'socket,path=/var/run/qemu-server/106.qga,server,nowait,id=qga0' -device 'virtio-serial,id=qga0,bus=pci.0,addr=0x8' -device 'virtserialport,chardev=qga0,name=org.qemu.guest_agent.0' -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:7dbdeef98674' -drive 'if=none,id=drive-ide2,media=cdrom,aio=threads' -device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=200' -device 'virtio-scsi-pci,id=scsihw0,bus=pci.0,addr=0x5' -drive 'file=rbd:STD_POOL/vm-106-disk-0:conf=/etc/pve/ceph.conf:id=admin:keyring=/etc/pve/priv/ceph/STD_POOL_vm.keyring,if=none,id=drive-scsi0,format=raw,cache=none,aio=native,detect-zeroes=on' -device 'scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100' -netdev 'type=tap,id=net0,ifname=tap106i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=56:B4:5A:55:79:08,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300' -machine 'type=pc'' failed: got timeout
Mar 27 11:12:09 srv-pve3 pve-ha-lrm[2395838]: <root@pam> end task UPID:srv-pve3:00248EBF:0371C9B0:5C9B4C5A:qmstart:106:root@pam: start failed: command '/usr/bin/kvm -id 106 -name SRV-APT-REPO -chardev 'socket,id=qmp,path=/var/run/qemu-server/106.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' -mon 'chardev=qmp-event,mode=control' -pidfile /var/run/qemu-server/106.pid -daemonize -smbios 'type=1,uuid=d2cb6d15-c93e-41d5-a87d-230f193b5632' -smp '1,sockets=1,cores=1,maxcpus=1' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vnc unix:/var/run/qemu-server/106.vnc,x509,password -cpu kvm64,+lahf_lm,+sep,+kvm_pv_unhalt,+kvm_pv_eoi,enforce -m 2048 -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' -device 'vmgenid,guid=ec8aa96c-49bf-4d92-8869-7cfebb1075fd' -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' -device 'usb-tablet,id=tablet,bus=uhci.0,port=1' -device 'VGA,id=vga,bus=pci.0,addr=0x2' -chardev 'socket,path=/var/run/qemu-server/106.qga,server,nowait,id=qga0' -device 'virtio-serial,id=qga0,bus=pci.0,addr=0x8' -device 'virtserialport,chardev=qga0,name=org.qemu.guest_agent.0' -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:7dbdeef98674' -drive 'if=none,id=drive-ide2,media=cdrom,aio=threads' -device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=200' -device 'virtio-scsi-pci,id=scsihw0,bus=pci.0,addr=0x5' -drive 'file=rbd:STD_POOL/vm-106-disk-0:conf=/etc/pve/ceph.conf:id=admin:keyring=/etc/pve/priv/ceph/STD_POOL_vm.keyring,if=none,id=drive-scsi0,format=raw,cache=none,aio=native,detect-zeroes=on' -device 'scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100' -netdev 'type=tap,id=net0,ifname=tap106i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=56:B4:5A:55:79:08,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300' -machine 'type=pc'' failed: got timeout
Mar 27 11:12:09 srv-pve3 pve-ha-lrm[2395838]: service status vm:106 started
Mar 27 11:13:00 srv-pve3 systemd[1]: Starting Proxmox VE replication runner...
Mar 27 11:13:01 srv-pve3 systemd[1]: Started Proxmox VE replication runner.
Mar 27 11:14:00 srv-pve3 systemd[1]: Starting Proxmox VE replication runner...
Mar 27 11:14:01 srv-pve3 systemd[1]: Started Proxmox VE replication runner.
Mar 27 11:14:41 srv-pve3 kernel: [578064.909598] tg3 0000:41:00.0 enp65s0f0: Link is down
Mar 27 11:14:41 srv-pve3 kernel: [578064.911803] bond0: link status definitely down for interface enp65s0f0, disabling it
Mar 27 11:14:48 srv-pve3 kernel: [578071.731566] tg3 0000:41:00.0 enp65s0f0: Link is up at 1000 Mbps, full duplex
Mar 27 11:14:48 srv-pve3 kernel: [578071.731579] tg3 0000:41:00.0 enp65s0f0: Flow control is on for TX and on for RX
Mar 27 11:14:48 srv-pve3 kernel: [578071.731592] tg3 0000:41:00.0 enp65s0f0: EEE is enabled
Mar 27 11:14:48 srv-pve3 kernel: [578071.779731] bond0: link status definitely up for interface enp65s0f0, 1000 Mbps full duplex
Mar 27 11:15:00 srv-pve3 systemd[1]: Starting Proxmox VE replication runner...
Mar 27 11:15:01 srv-pve3 systemd[1]: Started Proxmox VE replication runner.
Mar 27 11:15:08 srv-pve3 corosync[2197]: notice  [TOTEM ] Retransmit List: 775
Mar 27 11:15:08 srv-pve3 corosync[2197]: notice  [TOTEM ] Retransmit List: 775
Mar 27 11:15:08 srv-pve3 corosync[2197]:  [TOTEM ] Retransmit List: 775
Mar 27 11:15:08 srv-pve3 corosync[2197]:  [TOTEM ] Retransmit List: 775
Mar 27 11:15:08 srv-pve3 corosync[2197]: notice  [TOTEM ] Retransmit List: 777
Mar 27 11:15:08 srv-pve3 corosync[2197]:  [TOTEM ] Retransmit List: 777


(...)

Mar 27 11:19:29 srv-pve3 corosync[2197]: notice  [TOTEM ] Retransmit List: d81
Mar 27 11:19:29 srv-pve3 corosync[2197]:  [TOTEM ] Retransmit List: d81
Mar 27 11:19:45 srv-pve3 corosync[2197]: notice  [TOTEM ] A new membership (10.0.0.101:912) was formed. Members joined: 2
Mar 27 11:19:45 srv-pve3 corosync[2197]:  [TOTEM ] A new membership (10.0.0.101:912) was formed. Members joined: 2
Mar 27 11:19:45 srv-pve3 corosync[2197]: error   [TOTEM ] Marking ringid 1 interface 192.168.1.103 FAULTY
Mar 27 11:19:45 srv-pve3 corosync[2197]:  [TOTEM ] Marking ringid 1 interface 192.168.1.103 FAULTY
Mar 27 11:19:45 srv-pve3 corosync[2197]: warning [CPG   ] downlist left_list: 0 received
Mar 27 11:19:45 srv-pve3 corosync[2197]: warning [CPG   ] downlist left_list: 0 received
Mar 27 11:19:45 srv-pve3 corosync[2197]: warning [CPG   ] downlist left_list: 0 received
Mar 27 11:19:45 srv-pve3 corosync[2197]:  [CPG   ] downlist left_list: 0 received
Mar 27 11:19:45 srv-pve3 corosync[2197]:  [CPG   ] downlist left_list: 0 received
Mar 27 11:19:45 srv-pve3 corosync[2197]: notice  [QUORUM] Members[3]: 1 2 3
Mar 27 11:19:45 srv-pve3 corosync[2197]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
Mar 27 11:19:45 srv-pve3 corosync[2197]:  [CPG   ] downlist left_list: 0 received
Mar 27 11:19:45 srv-pve3 corosync[2197]:  [QUORUM] Members[3]: 1 2 3
Mar 27 11:19:45 srv-pve3 corosync[2197]:  [MAIN  ] Completed service synchronization, ready to provide service.
Mar 27 11:19:45 srv-pve3 pmxcfs[2072]: [dcdb] notice: members: 1/1933, 2/2120, 3/2072
Mar 27 11:19:45 srv-pve3 pmxcfs[2072]: [dcdb] notice: starting data syncronisation
Mar 27 11:19:45 srv-pve3 pmxcfs[2072]: [status] notice: members: 1/1933, 2/2120, 3/2072
Mar 27 11:19:45 srv-pve3 pmxcfs[2072]: [status] notice: starting data syncronisation
Mar 27 11:19:45 srv-pve3 pmxcfs[2072]: [dcdb] notice: received sync request (epoch 1/1933/00000015)
Mar 27 11:19:45 srv-pve3 pmxcfs[2072]: [status] notice: received sync request (epoch 1/1933/00000015)
Mar 27 11:19:45 srv-pve3 pmxcfs[2072]: [dcdb] notice: received all states
Mar 27 11:19:45 srv-pve3 pmxcfs[2072]: [dcdb] notice: leader is 1/1933
Mar 27 11:19:45 srv-pve3 pmxcfs[2072]: [dcdb] notice: synced members: 1/1933, 3/2072
Mar 27 11:19:45 srv-pve3 pmxcfs[2072]: [dcdb] notice: all data is up to date
Mar 27 11:19:45 srv-pve3 pmxcfs[2072]: [dcdb] notice: dfsm_deliver_queue: queue length 3
Mar 27 11:19:45 srv-pve3 corosync[2197]: notice  [TOTEM ] Automatically recovered ring 1
Mar 27 11:19:45 srv-pve3 corosync[2197]:  [TOTEM ] Automatically recovered ring 1
Mar 27 11:19:45 srv-pve3 pmxcfs[2072]: [status] notice: received all states
Mar 27 11:19:45 srv-pve3 pmxcfs[2072]: [status] notice: all data is up to date
Mar 27 11:19:45 srv-pve3 pmxcfs[2072]: [status] notice: dfsm_deliver_queue: queue length 41
Mar 27 11:19:47 srv-pve3 pmxcfs[2072]: [status] notice: received log
Mar 27 11:19:47 srv-pve3 pmxcfs[2072]: [status] notice: received log
Mar 27 11:19:51 srv-pve3 corosync[2197]: error   [TOTEM ] Marking ringid 1 interface 192.168.1.103 FAULTY
Mar 27 11:19:51 srv-pve3 corosync[2197]:  [TOTEM ] Marking ringid 1 interface 192.168.1.103 FAULTY
Mar 27 11:19:52 srv-pve3 corosync[2197]: notice  [TOTEM ] Automatically recovered ring 1
Mar 27 11:19:52 srv-pve3 corosync[2197]:  [TOTEM ] Automatically recovered ring 1
Mar 27 11:19:55 srv-pve3 pve-ha-crm[3580]: node 'srv-pve2': state changed from 'unknown' => 'online'
Mar 27 11:19:55 srv-pve3 pve-ha-crm[3580]: migrate service 'vm:106' to node 'srv-pve2' (running)
Mar 27 11:19:55 srv-pve3 pve-ha-crm[3580]: service 'vm:106': state changed from 'started' to 'migrate'  (node = srv-pve3, target = srv-pve2)
Mar 27 11:20:00 srv-pve3 systemd[1]: Starting Proxmox VE replication runner...
Mar 27 11:20:01 srv-pve3 systemd[1]: Started Proxmox VE replication runner.
Mar 27 11:20:04 srv-pve3 corosync[2197]: error   [TOTEM ] Marking ringid 1 interface 192.168.1.103 FAULTY
Mar 27 11:20:04 srv-pve3 corosync[2197]:  [TOTEM ] Marking ringid 1 interface 192.168.1.103 FAULTY
Mar 27 11:20:05 srv-pve3 corosync[2197]: notice  [TOTEM ] Automatically recovered ring 1
Mar 27 11:20:05 srv-pve3 corosync[2197]:  [TOTEM ] Automatically recovered ring 1
Mar 27 11:20:05 srv-pve3 pve-ha-lrm[2398030]: <root@pam> starting task UPID:srv-pve3:0024974F:03728F7A:5C9B4E55:qmigrate:106:root@pam:
Mar 27 11:20:06 srv-pve3 kernel: [578389.753301] vmbr0: port 3(tap106i0) entered disabled state
Mar 27 11:20:06 srv-pve3 pve-ha-lrm[2398031]: VM 106 qmp command failed - VM 106 qmp command 'query-machines' failed - unable to connect to VM 106 qmp socket - Connection refused
Mar 27 11:20:06 srv-pve3 pve-ha-lrm[2398031]: migration aborted
Mar 27 11:20:06 srv-pve3 pve-ha-lrm[2398030]: <root@pam> end task UPID:srv-pve3:0024974F:03728F7A:5C9B4E55:qmigrate:106:root@pam: migration aborted
Mar 27 11:20:06 srv-pve3 pve-ha-lrm[2398030]: service vm:106 not moved (migration error)
Mar 27 11:20:07 srv-pve3 qmeventd[1328]: Starting cleanup for 106
Mar 27 11:20:07 srv-pve3 qmeventd[1328]: Finished cleanup for 106
Mar 27 11:20:15 srv-pve3 pve-ha-crm[3580]: service 'vm:106' - migration failed (exit code 1)
Mar 27 11:20:15 srv-pve3 pve-ha-crm[3580]: service 'vm:106': state changed from 'migrate' to 'started'  (node = srv-pve3)
Mar 27 11:20:15 srv-pve3 pve-ha-crm[3580]: migrate service 'vm:106' to node 'srv-pve2' (running)
Mar 27 11:20:15 srv-pve3 pve-ha-crm[3580]: service 'vm:106': state changed from 'started' to 'migrate'  (node = srv-pve3, target = srv-pve2)
Mar 27 11:20:15 srv-pve3 corosync[2197]: error   [TOTEM ] Marking ringid 1 interface 192.168.1.103 FAULTY
Mar 27 11:20:15 srv-pve3 corosync[2197]:  [TOTEM ] Marking ringid 1 interface 192.168.1.103 FAULTY
Mar 27 11:20:15 srv-pve3 pve-ha-lrm[2398093]: <root@pam> starting task UPID:srv-pve3:0024978E:0372938B:5C9B4E5F:qmigrate:106:root@pam:
Mar 27 11:20:16 srv-pve3 corosync[2197]: notice  [TOTEM ] Automatically recovered ring 1
Mar 27 11:20:16 srv-pve3 corosync[2197]:  [TOTEM ] Automatically recovered ring 1
Mar 27 11:20:17 srv-pve3 pve-ha-lrm[2398093]: <root@pam> end task UPID:srv-pve3:0024978E:0372938B:5C9B4E5F:qmigrate:106:root@pam: OK
Mar 27 11:20:25 srv-pve3 pve-ha-crm[3580]: service 'vm:106': state changed from 'migrate' to 'started'  (node = srv-pve2)
Mar 27 11:20:26 srv-pve3 corosync[2197]: error   [TOTEM ] Marking ringid 1 interface 192.168.1.103 FAULTY
Mar 27 11:20:26 srv-pve3 corosync[2197]:  [TOTEM ] Marking ringid 1 interface 192.168.1.103 FAULTY
Mar 27 11:20:27 srv-pve3 pmxcfs[2072]: [status] notice: received log
Mar 27 11:20:27 srv-pve3 corosync[2197]: notice  [TOTEM ] Automatically recovered ring 1
Mar 27 11:20:27 srv-pve3 corosync[2197]:  [TOTEM ] Automatically recovered ring 1
Mar 27 11:20:28 srv-pve3 pmxcfs[2072]: [status] notice: received log
 
Should I try and reboot all nodes to see if this could be the root cause of the problem?
i would rather check what the reason for the flapping interfaces is (switch, cable, nic, etc)
 
Hello,

The flapping interface issue have been resolved. It was a misconfiguration of the interfaces as well as a problem of ring redundancy on corosync that is now resolved.

The HA problem persists thought and the syslog of the node that tries to start the HA VM shows possible problems on the network interface of the VM:
Code:
Mar 28 18:52:44 srv-pve2 pve-ha-lrm[15047]: starting service vm:106
Mar 28 18:52:44 srv-pve2 pve-ha-lrm[15047]: <root@pam> starting task UPID:srv-pve2:00003AC8:000345C4:5C9D09EC:qmstart:106:root@pam:
Mar 28 18:52:44 srv-pve2 pve-ha-lrm[15048]: start VM 106: UPID:srv-pve2:00003AC8:000345C4:5C9D09EC:qmstart:106:root@pam:
Mar 28 18:52:44 srv-pve2 systemd[1]: Started 106.scope.
Mar 28 18:52:44 srv-pve2 systemd-udevd[15059]: Could not generate persistent MAC address for tap106i0: No such file or directory
Mar 28 18:52:44 srv-pve2 kernel: [ 2145.483456] device tap106i0 entered promiscuous mode
Mar 28 18:52:44 srv-pve2 kernel: [ 2145.495240] vmbr0: port 4(tap106i0) entered blocking state
Mar 28 18:52:44 srv-pve2 kernel: [ 2145.495243] vmbr0: port 4(tap106i0) entered disabled state
Mar 28 18:52:44 srv-pve2 kernel: [ 2145.495404] vmbr0: port 4(tap106i0) entered blocking state
Mar 28 18:52:44 srv-pve2 kernel: [ 2145.495406] vmbr0: port 4(tap106i0) entered forwarding state
Mar 28 18:52:49 srv-pve2 pve-ha-lrm[15047]: Task 'UPID:srv-pve2:00003AC8:000345C4:5C9D09EC:qmstart:106:root@pam:' still active, waiting
Mar 28 18:52:54 srv-pve2 pve-ha-lrm[15047]: Task 'UPID:srv-pve2:00003AC8:000345C4:5C9D09EC:qmstart:106:root@pam:' still active, waiting
Mar 28 18:52:59 srv-pve2 pve-ha-lrm[15047]: Task 'UPID:srv-pve2:00003AC8:000345C4:5C9D09EC:qmstart:106:root@pam:' still active, waiting
Mar 28 18:53:00 srv-pve2 systemd[1]: Starting Proxmox VE replication runner...
Mar 28 18:53:00 srv-pve2 systemd[1]: Started Proxmox VE replication runner.
Mar 28 18:53:04 srv-pve2 pve-ha-lrm[15047]: Task 'UPID:srv-pve2:00003AC8:000345C4:5C9D09EC:qmstart:106:root@pam:' still active, waiting
Mar 28 18:53:09 srv-pve2 pve-ha-lrm[15047]: Task 'UPID:srv-pve2:00003AC8:000345C4:5C9D09EC:qmstart:106:root@pam:' still active, waiting
Mar 28 18:53:14 srv-pve2 pve-ha-lrm[15047]: Task 'UPID:srv-pve2:00003AC8:000345C4:5C9D09EC:qmstart:106:root@pam:' still active, waiting
Mar 28 18:53:14 srv-pve2 pve-ha-lrm[15048]: start failed: command '/usr/bin/kvm -id 106 -name SRV-APT-REPO -chardev 'socket,id=qmp,path=/var/run/qemu-server/106.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' -mon 'chardev=qmp-event,mode=control' -pidfile /var/run/qemu-server/106.pid -daemonize -smbios 'type=1,uuid=d2cb6d15-c93e-41d5-a87d-230f193b5632' -smp '1,sockets=1,cores=1,maxcpus=1' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vnc unix:/var/run/qemu-server/106.vnc,x509,password -cpu kvm64,+lahf_lm,+sep,+kvm_pv_unhalt,+kvm_pv_eoi,enforce -m 2048 -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' -device 'vmgenid,guid=ec8aa96c-49bf-4d92-8869-7cfebb1075fd' -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' -device 'usb-tablet,id=tablet,bus=uhci.0,port=1' -device 'VGA,id=vga,bus=pci.0,addr=0x2' -chardev 'socket,path=/var/run/qemu-server/106.qga,server,nowait,id=qga0' -device 'virtio-serial,id=qga0,bus=pci.0,addr=0x8' -device 'virtserialport,chardev=qga0,name=org.qemu.guest_agent.0' -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:192a6c1e59f4' -drive 'if=none,id=drive-ide2,media=cdrom,aio=threads' -device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=200' -device 'virtio-scsi-pci,id=scsihw0,bus=pci.0,addr=0x5' -drive 'file=rbd:STD_POOL/vm-106-disk-0:conf=/etc/pve/ceph.conf:id=admin:keyring=/etc/pve/priv/ceph/STD_POOL_vm.keyring,if=none,id=drive-scsi0,format=raw,cache=none,aio=native,detect-zeroes=on' -device 'scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100' -netdev 'type=tap,id=net0,ifname=tap106i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=56:B4:5A:55:79:08,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300' -machine 'type=pc'' failed: got timeout
Mar 28 18:53:14 srv-pve2 pve-ha-lrm[15047]: <root@pam> end task UPID:srv-pve2:00003AC8:000345C4:5C9D09EC:qmstart:106:root@pam: start failed: command '/usr/bin/kvm -id 106 -name SRV-APT-REPO -chardev 'socket,id=qmp,path=/var/run/qemu-server/106.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' -mon 'chardev=qmp-event,mode=control' -pidfile /var/run/qemu-server/106.pid -daemonize -smbios 'type=1,uuid=d2cb6d15-c93e-41d5-a87d-230f193b5632' -smp '1,sockets=1,cores=1,maxcpus=1' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vnc unix:/var/run/qemu-server/106.vnc,x509,password -cpu kvm64,+lahf_lm,+sep,+kvm_pv_unhalt,+kvm_pv_eoi,enforce -m 2048 -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' -device 'vmgenid,guid=ec8aa96c-49bf-4d92-8869-7cfebb1075fd' -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' -device 'usb-tablet,id=tablet,bus=uhci.0,port=1' -device 'VGA,id=vga,bus=pci.0,addr=0x2' -chardev 'socket,path=/var/run/qemu-server/106.qga,server,nowait,id=qga0' -device 'virtio-serial,id=qga0,bus=pci.0,addr=0x8' -device 'virtserialport,chardev=qga0,name=org.qemu.guest_agent.0' -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:192a6c1e59f4' -drive 'if=none,id=drive-ide2,media=cdrom,aio=threads' -device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=200' -device 'virtio-scsi-pci,id=scsihw0,bus=pci.0,addr=0x5' -drive 'file=rbd:STD_POOL/vm-106-disk-0:conf=/etc/pve/ceph.conf:id=admin:keyring=/etc/pve/priv/ceph/STD_POOL_vm.keyring,if=none,id=drive-scsi0,format=raw,cache=none,aio=native,detect-zeroes=on' -device 'scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100' -netdev 'type=tap,id=net0,ifname=tap106i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=56:B4:5A:55:79:08,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300' -machine 'type=pc'' failed: got timeout
Mar 28 18:53:14 srv-pve2 pve-ha-lrm[15047]: service status vm:106 started
 
Can someone explain how this was actually solved?
I created a setup using nested virtualization to simulate a cluster i want to implement.

This is based with 4 nodes. 2 nodes with CEPH and a rule allowing a minimum of a one online CEPH. If i have a HA group based on node 1 and node 2 (the two ones with CEPH), with a VM running on node 2 i have the following behaviour:
- Shutdown the node2: VM is Shutdown and restarted on node 1 as expected.
- Stopping the node2 (simulate malfunction): I have the behaviour described on this thread.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!