Last Friday I started experiencing some inconsistencies on the cluster.
An overview off my enviornment: (no HA)
I access Web UI to node 1 as a normally do and the web was unnacesable, altought SSH and VMs where still working (I realized this last two too late), so I accessed node 2 and suddleny all nodes starter blinking and a matter of seccond nodes went on to off line several times until they stayed all offline, fortuneltly I still had access to management console (web ui) and SSH and VMs where running.
Tried all the obvious stuff and stuff that I could find around on node 1 tried restarting the services
And then tried restarting both nodes (1 and 2)
But nothing seamed to pick up.
It's difficult to know what happened next as It has been a few days and everything is starting to blur.
But I tried to upgrade node1 to proxmox 5 which failed (didn't go all the way through), not happy with that, tried to upgrade another node (which had no running VMs).
So my situation now is two nodes where proxmox was upgraded but didnt go trought all the way:
Plus a couple node (1 and 2) where I have VMs that I'm unable to turn on or migrate/move to another node.
If I try movien a VM via cli when i triy to browse the node directory
if would just "freeze" hang there witought doing anything
wheni tried starting one of the VMs on node 2 I get this
Also remember getting thsi a lot
What logs would you advise me to have a look at and would you want to get? Also any advise on what could I be looking at?
the last think I did before everything started to go crazy is add a node (that had been delted) but is was a new build. but the it was working find like this over night. but then again it is still "working fine".
Kind regards.
\M
An overview off my enviornment: (no HA)
[ root @ hq-proxmox-04 ~ ]# pvecm status
Quorum information
------------------
Date: Mon Jul 31 11:32:03 2017
Quorum provider: corosync_votequorum
Nodes: 9
Node ID: 0x00000004
Ring ID: 1/111708
Quorate: Yes
Votequorum information
----------------------
Expected votes: 9
Highest expected: 9
Total votes: 9
Quorum: 5
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.20.30.228
0x00000002 1 10.20.30.229
0x00000003 1 10.20.30.230
0x00000004 1 10.20.30.231 (local)
0x00000005 1 10.20.30.232
0x00000006 1 10.20.30.233
0x00000007 1 10.20.30.234
0x00000008 1 10.20.30.235
0x00000009 1 10.20.30.236
[ root @ hq-proxmox-04 ~ ]# pveversion -v
proxmox-ve: 4.4-84 (running kernel: 4.4.44-1-pve)
pve-manager: 4.4-13 (running version: 4.4-13/7ea56165)
pve-kernel-4.4.44-1-pve: 4.4.44-84
lvm2: 2.02.116-pve3
corosync-pve: 2.4.2-2~pve4+1
libqb0: 1.0.1-1
pve-cluster: 4.0-49
qemu-server: 4.0-110
pve-firmware: 1.1-11
libpve-common-perl: 4.0-94
libpve-access-control: 4.0-23
libpve-storage-perl: 4.0-76
pve-libspice-server1: 0.12.8-2
vncterm: 1.3-2
pve-docs: 4.4-4
pve-qemu-kvm: 2.7.1-4
pve-container: 1.0-97
pve-firewall: 2.0-33
pve-ha-manager: 1.0-40
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u3
lxc-pve: 2.0.7-4
lxcfs: 2.0.6-pve1
criu: 1.6.0-1
novnc-pve: 0.5-9
smartmontools: 6.5+svn4324-1~pve80
I access Web UI to node 1 as a normally do and the web was unnacesable, altought SSH and VMs where still working (I realized this last two too late), so I accessed node 2 and suddleny all nodes starter blinking and a matter of seccond nodes went on to off line several times until they stayed all offline, fortuneltly I still had access to management console (web ui) and SSH and VMs where running.
Tried all the obvious stuff and stuff that I could find around on node 1 tried restarting the services
#service pve-cluster restart
#service pvestatd resatrt
And then tried restarting both nodes (1 and 2)
But nothing seamed to pick up.
It's difficult to know what happened next as It has been a few days and everything is starting to blur.
But I tried to upgrade node1 to proxmox 5 which failed (didn't go all the way through), not happy with that, tried to upgrade another node (which had no running VMs).
So my situation now is two nodes where proxmox was upgraded but didnt go trought all the way:
[ root @ hq-proxmox-07 ~ ]# pveversion -v
proxmox-ve: not correctly installed (running kernel: 4.4.67-1-pve)
pve-manager: not correctly installed (running version: 5.0-23/af4267bf)
pve-kernel-4.4.67-1-pve: 4.4.67-92
pve-kernel-4.10.17-1-pve: 4.10.17-16
libpve-http-server-perl: 2.0-5
lvm2: 2.02.168-pve2
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: not correctly installed
qemu-server: not correctly installed
pve-firmware: 2.0-2
libpve-common-perl: 5.0-16
libpve-guest-common-perl: not correctly installed
libpve-access-control: not correctly installed
libpve-storage-perl: 5.0-12
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-2
pve-docs: 5.0-9
pve-qemu-kvm: 2.9.0-2
pve-container: not correctly installed
pve-firewall: not correctly installed
pve-ha-manager: not correctly installed
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.0.8-3
lxcfs: 2.0.7-pve2
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
Plus a couple node (1 and 2) where I have VMs that I'm unable to turn on or migrate/move to another node.
If I try movien a VM via cli when i triy to browse the node directory
#ls /etc/pve/nodes/NODE/qemu-server/
if would just "freeze" hang there witought doing anything
wheni tried starting one of the VMs on node 2 I get this
TASK ERROR: start failed: command '/usr/bin/kvm -id 104 -chardev 'socket,id=qmp,path=/var/run/qemu-server/104.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -pidfile /var/run/qemu-server/104.pid -daemonize -smbios 'type=1,uuid=c842a898-0d71-436f-9bc5-93dd580ee0e2' -name hq-qa-02 -smp '4,sockets=1,cores=4,maxcpus=4' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vga std -vnc unix:/var/run/qemu-server/104.vnc,x509,password -no-hpet -cpu 'kvm64,+lahf_lm,+sep,+kvm_pv_unhalt,+kvm_pv_eoi,hv_spinlocks=0x1fff,hv_vapic,hv_time,hv_reset,hv_vpindex,hv_runtime,hv_relaxed,enforce' -m 16384 -k en-us -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' -device 'usb-tablet,id=tablet,bus=uhci.0,port=1' -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:2ef1e0ee357b' -drive 'if=none,id=drive-ide2,media=cdrom,aio=threads' -device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=200' -device 'ahci,id=ahci0,multifunction=on,bus=pci.0,addr=0x7' -drive 'file=/mnt/pve/HQ_RNFS_01_SSD/images/104/vm-104-disk-1.qcow2,if=none,id=drive-sata0,format=qcow2,cache=none,aio=native,detect-zeroes=on' -device 'ide-drive,bus=ahci0.0,drive=drive-sata0,id=sata0,bootindex=100' -drive 'file=/mnt/pve/HQ_RNFS_01_SSD/images/104/vm-104-disk-2.qcow2,if=none,id=drive-sata1,format=qcow2,cache=none,aio=native,detect-zeroes=on' -device 'ide-drive,bus=ahci0.1,drive=drive-sata1,id=sata1' -drive 'file=/mnt/pve/HQ_RNFS_01_SSD/images/104/vm-104-disk-3.qcow2,if=none,id=drive-sata2,format=qcow2,cache=none,aio=native,detect-zeroes=on' -device 'ide-drive,bus=ahci0.2,drive=drive-sata2,id=sata2' -drive 'file=/mnt/pve/HQ_RNFS_01_SSD/images/104/vm-104-disk-4.qcow2,if=none,id=drive-sata3,format=qcow2,cache=none,aio=native,detect-zeroes=on' -device 'ide-drive,bus=ahci0.3,drive=drive-sata3,id=sata3' -drive 'file=/mnt/pve/HQ_RNFS_01_SSD/images/104/vm-104-disk-5.qcow2,if=none,id=drive-sata4,format=qcow2,cache=none,aio=native,detect-zeroes=on' -device 'ide-drive,bus=ahci0.4,drive=drive-sata4,id=sata4' -drive 'file=/mnt/pve/HQ_RNFS_01_SSD/images/104/vm-104-disk-6.qcow2,if=none,id=drive-sata5,format=qcow2,cache=none,aio=native,detect-zeroes=on' -device 'ide-drive,bus=ahci0.5,drive=drive-sata5,id=sata5' -netdev 'type=tap,id=net0,ifname=tap104i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown' -device 'e1000,mac=4A:9C:14:A0:66:5B,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300' -rtc 'driftfix=slew,base=localtime' -global 'kvm-pit.lost_tick_policy=discard'' failed: got timeout
Stop
Name
Value
Status
stopped: start failed: command '/usr/bin/kvm -id 104 -chardev 'socket,id=qmp,path=/var/run/qemu-server/104.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -pidfile /var/run/qemu-server/104.pid -daemonize -smbios 'type=1,uuid=c842a898-0d71-436f-9bc5-93dd580ee0e2' -name hq-qa-02 -smp '4,sockets=1,cores=4,maxcpus=4' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vga std -vnc unix:/var/run/qemu-server/104.vnc,x509,password -no-hpet -cpu 'kvm64,+lahf_lm,+sep,+kvm_pv_unhalt,+kvm_pv_eoi,hv_spinlocks=0x1fff,hv_vapic,hv_time,hv_reset,hv_vpindex,hv_runtime,hv_relaxed,enforce' -m 16384 -k en-us -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' -device 'usb-tablet,id=tablet,bus=uhci.0,port=1' -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:2ef1e0ee357b' -drive 'if=none,id=drive-ide2,media=cdrom,aio=threads' -device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=200' -device 'ahci,id=ahci0,multifunction=on,bus=pci.0,addr=0x7' -drive 'file=/mnt/pve/HQ_RNFS_01_SSD/images/104/vm-104-disk-1.qcow2,if=none,id=drive-sata0,format=qcow2,cache=none,aio=native,detect-zeroes=on' -device 'ide-drive,bus=ahci0.0,drive=drive-sata0,id=sata0,bootindex=100' -drive 'file=/mnt/pve/HQ_RNFS_01_SSD/images/104/vm-104-disk-2.qcow2,if=none,id=drive-sata1,format=qcow2,cache=none,aio=native,detect-zeroes=on' -device 'ide-drive,bus=ahci0.1,drive=drive-sata1,id=sata1' -drive 'file=/mnt/pve/HQ_RNFS_01_SSD/images/104/vm-104-disk-3.qcow2,if=none,id=drive-sata2,format=qcow2,cache=none,aio=native,detect-zeroes=on' -device 'ide-drive,bus=ahci0.2,drive=drive-sata2,id=sata2' -drive 'file=/mnt/pve/HQ_RNFS_01_SSD/images/104/vm-104-disk-4.qcow2,if=none,id=drive-sata3,format=qcow2,cache=none,aio=native,detect-zeroes=on' -device 'ide-drive,bus=ahci0.3,drive=drive-sata3,id=sata3' -drive 'file=/mnt/pve/HQ_RNFS_01_SSD/images/104/vm-104-disk-5.qcow2,if=none,id=drive-sata4,format=qcow2,cache=none,aio=native,detect-zeroes=on' -device 'ide-drive,bus=ahci0.4,drive=drive-sata4,id=sata4' -drive 'file=/mnt/pve/HQ_RNFS_01_SSD/images/104/vm-104-disk-6.qcow2,if=none,id=drive-sata5,format=qcow2,cache=none,aio=native,detect-zeroes=on' -device 'ide-drive,bus=ahci0.5,drive=drive-sata5,id=sata5' -netdev 'type=tap,id=net0,ifname=tap104i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown' -device 'e1000,mac=4A:9C:14:A0:66:5B,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300' -rtc 'driftfix=slew,base=localtime' -global 'kvm-pit.lost_tick_policy=discard'' failed: got timeout
Task type
qmstart
User name
root@pam
Node
hq-proxmox-02
Process ID
7031
Start Time
2017-07-31 15:19:28
Unique task ID
UPID:hq-proxmox-02:00001B77:00046000:597F3C70:qmstart:104:root@pam:
Also remember getting thsi a lot
Jul 28 10:13:10 hq-proxmox-01 pmxcfs[963]: [status] notice: cpg_send_message retried 1 times
What logs would you advise me to have a look at and would you want to get? Also any advise on what could I be looking at?
the last think I did before everything started to go crazy is add a node (that had been delted) but is was a new build. but the it was working find like this over night. but then again it is still "working fine".
Kind regards.
\M
Last edited: