Dear Members, Dear Staff,
I have to check disaster recovery procedure on a 3 nodes (pve1, pve2, pve3) cluster with ceph (RBD storage).
Everything works fine, in case of one node failure the cluster works as expected.
I would like to test starting VM's on a single node with crashed cluster.
This is my workaround:
1. all nodes online, VM's are running on pve2
2. I displug network cables from pve2 and pve3, only pve1 is available on the network.
3. pve1 restarts automatically.
4. I login to pve1 via ssh, and run
pvecm expected 1
... based on a forum entry:
"You can temporarily set expected votes to a lower value:
# pvecm expected <number_of_nodes_online>
But only do that if you are sure the other nodes are really offline."
5. I move vm's conf files from /etc/pve/nodes/pve2/qemu-server to /etc/pve/nodes/pve1/qemu-server
6. VMs are available on the pve1 on web interface, VM's status are powered off.
7. I try to start VM with start button, but the process indicator is just rotating.
Here is a journalctl log details:
root@pve1:~# journalctl -f
-- Logs begin at Mon 2020-11-16 19:02:17 CET. --
Nov 16 19:28:34 pve1 pvestatd[1543]: status update time (5.309 seconds)
Nov 16 19:28:38 pve1 ceph-mon[1451]: 2020-11-16 19:28:38.333 7f6ff1591700 -1 mon.pve1@0(probing) e3 get_health_metrics reporting 2 slow ops, oldest is auth(proto 0 73 bytes epoch 0)
Nov 16 19:28:39 pve1 pvedaemon[1564]: <root@pam> successful auth for user 'root@pam'
Nov 16 19:28:43 pve1 pvestatd[1543]: got timeout
Nov 16 19:28:43 pve1 ceph-mon[1451]: 2020-11-16 19:28:43.337 7f6ff1591700 -1 mon.pve1@0(probing) e3 get_health_metrics reporting 2 slow ops, oldest is auth(proto 0 73 bytes epoch 0)
Nov 16 19:28:43 pve1 pvestatd[1543]: status update time (5.332 seconds)
Nov 16 19:28:48 pve1 ceph-mon[1451]: 2020-11-16 19:28:48.337 7f6ff1591700 -1 mon.pve1@0(probing) e3 get_health_metrics reporting 2 slow ops, oldest is auth(proto 0 73 bytes epoch 0)
Nov 16 19:28:53 pve1 ceph-mon[1451]: 2020-11-16 19:28:53.337 7f6ff1591700 -1 mon.pve1@0(probing) e3 get_health_metrics reporting 2 slow ops, oldest is auth(proto 0 73 bytes epoch 0)
Nov 16 19:28:53 pve1 pvestatd[1543]: got timeout
Nov 16 19:28:53 pve1 pvestatd[1543]: status update time (5.316 seconds)
Nov 16 19:28:58 pve1 ceph-mon[1451]: 2020-11-16 19:28:58.337 7f6ff1591700 -1 mon.pve1@0(probing) e3 get_health_metrics reporting 2 slow ops, oldest is auth(proto 0 73 bytes epoch 0)
Nov 16 19:29:00 pve1 systemd[1]: Starting Proxmox VE replication runner...
Nov 16 19:29:00 pve1 systemd[1]: pvesr.service: Succeeded.
Nov 16 19:29:00 pve1 systemd[1]: Started Proxmox VE replication runner.
Nov 16 19:29:03 pve1 ceph-mon[1451]: 2020-11-16 19:29:03.337 7f6ff1591700 -1 mon.pve1@0(probing) e3 get_health_metrics reporting 2 slow ops, oldest is auth(proto 0 73 bytes epoch 0)
Nov 16 19:29:04 pve1 pvestatd[1543]: got timeout
Nov 16 19:29:04 pve1 pvestatd[1543]: status update time (5.321 seconds)
Nov 16 19:29:08 pve1 ceph-mon[1451]: 2020-11-16 19:29:08.337 7f6ff1591700 -1 mon.pve1@0(probing) e3 get_health_metrics reporting 2 slow ops, oldest is auth(proto 0 73 bytes epoch 0)
Nov 16 19:29:13 pve1 ceph-mon[1451]: 2020-11-16 19:29:13.337 7f6ff1591700 -1 mon.pve1@0(probing) e3 get_health_metrics reporting 2 slow ops, oldest is auth(proto 0 73 bytes epoch 0)
Nov 16 19:29:13 pve1 pvestatd[1543]: got timeout
What is the right method to start VM's on pve1 node?
PVE version details:
===================================================
proxmox-ve: 6.2-2 (running kernel: 5.4.65-1-pve)
pve-manager: 6.2-15 (running version: 6.2-15/48bd51b6)
pve-kernel-5.4: 6.2-7
pve-kernel-helper: 6.2-7
pve-kernel-5.4.65-1-pve: 5.4.65-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph: 14.2.11-pve1
ceph-fuse: 14.2.11-pve1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.5
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.2-4
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.2-10
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.0.1-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.3-10
pve-cluster: 6.2-1
pve-container: 3.2-2
pve-docs: 6.2-6
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.1.0-6
pve-xtermjs: 4.7.0-2
qemu-server: 6.2-19
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.4-pve2
===================================================
thank you,
I have to check disaster recovery procedure on a 3 nodes (pve1, pve2, pve3) cluster with ceph (RBD storage).
Everything works fine, in case of one node failure the cluster works as expected.
I would like to test starting VM's on a single node with crashed cluster.
This is my workaround:
1. all nodes online, VM's are running on pve2
2. I displug network cables from pve2 and pve3, only pve1 is available on the network.
3. pve1 restarts automatically.
4. I login to pve1 via ssh, and run
pvecm expected 1
... based on a forum entry:
"You can temporarily set expected votes to a lower value:
# pvecm expected <number_of_nodes_online>
But only do that if you are sure the other nodes are really offline."
5. I move vm's conf files from /etc/pve/nodes/pve2/qemu-server to /etc/pve/nodes/pve1/qemu-server
6. VMs are available on the pve1 on web interface, VM's status are powered off.
7. I try to start VM with start button, but the process indicator is just rotating.
Here is a journalctl log details:
root@pve1:~# journalctl -f
-- Logs begin at Mon 2020-11-16 19:02:17 CET. --
Nov 16 19:28:34 pve1 pvestatd[1543]: status update time (5.309 seconds)
Nov 16 19:28:38 pve1 ceph-mon[1451]: 2020-11-16 19:28:38.333 7f6ff1591700 -1 mon.pve1@0(probing) e3 get_health_metrics reporting 2 slow ops, oldest is auth(proto 0 73 bytes epoch 0)
Nov 16 19:28:39 pve1 pvedaemon[1564]: <root@pam> successful auth for user 'root@pam'
Nov 16 19:28:43 pve1 pvestatd[1543]: got timeout
Nov 16 19:28:43 pve1 ceph-mon[1451]: 2020-11-16 19:28:43.337 7f6ff1591700 -1 mon.pve1@0(probing) e3 get_health_metrics reporting 2 slow ops, oldest is auth(proto 0 73 bytes epoch 0)
Nov 16 19:28:43 pve1 pvestatd[1543]: status update time (5.332 seconds)
Nov 16 19:28:48 pve1 ceph-mon[1451]: 2020-11-16 19:28:48.337 7f6ff1591700 -1 mon.pve1@0(probing) e3 get_health_metrics reporting 2 slow ops, oldest is auth(proto 0 73 bytes epoch 0)
Nov 16 19:28:53 pve1 ceph-mon[1451]: 2020-11-16 19:28:53.337 7f6ff1591700 -1 mon.pve1@0(probing) e3 get_health_metrics reporting 2 slow ops, oldest is auth(proto 0 73 bytes epoch 0)
Nov 16 19:28:53 pve1 pvestatd[1543]: got timeout
Nov 16 19:28:53 pve1 pvestatd[1543]: status update time (5.316 seconds)
Nov 16 19:28:58 pve1 ceph-mon[1451]: 2020-11-16 19:28:58.337 7f6ff1591700 -1 mon.pve1@0(probing) e3 get_health_metrics reporting 2 slow ops, oldest is auth(proto 0 73 bytes epoch 0)
Nov 16 19:29:00 pve1 systemd[1]: Starting Proxmox VE replication runner...
Nov 16 19:29:00 pve1 systemd[1]: pvesr.service: Succeeded.
Nov 16 19:29:00 pve1 systemd[1]: Started Proxmox VE replication runner.
Nov 16 19:29:03 pve1 ceph-mon[1451]: 2020-11-16 19:29:03.337 7f6ff1591700 -1 mon.pve1@0(probing) e3 get_health_metrics reporting 2 slow ops, oldest is auth(proto 0 73 bytes epoch 0)
Nov 16 19:29:04 pve1 pvestatd[1543]: got timeout
Nov 16 19:29:04 pve1 pvestatd[1543]: status update time (5.321 seconds)
Nov 16 19:29:08 pve1 ceph-mon[1451]: 2020-11-16 19:29:08.337 7f6ff1591700 -1 mon.pve1@0(probing) e3 get_health_metrics reporting 2 slow ops, oldest is auth(proto 0 73 bytes epoch 0)
Nov 16 19:29:13 pve1 ceph-mon[1451]: 2020-11-16 19:29:13.337 7f6ff1591700 -1 mon.pve1@0(probing) e3 get_health_metrics reporting 2 slow ops, oldest is auth(proto 0 73 bytes epoch 0)
Nov 16 19:29:13 pve1 pvestatd[1543]: got timeout
What is the right method to start VM's on pve1 node?
PVE version details:
===================================================
proxmox-ve: 6.2-2 (running kernel: 5.4.65-1-pve)
pve-manager: 6.2-15 (running version: 6.2-15/48bd51b6)
pve-kernel-5.4: 6.2-7
pve-kernel-helper: 6.2-7
pve-kernel-5.4.65-1-pve: 5.4.65-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph: 14.2.11-pve1
ceph-fuse: 14.2.11-pve1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.5
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.2-4
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.2-10
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.0.1-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.3-10
pve-cluster: 6.2-1
pve-container: 3.2-2
pve-docs: 6.2-6
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.1.0-6
pve-xtermjs: 4.7.0-2
qemu-server: 6.2-19
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.4-pve2
===================================================
thank you,