Proxmox 4.0 and 4.1 testing HA (pulled power cable, ipmitool power off) problem

The problem is that node 1 cannot start it because the config was not moved which is strange.

Wait until all nodes are online and active (maybe wait another two minutes) then execute the following command in the terminal/shell of one of your nodes as root:

Code:
echo "{}" > /etc/pve/ha/manager_status

After that everything should be okay, then wait about two minutes and test again.
This is a really strange error, which can normally only constructed by hand (manually editing some status files), but I will look more into it.

Thanks for all the information, a bit more of the log shortly before the power plug would also be helpful, so that we see the whole process and what happens.
 
The problem is that node 1 cannot start it because the config was not moved which is strange.

Wait until all nodes are online and active (maybe wait another two minutes) then execute the following command in the terminal/shell of one of your nodes as root:

Code:
echo "{}" > /etc/pve/ha/manager_status

After that everything should be okay, then wait about two minutes and test again.
This is a really strange error, which can normally only constructed by hand (manually editing some status files), but I will look more into it.

Thanks for all the information, a bit more of the log shortly before the power plug would also be helpful, so that we see the whole process and what happens.

No luck, i tried with command but nothing changed. Status is same like on picture All_at_same_picture.JPG .
Log is in a file. I tried command on two nodes, last one at 17:34:20.
Now what i know it is working, maybe can help. Remove and return vm100 from HA. I can try that tomorov.
And i can repeat procedure from start with detail logs.
 

Attachments

  • log-after-command.txt
    23.8 KB · Views: 5
That's really strange!

I'm not rather sure how that could happened and less sure how that can persists after the manager rebuilds its status.
I tried really hard but couldn't reproduce it.

Please try:
Code:
ha-manager disable vm:100
ha-manager enable vm:100
 
That's really strange!

I'm not rather sure how that could happened and less sure how that can persists after the manager rebuilds its status.
I tried really hard but couldn't reproduce it.

Please try:
Code:
ha-manager disable vm:100
ha-manager enable vm:100


Finaly vm100 started.
Today i will try to reproduce "problem" with detail logs. I hope it will help in debuging.

syslog from proxmox1
Dec 21 10:40:07 proxmox1test pve-ha-lrm[25875]: service 'vm:100' not on this node
Dec 21 10:40:17 proxmox1test pve-ha-lrm[25891]: service 'vm:100' not on this node
Dec 21 10:40:27 proxmox1test pve-ha-lrm[25913]: service 'vm:100' not on this node
Dec 21 10:40:37 proxmox1test pve-ha-lrm[25929]: service 'vm:100' not on this node
Dec 21 10:40:47 proxmox1test pve-ha-lrm[25945]: service 'vm:100' not on this node
Dec 21 10:40:57 proxmox1test pve-ha-lrm[25961]: service 'vm:100' not on this node
Dec 21 10:41:07 proxmox1test pve-ha-lrm[25977]: service 'vm:100' not on this node
Dec 21 10:41:17 proxmox1test pve-ha-lrm[25993]: service 'vm:100' not on this node
Dec 21 10:41:27 proxmox1test pve-ha-lrm[26015]: service 'vm:100' not on this node
Dec 21 10:41:37 proxmox1test pve-ha-lrm[26031]: service 'vm:100' not on this node
Dec 21 10:41:47 proxmox1test pve-ha-lrm[26047]: service 'vm:100' not on this node
Dec 21 10:41:57 proxmox1test pve-ha-lrm[26065]: service 'vm:100' not on this node
Dec 21 10:42:07 proxmox1test pve-ha-lrm[26081]: service 'vm:100' not on this node
Dec 21 10:42:17 proxmox1test pve-ha-lrm[26097]: service 'vm:100' not on this node
Dec 21 10:42:27 proxmox1test pve-ha-lrm[26119]: service 'vm:100' not on this node
Dec 21 10:42:37 proxmox1test pve-ha-lrm[26135]: service 'vm:100' not on this node
Dec 21 10:42:47 proxmox1test pve-ha-lrm[26151]: service 'vm:100' not on this node
Dec 21 10:42:57 proxmox1test pve-ha-lrm[26167]: service 'vm:100' not on this node
Dec 21 10:43:07 proxmox1test pve-ha-lrm[26183]: service 'vm:100' not on this node
Dec 21 10:43:17 proxmox1test pve-ha-lrm[26199]: service 'vm:100' not on this node
Dec 21 10:43:27 proxmox1test pve-ha-lrm[26221]: service 'vm:100' not on this node
Dec 21 10:43:37 proxmox1test pve-ha-lrm[26237]: service 'vm:100' not on this node
Dec 21 10:43:47 proxmox1test pve-ha-lrm[26253]: service 'vm:100' not on this node
Dec 21 10:43:57 proxmox1test pve-ha-lrm[26269]: service 'vm:100' not on this node
Dec 21 10:44:07 proxmox1test pve-ha-lrm[26285]: service 'vm:100' not on this node
Dec 21 10:44:17 proxmox1test pve-ha-lrm[26301]: service 'vm:100' not on this node
Dec 21 10:44:27 proxmox1test pve-ha-lrm[26323]: service 'vm:100' not on this node
Dec 21 10:44:37 proxmox1test pve-ha-lrm[26339]: service 'vm:100' not on this node
Dec 21 10:44:47 proxmox1test pve-ha-lrm[26355]: service 'vm:100' not on this node
Dec 21 10:44:57 proxmox1test pve-ha-lrm[26371]: service 'vm:100' not on this node
Dec 21 10:45:01 proxmox1test CRON[26373]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Dec 21 10:45:07 proxmox1test pve-ha-lrm[26390]: service 'vm:100' not on this node
Dec 21 10:45:17 proxmox1test pve-ha-lrm[26406]: service 'vm:100' not on this node
Dec 21 10:45:27 proxmox1test pve-ha-lrm[26429]: service 'vm:100' not on this node
Dec 21 10:45:37 proxmox1test pve-ha-lrm[26445]: service 'vm:100' not on this node
Dec 21 10:45:47 proxmox1test pve-ha-lrm[26461]: service 'vm:100' not on this node
Dec 21 10:45:57 proxmox1test pve-ha-lrm[26478]: service 'vm:100' not on this node
Dec 21 10:46:01 proxmox1test pvedaemon[522]: <root@pam> successful auth for user 'root@pam'
Dec 21 10:46:07 proxmox1test pve-ha-lrm[26494]: service 'vm:100' not on this node
Dec 21 10:46:17 proxmox1test pve-ha-lrm[26514]: service 'vm:100' not on this node
Dec 21 10:46:27 proxmox1test pve-ha-lrm[26543]: service 'vm:100' not on this node
Dec 21 10:46:37 proxmox1test pve-ha-lrm[26559]: service 'vm:100' not on this node
Dec 21 10:46:47 proxmox1test pve-ha-lrm[26575]: service 'vm:100' not on this node
Dec 21 10:46:57 proxmox1test pve-ha-lrm[26595]: service 'vm:100' not on this node
Dec 21 10:47:07 proxmox1test pve-ha-lrm[26612]: service 'vm:100' not on this node
Dec 21 10:47:10 proxmox1test pve-ha-crm[2476]: service 'vm:100': state changed from 'started' to 'request_stop'
Dec 21 10:47:17 proxmox1test pve-ha-lrm[26628]: service 'vm:100' not on this node
Dec 21 10:47:20 proxmox1test pve-ha-crm[2476]: service 'vm:100': state changed from 'request_stop' to 'error'
Dec 21 10:47:20 proxmox1test pve-ha-crm[2476]: service 'vm:100': state changed from 'error' to 'stopped'
Dec 21 10:47:20 proxmox1test pve-ha-crm[2476]: fixup service 'vm:100' location (proxmox1test => proxmox3test
Dec 21 10:47:29 proxmox1test pvedaemon[6234]: <root@pam> successful auth for user 'root@pam'
Dec 21 10:47:40 proxmox1test pve-ha-crm[2476]: service 'vm:100': state changed from 'stopped' to 'started' (node = proxmox3test)
Dec 21 10:47:47 proxmox1test pmxcfs[2338]: [status] notice: received log
Dec 21 10:47:48 proxmox1test pmxcfs[2338]: [status] notice: received log
Dec 21 10:48:00 proxmox1test pvedaemon[30170]: <root@pam> successful auth for user 'root@pam'
Dec 21 10:48:00 proxmox1test pveproxy[24927]: worker exit
Dec 21 10:48:00 proxmox1test pveproxy[24924]: worker 24927 finished
Dec 21 10:48:00 proxmox1test pveproxy[24924]: starting 1 worker(s)
Dec 21 10:48:00 proxmox1test pveproxy[24924]: worker 26703 started

syslog from proxmox3
Dec 21 10:45:01 proxmox3test CRON[5200]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Dec 21 10:46:01 proxmox3test pmxcfs[3557]: [status] notice: received log
Dec 21 10:47:29 proxmox3test pmxcfs[3557]: [status] notice: received log
Dec 21 10:47:47 proxmox3test pve-ha-lrm[5493]: starting service vm:100
Dec 21 10:47:47 proxmox3test pve-ha-lrm[5494]: start VM 100: UPID:proxmox3test:00001576:027109A8:5677CAC3:qmstart:100:root@pam:
Dec 21 10:47:47 proxmox3test pve-ha-lrm[5493]: <root@pam> starting task UPID:proxmox3test:00001576:027109A8:5677CAC3:qmstart:100:root@pam:
Dec 21 10:47:48 proxmox3test pve-ha-lrm[5493]: Task still active, waiting
Dec 21 10:47:48 proxmox3test kernel: [409626.079137] device tap100i0 entered promiscuous mode
Dec 21 10:47:48 proxmox3test kernel: [409626.096801] vmbr151: port 3(tap100i0) entered forwarding state
Dec 21 10:47:48 proxmox3test kernel: [409626.096819] vmbr151: port 3(tap100i0) entered forwarding state
Dec 21 10:47:48 proxmox3test pve-ha-lrm[5493]: <root@pam> end task UPID:proxmox3test:00001576:027109A8:5677CAC3:qmstart:100:root@pam: OK
Dec 21 10:47:48 proxmox3test pve-ha-lrm[5493]: service status vm:100 started
Dec 21 10:47:51 proxmox3test kernel: [409629.185887] kvm: zapping shadow pages for mmio generation wraparound
Dec 21 10:47:51 proxmox3test kernel: [409629.193191] kvm: zapping shadow pages for mmio generation wraparound
Dec 21 10:48:00 proxmox3test pmxcfs[3557]: [status] notice: received log
 
Today i will try to reproduce "problem" with detail logs. I hope it will help in debuging.

OK, that would be great, better more information then to less.

Please also post the logs in [code] logs here ... [/code] tags, that makes it a lot easier to read. Or attach them to the post.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!