Interface broken after update - tasks aren't carried out

Ulrar · Apr 16, 2016

Hi,

I just updated a three node cluster to the latest version (4.1-22 from the repo). Did that one by one, the first node went fine but the second and third node crashed during the update, I had to run dpkg --configure -a to finish it.
Did that over the course of today, and those two crashes caused me a lot of trouble since it forced two heal of the glusterFS installed on the machines one after another, locking all the VMs, but that's not the point.

For each update, I live migrated all the VMs to the other two nodes and that worked fine.
Now that they are all three up to date, I wanted to move back some machines on the third node and discovered that I can't actually do anything, when I try to live migrate a VM, stop a VM, start a VM .. I just get an "OK" in the task list a few seconds later, and nothing happens.
In the log I just get the

pvedaemon[11735]: <root@pam> starting task UPID:ns3624511:00003668:005A2E0E:57118C87:hastart:101:root@pam:

And absolutely nothing else after that. Tried to move VMs from each nodes to other, both online and offline. I ended up powering off a VM through SSH to test, and I can't start it up again, I just get an OK again.
Any idea what I can do to unblock that ? I can't just reboot a node since that will cause another gluster heal.

It does look like I can move a disk between storages though, but ofcourse the VM gets rebooted at the end of the operation and just never starts again.

Thanks

EDIT : To be sure I tried hooking back up a disk from a different storage (in case the glusterFS is broken somehow) to a powered down VM, but it still won't start. Seems to be a legitimate proxmox problem this time.

EDIT 2 : When I use qm start 101, I get only this output "Executing HA start for VM 101". In the logs I see the starting task and one second later the end task, no errors.

wosp · Apr 16, 2016

1. From what version are your upgrading?
2. Can you try to remove HA from the VM and see if it starts?
3. Do you see any errors about a locked VM? If yes, try to execute "qm unlock <VMID>"
4. What does "pvecm status" show?

Ulrar · Apr 16, 2016

1) Not sure, something like 4.1-8 ? It was installed early february, so whatever versions was in the depot back then. I do have backups of the full filesystems, if need be, is there a file I could see the versions in ?

2) I tried, didn't change anything

3) No errors like that, as I mentioned I really don't have any sort of output in the logs apart from the start, and if using the cli the end thing. At least nothing I can see with journalctl -b. I tried to unlock one anyway, no changes.

4)

Quorum information
------------------
Date: Sat Apr 16 10:49:57 2016
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000001
Ring ID: 27436
Quorate: Yes

Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.10.0.1 (local)
0x00000002 1 10.10.0.2
0x00000003 1 10.10.0.3

Thanks for the help !

wosp · Apr 16, 2016

Hmm, very strange. Are you absolutely sure there are no files being healed currently by Gluster (gluster volume heal <DATASTORE> info)? If you create a new VM, will that one start? Can you show the .conf of a VM that doesn't start (/etc/pve/nodes/<NODENAME>/qemu-server/<VMID>.conf)

Ulrar · Apr 16, 2016

Yep, I'm sure. I even copied the disks of a VM to another proxmox out of the cluster, and it boots fine there (couldn't use vzdump 'cause it does the same OK with no output thing, however, I just copied the disks).

I created a new VM with a disk on the gluster and it booted fine ! That's a good start. Right after that I tried to start VM 101 again (the one I powered down to test and can't bring back up) and I got that in the logs :
qm[19331]: VM 101 qmp command failed - VM 101 not running

Still didn't start, still got an OK in the interface, and now when I try to start it I'm back to no output.
Here is it's .conf file :

bootdisk: virtio0
cores: 4
ide2: none,media=cdrom
memory: 16384
name: Some_Name
net0: e1000=3A:31:32:66:66:62,bridge=vmbr1
numa: 0
ostype: l26
smbios1: uuid=fc35de6e-c207-435b-8697-e43e14979674
sockets: 1
unused0: vm-storage:101/vm-101-disk-1.qcow2
virtio0: gluster:101/vm-101-disk-1.qcow2,iothread=1,size=25G

wosp · Apr 16, 2016

Hmm, I don't see anything odd in the .conf of your non-booting VM. However, regarding the fact a new VM is booting, you might think it must be in the non-booting VM config. How much I want to help you solve this, I have no idea right now. I hope someone from the Proxmox team, or another forum member, are here soon to help you out. Sorry.

Ulrar · Apr 16, 2016

Well thanks anyway !
Since a new VM is starting up fine, I guess worst case I will just re-create the VMs and move the disks, that should work.
I'd rather avoid it as it's a downtime for what looks like no reason. but at least now I don't have to be scared of a node crash bringing down some VMs, I have a way to bring them back up !

Ulrar · Apr 16, 2016

Looks like the problem is the first node.
I re-created a VM and moved it's disk to the new id, it booted fine but after that same problem as before - no actions on it are doing anything.
Then I tried re-created the VM but migrating it to the third node before booting it, that VM is fine : can be started, stopped, live migrated to the second node and back ..
But if I try to live migrate that VM to the first node, it doesn't do it and now the VM is "broken", still running on the third node but I can't do anything with it.

Looks like for some reason the VM, as soon as they ran or tried to run on the first node, are broken. It's weird since it was the first node that I updated first, and it worked fine for a whole day before I had the opportunity to update the second, and again a few hours before I updated the third. Only when the third got updated that this problem started.
But it's not as critical as it was, now at least I can manually bring up the VM's if needed, just can't migrate them away from the first node if I run them there.

EDIT : Not that simple, if I just let the VM be it seems to lose it's ability to migrate anyway, guess it has nothing to do with the first node.

Ulrar · Apr 18, 2016

Hi,

Just did some more testing and I figured out the exact cause : Enabling HA. If I enable HA on a VM, I can't do anything to it. If I disable it (as in delete the entry in the HA menu) the VM is back to normal.
That's very problematic.

manu · Apr 18, 2016

Hi
I read from your edited first post, that the issue ( start taks are not executed by the HA ressource manager ) happens whether you're using the GUI or the command line.
If yes, this would mean that the problem comes from the HA stack and not from the new GUI.

What is the ouput of ?
service pve-ha-crm status
service pve-ha-lrm status

Ulrar · Apr 18, 2016

Hi,

Does seem that way, yes.

Here is the output from the first node :

● pve-ha-crm.service - PVE Cluster Ressource Manager Daemon

Loaded: loaded (/lib/systemd/system/pve-ha-crm.service; enabled)

Active: active (running) since Fri 2016-04-15 10:27:09 CEST; 3 days ago

Main PID: 2300 (pve-ha-crm)

CGroup: /system.slice/pve-ha-crm.service

└─2300 pve-ha-crm

Apr 15 10:27:09 first_node pve-ha-crm[2300]: starting server

Apr 15 10:27:09 first_node pve-ha-crm[2300]: status change startup => wait_for_quorum

Apr 15 10:27:09 first_node systemd[1]: Started PVE Cluster Ressource Manager Daemon.

Apr 15 10:27:19 first_node pve-ha-crm[2300]: status change wait_for_quorum => slave

Apr 15 11:46:13 first_node pve-ha-crm[2300]: status change slave => wait_for_quorum

Apr 15 11:46:23 first_node pve-ha-crm[2300]: status change wait_for_quorum => slave

Apr 15 16:28:29 first_node pve-ha-crm[2300]: status change slave => wait_for_quorum

Apr 15 16:28:39 first_node pve-ha-crm[2300]: status change wait_for_quorum => slave

Apr 15 16:59:23 first_node pve-ha-crm[2300]: status change slave => wait_for_quorum

Apr 15 16:59:33 first_node pve-ha-crm[2300]: status change wait_for_quorum => slave

● pve-ha-lrm.service - PVE Local HA Ressource Manager Daemon

Loaded: loaded (/lib/systemd/system/pve-ha-lrm.service; enabled)

Active: active (running) since Fri 2016-04-15 10:27:10 CEST; 3 days ago

Main PID: 2309 (pve-ha-lrm)

CGroup: /system.slice/pve-ha-lrm.service

└─2309 pve-ha-lrm

Apr 15 21:21:04 first_node pve-ha-lrm[29441]: Task 'UPID:first_node:00007302:003BA99B:57113E67:qmigrate:102:root@pam:' still active, waiting

Apr 15 21:21:09 first_node pve-ha-lrm[29441]: Task 'UPID:first_node:00007302:003BA99B:57113E67:qmigrate:102:root@pam:' still active, waiting

Apr 15 21:21:14 first_node pve-ha-lrm[29441]: Task 'UPID:first_node:00007302:003BA99B:57113E67:qmigrate:102:root@pam:' still active, waiting

Apr 15 21:21:19 first_node pve-ha-lrm[29441]: Task 'UPID:first_node:00007302:003BA99B:57113E67:qmigrate:102:root@pam:' still active, waiting

Apr 15 21:21:23 first_node pve-ha-lrm[29441]: <root@pam> end task UPID:first_node:00007302:003BA99B:57113E67:qmigrate:102:root@pam: OK

Apr 16 01:51:48 first_node pve-ha-lrm[13220]: starting service vm:101

Apr 16 01:51:48 first_node pve-ha-lrm[13220]: <root@pam> starting task UPID:first_node:000033A7:0054BB2E:57117E94:qmstart:101:root@pam:

Apr 16 01:51:48 first_node pve-ha-lrm[13223]: start VM 101: UPID:first_node:000033A7:0054BB2E:57117E94:qmstart:101:root@pam:

Apr 16 01:51:50 first_node pve-ha-lrm[13220]: <root@pam> end task UPID:first_node:000033A7:0054BB2E:57117E94:qmstart:101:root@pam: OK

Apr 16 01:51:50 first_node pve-ha-lrm[13220]: service status vm:101 started

From the second node, seems more interesting :

● pve-ha-crm.service - PVE Cluster Ressource Manager Daemon

Loaded: loaded (/lib/systemd/system/pve-ha-crm.service; enabled)

Active: active (running) since Fri 2016-04-15 23:03:36 CEST; 2 days ago

Process: 2380 ExecStart=/usr/sbin/pve-ha-crm start (code=exited, status=0/SUCCESS)

Main PID: 2382 (pve-ha-crm)

CGroup: /system.slice/pve-ha-crm.service

└─2382 pve-ha-crm

Apr 18 10:38:34 second_node pve-ha-crm[2382]: fencing: acknowleged - got agent lock for node 'third_node'

Apr 18 10:38:34 second_node pve-ha-crm[2382]: node 'third_node': state changed from 'fence' => 'unknown'

Apr 18 10:38:34 second_node pve-ha-crm[2382]: recover service 'vm:104' from fenced node 'third_node' to node 'second_node'

Apr 18 10:38:34 second_node pve-ha-crm[2382]: got unexpected error - rename '/etc/pve/nodes/third_node/qemu-server/104.conf' to '/etc/pve/nodes/second_node/qemu-server/104.conf' f...or directory

Apr 18 10:38:44 second_node pve-ha-crm[2382]: node 'third_node': state changed from 'unknown' => 'online'

Apr 18 10:38:44 second_node pve-ha-crm[2382]: node 'third_node': state changed from 'online' => 'fence'

Apr 18 10:38:44 second_node pve-ha-crm[2382]: fencing: acknowleged - got agent lock for node 'third_node'

Apr 18 10:38:44 second_node pve-ha-crm[2382]: node 'third_node': state changed from 'fence' => 'unknown'

Apr 18 10:38:44 second_node pve-ha-crm[2382]: recover service 'vm:104' from fenced node 'third_node' to node 'second_node'

Apr 18 10:38:44 second_node pve-ha-crm[2382]: got unexpected error - rename '/etc/pve/nodes/third_node/qemu-server/104.conf' to '/etc/pve/nodes/second_node/qemu-server/104.conf' f...or directory

The line in full, found in the logs :

Apr 18 10:46:04 second_node pve-ha-crm[2382]: got unexpected error - rename '/etc/pve/nodes/third_node/qemu-server/104.conf' to '/etc/pve/nodes/second_node/qemu-server/104.conf' failed - No such file or directory

● pve-ha-lrm.service - PVE Local HA Ressource Manager Daemon

Loaded: loaded (/lib/systemd/system/pve-ha-lrm.service; enabled)

Active: active (running) since Fri 2016-04-15 23:03:36 CEST; 2 days ago

Process: 2383 ExecStart=/usr/sbin/pve-ha-lrm start (code=exited, status=0/SUCCESS)

Main PID: 2389 (pve-ha-lrm)

CGroup: /system.slice/pve-ha-lrm.service

└─2389 pve-ha-lrm

Apr 15 23:03:36 second_node pve-ha-lrm[2389]: starting server

Apr 15 23:03:36 second_node pve-ha-lrm[2389]: status change startup => wait_for_agent_lock

Apr 15 23:03:36 second_node systemd[1]: Started PVE Local HA Ressource Manager Daemon.

Apr 16 01:30:35 second_node pve-ha-lrm[2389]: successfully acquired lock 'ha_agent_second_node_lock'

Apr 16 01:30:35 second_node pve-ha-lrm[2389]: watchdog active

Apr 16 01:30:35 second_node pve-ha-lrm[2389]: status change wait_for_agent_lock => active

And the third node :

● pve-ha-crm.service - PVE Cluster Ressource Manager Daemon

Loaded: loaded (/lib/systemd/system/pve-ha-crm.service; enabled)

Active: active (running) since Sat 2016-04-16 01:58:14 CEST; 2 days ago

Process: 2301 ExecStart=/usr/sbin/pve-ha-crm start (code=exited, status=0/SUCCESS)

Main PID: 2303 (pve-ha-crm)

CGroup: /system.slice/pve-ha-crm.service

└─2303 pve-ha-crm

Apr 16 01:58:14 third_node pve-ha-crm[2303]: starting server

Apr 16 01:58:14 third_node pve-ha-crm[2303]: status change startup => wait_for_quorum

Apr 16 01:58:14 third_node systemd[1]: Started PVE Cluster Ressource Manager Daemon.

● pve-ha-lrm.service - PVE Local HA Ressource Manager Daemon

Loaded: loaded (/lib/systemd/system/pve-ha-lrm.service; enabled)

Active: active (running) since Sat 2016-04-16 01:58:15 CEST; 2 days ago

Process: 2304 ExecStart=/usr/sbin/pve-ha-lrm start (code=exited, status=0/SUCCESS)

Main PID: 2310 (pve-ha-lrm)

CGroup: /system.slice/pve-ha-lrm.service

└─2310 pve-ha-lrm

Apr 16 01:58:15 third_node pve-ha-lrm[2310]: starting server

Apr 16 01:58:15 third_node pve-ha-lrm[2310]: status change startup => wait_for_agent_lock

Apr 16 01:58:15 third_node systemd[1]: Started PVE Local HA Ressource Manager Daemon.

Ulrar · Apr 18, 2016

Looks like the third node is 'blinking' for some reason, being fenced then back online every few seconds. The VM 104 it's mentioning isn't even on that node, it's already on the second one.
The problem did start when I rebooted the third node so I suppose it makes sense, but it was working fine before. I don't see any packet losses between the second and the third node, it's very strange !

Ulrar · Apr 18, 2016

What would happen if I just tried to remove HA for VM 104 ? Any risk of making a node crash or something ? I'm thinking it would stop trying to migrate the VM from somewhere it isn't and maybe it'd free up the HA to do something else (hopefully not all the tasks I tried to queue since friday, that would be bad)

manu · Apr 18, 2016

Hi Ulrar
Disabling HA for a VM should not make the node crash, it will just shutdown the VM.

It looks like you're having cluster communication problem, so it would be better in your case to deactivate the HA features, restart the VMs, and see if you can correctly migrate the VMs between the nodes.
If everything is fine at that point, and your cluster communication is correct, then re enable HA.

Ulrar · Apr 19, 2016

Okay, just did this and everything seems to have worked.
No VM were halted after removing HA, but the logs did say it removed a few stale config file so I guess the update broke something in HA. I'm re-enabling it now and can still migrate the VMs, so everything seems okay ! Didn't even need to reboot anything, that's perfect.

Just noticed something though, I moved the disk of a VM from one gluster to another, and the VM did not reboot at the end of the transfer. Every transer I did these last few days ended by the VM rebooting at the end, was it not normal ? Or is it some recent change ? I must say it's great, having to do these moves at night was annoying ! Really like proxmox more and more !

EDIT : Oh, almost forgot, thanks !

I'll keep on eye on it tomorrow but it seems solved !

t.lamprecht · Apr 19, 2016

Small followup trying to explain what happened regarding the HA stuff.

Apr 18 10:38:34 second_node pve-ha-crm[2382]: fencing: acknowleged - got agent lock for node 'third_node'

Apr 18 10:38:34 second_node pve-ha-crm[2382]: node 'third_node': state changed from 'fence' => 'unknown'

Apr 18 10:38:34 second_node pve-ha-crm[2382]: recover service 'vm:104' from fenced node 'third_node' to node 'second_node'

Apr 18 10:38:34 second_node pve-ha-crm[2382]: got unexpected error - rename '/etc/pve/nodes/third_node/qemu-server/104.conf' to '/etc/pve/nodes/second_node/qemu-server/104.conf' f...or directory

Apr 18 10:38:44 second_node pve-ha-crm[2382]: node 'third_node': state changed from 'unknown' => 'online'

Hmm, the reason for this was that the third node needed to long to update (or hang) thus it was marked as to be fenced, after that recovering its service failed (that's really strange for me) and the HA state became out of sync.

Can it be that you started the upgrade on all nodes at the same time? We had some bugs which could lead to problems in the HA stack on a mass update of all nodes, it could be that the HA manager of a node needed to long, or that updating the current master at the same time with another node made some problems, this is fixed now but its still safer to update on after each other.

Ulrar · Apr 19, 2016

No, I didn't update them at the same time. I had never updated a proxmox yet, so I emptied the first node, updated it, rebooted it then moved all the VMs from node 2 to node 1, updated, rebooted and then the same for node 3.
It was good call since both node 2 ans node 3 crashed during the update, had to finish it by hand with dpkg after their reboot.

t.lamprecht · Apr 19, 2016

Ulrar said:
I had never updated a proxmox yet, so I emptied the first node, updated it, rebooted it then moved all the VMs from node 2 to node 1, updated, rebooted and then the same for node 3.

Ah OK, really good practice, sorry for the implied "you updated all at once" accusation

OK, then the watchdog triggered as it wasn't updated anymore and reset the node, so it froze (unlikely) or the HA manager needed to long and you ran in a bug we fixed in the newer version.

wosp · Apr 19, 2016

t.lamprecht said:
OK, then the watchdog triggered as it wasn't updated anymore and reset the node, so it froze (unlikely) or the HA manager needed to long and you ran in a bug we fixed in the newer version.

Sorry for hijacking this thread, but is it recommended to stop watchdog fencing when performing an upgrade to prevent this kind of issues?

t.lamprecht · Apr 19, 2016

wosp said:
Sorry for hijacking this thread, but is it recommended to stop watchdog fencing when performing an upgrade to prevent this kind of issues?

This kind of issues should be fixed for now, I couldn't trigger it anymore and I tried to provoke it really hard, but there is one thing you really have to pay attention: There must not be any service in the error state when updating, you must fix them before updating, e.g. look whats gone wrong and disable those services or remove them from HA.

Interface broken after update - tasks aren't carried out

Active Member

Renowned Member

Active Member

Renowned Member

Active Member

Renowned Member

Active Member

Active Member

Active Member

Proxmox Staff Member

Active Member

Active Member

Active Member

Proxmox Staff Member

Active Member

Proxmox Staff Member

Active Member

Proxmox Staff Member

Renowned Member

Proxmox Staff Member