Host was reset randomly

Alexander Nilsson · May 26, 2019

Hi all! I been getting some trouble with my hosts randomly rebooting on me. It has happened a few times now and I don't know how to investigate it. Below are the syslogs immedietely before and after the last reset. Notice the "^@^@^@^@" which I would guess is where something (watchdog?) reset the server forcefully.

Last time this happened I had been running a backup job, but that had finished more than an hour before. The logs mention the replication runner just before, but that has worked flawlessly as far as I know, I see no errors in the GUI anyway.

The good news is that HA (with replication) worked as it should and migrated the most important VMs and CTs to another host, so now I have that tested

.

Just to be clear about my actual question: How should I proceed with my investigation? I realize that you probably need more information than I have provided, but what do you require?

Code:

May 26 14:04:00 vmh1 systemd[1]: Starting Proxmox VE replication runner...
May 26 14:04:00 vmh1 systemd[1]: Started Proxmox VE replication runner.
May 26 14:04:50 vmh1 corosync[1969]: error   [TOTEM ] FAILED TO RECEIVE
May 26 14:04:50 vmh1 corosync[1969]:  [TOTEM ] FAILED TO RECEIVE
May 26 14:04:52 vmh1 corosync[1969]: notice  [TOTEM ] A new membership (192.168.1.51:2424) was formed. Members left: 1 3
May 26 14:04:52 vmh1 corosync[1969]: notice  [TOTEM ] Failed to receive the leave message. failed: 1 3
May 26 14:04:52 vmh1 corosync[1969]: warning [CPG   ] downlist left_list: 2 received
May 26 14:04:52 vmh1 corosync[1969]: notice  [QUORUM] This node is within the non-primary component and will NOT provide any services.
May 26 14:04:52 vmh1 corosync[1969]: notice  [QUORUM] Members[1]: 2
May 26 14:04:52 vmh1 corosync[1969]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
May 26 14:04:52 vmh1 corosync[1969]:  [TOTEM ] A new membership (192.168.1.51:2424) was formed. Members left: 1 3
May 26 14:04:52 vmh1 corosync[1969]:  [TOTEM ] Failed to receive the leave message. failed: 1 3
May 26 14:04:52 vmh1 corosync[1969]:  [CPG   ] downlist left_list: 2 received
May 26 14:04:52 vmh1 pmxcfs[1860]: [dcdb] notice: members: 2/1860
May 26 14:04:52 vmh1 pmxcfs[1860]: [status] notice: members: 2/1860
May 26 14:04:52 vmh1 corosync[1969]:  [QUORUM] This node is within the non-primary component and will NOT provide any services.
May 26 14:04:52 vmh1 corosync[1969]:  [QUORUM] Members[1]: 2
May 26 14:04:52 vmh1 corosync[1969]:  [MAIN  ] Completed service synchronization, ready to provide service.
May 26 14:04:52 vmh1 pmxcfs[1860]: [status] notice: node lost quorum
May 26 14:04:53 vmh1 pve-ha-lrm[2099]: lost lock 'ha_agent_vmh1_lock - cfs lock update failed - Permission denied
May 26 14:04:57 vmh1 pve-ha-crm[2064]: status change slave => wait_for_quorum
May 26 14:04:58 vmh1 pve-ha-lrm[2099]: status change active => lost_agent_lock
May 26 14:05:00 vmh1 systemd[1]: Starting Proxmox VE replication runner...
May 26 14:05:00 vmh1 pvesr[7662]: trying to acquire cfs lock 'file-replication_cfg' ...
May 26 14:05:01 vmh1 pvesr[7662]: trying to acquire cfs lock 'file-replication_cfg' ...
May 26 14:05:02 vmh1 pvesr[7662]: trying to acquire cfs lock 'file-replication_cfg' ...
May 26 14:05:03 vmh1 pvesr[7662]: trying to acquire cfs lock 'file-replication_cfg' ...
May 26 14:05:04 vmh1 pvesr[7662]: trying to acquire cfs lock 'file-replication_cfg' ...
May 26 14:05:05 vmh1 pvesr[7662]: trying to acquire cfs lock 'file-replication_cfg' ...
May 26 14:05:06 vmh1 pvesr[7662]: trying to acquire cfs lock 'file-replication_cfg' ...
May 26 14:05:07 vmh1 pvesr[7662]: trying to acquire cfs lock 'file-replication_cfg' ...
May 26 14:05:08 vmh1 pvesr[7662]: trying to acquire cfs lock 'file-replication_cfg' ...
May 26 14:05:09 vmh1 pvesr[7662]: error with cfs lock 'file-replication_cfg': no quorum!
May 26 14:05:09 vmh1 systemd[1]: pvesr.service: Main process exited, code=exited, status=13/n/a
May 26 14:05:09 vmh1 systemd[1]: Failed to start Proxmox VE replication runner.
May 26 14:05:09 vmh1 systemd[1]: pvesr.service: Unit entered failed state.
May 26 14:05:09 vmh1 systemd[1]: pvesr.service: Failed with result 'exit-code'.
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@May 26 14:06:55 vmh1 systemd-modules-load[486]: Inserted module 'iscsi_tcp'
May 26 14:06:55 vmh1 systemd-modules-load[486]: Inserted module 'ib_iser'
May 26 14:06:55 vmh1 systemd-modules-load[486]: Inserted module 'vhost_net'
May 26 14:06:55 vmh1 systemd[1]: Starting Flush Journal to Persistent Storage...
May 26 14:06:55 vmh1 systemd[1]: Started Set the console keyboard layout.
May 26 14:06:55 vmh1 systemd[1]: Started Flush Journal to Persistent Storage.
May 26 14:06:55 vmh1 systemd[1]: Started udev Coldplug all Devices.
May 26 14:06:55 vmh1 systemd[1]: Starting udev Wait for Complete Device Initialization...
May 26 14:06:55 vmh1 systemd[1]: Found device /dev/disk/by-uuid/2611-EE05.
May 26 14:06:55 vmh1 systemd[1]: Listening on Load/Save RF Kill Switch Status /dev/rfkill Watch.
May 26 14:06:55 vmh1 systemd[1]: Found device /dev/pve/swap.
May 26 14:06:55 vmh1 systemd[1]: Activating swap /dev/pve/swap...
May 26 14:06:55 vmh1 systemd[1]: Activated swap /dev/pve/swap.
May 26 14:06:55 vmh1 systemd[1]: Reached target Swap.
May 26 14:06:55 vmh1 kernel: [    0.000000] Linux version 4.15.18-14-pve (build@pve) (gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u1)) #1 SMP PVE 4.15.18-39 (Wed, 15 May 2019 06:56:23 +0200) ()
May 26 14:06:55 vmh1 kernel: [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.15.18-14-pve root=/dev/mapper/pve-root ro quiet

dcsapak · May 27, 2019

Alexander Nilsson said:
May 26 14:04:52 vmh1 pmxcfs[1860]: [status] notice: node lost quorum

your node lost quorum, and if you had ha enabled the node fenced itself to prevent a split brain situation
i would look at your network to see why the node could not keep quorum with the other nodes

Alexander Nilsson · May 27, 2019

ah, yes, I gathered as much, but I have been unable to find anything wrong yet with the network, except the 3 cluster hosts and the vm's there are only one more pc on that network segment (for administration) and the traffic was low during that time as well... is there some more logs that I can look into? Or maybe there is some logs that I can turn on in order to better investigate the next time this happens?

Mostly I'm looking for trouble shooting tips here.

Search

Search

Host was reset randomly

Alexander Nilsson

New Member

dcsapak

Proxmox Staff Member

Alexander Nilsson

New Member