Hi,
We've recently configured a Proxmox node with two hosts and shared storage. All three servers are connected directly via 10Gbit Intel 82599EB based SFP+ link. Storage server runs on latest OpenMediavault and provides 32TB of RAID10 (mdadm based) space for both backups, ISOs and VMs over NFS and is also the host for a 1GB quorum disk shared via iscsi/fileio protocol. The entire configuration is pretty simple and looks like this: NODE1->STORAGE<-NODE2. Unforttunatelly we couldn't afford a decent 10Gbit switch which is why the storage server contain a dual-port card and acts as a switch between NODE1 and NODE2 - it's a simple bridge configured with brctl. All three machines are Supermicro Xeon based computers with 16GB of RAM for the storage and 64GB for each node.
Each machine is absolutelly up to date, there's only one Windows 2012R2 VM configured in HA mode. Now, everything works perfectly fine, except for backups (we're using LZO if that matters)... During the backup task something strange happens and the host currently hosting the VM gets evicted from the cluster, the job itself is interrupted, VM gets stuck in "locked" state and everything goes south.
Could anybody please take a look at the logs below and help us track the problem?
In this specific case sul-node0001 was hosting the VM and performing the backup while sul-node0002 was completely idle acting as a backup host (and part of the cluster, of course).
Logs for NODE1:
Logs for NODE2:
There's nothing extraordinary in storage server's log. nfsd doesn't get messed up everything seems perfectly OK on that side. Could it be due to that 10Gbit getting over-saturated and dropping some vital packets which in turn leads node0002 to believe that node0001 has gone down? This seems both possible and highly unlikely to me, because peak HDD transfers on the storage server oscilate around 350MB/s which shouldn't choke those Intel 82599s, right?
Could you please provide me with any tips?
We've recently configured a Proxmox node with two hosts and shared storage. All three servers are connected directly via 10Gbit Intel 82599EB based SFP+ link. Storage server runs on latest OpenMediavault and provides 32TB of RAID10 (mdadm based) space for both backups, ISOs and VMs over NFS and is also the host for a 1GB quorum disk shared via iscsi/fileio protocol. The entire configuration is pretty simple and looks like this: NODE1->STORAGE<-NODE2. Unforttunatelly we couldn't afford a decent 10Gbit switch which is why the storage server contain a dual-port card and acts as a switch between NODE1 and NODE2 - it's a simple bridge configured with brctl. All three machines are Supermicro Xeon based computers with 16GB of RAM for the storage and 64GB for each node.
Each machine is absolutelly up to date, there's only one Windows 2012R2 VM configured in HA mode. Now, everything works perfectly fine, except for backups (we're using LZO if that matters)... During the backup task something strange happens and the host currently hosting the VM gets evicted from the cluster, the job itself is interrupted, VM gets stuck in "locked" state and everything goes south.
Could anybody please take a look at the logs below and help us track the problem?
In this specific case sul-node0001 was hosting the VM and performing the backup while sul-node0002 was completely idle acting as a backup host (and part of the cluster, of course).
Logs for NODE1:
Code:
Nov 3 12:22:05 sul-node0001 pvedaemon[4016]: <root@pam> successful auth for user 'root@pam'
Nov 3 12:22:06 sul-node0001 rgmanager[11957]: [pvevm] VM 100 is running
Nov 3 12:22:41 sul-node0001 pvestatd[4037]: WARNING: command 'df -P -B 1 /mnt/pve /NFS_BACKUP' failed: got timeout
Nov 3 12:22:43 sul-node0001 pvestatd[4037]: WARNING: command 'df -P -B 1 /mnt/pve /NFS_MSSQL' failed: got timeout
Nov 3 12:22:45 sul-node0001 pvestatd[4037]: WARNING: command 'df -P -B 1 /mnt/pve /NFS_STORAGE' failed: got timeout
Nov 3 12:22:45 sul-node0001 pvestatd[4037]: status update time (6.066 seconds)
Nov 3 12:22:46 sul-node0001 rgmanager[12045]: [pvevm] VM 100 is running
Nov 3 12:23:06 sul-node0001 rgmanager[12099]: [pvevm] VM 100 is running
Nov 3 12:23:26 sul-node0001 rgmanager[12152]: [pvevm] VM 100 is running
Nov 3 12:23:57 sul-node0001 rgmanager[12284]: [pvevm] VM 100 is running
Nov 3 12:24:06 sul-node0001 rgmanager[12320]: [pvevm] VM 100 is running
Nov 3 12:24:09 sul-node0001 pveproxy[4042]: worker 9158 finished
Nov 3 12:24:09 sul-node0001 pveproxy[4042]: starting 1 worker(s)
Nov 3 12:24:09 sul-node0001 pveproxy[4042]: worker 12353 started
Nov 3 12:24:36 sul-node0001 kernel: kvm: exiting hardware virtualization
Nov 3 12:24:36 sul-node0001 kernel: sd 0:0:1:0: [sdb] Synchronizing SCSI cache
Logs for NODE2:
Code:
Nov 3 12:18:51 sul-node0002 pmxcfs[3247]: [status] notice: received log
Nov 3 12:19:29 sul-node0002 qdiskd[3448]: qdisk cycle took more than 1 second to complete (1.340000)
Nov 3 12:22:05 sul-node0002 pmxcfs[3247]: [status] notice: received log
Nov 3 12:22:34 sul-node0002 pvestatd[4430]: WARNING: command 'df -P -B 1 /mnt/pve/NFS_BACKUP' failed: got timeout
Nov 3 12:22:36 sul-node0002 pvestatd[4430]: WARNING: command 'df -P -B 1 /mnt/pve/NFS_MSSQL' failed: got timeout
Nov 3 12:22:38 sul-node0002 pvestatd[4430]: WARNING: command 'df -P -B 1 /mnt/pve/NFS_STORAGE' failed: got timeout
Nov 3 12:22:38 sul-node0002 pvestatd[4430]: status update time (6.056 seconds)
Nov 3 12:22:44 sul-node0002 pvestatd[4430]: WARNING: command 'df -P -B 1 /mnt/pve/NFS_BACKUP' failed: got timeout
Nov 3 12:22:46 sul-node0002 pvestatd[4430]: WARNING: command 'df -P -B 1 /mnt/pve/NFS_MSSQL' failed: got timeout
Nov 3 12:22:48 sul-node0002 pvestatd[4430]: WARNING: command 'df -P -B 1 /mnt/pve/NFS_STORAGE' failed: got timeout
Nov 3 12:22:48 sul-node0002 pvestatd[4430]: status update time (6.056 seconds)
Nov 3 12:24:30 sul-node0002 qdiskd[3448]: Assuming master role
Nov 3 12:24:31 sul-node0002 qdiskd[3448]: Writing eviction notice for node 1
Nov 3 12:24:32 sul-node0002 qdiskd[3448]: Node 1 evicted
Nov 3 12:25:29 sul-node0002 corosync[3397]: [TOTEM ] A processor failed, forming new configuration.
Nov 3 12:25:31 sul-node0002 corosync[3397]: [CLM ] CLM CONFIGURATION CHANGE
Nov 3 12:25:31 sul-node0002 corosync[3397]: [CLM ] New Configuration:
Nov 3 12:25:31 sul-node0002 corosync[3397]: [CLM ] #011r(0) ip(10.10.10.222)
Nov 3 12:25:31 sul-node0002 corosync[3397]: [CLM ] Members Left:
Nov 3 12:25:31 sul-node0002 corosync[3397]: [CLM ] #011r(0) ip(10.10.10.221)
Nov 3 12:25:31 sul-node0002 corosync[3397]: [CLM ] Members Joined:
Nov 3 12:25:31 sul-node0002 corosync[3397]: [QUORUM] Members[1]: 2
Nov 3 12:25:31 sul-node0002 corosync[3397]: [CLM ] CLM CONFIGURATION CHANGE
Nov 3 12:25:31 sul-node0002 corosync[3397]: [CLM ] New Configuration:
Nov 3 12:25:31 sul-node0002 corosync[3397]: [CLM ] #011r(0) ip(10.10.10.222)
Nov 3 12:25:31 sul-node0002 corosync[3397]: [CLM ] Members Left:
Nov 3 12:25:31 sul-node0002 corosync[3397]: [CLM ] Members Joined:
Nov 3 12:25:31 sul-node0002 corosync[3397]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Nov 3 12:25:31 sul-node0002 rgmanager[3888]: State change: sul-node0001 DOWN
Nov 3 12:25:31 sul-node0002 corosync[3397]: [CPG ] chosen downlist: sender r(0) ip(10.10.10.222) ; members(old:2 left:1)
Nov 3 12:25:31 sul-node0002 pmxcfs[3247]: [dcdb] notice: members: 2/3247
Nov 3 12:25:31 sul-node0002 pmxcfs[3247]: [dcdb] notice: members: 2/3247
Nov 3 12:25:31 sul-node0002 kernel: dlm: closing connection to node 1
Nov 3 12:25:31 sul-node0002 corosync[3397]: [MAIN ] Completed service synchronization, ready to provide service.
Nov 3 12:25:31 sul-node0002 fenced[3603]: fencing node sul-node0001
Nov 3 12:25:32 sul-node0002 fence_ipmilan: Parse error: Ignoring unknown option 'nodename=sul-node0001
Nov 3 12:25:47 sul-node0002 fenced[3603]: fence sul-node0001 success
Nov 3 12:25:48 sul-node0002 rgmanager[3888]: Starting stopped service pvevm:100
Nov 3 12:25:48 sul-node0002 rgmanager[9284]: [pvevm] Move config for VM 100 to local node
Nov 3 12:25:48 sul-node0002 pvevm: <root@pam> starting task UPID:sul-node0002:00002458:00050553:5457663C:qmstart:100:root@pam:
Nov 3 12:25:48 sul-node0002 task UPID:sul-node0002:00002458:00050553:5457663C:qmstart:100:root@pam:: start VM 100: UPID:sul-node0002:00002458:00050553:5457663C:qmstart:100:root@pam:
Nov 3 12:25:48 sul-node0002 task UPID:sul-node0002:00002458:00050553:5457663C:qmstart:100:root@pam:: VM is locked (backup)
Nov 3 12:25:48 sul-node0002 pvevm: <root@pam> end task UPID:sul-node0002:00002458:00050553:5457663C:qmstart:100:root@pam: VM is locked (backup)
Nov 3 12:25:48 sul-node0002 rgmanager[3888]: start on pvevm "100" returned 1 (generic error)
Nov 3 12:25:48 sul-node0002 rgmanager[3888]: #68: Failed to start pvevm:100; return value: 1
Nov 3 12:25:48 sul-node0002 rgmanager[3888]: Stopping service pvevm:100
Nov 3 12:25:49 sul-node0002 rgmanager[9306]: [pvevm] VM 100 is already stopped
Nov 3 12:25:49 sul-node0002 rgmanager[3888]: Service pvevm:100 is recovering
Nov 3 12:25:49 sul-node0002 rgmanager[3888]: #71: Relocating failed service pvevm:100
Nov 3 12:25:49 sul-node0002 rgmanager[3888]: Service pvevm:100 is stopped
There's nothing extraordinary in storage server's log. nfsd doesn't get messed up everything seems perfectly OK on that side. Could it be due to that 10Gbit getting over-saturated and dropping some vital packets which in turn leads node0002 to believe that node0001 has gone down? This seems both possible and highly unlikely to me, because peak HDD transfers on the storage server oscilate around 350MB/s which shouldn't choke those Intel 82599s, right?
Could you please provide me with any tips?
Last edited: