Cluster Node unreachable during 'Move Disk'

BORNXenon

Active Member
Oct 21, 2019
32
5
28
44
I've recently moved from a single node on a RAID10 to a 3 node cluster on CEPH and since then one particular VM has intermittant speed issues
The guest in question is a Windows Server 2008 R2, it serves web applications through IIS which pull data from a seperate dedicated SQL Server (Linux), a legacy app that pulls data from an Access database local to the server and also stores and serves the data for our accounts package (Sage 50).

Since moving to CEPH, the server seems to have some issues. It serves the SQL based applications no problems, really quite speedy, however, the legacy app which uses a local Access database would not load initially, and accessing the accounts data is sluggish. While these issues were present, I accessed the console of the server and there was a noticable delay between clicking on something (like the Start button) and anything actually happening, it does eventually speed up, but some time later exhibits similar behaviour.

The performance metrics on the PVE Summary of the guest suggest that it isn't being worked hard, and the metrics inside the VM suggest the same so I'm a bit stumped.
In order to see if CEPH is the culprit, I moved the disk of the VM to the ZFS local storage of the node and thats when the fun began!

The node became only intermittantly available, I was unable to log on to the node's GUI and the CEPH monitor on the node kept dropping, the VM's running on the node did however remain accessible throughout with no loss of service.
Once the move finished, the node is fully accessible again and the CEPH monitor s back up.

I'm not sure what the cause of these issues may be, whether network related or disk related. The I/O delay metric showed in the region of 80% during the move (when I did manage to get onto the node) and stayed at about 12% following the move completion, it seems to have settled back down to below 0% again now though.


Node specs are as below:

Node 1:
Fujitsu Primergy RX300 S6
Xeon X5647 @ 2.93GHz (2 Sockets) 16 cores total
96GB RAM
2x Onboard 1Gbps
Fujitsu D2755-A11 Dual 10GBE network card
LSI 9200-8i HBA in IT Mode
2x 146GB 10K 2.5" SAS disks configured in ZFS Mirror
6x 300GB 10K 2.5" SAS disks for CEPH

Node 2:
Fujitsu Primergy RX300 S6
Xeon X5645 @ 2.40GHz (1 Socket) 12 cores total
80GB RAM
2x Onboard 1Gbps
Fujitsu D2755-A11 Dual 10GBE network card
LSI 9200-8i HBA in IT Mode
2x 146GB 10K 2.5" SAS disks configured in ZFS Mirror
6x 300GB 10K 2.5" SAS disks for CEPH

Node 3 (Node that experienced issues):
Fujitsu Primergy RX300 S6
Xeon X5645 @ 2.40GHz (1 Socket) 12 cores total
80GB RAM
2x Onboard 1Gbps
Fujitsu D2755-A11 Dual 10GBE network card
LSI 9200-8i HBA in IT Mode
2x 146GB 10K 2.5" SAS disks configured in ZFS Mirror
6x 300GB 10K 2.5" SAS disks for CEPH

Proxmox public network is on one of the 1Gbps links
Corosync is on the other
CEPH public and cluster networks are on 1 of the 10GBE links connected to a 4 port Mikrotik 10GBE switch using SFP+ DAC cables

The rbd pool has the default 128PGs which I believe isn't sufficient so I was planniing on upping this to 512PGs this weekend although the autoscaler isn't providing any warnings about placement group sizing.

Config and Crush map below:

Code:
[global]
     auth_client_required = cephx
     auth_cluster_required = cephx
     auth_service_required = cephx
     cluster_network = 10.0.3.201/24
     fsid = 68a9fad2-615a-4ba6-a32d-dd6e108b8ad8
     mon_allow_pool_delete = true
     mon_host = 10.0.3.201 10.0.3.202 10.0.3.203
     osd_pool_default_min_size = 2
     osd_pool_default_size = 3
     public_network = 10.0.3.201/24

[client]
     keyring = /etc/pve/priv/$cluster.$name.keyring

[mds]
     keyring = /var/lib/ceph/mds/ceph-$id/keyring

[mds.mrpve1]
     host = mrpve1
     mds_standby_for_name = pve

[mds.mrpve2]
     host = mrpve2
     mds_standby_for_name = pve

[mds.mrpve3]
     host = mrpve3
     mds standby for name = pve

[mon.mrpve1]
     public_addr = 10.0.3.201

[mon.mrpve2]
     public_addr = 10.0.3.202

[mon.mrpve3]
     public_addr = 10.0.3.203

Code:
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class hdd
device 1 osd.1 class hdd
device 2 osd.2 class hdd
device 3 osd.3 class hdd
device 4 osd.4 class hdd
device 5 osd.5 class hdd
device 6 osd.6 class hdd
device 7 osd.7 class hdd
device 8 osd.8 class hdd
device 9 osd.9 class hdd
device 10 osd.10 class hdd
device 11 osd.11 class hdd
device 12 osd.12 class hdd
device 13 osd.13 class hdd
device 14 osd.14 class hdd
device 15 osd.15 class hdd
device 16 osd.16 class hdd
device 17 osd.17 class hdd

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 zone
type 10 region
type 11 root

# buckets
host mrpve1 {
    id -3        # do not change unnecessarily
    id -4 class hdd        # do not change unnecessarily
    # weight 1.637
    alg straw2
    hash 0    # rjenkins1
    item osd.0 weight 0.273
    item osd.1 weight 0.273
    item osd.2 weight 0.273
    item osd.3 weight 0.273
    item osd.4 weight 0.273
    item osd.16 weight 0.273
}
host mrpve2 {
    id -5        # do not change unnecessarily
    id -6 class hdd        # do not change unnecessarily
    # weight 1.637
    alg straw2
    hash 0    # rjenkins1
    item osd.5 weight 0.273
    item osd.6 weight 0.273
    item osd.7 weight 0.273
    item osd.8 weight 0.273
    item osd.9 weight 0.273
    item osd.17 weight 0.273
}
host mrpve3 {
    id -7        # do not change unnecessarily
    id -8 class hdd        # do not change unnecessarily
    # weight 1.637
    alg straw2
    hash 0    # rjenkins1
    item osd.10 weight 0.273
    item osd.11 weight 0.273
    item osd.12 weight 0.273
    item osd.13 weight 0.273
    item osd.14 weight 0.273
    item osd.15 weight 0.273
}
root default {
    id -1        # do not change unnecessarily
    id -2 class hdd        # do not change unnecessarily
    # weight 4.910
    alg straw2
    hash 0    # rjenkins1
    item mrpve1 weight 1.637
    item mrpve2 weight 1.637
    item mrpve3 weight 1.637
}

# rules
rule replicated_rule {
    id 0
    type replicated
    min_size 1
    max_size 10
    step take default
    step chooseleaf firstn 0 type host
    step emit
}

# end crush map

Any ideas firstly why the VM may be slow given my config (all Linux guests are fast and responsive), and secondly why Node3 went off grid during the Move Disk operation?
I was planning on getting a second 10GBE switch and moving the Proxmox connection to the spare 10GBE link so that I can connect to a FreeNas for fast VM backup, but now I'm wondering if I need to seperate the CEPH Public and Cluster networks?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!