Hi,
I recently moved from proxmox + ISCSI ZFS storage to a 3-node hyper converged proxmox cluster running proxmox 6.3 and ceph octopus.
The cluster has 1GbE interfaces for VM traffic and leverages a 40Gbps infiniband network for the proxmox cluster and ceph cluster.
I have a redundant pair of infiniband switches, with different partitions for the proxmox cluster, ceph frontend and ceph backend interfaces. I run the ceph frontend and backend across different switches to be sure they take a different path.
The ceph cluster has 2 storage pools with 66 disks in total (33 ssd & 30 hdd) equally distributed across the 3 nodes.
I tested the cluster before bringing it into operation (iperf, rados bench, dd) and I got good read and write performance of +1GB/sec, no issues, but once I started loading it with VM's with normal operations, I've started to get issues with slow OPS when VM's handle medium loads of data, such as copying large files, or recently with installing a new VM. The slow OPS are pretty much on all OSD's seemingly at random, across all nodes. The OSD's tend to 'hang', basically locking every vm on the cluster. Restarting the OSD's in question solves the problem. I've also seen slow OSD heartbeat messages, so I suspected the infiniband network. I set up a continuous ping during and was able to simulate issues. The ping however does not show a single RTT above 0,15ms on the front and the back has a ping that goes to 3ms occasionally, but is mostly als around 0.1ms. The traffic on the interfaces does not go over 4-5 gbps. The OSD ping delays are much bigger and again seem to be solved when restarting the OSD's, so it seems to me like the OSD processes are hanging. I upgrade from Nautilus to Octopus, but that did not improve the situation. I would appreciate any help or pointers.
Here is my pveversion -v:
# pveversion -v
proxmox-ve: 6.3-1 (running kernel: 5.4.78-2-pve)
pve-manager: 6.3-3 (running version: 6.3-3/eee5f901)
pve-kernel-5.4: 6.3-3
pve-kernel-helper: 6.3-3
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.65-1-pve: 5.4.65-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph: 15.2.6-pve1
ceph-fuse: 15.2.6-pve1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.5
libproxmox-backup-qemu0: 1.0.2-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.3-2
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.3-3
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.0.5-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-3
pve-cluster: 6.2-1
pve-container: 3.3-1
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.1.0-7
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-2
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.5-pve1
Benchmarks (no issues):
(inside vm)
# dd if=/dev/zero of=here bs=20G count=1 oflag=direct
0+1 records in
0+1 records out
2147479552 bytes (2.1 GB, 2.0 GiB) copied, 4.9323 s, 435 MB/s
(on cluster nodes)
# rados bench -p ssd_pool 30 write --no-cleanup
Total time run: 30.0452
Total writes made: 8675
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 1154.93
Stddev Bandwidth: 56.661
Max bandwidth (MB/sec): 1256
Min bandwidth (MB/sec): 1028
Average IOPS: 288
Stddev IOPS: 14.1652
Max IOPS: 314
Min IOPS: 257
Average Latency(s): 0.055374
Stddev Latency(s): 0.0133843
Max latency(s): 0.275931
Min latency(s): 0.0245358
# rados bench -p ssd_pool 30 seq
Total time run: 26.2841
Total reads made: 8675
Read size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 1320.19
Average IOPS: 330
Stddev IOPS: 23.8767
Max IOPS: 397
Min IOPS: 281
Average Latency(s): 0.0474705
Max latency(s): 0.315668
Min latency(s): 0.0128199
# rados bench -p ssd_pool 30 rand
Cat /etc/
Total time run: 30.0651
Total reads made: 10072
Read size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 1340.02
Average IOPS: 335
Stddev IOPS: 25.6896
Max IOPS: 396
Min IOPS: 297
Average Latency(s): 0.046748
Max latency(s): 0.249803
Min latency(s): 0.00535442
In normal operations when issue occurs:
2020-12-23T16:14:56.492856+0100 osd.57 [WRN] slow request osd_op(client.13582959.0:97068 5.e 5:700fc2e6:::rbd_data.cf37edf65bf59f.0000000000001b6c:head [set-alloc-hint object_size 4194304 write_size 4194304,write 262144~3932160 in=3932160b] snapc 0=[] ondisk+write+known_if_redirected e5635) initiated 2020-12-23T16:07:23.322144+0100 currently waiting for sub ops
2020-12-23T16:14:56.492875+0100 osd.57 [WRN] slow request osd_op(client.10751556.0:1506598 5.e 5:70493bc6:::rbd_data.ebc2a19f8930c.0000000000000818:head [write 32768~4096 in=4096b] snapc 0=[] ondisk+write+known_if_redirected e5635) initiated 2020-12-23T16:07:54.920373+0100 currently delayed
2020-12-23T16:14:56.492887+0100 osd.57 [WRN] slow request osd_op(client.13582959.0:98407 5.1b6 5:6deb6e6f:::rbd_data.cf37edf65bf59f.0000000000001d18:head [write 262144~3932160 in=3932160b] snapc 0=[] ondisk+write+known_if_redirected e5635) initiated 2020-12-23T16:09:01.693417+0100 currently delayed
2020-12-23T16:14:56.492905+0100 osd.57 [WRN] slow request osd_op(client.13582959.0:96803 5.1b6 5:6d96f34a:::rbd_data.cf37edf65bf59f.0000000000001b1c:head [write 262144~3932160 in=3932160b] snapc 0=[] ondisk+write+known_if_redirected e5635) initiated 2020-12-23T16:07:19.537915+0100 currently waiting for sub ops
2020-12-23T16:14:56.492923+0100 osd.57 [WRN] slow request osd_op(client.13582959.0:97522 5.1b6 5:6dec277f:::rbd_data.cf37edf65bf59f.0000000000001bbb:head [write 262144~3932160 in=3932160b] snapc 0=[] ondisk+write+known_if_redirected e5635) initiated 2020-12-23T16:07:31.146953+0100 currently waiting for sub ops
2020-12-23T16:14:56.492935+0100 osd.57 [WRN] slow request osd_op(client.10437540.0:2942813 5.1b6 5:6de767e2:::rbd_data.450f3449e0b4c.00000000000000eb:head [write 3878912~4096 in=4096b] snapc 0=[] ondisk+write+known_if_redirected e5635) initiated 2020-12-23T16:08:27.815367+0100 currently delayed
When I run: ceph daemon osd.33 dump_ops_in_flight
Some OPS delayed, last event: waiting for readable
Some OPS waiting for subops, last event: sub_op_commit_rec
"ops": [
{
"description": "osd_op(client.10475257.0:213323 5.11e 5:78ef0274:::rbd_header.85751850770c50:head [watch ping cookie 140573457871744 gen 2] snapc 0=[] ondisk+write+known_if_redirected e5619)",
"initiated_at": "2020-12-23T09:10:47.469872+0100",
"age": 307.06381716800001,
"duration": 307.06390489900002,
"type_data": {
"flag_point": "delayed",
"client_info": {
"client": "client.10475257",
"client_addr": "10.2.20.112:0/1863972183",
"tid": 213323
},
"events": [
{
"event": "initiated",
"time": "2020-12-23T09:10:47.469872+0100",
"duration": 0
},
{
"event": "throttled",
"time": "2020-12-23T09:10:47.469872+0100",
"duration": 2.8059999999999999e-06
},
{
"event": "header_read",
"time": "2020-12-23T09:10:47.469875+0100",
"duration": 5.2959999999999998e-06
},
{
"event": "all_read",
"time": "2020-12-23T09:10:47.469880+0100",
"duration": 6.0100000000000005e-07
},
{
"event": "dispatched",
"time": "2020-12-23T09:10:47.469880+0100",
"duration": 6.2639999999999997e-06
},
{
"event": "queued_for_pg",
"time": "2020-12-23T09:10:47.469887+0100",
"duration": 5.1799999999999999e-05
},
{
"event": "reached_pg",
"time": "2020-12-23T09:10:47.469939+0100",
"duration": 2.7971999999999999e-05
},
{
"event": "waiting for readable",
"time": "2020-12-23T09:10:47.469966+0100",
"duration": 9.4739000000000002e-05
}
]
}
},
Slow heartbeat:
020-12-23T17:04:24.087770+0100 mon.node1 [WRN] Health check failed: Slow OSD heartbeats on back (longest 1258.257ms) (OSD_SLOW_PING_TIME_BACK)
2020-12-23T17:04:24.087853+0100 mon.node1 [WRN] Health check failed: Slow OSD heartbeats on front (longest 1060.077ms) (OSD_SLOW_PING_TIME_FRONT)
2020-12-23T17:04:24.087890+0100 mon.node1 [WRN] Health check failed: 1 slow ops, oldest one blocked for 31 sec, osd.55 has slow ops (SLOW_OPS)
2020-12-23T17:04:28.116156+0100 mon.node1 [INF] Health check cleared: SLOW_OPS (was: 1 slow ops, oldest one blocked for 31 sec, osd.55 has slow ops)
2020-12-23T17:04:30.179199+0100 mon.node1 [WRN] Health check update: Slow OSD heartbeats on back (longest 1851.623ms) (OSD_SLOW_PING_TIME_BACK)
2020-12-23T17:04:38.283704+0100 mon.node1 [WRN] Health check update: Slow OSD heartbeats on front (longest 1295.957ms) (OSD_SLOW_PING_TIME_FRONT)
2020-12-23T17:04:54.435691+0100 mon.node1 [WRN] Health check update: Slow OSD heartbeats on back (longest 2171.717ms) (OSD_SLOW_PING_TIME_BACK)
2020-12-23T17:04:54.435744+0100 mon.node1 [WRN] Health check update: Slow OSD heartbeats on front (longest 1901.092ms) (OSD_SLOW_PING_TIME_FRONT)
# ceph daemon /var/run/ceph/ceph-mgr.node1.asok dump_osd_network 0|more
{
"threshold": 0,
"entries": [
{
"last update": "Wed Dec 23 17:08:08 2020",
"stale": false,
"from osd": 62,
"to osd": 19,
"interface": "back",
"average": {
"1min": 0.709,
"5min": 480.570,
"15min": 179.110
},
"min": {
"1min": 0.548,
"5min": 0.524,
"15min": 0.502
},
"max": {
"1min": 0.917,
"5min": 13553.647,
"15min": 13553.647
},
"last": 0.663
},
{
"last update": "Wed Dec 23 17:07:53 2020",
"stale": false,
"from osd": 62,
"to osd": 18,
"interface": "back",
"average": {
"1min": 0.728,
"5min": 480.443,
"15min": 171.896
},
"min": {
"1min": 0.537,
"5min": 0.486,
"15min": 0.486
},
"max": {
"1min": 0.961,
"5min": 13553.526,
"15min": 13553.526
},
"last": 0.682
},
Infiniband tunables per proxmox wiki recommendations:
net.ipv4.tcp_mem=1280000 1280000 1280000
net.ipv4.tcp_wmem = 32768 131072 1280000
net.ipv4.tcp_rmem = 32768 131072 1280000
net.core.rmem_max=16777216
net.core.wmem_max=16777216
net.core.rmem_default=16777216
net.core.wmem_default=16777216
net.core.optmem_max=1524288
net.ipv4.tcp_sack=0
net.ipv4.tcp_timestamps=0
I recently moved from proxmox + ISCSI ZFS storage to a 3-node hyper converged proxmox cluster running proxmox 6.3 and ceph octopus.
The cluster has 1GbE interfaces for VM traffic and leverages a 40Gbps infiniband network for the proxmox cluster and ceph cluster.
I have a redundant pair of infiniband switches, with different partitions for the proxmox cluster, ceph frontend and ceph backend interfaces. I run the ceph frontend and backend across different switches to be sure they take a different path.
The ceph cluster has 2 storage pools with 66 disks in total (33 ssd & 30 hdd) equally distributed across the 3 nodes.
I tested the cluster before bringing it into operation (iperf, rados bench, dd) and I got good read and write performance of +1GB/sec, no issues, but once I started loading it with VM's with normal operations, I've started to get issues with slow OPS when VM's handle medium loads of data, such as copying large files, or recently with installing a new VM. The slow OPS are pretty much on all OSD's seemingly at random, across all nodes. The OSD's tend to 'hang', basically locking every vm on the cluster. Restarting the OSD's in question solves the problem. I've also seen slow OSD heartbeat messages, so I suspected the infiniband network. I set up a continuous ping during and was able to simulate issues. The ping however does not show a single RTT above 0,15ms on the front and the back has a ping that goes to 3ms occasionally, but is mostly als around 0.1ms. The traffic on the interfaces does not go over 4-5 gbps. The OSD ping delays are much bigger and again seem to be solved when restarting the OSD's, so it seems to me like the OSD processes are hanging. I upgrade from Nautilus to Octopus, but that did not improve the situation. I would appreciate any help or pointers.
Here is my pveversion -v:
# pveversion -v
proxmox-ve: 6.3-1 (running kernel: 5.4.78-2-pve)
pve-manager: 6.3-3 (running version: 6.3-3/eee5f901)
pve-kernel-5.4: 6.3-3
pve-kernel-helper: 6.3-3
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.65-1-pve: 5.4.65-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph: 15.2.6-pve1
ceph-fuse: 15.2.6-pve1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.5
libproxmox-backup-qemu0: 1.0.2-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.3-2
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.3-3
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.0.5-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-3
pve-cluster: 6.2-1
pve-container: 3.3-1
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.1.0-7
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-2
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.5-pve1
Benchmarks (no issues):
(inside vm)
# dd if=/dev/zero of=here bs=20G count=1 oflag=direct
0+1 records in
0+1 records out
2147479552 bytes (2.1 GB, 2.0 GiB) copied, 4.9323 s, 435 MB/s
(on cluster nodes)
# rados bench -p ssd_pool 30 write --no-cleanup
Total time run: 30.0452
Total writes made: 8675
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 1154.93
Stddev Bandwidth: 56.661
Max bandwidth (MB/sec): 1256
Min bandwidth (MB/sec): 1028
Average IOPS: 288
Stddev IOPS: 14.1652
Max IOPS: 314
Min IOPS: 257
Average Latency(s): 0.055374
Stddev Latency(s): 0.0133843
Max latency(s): 0.275931
Min latency(s): 0.0245358
# rados bench -p ssd_pool 30 seq
Total time run: 26.2841
Total reads made: 8675
Read size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 1320.19
Average IOPS: 330
Stddev IOPS: 23.8767
Max IOPS: 397
Min IOPS: 281
Average Latency(s): 0.0474705
Max latency(s): 0.315668
Min latency(s): 0.0128199
# rados bench -p ssd_pool 30 rand
Cat /etc/
Total time run: 30.0651
Total reads made: 10072
Read size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 1340.02
Average IOPS: 335
Stddev IOPS: 25.6896
Max IOPS: 396
Min IOPS: 297
Average Latency(s): 0.046748
Max latency(s): 0.249803
Min latency(s): 0.00535442
In normal operations when issue occurs:
2020-12-23T16:14:56.492856+0100 osd.57 [WRN] slow request osd_op(client.13582959.0:97068 5.e 5:700fc2e6:::rbd_data.cf37edf65bf59f.0000000000001b6c:head [set-alloc-hint object_size 4194304 write_size 4194304,write 262144~3932160 in=3932160b] snapc 0=[] ondisk+write+known_if_redirected e5635) initiated 2020-12-23T16:07:23.322144+0100 currently waiting for sub ops
2020-12-23T16:14:56.492875+0100 osd.57 [WRN] slow request osd_op(client.10751556.0:1506598 5.e 5:70493bc6:::rbd_data.ebc2a19f8930c.0000000000000818:head [write 32768~4096 in=4096b] snapc 0=[] ondisk+write+known_if_redirected e5635) initiated 2020-12-23T16:07:54.920373+0100 currently delayed
2020-12-23T16:14:56.492887+0100 osd.57 [WRN] slow request osd_op(client.13582959.0:98407 5.1b6 5:6deb6e6f:::rbd_data.cf37edf65bf59f.0000000000001d18:head [write 262144~3932160 in=3932160b] snapc 0=[] ondisk+write+known_if_redirected e5635) initiated 2020-12-23T16:09:01.693417+0100 currently delayed
2020-12-23T16:14:56.492905+0100 osd.57 [WRN] slow request osd_op(client.13582959.0:96803 5.1b6 5:6d96f34a:::rbd_data.cf37edf65bf59f.0000000000001b1c:head [write 262144~3932160 in=3932160b] snapc 0=[] ondisk+write+known_if_redirected e5635) initiated 2020-12-23T16:07:19.537915+0100 currently waiting for sub ops
2020-12-23T16:14:56.492923+0100 osd.57 [WRN] slow request osd_op(client.13582959.0:97522 5.1b6 5:6dec277f:::rbd_data.cf37edf65bf59f.0000000000001bbb:head [write 262144~3932160 in=3932160b] snapc 0=[] ondisk+write+known_if_redirected e5635) initiated 2020-12-23T16:07:31.146953+0100 currently waiting for sub ops
2020-12-23T16:14:56.492935+0100 osd.57 [WRN] slow request osd_op(client.10437540.0:2942813 5.1b6 5:6de767e2:::rbd_data.450f3449e0b4c.00000000000000eb:head [write 3878912~4096 in=4096b] snapc 0=[] ondisk+write+known_if_redirected e5635) initiated 2020-12-23T16:08:27.815367+0100 currently delayed
When I run: ceph daemon osd.33 dump_ops_in_flight
Some OPS delayed, last event: waiting for readable
Some OPS waiting for subops, last event: sub_op_commit_rec
"ops": [
{
"description": "osd_op(client.10475257.0:213323 5.11e 5:78ef0274:::rbd_header.85751850770c50:head [watch ping cookie 140573457871744 gen 2] snapc 0=[] ondisk+write+known_if_redirected e5619)",
"initiated_at": "2020-12-23T09:10:47.469872+0100",
"age": 307.06381716800001,
"duration": 307.06390489900002,
"type_data": {
"flag_point": "delayed",
"client_info": {
"client": "client.10475257",
"client_addr": "10.2.20.112:0/1863972183",
"tid": 213323
},
"events": [
{
"event": "initiated",
"time": "2020-12-23T09:10:47.469872+0100",
"duration": 0
},
{
"event": "throttled",
"time": "2020-12-23T09:10:47.469872+0100",
"duration": 2.8059999999999999e-06
},
{
"event": "header_read",
"time": "2020-12-23T09:10:47.469875+0100",
"duration": 5.2959999999999998e-06
},
{
"event": "all_read",
"time": "2020-12-23T09:10:47.469880+0100",
"duration": 6.0100000000000005e-07
},
{
"event": "dispatched",
"time": "2020-12-23T09:10:47.469880+0100",
"duration": 6.2639999999999997e-06
},
{
"event": "queued_for_pg",
"time": "2020-12-23T09:10:47.469887+0100",
"duration": 5.1799999999999999e-05
},
{
"event": "reached_pg",
"time": "2020-12-23T09:10:47.469939+0100",
"duration": 2.7971999999999999e-05
},
{
"event": "waiting for readable",
"time": "2020-12-23T09:10:47.469966+0100",
"duration": 9.4739000000000002e-05
}
]
}
},
Slow heartbeat:
020-12-23T17:04:24.087770+0100 mon.node1 [WRN] Health check failed: Slow OSD heartbeats on back (longest 1258.257ms) (OSD_SLOW_PING_TIME_BACK)
2020-12-23T17:04:24.087853+0100 mon.node1 [WRN] Health check failed: Slow OSD heartbeats on front (longest 1060.077ms) (OSD_SLOW_PING_TIME_FRONT)
2020-12-23T17:04:24.087890+0100 mon.node1 [WRN] Health check failed: 1 slow ops, oldest one blocked for 31 sec, osd.55 has slow ops (SLOW_OPS)
2020-12-23T17:04:28.116156+0100 mon.node1 [INF] Health check cleared: SLOW_OPS (was: 1 slow ops, oldest one blocked for 31 sec, osd.55 has slow ops)
2020-12-23T17:04:30.179199+0100 mon.node1 [WRN] Health check update: Slow OSD heartbeats on back (longest 1851.623ms) (OSD_SLOW_PING_TIME_BACK)
2020-12-23T17:04:38.283704+0100 mon.node1 [WRN] Health check update: Slow OSD heartbeats on front (longest 1295.957ms) (OSD_SLOW_PING_TIME_FRONT)
2020-12-23T17:04:54.435691+0100 mon.node1 [WRN] Health check update: Slow OSD heartbeats on back (longest 2171.717ms) (OSD_SLOW_PING_TIME_BACK)
2020-12-23T17:04:54.435744+0100 mon.node1 [WRN] Health check update: Slow OSD heartbeats on front (longest 1901.092ms) (OSD_SLOW_PING_TIME_FRONT)
# ceph daemon /var/run/ceph/ceph-mgr.node1.asok dump_osd_network 0|more
{
"threshold": 0,
"entries": [
{
"last update": "Wed Dec 23 17:08:08 2020",
"stale": false,
"from osd": 62,
"to osd": 19,
"interface": "back",
"average": {
"1min": 0.709,
"5min": 480.570,
"15min": 179.110
},
"min": {
"1min": 0.548,
"5min": 0.524,
"15min": 0.502
},
"max": {
"1min": 0.917,
"5min": 13553.647,
"15min": 13553.647
},
"last": 0.663
},
{
"last update": "Wed Dec 23 17:07:53 2020",
"stale": false,
"from osd": 62,
"to osd": 18,
"interface": "back",
"average": {
"1min": 0.728,
"5min": 480.443,
"15min": 171.896
},
"min": {
"1min": 0.537,
"5min": 0.486,
"15min": 0.486
},
"max": {
"1min": 0.961,
"5min": 13553.526,
"15min": 13553.526
},
"last": 0.682
},
Infiniband tunables per proxmox wiki recommendations:
net.ipv4.tcp_mem=1280000 1280000 1280000
net.ipv4.tcp_wmem = 32768 131072 1280000
net.ipv4.tcp_rmem = 32768 131072 1280000
net.core.rmem_max=16777216
net.core.wmem_max=16777216
net.core.rmem_default=16777216
net.core.wmem_default=16777216
net.core.optmem_max=1524288
net.ipv4.tcp_sack=0
net.ipv4.tcp_timestamps=0