CEPH causes Windows 2012 VM freezes

Raymond Burns

Member
Apr 2, 2013
333
1
18
Houston, Texas, United States
I have 4 CEPH servers all run:
CPU(s) --> 24 x Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz (2 Sockets)
Kernel Version --> Linux 4.4.40-1-pve #1 SMP Wed Feb 8 16:13:20 CET 2017
PVE Manager Version --> pve-manager/4.4-12/e71b7a74

My Ceph.conf:
Code:
cat /etc/ceph/ceph.conf
[global]
         auth client required = cephx
         auth cluster required = cephx
         auth service required = cephx
         cluster network = ##.##.###.87.0/24
         filestore xattr use omap = true
         fsid = ccf2ef03-0a0f-42dd-a553-##############
         keyring = /etc/pve/priv/$cluster.$name.keyring
         osd journal size = 12288
         osd max backfills = 1
         osd pool default min size = 1
         osd recovery max active = 1
         public network = ##.##.###.0/24

[osd]
         keyring = /var/lib/ceph/osd/ceph-$id/keyring

[mon.1]
         host = pct2-prox-h
         mon addr = ##.##.###.27:6789

[mon.6]
         host = pct2-Prox-G
         mon addr = ##.##.###.26:6789

[mon.0]
         host = pct2-prox-e
         mon addr = ##.##.###..24:6789

[mon.5]
         host = pct2-prox-f
         mon addr = ##.##.###.25:6789

The errors that I get when Windows VM freeze:
Code:
2017-04-13 09:07:36.162869 mon.0 [INF] pgmap v31909319: 2048 pgs: 2048 active+clean; 43408 GB data, 127 TB used, 228 TB / 356 TB avail; 7833 B/s wr, 2 op/s
2017-04-13 09:07:37.240578 mon.0 [INF] pgmap v31909320: 2048 pgs: 2048 active+clean; 43408 GB data, 127 TB used, 228 TB / 356 TB avail
2017-04-13 09:07:28.984650 osd.52 [WRN] 4 slow requests, 1 included below; oldest blocked for > 69.549736 secs
2017-04-13 09:07:28.984658 osd.52 [WRN] slow request 60.048898 seconds old, received at 2017-04-13 09:06:28.935692: osd_op(client.89742621.1:181374135 9.c05e0889 rbd_data.1ad87a8238e1f29.0000000000001693 [read 3502080~4096] snapc 0=[] read e161505) currently waiting for rw locks
2017-04-13 09:07:33.833850 osd.127 [WRN] 4 slow requests, 1 included below; oldest blocked for > 58.140064 secs
2017-04-13 09:07:33.833855 osd.127 [WRN] slow request 30.795260 seconds old, received at 2017-04-13 09:07:03.038549: osd_op(client.89732041.1:580992642 9.398d4db4 rbd_data.1d68211238e1f29.00000000000013b5 [set-alloc-hint object_size 4194304 write_size 4194304,write 1069056~8192] snapc 0=[] ondisk+write e161505) currently waiting for subops from 44,72
2017-04-13 09:07:35.834153 osd.127 [WRN] 4 slow requests, 2 included below; oldest blocked for > 60.140352 secs
2017-04-13 09:07:35.834161 osd.127 [WRN] slow request 60.140352 seconds old, received at 2017-04-13 09:06:35.693745: osd_op(client.89742621.1:181374140 9.4e51373d rbd_data.1ad87a8238e1f29.0000000000002e01 [set-alloc-hint object_size 4194304 write_size 4194304,write 2203648~4096] snapc 0=[] ondisk+write e161505) currently waiting for subops from 3,59
2017-04-13 09:07:35.834171 osd.127 [WRN] slow request 60.140230 seconds old, received at 2017-04-13 09:06:35.693867: osd_op(client.89742621.1:181374141 9.4e51373d rbd_data.1ad87a8238e1f29.0000000000002e01 [set-alloc-hint object_size 4194304 write_size 4194304,write 2215936~8192] snapc 0=[] ondisk+write e161505) currently waiting for subops from 3,59
2017-04-13 09:07:35.985652 osd.52 [WRN] 4 slow requests, 1 included below; oldest blocked for > 76.550747 secs
2017-04-13 09:07:35.985667 osd.52 [WRN] slow request 60.289877 seconds old, received at 2017-04-13 09:06:35.695723: osd_repop(client.89742621.1:181374142 9.567 9:e6bc7465:::rbd_data.1ad87a8238e1f29.0000000000003a93:head v 161505'414662) currently started
2017-04-13 09:07:36.040023 osd.33 [WRN] 2 slow requests, 1 included below; oldest blocked for > 60.349378 secs
2017-04-13 09:07:36.040033 osd.33 [WRN] slow request 60.349378 seconds old, received at 2017-04-13 09:06:35.690568: osd_op(client.89742621.1:181374142 9.a62e3d67 rbd_data.1ad87a8238e1f29.0000000000003a93 [set-alloc-hint object_size 4194304 write_size 4194304,write 770048~4096] snapc 0=[] ondisk+write e161505) currently waiting for subops from 46,52
2017-04-13 09:07:38.295496 mon.0 [INF] pgmap v31909321: 2048 pgs: 2048 active+clean; 43408 GB data, 127 TB used, 228 TB / 356 TB avail
2017-04-13 09:07:33.484763 osd.71 [WRN] 3 slow requests, 1 included below; oldest blocked for > 74.982940 secs
2017-04-13 09:07:33.484774 osd.71 [WRN] slow request 30.043067 seconds old, received at 2017-04-13 09:07:03.441639: osd_op(client.89738833.1:218935633 9.d1ec8b52 rbd_data.9d2a3b238e1f29.000000000000052f [read 2957312~4096] snapc 0=[] read e161505) currently started
2017-04-13 09:07:39.018395 osd.114 [WRN] 5 slow requests, 4 included below; oldest blocked for > 79.583112 secs
2017-04-13 09:07:39.018403 osd.114 [WRN] slow request 30.714863 seconds old, received at 2017-04-13 09:07:08.303481: osd_op(client.89732041.1:580992649 9.333209a4 rbd_data.576ef49238e1f29.0000000000003400 [set-alloc-hint object_size 4194304 write_size 4194304,write 65536~4096] snapc a9=[] ondisk+write e161505) currently waiting for subops from 68,103
2017-04-13 09:07:39.018407 osd.114 [WRN] slow request 30.714796 seconds old, received at 2017-04-13 09:07:08.303548: osd_op(client.89732041.1:580992650 9.333209a4 rbd_data.576ef49238e1f29.0000000000003400 [set-alloc-hint object_size 4194304 write_size 4194304,write 860160~4096] snapc a9=[] ondisk+write e161505) currently waiting for subops from 68,103
2017-04-13 09:07:39.018412 osd.114 [WRN] slow request 30.714743 seconds old, received at 2017-04-13 09:07:08.303601: osd_op(client.89732041.1:580992651 9.333209a4 rbd_data.576ef49238e1f29.0000000000003400 [set-alloc-hint object_size 4194304 write_size 4194304,write 1167360~4096] snapc a9=[] ondisk+write e161505) currently waiting for subops from 68,103
2017-04-13 09:07:39.018416 osd.114 [WRN] slow request 30.714959 seconds old, received at 2017-04-13 09:07:08.303385: osd_op(client.89732041.1:580992648 9.333209a4 rbd_data.576ef49238e1f29.0000000000003400 [set-alloc-hint object_size 4194304 write_size 4194304,write 28672~4096] snapc a9=[] ondisk+write e161505) currently waiting for subops from 68,103
2017-04-13 09:07:39.331535 mon.0 [INF] pgmap v31909322: 2048 pgs: 1 active+clean+scrubbing+deep, 2047 active+clean; 43408 GB data, 127 TB used, 228 TB / 356 TB avail; 5771 B/s rd, 7695 B/s wr, 4 op/s
2017-04-13 09:07:33.774701 osd.44 [WRN] 2 slow requests, 1 included below; oldest blocked for > 75.272316 secs
2017-04-13 09:07:33.774708 osd.44 [WRN] slow request 30.735683 seconds old, received at 2017-04-13 09:07:03.038959: osd_repop(client.89732041.1:580992642 9.5b4 9:2db2b19c:::rbd_data.1d68211238e1f29.00000000000013b5:head v 161505'585906) currently started
2017-04-13 09:07:34.361450 osd.93 [WRN] 6 slow requests, 3 included below; oldest blocked for > 75.334009 secs
2017-04-13 09:07:34.361455 osd.93 [WRN] slow request 30.440228 seconds old, received at 2017-04-13 09:07:03.921185: osd_op(client.89738833.1:218935636 9.44570db rbd_data.5c8c462ae8944a.0000000000001060 [set-alloc-hint object_size 4194304 write_size 4194304,write 3465216~4096] snapc 0=[] ondisk+write e161505) currently waiting for subops from 90,132
2017-04-13 09:07:34.361461 osd.93 [WRN] slow request 30.440104 seconds old, received at 2017-04-13 09:07:03.921308: osd_op(client.89738833.1:218935637 9.44570db rbd_data.5c8c462ae8944a.0000000000001060 [set-alloc-hint object_size 4194304 write_size 4194304,write 3674112~4096] snapc 0=[] ondisk+write e161505) currently waiting for subops from 90,132
2017-04-13 09:07:34.361464 osd.93 [WRN] slow request 30.440415 seconds old, received at 2017-04-13 09:07:03.920997: osd_op(client.89738833.1:218935635 9.44570db rbd_data.5c8c462ae8944a.0000000000001060 [set-alloc-hint object_size 4194304 write_size 4194304,write 3407872~4096] snapc 0=[] ondisk+write e161505) currently waiting for subops from 90,132
2017-04-13 09:07:35.759245 osd.59 [WRN] 5 slow requests, 2 included below; oldest blocked for > 74.995369 secs
2017-04-13 09:07:35.759250 osd.59 [WRN] slow request 60.064942 seconds old, received at 2017-04-13 09:06:35.694254: osd_repop(client.89742621.1:181374140 9.73d 9:bcec8a72:::rbd_data.1ad87a8238e1f29.0000000000002e01:head v 161505'908763) currently started
2017-04-13 09:07:35.759256 osd.59 [WRN] slow request 60.064298 seconds old, received at 2017-04-13 09:06:35.694898: osd_repop(client.89742621.1:181374141 9.73d 9:bcec8a72:::rbd_data.1ad87a8238e1f29.0000000000002e01:head v 161505'908764) currently started
2017-04-13 09:07:35.946268 osd.74 [WRN] 5 slow requests, 3 included below; oldest blocked for > 77.568043 secs
2017-04-13 09:07:35.946277 osd.74 [WRN] slow request 60.426785 seconds old, received at 2017-04-13 09:06:35.519389: osd_repop(client.89732041.1:580992605 9.7a7 9:e5f450fe:::rbd_data.5cbd322ae8944a.0000000000068f1b:head v 161505'322945) currently started
2017-04-13 09:07:35.946285 osd.74 [WRN] slow request 60.426784 seconds old, received at 2017-04-13 09:06:35.519390: osd_repop(client.89732041.1:580992603 9.5a 9:5a0c5ed8:::rbd_data.5cbd322ae8944a.000000000001930e:head v 161505'1015212) currently started
2017-04-13 09:07:35.946291 osd.74 [WRN] slow request 60.426357 seconds old, received at 2017-04-13 09:06:35.519817: osd_repop(client.89732041.1:580992604 9.5a 9:5a0c5ed8:::rbd_data.5cbd322ae8944a.000000000001930e:head v 161505'1015213) currently started
2017-04-13 09:07:36.051909 osd.76 [INF] 9.4d2 deep-scrub starts
2017-04-13 09:07:38.224054 osd.123 [WRN] 1 slow requests, 1 included below; oldest blocked for > 60.063173 secs
2017-04-13 09:07:38.224069 osd.123 [WRN] slow request 60.063173 seconds old, received at 2017-04-13 09:06:38.160850: osd_op(client.89732041.1:580992607 9.5ed19b59 rbd_data.1d68211238e1f29.0000000000000094 [set-alloc-hint object_size 4194304 write_size 4194304,write 2269184~4096] snapc 0=[] ondisk+write e161505) currently waiting for subops from 68,106
2017-04-13 09:07:38.630063 osd.64 [WRN] 4 slow requests, 1 included below; oldest blocked for > 79.217787 secs
2017-04-13 09:07:38.630072 osd.64 [WRN] slow request 60.537214 seconds old, received at 2017-04-13 09:06:38.092795: osd_repop(client.89742621.1:181374148 9.6b9 9:9d64fe07:::rbd_data.b86bb1238e1f29.0000000000001016:head v 161505'451185) currently started
2017-04-13 09:07:38.989690 osd.48 [WRN] 3 slow requests, 1 included below; oldest blocked for > 79.581230 secs
2017-04-13 09:07:38.989785 osd.48 [WRN] slow request 60.899234 seconds old, received at 2017-04-13 09:06:38.090366: osd_op(client.89742621.1:181374150 9.760aceed rbd_data.b86bb1238e1f29.0000000000001056 [set-alloc-hint object_size 4194304 write_size 4194304,write 3788800~65536] snapc 0=[] ondisk+write e161505) currently waiting for subops from 67,126
2017-04-13 09:07:40.363151 mon.0 [INF] pgmap v31909323: 2048 pgs: 1 active+clean+scrubbing+deep, 2047 active+clean; 43408 GB data, 127 TB used, 228 TB / 356 TB avail; 5889 B/s rd, 7852 B/s wr, 4 op/s

My CEPH helath is good:
Code:
# ceph -s
    cluster ccf2ef03-0a0f-42dd-a553-7############
     health HEALTH_WARN
            120 requests are blocked > 32 sec
            crush map has legacy tunables (require bobtail, min is firefly)
            all OSDs are running jewel or later but the 'require_jewel_osds' osdmap flag is not set
     monmap e23: 4 mons at {0=##.##.###.24:6789/0,1=##.##.###..27:6789/0,5=##.##.###.25:6789/0,6=##.##.###.26:6789/0}
            election epoch 9926, quorum 0,1,2,3 0,5,6,1
     osdmap e161505: 128 osds: 128 up, 128 in
      pgmap v31909375: 2048 pgs, 1 pools, 43408 GB data, 10878 kobjects
            127 TB used, 228 TB / 356 TB avail
                2047 active+clean
                   1 active+clean+scrubbing+deep
  client io 11597 kB/s rd, 1095 kB/s wr, 349 op/s rd, 248 op/s wr

My CEPH performance test results:
Code:
rados -p Ceph3 bench 60 write -t 8

  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
   60       8      3192      3184   212.239       240   0.0713033    0.150371
Total time run:         60.634442
Total writes made:      3192
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     210.573
Stddev Bandwidth:       39.9109
Max bandwidth (MB/sec): 296
Min bandwidth (MB/sec): 116
Average IOPS:           52
Stddev IOPS:            9
Max IOPS:               74
Min IOPS:               29
Average Latency(s):     0.151147
Stddev Latency(s):      0.160218
Max latency(s):         2.62365
Min latency(s):         0.0440302
Cleaning up (deleting benchmark objects)
Clean up completed and total clean up time :22.218130

The config for the Windows VM is:
Code:
# cat /etc/pve/qemu-server/103.conf
#
#Windows Server 2012
agent: 1
boot: dcn
bootdisk: virtio0
cores: 6
memory: 6144
name: #####
net0: virtio=66:62:36:30:66:66,bridge=vmbr0
numa: 0
onboot: 1
ostype: win8
protection: 1
smbios1: uuid=da877aac-3ec4-41d9-a8cf-48c90d0548a2
sockets: 1
virtio0: Ceph3:vm-103-disk-2,cache=writeback,size=60G
virtio1: Ceph3:vm-103-disk-1,backup=0,cache=writeback,size=7500G
virtio2: Ceph3:vm-103-disk-4,backup=0,cache=writeback,size=8000G
virtio3: Ceph3:vm-103-disk-3,backup=0,cache=writeback,size=60G

The error in the Windows Event Viewer that relates to this freeze:
Code:
Log Name:      Application
Source:        ESENT
Date:          4/13/2017 8:39:42 AM
Event ID:      533
Task Category: General
Level:         Warning
Keywords:      Classic
User:          N/A
Computer:      ######.######.####.###
Description:
svchost (1744) A request to write to the file "C:\Windows\system32\LogFiles\Sum\Svc.log" at offset 1617920 (0x000000000018b000) for 4096 (0x00001000) bytes has not completed for 36 second(s). This problem is likely due to faulty hardware. Please contact your hardware vendor for further assistance diagnosing the problem.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="ESENT" />
    <EventID Qualifiers="0">533</EventID>
    <Level>3</Level>
    <Task>1</Task>
    <Keywords>0x80000000000000</Keywords>
    <TimeCreated SystemTime="2017-04-13T13:39:42.000000000Z" />
    <EventRecordID>17175</EventRecordID>
    <Channel>Application</Channel>
    <Computer>######.#####.###.##.###</Computer>
    <Security />
  </System>
  <EventData>
    <Data>svchost</Data>
    <Data>1744</Data>
    <Data>
    </Data>
    <Data>C:\Windows\system32\LogFiles\Sum\Svc.log</Data>
    <Data>1617920 (0x000000000018b000)</Data>
    <Data>4096 (0x00001000)</Data>
    <Data>36</Data>
  </EventData>
</Event>

I have a ZFS connection that does not generate this error. If you need anymore info, please let me know. This issue is an Emergency. I am having corrupt databases within my Windows VM.
 
Last edited:
I have run two more CEPH Rados Benchmark Tests:
Code:
# rados -p Ceph3 bench 60 write -t 24
Maintaining 24 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 60 seconds or 0 objects

2017-04-13 09:23:00.793229 min lat: 0.0488313 max lat: 2.44422 avg lat: 0.226033
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
   60      24      6380      6356   423.673       420    0.356573    0.226033
   61      19      6380      6361   417.055        20    0.449474    0.226161
Total time run:         61.716950
Total writes made:      6380
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     413.501
Stddev Bandwidth:       94.6689
Max bandwidth (MB/sec): 624
Min bandwidth (MB/sec): 20
Average IOPS:           103
Stddev IOPS:            23
Max IOPS:               156
Min IOPS:               5
Average Latency(s):     0.231071
Stddev Latency(s):      0.233439
Max latency(s):         2.44422
Min latency(s):         0.0488313
Cleaning up (deleting benchmark objects)
Clean up completed and total clean up time :21.217585

and the other:
Code:
# rados -p Ceph3 bench 60 write -t 48
Maintaining 48 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 60 seconds or 0 objects

2017-04-13 09:25:00.610383 min lat: 0.189367 max lat: 2.65904 avg lat: 0.430375
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
   60      47      6687      6640   442.614       352    0.483863    0.430375
   61      32      6688      6656   436.407        64     1.16417    0.431048
Total time run:         61.402467
Total writes made:      6688
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     435.683
Stddev Bandwidth:       106.741
Max bandwidth (MB/sec): 684
Min bandwidth (MB/sec): 64
Average IOPS:           108
Stddev IOPS:            26
Max IOPS:               171
Min IOPS:               16
Average Latency(s):     0.437381
Stddev Latency(s):      0.260999
Max latency(s):         2.65904
Min latency(s):         0.189367
Cleaning up (deleting benchmark objects)
 
Hi Raymond,
looks that an OSD (journal?) is not fast enough to write the data - do you see higher latencies with
Code:
ceph osd perf
on single disks - is it the same with disabled deep-scrub?

Your perf-value for 120 OSDs looks that you don't use journal-SSDs?! All journals on the OSDs?

And you have 120 OSDs in 4 Servers?? How many RAM do you have?

Your tunables shows, that you started with an old ceph-version. Which version do you have running now?
Perhaps now over easter-weekend it's a good time to set your tunables to hammer?! But be carefully - if you have an faulty OSD you should replace them first. And you should add some tuning in your ceph.conf to avoid big troubles to your clients.

BTW. four mons are suboptimal - use an odd number (three are enough).

Udo
 
Code:
ceph osd perf
osd fs_commit_latency(ms) fs_apply_latency(ms)
136                     9                    9
135                    22                   22
134                    38                   38
 45                   171                  174
 44                    77                   84
133                   101                  105
132                    47                   47
130                   109                  116
 84                    20                   22
 15                    68                   78
 99                    64                   76
 98                    46                   56
 97                    10                   10
 96                    10                   10
 95                    28                   43
 94                    54                   58
 93                    40                   43
 92                    18                   20
 91                    19                   20
 90                    37                   41
  9                     7                    9
 89                    40                   44
 88                    31                   34
 87                    43                   42
 86                    27                   30
 83                    35                   39
 82                    44                   53
 81                    81                   87
 80                    11                   11
  8                    10                   16
 79                    15                   17
 78                    46                   51
 77                    11                   11
 76                    16                   19
 75                    70                   75
 74                    29                   29
 73                     9                   11
 72                     6                    7
 71                    33                   37
 70                    19                   21
  7                    13                   13
 69                    22                   23
  2                    15                   17
129                    15                   16
 32                    11                   12
128                     7                    7
 31                   113                  137
127                    54                   59
 30                    16                   20
126                    23                   26
 29                    58                   67
125                    50                   54
 28                    49                   67
124                    38                   43
123                    48                   52
 26                    67                   68
122                    54                   58
 25                    20                   21
121                    19                   19
 24                    10                   10
120                    39                   39
 23                     7                    8
119                     7                    7
 22                    56                   66
118                    69                   80
 21                    33                   41
 12                    29                   32
109                    56                   81
106                    16                   17
103                    17                   20
105                    79                   86
 11                    12                   13
108                    12                   13
102                   144                  158
  5                    11                   12
104                    16                   17
 10                   183                  205
107                    30                   31
101                    46                   54
  4                    13                   15
100                   146                  153
  3                    80                   93
  1                   111                  123
116                    73                   77
 19                     7                    8
  0                   130                  137
115                    39                   42
 18                    14                   15
 13                     5                   12
110                    43                   49
 14                    37                   40
111                   100                  111
112                    82                   93
 16                    40                   50
113                    67                   69
 17                    24                   27
114                    64                   68
 20                   174                  191
117                    12                   13
 33                    20                   28
 34                    26                   32
 35                    13                   14
 36                    67                   86
 37                    41                   49
 38                    19                   22
 39                    10                   12
 40                    15                   18
 41                    14                   18
 46                     6                    7
 47                    17                   19
 48                    56                   70
 49                    12                   13
 50                    31                   32
 52                    50                   54
 53                    19                   24
 54                    57                   59
 56                    17                   21
 57                    21                   23
 58                    47                   49
 59                   105                  112
 60                    48                   52
 61                    15                   23
 62                    46                   57
 63                   102                  109
 64                    36                   38
 65                    26                   29
 67                    47                   53
 68                    26                   28
Yes, all journals are on the SSD.
We had horrible luck using Kingston SSD with 2 journals on each SSD. When the SSD failed, we lost 8TB osd everytime. Causing very long rebuilds.
SSD seemed to fail very fast. So now they are on the journals.
Also, with journal on OSD, I am able to migrate the entire disk easily to a different server. Not the same with SSD journaling.
Even still, the performance is there for a couple Windows VMs.
I have set the tunables and report back my findings
 
Code:
ceph osd perf
osd fs_commit_latency(ms) fs_apply_latency(ms)
136                     9                    9
135                    22                   22
134                    38                   38
 45                   171                  174
...
Hi,
are the perf-output with or without deep-scrub enabled?
Yes, all journals are on the SSD.
We had horrible luck using Kingston SSD with 2 journals on each SSD. When the SSD failed, we lost 8TB osd everytime. Causing very long rebuilds.
SSD seemed to fail very fast. So now they are on the journals.
What kind of SSDs do you use? Sounds that are no DC-grade are used! I guess that the issue for your slow writes.
Use something like Intel DC S3700 (or cheaper DC3610).
Also, with journal on OSD, I am able to migrate the entire disk easily to a different server. Not the same with SSD journaling.
Even still, the performance is there for a couple Windows VMs.
now I'm lost - do you use SSD-Journaling, or not?!
I have set the tunables and report back my findings
!!ATTENTION!! Do this not before you clear with your journal-SSDs and enable some tunings!!
Otherwise you can lost data!! Because this can produce a lot of data movement - and I mean realy a lot (like 70% of your OSD-Data).
If you use consumer-SSDs and they break during this you are able to LOST ALL DATA!

Unfortunality you hasn't answer all questions - RAM/OSDs on 4 Nodes?
With 120 8TB-OSDs on 4 Nodes you will run in trouble!!

Udo
 
The perf is with deep-scrub enabled. I have not done anything to disable deep-scrub.

I do not use SSD journal. All journals are on osd disk.
Each node has 74GB usable ddr4 RAM or more.
Each node started with 36 drives, but some are removed because of bad smart. Whenever a drive is bad with smart errors, I reweight the osd to 0.0000 and then remove. So now I have 120 drives left (31 on node a, 30 on node b, 30 on node c, 29 on node d)

I did not have good SSD to start with, so I abandoned having SSD journal and just have journal on each osd disk.

You mention to enable some tunings. I did not read anything specific that would assist other than setting amount of recovery osd to reduce load. Do you have a link for additional tunings I can read up on.
 
The perf is with deep-scrub enabled. I have not done anything to disable deep-scrub.

I do not use SSD journal. All journals are on osd disk.
Each node has 74GB usable ddr4 RAM or more.
Each node started with 36 drives, but some are removed because of bad smart. Whenever a drive is bad with smart errors, I reweight the osd to 0.0000 and then remove. So now I have 120 drives left (31 on node a, 30 on node b, 30 on node c, 29 on node d)
]
imho too much. I had an cluster with 12*4TB OSDs for each node (48GB RAM - ceph only, without virtualisation) which was not optimal but ok.
ceph performed bett with more OSD-nodes - so if you add nodes and reduce your OSD/node ratio you will speed up your IO.
I did not have good SSD to start with, so I abandoned having SSD journal and just have journal on each osd disk.
ok.
You mention to enable some tunings. I did not read anything specific that would assist other than setting amount of recovery osd to reduce load. Do you have a link for additional tunings I can read up on.
The most important (osd max backfill + osd recovery max active) are set.
Other helpfull settings (op_threads depends on configuration and must perhaps testet on the own cluster):
Code:
[OSD]
osd mount options xfs = "rw,noatime,inode64,logbufs=8,logbsize=256k,allocsize=4M"
osd mkfs options xfs = "-f -i size=2048"

osd_scrub_load_threshold = 2.5

osd_op_threads = 4
osd_disk_threads = 1

filestore_op_threads = 4
osd_enable_op_tracker = false

osd_disk_thread_ioprio_class  = idle
osd_disk_thread_ioprio_priority = 7

debug asok = 0/0
debug auth = 0/0
debug buffer = 0/0
debug client = 0/0
debug context = 0/0
debug crush = 0/0
debug filer = 0/0
debug filestore = 0/0
debug finisher = 0/0
debug heartbeatmap = 0/0
debug journal = 0/0
debug journaler = 0/0
debug lockdep = 0/0
debug mds = 0/0
debug mds balancer = 0/0
debug mds locker = 0/0
debug mds log = 0/0
debug mds log expire = 0/0
debug mds migrator = 0/0
debug mon = 0/0
debug monc = 0/0
debug ms = 0/0
debug objclass = 0/0
debug objectcacher = 0/0
debug objecter = 0/0
debug optracker = 0/0
debug osd = 0/0
debug paxos = 0/0
debug perfcounter = 0/0
debug rados = 0/0
debug rbd = 0/0
debug rgw = 0/0
debug throttle = 0/0
debug timer = 0/0
debug tp = 0/0
some settings can be injected to be activated during runtime.
Mount-options (helpful for big OSDs) can be manual set with "mount - o remount ...".
The mount options avoid fragmentation, which is perhaps now late, but can be helpfull before you change the tunables.

Udo
 
I had (osd max backfill + osd recovery max active) set already.
It actually turned out to be an issue with DFS and bad Tuning on the Windows side in regards to the file system freezing.
The errors were still present for the storage controller resetting, but after we solved the DFS issue, the system and file shares worked just fine. Even with those error events in the Windows logs.

As soon as I set the
Code:
           crush map has legacy tunables (require bobtail, min is firefly)
           all OSDs are running jewel or later but the 'require_jewel_osds' osdmap flag is not set
the Proxmox CEPH logs cleaned up in regards to slow requests.

My current ceph.conf:
Code:
[global]
         auth client required = cephx
         auth cluster required = cephx
         auth service required = cephx
         cluster network =  xx.xxx.xxx.xxx/xx
         filestore xattr use omap = true
         fsid = ccf2ef03-0axxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
         keyring = /etc/pve/priv/$cluster.$name.keyring
         osd journal size = 12288
         osd max backfills = 1
         osd pool default min size = 1
         osd recovery max active = 1
         public network = 1

Those other tuning options didn't seem to make a great effect on the performance.

My end goal is to have 7 duplicate CEPH systems with 36 OSD's each. Not so much for speed but the stability. To have disk fail whenever and have little to no impact on the system is AWESOME! Even with a good ZFS, I had problems with memory going bad and such, which resulted in hours of downtime. I can lose entire servers at a time, and still have all of my VM's operational. The stability is what's best!
 
It actually turned out to be an issue with DFS and bad Tuning on the Windows side in regards to the file system freezing.
The errors were still present for the storage controller resetting, but after we solved the DFS issue, the system and file shares worked just fine. Even with those error events in the Windows logs.
Hi,
can you explain the issue with DFS and bad tuning?
Because I need DFS in next weeks and hope to use it without trouble.

Udo
 
Yep!
So we had to go a little outside the box on this one. All of our users have shared personal folders on a DFS namespace split between two separate CEPH servers at two different locations, and 3 other standalone baremetal servers.
We had to go outside the box on this one since DFS-Replication would not stay stable for us, so we used Resilio, better known as BTSync, to keep the files in sync. That part worked BEAUTIFULLY!

One problem we had was our main CEPH system was recently upgraded to Jewel. None of the tunables were set, and not all of the disk were operating at 100%. 4 disk had SMART errors that we thought we could ignore.

We have over 500GB in .pst files.
When one DFS Target server disconnects, there is a small timeout window, and then the next DFS Target server would pick up. And when the original server came back on, the namespace would flip back to the primary server.

The issue came in with the .pst files flapping back and forth.

CEPH would hangup for about 20 seconds due to the tunables not being set and definitely due to the drives with SMART issues. That 20 seconds would flip the namespace to a secondary remote server, which is the way that it's supposed to happen, however, .pst and Outlook cannot work that way. So the user computers would freeze up as the .pst were disconnected.

The SMART error that I thought I could ignore was:
Code:
The following warning/error was logged by the smartd daemon:
Device: /dev/sdl [SAT], ATA error count increased from 30069 30075

So overall, no problems with DFS other than User ignorance on DFS Replicate in addition to using .pst files in the DFS system.
 
  • Like
Reactions: udo

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!