I have 4 CEPH servers all run:
CPU(s) --> 24 x Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz (2 Sockets)
Kernel Version --> Linux 4.4.40-1-pve #1 SMP Wed Feb 8 16:13:20 CET 2017
PVE Manager Version --> pve-manager/4.4-12/e71b7a74
My Ceph.conf:
The errors that I get when Windows VM freeze:
My CEPH helath is good:
My CEPH performance test results:
The config for the Windows VM is:
The error in the Windows Event Viewer that relates to this freeze:
I have a ZFS connection that does not generate this error. If you need anymore info, please let me know. This issue is an Emergency. I am having corrupt databases within my Windows VM.
CPU(s) --> 24 x Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz (2 Sockets)
Kernel Version --> Linux 4.4.40-1-pve #1 SMP Wed Feb 8 16:13:20 CET 2017
PVE Manager Version --> pve-manager/4.4-12/e71b7a74
My Ceph.conf:
Code:
cat /etc/ceph/ceph.conf
[global]
auth client required = cephx
auth cluster required = cephx
auth service required = cephx
cluster network = ##.##.###.87.0/24
filestore xattr use omap = true
fsid = ccf2ef03-0a0f-42dd-a553-##############
keyring = /etc/pve/priv/$cluster.$name.keyring
osd journal size = 12288
osd max backfills = 1
osd pool default min size = 1
osd recovery max active = 1
public network = ##.##.###.0/24
[osd]
keyring = /var/lib/ceph/osd/ceph-$id/keyring
[mon.1]
host = pct2-prox-h
mon addr = ##.##.###.27:6789
[mon.6]
host = pct2-Prox-G
mon addr = ##.##.###.26:6789
[mon.0]
host = pct2-prox-e
mon addr = ##.##.###..24:6789
[mon.5]
host = pct2-prox-f
mon addr = ##.##.###.25:6789
The errors that I get when Windows VM freeze:
Code:
2017-04-13 09:07:36.162869 mon.0 [INF] pgmap v31909319: 2048 pgs: 2048 active+clean; 43408 GB data, 127 TB used, 228 TB / 356 TB avail; 7833 B/s wr, 2 op/s
2017-04-13 09:07:37.240578 mon.0 [INF] pgmap v31909320: 2048 pgs: 2048 active+clean; 43408 GB data, 127 TB used, 228 TB / 356 TB avail
2017-04-13 09:07:28.984650 osd.52 [WRN] 4 slow requests, 1 included below; oldest blocked for > 69.549736 secs
2017-04-13 09:07:28.984658 osd.52 [WRN] slow request 60.048898 seconds old, received at 2017-04-13 09:06:28.935692: osd_op(client.89742621.1:181374135 9.c05e0889 rbd_data.1ad87a8238e1f29.0000000000001693 [read 3502080~4096] snapc 0=[] read e161505) currently waiting for rw locks
2017-04-13 09:07:33.833850 osd.127 [WRN] 4 slow requests, 1 included below; oldest blocked for > 58.140064 secs
2017-04-13 09:07:33.833855 osd.127 [WRN] slow request 30.795260 seconds old, received at 2017-04-13 09:07:03.038549: osd_op(client.89732041.1:580992642 9.398d4db4 rbd_data.1d68211238e1f29.00000000000013b5 [set-alloc-hint object_size 4194304 write_size 4194304,write 1069056~8192] snapc 0=[] ondisk+write e161505) currently waiting for subops from 44,72
2017-04-13 09:07:35.834153 osd.127 [WRN] 4 slow requests, 2 included below; oldest blocked for > 60.140352 secs
2017-04-13 09:07:35.834161 osd.127 [WRN] slow request 60.140352 seconds old, received at 2017-04-13 09:06:35.693745: osd_op(client.89742621.1:181374140 9.4e51373d rbd_data.1ad87a8238e1f29.0000000000002e01 [set-alloc-hint object_size 4194304 write_size 4194304,write 2203648~4096] snapc 0=[] ondisk+write e161505) currently waiting for subops from 3,59
2017-04-13 09:07:35.834171 osd.127 [WRN] slow request 60.140230 seconds old, received at 2017-04-13 09:06:35.693867: osd_op(client.89742621.1:181374141 9.4e51373d rbd_data.1ad87a8238e1f29.0000000000002e01 [set-alloc-hint object_size 4194304 write_size 4194304,write 2215936~8192] snapc 0=[] ondisk+write e161505) currently waiting for subops from 3,59
2017-04-13 09:07:35.985652 osd.52 [WRN] 4 slow requests, 1 included below; oldest blocked for > 76.550747 secs
2017-04-13 09:07:35.985667 osd.52 [WRN] slow request 60.289877 seconds old, received at 2017-04-13 09:06:35.695723: osd_repop(client.89742621.1:181374142 9.567 9:e6bc7465:::rbd_data.1ad87a8238e1f29.0000000000003a93:head v 161505'414662) currently started
2017-04-13 09:07:36.040023 osd.33 [WRN] 2 slow requests, 1 included below; oldest blocked for > 60.349378 secs
2017-04-13 09:07:36.040033 osd.33 [WRN] slow request 60.349378 seconds old, received at 2017-04-13 09:06:35.690568: osd_op(client.89742621.1:181374142 9.a62e3d67 rbd_data.1ad87a8238e1f29.0000000000003a93 [set-alloc-hint object_size 4194304 write_size 4194304,write 770048~4096] snapc 0=[] ondisk+write e161505) currently waiting for subops from 46,52
2017-04-13 09:07:38.295496 mon.0 [INF] pgmap v31909321: 2048 pgs: 2048 active+clean; 43408 GB data, 127 TB used, 228 TB / 356 TB avail
2017-04-13 09:07:33.484763 osd.71 [WRN] 3 slow requests, 1 included below; oldest blocked for > 74.982940 secs
2017-04-13 09:07:33.484774 osd.71 [WRN] slow request 30.043067 seconds old, received at 2017-04-13 09:07:03.441639: osd_op(client.89738833.1:218935633 9.d1ec8b52 rbd_data.9d2a3b238e1f29.000000000000052f [read 2957312~4096] snapc 0=[] read e161505) currently started
2017-04-13 09:07:39.018395 osd.114 [WRN] 5 slow requests, 4 included below; oldest blocked for > 79.583112 secs
2017-04-13 09:07:39.018403 osd.114 [WRN] slow request 30.714863 seconds old, received at 2017-04-13 09:07:08.303481: osd_op(client.89732041.1:580992649 9.333209a4 rbd_data.576ef49238e1f29.0000000000003400 [set-alloc-hint object_size 4194304 write_size 4194304,write 65536~4096] snapc a9=[] ondisk+write e161505) currently waiting for subops from 68,103
2017-04-13 09:07:39.018407 osd.114 [WRN] slow request 30.714796 seconds old, received at 2017-04-13 09:07:08.303548: osd_op(client.89732041.1:580992650 9.333209a4 rbd_data.576ef49238e1f29.0000000000003400 [set-alloc-hint object_size 4194304 write_size 4194304,write 860160~4096] snapc a9=[] ondisk+write e161505) currently waiting for subops from 68,103
2017-04-13 09:07:39.018412 osd.114 [WRN] slow request 30.714743 seconds old, received at 2017-04-13 09:07:08.303601: osd_op(client.89732041.1:580992651 9.333209a4 rbd_data.576ef49238e1f29.0000000000003400 [set-alloc-hint object_size 4194304 write_size 4194304,write 1167360~4096] snapc a9=[] ondisk+write e161505) currently waiting for subops from 68,103
2017-04-13 09:07:39.018416 osd.114 [WRN] slow request 30.714959 seconds old, received at 2017-04-13 09:07:08.303385: osd_op(client.89732041.1:580992648 9.333209a4 rbd_data.576ef49238e1f29.0000000000003400 [set-alloc-hint object_size 4194304 write_size 4194304,write 28672~4096] snapc a9=[] ondisk+write e161505) currently waiting for subops from 68,103
2017-04-13 09:07:39.331535 mon.0 [INF] pgmap v31909322: 2048 pgs: 1 active+clean+scrubbing+deep, 2047 active+clean; 43408 GB data, 127 TB used, 228 TB / 356 TB avail; 5771 B/s rd, 7695 B/s wr, 4 op/s
2017-04-13 09:07:33.774701 osd.44 [WRN] 2 slow requests, 1 included below; oldest blocked for > 75.272316 secs
2017-04-13 09:07:33.774708 osd.44 [WRN] slow request 30.735683 seconds old, received at 2017-04-13 09:07:03.038959: osd_repop(client.89732041.1:580992642 9.5b4 9:2db2b19c:::rbd_data.1d68211238e1f29.00000000000013b5:head v 161505'585906) currently started
2017-04-13 09:07:34.361450 osd.93 [WRN] 6 slow requests, 3 included below; oldest blocked for > 75.334009 secs
2017-04-13 09:07:34.361455 osd.93 [WRN] slow request 30.440228 seconds old, received at 2017-04-13 09:07:03.921185: osd_op(client.89738833.1:218935636 9.44570db rbd_data.5c8c462ae8944a.0000000000001060 [set-alloc-hint object_size 4194304 write_size 4194304,write 3465216~4096] snapc 0=[] ondisk+write e161505) currently waiting for subops from 90,132
2017-04-13 09:07:34.361461 osd.93 [WRN] slow request 30.440104 seconds old, received at 2017-04-13 09:07:03.921308: osd_op(client.89738833.1:218935637 9.44570db rbd_data.5c8c462ae8944a.0000000000001060 [set-alloc-hint object_size 4194304 write_size 4194304,write 3674112~4096] snapc 0=[] ondisk+write e161505) currently waiting for subops from 90,132
2017-04-13 09:07:34.361464 osd.93 [WRN] slow request 30.440415 seconds old, received at 2017-04-13 09:07:03.920997: osd_op(client.89738833.1:218935635 9.44570db rbd_data.5c8c462ae8944a.0000000000001060 [set-alloc-hint object_size 4194304 write_size 4194304,write 3407872~4096] snapc 0=[] ondisk+write e161505) currently waiting for subops from 90,132
2017-04-13 09:07:35.759245 osd.59 [WRN] 5 slow requests, 2 included below; oldest blocked for > 74.995369 secs
2017-04-13 09:07:35.759250 osd.59 [WRN] slow request 60.064942 seconds old, received at 2017-04-13 09:06:35.694254: osd_repop(client.89742621.1:181374140 9.73d 9:bcec8a72:::rbd_data.1ad87a8238e1f29.0000000000002e01:head v 161505'908763) currently started
2017-04-13 09:07:35.759256 osd.59 [WRN] slow request 60.064298 seconds old, received at 2017-04-13 09:06:35.694898: osd_repop(client.89742621.1:181374141 9.73d 9:bcec8a72:::rbd_data.1ad87a8238e1f29.0000000000002e01:head v 161505'908764) currently started
2017-04-13 09:07:35.946268 osd.74 [WRN] 5 slow requests, 3 included below; oldest blocked for > 77.568043 secs
2017-04-13 09:07:35.946277 osd.74 [WRN] slow request 60.426785 seconds old, received at 2017-04-13 09:06:35.519389: osd_repop(client.89732041.1:580992605 9.7a7 9:e5f450fe:::rbd_data.5cbd322ae8944a.0000000000068f1b:head v 161505'322945) currently started
2017-04-13 09:07:35.946285 osd.74 [WRN] slow request 60.426784 seconds old, received at 2017-04-13 09:06:35.519390: osd_repop(client.89732041.1:580992603 9.5a 9:5a0c5ed8:::rbd_data.5cbd322ae8944a.000000000001930e:head v 161505'1015212) currently started
2017-04-13 09:07:35.946291 osd.74 [WRN] slow request 60.426357 seconds old, received at 2017-04-13 09:06:35.519817: osd_repop(client.89732041.1:580992604 9.5a 9:5a0c5ed8:::rbd_data.5cbd322ae8944a.000000000001930e:head v 161505'1015213) currently started
2017-04-13 09:07:36.051909 osd.76 [INF] 9.4d2 deep-scrub starts
2017-04-13 09:07:38.224054 osd.123 [WRN] 1 slow requests, 1 included below; oldest blocked for > 60.063173 secs
2017-04-13 09:07:38.224069 osd.123 [WRN] slow request 60.063173 seconds old, received at 2017-04-13 09:06:38.160850: osd_op(client.89732041.1:580992607 9.5ed19b59 rbd_data.1d68211238e1f29.0000000000000094 [set-alloc-hint object_size 4194304 write_size 4194304,write 2269184~4096] snapc 0=[] ondisk+write e161505) currently waiting for subops from 68,106
2017-04-13 09:07:38.630063 osd.64 [WRN] 4 slow requests, 1 included below; oldest blocked for > 79.217787 secs
2017-04-13 09:07:38.630072 osd.64 [WRN] slow request 60.537214 seconds old, received at 2017-04-13 09:06:38.092795: osd_repop(client.89742621.1:181374148 9.6b9 9:9d64fe07:::rbd_data.b86bb1238e1f29.0000000000001016:head v 161505'451185) currently started
2017-04-13 09:07:38.989690 osd.48 [WRN] 3 slow requests, 1 included below; oldest blocked for > 79.581230 secs
2017-04-13 09:07:38.989785 osd.48 [WRN] slow request 60.899234 seconds old, received at 2017-04-13 09:06:38.090366: osd_op(client.89742621.1:181374150 9.760aceed rbd_data.b86bb1238e1f29.0000000000001056 [set-alloc-hint object_size 4194304 write_size 4194304,write 3788800~65536] snapc 0=[] ondisk+write e161505) currently waiting for subops from 67,126
2017-04-13 09:07:40.363151 mon.0 [INF] pgmap v31909323: 2048 pgs: 1 active+clean+scrubbing+deep, 2047 active+clean; 43408 GB data, 127 TB used, 228 TB / 356 TB avail; 5889 B/s rd, 7852 B/s wr, 4 op/s
My CEPH helath is good:
Code:
# ceph -s
cluster ccf2ef03-0a0f-42dd-a553-7############
health HEALTH_WARN
120 requests are blocked > 32 sec
crush map has legacy tunables (require bobtail, min is firefly)
all OSDs are running jewel or later but the 'require_jewel_osds' osdmap flag is not set
monmap e23: 4 mons at {0=##.##.###.24:6789/0,1=##.##.###..27:6789/0,5=##.##.###.25:6789/0,6=##.##.###.26:6789/0}
election epoch 9926, quorum 0,1,2,3 0,5,6,1
osdmap e161505: 128 osds: 128 up, 128 in
pgmap v31909375: 2048 pgs, 1 pools, 43408 GB data, 10878 kobjects
127 TB used, 228 TB / 356 TB avail
2047 active+clean
1 active+clean+scrubbing+deep
client io 11597 kB/s rd, 1095 kB/s wr, 349 op/s rd, 248 op/s wr
My CEPH performance test results:
Code:
rados -p Ceph3 bench 60 write -t 8
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
60 8 3192 3184 212.239 240 0.0713033 0.150371
Total time run: 60.634442
Total writes made: 3192
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 210.573
Stddev Bandwidth: 39.9109
Max bandwidth (MB/sec): 296
Min bandwidth (MB/sec): 116
Average IOPS: 52
Stddev IOPS: 9
Max IOPS: 74
Min IOPS: 29
Average Latency(s): 0.151147
Stddev Latency(s): 0.160218
Max latency(s): 2.62365
Min latency(s): 0.0440302
Cleaning up (deleting benchmark objects)
Clean up completed and total clean up time :22.218130
The config for the Windows VM is:
Code:
# cat /etc/pve/qemu-server/103.conf
#
#Windows Server 2012
agent: 1
boot: dcn
bootdisk: virtio0
cores: 6
memory: 6144
name: #####
net0: virtio=66:62:36:30:66:66,bridge=vmbr0
numa: 0
onboot: 1
ostype: win8
protection: 1
smbios1: uuid=da877aac-3ec4-41d9-a8cf-48c90d0548a2
sockets: 1
virtio0: Ceph3:vm-103-disk-2,cache=writeback,size=60G
virtio1: Ceph3:vm-103-disk-1,backup=0,cache=writeback,size=7500G
virtio2: Ceph3:vm-103-disk-4,backup=0,cache=writeback,size=8000G
virtio3: Ceph3:vm-103-disk-3,backup=0,cache=writeback,size=60G
The error in the Windows Event Viewer that relates to this freeze:
Code:
Log Name: Application
Source: ESENT
Date: 4/13/2017 8:39:42 AM
Event ID: 533
Task Category: General
Level: Warning
Keywords: Classic
User: N/A
Computer: ######.######.####.###
Description:
svchost (1744) A request to write to the file "C:\Windows\system32\LogFiles\Sum\Svc.log" at offset 1617920 (0x000000000018b000) for 4096 (0x00001000) bytes has not completed for 36 second(s). This problem is likely due to faulty hardware. Please contact your hardware vendor for further assistance diagnosing the problem.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
<System>
<Provider Name="ESENT" />
<EventID Qualifiers="0">533</EventID>
<Level>3</Level>
<Task>1</Task>
<Keywords>0x80000000000000</Keywords>
<TimeCreated SystemTime="2017-04-13T13:39:42.000000000Z" />
<EventRecordID>17175</EventRecordID>
<Channel>Application</Channel>
<Computer>######.#####.###.##.###</Computer>
<Security />
</System>
<EventData>
<Data>svchost</Data>
<Data>1744</Data>
<Data>
</Data>
<Data>C:\Windows\system32\LogFiles\Sum\Svc.log</Data>
<Data>1617920 (0x000000000018b000)</Data>
<Data>4096 (0x00001000)</Data>
<Data>36</Data>
</EventData>
</Event>
I have a ZFS connection that does not generate this error. If you need anymore info, please let me know. This issue is an Emergency. I am having corrupt databases within my Windows VM.
Last edited: