broken pipe (599)

HuHu · Jan 20, 2019

Hi all,

I run a 4-node PVE-Cluster. Everything looks OK when I work alone with the system. If 20 or more students work with the system (KVM) then it comes again and again to the following error „broken pipe (599)“. The students connect there VMs with noVNC and/or Spice. What can I do to make the system work smoothly?

Thank you very much in advance for your efforts and

best regards CHristian

sb-jw · Jan 20, 2019

Do you have some more information about your setup for us?

First what i would do is to check the PVE Version and update them to the latest, if possible. For me, we need more Informations to understand when exaclty it came to this situation. What information you have in your Monitoring / Metrics?

HuHu · Jan 20, 2019

Thanks sb-jw for your quick reply.
I guess it's PVE 5.2. The error has existed for a longer time (1.5 year). Sorry, I'm not sitting in front of the machine. Tomorrow I can give the necessary details.

HuHu · Jan 21, 2019

Hear the version information. All nodes have the same software version.

Are this the monitoring/metrics or what information do you exactly need?

HuHu · Feb 3, 2019

Oh, nobody has the same problem?

sb-jw · Feb 3, 2019

HuHu said:
Are this the monitoring/metrics or what information do you exactly need?

I mean a Monitoring system like check_mk or icinga or grafana. Otherwise it's not easy to see what's happend in the cluster when the problem occurs.

But there are much more Infos which I missing here like what hardware do you use? What Storage? HA VMs are setuped? How looks the VM config? Is there any other VM which doesn't have the problem? How connect the students (directly or do you use HAproxy or other LBs in front)?

HuHu · Feb 3, 2019

sb-jw said:
I mean a Monitoring system like check_mk or icinga or grafana. Otherwise it's not easy to see what's happend in the cluster when the problem occurs.

But there are much more Infos which I missing here like what hardware do you use? What Storage? HA VMs are setuped? How looks the VM config? Is there any other VM which doesn't have the problem? How connect the students (directly or do you use HAproxy or other LBs in front)?

Right now, I do not monitoring the PVE-Cluster.

HW: 4x double HP Blade BL360 G6, CPU/Core/Ram (s. image above), 10 GB Net
Storage: external Ceph-Cluster, 8 Nodes (nix of HP DL380 G4/G6/G8, HP BL360 G6+Storage-Blade), approximately 60 OSDs, 10GB public-/private Net
VM-Config: follow tomorro
Connection: no HAproxy, no LB

I have two guesses:
A) the Ceph-Cluster is to slow, ...
B) there is a configuration issue with PVEproxy, because if I am alone in the system everything works as I expect. If more than ~ 5 or 10 concurrent user the trouble begins.

sb-jw · Feb 3, 2019

HuHu said:
CPU/Core/Ram (s. image above)

There are no detailed information about the details.

HuHu said:
A) the Ceph-Cluster is to slow, ...

If you have 60 OSDs, I don't think that's the problem, but maybe. For this you need to push some metrics to grafana, maybe the latency is too high. But before we take a look at the hardware details, we wait for the metrics.

HuHu said:
If more than ~ 5 or 10 concurrent user the trouble begins.

Maybe some other systems are the problem, like NAT, Proxy or other services in front. Maybe a missconfiguration on one of the switches?

HuHu · Feb 4, 2019

sb-jw said:
There are no detailed information about the details.

4x HP BL365 G6
============
CPU:
===
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 24
On-line CPU(s) list: 0-23
Thread(s) per core: 1
Core(s) per socket: 6
Socket(s): 4
NUMA node(s): 4
Vendor ID: AuthenticAMD
CPU family: 16
Model: 8
Model name: Six-Core AMD Opteron(tm) Processor 8431
Stepping: 0
CPU MHz: 2400.025
BogoMIPS: 4800.05
Virtualization: AMD-V
L1d cache: 64K
L1i cache: 64K
L2 cache: 512K
L3 cache: 5118K
NUMA node0 CPU(s): 0,4,8,12,16,20
NUMA node1 CPU(s): 1,5,9,13,17,21
NUMA node2 CPU(s): 2,6,10,14,18,22
NUMA node3 CPU(s): 3,7,11,15,19,23
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pgeopt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tscc cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt hw_pst

RAM:
===
1x 196 GB
3x 128 GB

If you have 60 OSDs, I don't think that's the problem, but maybe. For this you need to push some metrics to grafana, maybe the latency is too high. But before we take a look at the hardware details, we wait for the metrics.

Maybe some other systems are the problem, like NAT, Proxy or other services in front. Maybe a missconfiguration on one of the switches?

No NAT, no Proxy, Switch, hm I'll see.

You want to know a VM configuration. Here a standard KVM config:

Code:

agent: 1
bootdisk: virtio0
cores: 2
ide2: none,media=cdrom
memory: 2048
name: deb95-23
net0: virtio=02:28:61:1C:A3:53,bridge=vmbr1
numa: 0
ostype: l26
scsihw: virtio-scsi-pci
smbios1: uuid=ba68e772-d536-4f6b-a203-c11dde5f6de8
sockets: 1
vga: qxl
virtio0: service:base-100-disk-1/vm-2861023-disk-1,cache=writethrough,size=12G

Ahmet Bas · Jul 3, 2019

@HuHu were you able to resolve your issue ?

HuHu · Jul 13, 2019

Sorry, not now. I have a hunch it has something to do with the storage. I run my own Ceph cluster and it has some performance problems.

Best regards CHristian

Search

Search

broken pipe (599)

HuHu

Member

sb-jw

Famous Member

HuHu

Member

HuHu

Member

Attachments

HuHu

Member

sb-jw

Famous Member

HuHu

Member

sb-jw

Famous Member

HuHu

Member

Ahmet Bas

Well-Known Member

HuHu

Member