broken pipe (599)

HuHu

Member
Oct 2, 2013
11
1
23
Hi all,

I run a 4-node PVE-Cluster. Everything looks OK when I work alone with the system. If 20 or more students work with the system (KVM) then it comes again and again to the following error „broken pipe (599)“. The students connect there VMs with noVNC and/or Spice. What can I do to make the system work smoothly?

Thank you very much in advance for your efforts and

best regards CHristian
 
Do you have some more information about your setup for us?

First what i would do is to check the PVE Version and update them to the latest, if possible. For me, we need more Informations to understand when exaclty it came to this situation. What information you have in your Monitoring / Metrics?
 
Thanks sb-jw for your quick reply.
I guess it's PVE 5.2. The error has existed for a longer time (1.5 year). Sorry, I'm not sitting in front of the machine. Tomorrow I can give the necessary details.
 
Hear the version information. All nodes have the same software version.


Are this the monitoring/metrics or what information do you exactly need?
 

Attachments

  • Node71_version.PNG
    Node71_version.PNG
    24.5 KB · Views: 13
  • overall_status.PNG
    overall_status.PNG
    35 KB · Views: 12
Are this the monitoring/metrics or what information do you exactly need?
I mean a Monitoring system like check_mk or icinga or grafana. Otherwise it's not easy to see what's happend in the cluster when the problem occurs.

But there are much more Infos which I missing here like what hardware do you use? What Storage? HA VMs are setuped? How looks the VM config? Is there any other VM which doesn't have the problem? How connect the students (directly or do you use HAproxy or other LBs in front)?
 
I mean a Monitoring system like check_mk or icinga or grafana. Otherwise it's not easy to see what's happend in the cluster when the problem occurs.

But there are much more Infos which I missing here like what hardware do you use? What Storage? HA VMs are setuped? How looks the VM config? Is there any other VM which doesn't have the problem? How connect the students (directly or do you use HAproxy or other LBs in front)?

Right now, I do not monitoring the PVE-Cluster.

HW: 4x double HP Blade BL360 G6, CPU/Core/Ram (s. image above), 10 GB Net
Storage: external Ceph-Cluster, 8 Nodes (nix of HP DL380 G4/G6/G8, HP BL360 G6+Storage-Blade), approximately 60 OSDs, 10GB public-/private Net
VM-Config: follow tomorro
Connection: no HAproxy, no LB


I have two guesses:
A) the Ceph-Cluster is to slow, ...
B) there is a configuration issue with PVEproxy, because if I am alone in the system everything works as I expect. If more than ~ 5 or 10 concurrent user the trouble begins.
 
CPU/Core/Ram (s. image above)
There are no detailed information about the details.

A) the Ceph-Cluster is to slow, ...
If you have 60 OSDs, I don't think that's the problem, but maybe. For this you need to push some metrics to grafana, maybe the latency is too high. But before we take a look at the hardware details, we wait for the metrics.

If more than ~ 5 or 10 concurrent user the trouble begins.
Maybe some other systems are the problem, like NAT, Proxy or other services in front. Maybe a missconfiguration on one of the switches?
 
There are no detailed information about the details.

4x HP BL365 G6
============
CPU:
===
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 24
On-line CPU(s) list: 0-23
Thread(s) per core: 1
Core(s) per socket: 6
Socket(s): 4
NUMA node(s): 4
Vendor ID: AuthenticAMD
CPU family: 16
Model: 8
Model name: Six-Core AMD Opteron(tm) Processor 8431
Stepping: 0
CPU MHz: 2400.025
BogoMIPS: 4800.05
Virtualization: AMD-V
L1d cache: 64K
L1i cache: 64K
L2 cache: 512K
L3 cache: 5118K
NUMA node0 CPU(s): 0,4,8,12,16,20
NUMA node1 CPU(s): 1,5,9,13,17,21
NUMA node2 CPU(s): 2,6,10,14,18,22
NUMA node3 CPU(s): 3,7,11,15,19,23
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pgeopt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tscc cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt hw_pst

RAM:
===
1x 196 GB
3x 128 GB


If you have 60 OSDs, I don't think that's the problem, but maybe. For this you need to push some metrics to grafana, maybe the latency is too high. But before we take a look at the hardware details, we wait for the metrics.

Maybe some other systems are the problem, like NAT, Proxy or other services in front. Maybe a missconfiguration on one of the switches?

No NAT, no Proxy, Switch, hm I'll see.

You want to know a VM configuration. Here a standard KVM config:
Code:
agent: 1
bootdisk: virtio0
cores: 2
ide2: none,media=cdrom
memory: 2048
name: deb95-23
net0: virtio=02:28:61:1C:A3:53,bridge=vmbr1
numa: 0
ostype: l26
scsihw: virtio-scsi-pci
smbios1: uuid=ba68e772-d536-4f6b-a203-c11dde5f6de8
sockets: 1
vga: qxl
virtio0: service:base-100-disk-1/vm-2861023-disk-1,cache=writethrough,size=12G
 
Sorry, not now. I have a hunch it has something to do with the storage. I run my own Ceph cluster and it has some performance problems.

Best regards CHristian
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!