Recent content by Sascha72036

  1. S

    VM network interruptions (TX dropped on tap interface)

    Both on the host and inside the VM, the logs (journalctl, dmesg -T) look normal. What stands out are very high context switch rates (around 300k+ per second). The VM is running about 100 Docker containers, which could be relevant. Maybe this is part of the problem, but I don’t see an obvious way...
  2. S

    pve vm tap Node drop packet

    I have a similar problem. Did you find a solution?
  3. S

    VM network interruptions (TX dropped on tap interface)

    Hello, I am observing recurring network interruptions on one of my VMs. Monitoring shows packet loss spikes up to 70–80% as well as high round-trip time outliers. An interesting observation is that the problem only appears after 3–4 days of VM uptime. Other VMs on the same host are not...
  4. S

    48 node pve cluster

    We have one Dual 10G NIC each node. The 10G NIC is shared between corosync & ceph traffic (seperated in two different vlans). The floods start for no apparent reason. With sctp instead of knet we have no more floods but after we restart corosync, some NICs in our cluster are resetting due to tx...
  5. S

    Alternative to corosync for large clusters

    We have exactly the same problem with our 48 node cluster. Some nodes start udp floods. As a result the 10G nics from other nodes entered in a blocking state. We tried sctp, but thats not the solution. Have you found a way to run corosync stably in large clusters?
  6. S

    48 node pve cluster

    Hello everyone, we're running a 48 node pve cluster with this setup: AMD EPYC 7402P, 512GB Memory, Intel X520-DA2 or Mellanox Connect X3 NIC, Ceph Pool with only NVMe, 2x 10Gbit/s interfaces (for cluster traffic) + 2x 1G (for public traffic). As a few others have recently reported in the forum...
  7. S

    high steal time on amd epyc-Rome

    Hello everyone, I use the AMD EPYC 7402P to virtualize with Proxmox. The problem is that from around 45% host-CPU usage the steal time on the VMs increases to more than 5%. Every VM is started with the CPU flag "host." root@VM24:~# cat /proc/cpuinfo processor : 0 vendor_id ...
  8. S

    corosync udp flood

    We switched the transport to sctp. After that the problems with corosync flood in the cluster no longer occurred. A few days ago we upgraded all servers to the latest pveversion. During the upgrade process, however, the network cards suddenly switched off on random hosts again with the same...
  9. S

    corosync udp flood

    Hello, we are running a 42 node proxmox cluster with ceph. Our nodes are connected via Intel X520-DA2 (2x 10G) to two seperated Arista 7050QX Switches. Corosync and Ceph are seprated in two different vlans. The normal traffic of the VMs run over the onboard NIC. We have big problems with...
  10. S

    Probleme mit proxmox Network

    Hier noch ein Auszug aus der syslog: Mar 20 10:15:37 prox39 kernel: [1843705.835893] vmbr0: port 5(tap148i0) entered disabled state Mar 20 10:15:37 prox39 pmxcfs[1711]: [status] notice: received log Mar 20 10:15:37 prox39 pmxcfs[1711]: [status] notice: received log Mar 20 10:15:37 prox39...
  11. S

    Probleme mit proxmox Network

    Hallo, wir betreiben ein großes Cluster mit 41 Nodes (Prozessor: AMD EPYC 7402P) sowie separate 2x 10G NIC für ceph + Cluster-Traffic (corosync etc). Cluster, Ceph usw ist alles abgetrennt vom normalen Traffic der VMs. Wir nehmen zwei redundante Switche (Arista) für ceph und zwei redundante...
  12. S

    EXT4-fs Error with Raid 10

    Hello, have the similar issue. Is there a recommendation from Proxmox how to rescue the VM data? Best regards, S. Gericke
  13. S

    LXC ARP issue

    Hello, i run in my cluster several LXC containers with more than one ipv4 address. Every lxc container have one ipv4 address in the proxmox "network" configuration. We add each further in the interfaces file via: up ip addr add 45.132.89.134/24 dev eth0 My problem is: It seems as if the LXC...
  14. S

    HA Deaktivieren

    Vielen Dank für die Antwort. Alle Server stehen auf "idle". Das Problem ist, dass eine Node neustartet, wenn eine andere Node aus dem Cluster rebootet und wieder hochfährt. Ich habe HA dafür verantwortlich gemacht, kann es noch andere Ursachen geben, die das verursachen? Habe eine syslog anbei...
  15. S

    Node reboot causes other node reboot when in cluster

    In my cluster there are 30 nodes. Maybe my issue is caused by other problems. Thank you for your answer.