slow migrations

RobFantini

Famous Member
May 24, 2012
2,009
102
133
Boston,Mass
Hello
I am seeing slower migrations with pve7 then pve6 .

We do have a network issue that I have been trying to track down over the last week, which is probably the cause.

However I wanted to see if others have noticed slower migrations.

thank you for reading this.
 
Could you tell us more details about the storage type that you are using with migrations?
 
same issue with ZFS raidz (used with replication)
we have 6 nic (1GB)
NIC 1 = WEB gui , nodes ip ,
NIC 2 = VM/CT trafic with external,
NIC 3 = cluster dedicated link
NIC 4 = unsused
NIC 4+6 = LAG to PBS server (restore or backup are fast enough for us ;-)

Which link is used for replication ?
 
Could you tell us more details about the storage type that you are using with migrations?

1- I am fairly sure it is related to this seen at dmesg on pve hosts
Code:
# dmesg|grep hung
[Sun Sep 12 07:39:32 2021] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Sun Sep 12 07:39:32 2021] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Sun Sep 12 07:39:32 2021] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Sun Sep 12 07:39:32 2021] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

2- and the mellonax switches - we use mlag had dmesg lines with 'RTM_NEWNEIGH ' every coupls of seconds.

after restarting the mellanox switches we had the hung lines at pve hosts.

so in the past the only way we knew to deal with those was reboot each of the 5 nodes, so that is in progress.

To answer your question.

we use ceph with seven 4TB intel data center grade nvme's per node. so 35 osd's
 
also we get emails when ceph -s shows warnings and saw these:
Code:
 cluster:
    id:     220b9a53-4556-48e3-a73c-28deff665e45
    health: HEALTH_WARN
            1 slow ops, oldest one blocked for 82322 sec, mon.pve4 has slow ops
 
  services:
    mon: 3 daemons, quorum pve15,pve11,pve4 (age 22h)
    mgr: pve4(active, since 24h), standbys: pve11, pve15
    osd: 35 osds: 35 up (since 22h), 35 in (since 10d)
 
  data:
    pools:   1 pools, 512 pgs
    objects: 2.22M objects, 7.7 TiB
    usage:   24 TiB used, 104 TiB / 127 TiB avail
    pgs:     512 active+clean
 
  io:
    client:   0 B/s rd, 11 MiB/s wr, 0 op/s rd, 122 op/s wr
    
## and zabbix:
Subject: PROBLEM: Ceph cluster in WARN state

Trigger: Ceph cluster in WARN state
Trigger status: PROBLEM
Trigger severity: High
Trigger URL:

Item values:

1. Overal Ceph status (numeric) (ceph:ceph.overall_status_int): 1
2. *UNKNOWN* (*UNKNOWN*:*UNKNOWN*): *UNKNOWN*
3. *UNKNOWN* (*UNKNOWN*:*UNKNOWN*): *UNKNOWN*
 
so in our case the network we think was the cause.

I will leave the thread open as someone else posted their issue, and will wait a few days to make sure there is not a repeat.
 
# dmesg|grep hung
[Sun Sep 12 07:39:32 2021] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Sun Sep 12 07:39:32 2021] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Sun Sep 12 07:39:32 2021] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Sun Sep 12 07:39:32 2021] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
anything else in dmesg ? (no driver bug or other info ?) what is your nic model ?
 
- we use connect-x4 and -x5
- the mlag switches are mellanox 2500sn running cumulus .

as for the loop i am certain the cables go from the connect-x* cards to the switch.

cumulus has this command that shows the lldp of each connection. I ran this on both switches to verify that the port used at each switch is correct. Note we have 2 connect-x cards per host.

Code:
# switch 1 :
root@mel1:[~]:# net show interface |grep pve
State  Name           Spd  MTU    Mode          LLDP                         Summary
-----  -------------  ---  -----  ------------  ---------------------------  -----------------------
UP     swp2           40G  9216   BondMember    pve7 (0c:42:a1:f3:a1:98)     Master: bond2(UP)
UP     swp4           40G  9216   BondMember    pve2 (0c:42:a1:f3:a1:19)     Master: bond4(UP)
UP     swp6           40G  9216   BondMember    pve4 (0c:42:a1:f3:a1:40)     Master: bond6(UP)
UP     swp7           40G  9216   BondMember    pve15 (0c:42:a1:f3:a1:88)    Master: bond7(UP)
UP     swp8           10G  9216   BondMember    pve11 (50:6b:4b:44:07:4b)    Master: bond8(UP)

## ceph ports
UP     swp27          10G  9216   BondMember    pve15 (3c:ec:ef:30:63:b3)    Master: bond27(UP)
UP     swp28          10G  9216   BondMember    pve7 (3c:ec:ef:30:64:6e)     Master: bond28(UP)
UP     swp29          10G  9216   BondMember    pve2 (3c:ec:ef:30:64:6a)     Master: bond29(UP)
UP     swp30          10G  9216   BondMember    pve4 (3c:ec:ef:30:61:89)     Master: bond30(UP)
UP     swp32          10G  9216   BondMember    pve11 (3c:ec:ef:30:67:cd)    Master: bond32(UP)



# switch 2:
root@mel2:[~]:# net show interface |grep pve
State  Name           Spd  MTU    Mode          LLDP                         Summary
-----  -------------  ---  -----  ------------  ---------------------------  -----------------------
UP     swp2           40G  9216   BondMember    pve7 (0c:42:a1:f3:a1:99)     Master: bond2(UP)
UP     swp4           40G  9216   BondMember    pve2 (0c:42:a1:f3:a1:18)     Master: bond4(UP)
UP     swp6           40G  9216   BondMember    pve4 (0c:42:a1:f3:a1:41)     Master: bond6(UP)
UP     swp7           40G  9216   BondMember    pve15 (0c:42:a1:f3:a1:89)    Master: bond7(UP)
UP     swp8           10G  9216   BondMember    pve11 (50:6b:4b:44:07:4a)    Master: bond8(UP)

##  ceph ports
UP     swp27          10G  9216   BondMember    pve15 (3c:ec:ef:30:63:b2)    Master: bond27(UP)
UP     swp28          10G  9216   BondMember    pve7 (3c:ec:ef:30:64:6f)     Master: bond28(UP)
UP     swp29          10G  9216   BondMember    pve2 (3c:ec:ef:30:64:6b)     Master: bond29(UP)
UP     swp30          10G  9216   BondMember    pve4 (3c:ec:ef:30:61:88)     Master: bond30(UP)
UP     swp32          10G  9216   BondMember    pve11 (3c:ec:ef:30:67:cc)    Master: bond32(UP)
 
however I am certain that it is likely that I had or have something mis-configured in software settings. Or there is a bug at the switch etc. usually operator errors occur more often then bugs.
 
- yes they are used as router. per the following in pfsense lldp .
Code:
-------------------------------------------------------------------------------
LLDP neighbors:
-------------------------------------------------------------------------------
Interface:    em0, via: LLDP, RID: 1, Time: 0 day, 01:11:06
  Chassis:     
    ChassisID:    mac 0c:42:a1:0a:47:5a
    SysName:      mel1
    SysDescr:     Cumulus Linux version 4.3.0 running on Mellanox Technologies Ltd. MSN2700-B
    MgmtIP:       10.200.10.1
    MgmtIP:       fe80::e42:a1ff:fe0a:475b
    Capability:   Bridge, on
    Capability:   Router, on

also all the bonds are configured as part of a bridge at /etc/network/interfaces on cumulus:
Code:
auto bridge
iface bridge
    bridge-ports swp22 swp24 peerlink bond1 bond2 bond3 bond4 bond5 bond6 bond7 bond8 bond17 bond18 bond19 bond20 bond21 bond25 bond26 bond27 bond28 bond29 bond30 bond31 bond32
    bridge-pvid 8
    bridge-vids 2-250
    bridge-vlan-aware yes
    mstpctl-treeprio 4096

I am unsure how routing is enabled or configured . On netgear managed switches we just click something on the switch web page and routing is enabled.

probably it is here:
Code:
oot@mel1:[/etc/frr]:# cat frr.conf
frr version 7.4+cl4.2.1u1
frr defaults datacenter
hostname mel-1
log syslog informational
hostname mel1
service integrated-vtysh-config
ip route 0.0.0.0/0 10.1.0.2 10
ip route 0.0.0.0/0 10.1.140.2 20
ip route 0.0.0.0/0 10.1.8.202 100
line vty
 
- yes they are used as router. per the following in pfsense lldp .
Code:
-------------------------------------------------------------------------------
LLDP neighbors:
-------------------------------------------------------------------------------
Interface:    em0, via: LLDP, RID: 1, Time: 0 day, 01:11:06
  Chassis:    
    ChassisID:    mac 0c:42:a1:0a:47:5a
    SysName:      mel1
    SysDescr:     Cumulus Linux version 4.3.0 running on Mellanox Technologies Ltd. MSN2700-B
    MgmtIP:       10.200.10.1
    MgmtIP:       fe80::e42:a1ff:fe0a:475b
    Capability:   Bridge, on
    Capability:   Router, on

also all the bonds are configured as part of a bridge at /etc/network/interfaces on cumulus:
Code:
auto bridge
iface bridge
    bridge-ports swp22 swp24 peerlink bond1 bond2 bond3 bond4 bond5 bond6 bond7 bond8 bond17 bond18 bond19 bond20 bond21 bond25 bond26 bond27 bond28 bond29 bond30 bond31 bond32
    bridge-pvid 8
    bridge-vids 2-250
    bridge-vlan-aware yes
    mstpctl-treeprio 4096

I am unsure how routing is enabled or configured . On netgear managed switches we just click something on the switch web page and routing is enabled.

probably it is here:
Code:
oot@mel1:[/etc/frr]:# cat frr.conf
frr version 7.4+cl4.2.1u1
frr defaults datacenter
hostname mel-1
log syslog informational
hostname mel1
service integrated-vtysh-config
ip route 0.0.0.0/0 10.1.0.2 10
ip route 0.0.0.0/0 10.1.140.2 20
ip route 0.0.0.0/0 10.1.8.202 100
line vty

so the cumulus switches are the gateway of the vms ?


I don't known if you do active-passive (vrrp), or active-active (vrr).

For vrrp, I think it should work without problem with static route


If you use vrr, be carefull because I don't think it's working fine without bgp and ecmp paths.


https://docs.nvidia.com/networking-...yer-2/Virtual-Router-Redundancy-VRR-and-VRRP/
 
Hello Spirit

As far as I know, and will recheck we use VRR and do NOT have bgp and ecmp set up.

would you suggest we just use VRRP ? my understanding of networking is not great and I could use advice on which way to set things up.

we have 5 pve hosts, 1 pbs host, pfsense and 4 netgear managed switches attached to the Cumulus pair of switches.
 
Hello, i added this to /etc/pve/pve-local crontab and got a hit from a standalone pve system we use for off site backups.
cron:
Code:
55    */4   *  *   *  root  dmesg -T |  grep hung

email from cron
Code:
[Sun Sep 12 09:45:13 2021] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

parts of dmesg from system
Code:
Sep12 07:19]  zd32: p1
[Sep12 09:45] INFO: task txg_sync:7329 blocked for more than 120 seconds.
[  +0.000052]       Tainted: P           O      5.11.22-3-pve #1
[  +0.000022] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  +0.000026] task:txg_sync        state:D stack:    0 pid: 7329 ppid:     2 flags:0x00004000
[  +0.000004] Call Trace:
[  +0.000007]  __schedule+0x2ca/0x880
[  +0.000013]  schedule+0x4f/0xc0
[  +0.000003]  cv_wait_common+0xfd/0x130 [spl]
[  +0.000013]  ? wait_woken+0x80/0x80
[  +0.000006]  __cv_wait+0x15/0x20 [spl]
[  +0.000006]  arc_read+0x1ba/0x12b0 [zfs]
[  +0.000100]  ? arc_can_share+0x80/0x80 [zfs]
[  +0.000054]  dsl_scan_visitbp.isra.0+0x289/0xbf0 [zfs]
[  +0.000076]  dsl_scan_visitbp.isra.0+0x314/0xbf0 [zfs]
[  +0.000075]  dsl_scan_visitbp.isra.0+0x58c/0xbf0 [zfs]
[  +0.000075]  dsl_scan_visitbp.isra.0+0x314/0xbf0 [zfs]
[  +0.000090]  dsl_scan_visitbp.isra.0+0x314/0xbf0 [zfs]
[  +0.000095]  dsl_scan_visitbp.isra.0+0x314/0xbf0 [zfs]
[  +0.000094]  dsl_scan_visitbp.isra.0+0x314/0xbf0 [zfs]
[  +0.000084]  dsl_scan_visitbp.isra.0+0x314/0xbf0 [zfs]
[  +0.000074]  dsl_scan_visitbp.isra.0+0x7d3/0xbf0 [zfs]
[  +0.000074]  dsl_scan_visit_rootbp.isra.0+0x125/0x1b0 [zfs]
[  +0.000075]  dsl_scan_visitds+0x1a8/0x510 [zfs]
[  +0.000074]  ? slab_pre_alloc_hook.constprop.0+0x96/0xe0
[  +0.000005]  ? __kmalloc_node+0x144/0x2b0
[  +0.000002]  ? spl_kmem_alloc_impl+0xb5/0x100 [spl]
[  +0.000021]  ? spl_kmem_alloc_impl+0xb5/0x100 [spl]
[  +0.000006]  ? tsd_hash_search.isra.0+0x47/0xa0 [spl]
[  +0.000012]  ? tsd_set+0x19b/0x4c0 [spl]
[  +0.000007]  ? rrw_enter_read_impl+0xcc/0x180 [zfs]
[  +0.000076]  dsl_scan_sync+0x88b/0x13c0 [zfs]
[  +0.000072]  spa_sync+0x5df/0x1000 [zfs]
[  +0.000076]  ? mutex_lock+0x13/0x40
[  +0.000004]  ? spa_txg_history_init_io+0x106/0x110 [zfs]
[  +0.000078]  txg_sync_thread+0x2d3/0x460 [zfs]
[  +0.000127]  ? txg_init+0x260/0x260 [zfs]
[  +0.000089]  thread_generic_wrapper+0x79/0x90 [spl]
[  +0.000010]  kthread+0x12f/0x150
[  +0.000005]  ? __thread_exit+0x20/0x20 [spl]
[  +0.000006]  ? __kthread_bind_mask+0x70/0x70
[  +0.000002]  ret_from_fork+0x22/0x30

I thought the hung messages at system consoles on cluster were caused by network issues . however having the hang show up on a stand alone system at a different site is worth mentioning.



pveversion:
pve-manager/7.0-11/63d82f4e (running kernel: 5.11.22-3-pve)
 
I just check kernlog at one of the cluster nodes, and the dmesg info is very different. there are lines like these:
Code:
Sep 12 07:39:41 pve2 kernel: [ 3385.014666] INFO: task jbd2/rbd0-8:4096 blocked for more than 120 seconds.
Sep 12 07:39:41 pve2 kernel: [ 3385.014698] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

so the offsite system hang is just a coincidence
 
  • Like
Reactions: FinnTux
Hello Spirit

As far as I know, and will recheck we use VRR and do NOT have bgp and ecmp set up.

would you suggest we just use VRRP ? my understanding of networking is not great and I could use advice on which way to set things up.

we have 5 pve hosts, 1 pbs host, pfsense and 4 netgear managed switches attached to the Cumulus pair of switches.
yes, use vrrp in this case.

see for more details:
https://www.redpill-linpro.com/techblog/2018/02/26/layer3-cumulus-mlag.html
 
  • Like
Reactions: RobFantini
Hello, i added this to /etc/pve/pve-local crontab and got a hit from a standalone pve system we use for off site backups.
cron:
Code:
55    */4   *  *   *  root  dmesg -T |  grep hung

email from cron
Code:
[Sun Sep 12 09:45:13 2021] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

parts of dmesg from system
Code:
Sep12 07:19]  zd32: p1
[Sep12 09:45] INFO: task txg_sync:7329 blocked for more than 120 seconds.
[  +0.000052]       Tainted: P           O      5.11.22-3-pve #1
[  +0.000022] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  +0.000026] task:txg_sync        state:D stack:    0 pid: 7329 ppid:     2 flags:0x00004000
[  +0.000004] Call Trace:
[  +0.000007]  __schedule+0x2ca/0x880
[  +0.000013]  schedule+0x4f/0xc0
[  +0.000003]  cv_wait_common+0xfd/0x130 [spl]
[  +0.000013]  ? wait_woken+0x80/0x80
[  +0.000006]  __cv_wait+0x15/0x20 [spl]
[  +0.000006]  arc_read+0x1ba/0x12b0 [zfs]
[  +0.000100]  ? arc_can_share+0x80/0x80 [zfs]
[  +0.000054]  dsl_scan_visitbp.isra.0+0x289/0xbf0 [zfs]
[  +0.000076]  dsl_scan_visitbp.isra.0+0x314/0xbf0 [zfs]
[  +0.000075]  dsl_scan_visitbp.isra.0+0x58c/0xbf0 [zfs]
[  +0.000075]  dsl_scan_visitbp.isra.0+0x314/0xbf0 [zfs]
[  +0.000090]  dsl_scan_visitbp.isra.0+0x314/0xbf0 [zfs]
[  +0.000095]  dsl_scan_visitbp.isra.0+0x314/0xbf0 [zfs]
[  +0.000094]  dsl_scan_visitbp.isra.0+0x314/0xbf0 [zfs]
[  +0.000084]  dsl_scan_visitbp.isra.0+0x314/0xbf0 [zfs]
[  +0.000074]  dsl_scan_visitbp.isra.0+0x7d3/0xbf0 [zfs]
[  +0.000074]  dsl_scan_visit_rootbp.isra.0+0x125/0x1b0 [zfs]
[  +0.000075]  dsl_scan_visitds+0x1a8/0x510 [zfs]
[  +0.000074]  ? slab_pre_alloc_hook.constprop.0+0x96/0xe0
[  +0.000005]  ? __kmalloc_node+0x144/0x2b0
[  +0.000002]  ? spl_kmem_alloc_impl+0xb5/0x100 [spl]
[  +0.000021]  ? spl_kmem_alloc_impl+0xb5/0x100 [spl]
[  +0.000006]  ? tsd_hash_search.isra.0+0x47/0xa0 [spl]
[  +0.000012]  ? tsd_set+0x19b/0x4c0 [spl]
[  +0.000007]  ? rrw_enter_read_impl+0xcc/0x180 [zfs]
[  +0.000076]  dsl_scan_sync+0x88b/0x13c0 [zfs]
[  +0.000072]  spa_sync+0x5df/0x1000 [zfs]
[  +0.000076]  ? mutex_lock+0x13/0x40
[  +0.000004]  ? spa_txg_history_init_io+0x106/0x110 [zfs]
[  +0.000078]  txg_sync_thread+0x2d3/0x460 [zfs]
[  +0.000127]  ? txg_init+0x260/0x260 [zfs]
[  +0.000089]  thread_generic_wrapper+0x79/0x90 [spl]
[  +0.000010]  kthread+0x12f/0x150
[  +0.000005]  ? __thread_exit+0x20/0x20 [spl]
[  +0.000006]  ? __kthread_bind_mask+0x70/0x70
[  +0.000002]  ret_from_fork+0x22/0x30

I thought the hung messages at system consoles on cluster were caused by network issues . however having the hang show up on a stand alone system at a different site is worth mentioning.



pveversion:
pve-manager/7.0-11/63d82f4e (running kernel: 5.11.22-3-pve)
the dmesg show zfs hang, it could hang the kernel, then network. (Really not sure about that, chicken-egg problem)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!