slow migrations

RobFantini · Sep 12, 2021

Hello
I am seeing slower migrations with pve7 then pve6 .

We do have a network issue that I have been trying to track down over the last week, which is probably the cause.

However I wanted to see if others have noticed slower migrations.

thank you for reading this.

Moayad · Sep 13, 2021

Could you tell us more details about the storage type that you are using with migrations?

ledufakademy · Sep 13, 2021

same issue with ZFS raidz (used with replication)
we have 6 nic (1GB)
NIC 1 = WEB gui , nodes ip ,
NIC 2 = VM/CT trafic with external,
NIC 3 = cluster dedicated link
NIC 4 = unsused
NIC 4+6 = LAG to PBS server (restore or backup are fast enough for us ;-)

Which link is used for replication ?

RobFantini · Sep 13, 2021

Moayad said:
Could you tell us more details about the storage type that you are using with migrations?

1- I am fairly sure it is related to this seen at dmesg on pve hosts

Code:

# dmesg|grep hung
[Sun Sep 12 07:39:32 2021] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Sun Sep 12 07:39:32 2021] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Sun Sep 12 07:39:32 2021] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Sun Sep 12 07:39:32 2021] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

2- and the mellonax switches - we use mlag had dmesg lines with 'RTM_NEWNEIGH ' every coupls of seconds.

after restarting the mellanox switches we had the hung lines at pve hosts.

so in the past the only way we knew to deal with those was reboot each of the 5 nodes, so that is in progress.

To answer your question.

we use ceph with seven 4TB intel data center grade nvme's per node. so 35 osd's

RobFantini · Sep 13, 2021

also we get emails when ceph -s shows warnings and saw these:

Code:

 cluster:
    id:     220b9a53-4556-48e3-a73c-28deff665e45
    health: HEALTH_WARN
            1 slow ops, oldest one blocked for 82322 sec, mon.pve4 has slow ops
 
  services:
    mon: 3 daemons, quorum pve15,pve11,pve4 (age 22h)
    mgr: pve4(active, since 24h), standbys: pve11, pve15
    osd: 35 osds: 35 up (since 22h), 35 in (since 10d)
 
  data:
    pools:   1 pools, 512 pgs
    objects: 2.22M objects, 7.7 TiB
    usage:   24 TiB used, 104 TiB / 127 TiB avail
    pgs:     512 active+clean
 
  io:
    client:   0 B/s rd, 11 MiB/s wr, 0 op/s rd, 122 op/s wr
    
## and zabbix:
Subject: PROBLEM: Ceph cluster in WARN state

Trigger: Ceph cluster in WARN state
Trigger status: PROBLEM
Trigger severity: High
Trigger URL:

Item values:

1. Overal Ceph status (numeric) (ceph:ceph.overall_status_int): 1
2. *UNKNOWN* (*UNKNOWN*:*UNKNOWN*): *UNKNOWN*
3. *UNKNOWN* (*UNKNOWN*:*UNKNOWN*): *UNKNOWN*

RobFantini · Sep 13, 2021

so in our case the network we think was the cause.

I will leave the thread open as someone else posted their issue, and will wait a few days to make sure there is not a repeat.

spirit · Sep 13, 2021

2- and the mellonax switches - we use mlag had dmesg lines with 'RTM_NEWNEIGH ' every coupls of seconds.

are you sure that you don't have a network loop or something like that ?

spirit · Sep 13, 2021

# dmesg|grep hung
[Sun Sep 12 07:39:32 2021] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Sun Sep 12 07:39:32 2021] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Sun Sep 12 07:39:32 2021] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Sun Sep 12 07:39:32 2021] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

anything else in dmesg ? (no driver bug or other info ?) what is your nic model ?

RobFantini · Sep 13, 2021

- we use connect-x4 and -x5
- the mlag switches are mellanox 2500sn running cumulus .

as for the loop i am certain the cables go from the connect-x* cards to the switch.

cumulus has this command that shows the lldp of each connection. I ran this on both switches to verify that the port used at each switch is correct. Note we have 2 connect-x cards per host.

Code:

# switch 1 :
root@mel1:[~]:# net show interface |grep pve
State  Name           Spd  MTU    Mode          LLDP                         Summary
-----  -------------  ---  -----  ------------  ---------------------------  -----------------------
UP     swp2           40G  9216   BondMember    pve7 (0c:42:a1:f3:a1:98)     Master: bond2(UP)
UP     swp4           40G  9216   BondMember    pve2 (0c:42:a1:f3:a1:19)     Master: bond4(UP)
UP     swp6           40G  9216   BondMember    pve4 (0c:42:a1:f3:a1:40)     Master: bond6(UP)
UP     swp7           40G  9216   BondMember    pve15 (0c:42:a1:f3:a1:88)    Master: bond7(UP)
UP     swp8           10G  9216   BondMember    pve11 (50:6b:4b:44:07:4b)    Master: bond8(UP)

## ceph ports
UP     swp27          10G  9216   BondMember    pve15 (3c:ec:ef:30:63:b3)    Master: bond27(UP)
UP     swp28          10G  9216   BondMember    pve7 (3c:ec:ef:30:64:6e)     Master: bond28(UP)
UP     swp29          10G  9216   BondMember    pve2 (3c:ec:ef:30:64:6a)     Master: bond29(UP)
UP     swp30          10G  9216   BondMember    pve4 (3c:ec:ef:30:61:89)     Master: bond30(UP)
UP     swp32          10G  9216   BondMember    pve11 (3c:ec:ef:30:67:cd)    Master: bond32(UP)



# switch 2:
root@mel2:[~]:# net show interface |grep pve
State  Name           Spd  MTU    Mode          LLDP                         Summary
-----  -------------  ---  -----  ------------  ---------------------------  -----------------------
UP     swp2           40G  9216   BondMember    pve7 (0c:42:a1:f3:a1:99)     Master: bond2(UP)
UP     swp4           40G  9216   BondMember    pve2 (0c:42:a1:f3:a1:18)     Master: bond4(UP)
UP     swp6           40G  9216   BondMember    pve4 (0c:42:a1:f3:a1:41)     Master: bond6(UP)
UP     swp7           40G  9216   BondMember    pve15 (0c:42:a1:f3:a1:89)    Master: bond7(UP)
UP     swp8           10G  9216   BondMember    pve11 (50:6b:4b:44:07:4a)    Master: bond8(UP)

##  ceph ports
UP     swp27          10G  9216   BondMember    pve15 (3c:ec:ef:30:63:b2)    Master: bond27(UP)
UP     swp28          10G  9216   BondMember    pve7 (3c:ec:ef:30:64:6f)     Master: bond28(UP)
UP     swp29          10G  9216   BondMember    pve2 (3c:ec:ef:30:64:6b)     Master: bond29(UP)
UP     swp30          10G  9216   BondMember    pve4 (3c:ec:ef:30:61:88)     Master: bond30(UP)
UP     swp32          10G  9216   BondMember    pve11 (3c:ec:ef:30:67:cc)    Master: bond32(UP)

RobFantini · Sep 13, 2021

however I am certain that it is likely that I had or have something mis-configured in software settings. Or there is a bug at the switch etc. usually operator errors occur more often then bugs.

spirit · Sep 13, 2021

don't have too much experience with mlag && cumulus (only with onyx),
but "RTM_NEWNEIGH" , is related to arp, not layer2/bridge mac address table.

(do you use them as router ? )

RobFantini · Sep 13, 2021

- yes they are used as router. per the following in pfsense lldp .

Code:

-------------------------------------------------------------------------------
LLDP neighbors:
-------------------------------------------------------------------------------
Interface:    em0, via: LLDP, RID: 1, Time: 0 day, 01:11:06
  Chassis:     
    ChassisID:    mac 0c:42:a1:0a:47:5a
    SysName:      mel1
    SysDescr:     Cumulus Linux version 4.3.0 running on Mellanox Technologies Ltd. MSN2700-B
    MgmtIP:       10.200.10.1
    MgmtIP:       fe80::e42:a1ff:fe0a:475b
    Capability:   Bridge, on
    Capability:   Router, on

also all the bonds are configured as part of a bridge at /etc/network/interfaces on cumulus:

Code:

auto bridge
iface bridge
    bridge-ports swp22 swp24 peerlink bond1 bond2 bond3 bond4 bond5 bond6 bond7 bond8 bond17 bond18 bond19 bond20 bond21 bond25 bond26 bond27 bond28 bond29 bond30 bond31 bond32
    bridge-pvid 8
    bridge-vids 2-250
    bridge-vlan-aware yes
    mstpctl-treeprio 4096

I am unsure how routing is enabled or configured . On netgear managed switches we just click something on the switch web page and routing is enabled.

probably it is here:

Code:

oot@mel1:[/etc/frr]:# cat frr.conf
frr version 7.4+cl4.2.1u1
frr defaults datacenter
hostname mel-1
log syslog informational
hostname mel1
service integrated-vtysh-config
ip route 0.0.0.0/0 10.1.0.2 10
ip route 0.0.0.0/0 10.1.140.2 20
ip route 0.0.0.0/0 10.1.8.202 100
line vty

spirit · Sep 13, 2021

RobFantini said:

- yes they are used as router. per the following in pfsense lldp .

Code:

-------------------------------------------------------------------------------
LLDP neighbors:
-------------------------------------------------------------------------------
Interface:    em0, via: LLDP, RID: 1, Time: 0 day, 01:11:06
  Chassis:    
    ChassisID:    mac 0c:42:a1:0a:47:5a
    SysName:      mel1
    SysDescr:     Cumulus Linux version 4.3.0 running on Mellanox Technologies Ltd. MSN2700-B
    MgmtIP:       10.200.10.1
    MgmtIP:       fe80::e42:a1ff:fe0a:475b
    Capability:   Bridge, on
    Capability:   Router, on

also all the bonds are configured as part of a bridge at /etc/network/interfaces on cumulus:

Code:

auto bridge
iface bridge
    bridge-ports swp22 swp24 peerlink bond1 bond2 bond3 bond4 bond5 bond6 bond7 bond8 bond17 bond18 bond19 bond20 bond21 bond25 bond26 bond27 bond28 bond29 bond30 bond31 bond32
    bridge-pvid 8
    bridge-vids 2-250
    bridge-vlan-aware yes
    mstpctl-treeprio 4096

I am unsure how routing is enabled or configured . On netgear managed switches we just click something on the switch web page and routing is enabled.

probably it is here:

Code:

oot@mel1:[/etc/frr]:# cat frr.conf
frr version 7.4+cl4.2.1u1
frr defaults datacenter
hostname mel-1
log syslog informational
hostname mel1
service integrated-vtysh-config
ip route 0.0.0.0/0 10.1.0.2 10
ip route 0.0.0.0/0 10.1.140.2 20
ip route 0.0.0.0/0 10.1.8.202 100
line vty

so the cumulus switches are the gateway of the vms ?

I don't known if you do active-passive (vrrp), or active-active (vrr).

For vrrp, I think it should work without problem with static route

If you use vrr, be carefull because I don't think it's working fine without bgp and ecmp paths.

https://docs.nvidia.com/networking-...yer-2/Virtual-Router-Redundancy-VRR-and-VRRP/

RobFantini · Sep 13, 2021

Hello Spirit

As far as I know, and will recheck we use VRR and do NOT have bgp and ecmp set up.

would you suggest we just use VRRP ? my understanding of networking is not great and I could use advice on which way to set things up.

we have 5 pve hosts, 1 pbs host, pfsense and 4 netgear managed switches attached to the Cumulus pair of switches.

RobFantini · Sep 14, 2021

Hello, i added this to /etc/pve/pve-local crontab and got a hit from a standalone pve system we use for off site backups.
cron:

Code:

55    */4   *  *   *  root  dmesg -T |  grep hung

email from cron

Code:

[Sun Sep 12 09:45:13 2021] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

parts of dmesg from system

Code:

Sep12 07:19]  zd32: p1
[Sep12 09:45] INFO: task txg_sync:7329 blocked for more than 120 seconds.
[  +0.000052]       Tainted: P           O      5.11.22-3-pve #1
[  +0.000022] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  +0.000026] task:txg_sync        state:D stack:    0 pid: 7329 ppid:     2 flags:0x00004000
[  +0.000004] Call Trace:
[  +0.000007]  __schedule+0x2ca/0x880
[  +0.000013]  schedule+0x4f/0xc0
[  +0.000003]  cv_wait_common+0xfd/0x130 [spl]
[  +0.000013]  ? wait_woken+0x80/0x80
[  +0.000006]  __cv_wait+0x15/0x20 [spl]
[  +0.000006]  arc_read+0x1ba/0x12b0 [zfs]
[  +0.000100]  ? arc_can_share+0x80/0x80 [zfs]
[  +0.000054]  dsl_scan_visitbp.isra.0+0x289/0xbf0 [zfs]
[  +0.000076]  dsl_scan_visitbp.isra.0+0x314/0xbf0 [zfs]
[  +0.000075]  dsl_scan_visitbp.isra.0+0x58c/0xbf0 [zfs]
[  +0.000075]  dsl_scan_visitbp.isra.0+0x314/0xbf0 [zfs]
[  +0.000090]  dsl_scan_visitbp.isra.0+0x314/0xbf0 [zfs]
[  +0.000095]  dsl_scan_visitbp.isra.0+0x314/0xbf0 [zfs]
[  +0.000094]  dsl_scan_visitbp.isra.0+0x314/0xbf0 [zfs]
[  +0.000084]  dsl_scan_visitbp.isra.0+0x314/0xbf0 [zfs]
[  +0.000074]  dsl_scan_visitbp.isra.0+0x7d3/0xbf0 [zfs]
[  +0.000074]  dsl_scan_visit_rootbp.isra.0+0x125/0x1b0 [zfs]
[  +0.000075]  dsl_scan_visitds+0x1a8/0x510 [zfs]
[  +0.000074]  ? slab_pre_alloc_hook.constprop.0+0x96/0xe0
[  +0.000005]  ? __kmalloc_node+0x144/0x2b0
[  +0.000002]  ? spl_kmem_alloc_impl+0xb5/0x100 [spl]
[  +0.000021]  ? spl_kmem_alloc_impl+0xb5/0x100 [spl]
[  +0.000006]  ? tsd_hash_search.isra.0+0x47/0xa0 [spl]
[  +0.000012]  ? tsd_set+0x19b/0x4c0 [spl]
[  +0.000007]  ? rrw_enter_read_impl+0xcc/0x180 [zfs]
[  +0.000076]  dsl_scan_sync+0x88b/0x13c0 [zfs]
[  +0.000072]  spa_sync+0x5df/0x1000 [zfs]
[  +0.000076]  ? mutex_lock+0x13/0x40
[  +0.000004]  ? spa_txg_history_init_io+0x106/0x110 [zfs]
[  +0.000078]  txg_sync_thread+0x2d3/0x460 [zfs]
[  +0.000127]  ? txg_init+0x260/0x260 [zfs]
[  +0.000089]  thread_generic_wrapper+0x79/0x90 [spl]
[  +0.000010]  kthread+0x12f/0x150
[  +0.000005]  ? __thread_exit+0x20/0x20 [spl]
[  +0.000006]  ? __kthread_bind_mask+0x70/0x70
[  +0.000002]  ret_from_fork+0x22/0x30

I thought the hung messages at system consoles on cluster were caused by network issues . however having the hang show up on a stand alone system at a different site is worth mentioning.

pveversion:
pve-manager/7.0-11/63d82f4e (running kernel: 5.11.22-3-pve)

RobFantini · Sep 14, 2021

I just check kernlog at one of the cluster nodes, and the dmesg info is very different. there are lines like these:

Code:

Sep 12 07:39:41 pve2 kernel: [ 3385.014666] INFO: task jbd2/rbd0-8:4096 blocked for more than 120 seconds.
Sep 12 07:39:41 pve2 kernel: [ 3385.014698] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

so the offsite system hang is just a coincidence

spirit · Sep 14, 2021

RobFantini said:
Hello Spirit

As far as I know, and will recheck we use VRR and do NOT have bgp and ecmp set up.

would you suggest we just use VRRP ? my understanding of networking is not great and I could use advice on which way to set things up.

we have 5 pve hosts, 1 pbs host, pfsense and 4 netgear managed switches attached to the Cumulus pair of switches.

yes, use vrrp in this case.

see for more details:
https://www.redpill-linpro.com/techblog/2018/02/26/layer3-cumulus-mlag.html

spirit · Sep 14, 2021

RobFantini said:

Hello, i added this to /etc/pve/pve-local crontab and got a hit from a standalone pve system we use for off site backups.
cron:

Code:

55    */4   *  *   *  root  dmesg -T |  grep hung

email from cron

Code:

[Sun Sep 12 09:45:13 2021] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

parts of dmesg from system

Code:

Sep12 07:19]  zd32: p1
[Sep12 09:45] INFO: task txg_sync:7329 blocked for more than 120 seconds.
[  +0.000052]       Tainted: P           O      5.11.22-3-pve #1
[  +0.000022] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  +0.000026] task:txg_sync        state:D stack:    0 pid: 7329 ppid:     2 flags:0x00004000
[  +0.000004] Call Trace:
[  +0.000007]  __schedule+0x2ca/0x880
[  +0.000013]  schedule+0x4f/0xc0
[  +0.000003]  cv_wait_common+0xfd/0x130 [spl]
[  +0.000013]  ? wait_woken+0x80/0x80
[  +0.000006]  __cv_wait+0x15/0x20 [spl]
[  +0.000006]  arc_read+0x1ba/0x12b0 [zfs]
[  +0.000100]  ? arc_can_share+0x80/0x80 [zfs]
[  +0.000054]  dsl_scan_visitbp.isra.0+0x289/0xbf0 [zfs]
[  +0.000076]  dsl_scan_visitbp.isra.0+0x314/0xbf0 [zfs]
[  +0.000075]  dsl_scan_visitbp.isra.0+0x58c/0xbf0 [zfs]
[  +0.000075]  dsl_scan_visitbp.isra.0+0x314/0xbf0 [zfs]
[  +0.000090]  dsl_scan_visitbp.isra.0+0x314/0xbf0 [zfs]
[  +0.000095]  dsl_scan_visitbp.isra.0+0x314/0xbf0 [zfs]
[  +0.000094]  dsl_scan_visitbp.isra.0+0x314/0xbf0 [zfs]
[  +0.000084]  dsl_scan_visitbp.isra.0+0x314/0xbf0 [zfs]
[  +0.000074]  dsl_scan_visitbp.isra.0+0x7d3/0xbf0 [zfs]
[  +0.000074]  dsl_scan_visit_rootbp.isra.0+0x125/0x1b0 [zfs]
[  +0.000075]  dsl_scan_visitds+0x1a8/0x510 [zfs]
[  +0.000074]  ? slab_pre_alloc_hook.constprop.0+0x96/0xe0
[  +0.000005]  ? __kmalloc_node+0x144/0x2b0
[  +0.000002]  ? spl_kmem_alloc_impl+0xb5/0x100 [spl]
[  +0.000021]  ? spl_kmem_alloc_impl+0xb5/0x100 [spl]
[  +0.000006]  ? tsd_hash_search.isra.0+0x47/0xa0 [spl]
[  +0.000012]  ? tsd_set+0x19b/0x4c0 [spl]
[  +0.000007]  ? rrw_enter_read_impl+0xcc/0x180 [zfs]
[  +0.000076]  dsl_scan_sync+0x88b/0x13c0 [zfs]
[  +0.000072]  spa_sync+0x5df/0x1000 [zfs]
[  +0.000076]  ? mutex_lock+0x13/0x40
[  +0.000004]  ? spa_txg_history_init_io+0x106/0x110 [zfs]
[  +0.000078]  txg_sync_thread+0x2d3/0x460 [zfs]
[  +0.000127]  ? txg_init+0x260/0x260 [zfs]
[  +0.000089]  thread_generic_wrapper+0x79/0x90 [spl]
[  +0.000010]  kthread+0x12f/0x150
[  +0.000005]  ? __thread_exit+0x20/0x20 [spl]
[  +0.000006]  ? __kthread_bind_mask+0x70/0x70
[  +0.000002]  ret_from_fork+0x22/0x30

I thought the hung messages at system consoles on cluster were caused by network issues . however having the hang show up on a stand alone system at a different site is worth mentioning.

pveversion:
pve-manager/7.0-11/63d82f4e (running kernel: 5.11.22-3-pve)

the dmesg show zfs hang, it could hang the kernel, then network. (Really not sure about that, chicken-egg problem)

RobFantini · Sep 14, 2021

spirit said:
yes, use vrrp in this case.

see for more details:
https://www.redpill-linpro.com/techblog/2018/02/26/layer3-cumulus-mlag.html

Spirit
that link mentions vrr not vrrp . did you mean use vrr ?

spirit · Sep 14, 2021

RobFantini said:
Spirit
that link mentions vrr not vrrp . did you mean use vrr ?

I mean, this show why you need bgp with vrr. if not, use vrrp

slow migrations

Famous Member

Proxmox Staff Member

Member

Famous Member

Famous Member

Famous Member

Distinguished Member

Distinguished Member

Famous Member

Famous Member

Distinguished Member

Famous Member

Distinguished Member

Famous Member

Famous Member

Famous Member

Distinguished Member

Distinguished Member

Famous Member

Distinguished Member

We value your privacy