PVESM got timeout 6.1

Dexogen

Active Member
Dec 3, 2018
9
2
43
Haifa, Hefa, Israel
Hi,
There are 18 nodes in our cluster. During the upgrade from version 5.4 to 6.1-7, there was a problem with a timeout. For any operations with storage lists, a delay of up to 30 seconds occurs. But after a while the list is loading (see screenshot). Moreover, there are no problems on the old nodes 5.4 until you update them.

Bash:
[root@compute-5 ~]$ pvesm status
Name                                Type     Status           Total            Used       Available        %
HPE-3PAR-STOR-01                    lvm     active     12884885504     10674503680      2210381824   82.85%
HPE-3PAR-STOR-02                    lvm     active      8992571392      7933526016      1059045376   88.22%
HPE-3PAR-STOR-03                    lvm     active      6442434560       830472192      5611962368   12.89%
CEPH-BUILD-SAS                        rbd     active     14881893314      9299124162      5582769152   62.49%
CEPH-BUILD-SSD                        rbd     active     28997599635     21123504531      7874095104   72.85%
CEPH-CUSTOM-SAS                        rbd     active      6694895406      1112126254      5582769152   16.61%
CEPH-CUSTOM-SSD                        rbd     active     10040456603      2166507419      7873949184   21.58%
backup                                nfs     active     87816765440     72777157632     15039607808   82.87%
iso                                    nfs     active       314569728       278293536        36276192   88.47%

1st run
Bash:
[root@compute-14 ~]$ pvesm status
got timeout
got timeout
Name                                Type     Status           Total            Used       Available        %
HPE-3PAR-STOR-01                    lvm     active     12884885504     10674503680      2210381824   82.85%
HPE-3PAR-STOR-02                    lvm     active      8992571392      7933526016      1059045376   88.22%
HPE-3PAR-STOR-03                    lvm     active      6442434560       830472192      5611962368   12.89%
CEPH-BUILD-SAS                        rbd   inactive               0               0               0    0.00%
CEPH-BUILD-SSD                        rbd     active     28997599635     21123504531      7874095104   72.85%
CEPH-CUSTOM-SAS                        rbd     active      6694895406      1112126254      5582769152   16.61%
CEPH-CUSTOM-SSD                        rbd   inactive               0               0               0    0.00%
backup                                nfs     active     87816765440     72777157632     15039607808   82.87%
iso
2nd run
Bash:
[root@compute-14 ~]$ pvesm status
got timeout
Name                                Type     Status           Total            Used       Available        %
HPE-3PAR-STOR-01                    lvm     active     12884885504     10674503680      2210381824   82.85%
HPE-3PAR-STOR-02                    lvm     active      8992571392      7933526016      1059045376   88.22%
HPE-3PAR-STOR-03                    lvm     active      6442434560       830472192      5611962368   12.89%
CEPH-BUILD-SAS                        rbd     active     14881893314      9299124162      5582769152   62.49%
CEPH-BUILD-SSD                        rbd     active     28997599635     21123504531      7874095104   72.85%
CEPH-CUSTOM-SAS                        rbd   inactive               0               0               0    0.00%
CEPH-CUSTOM-SSD                        rbd     active     10040456603      2166507419      7873949184   21.58%
backup                                nfs     active     87816765440     72777157632     15039607808   82.87%
iso                                    nfs     active       314569728       278293536        36276192   88.47%
3rd run
Bash:
[root@compute-14 ~]$ pvesm status
got timeout
got timeout
Name                                Type     Status           Total            Used       Available        %
HPE-3PAR-STOR-01                    lvm     active     12884885504     10674503680      2210381824   82.85%
HPE-3PAR-STOR-02                    lvm     active      8992571392      7933526016      1059045376   88.22%
HPE-3PAR-STOR-03                    lvm     active      6442434560       830472192      5611962368   12.89%
CEPH-BUILD-SAS                        rbd   inactive               0               0               0    0.00%
CEPH-BUILD-SSD                        rbd   inactive               0               0               0    0.00%
CEPH-CUSTOM-SAS                        rbd     active      6694895406      1112126254      5582769152   16.61%
CEPH-CUSTOM-SSD                        rbd     active     10040456603      2166507419      7873949184   21.58%
backup                                nfs     active     87816765440     72777157632     15039607808   82.87%
iso                                    nfs     active       314569728       278293536        36276192   88.47%

It is interesting that at each launch of the command, different rbd storages are displayed as inactive. With lvm this does not happen.

Upd. I forgot to clarify that ceph is external.
 

Attachments

  • timeout.jpg
    timeout.jpg
    17 KB · Views: 16
Last edited:
What version does the external Ceph cluster run on?
 
I think to have the same issue, sometimes pvesm status says timeout, sometimes it's working fine.

In my cluster I have multiple Ceph pools on the same Ceph storage, all are working fine. However 1 of 3 requests shows "inactive" and on stderr "got timeout".
 
Can all nodes in the cluster reach all nodes of the Ceph cluster? And is the MTU set the same for all interfaces?
 
Yep, same mtu and all nodes are available. If you reinstall the node to 5.4, the problem disappears. The problem appears only on version 6.1. Moreover, both updated from 5.4, and installed cleanly.
 
Are there any entries in the kernel? Are there any settings on the cluster/client that are not default. Does it make a difference with a different kernel?
 
No, all settings are the same. The only changes we use are:

YAML:
- name: Network tuning
  sysctl:
    name: "{{ item.name }}"
    value: "{{ item.value }}"
    sysctl_set: yes
    state: present
    ignoreerrors: yes
  with_items:
    - { name: "net.ipv4.ip_forward", value: "1" }
    - { name: "net.ipv4.conf.default.rp_filter", value: "0" }
    - { name: "net.ipv4.conf.all.rp_filter", value: "2" }
    - { name: "net.core.rmem_max", value: "56623104" }
    - { name: "net.core.wmem_max", value: "56623104" }
    - { name: "net.core.rmem_default", value: "56623104" }
    - { name: "net.core.wmem_default", value: "56623104" }
    - { name: "net.ipv4.tcp_rmem", value: "4096 87380 56623104" }
    - { name: "net.ipv4.tcp_wmem", value: "4096 65536 56623104" }
    - { name: "net.core.somaxconn", value: "1024" }
    - { name: "net.ipv4.udp_rmem_min", value: "8192" }
    - { name: "net.ipv4.udp_wmem_min", value: "8192" }

- name: Emulex dropping packets fix
  lineinfile:
    dest: /etc/modprobe.d/be2net.conf
    line: 'options be2net rx_frag_size=4096'
    state: present
 
Apparently no longer needed. The default value is now 8192. We will gradually restart the cluster to apply the changes. There is nothing interesting in the system log except pvestatd timeout errors:

Code:
Apr 29 16:28:42 compute-14 pvestatd[2206]: got timeout
Apr 29 16:28:47 compute-14 pvestatd[2206]: got timeout
Apr 29 16:28:47 compute-14 pvestatd[2206]: status update time (15.818 seconds)
Apr 29 16:28:52 compute-14 pvestatd[2206]: got timeout
Apr 29 16:28:57 compute-14 pvestatd[2206]: got timeout
 
Then I guess, you may want to revisit the other settings as well.

Apr 29 16:28:42 compute-14 pvestatd[2206]: got timeout
This just tells the same picture, as pvesm.
 
I turned on debug and performed in a cluster of two nodes with just one ceph-pool. Network settings returned to default values. I executed the command several times but sometimes got a timeout.

Code:
[root@compute14 ~]$ date && pvesm status && date
Wed 29 Apr 2020 05:55:08 PM MSK
got timeout
Name                           Type     Status           Total            Used       Available        %
CEPH-DEBUG-SAS                    rbd   inactive               0               0               0    0.00%
backup            nfs   disabled               0               0               0      N/A
local                           dir   disabled               0               0               0      N/A
local-lvm                   lvmthin     active        79896576               0        79896576    0.00%
iso               nfs     active       314569728       280108928        34460800   89.05%
Wed 29 Apr 2020 05:55:14 PM MSK

Perhaps a fragment of the log will help.
 

Attachments

  • syslog.log
    127.5 KB · Views: 4
It seems that the problem was in the difference between the ceph-client (14.2.8) and ceph-server (12.2.12) versions. I just rolled back the client version and the timeout problem was solved.

Bash:
#workaround for pve 6.1-7 with old ceph-cluster 
apt install ceph-common=12.2.11+dfsg1-2.1+b1 \
librbd1=12.2.11+dfsg1-2.1+b1 \
python-cephfs=12.2.11+dfsg1-2.1+b1 \
python-rados=12.2.11+dfsg1-2.1+b1 \
python-rbd=12.2.11+dfsg1-2.1+b1 \
librados2=12.2.11+dfsg1-2.1+b1 \
libradosstriper1=12.2.11+dfsg1-2.1+b1 \
ceph-fuse=12.2.11+dfsg1-2.1+b1 \
libcephfs2=12.2.11+dfsg1-2.1+b1
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!