PVESM got timeout 6.1

Dexogen · Apr 28, 2020

Hi,
There are 18 nodes in our cluster. During the upgrade from version 5.4 to 6.1-7, there was a problem with a timeout. For any operations with storage lists, a delay of up to 30 seconds occurs. But after a while the list is loading (see screenshot). Moreover, there are no problems on the old nodes 5.4 until you update them.

Bash:

[root@compute-5 ~]$ pvesm status
Name                                Type     Status           Total            Used       Available        %
HPE-3PAR-STOR-01                    lvm     active     12884885504     10674503680      2210381824   82.85%
HPE-3PAR-STOR-02                    lvm     active      8992571392      7933526016      1059045376   88.22%
HPE-3PAR-STOR-03                    lvm     active      6442434560       830472192      5611962368   12.89%
CEPH-BUILD-SAS                        rbd     active     14881893314      9299124162      5582769152   62.49%
CEPH-BUILD-SSD                        rbd     active     28997599635     21123504531      7874095104   72.85%
CEPH-CUSTOM-SAS                        rbd     active      6694895406      1112126254      5582769152   16.61%
CEPH-CUSTOM-SSD                        rbd     active     10040456603      2166507419      7873949184   21.58%
backup                                nfs     active     87816765440     72777157632     15039607808   82.87%
iso                                    nfs     active       314569728       278293536        36276192   88.47%

1st run

Bash:

[root@compute-14 ~]$ pvesm status
got timeout
got timeout
Name                                Type     Status           Total            Used       Available        %
HPE-3PAR-STOR-01                    lvm     active     12884885504     10674503680      2210381824   82.85%
HPE-3PAR-STOR-02                    lvm     active      8992571392      7933526016      1059045376   88.22%
HPE-3PAR-STOR-03                    lvm     active      6442434560       830472192      5611962368   12.89%
CEPH-BUILD-SAS                        rbd   inactive               0               0               0    0.00%
CEPH-BUILD-SSD                        rbd     active     28997599635     21123504531      7874095104   72.85%
CEPH-CUSTOM-SAS                        rbd     active      6694895406      1112126254      5582769152   16.61%
CEPH-CUSTOM-SSD                        rbd   inactive               0               0               0    0.00%
backup                                nfs     active     87816765440     72777157632     15039607808   82.87%
iso

2nd run

Bash:

[root@compute-14 ~]$ pvesm status
got timeout
Name                                Type     Status           Total            Used       Available        %
HPE-3PAR-STOR-01                    lvm     active     12884885504     10674503680      2210381824   82.85%
HPE-3PAR-STOR-02                    lvm     active      8992571392      7933526016      1059045376   88.22%
HPE-3PAR-STOR-03                    lvm     active      6442434560       830472192      5611962368   12.89%
CEPH-BUILD-SAS                        rbd     active     14881893314      9299124162      5582769152   62.49%
CEPH-BUILD-SSD                        rbd     active     28997599635     21123504531      7874095104   72.85%
CEPH-CUSTOM-SAS                        rbd   inactive               0               0               0    0.00%
CEPH-CUSTOM-SSD                        rbd     active     10040456603      2166507419      7873949184   21.58%
backup                                nfs     active     87816765440     72777157632     15039607808   82.87%
iso                                    nfs     active       314569728       278293536        36276192   88.47%

3rd run

Bash:

[root@compute-14 ~]$ pvesm status
got timeout
got timeout
Name                                Type     Status           Total            Used       Available        %
HPE-3PAR-STOR-01                    lvm     active     12884885504     10674503680      2210381824   82.85%
HPE-3PAR-STOR-02                    lvm     active      8992571392      7933526016      1059045376   88.22%
HPE-3PAR-STOR-03                    lvm     active      6442434560       830472192      5611962368   12.89%
CEPH-BUILD-SAS                        rbd   inactive               0               0               0    0.00%
CEPH-BUILD-SSD                        rbd   inactive               0               0               0    0.00%
CEPH-CUSTOM-SAS                        rbd     active      6694895406      1112126254      5582769152   16.61%
CEPH-CUSTOM-SSD                        rbd     active     10040456603      2166507419      7873949184   21.58%
backup                                nfs     active     87816765440     72777157632     15039607808   82.87%
iso                                    nfs     active       314569728       278293536        36276192   88.47%

It is interesting that at each launch of the command, different rbd storages are displayed as inactive. With lvm this does not happen.

Upd. I forgot to clarify that ceph is external.

Alwin · Apr 28, 2020

What version does the external Ceph cluster run on?

Dexogen · Apr 28, 2020

Alwin said:
What version does the external Ceph cluster run on?

12.2.12

Martin Verges · Apr 29, 2020

I think to have the same issue, sometimes pvesm status says timeout, sometimes it's working fine.

In my cluster I have multiple Ceph pools on the same Ceph storage, all are working fine. However 1 of 3 requests shows "inactive" and on stderr "got timeout".

Alwin · Apr 29, 2020

Can all nodes in the cluster reach all nodes of the Ceph cluster? And is the MTU set the same for all interfaces?

Dexogen · Apr 29, 2020

Yep, same mtu and all nodes are available. If you reinstall the node to 5.4, the problem disappears. The problem appears only on version 6.1. Moreover, both updated from 5.4, and installed cleanly.

Alwin · Apr 29, 2020

Are there any entries in the kernel? Are there any settings on the cluster/client that are not default. Does it make a difference with a different kernel?

Dexogen · Apr 29, 2020

No, all settings are the same. The only changes we use are:

YAML:

- name: Network tuning
  sysctl:
    name: "{{ item.name }}"
    value: "{{ item.value }}"
    sysctl_set: yes
    state: present
    ignoreerrors: yes
  with_items:
    - { name: "net.ipv4.ip_forward", value: "1" }
    - { name: "net.ipv4.conf.default.rp_filter", value: "0" }
    - { name: "net.ipv4.conf.all.rp_filter", value: "2" }
    - { name: "net.core.rmem_max", value: "56623104" }
    - { name: "net.core.wmem_max", value: "56623104" }
    - { name: "net.core.rmem_default", value: "56623104" }
    - { name: "net.core.wmem_default", value: "56623104" }
    - { name: "net.ipv4.tcp_rmem", value: "4096 87380 56623104" }
    - { name: "net.ipv4.tcp_wmem", value: "4096 65536 56623104" }
    - { name: "net.core.somaxconn", value: "1024" }
    - { name: "net.ipv4.udp_rmem_min", value: "8192" }
    - { name: "net.ipv4.udp_wmem_min", value: "8192" }

- name: Emulex dropping packets fix
  lineinfile:
    dest: /etc/modprobe.d/be2net.conf
    line: 'options be2net rx_frag_size=4096'
    state: present

Alwin · Apr 29, 2020

Dexogen said:
- name: Emulex dropping packets fix

Is this one still needed? The kernel is 5.3 in PVE 6.1. And anything in the logs?

Dexogen · Apr 29, 2020

Apparently no longer needed. The default value is now 8192. We will gradually restart the cluster to apply the changes. There is nothing interesting in the system log except pvestatd timeout errors:

Code:

Apr 29 16:28:42 compute-14 pvestatd[2206]: got timeout
Apr 29 16:28:47 compute-14 pvestatd[2206]: got timeout
Apr 29 16:28:47 compute-14 pvestatd[2206]: status update time (15.818 seconds)
Apr 29 16:28:52 compute-14 pvestatd[2206]: got timeout
Apr 29 16:28:57 compute-14 pvestatd[2206]: got timeout

Alwin · Apr 29, 2020

Then I guess, you may want to revisit the other settings as well.

Dexogen said:
Apr 29 16:28:42 compute-14 pvestatd[2206]: got timeout

This just tells the same picture, as pvesm.

Dexogen · Apr 29, 2020

I turned on debug and performed in a cluster of two nodes with just one ceph-pool. Network settings returned to default values. I executed the command several times but sometimes got a timeout.

Code:

[root@compute14 ~]$ date && pvesm status && date
Wed 29 Apr 2020 05:55:08 PM MSK
got timeout
Name                           Type     Status           Total            Used       Available        %
CEPH-DEBUG-SAS                    rbd   inactive               0               0               0    0.00%
backup            nfs   disabled               0               0               0      N/A
local                           dir   disabled               0               0               0      N/A
local-lvm                   lvmthin     active        79896576               0        79896576    0.00%
iso               nfs     active       314569728       280108928        34460800   89.05%
Wed 29 Apr 2020 05:55:14 PM MSK

Perhaps a fragment of the log will help.

Alwin · Apr 30, 2020

Dexogen said:
I turned on debug and performed in a cluster of two nodes with just one ceph-pool.

The debug is for the pmxcfs, its the cluster filesystem and only holds the config and keyring files for Ceph. The timeouts are from the connection with Ceph. To set the debug for Ceph, best edit the ceph.conf and increase the debugging for the client.
https://docs.ceph.com/docs/nautilus/rados/troubleshooting/log-and-debug/

Dexogen · Apr 30, 2020

It seems that the problem was in the difference between the ceph-client (14.2.8) and ceph-server (12.2.12) versions. I just rolled back the client version and the timeout problem was solved.

Bash:

#workaround for pve 6.1-7 with old ceph-cluster 
apt install ceph-common=12.2.11+dfsg1-2.1+b1 \
librbd1=12.2.11+dfsg1-2.1+b1 \
python-cephfs=12.2.11+dfsg1-2.1+b1 \
python-rados=12.2.11+dfsg1-2.1+b1 \
python-rbd=12.2.11+dfsg1-2.1+b1 \
librados2=12.2.11+dfsg1-2.1+b1 \
libradosstriper1=12.2.11+dfsg1-2.1+b1 \
ceph-fuse=12.2.11+dfsg1-2.1+b1 \
libcephfs2=12.2.11+dfsg1-2.1+b1

Search

Search

PVESM got timeout 6.1

Dexogen

Active Member

Attachments

Alwin

Proxmox Retired Staff

Dexogen

Active Member

Martin Verges

Member

Alwin

Proxmox Retired Staff

Dexogen

Active Member

Alwin

Proxmox Retired Staff

Dexogen

Active Member

Alwin

Proxmox Retired Staff

Dexogen

Active Member

Alwin

Proxmox Retired Staff

Dexogen

Active Member

Attachments

Alwin

Proxmox Retired Staff

Dexogen

Active Member

We value your privacy