LVM commands hang and node is marked with a questiomark

Karistea · Jan 20, 2021

Hi,
in a 12 node proxmoc 6.2 cluster, I often excerience problems with some hosts that turn to grey. This usually happen during migrations or after failed migrations.
During that time I am able to browse the VMs running on this node. Trying to investigate this, I've noticed that commands that lvm related command hang or timeout:

# lvs

^C Interrupted...
Giving up waiting for lock.
Can't get lock for pve
Cannot process volume group pve

Sometimes also I'm getting the following messages during migrations on this node:

WARNING: Device /dev/dm-17 not initialized in udev database even after waiting 10000000 microseconds.

The node usually recovers (turn green) when I restart pvestatd.

Any help would be appreciated.

thank you

oguz · Jan 20, 2021

hi,

can you provide some details:

* pveversion -v
* lsblk
* lvs -a
* journalctl output if you can reproduce the issue
* /var/log/syslog and dmesg can also be of interest

Karistea · Jan 20, 2021

Hi Oguz,

here is the info you have asked for:

Code:

# pveversion -v
proxmox-ve: 6.3-1 (running kernel: 5.4.78-2-pve)
pve-manager: 6.3-3 (running version: 6.3-3/eee5f901)
pve-kernel-5.4: 6.3-3
pve-kernel-helper: 6.3-3
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.7
libproxmox-backup-qemu0: 1.0.2-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.3-2
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.1-1
libpve-storage-perl: 6.3-3
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.0.6-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-3
pve-cluster: 6.2-1
pve-container: 3.3-2
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.1.0-7
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-2
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.5-pve1

Code:

NAME                         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                            8:0    0 838.3G  0 disk 
├─sda1                         8:1    0  1007K  0 part 
├─sda2                         8:2    0   512M  0 part 
└─sda3                         8:3    0 837.9G  0 part 
  ├─pve-swap                 253:0    0     4G  0 lvm  [SWAP]
  ├─pve-root                 253:1    0    16G  0 lvm  /
  ├─pve-data_tmeta           253:2    0     8G  0 lvm  
  │ └─pve-data-tpool         253:4    0 785.8G  0 lvm  
  │   ├─pve-data             253:5    0 785.8G  0 lvm  
  │   ├─pve-vm--175--disk--0 253:6    0   100G  0 lvm  
  │   ├─pve-vm--175--disk--1 253:7    0   100G  0 lvm  
  │   ├─pve-vm--105--disk--0 253:8    0    60G  0 lvm  
  │   ├─pve-vm--103--disk--0 253:9    0    60G  0 lvm  
  │   ├─pve-vm--103--disk--1 253:10   0    60G  0 lvm  
  │   └─pve-vm--104--disk--0 253:12   0    22G  0 lvm  
  └─pve-data_tdata           253:3    0 785.8G  0 lvm  
    └─pve-data-tpool         253:4    0 785.8G  0 lvm  
      ├─pve-data             253:5    0 785.8G  0 lvm  
      ├─pve-vm--175--disk--0 253:6    0   100G  0 lvm  
      ├─pve-vm--175--disk--1 253:7    0   100G  0 lvm  
      ├─pve-vm--105--disk--0 253:8    0    60G  0 lvm  
      ├─pve-vm--103--disk--0 253:9    0    60G  0 lvm  
      ├─pve-vm--103--disk--1 253:10   0    60G  0 lvm  
      └─pve-vm--104--disk--0 253:12   0    22G  0 lvm  
pve-vm--104--disk--1         253:11   0    22G  0 lvm

Code:

# time lvs -a
^C  Interrupted...
  Giving up waiting for lock.
  Can't get lock for pve
  Cannot process volume group pve

real    2m16.163s
user    0m0.010s
sys    0m0.007s

Some interesting logs :

Code:

pvesr[6388]: error during cfs-locked 'file-replication_cfg' operation: got lock request timeout


kernel: [64382.007177] Buffer I/O error on dev dm-11, logical block 5767152, async page read

Please note that dm-11 is the pve-vm--104--disk--1 which must be left there after a failed migration.
Also, I noticed that this host is running version 6.3 in addition to the other nodes with 6.2 installed. Could this be the problem?

thank you

oguz · Jan 20, 2021

thanks for the details

Karistea said:
Also, I noticed that this host is running version 6.3 in addition to the other nodes with 6.2 installed. Could this be the problem?

unlikely, 6.2 and 6.3 should normally work together (still better to have them all on the same versions though)

Karistea said:
kernel: [64382.007177] Buffer I/O error on dev dm-11, logical block 5767152, async page read

must be more where this came from. could you post the dmesg output?

Karistea · Jan 20, 2021

I'm not sure if this kvm trace is related:

Code:

[Tue Jan 19 22:39:41 2021] INFO: task kvm:28271 blocked for more than 604 seconds.
[Tue Jan 19 22:39:41 2021]       Tainted: P          IO      5.4.78-2-pve #1
[Tue Jan 19 22:39:41 2021] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Tue Jan 19 22:39:41 2021] kvm             D    0 28271      1 0x00000000
[Tue Jan 19 22:39:41 2021] Call Trace:
[Tue Jan 19 22:39:41 2021]  __schedule+0x2e6/0x6f0
[Tue Jan 19 22:39:41 2021]  schedule+0x33/0xa0
[Tue Jan 19 22:39:41 2021]  schedule_timeout+0x205/0x330
[Tue Jan 19 22:39:41 2021]  ? dm_make_request+0x56/0xb0
[Tue Jan 19 22:39:41 2021]  ? generic_make_request+0xcf/0x310
[Tue Jan 19 22:39:41 2021]  io_schedule_timeout+0x1e/0x50
[Tue Jan 19 22:39:41 2021]  wait_for_completion_io+0xb7/0x140
[Tue Jan 19 22:39:41 2021]  ? wake_up_q+0x80/0x80
[Tue Jan 19 22:39:41 2021]  submit_bio_wait+0x61/0x90
[Tue Jan 19 22:39:41 2021]  blkdev_issue_flush+0x8e/0xc0
[Tue Jan 19 22:39:41 2021]  blkdev_fsync+0x35/0x50
[Tue Jan 19 22:39:41 2021]  vfs_fsync_range+0x48/0x80
[Tue Jan 19 22:39:41 2021]  ? __fget_light+0x59/0x70
[Tue Jan 19 22:39:41 2021]  do_fsync+0x3d/0x70
[Tue Jan 19 22:39:41 2021]  __x64_sys_fdatasync+0x17/0x20
[Tue Jan 19 22:39:41 2021]  do_syscall_64+0x57/0x190
[Tue Jan 19 22:39:41 2021]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[Tue Jan 19 22:39:41 2021] RIP: 0033:0x7f1be1d472e7
[Tue Jan 19 22:39:41 2021] Code: Bad RIP value.
[Tue Jan 19 22:39:41 2021] RSP: 002b:00007f18c19804c0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[Tue Jan 19 22:39:41 2021] RAX: ffffffffffffffda RBX: 000000000000001a RCX: 00007f1be1d472e7
[Tue Jan 19 22:39:41 2021] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000000000001a
[Tue Jan 19 22:39:41 2021] RBP: 00007f1bd4c6d030 R08: 0000000000000000 R09: 00000000ffffffff
[Tue Jan 19 22:39:41 2021] R10: 00007f18c19804b0 R11: 0000000000000293 R12: 0000563f77abe45a
[Tue Jan 19 22:39:41 2021] R13: 00007f1bd4c6d098 R14: 00007f1bd470ae00 R15: 00007f18cf4dae10
[Tue Jan 19 23:38:53 2021] Buffer I/O error on dev dm-11, logical block 5767152, async page read
[Tue Jan 19 23:38:53 2021] Buffer I/O error on dev dm-11, logical block 5767152, async page read
[Wed Jan 20 00:00:29 2021] Buffer I/O error on dev dm-11, logical block 5767152, async page read
[Wed Jan 20 00:08:14 2021] Buffer I/O error on dev dm-11, logical block 5767152, async page read
[Wed Jan 20 00:30:14 2021] Buffer I/O error on dev dm-11, logical block 5767152, async page read
[Wed Jan 20 00:38:14 2021] Buffer I/O error on dev dm-11, logical block 5767152, async page read
[Wed Jan 20 01:00:46 2021] Buffer I/O error on dev dm-11, logical block 5767152, async page read
[Wed Jan 20 01:08:14 2021] Buffer I/O error on dev dm-11, logical block 5767152, async page read
[Wed Jan 20 01:31:19 2021] Buffer I/O error on dev dm-11, logical block 5767152, async page read

Keep in mind that the dm-11 causes all the udev queries to hang and make the host grey periodically. I have tried to remove it using dmsetup but I couldn't. Using lsof,fuser, I can't find which process keep this device busy.

stra4d · Jan 23, 2021

We have had the same thing occur for a full shutdown VM backup.

INFO: task vgs:7437 blocked for more than 120 seconds.
Tainted: P IO 5.4.78-2-pve #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disabled this message

This repeats and the VMs are shut down, but hung w/o actually backing anything up.

dmseg is attached.

pveversion:

Code:

~# pveversion -v
proxmox-ve: 6.3-1 (running kernel: 5.4.78-2-pve)
pve-manager: 6.3-3 (running version: 6.3-3/eee5f901)
pve-kernel-5.4: 6.3-3
pve-kernel-helper: 6.3-3
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.78-1-pve: 5.4.78-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.7
libproxmox-backup-qemu0: 1.0.2-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.3-2
libpve-guest-common-perl: 3.1-4
libpve-http-server-perl: 3.1-1
libpve-storage-perl: 6.3-4
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.0.6-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-3
pve-cluster: 6.2-1
pve-container: 3.3-2
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.1.0-8
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-3
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.5-pve1

oguz · Jan 25, 2021

could you run smartctl -a /dev/sda and post the output here?

stra4d · Jan 26, 2021

@oguz sorry - are you referring to OP or me?

If not me, please ignore. If me, here is one:

Code:

# smartctl -d cciss,3 -a /dev/sda
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.78-2-pve] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Black Mobile
Device Model:     WDC WD7500BPKX-22HPJT0
Serial Number:    WD-WX61AC4L422L
LU WWN Device Id: 5 0014ee 6b01f96dc
Firmware Version: 01.01A01
User Capacity:    750,156,374,016 bytes [750 GB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Tue Jan 26 13:57:35 2021 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Status not supported: Incomplete response, ATA output registers missing
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (12900) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 128) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x70b5) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       797
  3 Spin_Up_Time            0x0027   193   191   021    Pre-fail  Always       -       1341
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       51
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   032   032   000    Old_age   Always       -       49640
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       51
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       47
193 Load_Cycle_Count        0x0032   166   166   000    Old_age   Always       -       102300
194 Temperature_Celsius     0x0022   125   115   000    Old_age   Always       -       22
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Search

Search

LVM commands hang and node is marked with a questiomark

Karistea

New Member

oguz

Proxmox Retired Staff

Karistea

New Member

oguz

Proxmox Retired Staff

Karistea

New Member

stra4d

Well-Known Member

Attachments

oguz

Proxmox Retired Staff

stra4d

Well-Known Member

We value your privacy