[SOLVED] Grey unknown status

urbanator · Dec 3, 2019

Hi,

I have been having an issue with the web interface of proxmox showing my node and all vm/containers running on it as having an "unknown" status with grey ? marks.
This seems to happen a few hours after every reboot of the server. Restarting pvedaemon, pveproxy, pvestatd does not seem to help.

From what I can tell from the syslog this seems to be a problem with pvestatd (PID:4707):

Dec 03 06:32:09 charlie kernel: INFO: task lvs:4707 blocked for more than 120 seconds.
Dec 03 06:32:09 charlie kernel: Tainted: P O 5.0.21-5-pve #1
Dec 03 06:32:09 charlie kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 03 06:32:09 charlie kernel: lvs D 0 4707 1656 0x00000000
Dec 03 06:32:09 charlie kernel: Call Trace:
Dec 03 06:32:09 charlie kernel: __schedule+0x2d4/0x870
Dec 03 06:32:09 charlie kernel: ? get_page_from_freelist+0xefe/0x1440
Dec 03 06:32:09 charlie kernel: schedule+0x2c/0x70
Dec 03 06:32:09 charlie kernel: schedule_timeout+0x258/0x360
Dec 03 06:32:09 charlie kernel: wait_for_completion+0xb7/0x140
Dec 03 06:32:09 charlie kernel: ? wake_up_q+0x80/0x80
Dec 03 06:32:09 charlie kernel: __flush_work+0x138/0x1f0
Dec 03 06:32:09 charlie kernel: ? worker_detach_from_pool+0xb0/0xb0
Dec 03 06:32:09 charlie kernel: ? get_work_pool+0x40/0x40
Dec 03 06:32:09 charlie kernel: __cancel_work_timer+0x115/0x190
Dec 03 06:32:09 charlie kernel: ? exact_lock+0x11/0x20
Dec 03 06:32:09 charlie kernel: cancel_delayed_work_sync+0x13/0x20
Dec 03 06:32:09 charlie kernel: disk_block_events+0x78/0x80
Dec 03 06:32:09 charlie kernel: __blkdev_get+0x73/0x550
Dec 03 06:32:09 charlie kernel: ? bd_acquire+0xd0/0xd0
Dec 03 06:32:09 charlie kernel: blkdev_get+0x10c/0x330
Dec 03 06:32:09 charlie kernel: ? bd_acquire+0xd0/0xd0
Dec 03 06:32:09 charlie kernel: blkdev_open+0x92/0x100
Dec 03 06:32:09 charlie kernel: do_dentry_open+0x143/0x3a0
Dec 03 06:32:09 charlie kernel: vfs_open+0x2d/0x30
Dec 03 06:32:09 charlie kernel: path_openat+0x2bf/0x1570
Dec 03 06:32:09 charlie kernel: ? filename_lookup.part.61+0xe0/0x170
Dec 03 06:32:09 charlie kernel: ? strncpy_from_user+0x57/0x1c0
Dec 03 06:32:09 charlie kernel: do_filp_open+0x93/0x100
Dec 03 06:32:09 charlie kernel: ? strncpy_from_user+0x57/0x1c0
Dec 03 06:32:09 charlie kernel: ? __alloc_fd+0x46/0x150
Dec 03 06:32:09 charlie kernel: do_sys_open+0x177/0x280
Dec 03 06:32:09 charlie kernel: __x64_sys_openat+0x20/0x30
Dec 03 06:32:09 charlie kernel: do_syscall_64+0x5a/0x110
Dec 03 06:32:09 charlie kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9
Dec 03 06:32:09 charlie kernel: RIP: 0033:0x7fae96c151ae
Dec 03 06:32:09 charlie kernel: Code: Bad RIP value.
Dec 03 06:32:09 charlie kernel: RSP: 002b:00007ffe72629630 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
Dec 03 06:32:09 charlie kernel: RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fae96c151ae
Dec 03 06:32:09 charlie kernel: RDX: 0000000000044000 RSI: 0000557630d80698 RDI: 00000000ffffff9c
Dec 03 06:32:09 charlie kernel: RBP: 00007ffe72629790 R08: 0000557630dbe010 R09: 0000000000000000
Dec 03 06:32:09 charlie kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffe7262be95
Dec 03 06:32:09 charlie kernel: R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000

Does anyone know why/what could be causing this to happen? I have updated everything to the latest versions.

I am also having an issue with scheduled backups failing, which seems to be related: snapshots work fine when the web GUI is showing green status symbols, but fail when showing grey ? marks. The backup log goes as far as "starting backup of VM 10x (lxc)" but does not progress to writing to the backup location. It will stay in this status until I reboot the server (and only then will it produce an interrupt error in the log). I then have to manually unlock/start the container to get it running again.

Package versions:

proxmox-ve: 6.0-2 (running kernel: 5.0.21-5-pve)
pve-manager: 6.0-15 (running version: 6.0-15/52b91481)
pve-kernel-helper: 6.0-12
pve-kernel-5.0: 6.0-11
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-5.0.21-3-pve: 5.0.21-7
pve-kernel-5.0.21-2-pve: 5.0.21-7
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.2-pve4
criu: 3.11-3 glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.13-pve1
libpve-access-control: 6.0-5
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-9
libpve-guest-common-perl: 3.0-3
libpve-http-server-perl: 3.0-3
libpve-storage-perl: 6.0-12
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve3
lxc-pve: 3.2.1-1
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.1-1
pve-cluster: 6.0-9
pve-container: 3.0-14
pve-docs: 6.0-9
pve-edk2-firmware: 2.20191002-1
pve-firewall: 4.0-8
pve-firmware: 3.0-4
pve-ha-manager: 3.0-5
pve-i18n: 2.0-3
pve-qemu-kvm: 4.1.1-2
pve-xtermjs: 3.13.2-1
qemu-server: 6.1-1
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.2-pve2

aaron · Dec 4, 2019

How is your storage configured? If the task 'lvs' (which just lists logical volumes) is hanging for more than 2 minutes there must be something wrong.

urbanator · Dec 4, 2019

As far as I am aware (I didn't set this server up) the VMs/Containers are on lvm logical disks, and the backups are stored in a RAID1 array (sda + sdb) using a ZFS pool.

I've not managed to find any errors with the ZFS pool, or with the RAID drives themselves when running long SMART scans.

Are there any other logs which may help find the cause of the task hanging?

Many thanks.

aaron · Dec 5, 2019

Can you show the content of you /etc/pve/storage.cfg?

Do the commands pvs and lvs run fine and how long does it take them?

urbanator · Dec 5, 2019

Both pvs and lvs hang and do not output anything (I left them both running for ~30 mins).

Here's the storage.cfg contents:

dir: local
path /var/lib/vz
content snippets,iso,backup,vztmpl
maxfiles 2
shared 0

lvmthin: local-lvm
thinpool data
vgname pve
content rootdir,images

dir: HDDRAID1_DUMPS
path /HDDRAID1/vz
content vztmpl,snippets,backup,iso
maxfiles 10
shared 1

ViennaTux · Dec 5, 2019

I did find something similar: when inserting a RDX cassette, with e.g. 4TB, the os starts scanning the device, reports it (correctly) and about 10 mins later the node is unavailable, its vms show "state unknown". Mind, the RDX is not mounted.
When rebooting with inserted RDX, the boot process hangs. When the RDX is unmounted before booting, al is well (again).

ViennaTux · Dec 7, 2019

Update: even without the RDX the status changes to "unknown" out of the blue.
messages show

Dec 5 20:33:19 denobula kernel: [ 60.158232] hrtimer: interrupt took 6474 ns
Dec 5 20:34:32 denobula kernel: [ 133.137384] perf: interrupt took too long (2507 > 2500), lowering kernel.perf_event_max_s
ample_rate to 79750
Dec 5 20:34:59 denobula kernel: [ 160.413685] device tap1009i0 entered promiscuous mode
Dec 5 20:34:59 denobula kernel: [ 160.426753] vmbr1: port 4(tap1009i0) entered blocking state
Dec 5 20:34:59 denobula kernel: [ 160.426757] vmbr1: port 4(tap1009i0) entered disabled state
Dec 5 20:34:59 denobula kernel: [ 160.427977] vmbr1: port 4(tap1009i0) entered blocking state
Dec 5 20:34:59 denobula kernel: [ 160.427981] vmbr1: port 4(tap1009i0) entered forwarding state
Dec 5 20:36:13 denobula kernel: [ 234.119770] perf: interrupt took too long (3147 > 3133), lowering kernel.perf_event_max_s
ample_rate to 63500
Dec 5 20:47:09 denobula kernel: [ 889.756439] perf: interrupt took too long (3935 > 3933), lowering kernel.perf_event_max_s
ample_rate to 50750
Dec 5 21:44:37 denobula kernel: [ 4337.724500] perf: interrupt took too long (4922 > 4918), lowering kernel.perf_event_max_s
ample_rate to 40500
Dec 6 00:00:03 denobula rsyslogd: [origin software="rsyslogd" swVersion="8.1901.0" x-pid="737" x-info="https://www.rsyslog.
com"] rsyslogd was HUPed
Dec 7 00:00:03 denobula rsyslogd: [origin software="rsyslogd" swVersion="8.1901.0" x-pid="737" x-info="https://www.rsyslog.
com"] rsyslogd was HUPed
Dec 7 07:48:54 denobula kernel: [126993.548772] vgs D 0 10377 1131 0x00000000
Dec 7 07:48:54 denobula kernel: [126993.548775] Call Trace:
Dec 7 07:48:54 denobula kernel: [126993.548785] __schedule+0x2bb/0x660
Dec 7 07:48:54 denobula kernel: [126993.548790] ? enqueue_task_fair+0x121/0x450
Dec 7 07:48:54 denobula kernel: [126993.548792] schedule+0x33/0xa0
Dec 7 07:48:54 denobula kernel: [126993.548794] schedule_timeout+0x205/0x300
Dec 7 07:48:54 denobula kernel: [126993.548796] ? ttwu_do_activate+0x5a/0x70
Dec 7 07:48:54 denobula kernel: [126993.548798] wait_for_completion+0xb7/0x140
Dec 7 07:48:54 denobula kernel: [126993.548799] ? wake_up_q+0x80/0x80
Dec 7 07:48:54 denobula kernel: [126993.548802] __flush_work+0x131/0x1e0
Dec 7 07:48:54 denobula kernel: [126993.548804] ? worker_detach_from_pool+0xb0/0xb0
Dec 7 07:48:54 denobula kernel: [126993.548805] ? work_busy+0x90/0x90
Dec 7 07:48:54 denobula kernel: [126993.548806] __cancel_work_timer+0x115/0x190
Dec 7 07:48:54 denobula kernel: [126993.548809] ? exact_lock+0x11/0x20
Dec 7 07:48:54 denobula kernel: [126993.548812] ? kobj_lookup+0xec/0x160
Dec 7 07:48:54 denobula kernel: [126993.548814] cancel_delayed_work_sync+0x13/0x20
Dec 7 07:48:54 denobula kernel: [126993.548815] disk_block_events+0x78/0x80
Dec 7 07:48:54 denobula kernel: [126993.548819] __blkdev_get+0x73/0x550
Dec 7 07:48:54 denobula kernel: [126993.548820] blkdev_get+0xe0/0x140
Dec 7 07:48:54 denobula kernel: [126993.548822] ? bd_acquire+0xd0/0xd0
Dec 7 07:48:54 denobula kernel: [126993.548823] blkdev_open+0x92/0x100
Dec 7 07:48:54 denobula kernel: [126993.548827] do_dentry_open+0x143/0x3a0
... this goes on until reboot.

I have not the slightest idea why...

theleeski · Dec 9, 2019

I have exactly the same issue. Originally the server was upgraded from PVE5 to PVE6, so I re-installed fresh and I am still experiencing these same symptoms (grey unknown statuses, hung backups, pvs/lvs not responding, syslog entry for "task lvs:6204 blocked for more than 120 seconds."). Storage is local so nothing complex going on there.

After a reboot, storage.cfg and pvs/lvs outputs are as follows:

root@lon1hyp1:~# cat /etc/pve/storage.cfg
dir: local
path /var/lib/vz
content vztmpl,iso,backup
maxfiles 3
shared 0

lvmthin: local-lvm
thinpool data
vgname pve
content images,rootdir

root@lon1hyp1:~# pvs
PV VG Fmt Attr PSize PFree
/dev/sda3 pve lvm2 a-- 928.95g <420.96g

root@lon1hyp1:~# lvs
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
data pve twi-aotz-- <294.00g 17.58 1.41
root pve -wi-ao---- 200.00g
swap pve -wi-ao---- 8.00g
vm-100-disk-0 pve Vwi-aotz-- 16.00g data 12.18
vm-101-disk-0 pve Vwi-aotz-- 16.00g data 20.17
vm-102-disk-0 pve Vwi-aotz-- 8.00g data 27.20
vm-103-disk-0 pve Vwi-aotz-- 80.00g data 5.88
vm-104-disk-0 pve Vwi-aotz-- 8.00g data 17.65
vm-105-disk-0 pve Vwi-aotz-- 8.00g data 20.25
vm-106-disk-0 pve Vwi-aotz-- 16.00g data 11.20
vm-107-disk-0 pve Vwi-aotz-- 8.00g data 22.90
vm-108-disk-0 pve Vwi-a-tz-- 16.00g data 10.83
vm-110-disk-0 pve Vwi-aotz-- 16.00g data 12.31
vm-111-disk-0 pve Vwi-a-tz-- 8.00g data 16.92
vm-199-disk-0 pve Vwi-aotz-- 32.00g data 87.27

The lvs task blocked syslog entry as follows:

Dec 7 07:08:02 lon1hyp1 pvedaemon[4847]: <root@pam> successful auth for user 'root@pam'
Dec 7 07:08:04 lon1hyp1 kernel: [306286.717833] INFO: task lvs:6204 blocked for more than 120 seconds.
Dec 7 07:08:04 lon1hyp1 kernel: [306286.718324] Tainted: P O 5.0.21-5-pve #1
Dec 7 07:08:04 lon1hyp1 kernel: [306286.718713] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 7 07:08:04 lon1hyp1 kernel: [306286.719105] lvs D 0 6204 1656 0x00000000
Dec 7 07:08:04 lon1hyp1 kernel: [306286.719109] Call Trace:
Dec 7 07:08:04 lon1hyp1 kernel: [306286.719122] __schedule+0x2d4/0x870
Dec 7 07:08:04 lon1hyp1 kernel: [306286.719125] schedule+0x2c/0x70
Dec 7 07:08:04 lon1hyp1 kernel: [306286.719128] schedule_timeout+0x258/0x360
Dec 7 07:08:04 lon1hyp1 kernel: [306286.719133] ? ttwu_do_activate+0x67/0x90
Dec 7 07:08:04 lon1hyp1 kernel: [306286.719135] wait_for_completion+0xb7/0x140
Dec 7 07:08:04 lon1hyp1 kernel: [306286.719137] ? wake_up_q+0x80/0x80
Dec 7 07:08:04 lon1hyp1 kernel: [306286.719141] __flush_work+0x138/0x1f0
Dec 7 07:08:04 lon1hyp1 kernel: [306286.719143] ? worker_detach_from_pool+0xb0/0xb0
Dec 7 07:08:04 lon1hyp1 kernel: [306286.719145] ? get_work_pool+0x40/0x40
Dec 7 07:08:04 lon1hyp1 kernel: [306286.719147] __cancel_work_timer+0x115/0x190
Dec 7 07:08:04 lon1hyp1 kernel: [306286.719153] ? exact_lock+0x11/0x20
Dec 7 07:08:04 lon1hyp1 kernel: [306286.719155] cancel_delayed_work_sync+0x13/0x20
Dec 7 07:08:04 lon1hyp1 kernel: [306286.719157] disk_block_events+0x78/0x80
Dec 7 07:08:04 lon1hyp1 kernel: [306286.719162] __blkdev_get+0x73/0x550
Dec 7 07:08:04 lon1hyp1 kernel: [306286.719164] ? bd_acquire+0xd0/0xd0
Dec 7 07:08:04 lon1hyp1 kernel: [306286.719166] blkdev_get+0x10c/0x330
Dec 7 07:08:04 lon1hyp1 kernel: [306286.719168] ? bd_acquire+0xd0/0xd0
Dec 7 07:08:04 lon1hyp1 kernel: [306286.719169] blkdev_open+0x92/0x100
Dec 7 07:08:04 lon1hyp1 kernel: [306286.719174] do_dentry_open+0x143/0x3a0
Dec 7 07:08:04 lon1hyp1 kernel: [306286.719176] vfs_open+0x2d/0x30
Dec 7 07:08:04 lon1hyp1 kernel: [306286.719179] path_openat+0x2bf/0x1570
Dec 7 07:08:04 lon1hyp1 kernel: [306286.719183] ? __do_page_fault+0x25a/0x4c0
Dec 7 07:08:04 lon1hyp1 kernel: [306286.719186] ? mem_cgroup_try_charge+0x8b/0x190
Dec 7 07:08:04 lon1hyp1 kernel: [306286.719187] do_filp_open+0x93/0x100
Dec 7 07:08:04 lon1hyp1 kernel: [306286.719192] ? strncpy_from_user+0x57/0x1c0
Dec 7 07:08:04 lon1hyp1 kernel: [306286.719195] ? __alloc_fd+0x46/0x150
Dec 7 07:08:04 lon1hyp1 kernel: [306286.719197] do_sys_open+0x177/0x280
Dec 7 07:08:04 lon1hyp1 kernel: [306286.719201] ? __x64_sys_io_submit+0xa9/0x190
Dec 7 07:08:04 lon1hyp1 kernel: [306286.719203] __x64_sys_openat+0x20/0x30
Dec 7 07:08:04 lon1hyp1 kernel: [306286.719209] do_syscall_64+0x5a/0x110
Dec 7 07:08:04 lon1hyp1 kernel: [306286.719211] entry_SYSCALL_64_after_hwframe+0x44/0xa9
Dec 7 07:08:04 lon1hyp1 kernel: [306286.719214] RIP: 0033:0x7f69cd4e51ae
Dec 7 07:08:04 lon1hyp1 kernel: [306286.719222] Code: Bad RIP value.
Dec 7 07:08:04 lon1hyp1 kernel: [306286.719223] RSP: 002b:00007ffffefceda0 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
Dec 7 07:08:04 lon1hyp1 kernel: [306286.719225] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f69cd4e51ae
Dec 7 07:08:04 lon1hyp1 kernel: [306286.719226] RDX: 0000000000044000 RSI: 000055b3447b7dd0 RDI: 00000000ffffff9c
Dec 7 07:08:04 lon1hyp1 kernel: [306286.719227] RBP: 00007ffffefcef00 R08: 000055b344865000 R09: 0000000000000000
Dec 7 07:08:04 lon1hyp1 kernel: [306286.719228] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffffefd0e95
Dec 7 07:08:04 lon1hyp1 kernel: [306286.719228] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000

Any ideas what else I can check here?

Thanks

aaron · Dec 10, 2019

urbanator said:
Both pvs and lvs hang and do not output anything (I left them both running for ~30 mins).

That is interesting. Are your disks okay? (SMART status)
Do you see any I/O errors in the output of dmesg?

Do you have any other LVs that are not part of the storage.cfg?

aaron · Dec 10, 2019

Wolfgang Leithner said:
I did find something similar: when inserting a RDX cassette, with e.g. 4TB, the os starts scanning the device, reports it (correctly) and about 10 mins later the node is unavailable, its vms show "state unknown". Mind, the RDX is not mounted.
When rebooting with inserted RDX, the boot process hangs. When the RDX is unmounted before booting, al is well (again).

Can you please try to add a new filter for LVM to ignore the Tandberg drive?

Should be, AFAIU,

Code:

"r|/dev/<tandberg>|,"

in front of the other filters in the global_filters option in the /etc/lvm/lvm.conf file. Replace <tandberg> with whatever name / path the tandberg is showing up on your system. We don't have any of those around here to test that.

Please let us know if this helps after a reboot of the server.

Looks like someone in the German forum is having the same problem with Tandberg drives: https://forum.proxmox.com/threads/blockierendes-pvestatd-vgs.61362/

urbanator · Dec 12, 2019

aaron said:
Can you please try to add a new filter for LVM to ignore the Tandberg drive?

Should be, AFAIU,

Code:

"r|/dev/<tandberg>|,"

in front of the other filters in the global_filters option in the /etc/lvm/lvm.conf file. Replace <tandberg> with whatever name / path the tandberg is showing up on your system. We don't have any of those around here to test that.

Please let us know if this helps after a reboot of the server.

Looks like someone in the German forum is having the same problem with Tandberg drives: https://forum.proxmox.com/threads/blockierendes-pvestatd-vgs.61362/

I noticed that a removable drive that is in my server is also a Tandberg drive, and edited the lvm.conf as suggested to try out the workaround.

However after restarting the server, I get errors at boot stating that it has "failed to create a global regex device filter" and that there are syntax/regex pattern errors. And the system boot process stops!

I created a backup of the original lvm.conf which I have restored, however the same errors are being thrown. I read that a copy lvm.conf file is stored within initrd for bootup, so I have also done the following to try and update this, but this has also not seemed to change anything (the same errors are show referencing the edited lvm.conf file):

Code:

rm /etc/lvm/cache/.cache
mv initrd-`uname -r`.img initrd-`uname -r`.img.orig
mkinitrd -v -f /boot/initrd-`uname -r`.img `uname -r`

Does anyone have any ideas how I can get the system to boot using the original lvm.conf file? as I need to get the server back up and running.

aaron · Dec 12, 2019

Did you replace <tandberg> with the actual drive? I don't have a Tandberg drive at hand to test it myself.

If it is removable USB device you might want to do the same as in the other thread, ignoring all USB devices.

Code:

devices {
        ...
        global_filter = [ "r|/dev/disk/by-id/usb.*|", ... ]
        ...
}

Regarding your boot problem: How does the boot process stop? Do you end up in a shell?

urbanator · Dec 12, 2019

aaron said:
Did you replace <tandberg> with the actual drive? I don't have a Tandberg drive at hand to test it myself.

If it is removable USB device you might want to do the same as in the other thread, ignoring all USB devices.

Code:

devices { ... global_filter = [ "r|/dev/disk/by-id/usb.*|", ... ] ... }

Regarding your boot problem: How does the boot process stop? Do you end up in a shell?

Yes I replaced <tandberg> it with sdc as this was the Tandberg drive, however I think there may have been a typo in there and I think it's filtering out the system disk which is the issue.

It is dropping out to shell on bootup:

aaron · Dec 12, 2019

I recreated a similar situation here where I ended up with the same error message in the rescue shell.
What helped was to edit the lvm.conf in the rescue shell.

Code:

vi /etc/lvm/lvm.conf

I removed the added wrong part to the filter, saved the file and ran

Code:

vgchange -ay pve

.
The last line of the produced output said that 5 logical volumes are now active.
I then exited the rescue shell and the system booted up.

I hope you are comfortable with vi as there are no other editors available at that stage AFAIK.

The next step would be to fix the lvm.conf in the running system and then run update-initramfs -u.

Did you by any chance miss the closing | symbol at the end of the filter? That's how I was able to trigger the problem.

urbanator · Dec 12, 2019

aaron said:
I recreated a similar situation here where I ended up with the same error message in the rescue shell.
What helped was to edit the lvm.conf in the rescue shell.

Code:

vi /etc/lvm/lvm.conf

I removed the added wrong part to the filter, saved the file and ran

Code:

vgchange -ay pve

.
The last line of the produced output said that 5 logical volumes are now active.
I then exited the rescue shell and the system booted up.

I hope you are comfortable with vi as there are no other editors available at that stage AFAIK.

The next step would be to fix the lvm.conf in the running system and then run update-initramfs -u.

Did you by any chance miss the closing | symbol at the end of the filter? That's how I was able to trigger the problem.

Thanks I will give this a try. I think I did miss the closing "|".
I did notice this after I edited the lvm.conf and ran the update process before the planned reboot, and the same syntax error was shown. I corrected the error then, but I'm guessing during the update the incorrect version had already been copied to initrd.

urbanator · Dec 12, 2019

Thanks Aaron, that did the job and got the server booting up once again.

We have decided to try the system without the Tandberg drive plugged in for the time being (as it is not currently being used anyway).
I will report back on whether this solves any of the instability problems we have had with the GUI/lvs task after a few days of the server up and running.

urbanator · Dec 15, 2019

Update: After 3 days of disconnecting the Tandberg USB drive the server has been completely stable.

I have re-enabled the scheduled daily backups and they now complete successfully, and the web interface no longer shows an unknown status.

The drive was definitely the cause of the issues. I'm not sure exactly why this was, but the server seemed to be stable for a number of hours after a restart with the drive connected, so at a guess I think it could be possible the issue could be related to the drive spinning down/going to sleep (as it was not being used on my server)?

theleeski · Dec 15, 2019

For me the issue can take a week or more to manifest, normally somewhere between 3 and 10 days. It never happened at all on PVE version 5. It started happening as soon as I upgraded to 6.0, and persists on 6.1. Based on this thread I have noted that my server has an onboard SD CARD present which shows up as /dev/sdc, I've disabled it in the BIOS as it is not being used and will see if it helps.

ViennaTux · Dec 17, 2019

I do have the same issue as theleeski. Under 5.x it worked like a charm for more than a year, after the update the system gets woozy after a few days.
Unfortunately the Tandberg is builtin, so booting disconnected is not an option, disabeling the device neither. When there is no casette in the drive, the system runs quite ok (for how long exactly I will have to determine), but as soon as the casette is rolled in, the trouble begins. Maybe a defective driver?

aaron · Dec 17, 2019

Hmm, I don't know what exactly is the cause, especially since I cannot try to reproduce it here without a Tandberg drive.
My guess is that they are using LVM for their tapes / disks and that is somehow interfering with the newer LVM that is present in PVE 6.x.

Are they tape drives or disk drives? AFAIR Tandberg offers both.

I don't have much experience with recent tape drives under Linux myself.

As a shot in the dark: can you tell if you put in a tape/disk and run a lvs or vgs command right away, if the drive is working?

Maybe it is trying to read the whole tape for the LVM info?

[SOLVED] Grey unknown status

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Well-Known Member

Well-Known Member

New Member

Proxmox Staff Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Member

Member

New Member

Well-Known Member

Proxmox Staff Member