[SOLVED] Upgrade to PVE6: dev-pve-data.device/start timed out.

mollien

Active Member
Jun 18, 2015
8
0
41
Last Saturday, we tried to upgrade our 4-node cluster environment from PVE5 to PVE6, using the guide and the pve5to6 tool. Following the guide, we upgraded Corosync on the 4 nodes, then proceeded to upgrade the first node (PMX01) to buster. All smooth, until the system rebooted and came up in maintenance mode.

Looking through the boot logs, it appears that mounting /dev/pve/data timed out, which in turn caused a bunch of dependencies to fail:

Code:
Aug 25 11:26:16 arwpmx01 systemd[1]: dev-pve-data.device: Job dev-pve-data.device/start timed out.
Aug 25 11:26:16 arwpmx01 systemd[1]: Timed out waiting for device /dev/pve/data.
-- Subject: A start job for unit dev-pve-data.device has failed
-- Defined-By: systemd
-- Support: https://www.debian.org/support
--
-- A start job for unit dev-pve-data.device has finished with a failure.
--
-- The job identifier is 14 and the job result is timeout.

We can get the system to boot to a functional multi-user state by:
- Logging on to maintenance mode as root
- edit /etc/fstab and commenting out the /dev/pve/data mount
- reboot
- edit /etc/fstab and uncommenting the /dev/pve/data mount
- mount -a

After this, the server seems to work as expected. All the images are there, they can be started, migrated, etc. However, each subsequent reboot will require us to do the same thing.

To see if it was a machine-specific issue, we also upgraded the second server. The same issue happened, the same work-around got it going. Right now, we are running a half-upgraded cluster, 2 nodes with PVE6 and 2 nodes with PVE5, which is not ideal.

I found one other thread in this forum from somebody who described the same issue, but no resolution was posted and the last entry was from early August. I have been banging my head against the wall the whole day yesterday, trying to find a fix for the issue, but I have not found anything that even points me in the right direction.

Short of trying to rebuild one of the 'broken' cluster nodes, which may lead to rebuilding the whole Proxmox environment, does anyone here have any idea where to look?

Any help is greatly appreciated!
 
please post your:
* '/etc/pve/storage.cfg'
* '/etc/fstab'
* output of `pvs -a
* output of `vgs -a`
* output of `lvs -a`

anything relevant in the journal: `journalctl --since 2019-08-25`
 
Hello Stoiko, thanks for the reponse. See the requested files below:

* '/etc/pve/storage.cfg'
Code:
dir: local
    path /var/lib/vz
    content backup,images,iso,vztmpl,rootdir
    maxfiles 0

nfs: ARWNAS01
    export /volume1/ProxMox
    path /mnt/pve/ARWNAS01
    server 192.168.21.16
    content iso,vztmpl,rootdir,backup
    maxfiles 3
    options vers=3,nolock

* '/etc/fstab'
Code:
# <file system> <mount point> <type> <options> <dump> <pass>
/dev/pve/root / ext4 errors=remount-ro 0 1
/dev/pve/data /var/lib/vz ext4 defaults 0 2
/dev/pve/swap none swap sw 0 0
proc /proc proc defaults 0 0

* output of 'pvs -a'
Code:
  PV         VG  Fmt  Attr PSize  PFree
  /dev/sda2           ---      0      0
  /dev/sda3  pve lvm2 a--  <1.82t 16.00g

* output of `vgs -a`
Code:
  VG  #PV #LV #SN Attr   VSize  VFree
  pve   1   3   0 wz--n- <1.82t 16.00g

* output of `lvs -a`
Code:
  LV              VG  Attr       LSize  Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  data            pve twi-aotz-- <1.67t             0.00   0.15                          
  [data_tdata]    pve Twi-ao---- <1.67t                                                  
  [data_tmeta]    pve ewi-ao---- 16.00g                                                  
  [lvol0_pmspare] pve ewi------- 16.00g                                                  
  root            pve -wi-ao---- 96.00g                                                  
  swap            pve -wi-ao----  8.00g

anything relevant in the journal: `journalctl --since 2019-08-25`
Code:
Aug 25 16:54:34 arwpmx01 systemd[1]: Started LXC Container Initialization and Autoboot Code.
Aug 25 16:54:34 arwpmx01 rrdcached[1261]: rrdcached started.
Aug 25 16:54:34 arwpmx01 systemd[1]: Started LSB: start or stop rrdcached.
Aug 25 16:54:34 arwpmx01 systemd[1]: Starting The Proxmox VE cluster filesystem...
Aug 25 16:54:34 arwpmx01 postfix/postfix-script[1396]: warning: symlink leaves directory: /etc/postfix/./makedefs.out
Aug 25 16:54:34 arwpmx01 systemd[1]: Started Proxmox VE Login Banner.
Aug 25 16:54:34 arwpmx01 pmxcfs[1433]: [quorum] crit: quorum_initialize failed: 2
Aug 25 16:54:34 arwpmx01 pmxcfs[1433]: [quorum] crit: can't initialize service
Aug 25 16:54:34 arwpmx01 systemd[1]: Started LXC Container Initialization and Autoboot Code.
Aug 25 16:54:34 arwpmx01 rrdcached[1261]: rrdcached started.
Aug 25 16:54:34 arwpmx01 systemd[1]: Started LSB: start or stop rrdcached.
Aug 25 16:54:34 arwpmx01 systemd[1]: Starting The Proxmox VE cluster filesystem...
Aug 25 16:54:34 arwpmx01 postfix/postfix-script[1396]: warning: symlink leaves directory: /etc/postfix/./makedefs.out
Aug 25 16:54:34 arwpmx01 systemd[1]: Started Proxmox VE Login Banner.
Aug 25 16:54:34 arwpmx01 pmxcfs[1433]: [quorum] crit: quorum_initialize failed: 2
Aug 25 16:54:34 arwpmx01 pmxcfs[1433]: [quorum] crit: can't initialize service
Aug 25 16:54:34 arwpmx01 pmxcfs[1433]: [confdb] crit: cmap_initialize failed: 2
Aug 25 16:54:34 arwpmx01 pmxcfs[1433]: [confdb] crit: can't initialize service
Aug 25 16:54:34 arwpmx01 pmxcfs[1433]: [dcdb] crit: cpg_initialize failed: 2
Aug 25 16:54:34 arwpmx01 pmxcfs[1433]: [dcdb] crit: can't initialize service
Aug 25 16:54:34 arwpmx01 pmxcfs[1433]: [status] crit: cpg_initialize failed: 2
Aug 25 16:54:34 arwpmx01 pmxcfs[1433]: [status] crit: can't initialize service
Aug 25 16:54:34 arwpmx01 postfix/postfix-script[1437]: starting the Postfix mail system
Aug 25 16:54:34 arwpmx01 postfix/master[1439]: daemon started -- version 3.4.5, configuration /etc/postfix
Aug 25 16:54:34 arwpmx01 iscsid[1240]: iSCSI daemon with pid=1241 started!
Aug 25 16:54:34 arwpmx01 systemd[1]: Started Postfix Mail Transport Agent (instance -).
 
Last edited:
  • Like
Reactions: Tacid
This is a screenshot of my boot screen. The ACPI errors have always shown up (before the upgrade), and they still show up even if the machine boots with the fstab entry commented.

1566838444194.png
 
Thanks for your assistance, Stoiko! That definitely solved the issue.

What I did to resolve it on my 2 servers:

Clean server (all VM's and CT's migrated off):
- Removed the entry from fstab
- Rebooted
- Migrated VM's and CT's back on to the machine

Server with VM's + CT's still on it:
- Removed the line from fstab
- Rebooted
- Mounted /dev/pve/data on a temporary mountpoint </mnt/tmp>
- Moved the whole directory structure from </mnt/tmp> to </var/lib/vz>, keeping permissions
- Unmount </mnt/tmp> and remove the mountpoint
- Reboot

Thanks again! The Proxmox team rocks!
 
Glad you solved your issue!
Mounted /dev/pve/data on a temporary mountpoint </mnt/tmp>
hmm - this sounds like '/dev/pve/data' is a thin-pool on one of your servers and a filesystem on the other?
While this can work - I'd find it confusing (as in that case) - you could consider renaming the LV with the filesystem on it, or recreate it as LVM-thinpool (and storage)

In any case if your issue is resolved - please mark the thread as 'SOLVED', so that others know what to expect.
Thanks!
 
  • Like
Reactions: Tacid
I know the thread is a few months old, but I too ran into this issue with one node in a six node cluster. /etc/fstab was trying to mount the LVM-thin as ext4.

It was a head scratcher for why one of six was different. Researching led me to https://pve.proxmox.com/wiki/Instal....2Fvar.2Flib.2Fvz_.28Proxmox_4.2_and_later.29 which indicates the on-disk format changed in version 4.2. This node was rebuilt at some point post 4.2, while the others have been simply upgraded. At some point, /etc/fstab must have been copied from one node to another, but the hypervisor never rebooted.
 
same problem here in upgrade to 6.2-9,

additionally if you try to erase from fstab and mount manually with mount /dev/pve/data /directory/of/disk work's correctly, i can make dirs, send data and try to live migration machines to node, then the mechines up and work correctly but the disk disappears,

it's a magical feeling

more info:

if i start shell (but without erase from /etc/fstab) without disk after the boot error and try to mount manually then:
Code:
# mount /dev/pve/data /disk/path -v
mount: /dev/mapper/pve-data mounted in /disk/path.

but when i run df -h then /disk/path not is mounted! :c

if i run:
Code:
mount /dev/pve/data /disk/path -v; df -h
mount: /dev/mapper/pve-data mounted in /disk/path.

big df output, i don't copy this because i'm from vnc :) the interesting thing is here,

/dev/mapper/pve-data    size-of-disk    size-used    size available    %-of-use    /disk/path

the disk is correctly mounted but instantly is unmounted


I have left the node broken for 1 day if any need more info about error and how to fix correctly


Regards
 
Last edited:
Hello Stoiko,

Code:
# pvs
  PV         VG  Fmt  Attr PSize  PFree 
  /dev/sda3  pve lvm2 a--  <1.82t <16.38g

# vgs
  VG  #PV #LV #SN Attr   VSize  VFree  
  pve   1   3   0 wz--n- <1.82t <16.38g

# lvs
  LV   VG  Attr       LSize  Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  data pve twi-a-tz--  1.71t             0.00   0.15                            
  root pve -wi-ao---- 50.00g                                                    
  swap pve -wi-ao----  8.00g   

# lsblk
NAME               MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda                  8:0    0  1.8T  0 disk 
|-sda1               8:1    0 1007K  0 part 
|-sda2               8:2    0  512M  0 part 
`-sda3               8:3    0  1.8T  0 part 
  |-pve-swap       253:0    0    8G  0 lvm  [SWAP]
  |-pve-root       253:1    0   50G  0 lvm  /
  |-pve-data_tmeta 253:2    0 15.8G  0 lvm  
  | `-pve-data     253:4    0  1.7T  0 lvm  
  `-pve-data_tdata 253:3    0  1.7T  0 lvm  
    `-pve-data     253:4    0  1.7T  0 lvm  

# mount |grep /dev/pve/data


Regards,
 
sounds like /dev/pve/data is a lvm -thinpool, where somehow also an EXT4 filesystem was created on top.

I would make a backup of the disk (using dd or so)
then mount it with ext4 - copy all important data - and migrate it to a new disk.
 
hello,

i formatted all disk and recreated without lvm :D, no lvm no problem,

i can't fear every update, in my experience without lvm not have such important problems, and this not is the first problem i have with lvm

thanks for review,
Regards,
 
hello,

if any have this problem erasing disk from fstab and addiding this to crontab works for me:

Code:
@reboot mount /dev/pve/data /directory_of_disk
(I feel dirty for using this)

i thing this is caused because linux can't mount disk (obviously)

for some reason some lvm thing is not loaded during disk mount from fstab in boot if you using lvm from directory,

if you run after OS boot this work correctly but when fstab mount the disk in boot, stop trying to mount and start emergency shell, this still without load correctly because block the next boot steps after boot error

Regards,
 
Last edited:
@reboot mount /dev/pve/data /directory_of_disk
please don't do that.

In those cases the LV /dev/pve/data is both formatted as ext4 and configured as a lvm thin pool - this is a very brittle configuration and I would expect it to lead to problems eventually - including dataloss

just backup your machines - wipe the LV, set it up either as ext4 (directory storage) or as lvm thinpool and add it to your storage configuration.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!