[SOLVED] Proxmox 8 - systemd-shutdown[1] - Failed to get MD_LEVEL property

emunt6 · Jun 26, 2023

HI!

I upgraded the test server from pve7 to pve8 - I found the following annoying bug:
> I issue the server shutdown - ( init 6 ), but the shutdown process is hanging and looping the following message:
"sytemd-shutdown[1] - Failed to get MD_LEVEL property for /dev/mdX, ignoring: No such file or directory"

The storage assembly on boot is done by a script due the complexity.
I'm using stacked/layered/nested storage:
1st layer : 24-bay disk shelf with a dual controller connected to the HBA card (HDDs with dual SAS connector)
2nd layer: Multipath is created to each HDD ( /dev/dm-0 , /dev/dm-1, ... )
3rd layer: "3way RAID-1" arrays created using mdadm ( 8x separated RAID-1 arrays each has 3x HDD )
4th layer: LVM is created (LVM striped) on the top of the RAID-1 arrays ( storage pool )
5th layer: EXT4 filesystem is created on top of the LVM

This setup is working and robust without error until Proxmox7, where "shutdown" is not working anymore and causes "degraded sate" of the array due the improper "shutdown".

I found the following about this:
https://github.com/canonical/probert/issues/125
According to this, it is an "udev or systemd" bug,
"udevadm does not show the MD_LEVEL information on an inactive RAID"

This is not fixed/patched in the Debian12 version?
Any help?

Code:

Hardware:
HPE ML350p gen8

~$ pveversion -v
proxmox-ve: 8.0.1 (running kernel: 6.2.16-3-pve)
pve-manager: 8.0.3 (running version: 8.0.3/bbf3993334bfa916)
pve-kernel-6.2: 8.0.2
pve-kernel-5.15: 7.4-3
pve-kernel-6.2.16-3-pve: 6.2.16-3
pve-kernel-5.15.107-2-pve: 5.15.107-2
ceph-fuse: 16.2.11+ds-2
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown: residual config
ifupdown2: 3.2.0-1+pmx2
libjs-extjs: 7.0.0-3
libknet1: 1.25-pve1
libproxmox-acme-perl: 1.4.6
libproxmox-backup-qemu0: 1.4.0
libproxmox-rs-perl: 0.3.0
libpve-access-control: 8.0.3
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.0.5
libpve-guest-common-perl: 5.0.3
libpve-http-server-perl: 5.0.3
libpve-rs-perl: 0.8.3
libpve-storage-perl: 8.0.1
libqb0: 1.0.5-1
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve3
novnc-pve: 1.4.0-2
openvswitch-switch: 3.1.0-2
proxmox-backup-client: 2.99.0-1
proxmox-backup-file-restore: 2.99.0-1
proxmox-kernel-helper: 8.0.2
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.1
proxmox-widget-toolkit: 4.0.5
pve-cluster: 8.0.1
pve-container: 5.0.3
pve-docs: 8.0.3
pve-edk2-firmware: 3.20230228-4
pve-firewall: 5.0.2
pve-firmware: 3.7-1
pve-ha-manager: 4.0.2
pve-i18n: 3.0.4
pve-qemu-kvm: 8.0.2-3
pve-xtermjs: 4.16.0-3
qemu-server: 8.0.6
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.1.12-pve1

emunt6 · Jun 30, 2023

https://bugzilla.redhat.com/show_bug.cgi?id=1970610
https://bugs.gentoo.org/395203

Code:

Edit the following files,
> /usr/lib/systemd/system-shutdown/mdadm.shutdown
> /lib/systemd/system-shutdown/mdadm.shutdown

Add the following:
/sbin/mdadm --wait-clean --scan
/sbin/mdadm -S -s

emunt6 · Jul 3, 2023

***Update
proxmox-pve8:
pve-kernel-6.2.16-3-pve 6.2.16-3
mdadm 4.2-5
systemd 252.6-1

Now investigating, narrowed it down to the "mdadm+udev" in the pve-kernel ( bug? ).
Problem: Proxmox host is unable to restart/reboot due the "MD: mdX stopped." message is looping with "high speed".

LMC · Jul 6, 2023

I have the same problem after upgrading from PVE 7 to 8, and editing /usr/lib/systemd/system-shutdown/mdadm.shutdown didn't helped...

emunt6 · Jul 18, 2023

Update: 2023-04-17

It seems, proxmox developers solved it, they released new custom packages for pve8( -pmx1):

Code:

$> dpkg --list | grep -i pmx1

ii  libnss-myhostname:amd64                                     252.12-pmx1                                          amd64        nss module providing fallback resolution for the current hostname
ii  libnss-systemd:amd64                                        252.12-pmx1                                          amd64        nss module providing dynamic user and group name resolution
ii  libpam-systemd:amd64                                        252.12-pmx1                                          amd64        system and service manager - PAM module
ii  libsystemd-shared:amd64                                     252.12-pmx1                                          amd64        systemd shared private library
ii  libsystemd0:amd64                                           252.12-pmx1                                          amd64        systemd utility library
ii  libudev-dev:amd64                                           252.12-pmx1                                          amd64        libudev development files
ii  libudev1:amd64                                              252.12-pmx1                                          amd64        libudev shared library
ii  systemd                                                     252.12-pmx1                                          amd64        system and service manager
ii  systemd-sysv                                                252.12-pmx1                                          amd64        system and service manager - SysV compatibility symlinks
ii  udev                                                        252.12-pmx1                                          amd64        /dev/ and hotplug management daemon

LMC · Jul 21, 2023

After the update, it was looking like the problem was solved on first reboot, but after extensive testing, it's not. I still see this errors:

Code:

54.882310] watchdog: watchdogo: watchdog did not stop!
55.047108] systemd-shutdown [1]: Could not stop MD /dev/md1: No such device
55.047134] watchdog: watchdog: watchdog did not stop!
55.055696] systemd-shutdown [1]: Failed to finalize MD devices, ignoring.

Or this one spamming the console:

Code:

43.811775] systemd-shutdown [1]: Failed to get MD_LEVEL property for /dev/md0, ignoring: No such file or directory
43.812342] systemd-shutdown[1]: Failed to get MD_LEVEL property for /dev/md0, ignoring: No such file or directory
43.812905] systemd-shutdown[1]: Failed to get MD_LEVEL property for /dev/md0, ignoring: No such file or directory
43.813468] systemd-shutdown[1]: Failed to get MD_LEVEL property for /dev/md0, ignoring: No such file or directory
43.814082] systemd-shutdown[1]: Failed to get MD_LEVEL property for /dev/md0, ignoring: No such file or directory

The server hangs on shutdown and needs a manual reset...

emunt6 · Jul 21, 2023

LMC said:

After the update, it was looking like the problem was solved on first reboot, but after extensive testing, it's not. I still see this errors:

Code:

54.882310] watchdog: watchdogo: watchdog did not stop!
55.047108] systemd-shutdown [1]: Could not stop MD /dev/md1: No such device
55.047134] watchdog: watchdog: watchdog did not stop!
55.055696] systemd-shutdown [1]: Failed to finalize MD devices, ignoring.

Or this one spamming the console:

Code:

43.811775] systemd-shutdown [1]: Failed to get MD_LEVEL property for /dev/md0, ignoring: No such file or directory
43.812342] systemd-shutdown[1]: Failed to get MD_LEVEL property for /dev/md0, ignoring: No such file or directory
43.812905] systemd-shutdown[1]: Failed to get MD_LEVEL property for /dev/md0, ignoring: No such file or directory
43.813468] systemd-shutdown[1]: Failed to get MD_LEVEL property for /dev/md0, ignoring: No such file or directory
43.814082] systemd-shutdown[1]: Failed to get MD_LEVEL property for /dev/md0, ignoring: No such file or directory

The server hangs on shutdown and needs a manual reset...

Hi!

I'm testing again, maybe I found what is the problem.
First, it is necesseary to install this package: ( mdadm package has dependencies in the systemd-service files, but not correctly set in package dep file).

Code:

$> apt-get install dracut-core

As far as I can tell you, the "/etc/fstab" mount option "errors=remount-ro" causing this loop problem. ( I'm using this option in every mount ).
The systemd process reaching "reboot.target" / "shutdown.target" - it tries executing the "sd-remount.service" and "sd-mount.service" this causing the loop ( the mapper devices are already stopped/deleted - it is impossible to mount/remount the filesystem)
This is a systemd bug.

Can you show me your "/etc/fstab" file?

emunt6 · Jul 21, 2023

I opened a bugreport:
https://github.com/systemd/systemd/issues/28490

LMC · Jul 23, 2023

emunt6 said:
Can you show me your "/etc/fstab" file?

Sure, here it is, and indeed it contains errors=remount-ro but just for md2:

Code:

# / was on /dev/md2 during installation
UUID=45380162-b151-4e50-9af3-a9a549ca1757 /               ext4    discard,noatime,nodiratime,relatime,errors=remount-ro 0       1
# /boot was on /dev/md0 during installation
UUID=c5bc77dd-33b6-4691-bcf0-ba540d755bde /boot           ext4    discard,noatime,nodiratime,relatime 0       2
# swap was on /dev/md1 during installation
UUID=b6067ac4-c4fa-451e-835e-29a4e2deb606 none            swap    sw              0       0

fabian · Jul 31, 2023

could you please post the full journal output of a failed shutdown? thanks! note that we don't support MDRAID officially (for other reasons than this issue, but still something you might want to be aware of!)

emunt6 · Aug 3, 2023

fabian said:
could you please post the full journal output of a failed shutdown? thanks! note that we don't support MDRAID officially (for other reasons than this issue, but still something you might want to be aware of!)

There is no log, it wont finish the reboot/shutdown process

I attached video from ILO (2mb):
(download the file, the webbrowser player makes it low quality )
video-link

After the video, testing with the following:

Code:

$> systemctl mask sd-mount.service
$> systemctl mask sd-remount.service

fabian · Aug 3, 2023

I am not sure, but in the video, I see blk-availability being stopped at the same time - possibly that triggers a premature stopping of the MD devices? without a full (textual

) log of the shutdown it's hard to tell what's going on..

emunt6 · Aug 7, 2023

fabian said:
I am not sure, but in the video, I see blk-availability being stopped at the same time - possibly that triggers a premature stopping of the MD devices? without a full (textual ) log of the shutdown it's hard to tell what's going on..

After some testing, it seems solved:

Code:

$> apt-get install dracut-core
$> systemctl edit blk-availability.service
[Unit]
After=mdadm-shutdown.service
Requires=mdadm-shutdown.service

Code:

$> systemctl edit mdadm-shutdown.service
[Unit]
After=multipathd.service
Requires=multipathd.service

Code:

$> systemctl daemon-reload

It seems, with this modification resolves the mdadm raid problem.

There are the another issue, this kernel and systemd version is "unstable" - not ready for production system:

Code:

-systemd 252.12-pmx1
-pve-kernel-6.2.16-4-pve 6.2.16-5

It happened more time, its deadlocking:

Code:

root@proxmox2:~# systemctl
Failed to list units: Transport endpoint is not connected

root@proxmox2:~# systemctl daemon-reexec
Failed to reload daemon: Transport endpoint is not connected

root@proxmox2:~# systemctl daemon-reload
Failed to reload daemon: Transport endpoint is not connected

fabian · Aug 31, 2023

emunt6 said:
After some testing, it seems solved:

Code:

$> apt-get install dracut-core $> systemctl edit blk-availability.service [Unit] After=mdadm-shutdown.service Requires=mdadm-shutdown.service

Code:

$> systemctl edit mdadm-shutdown.service [Unit] After=multipathd.service Requires=multipathd.service

Code:

$> systemctl daemon-reload

It seems, with this modification resolves the mdadm raid problem.

great!

emunt6 said:
There are the another issue, this kernel and systemd version is "unstable" - not ready for production system:

Code:

-systemd 252.12-pmx1 -pve-kernel-6.2.16-4-pve 6.2.16-5

It happened more time, its deadlocking:

Code:

root@proxmox2:~# systemctl Failed to list units: Transport endpoint is not connected root@proxmox2:~# systemctl daemon-reexec Failed to reload daemon: Transport endpoint is not connected root@proxmox2:~# systemctl daemon-reload Failed to reload daemon: Transport endpoint is not connected

could you post more info about that? e.g., the journal leading up to the problem starting? it sounds like an issue with dbus, but I am not sure..

emunt6 · Sep 5, 2023

fabian said:
great!

could you post more info about that? e.g., the journal leading up to the problem starting? it sounds like an issue with dbus, but I am not sure..

Hi!

This happened, when you edit multiple times the same ".service" file, and calling the "systemctl daemon-reload", after the above error message appears. I did some more testing, hard to reproduce.

Another problem, there is a .service file, about 10/8 times is executed properly, but the rest it did not.

Code:

$> systemctl status pve-qm-start.service

● pve-qm-start.service - PVE-QM-START
     Loaded: loaded (/etc/systemd/system/pve-qm-start.service; enabled; preset: enabled)
     Active: inactive (dead)

The .service file:

Code:

[Unit]
Description=PVE-QM-START
After=pve-guests.service multipathd.service
Requires=pve-guests.service multipathd.service

[Service]
Type=oneshot
TimeoutSec=0
KillMode=
RemainAfterExit=yes
Restart=no
ExecStart=/bin/bash /etc/init.d/pve-qm start

[Install]
WantedBy=multi-user.target

fabian · Sep 6, 2023

that service file is not provided by us, I have no idea what it does

Search

Search

[SOLVED] Proxmox 8 - systemd-shutdown[1] - Failed to get MD_LEVEL property

emunt6

Active Member

emunt6

Active Member

emunt6

Active Member

Attachments

LMC

Active Member

emunt6

Active Member

LMC

Active Member

emunt6

Active Member

emunt6

Active Member

LMC

Active Member

fabian

Proxmox Staff Member

emunt6

Active Member

Attachments

fabian

Proxmox Staff Member

emunt6

Active Member

fabian

Proxmox Staff Member

emunt6

Active Member

fabian

Proxmox Staff Member