Lost utility power this morning and UPS drained -- no, I unfortunately (in hindsight), don't have Proxmox setup to shut the host down while on UPS.
On power up of pnode-a, noticed my vms were all not started and couldn't login even with 2fa recovery code. SSH into the system, noticed there were errors about the login like this, even though 'tfa.cfg' existed fine (tested creating a file and saw same similar permission denied).
"2023-08-22T08:46:28.425683-04:00 pnode-a pvedaemon[11082]: authentication failure; rhost=::ffff:192.168.71.10 user=root@pam msg=unable to open file '/etc/pve/priv/tfa.cfg.tmp.11082' - Permission denied"
Based on 'df' -- it seems like I may have lost one of my ZFS mirrors ('raid10') but ZFS seems to report everything is fine. My 2nd node (node-b) is configured same with way ZFS -- so, I expected to see 21TB for the size reported.
Any helpful hints to try and troubleshoot whatever is going on with 'df'? ZFS seems to show everything is good from the output.
On power up of pnode-a, noticed my vms were all not started and couldn't login even with 2fa recovery code. SSH into the system, noticed there were errors about the login like this, even though 'tfa.cfg' existed fine (tested creating a file and saw same similar permission denied).
"2023-08-22T08:46:28.425683-04:00 pnode-a pvedaemon[11082]: authentication failure; rhost=::ffff:192.168.71.10 user=root@pam msg=unable to open file '/etc/pve/priv/tfa.cfg.tmp.11082' - Permission denied"
Based on 'df' -- it seems like I may have lost one of my ZFS mirrors ('raid10') but ZFS seems to report everything is fine. My 2nd node (node-b) is configured same with way ZFS -- so, I expected to see 21TB for the size reported.
Any helpful hints to try and troubleshoot whatever is going on with 'df'? ZFS seems to show everything is good from the output.
Code:
root@pnode-a:/var/log# zpool status
pool: rpool
state: ONLINE
scan: scrub repaired 0B in 1 days 11:04:36 with 0 errors on Mon Aug 14 11:28:37 2023
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-WDC_WD120EDAZ-11F3RA0_5PK6VKLB-part3 ONLINE 0 0 0
ata-WDC_WD120EDBZ-11B1HA0_5QG2ZJ8E-part3 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
ata-WDC_WD120EDAZ-11F3RA0_5PK8N47F-part3 ONLINE 0 0 0
ata-WDC_WD120EDAZ-11F3RA0_5PK82TLB-part3 ONLINE 0 0 0
root@pnode-a:/var/log# zfs list
NAME USED AVAIL REFER MOUNTPOINT
rpool 10.2T 11.5T 104K /rpool
rpool/ROOT 66.3G 11.5T 96K /rpool/ROOT
rpool/ROOT/pve-1 66.3G 11.5T 61.3G /
rpool/data 10.2T 11.5T 104K /rpool/data
rpool/data/base-101-disk-0 140K 11.5T 132K -
rpool/data/base-101-disk-1 17.5G 11.5T 17.5G -
rpool/data/base-111-disk-0 168K 11.5T 160K -
rpool/data/base-111-disk-1 19.4G 11.5T 19.4G -
root@pnode-a:/var/log# zpool list
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
rpool 21.8T 10.2T 11.6T - - 43% 46% 1.00x ONLINE -
root@pnode-a:/var/log# df -h
Filesystem Size Used Avail Use% Mounted on
udev 63G 0 63G 0% /dev
tmpfs 13G 2.1M 13G 1% /run
rpool/ROOT/pve-1 12T 62G 12T 1% /
tmpfs 63G 46M 63G 1% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
rpool 12T 128K 12T 1% /rpool
rpool/ROOT 12T 128K 12T 1% /rpool/ROOT
rpool/data 12T 128K 12T 1% /rpool/data
rpool/data/subvol-801-disk-0 16G 402M 16G 3% /rpool/data/subvol-801-disk-0
rpool/data/subvol-401-disk-0 50G 387M 50G 1% /rpool/data/subvol-401-disk-0
/dev/fuse 128M 52K 128M 1% /etc/pve
tmpfs 13G 0 13G 0% /run/user/0
root@pnode-b:~# zpool status
pool: rpool
state: ONLINE
scan: scrub repaired 0B in 01:47:00 with 0 errors on Sun Aug 13 02:11:01 2023
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-WDC_WD120EDBZ-11B1HA0_5QG3KX8E-part3 ONLINE 0 0 0
ata-WDC_WD120EMFZ-11A6JA0_QGJTW39T-part3 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
ata-WDC_WD120EMFZ-11A6JA0_QGH7TJ5T-part3 ONLINE 0 0 0
ata-WDC_WD120EMFZ-11A6JA0_9RJU9VPC-part3 ONLINE 0 0 0
errors: No known data errors
root@pnode-b:~# zpool status
pool: rpool
state: ONLINE
scan: scrub repaired 0B in 01:47:00 with 0 errors on Sun Aug 13 02:11:01 2023
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-WDC_WD120EDBZ-11B1HA0_5QG3KX8E-part3 ONLINE 0 0 0
ata-WDC_WD120EMFZ-11A6JA0_QGJTW39T-part3 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
ata-WDC_WD120EMFZ-11A6JA0_QGH7TJ5T-part3 ONLINE 0 0 0
ata-WDC_WD120EMFZ-11A6JA0_9RJU9VPC-part3 ONLINE 0 0 0
errors: No known data errors
root@pnode-b:~# df -h
Filesystem Size Used Avail Use% Mounted on
udev 63G 0 63G 0% /dev
tmpfs 13G 1.6M 13G 1% /run
rpool/ROOT/pve-1 21T 7.0G 21T 1% /
tmpfs 63G 72M 63G 1% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
rpool 21T 128K 21T 1% /rpool
rpool/ROOT 21T 128K 21T 1% /rpool/ROOT
rpool/data 21T 128K 21T 1% /rpool/data
rpool/data/subvol-802-disk-0 32G 402M 32G 2% /rpool/data/subvol-802-disk-0
rpool/data/subvol-821-disk-0 32G 401M 32G 2% /rpool/data/subvol-821-disk-0
/dev/fuse 128M 52K 128M 1% /etc/pve
tmpfs 13G 0 13G 0% /run/user/0
root@pnode-b:~# zpool list
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
rpool 21.8T 1023G 20.8T - - 5% 4% 1.00x ONLINE -
root@pnode-b:~# zfs list
NAME USED AVAIL REFER MOUNTPOINT
rpool 1023G 20.7T 104K /rpool
rpool/ROOT 6.97G 20.7T 96K /rpool/ROOT
rpool/ROOT/pve-1 6.97G 20.7T 6.97G /
rpool/data 1015G 20.7T 104K /rpool/data
rpool/data/subvol-802-disk-0 401M 31.6G 401M /rpool/data/subvol-802-disk-0
rpool/data/subvol-821-disk-0 401M 31.6G 401M /rpool/data/subvol-821-disk-0
rpool/data/vm-201-disk-0 91.0G 20.7T 87.9G -
rpool/data/vm-202-disk-0 81.0G 20.7T 81.0G -
rpool/data/vm-262-disk-0 6.73G 20.7T 6.73G -
rpool/data/vm-271-disk-0 6.81G 20.7T 6.81G -
rpool/data/vm-291-disk-0 208K 20.7T 136K -
--------------------
Aug 22 07:41:46 pnode-a dbus-daemon[9580]: [system] Successfully activated service 'org.freedesktop.systemd1'
Aug 22 07:41:46 pnode-a smartd[9587]: Device: /dev/sdb [SAT], is SMART capable. Adding to "monitor" list.
Aug 22 07:41:46 pnode-a systemd[1]: Started ksmtuned.service - Kernel Samepage Merging (KSM) Tuning Daemon.
Aug 22 07:41:46 pnode-a smartd[9587]: Device: /dev/sdb [SAT], state read from /var/lib/smartmontools/smartd.WDC_WD120EDBZ_11B1HA0-5QG2ZJ8E.ata.state
Aug 22 07:41:46 pnode-a smartd[9587]: Device: /dev/sdc, type changed from 'scsi' to 'sat'
Aug 22 07:41:46 pnode-a smartd[9587]: Device: /dev/sdc [SAT], opened
Aug 22 07:41:46 pnode-a ovs-ctl[9379]: Creating empty database /etc/openvswitch/conf.db.
Aug 22 07:41:46 pnode-a smartd[9587]: Device: /dev/sdc [SAT], WDC WD120EDAZ-11F3RA0, S/N:5PK8N47F, WWN:5-000cca-291ee3512, FW:81.00A81, 12.0 TB
Aug 22 07:41:46 pnode-a systemd[1]: e2scrub_reap.service: Deactivated successfully.
Aug 22 07:41:46 pnode-a systemd[1]: Finished e2scrub_reap.service - Remove Stale Online ext4 Metadata Check Snapshots.
Aug 22 07:41:46 pnode-a smartd[9587]: Device: /dev/sdc [SAT], not found in smartd database 7.3/5319.
Aug 22 07:41:46 pnode-a smartd[9587]: Device: /dev/sdc [SAT], is SMART capable. Adding to "monitor" list.
Aug 22 07:41:46 pnode-a smartd[9587]: Device: /dev/sdc [SAT], state read from /var/lib/smartmontools/smartd.WDC_WD120EDAZ_11F3RA0-5PK8N47F.ata.state
Aug 22 07:41:46 pnode-a smartd[9587]: Device: /dev/sdd, type changed from 'scsi' to 'sat'
Aug 22 07:41:46 pnode-a smartd[9587]: Device: /dev/sdd [SAT], opened
Aug 22 07:41:46 pnode-a smartd[9587]: Device: /dev/sdd [SAT], WDC WD120EDAZ-11F3RA0, S/N:5PK82TLB, WWN:5-000cca-291edf3f7, FW:81.00A81, 12.0 TB
Aug 22 07:41:46 pnode-a smartd[9587]: Device: /dev/sdd [SAT], not found in smartd database 7.3/5319.
Aug 22 07:41:46 pnode-a smartd[9587]: Device: /dev/sdd [SAT], is SMART capable. Adding to "monitor" list.
Aug 22 07:41:46 pnode-a smartd[9587]: Device: /dev/sdd [SAT], state read from /var/lib/smartmontools/smartd.WDC_WD120EDAZ_11F3RA0-5PK82TLB.ata.state
Aug 22 07:41:46 pnode-a smartd[9587]: Device: /dev/sde [SAT], opened
Aug 22 07:41:46 pnode-a smartd[9587]: Device: /dev/sde [SAT], not ATA, no IDENTIFY DEVICE Structure
Aug 22 07:41:46 pnode-a smartd[9587]: Monitoring 4 ATA/SATA, 0 SCSI/SAS and 0 NVMe devices
Aug 22 07:41:46 pnode-a smartd[9587]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 118 to 171
Aug 22 07:41:46 pnode-a ovs-ctl[9379]: Starting ovsdb-server.
Aug 22 07:41:46 pnode-a smartd[9587]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 108 to 158
Aug 22 07:41:46 pnode-a smartd[9587]: Device: /dev/sdc [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 112 to 162
Aug 22 07:41:46 pnode-a smartd[9587]: Device: /dev/sdd [SAT], SMART Prefailure Attribute: 3 Spin_Up_Time changed from 159 to 163
Aug 22 07:41:46 pnode-a smartd[9587]: Device: /dev/sdd [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 144 to 196
Aug 22 07:41:46 pnode-a smartd[9587]: Device: /dev/sda [SAT], state written to /var/lib/smartmontools/smartd.WDC_WD120EDAZ_11F3RA0-5PK6VKLB.ata.state
Aug 22 07:41:46 pnode-a ovs-vsctl[9723]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --no-wait -- init -- set Open_vSwitch . db-version=8.3.1
Aug 22 07:41:46 pnode-a smartd[9587]: Device: /dev/sdb [SAT], state written to /var/lib/smartmontools/smartd.WDC_WD120EDBZ_11B1HA0-5QG2ZJ8E.ata.state
Aug 22 07:41:46 pnode-a smartd[9587]: Device: /dev/sdc [SAT], state written to /var/lib/smartmontools/smartd.WDC_WD120EDAZ_11F3RA0-5PK8N47F.ata.state
Aug 22 07:41:46 pnode-a smartd[9587]: Device: /dev/sdd [SAT], state written to /var/lib/smartmontools/smartd.WDC_WD120EDAZ_11F3RA0-5PK82TLB.ata.state
Aug 22 07:41:46 pnode-a systemd[1]: Started smartmontools.service - Self Monitoring and Reporting Technology (SMART) Daemon.
Aug 22 07:41:46 pnode-a ovs-vsctl[9728]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --no-wait set Open_vSwitch . ovs-version=3.1.0 "external-ids:system-id=\"7dc0fa63-5c09-410a-ac09-04b70ae91426\"" "external-ids:rundir=\"/var/run/openvswitch\"" "system-type=\"debian\"" "system-version=\"12\""
Aug 22 07:41:46 pnode-a ovs-ctl[9379]: Configuring Open vSwitch system IDs.
...
Aug 22 07:41:53 pnode-a systemd[1]: Finished lxc.service - LXC Container Initialization and Autoboot Code.
Aug 22 07:41:53 pnode-a iscsid[10355]: iSCSI daemon with pid=10356 started!
Aug 22 07:41:53 pnode-a fail2ban-server[10300]: Server ready
Aug 22 07:41:54 pnode-a rrdcached[10384]: rrdcached started.
Aug 22 07:41:54 pnode-a systemd[1]: Started rrdcached.service - LSB: start or stop rrdcached.
Aug 22 07:41:54 pnode-a systemd[1]: Starting pve-cluster.service - The Proxmox VE cluster filesystem...
Aug 22 07:41:54 pnode-a pmxcfs[10518]: [main] notice: resolved node name 'pnode-a' to '192.168.99.101' for default node IP address
Aug 22 07:41:54 pnode-a pmxcfs[10518]: [main] notice: resolved node name 'pnode-a' to '192.168.99.101' for default node IP address
Aug 22 07:41:54 pnode-a pmxcfs[10524]: [quorum] crit: quorum_initialize failed: 2
Aug 22 07:41:54 pnode-a pmxcfs[10524]: [quorum] crit: can't initialize service
Aug 22 07:41:54 pnode-a pmxcfs[10524]: [confdb] crit: cmap_initialize failed: 2
Aug 22 07:41:54 pnode-a pmxcfs[10524]: [confdb] crit: can't initialize service
Aug 22 07:41:54 pnode-a pmxcfs[10524]: [dcdb] crit: cpg_initialize failed: 2
Aug 22 07:41:54 pnode-a pmxcfs[10524]: [dcdb] crit: can't initialize service
Aug 22 07:41:54 pnode-a pmxcfs[10524]: [status] crit: cpg_initialize failed: 2
Aug 22 07:41:54 pnode-a pmxcfs[10524]: [status] crit: can't initialize service
Aug 22 07:41:54 pnode-a systemd[1]: Started postfix@-.service - Postfix Mail Transport Agent (instance -).
Aug 22 07:41:54 pnode-a kernel: igb 0000:05:00.3 enp5s0f3: igb: enp5s0f3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Aug 22 07:41:54 pnode-a kernel: IPv6: ADDRCONF(NETDEV_CHANGE): enp5s0f3: link becomes ready
Aug 22 07:41:54 pnode-a systemd[1]: Starting postfix.service - Postfix Mail Transport Agent...
Aug 22 07:41:54 pnode-a systemd[1]: Finished postfix.service - Postfix Mail Transport Agent.
Aug 22 07:41:55 pnode-a systemd[1]: Started pve-cluster.service - The Proxmox VE cluster filesystem.
Aug 22 07:41:55 pnode-a systemd[1]: Starting corosync.service - Corosync Cluster Engine...
Aug 22 07:41:55 pnode-a systemd[1]: Started cron.service - Regular background program processing daemon.
Aug 22 07:41:55 pnode-a systemd[1]: Starting pve-firewall.service - Proxmox VE firewall...
Aug 22 07:41:55 pnode-a systemd[1]: Starting pvestatd.service - PVE Status Daemon...
Aug 22 07:41:55 pnode-a cron[11031]: (CRON) INFO (pidfile fd = 3)
Aug 22 07:41:55 pnode-a cron[11031]: (CRON) INFO (Running [USER=28815]@reboot[/USER] jobs)
Aug 22 07:41:55 pnode-a corosync[11030]: [MAIN ] Corosync Cluster Engine starting up
Aug 22 07:41:55 pnode-a corosync[11030]: [MAIN ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf vqsim nozzle snmp pie relro bindnow
Aug 22 07:41:55 pnode-a corosync[11030]: [MAIN ] parse error in config: This totem parser can only parse version 2 configurations.
Aug 22 07:41:55 pnode-a corosync[11030]: [MAIN ] Corosync Cluster Engine exiting with status 8 at main.c:1445.
Aug 22 07:41:55 pnode-a systemd[1]: corosync.service: Main process exited, code=exited, status=8/n/a
Aug 22 07:41:55 pnode-a systemd[1]: corosync.service: Failed with result 'exit-code'.
Aug 22 07:41:55 pnode-a systemd[1]: Failed to start corosync.service - Corosync Cluster Engine.
Aug 22 07:41:55 pnode-a systemd[1]: Dependency failed for corosync-qdevice.service - Corosync Qdevice daemon.
Aug 22 07:41:55 pnode-a systemd[1]: corosync-qdevice.service: Job corosync-qdevice.service/start failed with result 'dependency'.
Aug 22 07:41:55 pnode-a systemd[1]: Starting pvedaemon.service - PVE API Daemon...
Last edited: