Proxmox freezes probably due to SSD being full

Sandbo

Well-Known Member
Jul 4, 2019
85
10
48
35
I am running Proxmox 6.4-13. Lately I am experiencing system freezing, in particular before it totally freezes out, it will display input/output error when doing df/nano/lsblk. I ruled out all add-on disks, leaving the system SSD to blame.

Besides it being a hardware failure, I remember seeing similar i/o error when the disk is full.
On checking through the UI, at Datacenter --> PVE --> Disks --> LVM,
it said the Usage is 97% which seems rather high.

I looked at the space assignment, however, it does not look that full:
CT100: 16G+128G
CT105: 128G
VM101: 128G
VM104: 32G

Then I checked from the console but I couldn't see anything out of ordinary.
Appreciated if you could give me any hints.

Code:
root@pve:~# lsblk
NAME                         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
nvme0n1                      259:0    0   477G  0 disk
├─nvme0n1p1                  259:1    0  1007K  0 part
├─nvme0n1p2                  259:2    0   512M  0 part
└─nvme0n1p3                  259:3    0 476.4G  0 part
  ├─pve-swap                 253:0    0     8G  0 lvm  [SWAP]
  ├─pve-root                 253:1    0    96G  0 lvm  /
  ├─pve-data_tmeta           253:2    0   3.6G  0 lvm 
  │ └─pve-data-tpool         253:4    0 349.3G  0 lvm 
  │   ├─pve-data             253:5    0 349.3G  0 lvm 
  │   ├─pve-vm--104--disk--0 253:6    0    32G  0 lvm 
  │   ├─pve-vm--104--disk--1 253:7    0     4M  0 lvm 
  │   ├─pve-vm--100--disk--1 253:8    0    16G  0 lvm 
  │   ├─pve-vm--105--disk--0 253:9    0   128G  0 lvm 
  │   ├─pve-vm--101--disk--0 253:11   0     4M  0 lvm 
  │   └─pve-vm--101--disk--1 253:12   0   128G  0 lvm 
  └─pve-data_tdata           253:3    0 349.3G  0 lvm 
    └─pve-data-tpool         253:4    0 349.3G  0 lvm 
      ├─pve-data             253:5    0 349.3G  0 lvm 
      ├─pve-vm--104--disk--0 253:6    0    32G  0 lvm 
      ├─pve-vm--104--disk--1 253:7    0     4M  0 lvm 
      ├─pve-vm--100--disk--1 253:8    0    16G  0 lvm 
      ├─pve-vm--105--disk--0 253:9    0   128G  0 lvm 
      ├─pve-vm--101--disk--0 253:11   0     4M  0 lvm 
      └─pve-vm--101--disk--1 253:12   0   128G  0 lvm
Code:
root@pve:~# df
Filesystem           1K-blocks     Used Available Use% Mounted on
udev                  16376308        0  16376308   0% /dev
tmpfs                  3280832     1184   3279648   1% /run
/dev/mapper/pve-root  98559220 64049392  29460280  69% /
tmpfs                 16404148    43680  16360468   1% /dev/shm
tmpfs                     5120        0      5120   0% /run/lock
tmpfs                 16404148        0  16404148   0% /sys/fs/cgroup
/dev/fuse                30720       20     30700   1% /etc/pve
tmpfs                  3280828        0   3280828   0% /run/user/0
root@pve:~# lsblk
NAME                         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
nvme0n1                      259:0    0   477G  0 disk
├─nvme0n1p1                  259:1    0  1007K  0 part
├─nvme0n1p2                  259:2    0   512M  0 part
└─nvme0n1p3                  259:3    0 476.4G  0 part
  ├─pve-swap                 253:0    0     8G  0 lvm  [SWAP]
  ├─pve-root                 253:1    0    96G  0 lvm  /
  ├─pve-data_tmeta           253:2    0   3.6G  0 lvm 
  │ └─pve-data-tpool         253:4    0 349.3G  0 lvm 
  │   ├─pve-data             253:5    0 349.3G  0 lvm 
  │   ├─pve-vm--104--disk--0 253:6    0    32G  0 lvm 
  │   ├─pve-vm--104--disk--1 253:7    0     4M  0 lvm 
  │   ├─pve-vm--100--disk--1 253:8    0    16G  0 lvm 
  │   ├─pve-vm--105--disk--0 253:9    0   128G  0 lvm 
  │   ├─pve-vm--101--disk--0 253:11   0     4M  0 lvm 
  │   └─pve-vm--101--disk--1 253:12   0   128G  0 lvm 
  └─pve-data_tdata           253:3    0 349.3G  0 lvm 
    └─pve-data-tpool         253:4    0 349.3G  0 lvm 
      ├─pve-data             253:5    0 349.3G  0 lvm 
      ├─pve-vm--104--disk--0 253:6    0    32G  0 lvm 
      ├─pve-vm--104--disk--1 253:7    0     4M  0 lvm 
      ├─pve-vm--100--disk--1 253:8    0    16G  0 lvm 
      ├─pve-vm--105--disk--0 253:9    0   128G  0 lvm 
      ├─pve-vm--101--disk--0 253:11   0     4M  0 lvm 
      └─pve-vm--101--disk--1 253:12   0   128G  0 lvm
Code:
root@pve:~# lvdisplay
  --- Logical volume ---
  LV Path                /dev/pve/swap
  LV Name                swap
  VG Name                pve
  LV UUID                CPHFWE-5fuD-JJhi-k9pP-DexF-D9B2-XLf4uT
  LV Write Access        read/write
  LV Creation host, time proxmox, 2019-08-05 07:41:21 +0900
  LV Status              available
  # open                 2
  LV Size                8.00 GiB
  Current LE             2048
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:0
  
  --- Logical volume ---
  LV Path                /dev/pve/root
  LV Name                root
  VG Name                pve
  LV UUID                LNNM1L-NucB-yxea-4W7E-pEfh-LpYM-0xBeip
  LV Write Access        read/write
  LV Creation host, time proxmox, 2019-08-05 07:41:21 +0900
  LV Status              available
  # open                 1
  LV Size                96.00 GiB
  Current LE             24576
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:1
  
  --- Logical volume ---
  LV Name                data
  VG Name                pve
  LV UUID                KGD0gn-Iimv-vYD4-hjkr-n2eB-ecM3-ZfAhjp
  LV Write Access        read/write
  LV Creation host, time proxmox, 2019-08-05 07:41:22 +0900
  LV Pool metadata       data_tmeta
  LV Pool data           data_tdata
  LV Status              available
  # open                 7
  LV Size                <349.31 GiB
  Allocated pool data    67.53%
  Allocated metadata     3.69%
  Current LE             89423
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:4
  
  --- Logical volume ---
  LV Path                /dev/pve/vm-104-disk-0
  LV Name                vm-104-disk-0
  VG Name                pve
  LV UUID                4jddzv-pFsw-8tmZ-xMxo-D0Id-flfd-45JiwG
  LV Write Access        read/write
  LV Creation host, time pve, 2020-01-01 13:20:44 +0900
  LV Pool name           data
  LV Status              available
  # open                 1
  LV Size                32.00 GiB
  Mapped size            39.96%
  Current LE             8192
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:6
  
  --- Logical volume ---
  LV Path                /dev/pve/vm-104-disk-1
  LV Name                vm-104-disk-1
  VG Name                pve
  LV UUID                pwAOnd-Kyzc-Ktpc-2hVx-CCZQ-erdw-vDlmpJ
  LV Write Access        read/write
  LV Creation host, time pve, 2020-01-01 13:20:44 +0900
  LV Pool name           data
  LV Status              available
  # open                 1
  LV Size                4.00 MiB
  Mapped size            3.12%
  Current LE             1
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:7
  
  --- Logical volume ---
  LV Path                /dev/pve/vm-100-disk-1
  LV Name                vm-100-disk-1
  VG Name                pve
  LV UUID                rcRfZY-boHF-G55k-VhDf-Di2g-1ttD-2x6NLO
  LV Write Access        read/write
  LV Creation host, time pve, 2021-02-26 22:09:59 +0900
  LV Pool name           data
  LV Status              available
  # open                 1
  LV Size                16.00 GiB
  Mapped size            99.11%
  Current LE             4096
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:8
  
  --- Logical volume ---
  LV Path                /dev/pve/vm-105-disk-0
  LV Name                vm-105-disk-0
  VG Name                pve
  LV UUID                OWQIOx-doXE-E2xQ-J5sd-C91k-fMBm-RaCQcE
  LV Write Access        read/write
  LV Creation host, time pve, 2021-03-20 20:16:59 +0900
  LV Pool name           data
  LV Status              available
  # open                 1
  LV Size                128.00 GiB
  Mapped size            98.70%
  Current LE             32768
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:9
  
  --- Logical volume ---
  LV Path                /dev/pve/vm-101-disk-0
  LV Name                vm-101-disk-0
  VG Name                pve
  LV UUID                eE7Dij-wPm8-G4V2-fiL4-yGzv-LlWQ-snhXSF
  LV Write Access        read/write
  LV Creation host, time pve, 2021-04-10 14:44:20 +0900
  LV Pool name           data
  LV Status              available
  # open                 1
  LV Size                4.00 MiB
  Mapped size            3.12%
  Current LE             1
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:11
  
  --- Logical volume ---
  LV Path                /dev/pve/vm-101-disk-1
  LV Name                vm-101-disk-1
  VG Name                pve
  LV UUID                KCMfqv-2PWy-iPXN-Eulx-eAPo-3a9G-B6u4Qz
  LV Write Access        read/write
  LV Creation host, time pve, 2021-04-10 14:44:20 +0900
  LV Pool name           data
  LV Status              available
  # open                 1
  LV Size                128.00 GiB
  Mapped size            63.22%
  Current LE             32768
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:12
 
You should also use smartctl to monitor your system disk and maybe initiate a long smart selftest. If it is a hardware problem you might see high error counts there.
 
You should also use smartctl to monitor your system disk and maybe initiate a long smart selftest. If it is a hardware problem you might see high error counts there.
I tried to use smartctl to do test, but it seems it does not work for testing NVMe drives.
Moreover, I was comparing the output of my Proxmox systems with nvme-cli:

The problematic system shows this (see 512.11 / 512.11 GB):
Code:
root@pve:~# nvme list
Node             SN                   Model                                    Namespace Usage                      Format           FW Rev
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1     HBSE39140500891      HP SSD EX950 512GB                       1         512.11  GB / 512.11  GB    512   B +  0 B   R1106C

Another system shows this instead, which makes sense as it does not use up all space (yet).
Code:
root@csct1:~# nvme list
Node             SN                   Model                                    Namespace Usage                      Format           FW Rev
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1     50026B7683D09045     KINGSTON SA2000M81000G                   1         300.25  GB /   1.00  TB    512   B +  0 B   S5Z42105

Yet another system shows this, which again makes sense.
Code:
root@pve:~# nvme list
Node             SN                   Model                                    Namespace Usage                      Format           FW Rev
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1     50026B7683E319C5     KINGSTON SA2000M81000G                   1          85.10  GB /   1.00  TB    512   B +  0 B   S5Z42105
/dev/nvme1n1     S463NF0K510637F      Samsung SSD 970 PRO 512GB                1         405.26  GB / 512.11  GB    512   B +  0 B   1B2QEXP7
/dev/nvme2n1     50026B728266FA48     KINGSTON SA2000M81000G                   1          94.58  GB /   1.00  TB    512   B +  0 B   S5Z42105

So it does seem I am somehow running out of space on the concerning system.
At the moment, the root partition is nearly empty, 7.23% (7.30 GB of 100.92 GB)
And the Thin LVM is only 67.85% (254.48 GB of 375.07 GB)
Looks like something is missed.
 
Last edited:
Update on 2022/01/08: Looks like the system SSD is indeed failing.

I have formatted the SSD and tried to put the latest Proxmox onto it.
As I slowly restoring from my backups, the system froze again and threw these errors before it died out.
Likely it's time to replace it.

Code:
Jan 08 21:36:37 csqt kernel: pcieport 0000:00:01.1: AER: Multiple Corrected error received: 0000:00:00.0
Jan 08 21:36:37 csqt kernel: pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
Jan 08 21:36:37 csqt kernel: pcieport 0000:00:01.1:   device [1022:1453] error status/mask=00000040/00006000
Jan 08 21:36:37 csqt kernel: pcieport 0000:00:01.1:    [ 6] BadTLP               
Jan 08 21:36:39 csqt kernel: pcieport 0000:00:01.1: AER: Multiple Corrected error received: 0000:01:00.0
Jan 08 21:36:39 csqt kernel: nvme 0000:01:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Jan 08 21:36:39 csqt kernel: nvme 0000:01:00.0:   device [126f:2262] error status/mask=00000001/0000e000
Jan 08 21:36:39 csqt kernel: nvme 0000:01:00.0:    [ 0] RxErr                 
Jan 08 21:36:39 csqt kernel: pcieport 0000:00:01.1: AER: Multiple Corrected error received: 0000:01:00.0
Jan 08 21:36:39 csqt kernel: pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
Jan 08 21:36:39 csqt kernel: pcieport 0000:00:01.1:   device [1022:1453] error status/mask=00001000/00006000
Jan 08 21:36:39 csqt kernel: pcieport 0000:00:01.1:    [12] Timeout               
Jan 08 21:36:39 csqt kernel: nvme 0000:01:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Jan 08 21:36:39 csqt kernel: nvme 0000:01:00.0:   device [126f:2262] error status/mask=00000001/0000e000
Jan 08 21:36:39 csqt kernel: nvme 0000:01:00.0:    [ 0] RxErr                 
Jan 08 21:36:39 csqt kernel: nvme 0000:01:00.0: AER:   Error of this Agent is reported first
Jan 08 21:36:39 csqt kernel: pcieport 0000:00:01.1: AER: Multiple Corrected error received: 0000:01:00.0
Jan 08 21:36:39 csqt kernel: pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
Jan 08 21:36:39 csqt kernel: pcieport 0000:00:01.1:   device [1022:1453] error status/mask=00001000/00006000
Jan 08 21:36:39 csqt kernel: pcieport 0000:00:01.1:    [12] Timeout               
Jan 08 21:36:39 csqt kernel: nvme 0000:01:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Jan 08 21:36:39 csqt kernel: nvme 0000:01:00.0:   device [126f:2262] error status/mask=00000001/0000e000
Jan 08 21:36:39 csqt kernel: nvme 0000:01:00.0:    [ 0] RxErr                 
Jan 08 21:36:39 csqt kernel: nvme 0000:01:00.0: AER:   Error of this Agent is reported first
Jan 08 21:36:39 csqt kernel: pcieport 0000:00:01.1: AER: Multiple Corrected error received: 0000:01:00.0
Jan 08 21:36:39 csqt kernel: nvme 0000:01:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Jan 08 21:36:39 csqt kernel: nvme 0000:01:00.0:   device [126f:2262] error status/mask=00000081/0000e000
Jan 08 21:36:39 csqt kernel: nvme 0000:01:00.0:    [ 0] RxErr                 
Jan 08 21:36:39 csqt kernel: nvme 0000:01:00.0:    [ 7] BadDLLP               
Jan 08 21:36:39 csqt kernel: pcieport 0000:00:01.1: AER: Multiple Corrected error received: 0000:01:00.0
Jan 08 21:36:39 csqt kernel: pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
Jan 08 21:36:39 csqt kernel: pcieport 0000:00:01.1:   device [1022:1453] error status/mask=00001000/00006000
Jan 08 21:36:39 csqt kernel: pcieport 0000:00:01.1:    [12] Timeout               
Jan 08 21:36:39 csqt kernel: nvme 0000:01:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Jan 08 21:36:39 csqt kernel: nvme 0000:01:00.0:   device [126f:2262] error status/mask=00000001/0000e000
Jan 08 21:36:39 csqt kernel: nvme 0000:01:00.0:    [ 0] RxErr                 
Jan 08 21:36:39 csqt kernel: nvme 0000:01:00.0: AER:   Error of this Agent is reported first
Jan 08 21:36:39 csqt kernel: pcieport 0000:00:01.1: AER: Multiple Corrected error received: 0000:01:00.0
Jan 08 21:36:39 csqt kernel: nvme 0000:01:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Jan 08 21:36:39 csqt kernel: nvme 0000:01:00.0:   device [126f:2262] error status/mask=00000001/0000e000
Jan 08 21:36:39 csqt kernel: nvme 0000:01:00.0:    [ 0] RxErr