Proxmox freezes probably due to SSD being full

Sandbo

Well-Known Member
Jul 4, 2019
85
10
48
34
I am running Proxmox 6.4-13. Lately I am experiencing system freezing, in particular before it totally freezes out, it will display input/output error when doing df/nano/lsblk. I ruled out all add-on disks, leaving the system SSD to blame.

Besides it being a hardware failure, I remember seeing similar i/o error when the disk is full.
On checking through the UI, at Datacenter --> PVE --> Disks --> LVM,
it said the Usage is 97% which seems rather high.

I looked at the space assignment, however, it does not look that full:
CT100: 16G+128G
CT105: 128G
VM101: 128G
VM104: 32G

Then I checked from the console but I couldn't see anything out of ordinary.
Appreciated if you could give me any hints.

Code:
root@pve:~# lsblk
NAME                         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
nvme0n1                      259:0    0   477G  0 disk
├─nvme0n1p1                  259:1    0  1007K  0 part
├─nvme0n1p2                  259:2    0   512M  0 part
└─nvme0n1p3                  259:3    0 476.4G  0 part
  ├─pve-swap                 253:0    0     8G  0 lvm  [SWAP]
  ├─pve-root                 253:1    0    96G  0 lvm  /
  ├─pve-data_tmeta           253:2    0   3.6G  0 lvm 
  │ └─pve-data-tpool         253:4    0 349.3G  0 lvm 
  │   ├─pve-data             253:5    0 349.3G  0 lvm 
  │   ├─pve-vm--104--disk--0 253:6    0    32G  0 lvm 
  │   ├─pve-vm--104--disk--1 253:7    0     4M  0 lvm 
  │   ├─pve-vm--100--disk--1 253:8    0    16G  0 lvm 
  │   ├─pve-vm--105--disk--0 253:9    0   128G  0 lvm 
  │   ├─pve-vm--101--disk--0 253:11   0     4M  0 lvm 
  │   └─pve-vm--101--disk--1 253:12   0   128G  0 lvm 
  └─pve-data_tdata           253:3    0 349.3G  0 lvm 
    └─pve-data-tpool         253:4    0 349.3G  0 lvm 
      ├─pve-data             253:5    0 349.3G  0 lvm 
      ├─pve-vm--104--disk--0 253:6    0    32G  0 lvm 
      ├─pve-vm--104--disk--1 253:7    0     4M  0 lvm 
      ├─pve-vm--100--disk--1 253:8    0    16G  0 lvm 
      ├─pve-vm--105--disk--0 253:9    0   128G  0 lvm 
      ├─pve-vm--101--disk--0 253:11   0     4M  0 lvm 
      └─pve-vm--101--disk--1 253:12   0   128G  0 lvm
Code:
root@pve:~# df
Filesystem           1K-blocks     Used Available Use% Mounted on
udev                  16376308        0  16376308   0% /dev
tmpfs                  3280832     1184   3279648   1% /run
/dev/mapper/pve-root  98559220 64049392  29460280  69% /
tmpfs                 16404148    43680  16360468   1% /dev/shm
tmpfs                     5120        0      5120   0% /run/lock
tmpfs                 16404148        0  16404148   0% /sys/fs/cgroup
/dev/fuse                30720       20     30700   1% /etc/pve
tmpfs                  3280828        0   3280828   0% /run/user/0
root@pve:~# lsblk
NAME                         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
nvme0n1                      259:0    0   477G  0 disk
├─nvme0n1p1                  259:1    0  1007K  0 part
├─nvme0n1p2                  259:2    0   512M  0 part
└─nvme0n1p3                  259:3    0 476.4G  0 part
  ├─pve-swap                 253:0    0     8G  0 lvm  [SWAP]
  ├─pve-root                 253:1    0    96G  0 lvm  /
  ├─pve-data_tmeta           253:2    0   3.6G  0 lvm 
  │ └─pve-data-tpool         253:4    0 349.3G  0 lvm 
  │   ├─pve-data             253:5    0 349.3G  0 lvm 
  │   ├─pve-vm--104--disk--0 253:6    0    32G  0 lvm 
  │   ├─pve-vm--104--disk--1 253:7    0     4M  0 lvm 
  │   ├─pve-vm--100--disk--1 253:8    0    16G  0 lvm 
  │   ├─pve-vm--105--disk--0 253:9    0   128G  0 lvm 
  │   ├─pve-vm--101--disk--0 253:11   0     4M  0 lvm 
  │   └─pve-vm--101--disk--1 253:12   0   128G  0 lvm 
  └─pve-data_tdata           253:3    0 349.3G  0 lvm 
    └─pve-data-tpool         253:4    0 349.3G  0 lvm 
      ├─pve-data             253:5    0 349.3G  0 lvm 
      ├─pve-vm--104--disk--0 253:6    0    32G  0 lvm 
      ├─pve-vm--104--disk--1 253:7    0     4M  0 lvm 
      ├─pve-vm--100--disk--1 253:8    0    16G  0 lvm 
      ├─pve-vm--105--disk--0 253:9    0   128G  0 lvm 
      ├─pve-vm--101--disk--0 253:11   0     4M  0 lvm 
      └─pve-vm--101--disk--1 253:12   0   128G  0 lvm
Code:
root@pve:~# lvdisplay
  --- Logical volume ---
  LV Path                /dev/pve/swap
  LV Name                swap
  VG Name                pve
  LV UUID                CPHFWE-5fuD-JJhi-k9pP-DexF-D9B2-XLf4uT
  LV Write Access        read/write
  LV Creation host, time proxmox, 2019-08-05 07:41:21 +0900
  LV Status              available
  # open                 2
  LV Size                8.00 GiB
  Current LE             2048
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:0
  
  --- Logical volume ---
  LV Path                /dev/pve/root
  LV Name                root
  VG Name                pve
  LV UUID                LNNM1L-NucB-yxea-4W7E-pEfh-LpYM-0xBeip
  LV Write Access        read/write
  LV Creation host, time proxmox, 2019-08-05 07:41:21 +0900
  LV Status              available
  # open                 1
  LV Size                96.00 GiB
  Current LE             24576
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:1
  
  --- Logical volume ---
  LV Name                data
  VG Name                pve
  LV UUID                KGD0gn-Iimv-vYD4-hjkr-n2eB-ecM3-ZfAhjp
  LV Write Access        read/write
  LV Creation host, time proxmox, 2019-08-05 07:41:22 +0900
  LV Pool metadata       data_tmeta
  LV Pool data           data_tdata
  LV Status              available
  # open                 7
  LV Size                <349.31 GiB
  Allocated pool data    67.53%
  Allocated metadata     3.69%
  Current LE             89423
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:4
  
  --- Logical volume ---
  LV Path                /dev/pve/vm-104-disk-0
  LV Name                vm-104-disk-0
  VG Name                pve
  LV UUID                4jddzv-pFsw-8tmZ-xMxo-D0Id-flfd-45JiwG
  LV Write Access        read/write
  LV Creation host, time pve, 2020-01-01 13:20:44 +0900
  LV Pool name           data
  LV Status              available
  # open                 1
  LV Size                32.00 GiB
  Mapped size            39.96%
  Current LE             8192
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:6
  
  --- Logical volume ---
  LV Path                /dev/pve/vm-104-disk-1
  LV Name                vm-104-disk-1
  VG Name                pve
  LV UUID                pwAOnd-Kyzc-Ktpc-2hVx-CCZQ-erdw-vDlmpJ
  LV Write Access        read/write
  LV Creation host, time pve, 2020-01-01 13:20:44 +0900
  LV Pool name           data
  LV Status              available
  # open                 1
  LV Size                4.00 MiB
  Mapped size            3.12%
  Current LE             1
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:7
  
  --- Logical volume ---
  LV Path                /dev/pve/vm-100-disk-1
  LV Name                vm-100-disk-1
  VG Name                pve
  LV UUID                rcRfZY-boHF-G55k-VhDf-Di2g-1ttD-2x6NLO
  LV Write Access        read/write
  LV Creation host, time pve, 2021-02-26 22:09:59 +0900
  LV Pool name           data
  LV Status              available
  # open                 1
  LV Size                16.00 GiB
  Mapped size            99.11%
  Current LE             4096
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:8
  
  --- Logical volume ---
  LV Path                /dev/pve/vm-105-disk-0
  LV Name                vm-105-disk-0
  VG Name                pve
  LV UUID                OWQIOx-doXE-E2xQ-J5sd-C91k-fMBm-RaCQcE
  LV Write Access        read/write
  LV Creation host, time pve, 2021-03-20 20:16:59 +0900
  LV Pool name           data
  LV Status              available
  # open                 1
  LV Size                128.00 GiB
  Mapped size            98.70%
  Current LE             32768
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:9
  
  --- Logical volume ---
  LV Path                /dev/pve/vm-101-disk-0
  LV Name                vm-101-disk-0
  VG Name                pve
  LV UUID                eE7Dij-wPm8-G4V2-fiL4-yGzv-LlWQ-snhXSF
  LV Write Access        read/write
  LV Creation host, time pve, 2021-04-10 14:44:20 +0900
  LV Pool name           data
  LV Status              available
  # open                 1
  LV Size                4.00 MiB
  Mapped size            3.12%
  Current LE             1
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:11
  
  --- Logical volume ---
  LV Path                /dev/pve/vm-101-disk-1
  LV Name                vm-101-disk-1
  VG Name                pve
  LV UUID                KCMfqv-2PWy-iPXN-Eulx-eAPo-3a9G-B6u4Qz
  LV Write Access        read/write
  LV Creation host, time pve, 2021-04-10 14:44:20 +0900
  LV Pool name           data
  LV Status              available
  # open                 1
  LV Size                128.00 GiB
  Mapped size            63.22%
  Current LE             32768
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:12
 
You should also use smartctl to monitor your system disk and maybe initiate a long smart selftest. If it is a hardware problem you might see high error counts there.
 
You should also use smartctl to monitor your system disk and maybe initiate a long smart selftest. If it is a hardware problem you might see high error counts there.
I tried to use smartctl to do test, but it seems it does not work for testing NVMe drives.
Moreover, I was comparing the output of my Proxmox systems with nvme-cli:

The problematic system shows this (see 512.11 / 512.11 GB):
Code:
root@pve:~# nvme list
Node             SN                   Model                                    Namespace Usage                      Format           FW Rev
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1     HBSE39140500891      HP SSD EX950 512GB                       1         512.11  GB / 512.11  GB    512   B +  0 B   R1106C

Another system shows this instead, which makes sense as it does not use up all space (yet).
Code:
root@csct1:~# nvme list
Node             SN                   Model                                    Namespace Usage                      Format           FW Rev
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1     50026B7683D09045     KINGSTON SA2000M81000G                   1         300.25  GB /   1.00  TB    512   B +  0 B   S5Z42105

Yet another system shows this, which again makes sense.
Code:
root@pve:~# nvme list
Node             SN                   Model                                    Namespace Usage                      Format           FW Rev
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1     50026B7683E319C5     KINGSTON SA2000M81000G                   1          85.10  GB /   1.00  TB    512   B +  0 B   S5Z42105
/dev/nvme1n1     S463NF0K510637F      Samsung SSD 970 PRO 512GB                1         405.26  GB / 512.11  GB    512   B +  0 B   1B2QEXP7
/dev/nvme2n1     50026B728266FA48     KINGSTON SA2000M81000G                   1          94.58  GB /   1.00  TB    512   B +  0 B   S5Z42105

So it does seem I am somehow running out of space on the concerning system.
At the moment, the root partition is nearly empty, 7.23% (7.30 GB of 100.92 GB)
And the Thin LVM is only 67.85% (254.48 GB of 375.07 GB)
Looks like something is missed.
 
Last edited:
Update on 2022/01/08: Looks like the system SSD is indeed failing.

I have formatted the SSD and tried to put the latest Proxmox onto it.
As I slowly restoring from my backups, the system froze again and threw these errors before it died out.
Likely it's time to replace it.

Code:
Jan 08 21:36:37 csqt kernel: pcieport 0000:00:01.1: AER: Multiple Corrected error received: 0000:00:00.0
Jan 08 21:36:37 csqt kernel: pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
Jan 08 21:36:37 csqt kernel: pcieport 0000:00:01.1:   device [1022:1453] error status/mask=00000040/00006000
Jan 08 21:36:37 csqt kernel: pcieport 0000:00:01.1:    [ 6] BadTLP               
Jan 08 21:36:39 csqt kernel: pcieport 0000:00:01.1: AER: Multiple Corrected error received: 0000:01:00.0
Jan 08 21:36:39 csqt kernel: nvme 0000:01:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Jan 08 21:36:39 csqt kernel: nvme 0000:01:00.0:   device [126f:2262] error status/mask=00000001/0000e000
Jan 08 21:36:39 csqt kernel: nvme 0000:01:00.0:    [ 0] RxErr                 
Jan 08 21:36:39 csqt kernel: pcieport 0000:00:01.1: AER: Multiple Corrected error received: 0000:01:00.0
Jan 08 21:36:39 csqt kernel: pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
Jan 08 21:36:39 csqt kernel: pcieport 0000:00:01.1:   device [1022:1453] error status/mask=00001000/00006000
Jan 08 21:36:39 csqt kernel: pcieport 0000:00:01.1:    [12] Timeout               
Jan 08 21:36:39 csqt kernel: nvme 0000:01:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Jan 08 21:36:39 csqt kernel: nvme 0000:01:00.0:   device [126f:2262] error status/mask=00000001/0000e000
Jan 08 21:36:39 csqt kernel: nvme 0000:01:00.0:    [ 0] RxErr                 
Jan 08 21:36:39 csqt kernel: nvme 0000:01:00.0: AER:   Error of this Agent is reported first
Jan 08 21:36:39 csqt kernel: pcieport 0000:00:01.1: AER: Multiple Corrected error received: 0000:01:00.0
Jan 08 21:36:39 csqt kernel: pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
Jan 08 21:36:39 csqt kernel: pcieport 0000:00:01.1:   device [1022:1453] error status/mask=00001000/00006000
Jan 08 21:36:39 csqt kernel: pcieport 0000:00:01.1:    [12] Timeout               
Jan 08 21:36:39 csqt kernel: nvme 0000:01:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Jan 08 21:36:39 csqt kernel: nvme 0000:01:00.0:   device [126f:2262] error status/mask=00000001/0000e000
Jan 08 21:36:39 csqt kernel: nvme 0000:01:00.0:    [ 0] RxErr                 
Jan 08 21:36:39 csqt kernel: nvme 0000:01:00.0: AER:   Error of this Agent is reported first
Jan 08 21:36:39 csqt kernel: pcieport 0000:00:01.1: AER: Multiple Corrected error received: 0000:01:00.0
Jan 08 21:36:39 csqt kernel: nvme 0000:01:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Jan 08 21:36:39 csqt kernel: nvme 0000:01:00.0:   device [126f:2262] error status/mask=00000081/0000e000
Jan 08 21:36:39 csqt kernel: nvme 0000:01:00.0:    [ 0] RxErr                 
Jan 08 21:36:39 csqt kernel: nvme 0000:01:00.0:    [ 7] BadDLLP               
Jan 08 21:36:39 csqt kernel: pcieport 0000:00:01.1: AER: Multiple Corrected error received: 0000:01:00.0
Jan 08 21:36:39 csqt kernel: pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
Jan 08 21:36:39 csqt kernel: pcieport 0000:00:01.1:   device [1022:1453] error status/mask=00001000/00006000
Jan 08 21:36:39 csqt kernel: pcieport 0000:00:01.1:    [12] Timeout               
Jan 08 21:36:39 csqt kernel: nvme 0000:01:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Jan 08 21:36:39 csqt kernel: nvme 0000:01:00.0:   device [126f:2262] error status/mask=00000001/0000e000
Jan 08 21:36:39 csqt kernel: nvme 0000:01:00.0:    [ 0] RxErr                 
Jan 08 21:36:39 csqt kernel: nvme 0000:01:00.0: AER:   Error of this Agent is reported first
Jan 08 21:36:39 csqt kernel: pcieport 0000:00:01.1: AER: Multiple Corrected error received: 0000:01:00.0
Jan 08 21:36:39 csqt kernel: nvme 0000:01:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Jan 08 21:36:39 csqt kernel: nvme 0000:01:00.0:   device [126f:2262] error status/mask=00000001/0000e000
Jan 08 21:36:39 csqt kernel: nvme 0000:01:00.0:    [ 0] RxErr
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!