Disk lifespanPreliminary cause analysis of the accident: ProxmoxVE experienced an SMB/CIFS mount down during the backup to SMB/CIFS task, triggering endless read and write operations on the local disk. At the same time, it also triggered gvt related errors
Accident occurrence node: ` journalctl -- since "03:00, May 12, 2024" -- until "11:00, May 12, 2024"`
Direct losses from accidents:
1. ProxmoxVE business interruption
2. ProxmoxVE cannot restart and force power off
3. Local-thin (Nvme) disk consumed approximately 22T of write lifetime (previously used 5T)
ServerInfo
Hardware configuration: ThinkStaion P340 (W480)
Partial log information
Almost every VPS has Disk IO Full at the time of an accident
Disk lifespan (Approximately 5TB of write was used before the fault occurred)
Attempt to restart fail
Accident occurrence node: ` journalctl -- since "03:00, May 12, 2024" -- until "11:00, May 12, 2024"`
Direct losses from accidents:
1. ProxmoxVE business interruption
2. ProxmoxVE cannot restart and force power off
3. Local-thin (Nvme) disk consumed approximately 22T of write lifetime (previously used 5T)
ServerInfo
Hardware configuration: ThinkStaion P340 (W480)
Code:
root@pve01
----------
OS: Proxmox VE 8.2.2 x86_64
Host: 30DHCTO1WW ThinkStation P340
Kernel: 6.8.4-3-pve
Uptime: 41 mins
Packages: 864 (dpkg)
Shell: bash 5.2.15
Resolution: 1920x1080
Terminal: /dev/pts/0
CPU: Intel Xeon W-1290 (20) @ 5.200GHz
GPU: NVIDIA Quadro P400
GPU: Intel Comet Lake-S GT2 [UHD Graphics P630]
Memory: 26540MiB / 64010MiB
Partial log information
Code:
...
May 12 03:01:07 pve01 pvescheduler[1495573]: INFO: Finished Backup of VM 139 (00:00:05)
May 12 03:01:07 pve01 pvescheduler[1495573]: INFO: Starting Backup of VM 150 (qemu)
May 12 03:01:21 pve01 pvescheduler[1495573]: INFO: Finished Backup of VM 150 (00:00:14)
May 12 03:01:21 pve01 pvescheduler[1495573]: INFO: Starting Backup of VM 161 (qemu)
May 12 03:01:27 pve01 pvescheduler[1495573]: INFO: Finished Backup of VM 161 (00:00:06)
May 12 03:01:28 pve01 pvescheduler[1495573]: INFO: Starting Backup of VM 300 (lxc)
May 12 03:01:28 pve01 dmeventd[987]: No longer monitoring thin pool pve-data-tpool.
May 12 03:01:28 pve01 dmeventd[987]: Monitoring thin pool pve-data-tpool.
May 12 03:01:28 pve01 kernel: EXT4-fs (dm-79): write access unavailable, skipping orphan cleanup
May 12 03:01:28 pve01 kernel: EXT4-fs (dm-79): mounted filesystem eb2f5ff8-7b91-4809-a123-5556b3091e8f ro without journal. Quota mode: none.
May 12 03:01:36 pve01 kernel: EXT4-fs (dm-79): unmounting filesystem eb2f5ff8-7b91-4809-a123-5556b3091e8f.
May 12 03:01:36 pve01 pvescheduler[1495573]: INFO: Finished Backup of VM 300 (00:00:08)
May 12 03:01:36 pve01 pvescheduler[1495573]: INFO: Starting Backup of VM 301 (lxc)
May 12 03:01:37 pve01 dmeventd[987]: No longer monitoring thin pool pve-data-tpool.
May 12 03:01:37 pve01 dmeventd[987]: Monitoring thin pool pve-data-tpool.
May 12 03:01:37 pve01 kernel: EXT4-fs (dm-79): write access unavailable, skipping orphan cleanup
May 12 03:01:37 pve01 kernel: EXT4-fs (dm-79): mounted filesystem eb2f5ff8-7b91-4809-a123-5556b3091e8f ro without journal. Quota mode: none.
May 12 03:01:45 pve01 kernel: EXT4-fs (dm-79): unmounting filesystem eb2f5ff8-7b91-4809-a123-5556b3091e8f.
May 12 03:01:45 pve01 pvescheduler[1495573]: INFO: Finished Backup of VM 301 (00:00:09)
May 12 03:01:45 pve01 pvescheduler[1495573]: INFO: Starting Backup of VM 302 (lxc)
May 12 03:01:45 pve01 dmeventd[987]: No longer monitoring thin pool pve-data-tpool.
May 12 03:01:45 pve01 dmeventd[987]: Monitoring thin pool pve-data-tpool.
May 12 03:01:45 pve01 kernel: EXT4-fs (dm-79): write access unavailable, skipping orphan cleanup
May 12 03:01:45 pve01 kernel: EXT4-fs (dm-79): mounted filesystem eb2f5ff8-7b91-4809-a123-5556b3091e8f ro without journal. Quota mode: none.
May 12 03:01:53 pve01 kernel: EXT4-fs (dm-79): unmounting filesystem eb2f5ff8-7b91-4809-a123-5556b3091e8f.
May 12 03:01:53 pve01 pvescheduler[1495573]: INFO: Finished Backup of VM 302 (00:00:08)
May 12 03:01:53 pve01 pvescheduler[1495573]: INFO: Starting Backup of VM 303 (lxc)
May 12 03:01:54 pve01 dmeventd[987]: No longer monitoring thin pool pve-data-tpool.
May 12 03:01:54 pve01 dmeventd[987]: Monitoring thin pool pve-data-tpool.
May 12 03:01:54 pve01 kernel: EXT4-fs (dm-79): write access unavailable, skipping orphan cleanup
May 12 03:01:54 pve01 kernel: EXT4-fs (dm-79): mounted filesystem eb2f5ff8-7b91-4809-a123-5556b3091e8f ro without journal. Quota mode: none.
May 12 03:02:01 pve01 kernel: EXT4-fs (dm-79): unmounting filesystem eb2f5ff8-7b91-4809-a123-5556b3091e8f.
May 12 03:02:01 pve01 pvescheduler[1495573]: INFO: Finished Backup of VM 303 (00:00:08)
May 12 03:02:02 pve01 pvescheduler[1495573]: INFO: Backup job finished successfully
May 12 03:02:02 pve01 pvescheduler[1495572]: INFO: got global lock
May 12 03:02:02 pve01 postfix/pickup[1438881]: 0911A4C13E4: uid=0 from=<root>
May 12 03:02:02 pve01 pvescheduler[1495572]: INFO: starting new backup job: vzdump 150 --notes-template '{{guestname}}' --compress zstd --mailnotification failure --mailto im>
May 12 03:02:02 pve01 postfix/cleanup[1499399]: 0911A4C13E4: message-id=<20240511190202.0911A4C13E4@pve01.insilen.com>
May 12 03:02:02 pve01 postfix/qmgr[2056]: 0911A4C13E4: from=<root@pve01.insilen.com>, size=65744, nrcpt=1 (queue active)
May 12 03:02:02 pve01 pvescheduler[1495572]: INFO: Starting Backup of VM 150 (qemu)
...
May 12 03:02:34 pve01 smartd[1601]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 64 to 63
May 12 03:02:34 pve01 smartd[1601]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 36 to 37
May 12 03:03:23 pve01 pvestatd[2073]: storage 'Synology' is not online
May 12 03:03:28 pve01 pvestatd[2073]: storage 'Snapshot' is not online
May 12 03:03:29 pve01 pvestatd[2073]: status update time (10.371 seconds)
May 12 03:03:34 pve01 pvestatd[2073]: storage 'Snapshot' is not online
May 12 03:03:39 pve01 pvestatd[2073]: storage 'Synology' is not online
May 12 03:03:39 pve01 pvestatd[2073]: status update time (10.362 seconds)
May 12 03:03:41 pve01 kernel: CIFS: VFS: \\192.168.20.50 sends on sock 00000000876a49bc stuck for 15 seconds
May 12 03:03:44 pve01 pvestatd[2073]: storage 'Synology' is not online
May 12 03:03:49 pve01 pvestatd[2073]: storage 'Snapshot' is not online
May 12 03:03:49 pve01 pvestatd[2073]: status update time (10.365 seconds)
May 12 03:03:54 pve01 pvestatd[2073]: storage 'Synology' is not online
May 12 03:04:00 pve01 pvestatd[2073]: storage 'Snapshot' is not online
May 12 03:04:00 pve01 pvestatd[2073]: status update time (10.365 seconds)
May 12 03:04:05 pve01 pvestatd[2073]: storage 'Snapshot' is not online
May 12 03:04:10 pve01 pvestatd[2073]: storage 'Synology' is not online
May 12 03:04:10 pve01 pvestatd[2073]: status update time (10.368 seconds)
May 12 03:04:15 pve01 pvestatd[2073]: storage 'Snapshot' is not online
May 12 03:04:20 pve01 pvestatd[2073]: storage 'Synology' is not online
May 12 03:04:20 pve01 pvestatd[2073]: status update time (10.364 seconds)
May 12 03:04:26 pve01 pvestatd[2073]: storage 'Synology' is not online
May 12 03:04:31 pve01 pvestatd[2073]: storage 'Snapshot' is not online
May 12 03:04:31 pve01 pvestatd[2073]: status update time (10.355 seconds)
May 12 03:04:36 pve01 pvestatd[2073]: storage 'Snapshot' is not online
May 12 03:04:41 pve01 pvestatd[2073]: storage 'Synology' is not online
May 12 03:04:41 pve01 pvestatd[2073]: status update time (10.377 seconds)
May 12 03:04:42 pve01 kernel: CIFS: VFS: \\192.168.20.50 sends on sock 00000000876a49bc stuck for 15 seconds
May 12 03:04:46 pve01 pvestatd[2073]: storage 'Snapshot' is not online
May 12 03:04:51 pve01 pvestatd[2073]: storage 'Synology' is not online
May 12 03:04:51 pve01 pvestatd[2073]: status update time (10.368 seconds)
May 12 03:04:57 pve01 pvestatd[2073]: storage 'Synology' is not online
May 12 03:04:57 pve01 kernel: CIFS: VFS: \\192.168.20.50 sends on sock 00000000876a49bc stuck for 15 seconds
May 12 03:05:02 pve01 pvestatd[2073]: storage 'Snapshot' is not online
May 12 03:05:02 pve01 pvestatd[2073]: status update time (10.358 seconds)
May 12 03:05:02 pve01 pvescheduler[1495572]: Warning: unable to close filehandle GEN124 properly: Host is down at /usr/share/perl5/PVE/VZDump/QemuServer.pm line 980.
May 12 03:05:07 pve01 pvestatd[2073]: storage 'Synology' is not online
May 12 03:05:08 pve01 kernel: CIFS: VFS: No writable handle in writepages rc=-9
May 12 03:05:12 pve01 pvestatd[2073]: storage 'Snapshot' is not online
May 12 03:05:12 pve01 pvestatd[2073]: status update time (10.361 seconds)
May 12 03:05:13 pve01 kernel: CIFS: VFS: No writable handle in writepages rc=-9
May 12 03:05:13 pve01 kernel: CIFS: VFS: No writable handle in writepages rc=-9
May 12 03:05:17 pve01 pvestatd[2073]: storage 'Snapshot' is not online
...
May 12 04:09:10 pve01 pvestatd[2073]: storage 'Snapshot' is not online
May 12 04:09:13 pve01 kernel: CIFS: VFS: No writable handle in writepages rc=-9
May 12 04:09:13 pve01 kernel: CIFS: VFS: No writable handle in writepages rc=-9
May 12 04:09:15 pve01 pvestatd[2073]: storage 'Synology' is not online
May 12 04:09:15 pve01 pvestatd[2073]: status update time (9.104 seconds)
May 12 04:09:18 pve01 kernel: CIFS: VFS: No writable handle in writepages rc=-9
May 12 04:09:18 pve01 kernel: CIFS: VFS: No writable handle in writepages rc=-9
May 12 04:09:21 pve01 pvestatd[2073]: storage 'Snapshot' is not online
May 12 04:09:23 pve01 kernel: CIFS: VFS: No writable handle in writepages rc=-9
May 12 04:09:23 pve01 kernel: CIFS: VFS: No writable handle in writepages rc=-9
May 12 04:09:26 pve01 pvestatd[2073]: storage 'Synology' is not online
May 12 04:09:26 pve01 pvestatd[2073]: status update time (10.367 seconds)
May 12 04:09:28 pve01 kernel: CIFS: VFS: No writable handle in writepages rc=-9
May 12 04:09:28 pve01 kernel: CIFS: VFS: No writable handle in writepages rc=-9
May 12 04:09:29 pve01 kernel: gvt: guest page write error, gpa 23a8e2000
May 12 04:09:29 pve01 kernel: gvt: guest page write error, gpa 23a8e2010
May 12 04:09:29 pve01 kernel: gvt: guest page write error, gpa 23a8e2020
May 12 04:09:29 pve01 kernel: gvt: guest page write error, gpa 23a8e2030
May 12 04:09:29 pve01 kernel: gvt: guest page write error, gpa 23a8e2040
May 12 04:09:29 pve01 kernel: gvt: guest page write error, gpa 23a8e2050
May 12 04:09:29 pve01 kernel: gvt: guest page write error, gpa 23a8e2060
May 12 04:09:29 pve01 kernel: gvt: guest page write error, gpa 23a8e2070
May 12 04:09:29 pve01 kernel: gvt: guest page write error, gpa 23a8e2080
...
May 12 05:05:27 pve01 kernel: CIFS: VFS: No writable handle in writepages rc=-9
May 12 05:05:27 pve01 kernel: CIFS: VFS: No writable handle in writepages rc=-9
May 12 05:05:32 pve01 pvestatd[2073]: storage 'Snapshot' is not online
May 12 05:05:32 pve01 kernel: CIFS: VFS: No writable handle in writepages rc=-9
May 12 05:05:32 pve01 kernel: CIFS: VFS: No writable handle in writepages rc=-9
May 12 05:05:37 pve01 pvestatd[2073]: storage 'Synology' is not online
May 12 05:05:37 pve01 pvestatd[2073]: status update time (10.440 seconds)
May 12 05:05:37 pve01 kernel: CIFS: VFS: No writable handle in writepages rc=-9
May 12 05:05:37 pve01 kernel: CIFS: VFS: No writable handle in writepages rc=-9
May 12 05:05:42 pve01 kernel: CIFS: VFS: No writable handle in writepages rc=-9
May 12 05:05:42 pve01 kernel: CIFS: VFS: No writable handle in writepages rc=-9
May 12 05:05:42 pve01 pvestatd[2073]: storage 'Synology' is not online
May 12 05:05:44 pve01 pvescheduler[1584491]: INFO: Finished Backup of VM 103 (00:05:44)
May 12 05:05:44 pve01 pvescheduler[1584491]: INFO: Starting Backup of VM 104 (qemu)
May 12 05:05:44 pve01 systemd[1]: Started 104.scope.
May 12 05:05:45 pve01 kernel: tap104i0: entered promiscuous mode
May 12 05:05:45 pve01 ovs-vsctl[1589587]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl del-port tap104i0
May 12 05:05:45 pve01 ovs-vsctl[1589587]: ovs|00002|db_ctl_base|ERR|no port named tap104i0
May 12 05:05:45 pve01 ovs-vsctl[1589588]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl del-port fwln104i0
May 12 05:05:45 pve01 ovs-vsctl[1589588]: ovs|00002|db_ctl_base|ERR|no port named fwln104i0
May 12 05:05:45 pve01 kernel: vmbr1: port 1(tap104i0) entered blocking state
May 12 05:05:45 pve01 kernel: vmbr1: port 1(tap104i0) entered disabled state
May 12 05:05:45 pve01 kernel: tap104i0: entered allmulticast mode
May 12 05:05:45 pve01 kernel: vmbr1: port 1(tap104i0) entered blocking state
May 12 05:05:45 pve01 kernel: vmbr1: port 1(tap104i0) entered forwarding state
May 12 05:05:46 pve01 pveproxy[1514780]: worker exit
May 12 05:05:46 pve01 pveproxy[2135]: worker 1514780 finished
May 12 05:05:46 pve01 pveproxy[2135]: starting 1 worker(s)
May 12 05:05:46 pve01 pveproxy[2135]: worker 1589622 started
May 12 05:05:47 pve01 pvestatd[2073]: storage 'Snapshot' is not online
May 12 05:05:47 pve01 kernel: CIFS: VFS: No writable handle in writepages rc=-9
May 12 05:05:47 pve01 kernel: CIFS: VFS: No writable handle in writepages rc=-9
May 12 05:05:47 pve01 pvestatd[2073]: status update time (10.411 seconds)
May 12 05:05:52 pve01 kernel: CIFS: VFS: No writable handle in writepages rc=-9
May 12 05:05:52 pve01 kernel: CIFS: VFS: No writable handle in writepages rc=-9
May 12 05:05:53 pve01 pvestatd[2073]: storage 'Snapshot' is not online
May 12 05:05:57 pve01 kernel: CIFS: VFS: No writable handle in writepages rc=-9
May 12 05:05:57 pve01 kernel: CIFS: VFS: No writable handle in writepages rc=-9
May 12 05:05:58 pve01 pvestatd[2073]: storage 'Synology' is not online
Almost every VPS has Disk IO Full at the time of an accident
Disk lifespan (Approximately 5TB of write was used before the fault occurred)
Attempt to restart fail
Code:
root@pve01:~# reboot
Failed to set wall message, ignoring: Transport endpoint is not connected
Call to Reboot failed: Transport endpoint is not connected
root@pve01:~# systemctl restart rsyslog
Failed to restart rsyslog.service: Unit rsyslog.service not found.
root@pve01:~# shutdown
Failed to set wall message, ignoring: Transport endpoint is not connected
Failed to schedule shutdown: Transport endpoint is not connected
root@pve01:~# init
init: required argument missing.
root@pve01:~# systemctl status
● pve01
State: stopping
Last edited: