Blocked tasks in dmesg and kern.log

Petr.114

Member
Jun 25, 2019
28
1
8
30
Hello,
we are having problems with proxmox servers, our server sometimes freeze or something and does not work for like 2-10 minutes until he unfreeze and continue work...our monitoring system is reporting these freezes, because the server does not respond to ping.

In log we found "INFO: task pvesr:18653 blocked for more than 241 seconds", "INFO: task z_wr_iss_h:350 blocked for more than 120 seconds" and some others.

And some of these errors ("info" messages) are corellating with servers not responding to ping.

Code:
root@prox4:~$ ./greplogs.sh blocked
/var/log/kern.log.3.gz:Dec 15 07:18:57 prox4 kernel: [ 2780.154332] INFO: task pvesr:18653 blocked for more than 120 seconds.
/var/log/kern.log.3.gz:Dec 15 07:20:58 prox4 kernel: [ 2900.980928] INFO: task pvesr:18653 blocked for more than 241 seconds.
/var/log/kern.log.3.gz:Dec 16 13:05:15 prox4 kernel: [109953.120197] INFO: task txg_sync:534 blocked for more than 241 seconds.
/var/log/kern.log.3.gz:Dec 16 13:05:15 prox4 kernel: [109953.120524] INFO: task zabbix_agentd:1866 blocked for more than 120 seconds.
/var/log/kern.log.3.gz:Dec 16 13:05:15 prox4 kernel: [109953.120964] INFO: task netdata:4174 blocked for more than 120 seconds.
10

Time​
Recovery time​
Status​
Host​
Problem​
Duration​
2020-12-16 13:01:42​
2020-12-16 13:06:32​
RESOLVED​
PV - Proxmox 4​
Unavailable by ICMP ping​
4m 50s​

Iam attaching dmesg with all the details.

Thanks in advance, will be glad for any advice.
 

Attachments

  • dmesg.txt
    82.5 KB · Views: 6
Last edited:

oguz

Proxmox Staff Member
Staff member
Nov 19, 2018
4,890
615
118
hi,

it sounds like resource starvation. have you checked your memory usage, cpu load etc?
 

mira

Proxmox Staff Member
Staff member
Aug 1, 2018
1,631
156
68
Please provide the output of zpool status and some infos about your disks (model especially).
 

Petr.114

Member
Jun 25, 2019
28
1
8
30
Hi, here are details you asked for.

root@prox4:~$ zpool status
pool: rpool
state: ONLINE
scan: scrub repaired 0B in 0 days 00:17:37 with 0 errors on Sun Dec 13 00:41:38 2020
config:

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-SAMSUNG_MZ7LM960HMJP-00005_S2TZNX0J312822-part3 ONLINE 0 0 0
ata-SAMSUNG_MZ7LM960HMJP-00005_S2TZNX0J312685-part3 ONLINE 0 0 0

errors: No known data errors

root@prox4:~$ hdparm -I /dev/sda

/dev/sda:

ATA device, with non-removable media
Model Number: SAMSUNG MZ7LM960HMJP-00005
Serial Number: S2TZNX0J312822
Firmware Revision: GXT5204Q
Transport: Serial, ATA8-AST, SATA 1.0a, SATA II Extensions, SAT A Rev 2.5, SATA Rev 2.6, SATA Rev 3.0
Standards:
Used: unknown (minor revision code 0x0039)
Supported: 9 8 7 6 5
Likely used: 9
Configuration:
Logical max current
cylinders 16383 16383
heads 16 16
sectors/track 63 63
--
CHS current addressable sectors: 16514064
LBA user addressable sectors: 268435455
LBA48 user addressable sectors: 1875385008
Logical Sector size: 512 bytes
Physical Sector size: 512 bytes
Logical Sector-0 offset: 0 bytes
device size with M = 1024*1024: 915715 MBytes
device size with M = 1000*1000: 960197 MBytes (960 GB)
cache/buffer size = unknown
Form Factor: 2.5 inch
Nominal Media Rotation Rate: Solid State Device
Capabilities:
LBA, IORDY(can be disabled)
Queue depth: 32
Standby timer values: spec'd by Standard, no device specific minimum
R/W multiple sector transfer: Max = 16 Current = 16
Advanced power management level: 254
DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6
Cycle time: min=120ns recommended=120ns
PIO: pio0 pio1 pio2 pio3 pio4
Cycle time: no flow control=120ns IORDY flow control=120ns
Commands/features:
Enabled Supported:
* SMART feature set
Security Mode feature set
* Power Management feature set
* Write cache
* Look-ahead
* Host Protected Area feature set
* WRITE_BUFFER command
* READ_BUFFER command
* NOP cmd
* DOWNLOAD_MICROCODE
* Advanced Power Management feature set
Power-Up In Standby feature set
SET_MAX security extension
* 48-bit Address feature set
* Device Configuration Overlay feature set
* Mandatory FLUSH_CACHE
* FLUSH_CACHE_EXT
* SMART error logging
* SMART self-test
* General Purpose Logging feature set
* WRITE_{DMA|MULTIPLE}_FUA_EXT
* 64-bit World wide name
Write-Read-Verify feature set
* WRITE_UNCORRECTABLE_EXT command
* {READ,WRITE}_DMA_EXT_GPL commands
* Segmented DOWNLOAD_MICROCODE
unknown 119[7]
* Gen1 signaling speed (1.5Gb/s)
* Gen2 signaling speed (3.0Gb/s)
* Gen3 signaling speed (6.0Gb/s)
* Native Command Queueing (NCQ)
* Phy event counters
* READ_LOG_DMA_EXT equivalent to READ_LOG_EXT
* DMA Setup Auto-Activate optimization
* Asynchronous notification (eg. media change)
* Software settings preservation
* SMART Command Transport (SCT) feature set
* SCT Write Same (AC2)
* SCT Error Recovery Control (AC3)
* SCT Features Control (AC4)
* SCT Data Tables (AC5)
* SANITIZE_ANTIFREEZE_LOCK_EXT command
* SANITIZE feature set
* CRYPTO_SCRAMBLE_EXT command
* BLOCK_ERASE_EXT command
* reserved 69[4]
* DOWNLOAD MICROCODE DMA command
* SET MAX SETPASSWORD/UNLOCK DMA commands
* WRITE BUFFER DMA command
* READ BUFFER DMA command
* Data Set Management TRIM supported (limit 8 blocks)
* Deterministic read ZEROs after TRIM
Security:
Master password revision code = 65534
supported
not enabled
not locked
frozen
not expired: security count
supported: enhanced erase
32min for SECURITY ERASE UNIT. 32min for ENHANCED SECURITY ERASE UNIT.
Logical Unit WWN Device Identifier: 5002538c4058367f
NAA : 5
IEEE OUI : 002538
Unique ID : c4058367f
Checksum: correct

On another servers we have similar issues with blocked tasks with different Samsung disks
Samsung_SSD_860_PRO_1TB and Samsung_SSD_860_EVO_1TB

About resources, nothing looks unusual...on server with 80%+ RAM and 50-70% CPU wearout there are no errors but on server with 60% RAM and 30% CPU wearout, there are errors...

Code:
root@prox1:~$ ./greplogs.sh blocked
/var/log/kern.log.2.gz:Dec 22 21:15:14 prox1 kernel: [6085698.338877] INFO: task nmbd:26877 blocked for more than 120 seconds. *-souhlasí se záznamem výpadku v zabbixu*
/var/log/kern.log.3.gz:Dec 14 14:23:05 prox1 kernel: [5369765.454110] INFO: task pvesr:21953 blocked for more than 120 seconds.
/var/log/kern.log.3.gz:Dec 15 07:20:05 prox1 kernel: [5430786.484824] INFO: task pvesr:16808 blocked for more than 241 seconds.
0


root@prox5:~$ ./greplogs.sh blocked
/var/log/kern.log.3.gz:Dec 16 12:57:31 prox5 kernel: [109955.118973] INFO: task pvesr:18948 blocked for more than 120 seconds.
/var/log/kern.log.4.gz:Dec 10 00:05:33 prox5 kernel: [1415605.203249] INFO: task z_unlinked_drai:404 blocked for more than 241 seconds.
/var/log/kern.log.4.gz:Dec 10 00:05:33 prox5 kernel: [1415605.203673] INFO: task zfs:30344 blocked for more than 241 seconds.
/var/log/kern.log.4.gz:Dec 10 00:09:35 prox5 kernel: [1415846.856159] INFO: task z_unlinked_drai:404 blocked for more than 483 seconds.
/var/log/kern.log.4.gz:Dec 10 00:09:35 prox5 kernel: [1415846.856556] INFO: task zfs:30344 blocked for more than 483 seconds.
10


root@pve:~$ ./greplogs.sh blocked
/var/log/kern.log.3.gz:Dec 15 07:32:49 pve kernel: [ 3384.501696] INFO: task pvestatd:2477 blocked for more than 120 seconds. *-výpadek v zabbixu je zaznamenán o více jak hodinu dříve*
/var/log/kern.log.3.gz:Dec 15 15:50:15 pve kernel: [33230.005719] INFO: task kworker/5:0:22249 blocked for more than 120 seconds. *-souhlasí se záznamem výpadku v zabbixu*
/var/log/kern.log.3.gz:Dec 16 12:58:59 pve kernel: [109354.165717] INFO: task pvesr:4841 blocked for more than 120 seconds.
/var/log/kern.log.3.gz:Dec 16 13:01:00 pve kernel: [109474.997714] INFO: task pvesr:4841 blocked for more than 241 seconds.
6

[657564.525793] INFO: task zvol:283 blocked for more than 120 seconds.
[657564.525798] Tainted: P W O 5.4.78-2-pve #1
[657564.525799] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[657564.525801] zvol D 0 283 2 0x80004000
[657564.525803] Call Trace:
[657564.525808] __schedule+0x2e6/0x6f0
[657564.525810] schedule+0x33/0xa0
[657564.525811] schedule_timeout+0x152/0x330
[657564.525814] ? __next_timer_interrupt+0xd0/0xd0
[657564.525816] io_schedule_timeout+0x1e/0x50
[657564.525824] __cv_timedwait_common+0x12f/0x170 [spl]
[657564.525825] ? wait_woken+0x80/0x80
[657564.525830] __cv_timedwait_io+0x19/0x20 [spl]
[657564.525880] zio_wait+0x130/0x270 [zfs]
[657564.525904] ? dbuf_rele+0x38/0x40 [zfs]
[657564.525926] dmu_tx_count_write+0x17a/0x1b0 [zfs]
[657564.525947] dmu_tx_hold_write+0x41/0x60 [zfs]
[657564.525974] zvol_write+0x165/0x620 [zfs]
[657564.525978] taskq_thread+0x2ec/0x4d0 [spl]
[657564.525981] ? wake_up_q+0x80/0x80
[657564.525983] kthread+0x120/0x140
[657564.525986] ? task_done+0xb0/0xb0 [spl]
[657564.525988] ? kthread_park+0x90/0x90
[657564.525990] ret_from_fork+0x35/0x40

[657564.525999] INFO: task txg_sync:566 blocked for more than 120 seconds.
[657564.526001] Tainted: P W O 5.4.78-2-pve #1
[657564.526003] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[657564.526004] txg_sync D 0 566 2 0x80004000
[657564.526005] Call Trace:
[657564.526007] __schedule+0x2e6/0x6f0
[657564.526008] ? __switch_to_asm+0x34/0x70
[657564.526009] schedule+0x33/0xa0
[657564.526010] schedule_timeout+0x152/0x330
[657564.526011] ? __switch_to_asm+0x40/0x70
[657564.526013] ? __next_timer_interrupt+0xd0/0xd0
[657564.526014] io_schedule_timeout+0x1e/0x50
[657564.526017] __cv_timedwait_common+0x12f/0x170 [spl]
[657564.526018] ? wait_woken+0x80/0x80
[657564.526020] __cv_timedwait_io+0x19/0x20 [spl]
[657564.526047] zio_wait+0x130/0x270 [zfs]
[657564.526048] ? _cond_resched+0x22/0x30
[657564.526071] dsl_pool_sync+0xdc/0x500 [zfs]
[657564.526097] spa_sync+0x5a7/0xfa0 [zfs]
[657564.526102] ? _cond_resched+0x19/0x30
[657564.526129] ? spa_txg_history_init_io+0x104/0x110 [zfs]
[657564.526156] txg_sync_thread+0x2d6/0x480 [zfs]
[657564.526183] ? txg_thread_exit.isra.13+0x60/0x60 [zfs]
[657564.526186] thread_generic_wrapper+0x74/0x90 [spl]
[657564.526188] kthread+0x120/0x140
[657564.526190] ? __thread_exit+0x20/0x20 [spl]
[657564.526192] ? kthread_park+0x90/0x90
[657564.526193] ret_from_fork+0x35/0x40

[657564.526205] INFO: task kvm:28679 blocked for more than 120 seconds.
[657564.526207] Tainted: P W O 5.4.78-2-pve #1
[657564.526208] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[657564.526209] kvm D 0 28679 1 0x00004000
[657564.526210] Call Trace:
[657564.526211] __schedule+0x2e6/0x6f0
[657564.526212] schedule+0x33/0xa0
[657564.526213] io_schedule+0x16/0x40
[657564.526216] wait_on_page_bit+0x141/0x210
[657564.526218] ? file_fdatawait_range+0x30/0x30
[657564.526219] wait_on_page_writeback+0x43/0x90
[657564.526221] __filemap_fdatawait_range+0xae/0x120
[657564.526223] file_write_and_wait_range+0xa0/0xc0
[657564.526225] blkdev_fsync+0x1b/0x50
[657564.526227] vfs_fsync_range+0x48/0x80
[657564.526229] ? __fget_light+0x59/0x70
[657564.526229] do_fsync+0x3d/0x70
[657564.526230] __x64_sys_fdatasync+0x17/0x20
[657564.526233] do_syscall_64+0x57/0x190
[657564.526234] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[657564.526235] RIP: 0033:0x7ff515eca2e7
[657564.526239] Code: Bad RIP value.
[657564.526240] RSP: 002b:00007ff3f8bfa4c0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[657564.526241] RAX: ffffffffffffffda RBX: 000000000000001d RCX: 00007ff515eca2e7
[657564.526241] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000000000001d
[657564.526242] RBP: 00007ff508e6d030 R08: 0000000000000000 R09: 00000000ffffffff
[657564.526242] R10: 00007ff3f8bfa4b0 R11: 0000000000000293 R12: 0000555ba833945a
[657564.526242] R13: 00007ff508e6d098 R14: 00007ff5088d0f50 R15: 00007ff50891ce10

[657564.526258] INFO: task rs:main Q:Reg:30933 blocked for more than 120 seconds.
[657564.526260] Tainted: P W O 5.4.78-2-pve #1
[657564.526261] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[657564.526262] rs:main Q:Reg D 0 30933 29594 0x00000100
[657564.526263] Call Trace:
[657564.526264] __schedule+0x2e6/0x6f0
[657564.526265] schedule+0x33/0xa0
[657564.526266] schedule_timeout+0x152/0x330
[657564.526269] ? taskq_member+0x18/0x30 [spl]
[657564.526270] ? __next_timer_interrupt+0xd0/0xd0
[657564.526271] io_schedule_timeout+0x1e/0x50
[657564.526273] __cv_timedwait_common+0x12f/0x170 [spl]
[657564.526274] ? wait_woken+0x80/0x80
[657564.526276] __cv_timedwait_io+0x19/0x20 [spl]
[657564.526304] zio_wait+0x130/0x270 [zfs]
[657564.526320] dbuf_read+0x4e1/0xb80 [zfs]
[657564.526340] dmu_tx_check_ioerr+0x68/0xd0 [zfs]
[657564.526360] dmu_tx_count_write+0x6f/0x1b0 [zfs]
[657564.526379] dmu_tx_hold_write_by_dnode+0x3a/0x50 [zfs]
[657564.526405] zfs_write+0x43d/0xf20 [zfs]
[657564.526407] ? prepend_path.isra.4+0xfc/0x320
[657564.526434] zpl_write_common_iovec+0xa3/0xf0 [zfs]
[657564.526460] zpl_iter_write+0xee/0x130 [zfs]
[657564.526462] new_sync_write+0x125/0x1c0
[657564.526463] __vfs_write+0x29/0x40
[657564.526464] vfs_write+0xab/0x1b0
[657564.526465] ksys_write+0x61/0xe0
[657564.526466] __x64_sys_write+0x1a/0x20
[657564.526467] do_syscall_64+0x57/0x190
[657564.526468] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[657564.526469] RIP: 0033:0x7f9e5d3a01ad
[657564.526471] Code: Bad RIP value.
[657564.526471] RSP: 002b:00007f9e5ad78590 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
[657564.526472] RAX: ffffffffffffffda RBX: 00007f9e5000a8c0 RCX: 00007f9e5d3a01ad
[657564.526472] RDX: 000000000000059c RSI: 00007f9e5000a8c0 RDI: 0000000000000008
[657564.526473] RBP: 0000000000000000 R08: 0000000000000000 R09: 00007f9e5000a8c0
[657564.526473] R10: 0000000000000000 R11: 0000000000000293 R12: 00007f9e5000e5b0
[657564.526473] R13: 00007f9e5ad785b0 R14: 0000562a6220dce0 R15: 000000000000059c
 
Last edited:

mira

Proxmox Staff Member
Staff member
Aug 1, 2018
1,631
156
68
So other servers have the same problems?
Could you provide some info about the hardware used? Do all of the servers have the same hardware?
How are the disks attached?
 

Petr.114

Member
Jun 25, 2019
28
1
8
30
Some of them do have same problems (problems happens 100% on servers which contain logs about blocked tasks from previous post). And some have even same hardware.

Disks are mostly in hotswap front positions, but some of the servers does not have hotswap positions, so its connected internally via standard sata cable+power

prox1+prox3
Supermicro X10SLM-F
Intel Xeon E3-1220 v3
32GB ECC DDR3 Samsung M391B1G73QH0-CK0

prox4+prox5
Supermicro X9SCL/X9SCM
Intel Xeon E3-1220 v2
prox5 32 GB ECC DDR3 Transcend TS1GLK72V3H; prox4 16GB ECC DDR3 without vendor info

prox2
Lenovo ThinkServer RS140
Intel Xeon E3-1246 v3
32GB ECC DDR3 Micron 18KSF1G72AZ-1G6P1

pve (this one is built from desktop components)
Gigabyte X79-UD3
Intel Core i7-3930K
32GB DDR3 Kingston 99U5471-039.A

If you would like to know any other hardware info, let me know...i will attach just one full "lshw" commant output from prox1.
 

Attachments

  • root@prox1 lshw.txt
    28.5 KB · Views: 1

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!