The problem: Operations with high latency on a filesystem and/or a storage volume can sometimes be attributed to just a few disks 'misbehaving', possibly because they are suffering mechanical failures and/or are going to break completely in the near future.
I/Os of high latency on just a few disks can then appear as latency outliers when accessing volumes build on a large number of disks and overall affect the performance of the entire storage. I write this having in mind the example of a storage system built with (SATA) JBODs using Oracle ASM as volume manager/DB filesystem. However the main ideas and tools described in this post apply to many other volume managers and file systems, including HDFS.
The standard tools: One way to find that one or more disks are serving I/O requests with high latency is with the use of standard Linux tools such as iostat, sar or collectl. Typically you would use those tools to spot anomalous values of average service time, average wait time and also of queue size.
A structured approach on how to do this is described in Brendan Gregg's USE method and the tools that can be used in Linux to implement it.
SystemTap scripts: In this post we focus on a technique and a couple of simple scripts that can be used to identify disks serving I/O with high latency using SystemTap probes to investigate I/O latency of the block devices.
The script blockio_latency_outliers_per_device.stp provides a measurement of some basics latency statistics for block device, including number of I/Os, average and maximum latency. The script also provides details of all the I/O where the latency is above a certain programmable threshold (the default threshold is set at 500 microseconds).
An example of its use (edited output for clarity) is here below. Note the latency warning message and overall the very large maximum value measured for the latency of the /dev/sdy block device:
[root@myhost] # stap -v blockio_latency_outliers_per_device.stp 10
Measuring block I/O latency and statistics
A warning will be printed for I/Os with latency higher than 500000 microseconds
Statistics will be printed every 10 seconds. Press CTRL-C to stop
A warning will be printed for I/Os with latency higher than 500000 microseconds
Statistics will be printed every 10 seconds. Press CTRL-C to stop
...
latency warning, >500000 microsec: device=sdy, sector=166891231, latency=714984
latency warning, >500000 microsec: device=sdy, sector=165679327, latency=582708
latency warning, >500000 microsec: device=sdy, sector=167102975, latency=1162550
latency warning, >500000 microsec: device=sdy, sector=165679327, latency=582708
latency warning, >500000 microsec: device=sdy, sector=167102975, latency=1162550
....
I/O latency basic statistics per device, measurement time: 10 seconds
Latency measured in microseconds
Latency measured in microseconds
Disk name          Num I/Os    Min latency    Avg latency    Max latency
....
sdu                     219            106           6217          27117
sdz 200 123 5995 27205
sdq 211 71 6553 31120
sdh 256 103 6643 22663
sds 224 101 6610 29743
sdm 238 92 7550 35571
sde 243 90 8652 52029
sdt 200 105 5997 25180
sdk 200 94 5696 35057
sdi 206 99 7849 30636
sdg 269 74 6746 36746
sdy 197 102 98298 1167977
sdr 200 89 6559 27873
sdl 200 140 8789 31996
sdw 210 99 7009 37118
sdd 217 94 7440 56071
sdn 205 99 6628 41339
....
sdz 200 123 5995 27205
sdq 211 71 6553 31120
sdh 256 103 6643 22663
sds 224 101 6610 29743
sdm 238 92 7550 35571
sde 243 90 8652 52029
sdt 200 105 5997 25180
sdk 200 94 5696 35057
sdi 206 99 7849 30636
sdg 269 74 6746 36746
sdy 197 102 98298 1167977
sdr 200 89 6559 27873
sdl 200 140 8789 31996
sdw 210 99 7009 37118
sdd 217 94 7440 56071
sdn 205 99 6628 41339
....
# SystemTap variables used to define filters, edit as needed
global IO_size = 8192 # this will be used as a filter for the I/O request size
# the value 8192 targets 8KB operations for Oracle single-block I/O
# use the value -1 to disable this filter
global IO_operation = 0 # this will be used as a filter: only read operations
# a value of 0 considers only read operations (the value 1 is for write)
# use the value -1 to disable this filter
global IO_devmaj = -1 # this will be used as a filter: device major number (-1 means no filter)
# example: use the value 253 for device mapper block devices
global IO_devmin = -1 # device minor number (or -1 if no filter)
Example:
stap -v blockio_rq_issue_filter_latencyhistogram.stp 10
Block I/O latency histograms from kernel trace points
Filters:
IO_size = 8192
IO_operation = 0 (0=read, 1=write, -1=disable filter)
IO_devmaj = 65 (-1=disable filter)
IO_devmin = 128 (-1=disable filter)
lock I/O latency histogram, measurement time: 10 seconds, I/O count: 199
Value = latency bucket (microseconds), count=I/O operations in 10 seconds
value |-------------------------------------------------- count
16 | 0
32 | 0
64 |@ 2
128 |@@@@@@@ 14
256 |@@ 4
512 |@ 2
1024 |@@@ 7
2048 |@@@@@@@@@ 19
4096 |@@@@@@@@@@@@@@@@@@ 37
8192 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 59
16384 |@@@@@@@ 14
32768 | 0
65536 |@@@@@ HIGH 10
131072 |@@@@@ LATENCY 11
262144 |@@@@@ I/O 10
524288 |@@@ OPERATIONS 6
1048576 |@@ 4
2097152 | 0
Value = latency bucket (microseconds), count=I/O operations in 10 seconds
value |-------------------------------------------------- count
16 | 0
32 | 0
64 |@ 2
128 |@@@@@@@ 14
256 |@@ 4
512 |@ 2
1024 |@@@ 7
2048 |@@@@@@@@@ 19
4096 |@@@@@@@@@@@@@@@@@@ 37
8192 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 59
16384 |@@@@@@@ 14
32768 | 0
65536 |@@@@@ HIGH 10
131072 |@@@@@ LATENCY 11
262144 |@@@@@ I/O 10
524288 |@@@ OPERATIONS 6
1048576 |@@ 4
2097152 | 0
For comparison here below is the latency histogram measured on another disk while the same workload was running. We can see that in this case high latency points found in the previous example are no more present. Most of the I/O operations are around 8 ms latency and some operations served from cache at sub-millisecond latency. The I/O reported for the /dev/sdy disks with latency of 64 ms and above are not present in this case.
stap -v blockio_rq_issue_filter_latencyhistogram.stp 10
Block I/O latency histograms from kernel trace points
Filters:
IO_size = 8192
IO_operation = 0 (0=read, 1=write, -1=disable filter)
IO_devmaj = 65 (-1=disable filter)
IO_devmin = 48 (-1=disable filter)
Block I/O latency histogram, measurement time: 10 seconds, I/O count: 196
Value = latency bucket (microseconds), count=I/O operations in 10 seconds
value |-------------------------------------------------- count
32 | 0
64 | 0
128 |@@@@@@@@@ 19
256 |@@@@@@@@ 16
512 | 1
1024 |@@@@ 9
2048 |@@@@@@@@@@@@@@ 28
4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 56
8192 |@@@@@@@@@@@@@@@@@@@@@@@@ 49
16384 |@@@@@@@@@ 18
32768 | 0
65536 | 0
The tests reported here have been performed on a old system running RHEL 5 (kernel 2.6.18-371) and SystemTap v. 1.8, however the scripts have also been tested on OL and CentOS 7.1 (kernel 3.10.0-229) and SystemTap 2.8. If you are using SystemTap 2.6 or above you can use the scripts blockio_rq_issue_latencyhistogram_new.stp and blockio_rq_issue_filter_latencyhistogram_new.stp instead.
Conclusion: In this post we have shown a simple technique to diagnose I/O latency outliers and in particular on how to drill down on I/Os served at high latency because of one or more misbehaving disk in the storage volume/filesystem. The investigation has been done using SystemTap scripts used first to discover which disks were serving some of their I/O s with high latency and then drilling down on specific devices with the use of latency histograms. The fact of identifying and later replacing the badly performing disks can be beneficial for the performance of the entire storage systems.
Download: The tools discussed in this post can be downloaded from this webpage and from Github.
For more information and examples on how to use SystemTap see the SystemTap wiki
Additional tools for Oracle ASM I/O investigations also see Bertrand Drouvot's asm_metrics.pl
Kevin Closson's SLOB was used as workload generator for the examples discussed here.
Additional links on tools and techniques on Oracle I/O troubleshooting in this blog: Heat Map Visualization of I/O Latency with SystemTap and PyLatencyMap, Event Histogram Metric and Oracle 12c and Life of an Oracle I/O: Tracing Logical and Physical I/O with Systemtap

