Links

Wednesday, July 29, 2015

Diagnose High-Latency I/O Operations Using SystemTap

Topic: this post is about some simple tools and techniques that can be used to drill down high-latency I/O events using SystemTap probes.

The problem: Operations with high latency on a filesystem and/or a storage volume can sometimes be attributed to just a few disks 'misbehaving', possibly because they are suffering mechanical failures and/or are going to break completely in the near future.
I/Os of high latency on just a few disks can then appear as latency outliers when accessing volumes build on a large number of disks and overall affect the performance of the entire storage. I write this having in mind the example of a storage system built with (SATA) JBODs using Oracle ASM as volume manager/DB filesystem. However the main ideas and tools described in this post apply to many other volume managers and file systems, including HDFS.

The standard tools: One way to find that one or more disks are serving I/O requests with high latency is with the use of standard Linux tools such as iostat, sar or collectl. Typically you would use those tools to spot anomalous values of average service time, average wait time and also of queue size.
A structured approach on how to do this is described in Brendan Gregg's USE method and the tools that can be used in Linux to implement it.

SystemTap scripts: In this post we focus on a technique and a couple of simple scripts that can be used to identify disks serving I/O with high latency using SystemTap probes to investigate I/O latency of the block devices.
The script blockio_latency_outliers_per_device.stp provides a measurement of some basics latency statistics for block device, including number of I/Os, average and maximum latency. The script also provides details of all the I/O where the latency is above a certain programmable threshold (the default threshold is set at 500 microseconds).
An example of its use (edited output for clarity) is here below. Note the latency warning message and overall the very large maximum value measured for the latency of the /dev/sdy block device:

[root@myhost] # stap -v blockio_latency_outliers_per_device.stp 10

Measuring block I/O latency and statistics
A warning will be printed for I/Os with latency higher than 500000 microseconds
Statistics will be printed every 10 seconds. Press CTRL-C to stop

...
latency warning, >500000 microsec: device=sdy, sector=166891231, latency=714984
latency warning, >500000 microsec: device=sdy, sector=165679327, latency=582708
latency warning, >500000 microsec: device=sdy, sector=167102975, latency=1162550
....

I/O latency basic statistics per device, measurement time: 10 seconds
Latency measured in microseconds
Disk name          Num I/Os    Min latency    Avg latency    Max latency
....
sdu                     219            106           6217          27117
sdz                     200            123           5995          27205
sdq                     211             71           6553          31120
sdh                     256            103           6643          22663
sds                     224            101           6610          29743
sdm                     238             92           7550          35571
sde                     243             90           8652          52029
sdt                     200            105           5997          25180
sdk                     200             94           5696          35057
sdi                     206             99           7849          30636
sdg                     269             74           6746          36746
sdy                     197            102          98298        1167977
sdr                     200             89           6559          27873
sdl                     200            140           8789          31996
sdw                     210             99           7009          37118
sdd                     217             94           7440          56071
sdn                     205             99           6628          41339
....


When candidate disks for high latency have been identified, the second step is to further drill down using the script blockio_rq_issue_filter_latencyhistogram.stp. This script gathers and displays I/O latency histograms for a subset of block devices that can be specified using filters in the script header. The default filters are:

# SystemTap variables used to define filters, edit as needed
global IO_size = 8192    # this will be used as a filter for the I/O request size
                         # the value 8192 targets 8KB operations for Oracle single-block I/O
                         # use the value -1 to disable this filter
global IO_operation = 0  # this will be used as a filter: only read operations
                         # a value of 0 considers only read operations (the value 1 is for write)
                         # use the value -1 to disable this filter
global IO_devmaj = -1    # this will be used as a filter: device major number (-1 means no filter)
                         # example: use the value 253 for device mapper block devices
global IO_devmin = -1    # device minor number (or -1 if no filter)

You can use  blockio_rq_issue_filter_latencyhistogram.stp to drill down on the latency histogram for those disks that have shown high latency I/O and also compare them with "good" disks. In the example above the candidate "trouble disk" is /dev/sdy with major number 65 and minor 128 (you can major and minor device numbers for "sdy" simply using ls -l /dev/sdy).
Example:

stap -v blockio_rq_issue_filter_latencyhistogram.stp 10

Block I/O latency histograms from kernel trace points
Filters:
    IO_size = 8192
    IO_operation = 0 (0=read, 1=write, -1=disable filter)
    IO_devmaj = 65 (-1=disable filter)
    IO_devmin = 128 (-1=disable filter)

lock I/O latency histogram, measurement time: 10 seconds, I/O count: 199
Value = latency bucket (microseconds), count=I/O operations in 10 seconds
  value |-------------------------------------------------- count
     16 |                                                    0
     32 |                                                    0
     64 |@                                                   2
    128 |@@@@@@@                                            14
    256 |@@                                                  4
    512 |@                                                   2
   1024 |@@@                                                 7
   2048 |@@@@@@@@@                                          19
   4096 |@@@@@@@@@@@@@@@@@@                                 37
   8192 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                      59
  16384 |@@@@@@@                                            14
  32768 |                                                    0
  65536 |@@@@@          HIGH                                10
 131072 |@@@@@          LATENCY                             11
 262144 |@@@@@          I/O                                 10
 524288 |@@@            OPERATIONS                           6
1048576 |@@                                                  4
2097152 |                                                    0


Note the presence of several I/O operations at high latency together with more normal I/O latecy, for example I/Os served from SATA disks at around 8 ms and also I/O served from the controller cache at sub-millisecond latency.
For comparison here below is the latency histogram measured on another disk while the same workload was running. We can see that in this case high latency points found in the previous example are no more present. Most of the I/O operations are around 8 ms latency and some operations served from cache at sub-millisecond latency. The I/O reported for the /dev/sdy disks with latency of 64 ms and above are not present in this case.

stap -v blockio_rq_issue_filter_latencyhistogram.stp 10

Block I/O latency histograms from kernel trace points
Filters:
    IO_size = 8192
    IO_operation = 0 (0=read, 1=write, -1=disable filter)
    IO_devmaj = 65 (-1=disable filter)
    IO_devmin = 48 (-1=disable filter)


Block I/O latency histogram, measurement time: 10 seconds, I/O count: 196
Value = latency bucket (microseconds), count=I/O operations in 10 seconds
value |-------------------------------------------------- count
   32 |                                                    0
   64 |                                                    0
  128 |@@@@@@@@@                                          19
  256 |@@@@@@@@                                           16
  512 |                                                    1
 1024 |@@@@                                                9
 2048 |@@@@@@@@@@@@@@                                     28
 4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@                       56
 8192 |@@@@@@@@@@@@@@@@@@@@@@@@                           49
16384 |@@@@@@@@@                                          18
32768 |                                                    0
65536 |                                                    0

The tests reported here have been performed on a old system running RHEL 5 (kernel 2.6.18-371)  and SystemTap v. 1.8, however the scripts have also been tested on OL and CentOS 7.1 (kernel 3.10.0-229) and SystemTap 2.8. If you are using SystemTap 2.6 or above you can use the scripts blockio_rq_issue_latencyhistogram_new.stp and blockio_rq_issue_filter_latencyhistogram_new.stp instead.

Conclusion: In this post we have shown a simple technique to diagnose I/O latency outliers and in particular on how to drill down on I/Os served at high latency because of one or more misbehaving disk in the storage volume/filesystem. The investigation has been done using SystemTap scripts used first to discover which disks were serving some of their I/O s with high latency and then drilling down on specific devices with the use of latency histograms. The fact of identifying and later replacing the badly performing disks can be beneficial for the performance of the entire storage systems.

Download: The tools discussed in this post can be downloaded from this webpage and from Github.

Acknowledgements and additional links: Brendan Gregg has published an extensive set of articles and tools on performance tuning, including storage latency investigations.
For more information and examples on how to use SystemTap see the SystemTap wiki
Additional tools for Oracle ASM I/O investigations also see Bertrand Drouvot's asm_metrics.pl
Kevin Closson's SLOB was used as workload generator for the examples discussed here.
Additional links on tools and techniques on Oracle I/O troubleshooting in this blog: Heat Map Visualization of I/O Latency with SystemTap and PyLatencyMap, Event Histogram Metric and Oracle 12c and Life of an Oracle I/O: Tracing Logical and Physical I/O with Systemtap

Tuesday, July 14, 2015

Heat Map Visualization of Latency Histograms for NetApp C-Mode

Topic: This post is about collecting and visualizing I/O latency histograms for NetApp filers in C-mode.

Motivations: The drill down of I/O latency is an important technique for troubleshooting and benchmarking storage. Average latency values can hide details of what is happening on the storage. Think for example of storage systems with flash and spindles, each serving I/O at different latency. Moreover averaging the measured values over time can hide details in case of varying workload or with issues that appear sporadically.

Latency heat maps: Representing latency histograms over time is a 3-D visualization problem and can be nicely solved using heat maps. This has been discussed in details by Brendan Gregg in ...

PyLatencyMap is a simple visualization tool that works on the command line and takes as input latency histograms and outputs two latency maps using ANSI codes for colors.
The latency heat map on the top of the screen uses a blue palette and represents the number of operations per second as a function of the latency bucket and time. The second map in yellow-red represents the time waited as a function of the latency bucket and time.

How to collect and display I/O latency histograms for NetApp filers

For a quick you can use Example10_NetApp_Cmode_reads.sh. Before running it you will need to edit the parameters: password to connect to the storage, the junction point you want to measure, the cluster management node and optionally the time interval between data points. Then run the example script. Similarly Example10b_NetApp_Cmode_writes.sh will capture and display heat maps for write operations if you prefer.

This is an example of the output

Figure1: Example output of PyLatencyMap measuring and displaying I/O read latency histograms using performance counters from the NetApp filer (C-mode). The top heat map (blue color) represents I/O operations latency over time. The lower heat map (yellow-red color) represents the time waited per bucket.

Additional details

The 'glue' between NetApp and PyLatencyMap is provided by two Perl scripts written by Ruben Gaspar. The scripts are: (1) NetApp_histogram_Cmode.pl connects to the filer and gets the latency histograms (2) NetApp_latency_connector.pl transforms the histograms from the format of the NetApp filer into a format that is understood by PyLatencyMap (i.e. a power-of-two histograms).

Note that data aggregation in buckets is done by NetApp_latency_connector.pl using the same conventions as Oracle's wait event histograms (v$event_histogram and v$event_histogram_micro), that is a value of N for bucket B means that there have been 100 I/O operations with latency between B and B/2. For those of you who use DTrace 'quantize' or SystemTap 'hist_log', please note the difference as in those tools the convention is that a value of N for bucket B means N operations with latency between B and 2*B.
Note also the following dependencies: the scripts discussed here for the integration between NetApp performance counters and PyLatencyMap have been developed for NetApp Clustered Data ONTAP and in particular have been tested on version 8.2 an 8.3 (notably they will not work against OnTap version 7). Also Perl is needed including the modules Net::OpenSSH and IO::Pty. PyLatencyMap is written in Python.

Conclusions and additional links

PyLatencyMap is a tool for collecting and displaying latency heat maps on the command line intended as a aid for performance troubleshooting and benchmarking activities. PyLatencyMap has now been integrated with NetApp performance counters which adds to the existing data sources of Oracle wait events, SystemTap and DTrace probes among others.

Download PyLatencyMap from:
https://github.com/LucaCanali/PyLatencyMap or http://canali.web.cern.ch/canali/resources.htm 

Previous blog entries on PyLatencyMap:
http://externaltable.blogspot.com/2015/03/heat-map-visualization-for-systemtap.html
http://externaltable.blogspot.com/2013/09/getting-started-with-pylatencymap.html
http://externaltable.blogspot.com/2013/08/pylatencymap-performance-tool-to-drill.html