Showing posts with label SystemTap. Show all posts
Showing posts with label SystemTap. Show all posts

Tuesday, March 29, 2016

SystemTap Guru Mode and Oracle SQL Parsing

Introduction and motivations

SystemTap and dynamic tracing tools in general give administrators great control on their systems with the relatively little additional effort to learn the new tools. In this post you will see how SystemTap that can be used to modify data on the fly at runtime. The outcome is a form of "live patching". Examples are provided on how to apply these ideas to Oracle SQL parsing functionality. This type of "guru mode" use of SystemTap is a corner case, but I believe it is important to know that such techniques exist and how they can be deployed, also because they can be implemented with just a few lines of code.

SystemTap has been successfully used for emergency security band aid of Linux systems for many years, see this presentation by Frank Ch. Eigler for full details. See also an example of how these techniques have been used in practice, described in the CERN openlab 2013 summer student lecture "SystemTap: Patching the Linux kernel on the fly".

This post is about applying the techniques and ideas of "live patching on the fly by data modification" to closed source application, when debuginfo is not available and in particular to Oracle. The post is structured around three examples of increasing complexity on how to hook and change the behavior of Oracle SQL hard parsing. Some of the topics that you will see addressed in the examples are:
  • how to find the relevant function(s) to hook SystemTap to
  • how to write into userspace memory with SystemTap probes
  • how to modify CPU registers with SystemTap probes

Disclaimer: The tools and techniques presented in this post are intended for learning/reference only and are best used on a sandbox as they are unsupported and can potentially put at risk systems stability and integrity. Administrator privileges are needed to run SystemTap probes.


Programmable SQL filter

In this example you will see a method for selectively blocking execution of SQL based on programmable filter rules implemented with SystemTap.

As a first step you need to identify a relevant function for SQL parsing in the Oracle binary. Functions in Oracle binary are not documented, but luckily the function opiprs has been discussed previously in this blog and turns out to be a good choice to use with SystemTap probes. There are also other options but the details are outside the scope of this post. What you need to know about opiprs for this blog post is summarized in the table below:

Function namePurposeSelected parameters
opiprsOracle Program Interface - Parse

This function is called when Oracle performs hard parsing 
(i.e. when a SQL statement that is not in the library cache needs to be parsed).
Notable function arguments:

register
rdx -> sql statement length        
register
rsi -> pointer to the SQL text string                                                           


Note two important arguments of opiprs that are passed using CPU registers rdx and rsi: respectively containing the length of the sql statement and the pointer to the SQL statement text. The SQL text string is stored in memory, more precisely in the stack of the Oracle process, this can be confirmed by comparing the SQL text address with the process memory map from /proc/<pid>/maps.

A simple mechanism to implement the original goal of selectively blocking SQL execution is the following: write a SystemTap probe on the Oracle function opiprs that examines the SQL and if it matches some programmable rules block further parsing by forcing the SQL parsing to exit with an error.

SQL parsing can be forced to exit with an error by writing a 0 (end of line) in the first memory location of the buffer that contains the SQL text, effectively signaling a zero-length string. The effect of such change is that Oracle will throw the error: ORA-900, invalid SQL statement.

SystemTap probes can write into userspace memory using embedded C functions. This requires running SystemTap in "guru mode" and requires some additional syntax as detailed in the SystemTap documentation.

The code to implement the ideas described so far is summarized in the example script filterSQL_opiprs.stp. It consists of two main parts: one is a probe on the Oracle function opiprs, the other is an auxiliary C function called block_parse, that performs the task of writing into memory and specifically to the memory location that contains the SQL test string. A copy of the main text of the script for convenience:

function block_parse(pointersql:long) %{
    char *sqltext;
    sqltext = (char *) STAP_ARG_pointersql;
     /* Modify the SQL text with an end of line: this will throw ORA-00900: invalid SQL statement */
sqltext[0] = 0;
%}

probe process("oracle").function("opiprs") {
   sqltext = user_string2(register("rsi"),"error")
   # debug code
   # sqllength = register("rdx")
   # printf("opiParse: arg2=%s, arg3=%d\n",sqltext,sqllength)
   if (isinstr(sqltext, "UNWANTED SQL")) {
      printf("FOUND!\n")
      block_parse(register("rsi"))
   }
}

Test the example:

1. Run the SystemTap script as root (note the oracle executable needs to be in the path) with:
# stap -g -v filterSQL_opiprs.stp

2. On a different session using SQL*Plus:
SQL> select 'Hello world' from dual; -- this runs normally

'HELLOWORLD
-----------
Hello world

SQL> select /* UNWANTED SQL */ 'Hello world' from dual;
select /* UNWANTED SQL */ 'Hello world' from dual
 *
ERROR at line 1:
ORA-00900: invalid SQL statement


This illustrates how filterSQL_opiprs.stp blocks any SQL that contains the string "UNWANTED SQL". The example can be generalized to filter generic SQL statements based on keywords or other complex rules.


Modify SQL on the fly

The code below shows an example of altering the SQL statement on the fly. It is an artificial example for demo purposes. The C function "replace_SQL" (see code) is used to write into Oracle userspace the new SQL text, effectively modifying the statement that is being parsedThe SystemTap script livepatch_basic_opiprs.stp is available at this link. Here is a copy of the main text:

%{
/* SQL that will replace TARGET_SQL */
#define REPLACEMENT_SQL "select power(count(*),3) from dba_objects"
%}

global TARGET_SQL = "select count(*) from dba_objects, dba_objects, dba_objects"

function replace_SQL(pointersql:long) %{
    char *sqltext;

    sqltext = (char *) STAP_ARG_pointersql;
    /* This changes in memory (stack) the SQL text that will be parsed */
    strcpy(sqltext, "select power(count(*),3) from dba_objects");
%}

probe process("oracle").function("opiprs") {
    sqltext = user_string2(register("rsi"),"error")  
    # debug code
    # sqllength = register("rdx")
    # printf("opiParse: arg2=%s, arg3=%d\n",sqltext,sqllength)
    if (sqltext == TARGET_SQL) {
        printf("FOUND!\n") # debug code
        replace_SQL(register("rsi"))
    }
}

Test the example:

1. Consider this SQL. It may take days of CPU time to execute, as it has been built on purpose with cartesian joins:

SQL> select count(*) from dba_objects, dba_objects, dba_objects;


2. Run the SystemTap script livepatch_basic_opiprs.stp as root and execute the SQL again:

# stap -g -v livepatch_basic_opiprs.stp

2. Run the SQL again (flushing the shared pool is used to cause hard parsing).

SQL> set timing on
SQL> alter system flush shared_pool;
SQL> select count(*) from dba_objects, dba_objects, dba_objects;

POWER(COUNT(*),3)
-----------------
       7.5931E+25

Elapsed: 00:00:19.26

The SQL now runs in a few seconds because the statement with cartesian joins has been replaced "on the fly" by the SystemTap probe with an equivalent statement that executes much faster and without joins.
There is an important limitation to this implementation: the length of the "new" SQL statement must not exceed the length of the original SQL. A more general case is addressed in the next example.


Modify SQL on the fly, a more complex experiment

This example addresses the case of replacing SQL statements on the fly removing the limitation of the example above on the length of the SQL statement.The main point is that you also have to update the register rdx with the length of the new SQL statement. If the new statement is shorter than the original one this step can be omitted (as it was the case of the previous example).
How to modify the content of a CPU register with SystemTap? This is done by updating the CPU register value in CONTEXT->uregs. SystemTap takes care of restoring the register values when returning to Oracle userspace execution.

Another important point is about where to write the new SQL text, as we need a longer buffer than with the original SQL. Where to allocate the extra memory?
The example code referenced below writes the new SQL text in the process stack using the value of the %rsp pointer and subtracting 0x2000 to it. This is an educated guess that the target memory location is allocated to the process (in the memory chunk allocated for the stack), however enough "down in the stack" that it is free and will not be used by subsequent branches or leaf functions called by opiprs. From a few basic tests this approach seems to work, however please note also that the use of this script is intended mainly for reference and education purposes and can be potentially dangerous for system stability.

The SystemTap script livepatch_opiprs.stp is available at this link.

The proposed example script livepatch_opiprs.stp replaces the SQL "select sysdate from dual" with "select sysdate -1 from dual". This is inspired by a hypothetical situation where you want to replay a workload with time-dependent SQL. Another example of SQL replacement you may want to test is adding SQL hints. Customize the SQL replacement as you wish by editing REPLACEMENT_SQL and TARGET_SQL in the script.

Test the example:

SQL> alter session set nls_date_format='YYYY-MM-DD HH24:MI';
SQL> select sysdate from dual; -- all normal up to this point

SYSDATE
--------------
2016-02-22 12:00

Run the SystemTap script as root
# stap -g -v livepatch_opiprs.stp

Now the same SQL will return a different result (that is sysdate -1 instead of sysdate):

SQL> alter system flush shared_pool;
SQL> select sysdate from dual;

SYSDATE
--------------
2016-02-21 12:00


Cleaning up after  testing

Once a SQL statement is hard parsed, all subsequent executions will also run with the modified text. If you want to revert to normal Oracle behavior you need to flush the statement out the library cache an re-parse (after having stopped the SystemTap script). For flushing statements out of the shared pool you can use "alter system flush shared pool" or dbms_shared_pool.purge (see also this post on dbms_shared_pool).


Pointers to reproduce the results using gdb

In the examples discussed in this post SystemTap has been used almost as an automated debugger. Notably with the addition that SystemTap has a low-overhead compared to many debuggers and provides a powerful programmable interface for defining the SQL text search and replacement actions. Another advantage of SystemTap is that it can attach to all running processes (of the Oracle executable in this case) if desired.
For completeness and as a reference, this is a short list of relevant gdb commands that can be used to reproduce some of the actions described in the examples of this post:

gdb -p <pid> start gdb against an existing Oracle session
break opiprs -> define a breakpoint on opiprs
continue -> continue program execution till breakpoint
info reg -> show registers
bt -> backtrace
x/1s $rsi -> visualize the SQL statement string using $rsi as pointer

write the new SQL length into the CPU register rdx:
  set $rdx=24

copy a SQL string into memory and update $rsi (examples):
  set $rsi=$rsp-0x2000
  p strcpy($rsi, "select sysdate from dual")
  set $rdx=25


Conclusions

This post illustrates SystemTap techniques that can be used to modify userspace data on the fly at runtime. Examples are provided on how to apply these techniques to Oracle SQL parsing. The techniques discussed here can be generalized and used on other functions of the Oracle kernel as well as be extended to "live-patch" data at runtime for other applications in the Linux environment. In particular the provided example SystemTap probes show techniques for writing into userspace memory and into CPU registers, useful to address cases where debuginfo is not available.

Disclaimer: The tools and techniques presented in this post are intended for learning/reference only and are best used on a sandbox as they are unsupported and can potentially put at risk systems stability and integrity. Administrator privileges are needed to run SystemTap probes.


Acknowledgements and references

A shout-out to Frank Ch. Eigler, for his work on SystemTap and in particular for his presentation "Applying band-aids over security wounds with systemtap" and related blog post and also for the tip on how to write into CPU registers with SystemTap.
Many thanks to Frits Hoogland for comments and improvements to this post and for collaboration on the investigation of Oracle internals and the use of Linux dynamic tracing tools for Oracle troubleshooting.


Link to the example code in Github.
Links to previous work on this blog on using SystemTap for Oracle tracing: SystemTap into Oracle for Fun and Profit and Life of an Oracle I/O: Tracing Logical and Physical I/O with SystemTapDiagnose High-Latency I/O Operations Using SystemTap.


Wednesday, July 29, 2015

Diagnose High-Latency I/O Operations Using SystemTap

Topic: this post is about some simple tools and techniques that can be used to drill down high-latency I/O events using SystemTap probes.

The problem: Operations with high latency on a filesystem and/or a storage volume can sometimes be attributed to just a few disks 'misbehaving', possibly because they are suffering mechanical failures and/or are going to break completely in the near future.
I/Os of high latency on just a few disks can then appear as latency outliers when accessing volumes build on a large number of disks and overall affect the performance of the entire storage. I write this having in mind the example of a storage system built with (SATA) JBODs using Oracle ASM as volume manager/DB filesystem. However the main ideas and tools described in this post apply to many other volume managers and file systems, including HDFS.

The standard tools: One way to find that one or more disks are serving I/O requests with high latency is with the use of standard Linux tools such as iostat, sar or collectl. Typically you would use those tools to spot anomalous values of average service time, average wait time and also of queue size.
A structured approach on how to do this is described in Brendan Gregg's USE method and the tools that can be used in Linux to implement it.

SystemTap scripts: In this post we focus on a technique and a couple of simple scripts that can be used to identify disks serving I/O with high latency using SystemTap probes to investigate I/O latency of the block devices.
The script blockio_latency_outliers_per_device.stp provides a measurement of some basics latency statistics for block device, including number of I/Os, average and maximum latency. The script also provides details of all the I/O where the latency is above a certain programmable threshold (the default threshold is set at 500 microseconds).
An example of its use (edited output for clarity) is here below. Note the latency warning message and overall the very large maximum value measured for the latency of the /dev/sdy block device:

[root@myhost] # stap -v blockio_latency_outliers_per_device.stp 10

Measuring block I/O latency and statistics
A warning will be printed for I/Os with latency higher than 500000 microseconds
Statistics will be printed every 10 seconds. Press CTRL-C to stop

...
latency warning, >500000 microsec: device=sdy, sector=166891231, latency=714984
latency warning, >500000 microsec: device=sdy, sector=165679327, latency=582708
latency warning, >500000 microsec: device=sdy, sector=167102975, latency=1162550
....

I/O latency basic statistics per device, measurement time: 10 seconds
Latency measured in microseconds
Disk name          Num I/Os    Min latency    Avg latency    Max latency
....
sdu                     219            106           6217          27117
sdz                     200            123           5995          27205
sdq                     211             71           6553          31120
sdh                     256            103           6643          22663
sds                     224            101           6610          29743
sdm                     238             92           7550          35571
sde                     243             90           8652          52029
sdt                     200            105           5997          25180
sdk                     200             94           5696          35057
sdi                     206             99           7849          30636
sdg                     269             74           6746          36746
sdy                     197            102          98298        1167977
sdr                     200             89           6559          27873
sdl                     200            140           8789          31996
sdw                     210             99           7009          37118
sdd                     217             94           7440          56071
sdn                     205             99           6628          41339
....


When candidate disks for high latency have been identified, the second step is to further drill down using the script blockio_rq_issue_filter_latencyhistogram.stp. This script gathers and displays I/O latency histograms for a subset of block devices that can be specified using filters in the script header. The default filters are:

# SystemTap variables used to define filters, edit as needed
global IO_size = 8192    # this will be used as a filter for the I/O request size
                         # the value 8192 targets 8KB operations for Oracle single-block I/O
                         # use the value -1 to disable this filter
global IO_operation = 0  # this will be used as a filter: only read operations
                         # a value of 0 considers only read operations (the value 1 is for write)
                         # use the value -1 to disable this filter
global IO_devmaj = -1    # this will be used as a filter: device major number (-1 means no filter)
                         # example: use the value 253 for device mapper block devices
global IO_devmin = -1    # device minor number (or -1 if no filter)

You can use  blockio_rq_issue_filter_latencyhistogram.stp to drill down on the latency histogram for those disks that have shown high latency I/O and also compare them with "good" disks. In the example above the candidate "trouble disk" is /dev/sdy with major number 65 and minor 128 (you can major and minor device numbers for "sdy" simply using ls -l /dev/sdy).
Example:

stap -v blockio_rq_issue_filter_latencyhistogram.stp 10

Block I/O latency histograms from kernel trace points
Filters:
    IO_size = 8192
    IO_operation = 0 (0=read, 1=write, -1=disable filter)
    IO_devmaj = 65 (-1=disable filter)
    IO_devmin = 128 (-1=disable filter)

lock I/O latency histogram, measurement time: 10 seconds, I/O count: 199
Value = latency bucket (microseconds), count=I/O operations in 10 seconds
  value |-------------------------------------------------- count
     16 |                                                    0
     32 |                                                    0
     64 |@                                                   2
    128 |@@@@@@@                                            14
    256 |@@                                                  4
    512 |@                                                   2
   1024 |@@@                                                 7
   2048 |@@@@@@@@@                                          19
   4096 |@@@@@@@@@@@@@@@@@@                                 37
   8192 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                      59
  16384 |@@@@@@@                                            14
  32768 |                                                    0
  65536 |@@@@@          HIGH                                10
 131072 |@@@@@          LATENCY                             11
 262144 |@@@@@          I/O                                 10
 524288 |@@@            OPERATIONS                           6
1048576 |@@                                                  4
2097152 |                                                    0


Note the presence of several I/O operations at high latency together with more normal I/O latecy, for example I/Os served from SATA disks at around 8 ms and also I/O served from the controller cache at sub-millisecond latency.
For comparison here below is the latency histogram measured on another disk while the same workload was running. We can see that in this case high latency points found in the previous example are no more present. Most of the I/O operations are around 8 ms latency and some operations served from cache at sub-millisecond latency. The I/O reported for the /dev/sdy disks with latency of 64 ms and above are not present in this case.

stap -v blockio_rq_issue_filter_latencyhistogram.stp 10

Block I/O latency histograms from kernel trace points
Filters:
    IO_size = 8192
    IO_operation = 0 (0=read, 1=write, -1=disable filter)
    IO_devmaj = 65 (-1=disable filter)
    IO_devmin = 48 (-1=disable filter)


Block I/O latency histogram, measurement time: 10 seconds, I/O count: 196
Value = latency bucket (microseconds), count=I/O operations in 10 seconds
value |-------------------------------------------------- count
   32 |                                                    0
   64 |                                                    0
  128 |@@@@@@@@@                                          19
  256 |@@@@@@@@                                           16
  512 |                                                    1
 1024 |@@@@                                                9
 2048 |@@@@@@@@@@@@@@                                     28
 4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@                       56
 8192 |@@@@@@@@@@@@@@@@@@@@@@@@                           49
16384 |@@@@@@@@@                                          18
32768 |                                                    0
65536 |                                                    0

The tests reported here have been performed on a old system running RHEL 5 (kernel 2.6.18-371)  and SystemTap v. 1.8, however the scripts have also been tested on OL and CentOS 7.1 (kernel 3.10.0-229) and SystemTap 2.8. If you are using SystemTap 2.6 or above you can use the scripts blockio_rq_issue_latencyhistogram_new.stp and blockio_rq_issue_filter_latencyhistogram_new.stp instead.

Conclusion: In this post we have shown a simple technique to diagnose I/O latency outliers and in particular on how to drill down on I/Os served at high latency because of one or more misbehaving disk in the storage volume/filesystem. The investigation has been done using SystemTap scripts used first to discover which disks were serving some of their I/O s with high latency and then drilling down on specific devices with the use of latency histograms. The fact of identifying and later replacing the badly performing disks can be beneficial for the performance of the entire storage systems.

Download: The tools discussed in this post can be downloaded from this webpage and from Github.

Acknowledgements and additional links: Brendan Gregg has published an extensive set of articles and tools on performance tuning, including storage latency investigations.
For more information and examples on how to use SystemTap see the SystemTap wiki
Additional tools for Oracle ASM I/O investigations also see Bertrand Drouvot's asm_metrics.pl
Kevin Closson's SLOB was used as workload generator for the examples discussed here.
Additional links on tools and techniques on Oracle I/O troubleshooting in this blog: Heat Map Visualization of I/O Latency with SystemTap and PyLatencyMap, Event Histogram Metric and Oracle 12c and Life of an Oracle I/O: Tracing Logical and Physical I/O with Systemtap

Wednesday, March 18, 2015

Heat Map Visualization of I/O Latency with SystemTap and PyLatencyMap

Topic: PyLatencyMap v1.2 a tool for collecting and visualizing I/O latency data collected and about its integration with SystemTap.

Introduction: When studying storage performance latency drill-down can be very important. Measuring the average I/O latency is often not enough, latency histograms are proven to be more suitable for investigating modern storage systems. This is because the storage service time can present multiple "modes". A common case is with storage systems where I/O requests can be served by low-latency SSD or by spindles with much higher service time. For such systems latency histograms provide an additional dimension to the performance investigation.
Latency heat maps come into play when we want to visualize a series of latency histograms over time. This is a 3-D visualization problem and heat maps provide a way to represent the third dimension as color (see as an example Figure 1 later in this article).
The main ideas behind the use of latency histograms for investigating storage performance and their visualization heat maps have been previously detailed in the excellent paper by Brendan Gregg on ACM queue and also on blog articles and presentations.

PyLatencyMap is a tool and set of scripts for gathering  and displaying heat maps using the command line interface (supporting ANSI terminal graphics). The main use for which PyLatencyMap has been developed is to measure and display storage latency data. PyLatencyMap can also be used to display heat maps from a generic source of latency data.
The latest update to PylatencyMap, version 1.2, provides integration with SystemTap and a few example scripts for getting started. SystemTap can be used to collect storage latency and PylatencyMap for producing the associated heat maps. Many more data sources have already been integrated with PyLatencyMap and examples are available with the tool: Oracle wait event histograms, DTrace histograms and custom aggregations from trace files.
PyLatencyMap has a modular architecture, which follows the typical use of Unix/Linux CLI tools. Data is generated by a collector script (for example a SystemTap script), then it is piped through an optional filter and finally piped into to the visualization engine:

latency data_source | <optional connector script> | python LatencyMap.py <options>

A series of examples and get-started scripts are provided with the tool. In this article we will focus on the scripts based on SystemTap. For discussions on how to use PyLatencyMap for SQL*plus and DTrace see also this other blog entry. Just to get started see the following CLI to run data collection with SystemTap and visualization with PyLatencyMap:

stap -v SystemTap/blockio_rq_issue_basic_pylatencymap.stp |python SystemTap/systemtap_connector.py |python LatencyMap.py

The script blockio_rq_issue_basic_pylatencymap.stp measures the latency of block I/O operations and can be used to collect latency histograms and display them as heat maps. The probe hooks top kernel.trace("block_rq_issue") and kernel.trace("block_rq_complete").
There are three probes in the script: one that gathers information at each block io operation, another that collects stats at the end of the I/O and a third probe that periodically prints the latency histogram. Note the use of the aggregation operator '<<<'  that SystemTap uses to populate the histograms, which are later printed with the use 'hist_log'.

probe kernel.trace("block_rq_issue") {
        requestTime[$rq] = gettimeofday_us()
}

probe kernel.trace("block_rq_complete") {
   t = gettimeofday_us()
   s = requestTime[$rq]
   if (s > 0) {
       latencyTimes <<< (t-s)
       delete requestTime[$rq]
   }
}

probe timer.sec(3) {
   printf("\n<begin record>\n")
   printf("timestamp, microsec, %d, %s\n",gettimeofday_us(),tz_ctime(gettimeofday_s()))
   printf("label, Latency of block I/O requests measured with SystemTap\n")
   printf("latencyunit, microsec\n")
   printf("datasource, systemtap\n")
   if (@count(latencyTimes) > 0)
       println(@hist_log(latencyTimes))
   printf("\n<end record>\n")

}

SystemTap scripts for PyLatencyMap:

For production usage there are more advanced scripts than the one just discussed. The script  Example9_SystemTap_blockIO_req.sh is based on blockio_rq_issue_pylatencymap.stp and provides tracing based on block_rq_issue and block_rq_complete as the example above but contains additional filters to remove 'false positives'. For ease of use this is packed into

Another example script is Example9b_SystemTap_blockIO_filter_read_8KB.sh is based on blockio_rq_issue_filter_pylatencymap.stp which provides the possibility to further filter on the block I/O being issues, in particular the defaults filter for 8KB-reads, which are of interested when testing OLTP applications in Oracle.

Example9c_SystemTap_pread.sh is a scripts that packages a SystemTap probe on pread calls. These are typically the underlying OS calls for single block reads that show up as 'db file sequential read' wait events in Oracle. Note however that Oracle can do single-block reads also using asynchronous, therefore the probes discussed above on block_io trace points are more general than this one.

Examples 9e, 9f: the probes in these scripts measure block_io hooking on different probe points and are provided for reference. Note, these scripts require to have kernel debuginfo symbols.

oracle_event_latency.stp: this script measures the latency of Oracle wait events and produces histograms by hooking directly into the Oracle executable. More details on how this works can be found at the blog entry "SystemTap into Oracle".

Example:

Figure 1 here below is an example the output from PyLatencyMap run using the script Example9b_SystemTap_blockIO_filter_read_8KB.sh. The measured workload has been produced with SLOB (at increasing load). The top heat map represents the number of I/O operations per latency bucket and as a function of time. On the horizontal axis you can fine the time, on the vertical axis the latency buckets. The tool assumes a power-of-2 scale. Color is used to represent the number of I/O operations per second. A blue palette is used for the frequency heat map.
The  heat map on the lower half of the figure uses a yellow-red palette to represent the amount of time waited in each bucket. I refer to it as a intensity/importance heat map. It highlights the relative weight of latency time spent on each bucket and helps in comparing the impact of a large number of low-latency operations vs. a smaller number of high latency operations.


Figure 1:  The output of PyLatencyMap contains the latency histogram of the latest measured point and  two heat maps: on top the heat map of the IOPS, on the lower graph the heat map of the waited time.

Note: for SystemTap (and DTrace) data sources, the latency bucket has to be interpreted as follows: a value of N in bucket B, means N I/O operations with latency between B and 2*B.
Other histogram latency data sources (notably Oracle) use a different convention. With the same example a value of N in a latency bucket B for Oracle histograms means N waits of latency between B and B/2. PyLatencyMap takes care of this difference when calculating the time waited histograms (i.e. the
Note: The scripts discussed here for block I/O tracing use the trace point block_rq_issue, it is also possible to use the trace point block_rq_insert to measure I/O latency from the point of insertion into the queue instead.

References and download links:

PyLatencyMap can be downloaded from this link or can be forked from Github. Additional information on PyLatencyMap is also available at this blog entry and also at this other blog entry.

Conclusions and acknowledgements

SystemTap is a very powerful tool for measuring I/O and producing latency histograms for storage performance investigations. PyLatencyMap provides an easy way to visualize histograms produced by SystemTap as latency heat maps. PyLatencyMap collects data and displays heat maps using the command line interface on character terminals with ANSI escape codes. This provides the combined benefit of a graphical output and the power of the command line. PyLatencyMap can also be used to visualize heat maps from latency histograms produced with DTrace, Oracle's SQL*plus and/or a from latency data aggregated from text and trace files.

Many thanks go to Brendan Gregg for sharing in his blog many awesome ideas and tools on the topic of heat map visualization and on using DTrace and SystemTap that have provided a great source of inspiration for this work.

Note: this entry was originally posted in March 2015, with major updates in July 2015.

Tuesday, November 4, 2014

Life of an Oracle I/O: Tracing Logical and Physical I/O with SystemTap

Topic: This post is about tracing logical and physical reads in Oracle using SystemTap. You find here a few examples illustrating the main mechanisms used by Oracle to performo physical and logical I/O and you learn how to build SystemTap scripts to further explore and troubleshoot Oracle I/O.

Introduction: Performing input/output from and to the storage media, i.e. physical I/O, is one of the critical tasks underlying all database engines. When accessing data stored in RAM from the database cache we can talk of logical I/O. Physical I/O is performed by the database using system calls to the O/S kernel. The type of system calls used depends on database settings, the O/S and the type of storage. Logical I/O is performed by the database processes using Oracle kernel functions. Oracle instruments the timing of physical I/O calls with the wait event interface, while logical I/O time is generically accounted for as CPU time. Tracing I/O operations performed by Oracle processes together with the details provided by the Oracle wait event interface is a powerful technique to investigate Oracle I/O internals. With the help of a few real-life examples we investigate what the mechanisms are that Oracle uses to access storage, how the I/O operations correlate to the wait event interface and what the meaning of the wait event time is for the most common wait events associated with I/O.

Building the tracing tool: Investigations of what Oracle and the O/S do when an I/O operation is performed require specialized tools. Systemtap is a dynamic tracing tool specific to Linux that allows tracing of both the O/S and the Oracle functions (Linux kernel functions, system calls and Oracle database kernel functions). In the following you can find a description of a systemtap script that can be used to trace Oracle logical and physical I/O: it is composed of probes that can attach to the function calls of interest and can read function parameters, memory and CPU registers.
Systemtap requires relatively recent kernels to be able to probe userspace: either the utrace or uprobe_events functionality needs to be available. This is the case for example of RHEL/OL 6.5 and higher but not with RHEL 5. Moreover for tracing system calls the kernel debuginfo package needs to be installed.
More details on this and on how to build a lab environment can be found in this article. The script that we use in this blog post to trace Oracle logical and physical I/O and wait event interface is: trace_oracle_logicalio_wait_events_physicalio_12102.stp (for Oracle 12.1.0.2). A version for 11.2.0.4 is also available: trace_oracle_logicalio_wait_events_physicalio_11204.stp Before moving to the examples and studies of Oracle I/O I would like to review the three main parts of the script: tracing physical I/O, logical I/O and wait events.

Tracing physical I/O: Oracle ultimately issues system calls to execute I/O operations. The type of calls depend on database settings, the O/S type and the type of storage. In this article you can find examples to Oracle on Linux using ASM storage on block devices. Notably direct NFS, local storage and ASM with asmlib are left out and possibly addressed in future studies. System calls can be traced using the strace(1) utility in Linux too. However, in this article you can see how SystemTap probes can be built to gather information on the system calls of interest: pread, pwrite, io_submit and io_getevents. Additional useful SystemTap probes to the underlying block I/O layer are: ioblock.request, ioblock_trace.request and ioblock.end (note, these are not system calls, but rather tracepoints for function calls inside the Linux kernel).

Probe name Purpose Selected parameters
syscall.pread/pwrite synchronous I/O: read/write from/to a file descriptor fd (file descriptor number), offset, count. 
return value: bytes read/written
syscall.io_submit asynchronous I/O: submit blocks for processing nr (number of I/O). For each I/O: file descriptor, offset, bytes, opcode (0 for read, 1 for write)
return value: number of I/O operations performed
syscall.io_getevents asynchronous I/O: read events from the completion queue min_nr (min number of I/Os to read), timeout for event waiting.
Return value: number of I/O reaped, For each I/O: file descriptor, offset, bytes
ioblock.request, ioblock_trace.request I/O requests sent to the block device interface kernel layer devname, sector, size, rw, block_io structure address
ioblock.end return from the block device interface kernel layer devname, sector, rw, block_io structure address

Note added, August 2015: Systemtap probes on kernel.trace("block_rq_issue") and kernel.trace("block_rq_complete") can also be used to investigate the block I/O interface with the added advantage of not requiring kernel debuginfo. You can find examples of such probes in the page with download material or on GitHub.

Tracing logical I/O: This is about providing the history and details of the logical I/O operations, which is when Oracle reads a data block from memory (buffer cache). Logical I/O operations are a super-set of physical I/O: if Oracle does not find data in memory (buffer cache) it will issue a physical read. You can find a few examples of this mechanism in action later in this post.
Oracle instrumentation provides extensive information on logical reads in several V$ views, such as V$SYSSTAT, V$SESSTAT, V$SYSMETRIC. Also logical read data is available from sql*plus with "set autotrace on" and with tkprof reports of 10046 traces. Logical reads come in two main "flavors": consistent reads and current reads. Oracle will use consistent reads when reading data at a given system change number SCN, current reads are used when performing DML operations on the block. Consistent reads can be traced in Oracle using the events 10200 and 10201. More details can be found in the presentation by Julian Dyke on logical I/O.
In this article we show how to trace logical I/O by attaching SystemTap probes to a set of Oracle kernel functions. The role of functions kcbgtcr and kcbgcur for tracing consistent and current reads has already been discussed by Alexander Anokhin in his excellent blog articles on Dtrace LIO. More recently I have investigated this topic and found useful to also trace the functions kcbzib, kcbzgb, kcbzvb.
A summary of the main finding on these functions is in the table here below. Note: the expressions %rdi, %rsi, %r8 indicate the values of the corresponding CPU registers. With SystemTap the register value can be read with the function register, example: register("rdi"). The CPU registers are used to extract the function call arguments. Systemtap does not provide the function call arguments when there are no debug symbols available for the executable. Notably the calling conventions for x86-64 as documented in  www.x86-64.org/documentation/abi.pdf state that the integer parameters for a function call are available in the following registers (in order): %rdi, %rsi, %rdx, %rcx, %r8 and %r9.


Function name Purpose Selected parameters
kcbgtcr Kernel Cache Buffers Get Consistent Read

Note: key function for consistent reads
tbs#=user_int32(%rdi) 
rel file n#= user_int32(%rdi+4) >> 22 & 0x003FFFFF 
block#=user_int32(%rdi+4) & 0x003FFFFF
data_object_id#=user_int32(%rdi+8)
object_id#=user_int32(%rdi+12)

Note: for bigfile tablespaces:
block#=user_int32(%rdi+4)
kcbgcur Kernel Cache Buffers Current

Note: key function for current reads
tbs#=user_int32(%rdi)
rel file n#= user_int32(%rdi+4) >> 22 & 0x003FFFFF
block#=user_int32(%rdi+4) & 0x003FFFFF
data_object_id#=user_int32(%rdi+8)
object_id#=user_int32(%rdi+12)

Note: for bigfile tablespaces:
block#=user_int32(%rdi+4)
kcbzib kcbZIB should stand for: Kernel Cache Buffers, Z (kcbz.o is a module for physical IO helper functions), IB: Input Buffer

Note: it indicates that Oracle will perform physical read(s) into the buffer cache

kcbzgb The suffix GB in kcbZGB should stand for: Get (space for) Buffer

Oracle allocates space in the buffer cache for a given block (typically before I/O operations). 

tbs n#=%rsi,
rel file n#=%rdx >> 22 & 0x003FFFFF
block#=%rdx & 0x003FFFFF
data_object_id#=%r8
object_id#=%r9

Note: for bigfile tablespaces:
block#=%rdx
kcbzvb
Invoked after Oracle has read a given block. 

It is part of the call chain for checking blocks governed by the db_block_checking parameter. Name guess: Kernel cache buffer verify block. Thanks to @FritsHoogland for pointing this out.

This function is used both for reads via the buffer cache and for direct reads

tbs n#=%rsi, 
rel file n#=%rdx >> 22 & 0x003FFFFF,
block#=%rdx & 0x003FFFFF
data_object_id=user_int32(%rdi+24)

Note: for bigfile tablespaces:
block#=%rdx

Note: as I write this only uprobes supports return probes for userspace functions for kernel versions 3.10 or higher. This means that this functionality is not available with RHEL/OL 6.x, but it is available on RHEL 7.x (3.10.x kernel) for example.

Tracing wait events: The wait event interface is probably one of the best features in Oracle performance tuning as it provides a quantitative method to identify performance issues and bottlenecks. Cary Millsap has published inspiring work on the subject.
For our purposes it suffices to say that wait events are instrumentation points where selected Oracle kernel actions are timed. In particular Oracle will time the start and end calls to most physical I/O operations, therefore providing a good way to understand how much time is spent during physical I/O (some limitations of this mechanism are discussed later in this post). There are also limitations to using the wait event interface to study I/O latency, in particular in the examples section you can find some interesting challenges related to timing asynchronous I/O events.
Wait event tracing is typically activated in Oracle using the 10046 event or with by calling DBMS_MONITOR.SESSION_TRACE_ENABLE. In this article we get the wait event details by attaching SystemTap probes to the relevant Oracle kernel functions and in particular to  kskthbwt and kskthewt to gather information respectively at the start and at the end of each wait event. Enabling wait event tracing using event 10046 at level 8 and other methods is a way to externalize the wait event timing information (as well as other SQL processing details), not to enable the wait information. This is done using the database parameter TIMED_STATISTICS, and is and should be set to TRUE in modern Oracle databases. One key piece of information that the Oracle kernel functions kskthbwt and kskthewt give us is a pointer into the X$KSUSE fixed table (or rather to the underlying memory structure in the SGA), which is the underlying table to V$SESSION and therefore opens the road to correlate the wait event with many useful information on session executing the SQL. Resolving the pointer in register r13 to the base value of X$KSUSE and computing the offset for the fields of interest requires additional work, this is described in a previous article (see also the script: ksuse_find_offsets.sql).


Function namePurposeSelected parameters
kskthbwtKernel service Kompile thread begin wait.
This function is called at the start of an Oracle wait event.
kslwtbctx is its parent function call and marks the start of a wait event. 
The suffix "bwt" stands for "begin wait".
Useful information when probing the function:

register r13 or r14 (behavior is version-dependent) -> pointer into X$KSUSE (V$SESSION) array in the SGA  with a fixed offset (version-dependent)
register rsi -> timestamp of the beginning of the wait (in microseconds)
register rdx -> wait event number
kskthewtKernel service Kompile thread end wait.
This function is called at the end of an Oracle wait event.
kslwtectx is its parent function call marking the end of a wait event.
The suffix "ewt" stands for "end wait".
Useful information when probing the function:

register r13 -> 
pointer into X$KSUSE (V$SESSION) array in the SGA  with a fixed offset (version-dependent)
register rdi -> timestamp of the beginning of the wait (in microseconds)
register rsi -> wait event number


Running the script:
Below you find a few examples of traces for common Oracle I/O access patterns. The traces have been obtained by runnig  stap -v trace_oracle_logicalio_wait_events_physicalio_12102.stp -x <pid> where <pid> is the process id of the session being traced. The output is piped into sed -f eventsname.sed, which takes care of translating the event number with the associated event name. The file eventsname.sed is generated by running the script eventsname.sql. It’s important to rerun the script to generate the event names before testing, because the wait event numbers can change without any warning, for example by migrations to newer versions, but potentially also by applying a patch. The system used to collect the traces on the following examples is Oracle 12.1.0.2 running on OEL 7.0 with UEK3 under Virtualbox.


Example 1: single-block random reads

This example is about single block (random) reads. This access path to data is very important for many OLTP workloads as it is used for table access via a index key. In Figure 1 here below you can see a snippet of a trace file for a query of the type: select <col_list> from table <tab> where <indexed_col>=<val>.

Figure 1: Oracle single-block random read and db file sequential read wait event

The key points are:
  • Oracle issues a logical read on one of the blocks needed to to retrieve data for the SQL of this example
    • additional details are provided in the logical read call where we can see that Oracle wants to read block#=2505675 from tbs#=7 and belonging to obj#=32174 (which is the table segment in this example)
  • The block is not found the cache so a new block is allocated in the buffer cache and a physical read is initiated
  • The database marks the start of a wait event for 'db file sequential read'
  • The physical read request for a block becomes a pread call to the O/S
    • The IO request enters the kernel via a system call, any function encountered after the system call is a kernel function (such as ioblock.request), until the system call returns, and thus gets back to userland, which means the Oracle executable in this case.
  • This read call is passed on to the block I/O interface to read 8192 bytes (1 database block), from the block device device (/dev/sdl in the example of Figure 1)
  • The block interface and then the OS call return the read data to the Oracle kernel
  • The Oracle wait interface marks the end of the wait and computes the elapsed time
    • the wait event is  'db file sequential read' and the parameters are: p1 is the file number, p2 in the block number, p3 is the number of blocks read (1 block).
    • from the timestamp values in the trace you can see that the duration of the db file sequential read wait is dominated by the time it took to do the pread call, with a small overhead (note also that the overhead could have been more significant if we had high CPU utilization on the DB server, for example because of workload from other Oracle sessions)
  • We can conclude that in this example the measured wait time for db file sequential read corresponds to physical read latency.

Example 2: sequential I/O into the buffer cache with synchronous I/O

This example is about multi-block reads into the cache. This is an access path that is used for example when Oracle performs a full table scan. There are multiple ways in which Oracle can access the storage to perform full scans, in this example you see the case when Oracle uses synchronous I/O (i.e. the pread call) to perform O/S reads, examples 3 and 4 cover sequential reads with asynchronous I/O. In Figure 2 here below you can see a snippet from tracing the workload of a query of the type: select <col_list> from table <tab>, a full scan of <tab> is performed by Oracle.

Figure 2: Oracle sequential I/O in a case where reads are performed using pread O/S calls

The key points of the trace in Figure 2:
  • Oracle issues a consistent read for tbs#=7 block#=2621443 and finds it in the cache
  • Oracle issues a consistent read for the next block, does not find it in the cache so prepares to read a list of block from storage in one physical read. The list is 124-blocks long (it has shortened in this example for clarity)
  • The databases marks the start of a wait event for "db file scattered read"
  • The physical read requests become a single pread call to the OS for 1015808 bytes (124 blocks)
  • This request moves down the I/O chain to the block I/O interfaces which splits it into 2 reads from the underlying block device, as in the system used for tests the maximum size of I/O is 524288 bytes.
    • As investigated by Frits Hoogland: the submitted I/O request lands in the IO scheduler scattered/gather list, where it is broken down to the IO size as advertised by the block device.
  • The I/Os are completed by the block I/O interface
  • The pread call returns
  • The wait event interface times the end of the wait
    • The wait event is db file scattered read and the parameters are: p1 is the file number, p2 in the block number, p3 is the number of blocks.
    • From the timestamp values in the trace you can see that the duration of the db file scattered read wait is mostly accounted for by the time it took to do the pread call, with a small overhead 
  • From the trace of calls to the Oracle kernel function kcbzvb we get additional confirmation of the blocks that have been read
  • We can conclude that in this example the wait time for db file scattered read corresponds to I/O latency.
Example 3: sequential I/O into the buffer cache with asynchronous I/O O/S calls

In Figure 3 you can see a snippet from the trace of  the SQL: select <col_list> from table <tab>, where Oracle performs a full scan of table <tab>. This is the same query that was used for Example 2 above, the difference from that case is that the trace in Figure 3 has been taken at a later time in the execution of the query, when the Oracle process switched from using synchronous I/O with pread calls to using the asynchronous  I/O interface. Asynchronous I/O is implemented in Linux with io_submit and io_getevents calls, to send the I/O requests and then to reap the results. Although Oracle uses the asynchronous interface to perform read operations in this example, the final result is very similar to what described in Example 2 (i.e. with synchronous I/O): the Oracle wait event is the same, "db file scattered read", the reads are also performed into the buffer cache and I/Os are reaped with blocking calls, therefore making the process effectively synchronous. A more 'aggressive use' of the asynchronous I/O interface by Oracle aimed is found in Oracle with direct reads, that are discussed in Example 4 below.

Figure 3: Oracle performing sequential I/O reads into the buffer cache using asynchronous I/O calls

The key points are:
  • Oracle issues a consistent read for tbs#=7 block#=2522760, it does not find it in the cache so prepares to read a list of blocks into the buffer cache. The list is 128-blocks long and shortened in this example
  • The databases marks the start of a wait event for "db file scattered read"
  • The physical read requests are passed on to the OS as a call to io_submit with 2 requests for a total of 1MB in size (that is 128 blocks): the first request is for a read (opcode 0) of 65536 bytes, the second for a read operation of 983040 bytes (that is 1MB -64 KB)
  • These requests are passed to the block I/O interface which splits the largest request into 2 reads from the underlying block device (/dev/sdi in this example). The read of 64KB is passed on unchanged in size to /dev/sdr
  • Oracle calls io_getevents to reap the results of the asynchronous I/O requests that have just been submitted and waits. The timeout.tv_sec field is set to 600, which indicates this is a blocking wait (which means io_getevents waits for 600 secs (tv_sec) and 0 nano secs (tv_nsec)) for min_nr=2 of IOs to finish. 
  • The block device interface returns the data upstream.
  • The io_getevents call returns the 2 I/O read.
  • The wait event interface marks the end of the wait. The wait event in this case is db file scattered read. 
    • The wait event parameters are: p1 is the file number, p2 in the block number, p3 is the number of blocks.
    • The wait event duration is basically the time is took to do the io_submit and io_getevents calls plus a small overhead.
    • Note that although Oracle has used the asynchronous interface, it has also waited for all the I/Os to complete therefore making the I/O essentially synchronous
  • From the calls to kcbzvb we get additional confirmation of the blocks that have been read.
  • Conclusion: the wait time for db file scattered read in this example is not a precise measure of the I/O latency, as two reads on two different devices where submitted in parallel and the total time was reported as if it was a single read.

Example 4: full scans with direct reads and asynchronous I/O

In this paragraph you can find two examples of Oracle I/O for the case of direct reads on block storage configured with ASM. The same query that was used in Examples 2 and 3 above is also used here: it is a query that performs a full table scan. In this example differently from Examples 2 and 3 Oracle uses serial direct read to access data. Serial direct read is a feature introduced in 11gR2 for optimizing full segment scan operations and bypassing the buffer cache layer, similarly to what happens with parallel queries. Oracle chooses when to use serial direct reads vs. cached reads based on a few parameters. In first approximation direct reads are preferred when the table is "big" compared to the size of buffer cache, the actual mechanism is more complex and can potentially change between Oracle versions. You can read more on this topic in this blog post by Tanel Poder.
For the particular setup I used to run this tests I had to force the use of serial direct reads before running the query, this was done with: alter session set "_serial_direct_read"=always; 
Note that alter session set "_serial_direct_read"=never; can be used to disable serial direct reads and force Oracle to use cached reads (that is the type of I/O you have seen in Example 2 and 3 above). The default value for the underscore parameter "_serial_direct_read" is "auto".
For the investigations in this example we can use a slightly different SystemTap probe from what we have used in the previous examples. I have removed tracing of logical I/O and block I/O for reducing cluttering and added a new probe on the userspace function io_getevents_0_4 of libaio (libaio is the Linux library that implements asynchronous I/O). This is because serial direct reads use non-blocking calls to reap I/Os and these calls are best traced at the libaio interface for reasons described by Frits Hoogland in this blog post. The SystemTap script represented in Figure 4a and 4b is: trace_oracle_wait_events_asyncio_libaio_12102.stp (a version for 11.2.0.4 is available at trace_oracle_wait_events_asyncio_libaio_11204.stp).

Figure 4a: Trace of the wait event and I/O calls of Oracle during a full table scan with serial direct read and asynchronous I/O, this is what happens in the first phases of the full scan

The key points of Figure 4a are:
  • Oracle requests I/O in chunks of contiguous storage space of 1MB. This is to reduce the overhead of seek time and potentially make use of readahead when available. 
    • The value of 1MB for the read request comes from the parameter db_file_multiblock_read_count=128 multiplied by the block size of 8 KB. 
  • Oracle submits one I/O of 1 MB with a call to io_submit and then calls io_getevents_0_4 in the libiao library using the non-blocking mode to try to reap the I/O. This happens twice, with no results. Oracle then submits another I/O of 1MB and tries for 4 times to reap I/O in non-blocking mode, with no results (the I/O operations have not yet finished).
  • The wait event timing starts just before the calls to io_getevents, which are used to reap the I/O results this time in blocking mode (with a timeout of 600 seconds)
    • The wait event in this case is: "direct path read"
    • Oracle reaps the two I/Os in two subsequent calls to io_getevents
  • After reaping the I/Os Oracle marks the end of wait for the event. 
  • The result is that 2 reads of 1 MB each have been performed 
  • The wait event interface does not report the correct number of I/Os performed:
    • the "direct path read" wait in this example reports one read of 128 blocks (1 MB), while 2 MB were read
  • The elapsed time for the wait event "direct path read" does not measure the I/O latency
    • Oracle is only timing the blocking calls to io_getevents, which start after the I/Os are submitted. 
    • Oracle is timing 2 reads together.
The use of serial direct reads in Oracle is more complex than the simple example of Figure 4a. Frits Hoogland has discussed this in more details the internals of direct reads in Oracle in this presentation. In particular Oracle uses an adaptive mechanism for direct reads with the goal of driving the storage system to high throughput when possible. This is implemented by increasing the number of outstanding I/O when calling the asynchronous I/O interface. Internally this makes use of I/O slots. Note, from a comment on Oracle diagnostic event 10353 we have additional indirect information on this topic: "Slots are a unit of I/O and this factor controls the number of outstanding I/Os".
Another finding of Frits' is that there are I/O operations performed for direct reads that are not timed by the wait event interface. This happens because Oracle can issue the io_getevents with timeout=0, that is reaping the I/O without having to block. You can see the these mechanisms in action in the trace of Figure 4b, which is for the same full scan as reported above in Figure 4a just at later time when the mechanism of adaptive serial direct reads has tuned Oracle I/O requests to be "more aggressive" on the use of storage resources.

Figure 4b: Trace of the wait event and I/O calls of Oracle during a full table scan with serial direct read and asynchronous I/O, this is an example of what happens when the adaptive mechanism increases the number of outstanding I/Os for optimizing the throughput

The key points of Figure 4b are:
  • This is the same full scan as shown in Figure 4a, the difference is that the trace in Figure 4b has been taken at a later time of the full scan operation (the table has 1 million blocks) when adaptive serial direct reads has increased the number of I/O slots (outstanding asynchronous I/O operations).
  • You can see that several io_submit operations are performed before the wait event timing is started. 
  • Oracle is trying to keep several outstanding I/Os in the queue: up to 9 outstanding asynchronous I/O operations can be seen in Figure 4b, while only 2 were present in simpler case of Figure 4a, that is at the beginning of the full scan.
  • Several attempts to reap I/Os with io_getevents operations are run by the Oracle process before the wait event timing is started
    • 2 reads of 1 MB each are reaped by one io_getevents call before the wait event timing has started, therefore these reads will not be accounted for in the wait event interface.
    • It is important to note that the timeout for this particular io_getevents call (as well as for the io_getevents_0_4 calls in libaio which have not returned any I/O) is set to 0 by the Oracle process, which means that Oracle does not wait for I/O results to be available, it will only reap them if they are available.
  • Between the moment at which Oracle starts timing the wait event "direct path read" and the end of the wait event 3 io_getevents operations have been performed (with the timeout set to 600, therefore blocking for I/O reads to be available) and a total of 9 reads of 1MB each are performed
    • The wait event interface does not report all the I/O that has been performed:
      • In the wait event parameters we can only find trace that 1 MB has been read (128 blocks), while in reality 9 MB have been read.
      • This is in addition to the 2 MB read outside of the wait event instrumentation.
    • The elapsed time for "db file direct read" does not accurately reflect I/O latency but rather the wait time for reaping I/Os as seen in Figure 4b.

Example 5: asynchronous I/O for random reads

Oracle will optimize single-block reads when it can do prefetching and/or batching of I/O. This is an optimization for random I/O using the asynchronous I/O interface, as Oracle can group multiple I/O requests in a single io_submit call, as opposed to sending them one after the other as in the case of Example 1, where the OS call pread was used. The example from Figure 5 comes from running Kevin Closson's SLOB on a 12c database. In the execution plan please note step 3 where Oracle uses the operation "TABLE ACCESS BY INDEX ROWID BATCHED" to access table data.

Figure 5: execution plan for a SLOB query illustrating batch table access

Here below in Figure 6 you can see a snippet of the Oracle I/O tracing when running the SLOB workload. You can see that the asynchronous I/O interface is being used to to perform random I/O. In the case of Figure 6, 126 requests are submitted in a single io_submit operation and subsequently reaped by io_getevents, waiting for all the requests to finish. The wait event in this case is "db file parallel read" and the calculated wait time is the time for the batch of 126 requests to finish. Therefore the wait time of db file parallel read wait does not provide an accurate measurement of the I/O latency for single block random reads.

Figure 6: Oracle performing I/O for random reads with asynchronous I/Os, for the case of batched reads. The trace has been edited replacing a long list of similar trace lines (typically 126 lines only varying for the block#) with "..."

Example 6: tracing DML

In Figure 7 you can see a snippet of tracing Oracle logical I/O and wait events when inserting a row into a table followed by a commit. No physical I/O was performed in this operation (the table had been previously cached). You can see that most of the block reads have been done in current read mode, as expected for DML. Moreover the blocks involved are both belonging to the table being inserted (you can identify them in Figure 7 as belonging to tbs#=11) to rollback segments (you can identify them in Figure 7 as belonging to tbs#=2). Note that for simplicity the table in this example has no indexes.

Figure 7: Logical I/O trace of Oracle performing and insert and commit operation

Example 7: tracing log writer
In Figure 8 you can see the trace of log writer during a commit operation. This is the same commit operation traced in Example 6 (see also Figure 7 above). The wait event is "log file parallel write" and its duration is timed by oracle from the beginning to the end of the asynchronous I/O operations performed. Therefore the measured wait event time "log file parallel write" is not an accurate measure of block write latency, as its duration depends on the number of operations performed. In the case of Figure 8 you can see that log writer writes to two distinct block devices, which is consistent with the use of 1 logfile member per redo log group, allocated on an ASM diskgroup with normal redundancy: log writer sends 2 write I/O requests simultaneously in order to write the primary and mirror extents of the logfile and waits for both of them to finish.

Figure 8: log writer tracing during a commit operation

Conclusions
Dynamic tracing and in particular SystemTap provide the opportunity to investigate Oracle physical and logical I/O and open the way to drill-down investigations of Oracle's I/O into the various layer of the Oracle engine down to the Linux kernel layer. This post describes some of methods, tools and examples of how these investigations can be performed.
The examples from real-world Oracle workloads highlight some of the most common cases of Oracle logical and physical I/O and the connections between the wait event interface and the O/S calls issued by Oracle processes. The Oracle wait event interface can provide useful data for measuring I/O latency and in general for troubleshooting I/O problems. This is true for random reads instrumented by the wait event "db file sequential read". The information provided by wait events related to asynchronous I/O, however, require additional attention: such events typically do not provide an accurate measure of the I/O latency. Moreover this post provides examples where some of the asynchronous I/Os for Oracle direct reads are not instrumented at all by the wait event interface.

Acknowledgements: I would like to thank Tanel Poder and Frits Hoogland for original ideas that have inspired this work. Special thanks go to Frits Hoogland for providing comments and additional ideas from his deep expertise on the subject.

Download the scripts:discussed in this post from the webpage with download material or from GitHub.