Wednesday, November 18, 2015

Oracle Wait Events Investigated with Extended Stack Profiling and Flame Graphs

Topic: this post is about investigating Oracle wait events using stack profiles and flame graphs extended with OS-process state and Oracle wait event details.

Context: The case of the DB Time > CPU Time + Wait Time

Oracle instrumentation provides wait event and CPU time accounting, a powerful and readily accessible data source for performance troubleshooting. An Oracle session at a given point in time is either on CPU, for example when processing data from cache, or it is in a waiting state, for example waiting for an I/O operation to complete.
A recent talk by John Beresniewicz (JB) at the Oak Table World 2015 goes into the details of an interesting case, where the CPU time and wait time measurements from Oracle's V$ tables (and AWR reports) appear to violate the equation "CPU time + Wait time = DB Time".
In particular JB discusses in details a case of I/O testing, originally raised by Kevin Closson, where this discrepancy is apparent and originates from the fact of using modern "fast" storage.
To further set the context for the following discussion, you can watch JB's presentation and the slides. In short the root cause of the issue is that a significant amount of CPU cycles spent serving I/Os by the OS can be double counted both into Oracle I/O wait events and CPU counters. In this post you will learn of a technique and related tools for extended stack tracing that can be used to investigate this and similar cases and provide direct evidence of what is happening "inside Oracle wait events".

A test system reproducing high CPU usage inside I/O wait events

This is about reproducing a case where AWR reports  %Wait + %CPU > 100% (of DB Time). Following the original case reported by JB, the test cases are generated using Kevin Closson's SLOB. One crucial point for the following is to have a system that serves random I/O with low latency. For this you don't necessarily need expensive HW, you can emulate (for testing) a fast I/O subsystem  in various ways, typically using RAM-based devices.
The test system used in the following is a Virtual box VM with 4 CPU/cores and 4GB of RAM. The test DB is installed with OL 6.7 and the test database is Oracle version 12.1.0.2. Storage has been configured as three block devices and handled with ASM on the database in the VM, but it is actually served by 3 files on the host machine. The buffer cache of the test database is set to a very low size,  64MB.
The important point is that the host where Virtualbox runs has enough RAM to be able to cache the files used as storage for the test DB. A host with 128 GB of RAM has been used and the active set for the test DB is 40 GB.

Running the test workload, I/O bound with low latency I/O and high number of IOPS

This first test consists in running SLOB for a few minutes to drive the (I/O bound) workload. It is a simple read-only workload (random reads of single Oracle blocks, 8 KB) and no updates. Only one SLOB schema is needed, this has also the advantage of keeping the overall server CPU utilization low. Some relevant slob.conf parameters are: SCALE=40000M, UPDATE_PCT=0, RUN_TIME=1000. The command to run the test, after SLOB schema setup, is simply ./runit.sh 1.
Additional checks and boundary conditions are:
  • The system is not CPU-bound. This should be the case if the only workload on the database is the one-schema SLOB run. However it is important to double check with OS commands, for example "vmstat" or "top". In a CPU-bound server interpreting CPU and wait time becomes more complicated, due to the additional time spent waiting in the run queue.
  • I/Os are served at low latency and high IOPS. If you build your test machine using a VM created with Virtualbox, as described earlier in this post, this can be achieved by using a host with enough RAM to cache the files acting as back-end for the VM storage. The caching on the host filesystem would normally happen during SLOB schema creation, if the host server has enough RAM and no concurrent activity is competing for the filesystem cache. However if needed, caching can be triggered by running a few full scans of the USER1.CF1 table.

The test reproduces the case "%Wait + %CPU > 100%"

The AWR report of the SLOB workload shows that main activity on the database is random reads, instrumented by Oracle with the wait events: "db file parallel read", "db file sequential read" and by CPU time. The AWR report also shows that the system did not have any other significant activity besides the SLOB workload. The notable point (see also Figure 1) is that the sum of the percentage of DB time spent on non-idle wait events and CPU adds up to a value significantly higher than 100%. This is an anomaly, as it seems to violate the concept that an Oracle process at a given point in time is either on CPU or waiting for a wait event.

Figure 1: Extract from the AWR report of an I/O bound workload (generated with SLOB) showing %Wait + %CPU > 100% (DB Time). This is an apparent anomaly of the instrumentation as DB time is expected to be constituted of DB CPU and non-idle wait time.

Another way to measure the workload is with Tanel Poder's snapper. Snapper has the advantage of drilling down to session level rather than aggregating at the system level. The output of a 10-second sample is reported in Figure 2. From the metrics reported there, you can see that the workload is mostly single-block reads at a rate of 10K IOPS and that the wait time for I/O wait and CPU adds up to a value larger than DB Time. Note that DB Time for this workload is equal to the elapsed time, as there is only 1 session active and it is active all the time serving DB calls. An extract of relevant snapper measurements from Figure 2 is:

TIME, DB CPU: 71.5%
WAIT, db file sequential read: 9.1%
WAIT, db file parallel read: 78.7%
STAT, physical read IO requests: 10.31k


Figure 2: Snapper output of the SLOB workload in the case of low-latency I/O. The sum of DB CPU and non-idle Wait Time is larger than DB Time, which is an anomaly of the instrumentation. ASH data, however, reports what looks like a correct accounting of the percentage of DB Time spent into DB CPU and Wait events.


Something is wrong: the sum of CPU time and wait time should not be larger than DB time. One first check that you can do is to investigate if the CPU time measurement reported by Oracle is accurate. It turns out that the problem is not here, you can verify this using OS commands, for example with "pidstat". Pidstat output (not reported here for brevity) is also interesting because it adds two additional pieces of information: it confirms that the process spends most of its time on CPU and that the CPU cycles are accounted as "sys" CPU time, i.e. time spent in kernel operations.
The wait events are timed by Oracle, at the beginning of the wait and measured again at the end. The Oracle functions kskthbwt (Kernel service Kompile thread begin wait) and kskthewt (Kernel service Kompile thread end wait) appear to take care of measuring the wait time and update the instrumentation with the relevant metrics. In particular the wait time measurement for "db file sequential read" and "db file parallel read" seems to correctly account for the time spent in OS calls from the tracing tests performed by the author and reported in the post "Life of an Oracle I/O: Tracing Logical and Physical I/O with SystemTap".

Moving into the active session sampling data of snapper reported at the bottom of Figure 2, you can find a more correct picture of where DB Time is spent: most of the DB time is spent in wait and a small amount on CPU. Most importantly the sum of DB CPU and Wait events adds up to 100%, as expected.
Following JB's analysis, the collected evidence points to the fact that many CPU cycles spent performing kernel-related activities for serving I/O are accounted into Oracle wait events. Therefore the sum of %CPU and %Wait is higher than 100% because of the double counting that happens in  these events, especially with the I/O interface is working at a high rate of IOPS. In the following you will see how profiling of OS-process state together with wait events and kernel stacks provides direct observation of this hypothesis.

Direct investigation with extended stack traces and flame graphs

The idea is to sample the Oracle process running the test workload (random I/O driven by SLOB) using kernel stack profiles extended with additional sampling of the Oracle OS-process state and the Oracle wait event interface. This is an extension of the work discussed in the blog article "Linux Kernel Stack Profiling and Flame Graphs Applied to Oracle Investigations". A custom tool ORA_KStackProfiler is introduced to perform the following actions:
  • sampling of /proc/<pid>/stat to measure the process state (Running, Stopped, Disk, etc),
  • sampling /proc/<pid>/stack to sample the kernel stack trace if any
  • sampling the status of the Oracle session (on CPU of waiting for a wait event). The sampling is done directly from SGA by reading the memory locations corresponding to X$KSUSE.KSUSEOPC and X$KSUSE.KSUSETIM.
Note that ORA_KStackProfiler is an experimental tool with several limitations, including the fact that sampling is not done atomically (the stack profile, process state and wait event are gathered with subsequent calls), which can include errors. See also the README file.
These are the steps used for testing and the resulting flame graph can be found in Figure 3:

1. Run "make" to build the executable: ora_kstackprofiler

2. Run the scripts in the oracle_scripts directory 
     - run eventsname.sql using SQL*plus to generate the file eventsname.sed translating event# into event names 
     - run find_pid_ksustim_ksuseopc.sql using SQL*plus to find the pid and details of X$KSUSE of the Oracle session to trace
Example: 

SQL> @find_pid_ksustim_ksuseopc.sql
This script finds the OS pid and address of X$KSUSE.KSUSEOPC and X$KSUSE.KSUSETIM for a given Oracle session. To be used together with ora_kstackprofiler. Run as user SYS.

Enter Oracle SID to be investigated: 111
OS pid = 1234, ksuseopc = 5796061762, ksusetim = 5796061792

3. Stack traces sampling extended with Oracle wait event data as in the following example:

./ora_kstackprofiler --pid 1234 --delay 100 --count 1000 --ksuseopc 5796061762 --ksusetim 5796061792 > ora_stack_slob_fastio.txt

4. Process the output with additional filtering and with flame graphs for visualization using https://github.com/brendangregg/FlameGraph

cat ora_stack_slob_fastio.txt| grep -v 0xffffffffffffffff | sed -f oracle_scripts/eventsname.sed | ../FlameGraph/stackcollapse-stap.pl| ../FlameGraph/flamegraph.pl --title "Kernel stack, OS State and Wait event profiling of Oracle random I/O (SLOB workload, low latency I/O)" > ora_stack_slob_fastio.svg


Figure 3: Flame graph visualization of Kernel stack traces extended with process state sampling and wait event details from SGA show a large amount of CPU cycles consumed inside the wait events "db file parallel read" and "db file sequential read" for an I/O bound workload (low-latency random reads generated with SLOB). Note that the wait event details from SGA is reported at the bottom of the graph. The line above that is the process state from /proc/<pid>/stat. On top, the kernel stack trace from /proc/<pid>/stack. Click here for a svg version of the graph.


Figure 3 shows a flame graph of the Kernel stack traces extended with process state sampling and wait event details from SGA. The wait event data is sampled from SGA (and in particular from the memory locations for X$KSUSE.KSUSEOPC and X$KSUSE.KSUSETIM) and displayed at the bottom line of the graph. Just on top of that, the OS process state from /proc/<pid>/stat, finally the kernel stack data from /proc/<pid>/stack.
The flame graph of Figure 3 provides direct evidence of the fact that a significant amount of the time accounted by Oracle as wait events for I/O, is actually spent consuming CPU cycles for serving I/O.
This is consistent with the fact that I/O in this test is served at high rate. The percentage of samples marked as "On CPU (Oracle state)" is about 10%, which is consistent with what measured with ASH (see snapper output in Figure 2).
There are some important limitations in the method of extended stack sampling: sampling of the stack profile, process state and Oracle wait event details is not atomic, so we can expect mismatches. Moreover in Figure 3 you can see that the percentage of samples in OS state "running" is higher that the amount of CPU time measured with pidstat, in this example by about 20%. Note that the state "running" reported from /proc/pid/status, means that the process is either actually running on CPU or that it is runnable state (in the run queue). When reading the flame graph you should also remember that it is based on sampling, therefore it shows the stack traces that are mostly active. For example in Figure 3 you can see io_submit as one of the most active functions/code paths for asynchronous I/O (in the context of the wait event "db file parallel read"). The corresponding call io_getevents does not appear in the graph as the low latency and high rate of  I/O makes this operation very fast. Tracing with strace or SystemTap can be used to further drill down on this observation with measurements of the I/O calls and their timing.

For additional understanding of what the system is doing it is useful to measure also the userspace stack. For this an extended tool is introduced: Ptrace_Profiler.
The output of Ptrace_Profiler is reported in Figure 4, with some annotations of relevant functions in the userspace and their meaning. You can see that most of the time is spent in operations related to asynchronous I/O, both on the kernel level and on the userspace libaio function. This is consistent with the workload executing random reads under the wait event "db file parallel read" at high speed.


Figure 4: Flame graph visualization of kernel and userspace stack traces for Oracle I/O workload (SLOB) extended with OS process state and wait event details. Data is collected with a custom tool: Ptrace_Profiler. Some of the relevant functions are annotated with their role. Most of the time in the workload is spent executing I/O and in particular io_submit in libaio or in the kernel. Click here for a svg version of the graph.


I/O latency

The nature of the workload and in particular the high number of IOPS and low latency of the storage play an important role in amplifying the amount of CPU consumed in Oracle wait events per unit of time (more work requires more CPU cycles). A direct confirmation that the storage is serving I/O at low latency can be found with OS-based tools reading from block I/O tracepoints. The following tools can be used based on SystemTap and Ftrace: blockio_rq_issue_filter_latencyhistogram.stp and iolatency_micro. Some sample measurements are reported here below. Note that a large number of I/O requests are served in the bucket 64-to-128 microseconds.

# ./iolatency_micro 10
Tracing block I/O. Output every DeltaT = 10 seconds. Ctrl-C to end.
 >=(mus) .. <(mus)  : IOPS       IO_latency/DeltaT |IOPS Distribution                     |
      16 -> 32      : 0          0                 |                                      |
      32 -> 64      : 623        29947             |####                                  |
      64 -> 128     : 6302       605049            |######################################|
     128 -> 256     : 3482       668659            |#####################                 |
     256 -> 512     : 664        255206            |#####                                 |
     512 -> 1024    : 47         36249             |#                                     |
    1024 -> 2048    : 3          5529              |#                                     |
    2048 -> 4096    : 4          13209             |#                                     |



No double counting of CPU usage into wait events with slow I/O

You may not have found a situation like the one described above in your production DBs. Here we want to investigate what made the test "special". The key element of the example above is the low latency and high IOPS of the test workload. This causes the system to be very busy serving I/O and this amplifies the effect of double counting CPU in wait events. In this paragraph an additional test will hopefully make the matter more clear. The idea is to run the same SLOB workload against a storage system serving I/Os with much higher latency. In the test system used for this blog post, based on a VM, the I/O can be slowed down simply by flushing the filesystem cache on the host system (that is the system running Virtualbox):

# free && sync && echo 3 > /proc/sys/vm/drop_caches && free

The next step is to start the SLOB workload using the same test parameters as in the previous test, with the main difference that the I/O workload now runs against a much slower storage system: I/Os are not cached in the Virtualbox host anymore and are served by slow SATA disks. Using OS instrumentation and the AWR report one can quickly see that the system is I/O bound, serving only 300 IOPS (it was 10K IOPS in the previous test with low-latency I/O). Note also from Figure 5 that in this test %DB CPU + %Wait = 102% (of DB Time). This number is quite close to the ideal value of 100% and could easily be interpreted as measurement noise, if you did not have the above analysis to explain the effect double counting of CPU in I/O wait events. Here below an example of the latency histogram taken during this test, note the predominance of very high I/O latency.

# ./iolatency_micro 10
Tracing block I/O. Output every DeltaT = 10 seconds. Ctrl-C to end.
 >=(mus) .. <(mus)  : IOPS       IO_latency/DeltaT |IOPS Distribution                     |
      64 -> 128     : 0          48                |#                                     |
     128 -> 256     : 2          480               |#                                     |
     256 -> 512     : 5          2150              |##                                    |
     512 -> 1024    : 3          2918              |##                                    |
    1024 -> 2048    : 3          5222              |#                                     |
    2048 -> 4096    : 1          3686              |#                                     |
    4096 -> 8192    : 13         80486             |####                                  |
    8192 -> 16384   : 27         333004            |########                              |
   16384 -> 32768   : 44         1103462           |#############                         |
   32768 -> 65536   : 75         3706060           |######################                |
   65536 -> 131072  : 136        13389004          |######################################|
  131072 -> 262144  : 38         7569408           |###########                           |
  262144 -> 524288  : 0          157286            |#                                     |



Figure 5: Extract of the AWR report showing the top wait time for a test using SLOB workload against "slow I/O". Note that DB CPU and no-idle Wait sums up to 102% of DB Time. Which is close to the ideal case of a sum of 100%.

Using ORA_KStackProfiler you can get a more detailed view of what the system is doing while running the SLOB workload: most of the time is spent inside an I/O wait event, the most important being "db file parallel read". However contrary to the case of Figure 3, here the process is mostly in Sleep or Disk Sleep state, waiting for the io_getevents call reap I/Os. The flame graphs of Figure 6 (slow I/O) and Figure 3 (fast I/O) appear to be quite different but they just reflect two aspects of asynchronous I/O underlying the "db file parallel read" wait event. In the case of low latency I/O (Figure 3) most of the I/O wait time is spent submitting I/O while getting the results is very fast, in the opposite example of slow I/O (Figure 6), most of the execution time is spent waiting to reap I/Os (io_getevents). The code path is the same, stack profiling and flame graphs highlight in each case the code path where most of the time is spent.


Figure 6: Flame graph of the kernel stack profile extended with OS process state and wait event details for an I/O-bound Oracle workload generated with SLOB. The I/O system is serving I/Os at about 300 IOPS. Most of the time is spent waiting for I/O. During the I/O wait time the Oracle process is mostly in Sleep or Disk sleep state, although also on running/runnable state for a fraction of the time. Click here for a svg version of the graph.


Conclusions

Extended stack profiling and flame graphs combine visualization of kernel stack trace profiles, userspace stack, OS process state and Oracle wait event data. This technique can be used for advanced troubleshooting and root cause analysis to find the code paths that are most frequently executed and correlate this information with the application context. For Oracle troubleshooting this leads to additional data beyond what is available with the wait interface, to be used in special cases where low-level investigations are needed. An example is provided of using extended stack profiling for drilling down on a case of high utilization of CPU inside I/O wait events where the one observes the anomaly DB CPU Time + Wait Time >> DB time. Extended stack profiles and flame graphs provide direct evidence of what happens in the system and why CPU is consumed inside Oracle wait events.

Download the tools and previous work

The tool discussed in this post can be downloaded from this webpage or from Github.
Relevant previous work on stack profiling, flame graphs and investigations of Oracle workload and wait events can be found at: "Oracle Optimizer Investigated with Flame Graphs", "Flame Graphs for Oracle", "Linux Kernel Stack Profiling and Flame Graphs Applied to Oracle Investigations", "Life of an Oracle I/O: Tracing Logical and Physical I/O with SystemTap".

Acknowledgmeents and references

Brendan Gregg is the inventor of flame graphs and has published very interesting material on the topic. John Beresniewicz (JB)'s talk at the Oak Table World 2015 has discussed the original test case for this blog post. Kevin Closson is the author of  SLOB. Tanel Poder has covered the topic of profiling /proc/<pid>/stack in his blog. Frits Hoogland has many details on Oracle internals and I/O investigations in his blog.


Monday, October 26, 2015

Linux Kernel Stack Profiling and Flame Graphs Applied to Oracle Investigations

Topic: This blog post is about kernel stack profiling and visualization with flame graphs.

Context: Stack profiling and flame graphs are very useful tools and techniques for troubleshooting and investigating workloads at the OS-level and understand which code path take most of the execution time. You can find extensive material and examples of flame graphs in Brendan Gregg's website and blog. A few additional examples of using stack tracing and flame graphs for investigating Oracle are: "Flame Graphs for Oracle" and "Oracle Optimizer Investigated with Flame Graphs".
Flame graphs are often used to visualize and analyze stack traces for CPU-bound processes. Perf has proven to be a very good tool to capture such traces in Linux. With perf you can collect stack backtraces of the running processes with little overhead. However this does not cover the cases when the process under investigation is off-CPU (for example sleeping and/or waiting for an I/O operation to finish). This topic has also been addressed by Brendan Gregg in his blog, with off-CPU flame graphs. However data collection for the case of off-CPU stack traces seems to be still an area of active investigation, especially for Linux. This post addresses the case of collecting kernel stack data for flame graph visualization in the case of processes that spend a large fraction of their time off-CPU and/or in system calls.

What's new: In this blog you can find a description of  a basic technique for sampling kernel stacks and a script to automate the process. The script gathers kernel stack traces and process status by sampling the /proc filesystem. This is intended for investigations of processes in Linux that  spend a significant part of their elapsed time off CPU (for example in status S, sleep, or D, disk sleep). 
The main engine of the script is a simple loop reading /proc/pid/stack. I have first heard of this idea in Tanel Poder's blog. A similar concept is also used in the poor man's profiler. Just to illustrate how simple the method is, here is the main data collection part of kstacksampler.sh:

for x in $(seq 1 $iterations); do
   cat /proc/$pid/stack    # get kernel stack trace
done

Download kstacksampler.sh: from this webpage or from Github.

Notes and limitations: The script kstacksampler.sh is quite basic, to be taken as a proof of concept and study material. The overhead and potential side effects of using it are expected to be minimal for most of the systems, so most likely it is safe to use it in production. The use of shell for profiling does not allow for high frequency sampling, typically you should start experimenting with 10-20 Hz. Processes that are busy with CPU usage should rather be traced using perf and on-CPU flame graph techniques. The script currently only traces one process at a time (although it could be adapted to trace multiple pids). The script does not trace threads (although it can be modified to trace tids from /proc/pid/task/tid/stack). The script does not provide userspace traces. The script samples both the process kernel stack and status (running, sleeping, etc). A simple editing of the script can disable this feature if not needed. Note that in Linux the process status "running" refers to both running and runnable (in run queue) processes.


Example 1: Investigation of a workload showing a network bottleneck. 

Preliminary info:
The case is of a database session  retrieving data over a slow network. A simple statement for reading a database table "SELECT * FROM LARGETABLE;" is used to generate the workload. When examining the database process (on the DB server side) it shows little usage of CPU. This can be investigated for example using pidstat (RHEL/EL 6.7 was used for this test).

[root@MYDB]# pidstat -p 9124 1 10
Linux 2.6.32-573.7.1.el6.x86_64 (MYDB)          10/22/2015      _x86_64_        (4 CPU)

10:10:28 PM       PID    %usr %system  %guest    %CPU   CPU  Command
10:10:29 PM      9124    0.00    4.00    0.00    4.00     0  oracle_9124_orc
10:10:30 PM      9124    4.85    4.85    0.00    9.71     2  oracle_9124_orc
10:10:31 PM      9124    1.00    2.00    0.00    3.00     2  oracle_9124_orc

You can also use Oracle instrumentation to investigate the process and find that it is mostly idle. In particular V$ views such as V$SESSION, V$SESSION_LONGOPS, V$ACTIVE_SESSION_HISTORY are some of the views that can be used, or rather their graphical counterpart using OEM.

Stack Tracing:
You can now try the script kstacksampler.sh and build a flame graph with its output to further the investigation. In the following the task is divided in two steps. The first step  is data collection with kstacksampler.sh, with some additional data filtering with grep and sed. The second step integrates with the FlameGraph toolset. In particular stackcollapse-stap.pl folds the stack and flamegraph.pl builds the flame graph (see Figure 1).

Step1: stack sampling
$ ./kstacksampler.sh -p 9124 -n 100 -i .05 | grep -v 0xffffffffffffffff | sed 's/State:\t//g'| sed 's/\[<.*>] //g' >stack_sample_example1.txt

Step2: flame graph
$ cat stack_sample_example1.txt | ../FlameGraph/stackcollapse-stap.pl | ../FlameGraph/flamegraph.pl --title "Kernel stack profiling - Example 1, network-related bottleneck"> kstacksampler_Example1_network_bottleneck_Fig1.svg

The resulting flame graph is:

Figure 1: Flame graph of a process that is network-bound and spending the bulk of its time on TCP stack waiting for messages to arrive, as detailed in Example 1 in the text. Click here for a svg version of the graph.

From the flame graph of Figure 1 you can see that the process is sleeping for a large fraction of its time and that the large majority of the traced operations are related to network/TCP communication. This confirms what found earlier with pidstat and Oracle instrumentation. Kernel stack sampling has helped to find/confirm the root cause of the performance issue.

Additional investigations:
Without going into too many details, which is not the scope of this example, you can find interesting to take one additional step and use the results of the analysis to apply and measure some changes to make the process go faster. The general idea is to use larger "data transfers" and reduce the number of round trips, therefore limiting the impact of the network latency. A simple change in the context of this Oracle-based example is to increase the number of rows fetched at each step by changing the arraysize parameter in SQL*plus and set it to the maximum allowed value: "SET ARRAYSIZE 5000" (arraysize was set to 100 in the previous case and in the tracing data of  Figure 1). Other and similar options are also available at the TCP/kernel level, for those interested in further tuning. The outcome of the change of arraysize is a measurable reduction of the percentage of the elapsed time that is time spent waiting on TCP operations. More importantly this drives the throughput up (from 6 MB/s in the case of Figure 1 to about 20 MB/sec). Also pidstat shows that the CPU utilization increases (from 6% to about 28%). Similar observations can be done using Oracle instrumentation with V$ views. Finally the flame graph for the tuned process can be seen in Figure 1b. Notably the percentage of time spent with process status=running has considerably increased (more useful work is being done per unit of time).

Figure 1b: Flame graph of the process described in Example 1 after changing the parameter arraysize from 100 to 5000. An increase in CPU usage can be seen in this flame graph compared to what is represented in Figure 1 (taken with arraysize=100). This change correlates with higher network throughput and faster query execution. Click here for a svg version of the graph.

Additional test case details:
I have created the case of Example 1 using Oracle running in a VM and connecting to it via a sshd tunnel. The tunnel was set up using putty: putty -P <forwarded_port> -L 1111:DB_host:1521 oracle@VM_host). 1111 is the locally forwarded port, 1521 is the Oracle listener port on DB_host, <forwarded port> is a port on the VM_host forwarded into port 22 (sshd) of the DB_host. SQL*plus on a client machine was used to connect to the DB and run the query: "SELECT * FROM LARGETABLE;", where LARGETABLE is any table big enough so that the query runs in a few minutes. Another tip is to "set autotrace traceonly" before running the query to suppress the output to the screen.


Example 2: Investigation of an Oracle process doing random I/O. 

This example is about investigating I/O-bound processes. In particular Figure 2 shows the output of kernel stack profiling and flame graph for an Oracle process executing Kevin Closson's SLOB workload. The same two steps as detailed above in Example 1 have been used to collect data and produce the flame graph. The workload is mainly random read from block devices. The DB server used for this example runs in a VM and Oracle is configured to use ASM. The stack traces show that the process time is mostly spent either doing I/O or waiting for I/O calls to complete. Two different type of I/O system calls are visible in Figure 2: calls for blocking I/O (pread64) and calls for asynchronous I/O (io_submit and io_getevents). At the Oracle level this is reported using the wait events "db file sequential read" (blocking I/O) and "db file parallel read" (asynchronous I/O). You can read more on the topic of tracing Oracle I/O at "Life of an Oracle I/O: Tracing Logical and Physical I/O with SystemTap"

Figure 2: Flame graph visualization of the stack profile for an I/O bound Oracle process running SLOB. The process is spending the majority of the time executing I/O or waiting for I/O calls to return. Click here for a svg version of the graph.


Conclusions

Stack profiling and flame graphs are useful techniques and tools for troubleshooting performance investigations at the OS-level. Flame graphs are often used for profiling CPU-bound processes, for example in Linux with the help of perf. This post describes how stack profiling of kernel traces from the /proc filesystem can be used to investigate workloads that spend a significant amount of their time off-CPU and/or in kernel code (for example sleeping for I/O or network system calls to complete). A simple tool for stack sampling, kstacksampler.sh, complements this blog post together with two examples of its use for investigating network and I/O-bound Oracle processes.

References and acknowledgements

Brendan Gregg's flame graphs are the starting ground for this work. Tanel Poder has also published several original investigations of stack profiling for Oracle troubleshooting, notably this included the blog post "Peeking into Linux kernel-land using /proc filesystem for quick’n’dirty troubleshooting". Kevin Closson has also published work on stack profiling with perf and OProfile and he is the author of SLOB, used in Example 2 of this post.


Saturday, August 29, 2015

Add Color to Your SQL

Topic: this post is about some simple techniques to add color to SQL scripts and their terminal output using ANSI escape codes.

Colors can be used to improve the output of command line tools. This is common practice, for example with the (bash) shell. Colors can also be very useful to improve the quality of the output of SQL scripts. In my experience this is not used frequently, probably because of the the need of specialized techniques and also because the results depend on the terminal emulator (or tool) used to display the output. In this post you will find some pointers and examples that you can use to add color to your SQL output.

Notable previous work

Tanel Poder has published a couple of years ago his very cool logo to the great snapper v4 script using SQL and color-rich terminal output.
Sayan Malakshinov has published a blog article and a script color.sql that provide some ready-to-use and simple techniques for coloring the output of Oracle SQL*plus scripts.
ANSI escape codes are the main underlying technique to add color to the terminal output, see more details on how this works at: Wikipedia article on ANSI escape codes.
Putty is a widely usedterminal emulator that support ANSI escape calls. If you are a Windows user, note that CMD.EXE does not support ANSI escape codes, therefore it will not be suitable to run the scripts described in this post.
Heat map visualization is a powerful technique to explore 3D data, by providing the third dimension as color. I have integrated heat map visualization and command line/ terminal output with two tools OraLatencyMap and PyLatencyMap aimed at the study of I/O latency. I will share in the next paragraph some of the tips and lessons learned from developing those tools regarding the use of color on the terminal output.

Color palettes by example

Color palettes are very useful for heat map visualization. I have identified two simple palettes for displaying I/O latency histograms on terminal output: the first one is composed of shades of blue, the other is yellow-to-red. See an example of their usage as heat maps at this link. The scripts Color_palette_blue.sql and Color_palette_yellow-red.sql show two basic examples of how to generate color palettes using ANSI escape codes. The SQL, also pasted here below, works simply by changing the character background color, printing a white space and finally resetting the background back to normal:

define ANSICODE_PREFIX="chr(27)||'[48;5;'"
define ANSICODE_BACKTONORMAL="chr(27)||'[0m'"

select 0 ID, &ANSICODE_PREFIX|| '0m '|| &ANSICODE_BACKTONORMAL COLOR from dual
UNION ALL  -- Black
select 1 ID, &ANSICODE_PREFIX|| '15m '|| &ANSICODE_BACKTONORMAL COLOR from dual  
UNION ALL  -- White
select 2 ID, &ANSICODE_PREFIX|| '51m '|| &ANSICODE_BACKTONORMAL COLOR from dual  
UNION ALL  -- Light blue
select 3 ID, &ANSICODE_PREFIX|| '45m '|| &ANSICODE_BACKTONORMAL COLOR from dual
UNION ALL 
select 4 ID, &ANSICODE_PREFIX|| '39m '|| &ANSICODE_BACKTONORMAL COLOR from dual  
UNION ALL 
select 5 ID, &ANSICODE_PREFIX|| '33m '|| &ANSICODE_BACKTONORMAL COLOR from dual
UNION ALL 
select 6 ID, &ANSICODE_PREFIX|| '27m '|| &ANSICODE_BACKTONORMAL COLOR from dual
UNION ALL  -- Dark blue 
select 7 ID, &ANSICODE_PREFIX|| '21m '|| &ANSICODE_BACKTONORMAL COLOR from dual; 

The generation of the ANSI escape codes has little complexity and can be ported with little effort to many other language of interest. Here an example in PL/SQL (from OraLatencyMap):

create or replace function asciiescape_color (p_token pls_integer, p_palette_type varchar2) 
return varchar2 
   is
      type t_palette is varray(7) of pls_integer;        -- a palette of 7 colors
      v_palette_blue t_palette := t_palette(15,51,45,39,33,27,21);      -- shades of blue
      v_palette_red t_palette := t_palette(15,226,220,214,208,202,196); -- white-yellow-red
      v_colornum pls_integer;
   begin
      if ((p_token < 0) or (p_token > 6)) then
          raise_application_error(-20001,'The color palette has 7 colors, 0<=p_token<=6, found instead:'||to_char(p_token));
      end if; 

      if (p_palette_type = 'blue') then
          v_colornum := v_palette_blue(p_token+1);                               
      else
          v_colornum := v_palette_red(p_token+1);                                 
      end if;
      return(chr(27)||'[48;5;'||to_char(v_colornum)||'m');   
      --return the ascii escape sequence to change background color 
end;
/

An example in Python (from PyLatencyMap):

def asciiescape_color(token, palette):
    blue_palette = {0:15, 1:51, 2:45, 3:39, 4:33, 5:27, 6:21}        # palette, shades of blue 
    red_palette  = {0:15, 1:226, 2:220, 3:214, 4:208, 5:202, 6:196}  # white-yellow-red palette
    if palette == 'blue':
       color_asciival = blue_palette[token]
    elif palette == 'red':
       color_asciival = red_palette[token]
    else:
       raise Exception('Wrong or missing palette name.')
       exit(1)
    return(chr(27) + '[48;5;' + str(color_asciival) + 'm')


Other ANSI escape codes of interest from OraLatencyMap and PyLatencyMap are the codes to restore the cursor back to the normal value and to clear the screen. Here is an example from PyLatencyMap (Python):

#reset the background color back to normal
asciiescape_backtonormal = chr(27) + '[0m'

# clear screen and move cursor to top line
line += chr(27) + '[0m' + chr(27) + '[2J' + chr(27) + '[H' 

An example of colorizing SQL output

You will see in this paragraph an example that I hope is both instructive and fun: how to add colors to a script that computes an image of the Mandelbrot set. The starting script is quite interesting by itself as it uses just SQL for computation and output display. The code is not original, I have ported it to Oracle from code on the PostgreSQL wiki, with some minor modifications. Mandelbrot_SQL_Oracle_text.sql is the "black and white" starting starting script before adding color.
Adding colors to the output for this example is just a matter of combining the original "black and white" script with the SQL scripts for generating color palettes. The results are the the following two scripts (see also the figure below for an example of their output):

The figure illustrates how to add color to some classes of SQL output. The examples use a test script which generates an image of the Mandelbrot set using SQL. Note that adding color to the plain text version is achieved here just by adding an extra table join (with the PALETTE common table expression). Other methods to add color to Oracle scripts include using PL/SQL functions or the use of other languages, languages, such as Python, as discussed in the text of this post.

A version of the script for PostgreSQL that uses the ideas discussed in this post for colorizing SQL is: Mandelbrot_SQL_PostgreSQL_color_blue.sql

Conclusions

Colors can improve the effectiveness of command line scripts and terminal output, including SQL scripts. ANSI escape codes provide a powerful tool for many terminal operations. Heat map visualization is a powerful data visualization technique that can be implemented also on the terminal output using ANSI escape codes. In this post you can find simple tips on how to add colors to the terminal output both for SQL and other languages, notably Python. Adding color to "black and white" script output can be fun and useful at the same time, as illustrated with the Mandelbrot SQL example. Happy coloring!

Download: The tools discussed in this post can be downloaded from this webpage and from Github

Additional links and references

Tanel's colored fished logo to the snapper v4 script
Sayan Malakshinov's blog article and script color.sql
Wikipedia on ANSI escape codes
Latency heat maps for I/O latency measurements on the CLI with OraLatencyMap and PyLatencyMap
Fun SQL snippets from the PostgreSQL wiki
Additional examples of recursive common table expression (recursive subquery factoring) and non-standard uses of SQL: how to find numeric solutions to basic physics equations using SQL



Wednesday, July 29, 2015

Diagnose High-Latency I/O Operations Using SystemTap

Topic: this post is about some simple tools and techniques that can be used to drill down high-latency I/O events using SystemTap probes.

The problem: Operations with high latency on a filesystem and/or a storage volume can sometimes be attributed to just a few disks 'misbehaving', possibly because they are suffering mechanical failures and/or are going to break completely in the near future.
I/Os of high latency on just a few disks can then appear as latency outliers when accessing volumes build on a large number of disks and overall affect the performance of the entire storage. I write this having in mind the example of a storage system built with (SATA) JBODs using Oracle ASM as volume manager/DB filesystem. However the main ideas and tools described in this post apply to many other volume managers and file systems, including HDFS.

The standard tools: One way to find that one or more disks are serving I/O requests with high latency is with the use of standard Linux tools such as iostat, sar or collectl. Typically you would use those tools to spot anomalous values of average service time, average wait time and also of queue size.
A structured approach on how to do this is described in Brendan Gregg's USE method and the tools that can be used in Linux to implement it.

SystemTap scripts: In this post we focus on a technique and a couple of simple scripts that can be used to identify disks serving I/O with high latency using SystemTap probes to investigate I/O latency of the block devices.
The script blockio_latency_outliers_per_device.stp provides a measurement of some basics latency statistics for block device, including number of I/Os, average and maximum latency. The script also provides details of all the I/O where the latency is above a certain programmable threshold (the default threshold is set at 500 microseconds).
An example of its use (edited output for clarity) is here below. Note the latency warning message and overall the very large maximum value measured for the latency of the /dev/sdy block device:

[root@myhost] # stap -v blockio_latency_outliers_per_device.stp 10

Measuring block I/O latency and statistics
A warning will be printed for I/Os with latency higher than 500000 microseconds
Statistics will be printed every 10 seconds. Press CTRL-C to stop

...
latency warning, >500000 microsec: device=sdy, sector=166891231, latency=714984
latency warning, >500000 microsec: device=sdy, sector=165679327, latency=582708
latency warning, >500000 microsec: device=sdy, sector=167102975, latency=1162550
....

I/O latency basic statistics per device, measurement time: 10 seconds
Latency measured in microseconds
Disk name          Num I/Os    Min latency    Avg latency    Max latency
....
sdu                     219            106           6217          27117
sdz                     200            123           5995          27205
sdq                     211             71           6553          31120
sdh                     256            103           6643          22663
sds                     224            101           6610          29743
sdm                     238             92           7550          35571
sde                     243             90           8652          52029
sdt                     200            105           5997          25180
sdk                     200             94           5696          35057
sdi                     206             99           7849          30636
sdg                     269             74           6746          36746
sdy                     197            102          98298        1167977
sdr                     200             89           6559          27873
sdl                     200            140           8789          31996
sdw                     210             99           7009          37118
sdd                     217             94           7440          56071
sdn                     205             99           6628          41339
....


When candidate disks for high latency have been identified, the second step is to further drill down using the script blockio_rq_issue_filter_latencyhistogram.stp. This script gathers and displays I/O latency histograms for a subset of block devices that can be specified using filters in the script header. The default filters are:

# SystemTap variables used to define filters, edit as needed
global IO_size = 8192    # this will be used as a filter for the I/O request size
                         # the value 8192 targets 8KB operations for Oracle single-block I/O
                         # use the value -1 to disable this filter
global IO_operation = 0  # this will be used as a filter: only read operations
                         # a value of 0 considers only read operations (the value 1 is for write)
                         # use the value -1 to disable this filter
global IO_devmaj = -1    # this will be used as a filter: device major number (-1 means no filter)
                         # example: use the value 253 for device mapper block devices
global IO_devmin = -1    # device minor number (or -1 if no filter)

You can use  blockio_rq_issue_filter_latencyhistogram.stp to drill down on the latency histogram for those disks that have shown high latency I/O and also compare them with "good" disks. In the example above the candidate "trouble disk" is /dev/sdy with major number 65 and minor 128 (you can major and minor device numbers for "sdy" simply using ls -l /dev/sdy).
Example:

stap -v blockio_rq_issue_filter_latencyhistogram.stp 10

Block I/O latency histograms from kernel trace points
Filters:
    IO_size = 8192
    IO_operation = 0 (0=read, 1=write, -1=disable filter)
    IO_devmaj = 65 (-1=disable filter)
    IO_devmin = 128 (-1=disable filter)

lock I/O latency histogram, measurement time: 10 seconds, I/O count: 199
Value = latency bucket (microseconds), count=I/O operations in 10 seconds
  value |-------------------------------------------------- count
     16 |                                                    0
     32 |                                                    0
     64 |@                                                   2
    128 |@@@@@@@                                            14
    256 |@@                                                  4
    512 |@                                                   2
   1024 |@@@                                                 7
   2048 |@@@@@@@@@                                          19
   4096 |@@@@@@@@@@@@@@@@@@                                 37
   8192 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                      59
  16384 |@@@@@@@                                            14
  32768 |                                                    0
  65536 |@@@@@          HIGH                                10
 131072 |@@@@@          LATENCY                             11
 262144 |@@@@@          I/O                                 10
 524288 |@@@            OPERATIONS                           6
1048576 |@@                                                  4
2097152 |                                                    0


Note the presence of several I/O operations at high latency together with more normal I/O latecy, for example I/Os served from SATA disks at around 8 ms and also I/O served from the controller cache at sub-millisecond latency.
For comparison here below is the latency histogram measured on another disk while the same workload was running. We can see that in this case high latency points found in the previous example are no more present. Most of the I/O operations are around 8 ms latency and some operations served from cache at sub-millisecond latency. The I/O reported for the /dev/sdy disks with latency of 64 ms and above are not present in this case.

stap -v blockio_rq_issue_filter_latencyhistogram.stp 10

Block I/O latency histograms from kernel trace points
Filters:
    IO_size = 8192
    IO_operation = 0 (0=read, 1=write, -1=disable filter)
    IO_devmaj = 65 (-1=disable filter)
    IO_devmin = 48 (-1=disable filter)


Block I/O latency histogram, measurement time: 10 seconds, I/O count: 196
Value = latency bucket (microseconds), count=I/O operations in 10 seconds
value |-------------------------------------------------- count
   32 |                                                    0
   64 |                                                    0
  128 |@@@@@@@@@                                          19
  256 |@@@@@@@@                                           16
  512 |                                                    1
 1024 |@@@@                                                9
 2048 |@@@@@@@@@@@@@@                                     28
 4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@                       56
 8192 |@@@@@@@@@@@@@@@@@@@@@@@@                           49
16384 |@@@@@@@@@                                          18
32768 |                                                    0
65536 |                                                    0

The tests reported here have been performed on a old system running RHEL 5 (kernel 2.6.18-371)  and SystemTap v. 1.8, however the scripts have also been tested on OL and CentOS 7.1 (kernel 3.10.0-229) and SystemTap 2.8. If you are using SystemTap 2.6 or above you can use the scripts blockio_rq_issue_latencyhistogram_new.stp and blockio_rq_issue_filter_latencyhistogram_new.stp instead.

Conclusion: In this post we have shown a simple technique to diagnose I/O latency outliers and in particular on how to drill down on I/Os served at high latency because of one or more misbehaving disk in the storage volume/filesystem. The investigation has been done using SystemTap scripts used first to discover which disks were serving some of their I/O s with high latency and then drilling down on specific devices with the use of latency histograms. The fact of identifying and later replacing the badly performing disks can be beneficial for the performance of the entire storage systems.

Download: The tools discussed in this post can be downloaded from this webpage and from Github.

Acknowledgements and additional links: Brendan Gregg has published an extensive set of articles and tools on performance tuning, including storage latency investigations.
For more information and examples on how to use SystemTap see the SystemTap wiki
Additional tools for Oracle ASM I/O investigations also see Bertrand Drouvot's asm_metrics.pl
Kevin Closson's SLOB was used as workload generator for the examples discussed here.
Additional links on tools and techniques on Oracle I/O troubleshooting in this blog: Heat Map Visualization of I/O Latency with SystemTap and PyLatencyMap, Event Histogram Metric and Oracle 12c and Life of an Oracle I/O: Tracing Logical and Physical I/O with Systemtap

Tuesday, July 14, 2015

Heat Map Visualization of Latency Histograms for NetApp C-Mode

Topic: This post is about collecting and visualizing I/O latency histograms for NetApp filers in C-mode.

Motivations: The drill down of I/O latency is an important technique for troubleshooting and benchmarking storage. Average latency values can hide details of what is happening on the storage. Think for example of storage systems with flash and spindles, each serving I/O at different latency. Moreover averaging the measured values over time can hide details in case of varying workload or with issues that appear sporadically.

Latency heat maps: Representing latency histograms over time is a 3-D visualization problem and can be nicely solved using heat maps. This has been discussed in details by Brendan Gregg in ...

PyLatencyMap is a simple visualization tool that works on the command line and takes as input latency histograms and outputs two latency maps using ANSI codes for colors.
The latency heat map on the top of the screen uses a blue palette and represents the number of operations per second as a function of the latency bucket and time. The second map in yellow-red represents the time waited as a function of the latency bucket and time.

How to collect and display I/O latency histograms for NetApp filers

For a quick you can use Example10_NetApp_Cmode_reads.sh. Before running it you will need to edit the parameters: password to connect to the storage, the junction point you want to measure, the cluster management node and optionally the time interval between data points. Then run the example script. Similarly Example10b_NetApp_Cmode_writes.sh will capture and display heat maps for write operations if you prefer.

This is an example of the output

Figure1: Example output of PyLatencyMap measuring and displaying I/O read latency histograms using performance counters from the NetApp filer (C-mode). The top heat map (blue color) represents I/O operations latency over time. The lower heat map (yellow-red color) represents the time waited per bucket.

Additional details

The 'glue' between NetApp and PyLatencyMap is provided by two Perl scripts written by Ruben Gaspar. The scripts are: (1) NetApp_histogram_Cmode.pl connects to the filer and gets the latency histograms (2) NetApp_latency_connector.pl transforms the histograms from the format of the NetApp filer into a format that is understood by PyLatencyMap (i.e. a power-of-two histograms).

Note that data aggregation in buckets is done by NetApp_latency_connector.pl using the same conventions as Oracle's wait event histograms (v$event_histogram and v$event_histogram_micro), that is a value of N for bucket B means that there have been 100 I/O operations with latency between B and B/2. For those of you who use DTrace 'quantize' or SystemTap 'hist_log', please note the difference as in those tools the convention is that a value of N for bucket B means N operations with latency between B and 2*B.
Note also the following dependencies: the scripts discussed here for the integration between NetApp performance counters and PyLatencyMap have been developed for NetApp Clustered Data ONTAP and in particular have been tested on version 8.2 an 8.3 (notably they will not work against OnTap version 7). Also Perl is needed including the modules Net::OpenSSH and IO::Pty. PyLatencyMap is written in Python.

Conclusions and additional links

PyLatencyMap is a tool for collecting and displaying latency heat maps on the command line intended as a aid for performance troubleshooting and benchmarking activities. PyLatencyMap has now been integrated with NetApp performance counters which adds to the existing data sources of Oracle wait events, SystemTap and DTrace probes among others.

Download PyLatencyMap from:
https://github.com/LucaCanali/PyLatencyMap or http://canali.web.cern.ch/canali/resources.htm 

Previous blog entries on PyLatencyMap:
http://externaltable.blogspot.com/2015/03/heat-map-visualization-for-systemtap.html
http://externaltable.blogspot.com/2013/09/getting-started-with-pylatencymap.html
http://externaltable.blogspot.com/2013/08/pylatencymap-performance-tool-to-drill.html

Monday, June 15, 2015

Event Histogram Metric and Oracle 12c

Topic: event histogram metric, a script and some related discussions on collecting and displaying wait event latency histograms for Oracle performance troubleshooting. Also discussed is the new 12c feature of event histrograms with microsecond buckets.

Why: Latency histograms (and by extension wait event histograms) provide very useful information when troubleshooting performance for systems exhibiting response time with multi-mode distribution. In such cases average wait values are often not sufficient to understand the behavior of the system under study and histograms provide a finer level of details. A common case is with I/O performance investigations for many modern storage systems, which  can serve I/O from HDD or flash, with completely different latency characteristics.

What: Event histogram metric is a command-line script for "near real-time" monitoring, based on the simple idea of taking two snapshots of the relevant V$ views and print the wait event histogram of the delta values, see also Figure 1 for an example. You can find an example of a case when event histogram metric has helped for performance troubleshooting in a previous blog entry at this link.

What's new: The latest version of event histogram metric (v 1.1, June 2015) adds the support for a little but very useful  new feature of Oracle version 12.1.0.2, where histogram buckets go down to the microsecond level with V$EVENT_HISTOGRAM_MICRO. Another feature I have added event histogram metric v 1.1 is the calculation of the estimated time waited per bucket. This gives an indication of the relative weight/importance of the time waited in each latency bucket and allows, for example, to compare the relative weight of many low-latency waits vs. a few events with long latency.

Download the tool from http://cern.ch/canali/resources.htm or from https://github.com/lucacanali The relevant scripts for this post are:
  • ehm.sql - a script to collect and visualize wait event histograms for Oracle RAC, 11g and 12c.
  • ehm_local.sql - same as ehm.sql but collects data only for the local instance (V$ view data), 11g and 12c.
  • ehm_micro.sql - collects and displays wait event histograms with microsecond buckets for RAC, requires Oracle 12.1.0.2.
  • ehm_micro_local - same as ehm_micro but for the local instance (V$ view data), requires Oracle 12.1.0.2.

Example of usage

Figure 1: Example of usage of the event histogram metric script on Oracle 12c. You can see low latency and high latency I/O modes. Sub millisecond I/Os come from the storage SSD/cache, high latency come from I/Os served by reading SATA disks. Note also that the average latency value, reported on the bottom of the figure, does not capture the details of the multiple modes and therefore does not represent the additional complexity of this workload.


Wait events of interest

Wait event 'db file sequential read' is used by the Oracle engine to provide a measure of the time spent on single-block read operations. This is an important source of data when troubleshooting the performance of many OLTP workloads, as it accounts for the time spent on storage access for random I/O (additional details and comments on the limitations of this approach are detailed in the next paragraph). This is the typical data access path when using indexes, for operations like primary key look-ups, index range scans, nested loops joins, etc, that are quite common in many transactional workloads. A short investigation of the internals of Oracle and its interaction with the OS (see also this blog post for details) shows that in most cases the time accounted by 'db file sequential read' is essentially the time taken to complete the underlying system call call to read data from the storage. Therefore this event provides a good measure for the storage response time as seen from the database server. Moreover in RAC it is easy to aggregate wait event data for all instances in one histogram using GV$ views and this provide an easy source of data for the shared storage.
Note: in Exadata deployments instead of 'db file sequential read' you should expect to see the wait event 'cell single block physical read'.

Another important wait event for troubleshooting the performance of many Oracle databases with significant transactional workloads is 'log file sync'. This wait event measures the time a session will wait on commit operations. There are several components to the wait time accounted in log file sync (see paragraph below for some additional hints and known pitfalls), one of the most important for the scope of this article is the time necessary to perform I/O on the redo log files. Studying the latency of this event can be a very good source of information for performance troubleshooting as it may allow you to understand if writes to redo log files are performed at low latency, so most likely into the storage DRAM or SSD cache, or rather with high latency, which may indicate the lack of caching, or the cache going into write through mode or other issues worth further investigations.

Limitations and additional comments

When investigating latency and storage response time for random I/O, the wait event 'db file sequential read' is a very important source of data, however not all random I/O performed by Oracle is accounted under this wait event. When Oracle see the opportunity of optimizing storage access by using prefetching or batching I/O, it will use asynchronous I/O (when available). Asynchronous I/O for random reads will be accounted with the wait event 'db file parallel read'. Asynchronous I/O improves the utilization of the I/O subsystem by the Oracle processes, as it allows to submit and reap many I/O operations in groups, as opposed for example to reading blocks one by one as in the case of pread system calls. However this makes the task of correlating the wait event latency with storage response time much more difficult. It is the case of 'db file parallel read' but also the case of wait events associated to asynchronous I/O used for sequential I/O operations such as 'direct path read' More on this topic and the pitfalls when using the wait event interface for asynchronous I/O are detailed in this other blog post (see also Oracle Support Doc ID 7448407.8  "Bug 7448407  Resource Accounting and Tracing change for Direct Path Waits").

One known pitfall when interpreting the measurements of the 'log file sync' wait event is that the wait time includes also what can be a significant CPU workload. This has been discussed in details by Kevin Closson. There are also many other details, including a new adaptive mechanism for the communication between the server process and log writer introduced in 11.2.0.3 (see: adaptive log file sync ) and more changes introduced in 12c with redo writer worker processes. See also Oracle Support Doc ID 34592.1 "WAITEVENT: log file sync Reference Note " for a list of known issues and bugs in this area.

As an addendum to the discussion: this is the formula used in the script event histogram metric to compute the the estimated value for the wait time in each bucket: T_wait = 0.75 * bucket_max_val * wait_count. As an example of the usage of the formula, consider 100 wait event in the Oracle bucket 2 ms. This means that the wait time has been between 1 ms * 100 events = 100 ms and 2 ms * 100 event = 200 ms. According to the formula above the estimated value is 0.75 * 2 ms * 100, that is the mean value between the maximum was of 2 ms * 100 and minimim wait of 1 ms * 100.

Related work and links

The study of latency histograms as a function of time leads to a 3D representation problem. A natural solution for it is to represent latency values in a heat map. If you are interested in more details see this blog post on OraLatencyMap, a tool to collect and display heat maps of wait events using SQL*plus and this series of posts on PyLatencyMap, a python-based tool integrated with many data sources, including Oracle wait events, SystemTap and DTrace.
Event histogram metric is part of a group of short command line scripts I use for Oracle troubleshooting, including scripts for near-real time measurement of Oracle performance. The scripts are available for download, you can read more details following this link.
When troubleshooting Oracle performance and investigating I/O latency details it can be useful to get graphs of historical trends from the AWR repository. PerfSheet4 is a tool that I use to extract and visualize AWR data.

Conclusions

Latency histograms of Oracle wait events are a powerful drill-down tool for performance investigations. Histograms are necessary when average values do not provide an accurate picture, for example when studying I/O latency of storage with multi-mode response time. Two wait events are of particular interest for many Oracle workloads: db file sequential read and log file sync. In Oracle 12.1.0.2 a new view V$EVENT_HISTOGRAM_MICRO provides latency details with microsecond buckets, which is very useful when investigating SSDs and low-latency storage in general. A set of simple scripts for near real-time data collection and display of the histograms have been discussed in this post and are available for download.

Wednesday, March 18, 2015

Heat Map Visualization of I/O Latency with SystemTap and PyLatencyMap

Topic: PyLatencyMap v1.2 a tool for collecting and visualizing I/O latency data collected and about its integration with SystemTap.

Introduction: When studying storage performance latency drill-down can be very important. Measuring the average I/O latency is often not enough, latency histograms are proven to be more suitable for investigating modern storage systems. This is because the storage service time can present multiple "modes". A common case is with storage systems where I/O requests can be served by low-latency SSD or by spindles with much higher service time. For such systems latency histograms provide an additional dimension to the performance investigation.
Latency heat maps come into play when we want to visualize a series of latency histograms over time. This is a 3-D visualization problem and heat maps provide a way to represent the third dimension as color (see as an example Figure 1 later in this article).
The main ideas behind the use of latency histograms for investigating storage performance and their visualization heat maps have been previously detailed in the excellent paper by Brendan Gregg on ACM queue and also on blog articles and presentations.

PyLatencyMap is a tool and set of scripts for gathering  and displaying heat maps using the command line interface (supporting ANSI terminal graphics). The main use for which PyLatencyMap has been developed is to measure and display storage latency data. PyLatencyMap can also be used to display heat maps from a generic source of latency data.
The latest update to PylatencyMap, version 1.2, provides integration with SystemTap and a few example scripts for getting started. SystemTap can be used to collect storage latency and PylatencyMap for producing the associated heat maps. Many more data sources have already been integrated with PyLatencyMap and examples are available with the tool: Oracle wait event histograms, DTrace histograms and custom aggregations from trace files.
PyLatencyMap has a modular architecture, which follows the typical use of Unix/Linux CLI tools. Data is generated by a collector script (for example a SystemTap script), then it is piped through an optional filter and finally piped into to the visualization engine:

latency data_source | <optional connector script> | python LatencyMap.py <options>

A series of examples and get-started scripts are provided with the tool. In this article we will focus on the scripts based on SystemTap. For discussions on how to use PyLatencyMap for SQL*plus and DTrace see also this other blog entry. Just to get started see the following CLI to run data collection with SystemTap and visualization with PyLatencyMap:

stap -v SystemTap/blockio_rq_issue_basic_pylatencymap.stp |python SystemTap/systemtap_connector.py |python LatencyMap.py

The script blockio_rq_issue_basic_pylatencymap.stp measures the latency of block I/O operations and can be used to collect latency histograms and display them as heat maps. The probe hooks top kernel.trace("block_rq_issue") and kernel.trace("block_rq_complete").
There are three probes in the script: one that gathers information at each block io operation, another that collects stats at the end of the I/O and a third probe that periodically prints the latency histogram. Note the use of the aggregation operator '<<<'  that SystemTap uses to populate the histograms, which are later printed with the use 'hist_log'.

probe kernel.trace("block_rq_issue") {
        requestTime[$rq] = gettimeofday_us()
}

probe kernel.trace("block_rq_complete") {
   t = gettimeofday_us()
   s = requestTime[$rq]
   if (s > 0) {
       latencyTimes <<< (t-s)
       delete requestTime[$rq]
   }
}

probe timer.sec(3) {
   printf("\n<begin record>\n")
   printf("timestamp, microsec, %d, %s\n",gettimeofday_us(),tz_ctime(gettimeofday_s()))
   printf("label, Latency of block I/O requests measured with SystemTap\n")
   printf("latencyunit, microsec\n")
   printf("datasource, systemtap\n")
   if (@count(latencyTimes) > 0)
       println(@hist_log(latencyTimes))
   printf("\n<end record>\n")

}

SystemTap scripts for PyLatencyMap:

For production usage there are more advanced scripts than the one just discussed. The script  Example9_SystemTap_blockIO_req.sh is based on blockio_rq_issue_pylatencymap.stp and provides tracing based on block_rq_issue and block_rq_complete as the example above but contains additional filters to remove 'false positives'. For ease of use this is packed into

Another example script is Example9b_SystemTap_blockIO_filter_read_8KB.sh is based on blockio_rq_issue_filter_pylatencymap.stp which provides the possibility to further filter on the block I/O being issues, in particular the defaults filter for 8KB-reads, which are of interested when testing OLTP applications in Oracle.

Example9c_SystemTap_pread.sh is a scripts that packages a SystemTap probe on pread calls. These are typically the underlying OS calls for single block reads that show up as 'db file sequential read' wait events in Oracle. Note however that Oracle can do single-block reads also using asynchronous, therefore the probes discussed above on block_io trace points are more general than this one.

Examples 9e, 9f: the probes in these scripts measure block_io hooking on different probe points and are provided for reference. Note, these scripts require to have kernel debuginfo symbols.

oracle_event_latency.stp: this script measures the latency of Oracle wait events and produces histograms by hooking directly into the Oracle executable. More details on how this works can be found at the blog entry "SystemTap into Oracle".

Example:

Figure 1 here below is an example the output from PyLatencyMap run using the script Example9b_SystemTap_blockIO_filter_read_8KB.sh. The measured workload has been produced with SLOB (at increasing load). The top heat map represents the number of I/O operations per latency bucket and as a function of time. On the horizontal axis you can fine the time, on the vertical axis the latency buckets. The tool assumes a power-of-2 scale. Color is used to represent the number of I/O operations per second. A blue palette is used for the frequency heat map.
The  heat map on the lower half of the figure uses a yellow-red palette to represent the amount of time waited in each bucket. I refer to it as a intensity/importance heat map. It highlights the relative weight of latency time spent on each bucket and helps in comparing the impact of a large number of low-latency operations vs. a smaller number of high latency operations.


Figure 1:  The output of PyLatencyMap contains the latency histogram of the latest measured point and  two heat maps: on top the heat map of the IOPS, on the lower graph the heat map of the waited time.

Note: for SystemTap (and DTrace) data sources, the latency bucket has to be interpreted as follows: a value of N in bucket B, means N I/O operations with latency between B and 2*B.
Other histogram latency data sources (notably Oracle) use a different convention. With the same example a value of N in a latency bucket B for Oracle histograms means N waits of latency between B and B/2. PyLatencyMap takes care of this difference when calculating the time waited histograms (i.e. the
Note: The scripts discussed here for block I/O tracing use the trace point block_rq_issue, it is also possible to use the trace point block_rq_insert to measure I/O latency from the point of insertion into the queue instead.

References and download links:

PyLatencyMap can be downloaded from this link or can be forked from Github. Additional information on PyLatencyMap is also available at this blog entry and also at this other blog entry.

Conclusions and acknowledgements

SystemTap is a very powerful tool for measuring I/O and producing latency histograms for storage performance investigations. PyLatencyMap provides an easy way to visualize histograms produced by SystemTap as latency heat maps. PyLatencyMap collects data and displays heat maps using the command line interface on character terminals with ANSI escape codes. This provides the combined benefit of a graphical output and the power of the command line. PyLatencyMap can also be used to visualize heat maps from latency histograms produced with DTrace, Oracle's SQL*plus and/or a from latency data aggregated from text and trace files.

Many thanks go to Brendan Gregg for sharing in his blog many awesome ideas and tools on the topic of heat map visualization and on using DTrace and SystemTap that have provided a great source of inspiration for this work.

Note: this entry was originally posted in March 2015, with major updates in July 2015.