External Table: Linux

Showing posts with label Linux. Show all posts

Tuesday, May 31, 2016

Linux BPF/bcc for Oracle Tracing

Topic: In this post you will find a short discussion and pointers to the code of a few sample scripts that I have written using Linux BPF/bcc and uprobes for Oracle tracing.

Previous work and motivations

Tools for dynamic tracing are very useful for troubleshooting and internals investigations of Oracle workloads. Dynamic tracing probes on the OS/kernel, can be used to measure the details for I/O latency for example. Moreover probes on the Oracle userspace functions can be be used to complement Oracle instrumentation and probe deeper into the internals of the Oracle engine, when needed. For example in a work done in collaboration with Frits Hoogland we have investigated how to link Oracle wait event interface with tools able to probe the userspace, such as gdb and dynamic tracing tools as DTrace, SystemTap and Perf. More details and examples on this topic can be found in previous work: Modern Linux Tools for Oracle Troubleshooting (presentation at UKOUG Tech14), Life of an Oracle I/O: Tracing Logical and Physical I/O with SystemTap, Linux Perf Probes for Oracle Tracing, Flame Graphs for Oracle.

What's new with BPF/bcc

BPF together with his frontend bcc are new and exciting technologies in the Linux dynamic tracing ecosystem. If you are not familiar with them, you can get up to speed using the excellent material in Brendan Gregg's blog. While the tools are still in development as I write this, a large amount of important features are already available starting from kernel 4.5, notably the support for uprobes that I have used in the scripts discussed later in this post.

I have started experimenting with porting a few probes for the Oracle userspace that I had written using SystemTap and discussed in this blog. While the language/syntax is completely different between SystemTap and BPF/bcc, porting the probes has turned out to be relatively straightforward. The work has been made substantially easier by the availability of a large selection of well-written and documented scripts in the tools directory of bcc. At present writing BPF/bcc scripts is a mixture of coding in Python and C, which feels to me both familiar and at the same time strangely low-level compared to the more mature environments for example of DTrace and SystemTap.

The example scripts that I have written using BPF/bcc are on Github. Here is a list with comments:

Script name	Short description
ora_sqlparse_trace.py	Tracing of Oracle SQL parsing. This script traces SQL hard parsing on Oracle binaries hooking on the Oracle function "opiprs" and reads from function arguments (CPU registers) and from process memory.
ora_wait_trace.py	Tracing of Oracle wait events. This script traces Oracle sessions by hooking on the functions "kskthewt" and "kews_update_wait_time" and reads from function arguments (CPU registers).
ora_logicalIO_histogram.py	Logical IO latency histograms. This script measures the latency between call and return time for the Oracle function "kcbgtcr", which is an important part of the logical IO processing for consistent reads.
ora_wait_histogram.py	Wait event latency histograms. This script traces Oracle sessions by hooking on the functions "kskthewt" and "kews_update_wait_time" and reads from function arguments (CPU registers). BPF computes the latency histogram for the wait events and the script prints the values on stdout.

An example of the usage of ora_wait_histogram.py to measure and display wait event latency:

# stdbuf -oL ./ora_wait_histogram.py 10 10|sed -e 's/event# = /event#=/g' -f eventsname.sed
Start tracing oracle wait events... Hit Ctrl-C to end.

event=db file sequential read
wait time, microsec : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 25 | |
128 -> 255 : 24521 |******************** |
256 -> 511 : 46788 |****************************************|
512 -> 1023 : 12169 |********** |
1024 -> 2047 : 1132 | |
2048 -> 4095 : 660 | |
4096 -> 8191 : 248 | |
8192 -> 16383 : 29 | |

Latency heat maps to display histograms collected with BPF/bcc

PyLatencyMap is a command-line tool for visualizing latency histograms using heat maps on terminal graphics, using ANSI escape codes. PyLatencyMap can be used to investigate I/O performance for random I/O, especially suited for the cases of multiple modes of response time from the storage (SSD cache, HDD, latency outliers). The original idea for PyLatencyMap comes from the work of Brendan Gregg on latency heat maps. I have added to the PyLatencyMap repository a modified version of the biolatency.py script to measure I/O latency histograms: this is the link to the script pylatencymap_biolatency.py and an example of heat maps generated with PyLatencyMap with the script Example11_BPF-bcc_blockIO_latency.sh:

The test workload has been generated using Kevin Closson's SLOB. Additional references with a more detailed discussion of the topic of testing Oracle I/O with SLOB and measuring latency heat maps are:
Heat Map Visualization of I/O Latency with SystemTap and PyLatencyMap and OraLatencyMap v1.1 and Testing I/O with SLOB 2.

Notes on the test environment

The scripts discussed in this post have been developed on Fedora 24 (alpha) running Linux kernel version 4.6 (using the rawhide kernel) and have been tested on workloads generated using Oracle version 11.2.0.4. This is not an Oracle-supported configuration and the scripts are intended mainly as a demonstration of the technology and for learning purposes.
Here are some pointers on the steps I used to setup a lab environment for testing:

Download Fedora server from https://getfedora.org/en/server/ (version 24 beta, as I write this).
Installed Fedora, for example on a VM with VirtualBox
Run "sudo dnf update" to get the latest kernel and packages
Get kernel-devel with "sudo dnf install kernel-devel"
Follow the installation instructions for bcc on Fedora from https://github.com/iovisor/bcc/blob/master/INSTALL.md

Note in particular the step for configuring the rawhide kernel, probably a good choice when testing BPF, as new features are being added on a regular basis as I write this:

# sudo dnf config-manager --add-repo=http://alt.fedoraproject.org/pub/alt/rawhide-kernel-nodebug/fedora-rawhide-kernel-nodebug.repo

# sudo dnf update

Additional pointers and recipes on how to install Oracle 11.2 on Fedora can be found on Tim Hall's website at: https://oracle-base.com/articles/11g/articles-11g

Conclusions

BPF with its bcc frontend are new and powerful tools for dynamic tracing for Linux. A discussion of the merits and shortfalls of BPF/bcc vs other existing solutions in the Linux dynamic tracing ecosystem is beyond the scope of this post. What you can find in this post are a few example scripts that I have written for tracing Oracle using BPF/bcc and uprobes and an additional script for integrating BPF/bcc with PyLatencyMap, which provides visualization as heat maps of the histograms generated using BPF/bcc.
Although the BPF/bcc environment is currently under evolution, it appears already a very useful addition to the toolbox for troubleshooting and performance investigations of Linux workloads. A set of example scripts/tools that come with the bcc repository are of great help for getting started both with using BPF/bcc and with writing new scripts. BPF/bcc can only run on relatively new kernels (as I write this, I have tested the scripts discussed in this post on kernel version 4.6) and this is an obstacle for its adoption in many environments, at least in the short term.

Note added, February 2019: Red Hat and Oracle Linux 7.6 have backported BPF and can run bcc scripts (yum install bcc*). I have updated the script repository accordingly, see https://github.com/LucaCanali/Linux_tracing_scripts/tree/master/BPF-bcc_Userspace_Oracle

Credits and acknowledgements

Many of the original ideas and tools discussed here are inspired or directly derived from the awesome published work of Brendan Gregg.
Many thanks also to the development teams of BPF and bcc for providing and supporting this new powerful tools. In particular thanks to Brenden Blanco for his work on uprobes for bcc and for his support on issue #478.
The work of investigating Oracle userspace with dynamic tracing tools has been done in collaboration with Frits Hoogland.

Tuesday, March 29, 2016

SystemTap Guru Mode and Oracle SQL Parsing

Introduction and motivations

SystemTap and dynamic tracing tools in general give administrators great control on their systems with the relatively little additional effort to learn the new tools. In this post you will see how SystemTap that can be used to modify data on the fly at runtime. The outcome is a form of "live patching". Examples are provided on how to apply these ideas to Oracle SQL parsing functionality. This type of "guru mode" use of SystemTap is a corner case, but I believe it is important to know that such techniques exist and how they can be deployed, also because they can be implemented with just a few lines of code.

SystemTap has been successfully used for emergency security band aid of Linux systems for many years, see this presentation by Frank Ch. Eigler for full details. See also an example of how these techniques have been used in practice, described in the CERN openlab 2013 summer student lecture "SystemTap: Patching the Linux kernel on the fly".

This post is about applying the techniques and ideas of "live patching on the fly by data modification" to closed source application, when debuginfo is not available and in particular to Oracle. The post is structured around three examples of increasing complexity on how to hook and change the behavior of Oracle SQL hard parsing. Some of the topics that you will see addressed in the examples are:

how to find the relevant function(s) to hook SystemTap to
how to write into userspace memory with SystemTap probes
how to modify CPU registers with SystemTap probes

Disclaimer: The tools and techniques presented in this post are intended for learning/reference only and are best used on a sandbox as they are unsupported and can potentially put at risk systems stability and integrity. Administrator privileges are needed to run SystemTap probes.

Programmable SQL filter

In this example you will see a method for selectively blocking execution of SQL based on programmable filter rules implemented with SystemTap.

As a first step you need to identify a relevant function for SQL parsing in the Oracle binary. Functions in Oracle binary are not documented, but luckily the function opiprs has been discussed previously in this blog and turns out to be a good choice to use with SystemTap probes. There are also other options but the details are outside the scope of this post. What you need to know about opiprs for this blog post is summarized in the table below:

Function name	Purpose	Selected parameters
opiprs	Oracle Program Interface - Parse This function is called when Oracle performs hard parsing (i.e. when a SQL statement that is not in the library cache needs to be parsed).	Notable function arguments: register rdx -> sql statement length register rsi -> pointer to the SQL text string

Note two important arguments of opiprs that are passed using CPU registers rdx and rsi: respectively containing the length of the sql statement and the pointer to the SQL statement text. The SQL text string is stored in memory, more precisely in the stack of the Oracle process, this can be confirmed by comparing the SQL text address with the process memory map from /proc/<pid>/maps.

A simple mechanism to implement the original goal of selectively blocking SQL execution is the following: write a SystemTap probe on the Oracle function opiprs that examines the SQL and if it matches some programmable rules block further parsing by forcing the SQL parsing to exit with an error.

SQL parsing can be forced to exit with an error by writing a 0 (end of line) in the first memory location of the buffer that contains the SQL text, effectively signaling a zero-length string. The effect of such change is that Oracle will throw the error: ORA-900, invalid SQL statement.

SystemTap probes can write into userspace memory using embedded C functions. This requires running SystemTap in "guru mode" and requires some additional syntax as detailed in the SystemTap documentation.

The code to implement the ideas described so far is summarized in the example script filterSQL_opiprs.stp. It consists of two main parts: one is a probe on the Oracle function opiprs, the other is an auxiliary C function called block_parse, that performs the task of writing into memory and specifically to the memory location that contains the SQL test string. A copy of the main text of the script for convenience:

function block_parse(pointersql:long) %{
char *sqltext;
sqltext = (char *) STAP_ARG_pointersql;
/* Modify the SQL text with an end of line: this will throw ORA-00900: invalid SQL statement */
sqltext[0] = 0;
%}

probe process("oracle").function("opiprs") {
sqltext = user_string2(register("rsi"),"error")
# debug code
# sqllength = register("rdx")
# printf("opiParse: arg2=%s, arg3=%d\n",sqltext,sqllength)
if (isinstr(sqltext, "UNWANTED SQL")) {
printf("FOUND!\n")
block_parse(register("rsi"))
}
}

Test the example:

1. Run the SystemTap script as root (note the oracle executable needs to be in the path) with:
# stap -g -v filterSQL_opiprs.stp

2. On a different session using SQL*Plus:
SQL> select 'Hello world' from dual; -- this runs normally

'HELLOWORLD
-----------
Hello world

SQL> select /* UNWANTED SQL */ 'Hello world' from dual;
select /* UNWANTED SQL */ 'Hello world' from dual
*
ERROR at line 1:
ORA-00900: invalid SQL statement

This illustrates how filterSQL_opiprs.stp blocks any SQL that contains the string "UNWANTED SQL". The example can be generalized to filter generic SQL statements based on keywords or other complex rules.

Modify SQL on the fly

The code below shows an example of altering the SQL statement on the fly. It is an artificial example for demo purposes. The C function "replace_SQL" (see code) is used to write into Oracle userspace the new SQL text, effectively modifying the statement that is being parsed. The SystemTap script livepatch_basic_opiprs.stp is available at this link. Here is a copy of the main text:

%{
/* SQL that will replace TARGET_SQL */
#define REPLACEMENT_SQL "select power(count(*),3) from dba_objects"
%}

global TARGET_SQL = "select count(*) from dba_objects, dba_objects, dba_objects"

function replace_SQL(pointersql:long) %{
char *sqltext;

sqltext = (char *) STAP_ARG_pointersql;
/* This changes in memory (stack) the SQL text that will be parsed */
strcpy(sqltext, "select power(count(*),3) from dba_objects");
%}

probe process("oracle").function("opiprs") {
sqltext = user_string2(register("rsi"),"error")
# debug code
# sqllength = register("rdx")
# printf("opiParse: arg2=%s, arg3=%d\n",sqltext,sqllength)
if (sqltext == TARGET_SQL) {
printf("FOUND!\n") # debug code
replace_SQL(register("rsi"))
}
}

Test the example:

1. Consider this SQL. It may take days of CPU time to execute, as it has been built on purpose with cartesian joins:

SQL> select count(*) from dba_objects, dba_objects, dba_objects;

2. Run the SystemTap script livepatch_basic_opiprs.stp as root and execute the SQL again:

# stap -g -v livepatch_basic_opiprs.stp

2. Run the SQL again (flushing the shared pool is used to cause hard parsing).

SQL> set timing on
SQL> alter system flush shared_pool;
SQL> select count(*) from dba_objects, dba_objects, dba_objects;

POWER(COUNT(*),3)
-----------------
7.5931E+25

Elapsed: 00:00:19.26

The SQL now runs in a few seconds because the statement with cartesian joins has been replaced "on the fly" by the SystemTap probe with an equivalent statement that executes much faster and without joins.
There is an important limitation to this implementation: the length of the "new" SQL statement must not exceed the length of the original SQL. A more general case is addressed in the next example.

Modify SQL on the fly, a more complex experiment

This example addresses the case of replacing SQL statements on the fly removing the limitation of the example above on the length of the SQL statement.The main point is that you also have to update the register rdx with the length of the new SQL statement. If the new statement is shorter than the original one this step can be omitted (as it was the case of the previous example).
How to modify the content of a CPU register with SystemTap? This is done by updating the CPU register value in CONTEXT->uregs. SystemTap takes care of restoring the register values when returning to Oracle userspace execution.

Another important point is about where to write the new SQL text, as we need a longer buffer than with the original SQL. Where to allocate the extra memory?
The example code referenced below writes the new SQL text in the process stack using the value of the %rsp pointer and subtracting 0x2000 to it. This is an educated guess that the target memory location is allocated to the process (in the memory chunk allocated for the stack), however enough "down in the stack" that it is free and will not be used by subsequent branches or leaf functions called by opiprs. From a few basic tests this approach seems to work, however please note also that the use of this script is intended mainly for reference and education purposes and can be potentially dangerous for system stability.

The SystemTap script livepatch_opiprs.stp is available at this link.

The proposed example script livepatch_opiprs.stp replaces the SQL "select sysdate from dual" with "select sysdate -1 from dual". This is inspired by a hypothetical situation where you want to replay a workload with time-dependent SQL. Another example of SQL replacement you may want to test is adding SQL hints. Customize the SQL replacement as you wish by editing REPLACEMENT_SQL and TARGET_SQL in the script.

Test the example:

SQL> alter session set nls_date_format='YYYY-MM-DD HH24:MI';
SQL> select sysdate from dual; -- all normal up to this point

SYSDATE
--------------
2016-02-22 12:00

Run the SystemTap script as root
# stap -g -v livepatch_opiprs.stp

Now the same SQL will return a different result (that is sysdate -1 instead of sysdate):

SQL> alter system flush shared_pool;
SQL> select sysdate from dual;

SYSDATE
--------------
2016-02-21 12:00

Cleaning up after testing

Once a SQL statement is hard parsed, all subsequent executions will also run with the modified text. If you want to revert to normal Oracle behavior you need to flush the statement out the library cache an re-parse (after having stopped the SystemTap script). For flushing statements out of the shared pool you can use "alter system flush shared pool" or dbms_shared_pool.purge (see also this post on dbms_shared_pool).

Pointers to reproduce the results using gdb

In the examples discussed in this post SystemTap has been used almost as an automated debugger. Notably with the addition that SystemTap has a low-overhead compared to many debuggers and provides a powerful programmable interface for defining the SQL text search and replacement actions. Another advantage of SystemTap is that it can attach to all running processes (of the Oracle executable in this case) if desired.
For completeness and as a reference, this is a short list of relevant gdb commands that can be used to reproduce some of the actions described in the examples of this post:

gdb -p <pid> start gdb against an existing Oracle session

break opiprs -> define a breakpoint on opiprs

continue -> continue program execution till breakpoint

info reg -> show registers

bt -> backtrace

x/1s $rsi -> visualize the SQL statement string using $rsi as pointer

write the new SQL length into the CPU register rdx:

set $rdx=24

copy a SQL string into memory and update $rsi (examples):

set $rsi=$rsp-0x2000
p strcpy($rsi, "select sysdate from dual")
set $rdx=25

Conclusions

This post illustrates SystemTap techniques that can be used to modify userspace data on the fly at runtime. Examples are provided on how to apply these techniques to Oracle SQL parsing. The techniques discussed here can be generalized and used on other functions of the Oracle kernel as well as be extended to "live-patch" data at runtime for other applications in the Linux environment. In particular the provided example SystemTap probes show techniques for writing into userspace memory and into CPU registers, useful to address cases where debuginfo is not available.

Disclaimer: The tools and techniques presented in this post are intended for learning/reference only and are best used on a sandbox as they are unsupported and can potentially put at risk systems stability and integrity. Administrator privileges are needed to run SystemTap probes.

Acknowledgements and references

A shout-out to Frank Ch. Eigler, for his work on SystemTap and in particular for his presentation "Applying band-aids over security wounds with systemtap" and related blog post and also for the tip on how to write into CPU registers with SystemTap.

Many thanks to Frits Hoogland for comments and improvements to this post and for collaboration on the investigation of Oracle internals and the use of Linux dynamic tracing tools for Oracle troubleshooting.

Link to the example code in Github.
Links to previous work on this blog on using SystemTap for Oracle tracing: SystemTap into Oracle for Fun and Profit and Life of an Oracle I/O: Tracing Logical and Physical I/O with SystemTap, Diagnose High-Latency I/O Operations Using SystemTap.

Monday, January 25, 2016

Linux Perf Probes for Oracle Tracing

Topic: this post is about Linux perf and uprobes for tracing and profiling Oracle workloads for advanced troubleshooting.

Context

The recent progress and maturity of some of the Linux dynamic tracing tools has raised interest in applying these techniques to Oracle troubleshooting and performance investigations. See Brendan Gregg's web pages for summary and future developments on dynamic traces for Linux. Some recent work on applying these tools and techniques to Oracle can be found in: Tanel Poder's posts on "RAM is the new disk", Frits Hoogland's post on "Stapflame" and in this blog at "Extended stack profiling".

Dynamic tracers are proven tools for instrumenting OS calls: an example of interest for OLTP tuning, are probes for measuring I/O latency at microsecond precision. Another topic, more complex but also more exciting, is tracing Oracle userspace. This topic is of growing interest with modern servers hosting large amounts of RAM and workloads that are often CPU-bound.

DTrace is still a reference when talking about dynamic tracing. For Oracle userspace tracing, however, this is mostly relevant for the Solaris environment. Basic examples of DTrace probes for Oracle on Linux can be found in this blog at this link.

SystemTap is a tool that offers a mature interface to tracing Linux kernel and the userspace. Notably SystemTap offers the possibility to write probes with advanced aggregations similarly to DTrace. Tracing Oracle userspace and wait events with SystemTap has been covered before in this blog with work done in collaboration with Frits Hoogland: see this link and this link and the scripts at this Github repository. An advantage of SystemTap can trace userspace for RHEL/OL 6.x using the utrace.

Linux Perf is another major player in this area. It is a powerful tool, integrated in the kernel and continuously being improved with new kernel releases. Perf is an excellent tool for data gathering for on-CPU flame graphs (see this link and references below). Perf has gained a reputation to be stable and relatively low overhead, which makes it useful for troubleshooting production workloads (mileage may vary).

In this post you will find examples, of how Perf can be used to trace the Oracle userspace: this includes tracing calls to userspace functions and also reading from memory, in particular from the Oracle SGA. This creates a link between Perf and Oracle instrumentation and can be used for tracing Oracle and for building extended profiling tools (i.e. tools that pull together OS and Oracle instrumentation) and for flame graph visualization.

Perf probes for Oracle

If you are not familiar with Linux Perf, you can profit from reading/testing a few examples from http://www.brendangregg.com/perf.html. The focus of this post is with perf probes for Oracle userspace functions using uprobes. This functionality is available from kernel version 3.5, this means that to run the examples shown here you will need a relatively recent Linux version: RHEL 7.x (with 3.10.x kernel) or Oracle Linux with UEK3 (3.8.x kernel) or UEK4 (4.1.x kernel). The examples in this post have been tested for Oracle 11.2.0.4 running on OL7.2 with UEK4.

Perf meets Oracle wait events: probes for the wait event interface are of particular interest as they provide a link between what Perf can measure and the Oracle instrumentation. Some key Oracle userspace functions in this area have been described previously in this blog (see also the example scripts in github). A short description of three functions of interest before that will be used later in the example scripts:

Function name	Purpose	Selected parameters
kskthbwt	Kernel service Kompile thread begin wait. This function is called at the start of an Oracle wait event. kslwtbctx is its parent function call and marks the start of a wait event. The suffix "bwt" most likely stands for "begin wait".	Parameters and registers: 11.2.0.4 - register r14 -> pointer into X$KSUSE (V$SESSION) array in the SGA with a fixed offset (version-dependent) Note for 12.1.0.2 - register r13 -> pointer into X$KSUSE register rsi -> timestamp of the beginning of the wait (in microseconds) register rdx -> wait event number
kskthewt	Kernel service Kompile thread end wait. This function is called at the end of an Oracle wait event. kslwtectx is its parent function call marking the end of a wait event. The suffix "ewt" most likely stands for "end wait".	Parameters and registers: register r13 -> pointer into X$KSUSE (V$SESSION) array in the SGA with a fixed offset (version-dependent) register rdi -> timestamp of the beginning of the wait (in microseconds) register rsi -> wait event number
kews_update_wait_time	Kernel event wait statistics update wait time. This function is called at the end of an Oracle wait event. kslwtectx is its parent function call marking the end of a wait event.	Parameters and registers: register rsi -> wait time register r13 (relevant for 11.2.0.4, it does not work on 12.1.0.2) -> event number register r15 -> pointer into X$KSUSE (V$SESSION) array in the SGA with a fixed offset (version-dependent)

You can get started with Perf tracing of Oracle userspace using the following:

# export ORACLE_HOME=/u01/app/oracle/product/11.2.0.4/rdbms
# perf probe -x $ORACLE_HOME/bin/oracle kskthewt timestamp=%di event=%si

If all worked well you will have created a new probe for Oracle userspace tracing and will see the following output.

Added new event:
probe_oracle:kskthewt (on kskthewt in /u01/app/oracle/product/11.2.0.4/rdbms/bin/oracle with timestamp=%di event=%si)

You can now use it in all perf tools, such as:

perf record -e probe_oracle:kskthewt -aR sleep 1

Note in case of errors, check the OS and Oracle version compatibility discussed elsewhere in this post. Go ahead and create two more probes:

# perf probe -x $ORACLE_HOME/bin/oracle kskthbwt timestamp=%si event=%dx
# perf probe -x $ORACLE_HOME/bin/oracle kews_update_wait_time wait_time=%si event=%r13

You should have now three probes on Oracle, you verify this by listing them with: "perf probe -l". If needed, you can delete probes with this syntax: "perf probe --del <probe_name>". Now you can run perf record to gather data for the events defined by the probes. One last step is to identify an Oracle process to test (it can be a sqlplus session for example). Find the OS pid of the Oracle process to probe and run the following:

# perf record -e probe_oracle:kews_update_wait_time -e probe_oracle:kskthbwt -e probe_oracle:kskthewt -p <pid>

Hit CTRL-C when done tracing. Display the trace output with the perf script command. Here an example from "select count(*) from <testtable>" tun on a test databas:

# perf script

oracle 17270 [003] 4004734.472372: probe_oracle:kews_update_wait_time: (95af198) wait_time=0x6fce81d4 event=0x162
oracle 17270 [003] 4004734.472408: probe_oracle:kskthewt: (94862ec) timestamp=0x521b78690de07 event=0x162
oracle 17270 [003] 4004734.472953: probe_oracle:kskthbwt: (9486678) timestamp=0x521b78690e05c event=0x15e
oracle 17270 [003] 4004734.472977: probe_oracle:kews_update_wait_time: (95af198) wait_time=0x1e event=0x15e
oracle 17270 [003] 4004734.472982: probe_oracle:kskthewt: (94862ec) timestamp=0x521b78690e07a event=0x15e
oracle 17270 [003] 4004734.474083: probe_oracle:kskthbwt: (9486678) timestamp=0x521b78690e4c6 event=0xc6
oracle 17270 [003] 4004734.474398: probe_oracle:kews_update_wait_time: (95af198) wait_time=0x13b event=0xc6
...

The fields displayed in the example output are: process name, pid, CPU number, timestamp since machine startup, probe name and address, probe parameters. Some additional processing is beneficial to make the output more easily readable. In the next example perf script output is filtered by piping it to a series of commands: (1) to reduce the number of fields displayed, (2) convert hexadecimal numbers into decimal, (3) convert wait event numbers in wait event names. For this we use sed and eventsname.sed, a file generated with a sqlplus script.

# perf script | \
awk '{printf "pid=%s TS=%s ",$2,$4} {print $5,$7,$8}' | \
perl -p -e 's/0[xX][0-9a-fA-F]+\b/sprintf "%s", hex $&/ge' | \

sed s/event=/event#=/g |sed -f eventsname.sed

pid=17270 TS=4004734.472372: probe_oracle:kews_update_wait_time: wait_time=1875804628 event=SQL*Net message from client
pid=17270 TS=4004734.472408: probe_oracle:kskthewt: timestamp=1444447003926023 event=SQL*Net message from client
pid=17270 TS=4004734.472953: probe_oracle:kskthbwt: timestamp=1444447003926620 event=SQL*Net message to client
pid=17270 TS=4004734.472977: probe_oracle:kews_update_wait_time: wait_time=30 event=SQL*Net message to client
pid=17270 TS=4004734.472982: probe_oracle:kskthewt: timestamp=1444447003926650 event=SQL*Net message to client
pid=17270 TS=4004734.474083: probe_oracle:kskthbwt: timestamp=1444447003927750 event=direct path read
pid=17270 TS=4004734.474398: probe_oracle:kews_update_wait_time: wait_time=315 event=direct path read

...

What the examples show is that trace data that you would normally find in a 10046 trace file (or trace for EXEC DBMS_MONITOR.session_trace_enable(waits=>TRUE) if you prefer) can be collected using OS commands alone. Perf with utrace can be used to trace Oracle as an alternative to setting Oracle diagnostic events. What you cannot find in the Perf examples above are information on the parameters of the wait event (usually called p1, p2, p3). This is the subject of the next paragraph.

It only gets better from here: perf with uprobes can read Oracle SGA

Uprobes has received a major improvement with kernel 3.14 to include the possibility of fetching arguments from the memory of the traced process (see also http://lxr.free-electrons.com/source/Documentation/trace/uprobetracer.txt). As I write this Oracle has released UEK4 with kernel 4.1.x, which can be used to test this feature. Fedora is another option for a test environment with recent kernels.

Many details and metrics of the Oracle instrumentation coming from V$ views and their underlying X$ tables can be directly accessed by reading memory structures in the SGA (this is not the case for all X$, as many of them are rather implemented as internal functions). X$KSUSE exposes many fields of V$SESSION. The memory location of the fields of interest can be identified using SQL as in this example.
The link between one of the main wait event userspace functions, kskthewt, and X$KSUSE is found by experimentation. This has been investigated together with Frits. We find that the value of the CPU register r13 points into the X$KSUSE record with an offset that is version depended. This work has already been done for SystemTap probes and a working example can be found at this link for 11.2.0,4 and this link for 12.1.0.2.

This is an example of a probe on the Oracle wait event interface that exposes selected fields from X$KSUSE memory structures:

# perf probe --del probe_oracle:kskthewt
# perf probe -x $ORACLE_HOME/bin/oracle kskthewt timestamp=%di event=%si p1='-2080(%r13)' p2='-2072(%r13)' p3='-2064(%r13)' sql_hash='-1828(%r13)'

# perf record -e probe_oracle:kskthewt -p <pid>

# perf script | \

awk '{printf "pid=%s TS=%s ",$2,$4} {print $8,$9,$10,$11,$12}' | \

perl -p -e 's/0[xX][0-9a-fA-F]+\b/sprintf "%s", hex $&/ge' | \

sed s/event=/event#=/g |sed -f eventsname.sed

pid=17270 TS=4097944.225385: timestamp=1444540213679034 event=SQL*Net message from client p1=1413697536 p2=1 p3=0 sql_hash=11373786902

pid=17270 TS=4097944.225786: timestamp=1444540213679433 event=SQL*Net message to client p1=1413697536 p2=1 p3=0 sql_hash=11373786902

pid=17270 TS=4097944.227361: timestamp=1444540213681023 event=direct path read p1=1 p2=193 p3=7 sql_hash=11373786902

pid=17270 TS=4097944.229206: timestamp=1444540213682870 event=direct path read p1=1 p2=3896 p3=8 sql_hash=11373786902

pid=17270 TS=4097944.230532: timestamp=1444540213684197 event=direct path read p1=1 p2=12032 p3=8 sql_hash=11373786902

pid=17270 TS=4097944.232161: timestamp=1444540213685827 event=direct path read p1=1 p2=29464 p3=8 sql_hash=11373786902

pid=17270 TS=4097944.244675: timestamp=1444540213698344 event=direct path read p1=1 p2=177 p3=5 sql_hash=11373786902

...

More fields of interest from X$KSUSE or other areas of the SGA can be extracted as arguments of perf by extending the example above.

Example of tracing Oracle parse functions:

Function name	Purpose	Selected parameters
opiprs	Oracle Program Interface - Parse Use to trace hard parsing. This function is called when doing hard parsing (i.e. when a SQL statement that is not in the library cache needs to be parsed).	Useful information when probing the function: register rdx -> sql statement length register rsi -> pointer to the SQL text string
kksParseCursor	Use to trace soft parsing. This is one of the functions in the call stack when Oracle parses a cursors already in the library cache	Useful information when probing the function: 11.2.0.4: register rsi -> sql statement lenght register rbx -> pointer to the SQL text string 12.1.0.2: register r9 ->sql statement lenght register r8 -> pointer to the SQL text string

This is an example of a perf probe to trace hard parsing (tip: flush the shared pool before testing):

# perf probe -x $ORACLE_HOME/bin/oracle opiprs length=%dx sql='+0(%si)':"string"

# perf record -e probe_oracle:opiprs -p <pid>

# perf script

oracle 17270 [003] 4100899.401507: probe_oracle:opiprs: (27fd2f6) length=0x19 sql="select sysdate from dual"

Similarly for tracing soft parsing (11.2.0.4 version) you can use the following probe:

# perf probe -x $ORACLE_HOME/bin/oracle kksParseCursor length=%si sql='+0(%bx)':"string"

Many more Oracle functions can be traced:

From the OS point of view, the Oracle executable is a big C program with a large number of functions. You can run "nm -D $ORACLE_HOME/bin/oracle" to get an idea of the scale: it is of the order of 10^5 text (code) sections. Obviously too many details and also undocumented. Some pointers to understand the function names can be found in MOS (see in especially the bug-related documents), from sqlplus running oradebug doc and from various blogs and presentations (see references).
Here below a short table of functions of interest for tracing logical and physical I/O taken from this blog post. The ideas and examples discussed above can be applied to these functions too, if there is a need to trace logical and physical I/O for example. Note that tracing logical I/O on a busy system can have an important overhead because of the large number of events per second.

Function name	Purpose	Selected parameters
kcbgtcr	Kernel Cache Buffers Get Consistent Read Note: key function for consistent reads	tbs#=user_int32(%rdi) rel file n#= user_int32(%rdi+4) >> 22 & 0x003FFFFF block#=user_int32(%rdi+4) & 0x003FFFFF data_object_id#=user_int32(%rdi+8) object_id#=user_int32(%rdi+12) Note: for bigfile tablespaces: block#=user_int32(%rdi+4)
kcbgcur	Kernel Cache Buffers Current Note: key function for current reads	tbs#=user_int32(%rdi) rel file n#= user_int32(%rdi+4) >> 22 & 0x003FFFFF block#=user_int32(%rdi+4) & 0x003FFFFF data_object_id#=user_int32(%rdi+8) object_id#=user_int32(%rdi+12) Note: for bigfile tablespaces: block#=user_int32(%rdi+4)
kcbzib	kcbZIB should stand for: Kernel Cache Buffers, Z (kcbz.o is a module for physical IO helper functions), IB: Input Buffer Note: this function is part of the physical I/O stack: it indicates that Oracle will perform physical read(s) into the buffer cache
kcbzgb	The suffix GB in kcbZGB should stand for: Get (space for) Buffer Oracle allocates space in the buffer cache for a given block (typically before I/O operations).	tbs n#=%rsi, rel file n#=%rdx >> 22 & 0x003FFFFF block#=%rdx & 0x003FFFFF data_object_id#=%r8 object_id#=%r9 Note: for bigfile tablespaces: block#=%rdx
kcbzvb	Invoked after Oracle has read a given block Note: this function is used both for reads in the buffer cache and for direct reads	tbs n#=%rsi, rel file n#=%rdx >> 22 & 0x003FFFFF, block#=%rdx & 0x003FFFFF data_object_id=user_int32(%rdi+24) Note: for bigfile tablespaces: block#=%rdx

Combining Perf probes and Oracle traces

What you have seen so far can also be described as a way to extract information normally available with Oracle diagnostic events, using perf and uprobes instead. Additional value to these methods comes when you combine them with perf probes for tracing kernel and hardware activity. There are many Perf events available. See also the output of perf list and the references in this post.

Study of Oracle I/O and wait events:
The example below combines perf events on tracepoints for system calls to the I/O interface in Linux and probes on the Oracle wait event interface. The example below records a trace for the Oracle userspace functions marking the beginning and end of a wait and for system calls to the I/O interface:

# perf record -e probe_oracle:kskthewt -e probe_oracle:kskthbwt -e syscalls:sys_enter_io_submit -e syscalls:sys_exit_io_submit -e syscalls:sys_enter_io_getevents -e syscalls:sys_exit_io_getevents -e syscalls:sys_enter_pread64 -e syscalls:sys_exit_pread64 -p <pid>

The output is processed for improved readability:

# perf script | \
awk '{printf "pid=%s TS=%s ", $2,$4} {print $5,$8,$9,$10,$11,$12}' | \
perl -p -e 's/0[xX][0-9a-fA-F]+\b/sprintf "%s", hex $&/ge' | \
sed s/event=/event#=/g |sed -f eventsname.sed

pid=25314 TS=4173676.363095: probe_oracle:kskthbwt: event=db file sequential read
pid=25314 TS=4173676.363109: syscalls:sys_enter_pread64: buf: 2644443136, count: 8192, pos:
pid=25314 TS=4173676.363667: syscalls:sys_exit_pread64:
pid=25314 TS=4173676.363689: probe_oracle:kskthewt: event=db file sequential read p1=4 p2=147586 p3=1 sql_hash=2246415721
pid=25314 TS=4173676.364030: probe_oracle:kskthbwt: event=db file scattered read
pid=25314 TS=4173676.364043: syscalls:sys_enter_pread64: buf: 5727326208, count: 40960, pos:
pid=25314 TS=4173676.364576: syscalls:sys_exit_pread64:
pid=25314 TS=4173676.364597: probe_oracle:kskthewt: event=db file scattered read p1=4 p2=147587 p3=5 sql_hash=2246415721
pid=25314 TS=4173676.364790: probe_oracle:kskthbwt: event=db file scattered read
pid=25314 TS=4173676.364801: syscalls:sys_enter_pread64: buf: 5727326208, count: 65536, pos:
pid=25314 TS=4173676.365344: syscalls:sys_exit_pread64:
pid=25314 TS=4173676.365365: probe_oracle:kskthewt: event=db file scattered read p1=4 p2=147592 p3=8 sql_hash=2246415721

What this example achieves is a way to port to Perf the work done with SystemTap for tracing Oracle I/O. See also http://externaltable.blogspot.com/2014/11/life-of-oracle-io-tracing-logical-and.html for additional details.

Oracle wait events and perf scheduler events

Perf scheduler events (see: perf list sched:*) are another source of very useful information. For example the event sched:sched_stat_runtime traces the amount of time spent executing on CPU. Note however that not all the scheduler probes are available on RHEL7 and UEK kernels, as CONFIG_SCHEDSTATS is not set on those platforms (see: grep CONFIG_SCHED /boot/config-`uname -r`). In particular this prevents the use of the technique described by Brendan Gregg for off-CPU flame graphs with perf.

This is an example of tracing the CPU scheduler together with I/O related events with Oracle wait events:

# perf record -e sched:* -e probe_oracle:kskthewt -e probe_oracle:kskthbwt -e syscalls:sys_enter_io_submit -e syscalls:sys_exit_io_submit -e syscalls:sys_enter_io_getevents -e syscalls:sys_exit_io_getevents -e syscalls:sys_enter_pread64 -e syscalls:sys_exit_pread64 -p <pid>

After capturing the workload you can extract atrace file running "perf script". In the table below you can see a snippet of the trace filtered with a simple scripts to improve the readability. You can see that perf provides valuable information by measuring CPU time, system calls and Oracle wait events.

EVENT NAME BEGIN TS END TS ELAPSED (MicroSec)
event=SQL*Net message to client 4413050521226 4413050521261 35
sched_stat_runtime 4413050521179 4413050521291 112
sched_stat_runtime 4413050521293 4413050521530 237
syscall:pread64 4413050521392 4413050535684 14292
event=db file sequential read 4413050521375 4413050535704 14329
syscall:io_submit 4413050536073 4413050536249 176
sched_stat_runtime 4413050535593 4413050536551 958
syscall:io_submit 4413050536343 4413050536623 280
sched_stat_runtime 4413050536548 4413050536686 138
syscall:io_getevents 4413050536678 4413050537650 972
event=direct path read 4413050536652 4413050537677 1025
syscall:io_submit 4413050537807 4413050537924 117
sched_stat_runtime 4413050537563 4413050537963 400
syscall:io_getevents 4413050537956 4413050539289 1333
event=direct path read 4413050537943 4413050539315 1372
sched_stat_runtime 4413050539134 4413050539553 419
..

Ideas for future work

Stack profiling and flame graphs: An additional step to get more value out of perf traces is to visualize their output. Flame graphs and extended stack profiling have proven to be useful in this area. See also previous work on flame graphs for Oracle at link1 and link2 on extended stack profiling with stapflame by Frits at this link and with stack profilers at this link.
This is still work in progress. For a sneak preview follow this link for a proof-of-concept graph.

Perf hardware events: Combining Oracle traces traces with hardware and PMU counters is another area of interest for future investigations. This could be useful for example for studying logical reads and their dependencies on events at HW level on memory and/or processor cache level.
Add and documenting more probes for Oracle userspace function tracing: For example it would be useful to develop a set of perf probes for replicating the 10046 trace event.

Linux and Oracle version compatibility:

The definition of probes on Oracle executable needs uprobes, available from kernel version 3.5. When testing the examples discussed in post on RHEL/OL, you can use Kernel UEK4 or UEK3 or use RHEL 7.x. Notably, examples discussed in this post will not run on RHEL 6.x kernels.
Issues with uprobes and Oracle versions: uprobes works OK for tracing Oracle 11.2. However, for Oracle 12.1 I find that uprobes works OK on RHEL7.0 (kernel 3.10.0-123) and UEK (kernel 3.8.x), but does not work for kernels that ship with RHEL 7.1,7,2 and anything higher (including UEK4). When testing the easiest is to use Oracle 11.2 or if you want to test Oracle 12.1 use UEK3 or RHEL 7.0 kernel. More investigations are needed on this topic.

Conclusions

Perf events and uprobes are very powerful tools for dynamic tracing of Oracle. They allow to complement and extend Oracle and OS instrumentation for advanced troubleshooting and performance investigations. In particular Perf with uprobes can be used:

As an alternative to standard instrumentation for Oracle tracing with diagnostic events: see the examples of using perf probes for tracing wait event, sql parsing, physical and logical I/O.
As a tool for combining OS-related probes (tracepoints for I/O, network, scheduler or hardware events) and Oracle diagnostic information, for advanced troubleshooting.
As a data collection tool for extended profiling of Oracle workloads, which combines stack traces, scheduler information, wait events or other Oracle diagnostic data for flame graph visualization.

Credits and references

Kudos to Brendan Gregg for his work on flame graphs and for the articles and presentations he has published on Linux tracers. Many thanks to Frits Hoogland for his investigations of Oracle internals, for collaboration on Oracle internals investigations and tracing tools and for proof reading this post. Many thanks to Tanel Poder for his work on Oracle internals and performance.
Additional links on topics related to this post:

http://www.brendangregg.com/perf.html
https://fritshoogland.wordpress.com/2016/01/12/introducing-stapflame-extended-stack-profiling-using-systemtap-perf-and-flame-graphs/
http://externaltable.blogspot.ch/2015/12/extended-stack-profiling-ideas-tools.html
http://blog.tanelpoder.com/2015/09/21/ram-is-the-new-disk-and-how-to-measure-its-performance-part-2-tools/
http://externaltable.blogspot.ch/2014/11/life-of-oracle-io-tracing-logical-and.html
https://fritshoogland.wordpress.com/2014/02/27/investigating-the-wait-interface-via-gdb/
http://enkitec.tv/2013/03/14/tanel-poders-hacking-session-how-oracle-sql-plans-are-really-executed-part-2/
https://alexanderanokhin.wordpress.com/tools/digger/
http://lxr.free-electrons.com/source/tools/perf/Documentation/perf-probe.txt
http://linux.die.net/man/1/perf-probe
http://lxr.free-electrons.com/source/Documentation/trace/uprobetracer.txt

Monday, October 26, 2015

Linux Kernel Stack Profiling and Flame Graphs Applied to Oracle Investigations

Topic: This blog post is about kernel stack profiling and visualization with flame graphs.

Context: Stack profiling and flame graphs are very useful tools and techniques for troubleshooting and investigating workloads at the OS-level and understand which code path take most of the execution time. You can find extensive material and examples of flame graphs in Brendan Gregg's website and blog. A few additional examples of using stack tracing and flame graphs for investigating Oracle are: "Flame Graphs for Oracle" and "Oracle Optimizer Investigated with Flame Graphs".

Flame graphs are often used to visualize and analyze stack traces for CPU-bound processes. Perf has proven to be a very good tool to capture such traces in Linux. With perf you can collect stack backtraces of the running processes with little overhead. However this does not cover the cases when the process under investigation is off-CPU (for example sleeping and/or waiting for an I/O operation to finish). This topic has also been addressed by Brendan Gregg in his blog, with off-CPU flame graphs. However data collection for the case of off-CPU stack traces seems to be still an area of active investigation, especially for Linux. This post addresses the case of collecting kernel stack data for flame graph visualization in the case of processes that spend a large fraction of their time off-CPU and/or in system calls.

What's new: In this blog you can find a description of a basic technique for sampling kernel stacks and a script to automate the process. The script gathers kernel stack traces and process status by sampling the /proc filesystem. This is intended for investigations of processes in Linux that spend a significant part of their elapsed time off CPU (for example in status S, sleep, or D, disk sleep).

The main engine of the script is a simple loop reading /proc/pid/stack. I have first heard of this idea in Tanel Poder's blog. A similar concept is also used in the poor man's profiler. Just to illustrate how simple the method is, here is the main data collection part of kstacksampler.sh:

for x in $(seq 1 $iterations); do

cat /proc/$pid/stack # get kernel stack trace

done

Download kstacksampler.sh: from this webpage or from Github.

Notes and limitations: The script kstacksampler.sh is quite basic, to be taken as a proof of concept and study material. The overhead and potential side effects of using it are expected to be minimal for most of the systems, so most likely it is safe to use it in production. The use of shell for profiling does not allow for high frequency sampling, typically you should start experimenting with 10-20 Hz. Processes that are busy with CPU usage should rather be traced using perf and on-CPU flame graph techniques. The script currently only traces one process at a time (although it could be adapted to trace multiple pids). The script does not trace threads (although it can be modified to trace tids from /proc/pid/task/tid/stack). The script does not provide userspace traces. The script samples both the process kernel stack and status (running, sleeping, etc). A simple editing of the script can disable this feature if not needed. Note that in Linux the process status "running" refers to both running and runnable (in run queue) processes.

Example 1: Investigation of a workload showing a network bottleneck.

Preliminary info:
The case is of a database session retrieving data over a slow network. A simple statement for reading a database table "SELECT * FROM LARGETABLE;" is used to generate the workload. When examining the database process (on the DB server side) it shows little usage of CPU. This can be investigated for example using pidstat (RHEL/EL 6.7 was used for this test).

[root@MYDB]# pidstat -p 9124 1 10

Linux 2.6.32-573.7.1.el6.x86_64 (MYDB) 10/22/2015 _x86_64_ (4 CPU)

10:10:28 PM PID %usr %system %guest %CPU CPU Command

10:10:29 PM 9124 0.00 4.00 0.00 4.00 0 oracle_9124_orc

10:10:30 PM 9124 4.85 4.85 0.00 9.71 2 oracle_9124_orc

10:10:31 PM 9124 1.00 2.00 0.00 3.00 2 oracle_9124_orc

You can also use Oracle instrumentation to investigate the process and find that it is mostly idle. In particular V$ views such as V$SESSION, V$SESSION_LONGOPS, V$ACTIVE_SESSION_HISTORY are some of the views that can be used, or rather their graphical counterpart using OEM.

Stack Tracing:
You can now try the script kstacksampler.sh and build a flame graph with its output to further the investigation. In the following the task is divided in two steps. The first step is data collection with kstacksampler.sh, with some additional data filtering with grep and sed. The second step integrates with the FlameGraph toolset. In particular stackcollapse-stap.pl folds the stack and flamegraph.pl builds the flame graph (see Figure 1).

Step1: stack sampling
$ ./kstacksampler.sh -p 9124 -n 100 -i .05 | grep -v 0xffffffffffffffff | sed 's/State:\t//g'| sed 's/\[<.*>] //g' >stack_sample_example1.txt

Step2: flame graph
$ cat stack_sample_example1.txt | ../FlameGraph/stackcollapse-stap.pl | ../FlameGraph/flamegraph.pl --title "Kernel stack profiling - Example 1, network-related bottleneck"> kstacksampler_Example1_network_bottleneck_Fig1.svg

The resulting flame graph is:

Figure 1: Flame graph of a process that is network-bound and spending the bulk of its time on TCP stack waiting for messages to arrive, as detailed in Example 1 in the text. Click here for a svg version of the graph.

From the flame graph of Figure 1 you can see that the process is sleeping for a large fraction of its time and that the large majority of the traced operations are related to network/TCP communication. This confirms what found earlier with pidstat and Oracle instrumentation. Kernel stack sampling has helped to find/confirm the root cause of the performance issue.

Additional investigations:
Without going into too many details, which is not the scope of this example, you can find interesting to take one additional step and use the results of the analysis to apply and measure some changes to make the process go faster. The general idea is to use larger "data transfers" and reduce the number of round trips, therefore limiting the impact of the network latency. A simple change in the context of this Oracle-based example is to increase the number of rows fetched at each step by changing the arraysize parameter in SQL*plus and set it to the maximum allowed value: "SET ARRAYSIZE 5000" (arraysize was set to 100 in the previous case and in the tracing data of Figure 1). Other and similar options are also available at the TCP/kernel level, for those interested in further tuning. The outcome of the change of arraysize is a measurable reduction of the percentage of the elapsed time that is time spent waiting on TCP operations. More importantly this drives the throughput up (from 6 MB/s in the case of Figure 1 to about 20 MB/sec). Also pidstat shows that the CPU utilization increases (from 6% to about 28%). Similar observations can be done using Oracle instrumentation with V$ views. Finally the flame graph for the tuned process can be seen in Figure 1b. Notably the percentage of time spent with process status=running has considerably increased (more useful work is being done per unit of time).

Figure 1b: Flame graph of the process described in Example 1 after changing the parameter arraysize from 100 to 5000. An increase in CPU usage can be seen in this flame graph compared to what is represented in Figure 1 (taken with arraysize=100). This change correlates with higher network throughput and faster query execution. Click here for a svg version of the graph.

Additional test case details:
I have created the case of Example 1 using Oracle running in a VM and connecting to it via a sshd tunnel. The tunnel was set up using putty: putty -P <forwarded_port> -L 1111:DB_host:1521 oracle@VM_host). 1111 is the locally forwarded port, 1521 is the Oracle listener port on DB_host, <forwarded port> is a port on the VM_host forwarded into port 22 (sshd) of the DB_host. SQL*plus on a client machine was used to connect to the DB and run the query: "SELECT * FROM LARGETABLE;", where LARGETABLE is any table big enough so that the query runs in a few minutes. Another tip is to "set autotrace traceonly" before running the query to suppress the output to the screen.

Example 2: Investigation of an Oracle process doing random I/O.

This example is about investigating I/O-bound processes. In particular Figure 2 shows the output of kernel stack profiling and flame graph for an Oracle process executing Kevin Closson's SLOB workload. The same two steps as detailed above in Example 1 have been used to collect data and produce the flame graph. The workload is mainly random read from block devices. The DB server used for this example runs in a VM and Oracle is configured to use ASM. The stack traces show that the process time is mostly spent either doing I/O or waiting for I/O calls to complete. Two different type of I/O system calls are visible in Figure 2: calls for blocking I/O (pread64) and calls for asynchronous I/O (io_submit and io_getevents). At the Oracle level this is reported using the wait events "db file sequential read" (blocking I/O) and "db file parallel read" (asynchronous I/O). You can read more on the topic of tracing Oracle I/O at "Life of an Oracle I/O: Tracing Logical and Physical I/O with SystemTap"

Figure 2: Flame graph visualization of the stack profile for an I/O bound Oracle process running SLOB. The process is spending the majority of the time executing I/O or waiting for I/O calls to return. Click here for a svg version of the graph.

Conclusions

Stack profiling and flame graphs are useful techniques and tools for troubleshooting performance investigations at the OS-level. Flame graphs are often used for profiling CPU-bound processes, for example in Linux with the help of perf. This post describes how stack profiling of kernel traces from the /proc filesystem can be used to investigate workloads that spend a significant amount of their time off-CPU and/or in kernel code (for example sleeping for I/O or network system calls to complete). A simple tool for stack sampling, kstacksampler.sh, complements this blog post together with two examples of its use for investigating network and I/O-bound Oracle processes.

References and acknowledgements

Brendan Gregg's flame graphs are the starting ground for this work. Tanel Poder has also published several original investigations of stack profiling for Oracle troubleshooting, notably this included the blog post "Peeking into Linux kernel-land using /proc filesystem for quick’n’dirty troubleshooting". Kevin Closson has also published work on stack profiling with perf and OProfile and he is the author of SLOB, used in Example 2 of this post.