Monday, January 25, 2016

Linux Perf Probes for Oracle Tracing

Topic: this post is about Linux perf and uprobes for tracing and profiling Oracle workloads for advanced troubleshooting.

Context

The recent progress and maturity of some of the Linux dynamic tracing tools has raised interest in applying these techniques to Oracle troubleshooting and performance investigations. See Brendan Gregg's web pages for summary and future developments on dynamic traces for Linux. Some recent work on applying these tools and techniques to Oracle can be found in: Tanel Poder's posts on "RAM is the new disk", Frits Hoogland's post on "Stapflame" and in this blog at "Extended stack profiling".

Dynamic tracers are proven tools for instrumenting OS calls: an example of interest for OLTP tuning, are probes for measuring I/O latency at microsecond precision. Another topic, more complex but also more exciting, is tracing Oracle userspace. This topic is of growing interest with modern servers hosting large amounts of RAM and workloads that are often CPU-bound.

DTrace is still a reference when talking about dynamic tracing. For Oracle userspace tracing, however, this is mostly relevant for the Solaris environment. Basic examples of DTrace probes for Oracle on Linux can be found in this blog at this link.

SystemTap is a tool that offers a mature interface to tracing Linux kernel and the userspace. Notably SystemTap offers the possibility to write probes with advanced aggregations similarly to DTrace. Tracing Oracle userspace and wait events with SystemTap has been covered before in this blog with work done in collaboration with Frits Hoogland: see this link and this link and the scripts at this Github repository. An advantage of SystemTap can trace userspace for RHEL/OL 6.x using the utrace.

Linux Perf is another major player in this area. It is a powerful tool, integrated in the kernel and continuously being improved with new kernel releases. Perf is an excellent tool for data gathering for on-CPU flame graphs (see this link and references below). Perf has gained a reputation to be stable and relatively low overhead, which makes it useful for troubleshooting production workloads (mileage may vary).

In this post you will find examples, of how Perf  can be used to trace the Oracle userspace: this includes tracing calls to userspace functions and also reading from memory, in particular from the Oracle SGA. This creates a link between Perf and Oracle instrumentation and can be used for tracing Oracle and for building extended profiling tools (i.e. tools that pull together OS and Oracle instrumentation) and for flame graph visualization.


Perf probes for Oracle

If you are not familiar with Linux Perf, you can profit from reading/testing a few examples from http://www.brendangregg.com/perf.html. The focus of this post is with perf probes for Oracle userspace functions using uprobes. This functionality is available from kernel version 3.5, this means that to run the examples shown here you will need a relatively recent Linux version: RHEL 7.x (with 3.10.x kernel) or Oracle Linux with UEK3 (3.8.x kernel) or UEK4 (4.1.x kernel). The examples in this post have been tested for Oracle 11.2.0.4 running on OL7.2 with UEK4.

Perf meets Oracle wait events: probes for the wait event interface are of particular interest as they provide a link between what Perf can measure and the Oracle instrumentation. Some key Oracle userspace functions in this area have been described previously in this blog (see also the example scripts in github). A short description of three functions of interest before that will be used later in the example scripts:


Function namePurposeSelected parameters
kskthbwtKernel service Kompile thread begin wait.
This function is called at the start of an Oracle wait event.
kslwtbctx is its parent function call and marks the start of a wait event. 
The suffix "bwt" most likely stands for "begin wait".
Parameters and registers:
11.2.0.4 - register r14  -> pointer into X$KSUSE (V$SESSION) array in the SGA with a fixed offset (version-dependent)
Note for 12.1.0.2 - register r13  -> pointer into X$KSUSE 

register rsi -> timestamp of the beginning of the wait (in microseconds)
register rdx -> wait event number
kskthewtKernel service Kompile thread end wait.
This function is called at the end of an Oracle wait event.
kslwtectx is its parent function call marking the end of a wait event.
The suffix "ewt" most likely stands for "end wait".
Parameters and registers:
register r13 -> pointer into X$KSUSE (V$SESSION) array in the SGA  with a fixed offset (version-dependent)
register rdi -> timestamp of the beginning of the wait (in microseconds)
register rsi -> wait event number
kews_update_wait_timeKernel event wait statistics update wait time.
This function is called at the end of an Oracle wait event.
kslwtectx is its parent function call marking the end of a wait event.
Parameters and registers:

register rsi -> wait time

register r13 (relevant for 11.2.0.4, it does not work on 12.1.0.2) -> event number
register r15 -> pointer into X$KSUSE (V$SESSION) array in the SGA  with a fixed offset (version-dependent)


You can get started with Perf tracing of Oracle userspace using the following:

# export ORACLE_HOME=/u01/app/oracle/product/11.2.0.4/rdbms
# perf probe -x $ORACLE_HOME/bin/oracle kskthewt timestamp=%di event=%si

If all worked well  you will have created a new probe for Oracle userspace tracing and will see the following output.

Added new event:
  probe_oracle:kskthewt (on kskthewt in /u01/app/oracle/product/11.2.0.4/rdbms/bin/oracle with timestamp=%di event=%si)

You can now use it in all perf tools, such as:

        perf record -e probe_oracle:kskthewt -aR sleep 1

Note in case of errors, check the OS and Oracle version compatibility discussed elsewhere in this post. Go ahead and create two more probes:

# perf probe -x $ORACLE_HOME/bin/oracle kskthbwt timestamp=%si event=%dx
# perf probe -x $ORACLE_HOME/bin/oracle kews_update_wait_time wait_time=%si event=%r13

You should have now three probes on Oracle, you verify this by listing them with:  "perf probe -l". If needed, you can delete probes with this syntax: "perf probe --del <probe_name>". Now you can run perf record to gather data for the events defined by the probes. One last step is to identify an Oracle process to test (it can be a sqlplus session for example). Find the OS pid of the Oracle process to probe and run the following:

# perf record -e probe_oracle:kews_update_wait_time -e probe_oracle:kskthbwt -e probe_oracle:kskthewt -p <pid>

Hit CTRL-C when done tracing. Display the trace output with the perf script command. Here an  example from "select count(*) from <testtable>" tun on a test databas:

# perf script

oracle 17270 [003] 4004734.472372: probe_oracle:kews_update_wait_time: (95af198) wait_time=0x6fce81d4 event=0x162
oracle 17270 [003] 4004734.472408: probe_oracle:kskthewt: (94862ec) timestamp=0x521b78690de07 event=0x162
oracle 17270 [003] 4004734.472953: probe_oracle:kskthbwt: (9486678) timestamp=0x521b78690e05c event=0x15e
oracle 17270 [003] 4004734.472977: probe_oracle:kews_update_wait_time: (95af198) wait_time=0x1e event=0x15e
oracle 17270 [003] 4004734.472982: probe_oracle:kskthewt: (94862ec) timestamp=0x521b78690e07a event=0x15e
oracle 17270 [003] 4004734.474083: probe_oracle:kskthbwt: (9486678) timestamp=0x521b78690e4c6 event=0xc6
oracle 17270 [003] 4004734.474398: probe_oracle:kews_update_wait_time: (95af198) wait_time=0x13b event=0xc6
...

The fields displayed in the example output are: process name, pid, CPU number, timestamp since machine startup, probe name and address, probe parameters. Some additional processing is beneficial to make the output more easily readable. In the next example perf script output is filtered by piping it to a series of commands: (1) to reduce the number of fields displayed, (2) convert hexadecimal numbers into decimal, (3) convert wait event numbers in wait event names. For this we use sed and eventsname.sed, a file generated with a sqlplus script.

# perf script | \
  awk '{printf "pid=%s TS=%s ",$2,$4} {print $5,$7,$8}' | \
  perl -p -e 's/0[xX][0-9a-fA-F]+\b/sprintf "%s", hex $&/ge' | \
  sed s/event=/event#=/g |sed -f eventsname.sed

pid=17270 TS=4004734.472372: probe_oracle:kews_update_wait_time: wait_time=1875804628 event=SQL*Net message from client
pid=17270 TS=4004734.472408: probe_oracle:kskthewt: timestamp=1444447003926023 event=SQL*Net message from client
pid=17270 TS=4004734.472953: probe_oracle:kskthbwt: timestamp=1444447003926620 event=SQL*Net message to client
pid=17270 TS=4004734.472977: probe_oracle:kews_update_wait_time: wait_time=30 event=SQL*Net message to client
pid=17270 TS=4004734.472982: probe_oracle:kskthewt: timestamp=1444447003926650 event=SQL*Net message to client
pid=17270 TS=4004734.474083: probe_oracle:kskthbwt: timestamp=1444447003927750 event=direct path read
pid=17270 TS=4004734.474398: probe_oracle:kews_update_wait_time: wait_time=315 event=direct path read
...

What the examples show is that trace data that you would normally find in a 10046 trace file (or  trace for EXEC DBMS_MONITOR.session_trace_enable(waits=>TRUE) if you prefer) can be collected using OS commands alone. Perf with utrace can be used to trace Oracle as an alternative to setting Oracle diagnostic events. What you cannot find in the Perf examples above are information on the parameters of the wait event (usually called p1, p2, p3). This is the subject of the next paragraph.


It only gets better from here: perf with uprobes can read Oracle SGA

Uprobes has received a major improvement with kernel 3.14 to include the possibility of fetching arguments from the memory of the traced process (see also http://lxr.free-electrons.com/source/Documentation/trace/uprobetracer.txt). As I write this Oracle has released UEK4 with kernel 4.1.x, which can be used to test this feature. Fedora is another option for a test environment with recent kernels.

Many details and metrics of the Oracle instrumentation coming from V$ views and their underlying X$ tables can be directly accessed by reading memory structures in the SGA (this is not the case for all X$, as many of them are  rather implemented as internal functions). X$KSUSE exposes many fields of V$SESSION. The memory location of the fields of interest can be identified using SQL as in this example.
The link between one of the main wait event userspace functions, kskthewt, and X$KSUSE is found by experimentation. This has been investigated together with Frits. We find that the value of the CPU register r13 points into the X$KSUSE record with an offset that is version depended. This work has already been done for SystemTap probes and a working example can be found at this link for 11.2.0,4 and this link for 12.1.0.2.

This is an example of a probe on the Oracle wait event interface that exposes selected fields from X$KSUSE memory structures:

# perf probe --del probe_oracle:kskthewt
# perf probe -x $ORACLE_HOME/bin/oracle kskthewt timestamp=%di event=%si p1='-2080(%r13)' p2='-2072(%r13)' p3='-2064(%r13)' sql_hash='-1828(%r13)'


# perf record -e probe_oracle:kskthewt -p <pid>

# perf script | \
  awk '{printf "pid=%s TS=%s ",$2,$4} {print $8,$9,$10,$11,$12}' | \
  perl -p -e 's/0[xX][0-9a-fA-F]+\b/sprintf "%s", hex $&/ge' | \
  sed s/event=/event#=/g |sed -f eventsname.sed

pid=17270 TS=4097944.225385: timestamp=1444540213679034 event=SQL*Net message from client p1=1413697536 p2=1 p3=0 sql_hash=11373786902
pid=17270 TS=4097944.225786: timestamp=1444540213679433 event=SQL*Net message to client p1=1413697536 p2=1 p3=0 sql_hash=11373786902
pid=17270 TS=4097944.227361: timestamp=1444540213681023 event=direct path read p1=1 p2=193 p3=7 sql_hash=11373786902
pid=17270 TS=4097944.229206: timestamp=1444540213682870 event=direct path read p1=1 p2=3896 p3=8 sql_hash=11373786902
pid=17270 TS=4097944.230532: timestamp=1444540213684197 event=direct path read p1=1 p2=12032 p3=8 sql_hash=11373786902
pid=17270 TS=4097944.232161: timestamp=1444540213685827 event=direct path read p1=1 p2=29464 p3=8 sql_hash=11373786902
pid=17270 TS=4097944.244675: timestamp=1444540213698344 event=direct path read p1=1 p2=177 p3=5 sql_hash=11373786902
...

More fields of interest from X$KSUSE or other areas of the SGA can be extracted as arguments of perf by extending the example above.

Example of tracing Oracle parse functions:

Function namePurposeSelected parameters
opiprsOracle Program Interface - Parse

Use to trace hard parsing.

This function is called when doing hard parsing
(i.e. when a SQL statement that is not in the library cache needs to be parsed).


Useful information when probing the function:

register rdx -> sql statement length        
register rsi -> pointer to the SQL text string                                                           
kksParseCursorUse to trace soft parsing.
This is one of the functions in the call stack when Oracle parses a cursors already in the library cache
Useful information when probing the function: 

11.2.0.4:
   register rsi -> sql statement lenght

   register rbx -> pointer to the SQL text string

12.1.0.2:
   
   register r9 ->sql statement lenght
   register r8 -> pointer to the SQL text string 



This is an example of a perf probe to trace hard parsing (tip: flush the shared pool before testing):

perf probe -x $ORACLE_HOME/bin/oracle opiprs length=%dx sql='+0(%si)':"string"

# perf record -e probe_oracle:opiprs -p <pid> 
# perf script

oracle 17270 [003] 4100899.401507: probe_oracle:opiprs: (27fd2f6) length=0x19 sql="select sysdate from dual"

Similarly for tracing soft parsing (11.2.0.4 version) you can use the following probe:

perf probe -x $ORACLE_HOME/bin/oracle kksParseCursor length=%si sql='+0(%bx)':"string"

Many more Oracle functions can be traced:

From the OS point of view, the Oracle executable is a big C program with a large number of functions. You can run "nm -D $ORACLE_HOME/bin/oracle" to get an idea of the scale: it is of the order of 10^5 text (code) sections. Obviously too many details and also undocumented. Some pointers to understand the function names can be found in MOS (see in especially the bug-related documents), from sqlplus running oradebug doc and from various blogs and presentations (see references).
Here below a short table of functions of interest for tracing logical and physical I/O taken from this blog post. The ideas and examples discussed above can be applied to these functions too, if there is a need to trace logical and physical I/O for example. Note that tracing logical I/O on a busy system can have an important overhead because of the large number of events per second.


Function namePurposeSelected parameters
kcbgtcrKernel Cache Buffers Get Consistent Read

Note: key function for consistent reads
tbs#=user_int32(%rdi) 
rel file n#= user_int32(%rdi+4) >> 22 & 0x003FFFFF                                                      
block#=user_int32(%rdi+4) & 0x003FFFFF
data_object_id#=user_int32(%rdi+8)
object_id#=user_int32(%rdi+12)

Note: for bigfile tablespaces:
block#=user_int32(%rdi+4)
kcbgcurKernel Cache Buffers Current

Note: key function for current reads
tbs#=user_int32(%rdi)
rel file n#= user_int32(%rdi+4) >> 22 & 0x003FFFFF
block#=user_int32(%rdi+4) & 0x003FFFFF
data_object_id#=user_int32(%rdi+8)
object_id#=user_int32(%rdi+12)

Note: for bigfile tablespaces:
block#=user_int32(%rdi+4)
kcbzibkcbZIB should stand for: Kernel Cache Buffers, Z
(kcbz.o is a module for physical IO helper functions), IB: Input Buffer


Note: this function is part of the physical I/O stack:
it indicates that Oracle will perform physical read(s) into the buffer cache

kcbzgbThe suffix GB in kcbZGB should stand for:
Get (space for) Buffer


Oracle allocates space in the buffer cache for a given block (typically before I/O operations). 

tbs n#=%rsi,
rel file n#=%rdx >> 22 & 0x003FFFFF
block#=%rdx & 0x003FFFFF
data_object_id#=%r8
object_id#=%r9

Note: for bigfile tablespaces:
block#=%rdx
kcbzvbInvoked after Oracle has read a given block

Note: this function is used both for reads in the buffer cache and for direct reads

tbs n#=%rsi, 
rel file n#=%rdx >> 22 & 0x003FFFFF,
block#=%rdx & 0x003FFFFF
data_object_id=user_int32(%rdi+24)

Note: for bigfile tablespaces:
block#=%rdx



Combining Perf probes and Oracle traces

What you have seen so far can also be described as a way to extract information normally available with Oracle diagnostic events, using perf and uprobes instead. Additional value to these methods comes when you combine them with perf probes for tracing kernel and hardware activity. There are many Perf events available. See also the output of perf list and the references in this post.

Study of Oracle I/O and wait events:
The example below combines perf events on tracepoints for system calls to the I/O interface in Linux and probes on the Oracle wait event interface. The example below records a trace for the Oracle userspace functions marking the beginning and end of a wait and for system calls to the I/O interface:

# perf record -e probe_oracle:kskthewt -e probe_oracle:kskthbwt  -e syscalls:sys_enter_io_submit -e syscalls:sys_exit_io_submit -e syscalls:sys_enter_io_getevents -e syscalls:sys_exit_io_getevents  -e syscalls:sys_enter_pread64 -e syscalls:sys_exit_pread64 -p <pid>

The output is processed for improved readability:

# perf script |  \
  awk '{printf "pid=%s TS=%s ", $2,$4} {print $5,$8,$9,$10,$11,$12}' |  \
  perl -p -e 's/0[xX][0-9a-fA-F]+\b/sprintf "%s", hex $&/ge' |  \
  sed s/event=/event#=/g |sed -f eventsname.sed

pid=25314 TS=4173676.363095: probe_oracle:kskthbwt: event=db file sequential read
pid=25314 TS=4173676.363109: syscalls:sys_enter_pread64: buf: 2644443136, count: 8192, pos:
pid=25314 TS=4173676.363667: syscalls:sys_exit_pread64:
pid=25314 TS=4173676.363689: probe_oracle:kskthewt: event=db file sequential read p1=4 p2=147586 p3=1 sql_hash=2246415721
pid=25314 TS=4173676.364030: probe_oracle:kskthbwt: event=db file scattered read
pid=25314 TS=4173676.364043: syscalls:sys_enter_pread64: buf: 5727326208, count: 40960, pos:
pid=25314 TS=4173676.364576: syscalls:sys_exit_pread64:
pid=25314 TS=4173676.364597: probe_oracle:kskthewt: event=db file scattered read p1=4 p2=147587 p3=5 sql_hash=2246415721
pid=25314 TS=4173676.364790: probe_oracle:kskthbwt: event=db file scattered read
pid=25314 TS=4173676.364801: syscalls:sys_enter_pread64: buf: 5727326208, count: 65536, pos:
pid=25314 TS=4173676.365344: syscalls:sys_exit_pread64:
pid=25314 TS=4173676.365365: probe_oracle:kskthewt: event=db file scattered read p1=4 p2=147592 p3=8 sql_hash=2246415721

What this example achieves is a way to port to Perf the work done with SystemTap for tracing Oracle I/O. See also http://externaltable.blogspot.com/2014/11/life-of-oracle-io-tracing-logical-and.html for additional details.

Oracle wait events and perf scheduler events

Perf scheduler events (see: perf list sched:*) are another source of very useful information. For example the event sched:sched_stat_runtime traces the amount of time spent executing on CPU. Note however that not all the scheduler probes are available on RHEL7 and UEK kernels, as CONFIG_SCHEDSTATS is not set on those platforms (see: grep CONFIG_SCHED /boot/config-`uname -r`). In particular this prevents the use of the technique described by Brendan Gregg for off-CPU flame graphs with perf.

This is an example of tracing the CPU scheduler together with I/O related events with Oracle wait events:

# perf record -e sched:* -e probe_oracle:kskthewt -e probe_oracle:kskthbwt  -e syscalls:sys_enter_io_submit -e syscalls:sys_exit_io_submit -e syscalls:sys_enter_io_getevents -e syscalls:sys_exit_io_getevents  -e syscalls:sys_enter_pread64 -e syscalls:sys_exit_pread64 -p <pid>

After capturing the workload you can extract atrace file running "perf script". In the table below you can see a snippet of the trace filtered with a simple scripts to improve the readability. You can see that perf provides valuable information by measuring CPU time, system calls and Oracle wait events.

EVENT NAME                      BEGIN TS        END TS          ELAPSED (MicroSec)
event=SQL*Net message to client 4413050521226   4413050521261   35
sched_stat_runtime              4413050521179   4413050521291   112
sched_stat_runtime              4413050521293   4413050521530   237
syscall:pread64                 4413050521392   4413050535684   14292
event=db file sequential read   4413050521375   4413050535704   14329
syscall:io_submit               4413050536073   4413050536249   176
sched_stat_runtime              4413050535593   4413050536551   958
syscall:io_submit               4413050536343   4413050536623   280
sched_stat_runtime              4413050536548   4413050536686   138
syscall:io_getevents            4413050536678   4413050537650   972
event=direct path read          4413050536652   4413050537677   1025
syscall:io_submit               4413050537807   4413050537924   117
sched_stat_runtime              4413050537563   4413050537963   400
syscall:io_getevents            4413050537956   4413050539289   1333
event=direct path read          4413050537943   4413050539315   1372
sched_stat_runtime              4413050539134   4413050539553   419
..


Ideas for future work

Stack profiling and flame graphs: An additional step to get more value out of perf traces is to visualize their output. Flame graphs and extended stack profiling have proven to be useful in this area. See also previous work on flame graphs for Oracle at link1 and link2 on extended stack profiling with stapflame by Frits at this link and with stack profilers at this link.
This is still work in progress. For a sneak preview follow this link for a proof-of-concept graph.

Perf hardware events: Combining Oracle traces traces with hardware and PMU counters is another area of interest for future investigations. This could be useful for example for studying logical reads and their dependencies on events at HW level on memory and/or processor cache level. 
Add and documenting more probes for Oracle userspace function tracing: For example it would be useful to develop a set of perf probes for replicating the 10046 trace event.

Linux and Oracle version compatibility:

The definition of probes on Oracle executable needs uprobes, available from kernel version 3.5. When testing the examples discussed in post on RHEL/OL, you can use Kernel UEK4 or UEK3 or use RHEL 7.x.  Notably, examples discussed in this post will not run on RHEL 6.x kernels.
Issues with uprobes and Oracle versions: uprobes works OK for tracing Oracle 11.2. However, for Oracle 12.1 I find that uprobes works OK on  RHEL7.0 (kernel 3.10.0-123) and UEK (kernel 3.8.x), but does not work for kernels that ship with RHEL 7.1,7,2 and anything higher (including UEK4). When testing the easiest is to use Oracle 11.2 or if you want to test Oracle 12.1 use UEK3 or RHEL 7.0 kernel. More investigations are needed on this topic.


Conclusions

Perf events and uprobes are very powerful tools for dynamic tracing of Oracle. They allow to complement and extend Oracle and OS instrumentation for advanced troubleshooting and performance investigations. In particular Perf with uprobes can be used:
  • As an alternative to standard instrumentation for Oracle tracing with diagnostic events: see the examples of using perf probes for tracing wait event, sql parsing, physical and logical I/O.
  • As a tool for combining OS-related probes (tracepoints for I/O, network, scheduler or hardware events) and Oracle diagnostic information, for advanced troubleshooting.
  • As a data collection tool for extended profiling of Oracle workloads, which combines stack traces, scheduler information, wait events or other Oracle diagnostic data for flame graph visualization.

Credits and references

Kudos to Brendan Gregg for his work on flame graphs and for the articles and presentations he has published on Linux tracers. Many thanks to Frits Hoogland for his investigations of Oracle internals, for collaboration on Oracle internals investigations and tracing tools and for proof reading this post. Many thanks to Tanel Poder for his work on Oracle internals and performance.
Additional links on topics related to this post:

http://www.brendangregg.com/perf.html
https://fritshoogland.wordpress.com/2016/01/12/introducing-stapflame-extended-stack-profiling-using-systemtap-perf-and-flame-graphs/
http://externaltable.blogspot.ch/2015/12/extended-stack-profiling-ideas-tools.html
http://blog.tanelpoder.com/2015/09/21/ram-is-the-new-disk-and-how-to-measure-its-performance-part-2-tools/
http://externaltable.blogspot.ch/2014/11/life-of-oracle-io-tracing-logical-and.html
https://fritshoogland.wordpress.com/2014/02/27/investigating-the-wait-interface-via-gdb/
http://enkitec.tv/2013/03/14/tanel-poders-hacking-session-how-oracle-sql-plans-are-really-executed-part-2/
https://alexanderanokhin.wordpress.com/tools/digger/
http://lxr.free-electrons.com/source/tools/perf/Documentation/perf-probe.txt
http://linux.die.net/man/1/perf-probe
http://lxr.free-electrons.com/source/Documentation/trace/uprobetracer.txt