Showing posts with label Tracing. Show all posts
Showing posts with label Tracing. Show all posts

Tuesday, May 31, 2016

Linux BPF/bcc for Oracle Tracing

Topic: In this post you will find a short discussion and pointers to the code of a few sample scripts that I have written using Linux BPF/bcc and uprobes for Oracle tracing.

Previous work and motivations

Tools for dynamic tracing are very useful for troubleshooting and internals investigations of Oracle workloads. Dynamic tracing probes on the OS/kernel, can be used to measure the details for I/O latency for example. Moreover probes on the Oracle userspace functions can be be used to complement Oracle instrumentation and probe deeper into the internals of the Oracle engine, when needed. For example in a work done in collaboration with Frits Hoogland we have investigated how to link Oracle wait event interface with tools able to probe the userspace, such as gdb and dynamic tracing tools as DTrace, SystemTap and Perf. More details and examples on this topic can be found in previous work: Modern Linux Tools for Oracle Troubleshooting (presentation at UKOUG Tech14), Life of an Oracle I/O: Tracing Logical and Physical I/O with SystemTapLinux Perf Probes for Oracle TracingFlame Graphs for Oracle.


What's new with BPF/bcc

BPF together with his frontend bcc are new and exciting technologies in the Linux dynamic tracing ecosystem. If you are not familiar with them, you can get up to speed using the excellent material in Brendan Gregg's blog. While the tools are still in development as I write this, a large amount of important features are already available starting from kernel 4.5, notably the support for uprobes that I have used in the scripts discussed later in this post.

I have started experimenting with porting a few probes for the Oracle userspace that I had written using SystemTap and discussed in this blog. While the language/syntax is completely different between SystemTap and BPF/bcc, porting the probes has turned out to be relatively straightforward. The work has been made substantially easier by the availability of a large selection of well-written and documented scripts in the tools directory of bcc. At present writing BPF/bcc scripts is a mixture of coding in Python and C, which feels to me both familiar and at the same time strangely low-level compared to the more mature environments for example of DTrace and SystemTap.

The example scripts that I have written using BPF/bcc are on Github. Here is a list with comments:

Script name Short description
ora_sqlparse_trace.py Tracing of Oracle SQL parsing. This script traces SQL hard parsing on Oracle binaries hooking on the Oracle function "opiprs" and reads from function arguments (CPU registers) and from process memory.
ora_wait_trace.py Tracing of Oracle wait events. This script traces Oracle sessions by hooking on the functions "kskthewt" and "kews_update_wait_time" and reads from function arguments (CPU registers).
ora_logicalIO_histogram.py Logical IO latency histograms. This script measures the latency between call and return time for the Oracle function "kcbgtcr", which is an important part of the logical IO processing for consistent reads.
ora_wait_histogram.py Wait event latency histograms. This script traces Oracle sessions by hooking on the functions "kskthewt" and "kews_update_wait_time" and reads from function arguments (CPU registers). BPF computes the latency histogram for the wait events and the script prints the values on stdout.

An example of the usage of ora_wait_histogram.py to measure and display wait event latency:

# stdbuf -oL ./ora_wait_histogram.py 10 10|sed -e 's/event# = /event#=/g' -f eventsname.sed
Start tracing oracle wait events... Hit Ctrl-C to end.

event=db file sequential read
     wait time, microsec : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 25       |                                        |
       128 -> 255        : 24521    |********************                    |
       256 -> 511        : 46788    |****************************************|
       512 -> 1023       : 12169    |**********                              |
      1024 -> 2047       : 1132     |                                        |
      2048 -> 4095       : 660      |                                        |
      4096 -> 8191       : 248      |                                        |
      8192 -> 16383      : 29       |                                        |


Latency heat maps to display histograms collected with BPF/bcc

PyLatencyMap is a command-line tool for visualizing latency histograms using heat maps on terminal graphics, using ANSI escape codes. PyLatencyMap can be used to investigate I/O performance for random I/O, especially suited for the cases of multiple modes of response time from the storage (SSD cache, HDD, latency outliers). The original idea for PyLatencyMap comes from the work of Brendan Gregg on latency heat maps. I have added to the PyLatencyMap repository a modified version of the biolatency.py script to measure I/O latency histograms: this is the link to the script pylatencymap_biolatency.py and an example of heat maps generated with PyLatencyMap with the script Example11_BPF-bcc_blockIO_latency.sh:



The test workload has been generated using Kevin Closson's SLOB. Additional references with a more detailed discussion of the topic of testing Oracle I/O with SLOB and measuring latency heat maps are:
Heat Map Visualization of I/O Latency with SystemTap and PyLatencyMap and OraLatencyMap v1.1 and Testing I/O with SLOB 2.


Notes on the test environment

The scripts discussed in this post have been developed on Fedora 24 (alpha) running Linux kernel version 4.6 (using the rawhide kernel) and have been tested on workloads generated using Oracle version 11.2.0.4. This is not an Oracle-supported configuration and  the scripts are intended mainly as a demonstration of the technology and for learning purposes.
Here are some pointers on the steps I used to setup a lab environment for testing:

Note in particular the step for configuring the rawhide kernel, probably a good choice when testing BPF, as new features are being added on a regular basis as I write this:
# sudo dnf config-manager --add-repo=http://alt.fedoraproject.org/pub/alt/rawhide-kernel-nodebug/fedora-rawhide-kernel-nodebug.repo

# sudo dnf update

Additional pointers and recipes on how to install Oracle 11.2 on Fedora can be found on Tim Hall's website at: https://oracle-base.com/articles/11g/articles-11g


Conclusions

BPF with its bcc frontend are new and powerful tools for dynamic tracing for Linux. A discussion of the merits and shortfalls of BPF/bcc vs other existing solutions in the Linux dynamic tracing ecosystem is beyond the scope of this post. What you can find in this post are a few example scripts that I have written for tracing Oracle using BPF/bcc and uprobes and an additional script for integrating BPF/bcc with PyLatencyMap, which provides visualization as heat maps of the histograms generated using BPF/bcc.
Although the BPF/bcc environment is currently under evolution, it appears already a very useful addition to the toolbox for troubleshooting and performance investigations of Linux workloads. A set of example scripts/tools that come with the bcc repository are of great help for getting started both with using BPF/bcc and with writing new scripts. BPF/bcc can only run on relatively new kernels (as I write this, I have tested the scripts discussed in this post on kernel version 4.6) and this is an obstacle for its adoption in many environments, at least in the short term.

Note added, February 2019: Red Hat and Oracle Linux 7.6 have backported BPF and can run bcc scripts (yum install bcc*). I have updated the script repository accordingly, see https://github.com/LucaCanali/Linux_tracing_scripts/tree/master/BPF-bcc_Userspace_Oracle


Credits and acknowledgements

Many of the original ideas and tools discussed here are inspired or directly derived from the awesome published work of Brendan Gregg.
Many thanks also to the development teams of BPF and bcc for providing and supporting this new powerful tools. In particular thanks to Brenden Blanco for his work on uprobes for bcc and for his support on issue #478.
The work of investigating Oracle userspace with dynamic tracing tools has been done in collaboration with Frits Hoogland.

Monday, January 25, 2016

Linux Perf Probes for Oracle Tracing

Topic: this post is about Linux perf and uprobes for tracing and profiling Oracle workloads for advanced troubleshooting.

Context

The recent progress and maturity of some of the Linux dynamic tracing tools has raised interest in applying these techniques to Oracle troubleshooting and performance investigations. See Brendan Gregg's web pages for summary and future developments on dynamic traces for Linux. Some recent work on applying these tools and techniques to Oracle can be found in: Tanel Poder's posts on "RAM is the new disk", Frits Hoogland's post on "Stapflame" and in this blog at "Extended stack profiling".

Dynamic tracers are proven tools for instrumenting OS calls: an example of interest for OLTP tuning, are probes for measuring I/O latency at microsecond precision. Another topic, more complex but also more exciting, is tracing Oracle userspace. This topic is of growing interest with modern servers hosting large amounts of RAM and workloads that are often CPU-bound.

DTrace is still a reference when talking about dynamic tracing. For Oracle userspace tracing, however, this is mostly relevant for the Solaris environment. Basic examples of DTrace probes for Oracle on Linux can be found in this blog at this link.

SystemTap is a tool that offers a mature interface to tracing Linux kernel and the userspace. Notably SystemTap offers the possibility to write probes with advanced aggregations similarly to DTrace. Tracing Oracle userspace and wait events with SystemTap has been covered before in this blog with work done in collaboration with Frits Hoogland: see this link and this link and the scripts at this Github repository. An advantage of SystemTap can trace userspace for RHEL/OL 6.x using the utrace.

Linux Perf is another major player in this area. It is a powerful tool, integrated in the kernel and continuously being improved with new kernel releases. Perf is an excellent tool for data gathering for on-CPU flame graphs (see this link and references below). Perf has gained a reputation to be stable and relatively low overhead, which makes it useful for troubleshooting production workloads (mileage may vary).

In this post you will find examples, of how Perf  can be used to trace the Oracle userspace: this includes tracing calls to userspace functions and also reading from memory, in particular from the Oracle SGA. This creates a link between Perf and Oracle instrumentation and can be used for tracing Oracle and for building extended profiling tools (i.e. tools that pull together OS and Oracle instrumentation) and for flame graph visualization.


Perf probes for Oracle

If you are not familiar with Linux Perf, you can profit from reading/testing a few examples from http://www.brendangregg.com/perf.html. The focus of this post is with perf probes for Oracle userspace functions using uprobes. This functionality is available from kernel version 3.5, this means that to run the examples shown here you will need a relatively recent Linux version: RHEL 7.x (with 3.10.x kernel) or Oracle Linux with UEK3 (3.8.x kernel) or UEK4 (4.1.x kernel). The examples in this post have been tested for Oracle 11.2.0.4 running on OL7.2 with UEK4.

Perf meets Oracle wait events: probes for the wait event interface are of particular interest as they provide a link between what Perf can measure and the Oracle instrumentation. Some key Oracle userspace functions in this area have been described previously in this blog (see also the example scripts in github). A short description of three functions of interest before that will be used later in the example scripts:


Function namePurposeSelected parameters
kskthbwtKernel service Kompile thread begin wait.
This function is called at the start of an Oracle wait event.
kslwtbctx is its parent function call and marks the start of a wait event. 
The suffix "bwt" most likely stands for "begin wait".
Parameters and registers:
11.2.0.4 - register r14  -> pointer into X$KSUSE (V$SESSION) array in the SGA with a fixed offset (version-dependent)
Note for 12.1.0.2 - register r13  -> pointer into X$KSUSE 

register rsi -> timestamp of the beginning of the wait (in microseconds)
register rdx -> wait event number
kskthewtKernel service Kompile thread end wait.
This function is called at the end of an Oracle wait event.
kslwtectx is its parent function call marking the end of a wait event.
The suffix "ewt" most likely stands for "end wait".
Parameters and registers:
register r13 -> pointer into X$KSUSE (V$SESSION) array in the SGA  with a fixed offset (version-dependent)
register rdi -> timestamp of the beginning of the wait (in microseconds)
register rsi -> wait event number
kews_update_wait_timeKernel event wait statistics update wait time.
This function is called at the end of an Oracle wait event.
kslwtectx is its parent function call marking the end of a wait event.
Parameters and registers:

register rsi -> wait time

register r13 (relevant for 11.2.0.4, it does not work on 12.1.0.2) -> event number
register r15 -> pointer into X$KSUSE (V$SESSION) array in the SGA  with a fixed offset (version-dependent)


You can get started with Perf tracing of Oracle userspace using the following:

# export ORACLE_HOME=/u01/app/oracle/product/11.2.0.4/rdbms
# perf probe -x $ORACLE_HOME/bin/oracle kskthewt timestamp=%di event=%si

If all worked well  you will have created a new probe for Oracle userspace tracing and will see the following output.

Added new event:
  probe_oracle:kskthewt (on kskthewt in /u01/app/oracle/product/11.2.0.4/rdbms/bin/oracle with timestamp=%di event=%si)

You can now use it in all perf tools, such as:

        perf record -e probe_oracle:kskthewt -aR sleep 1

Note in case of errors, check the OS and Oracle version compatibility discussed elsewhere in this post. Go ahead and create two more probes:

# perf probe -x $ORACLE_HOME/bin/oracle kskthbwt timestamp=%si event=%dx
# perf probe -x $ORACLE_HOME/bin/oracle kews_update_wait_time wait_time=%si event=%r13

You should have now three probes on Oracle, you verify this by listing them with:  "perf probe -l". If needed, you can delete probes with this syntax: "perf probe --del <probe_name>". Now you can run perf record to gather data for the events defined by the probes. One last step is to identify an Oracle process to test (it can be a sqlplus session for example). Find the OS pid of the Oracle process to probe and run the following:

# perf record -e probe_oracle:kews_update_wait_time -e probe_oracle:kskthbwt -e probe_oracle:kskthewt -p <pid>

Hit CTRL-C when done tracing. Display the trace output with the perf script command. Here an  example from "select count(*) from <testtable>" tun on a test databas:

# perf script

oracle 17270 [003] 4004734.472372: probe_oracle:kews_update_wait_time: (95af198) wait_time=0x6fce81d4 event=0x162
oracle 17270 [003] 4004734.472408: probe_oracle:kskthewt: (94862ec) timestamp=0x521b78690de07 event=0x162
oracle 17270 [003] 4004734.472953: probe_oracle:kskthbwt: (9486678) timestamp=0x521b78690e05c event=0x15e
oracle 17270 [003] 4004734.472977: probe_oracle:kews_update_wait_time: (95af198) wait_time=0x1e event=0x15e
oracle 17270 [003] 4004734.472982: probe_oracle:kskthewt: (94862ec) timestamp=0x521b78690e07a event=0x15e
oracle 17270 [003] 4004734.474083: probe_oracle:kskthbwt: (9486678) timestamp=0x521b78690e4c6 event=0xc6
oracle 17270 [003] 4004734.474398: probe_oracle:kews_update_wait_time: (95af198) wait_time=0x13b event=0xc6
...

The fields displayed in the example output are: process name, pid, CPU number, timestamp since machine startup, probe name and address, probe parameters. Some additional processing is beneficial to make the output more easily readable. In the next example perf script output is filtered by piping it to a series of commands: (1) to reduce the number of fields displayed, (2) convert hexadecimal numbers into decimal, (3) convert wait event numbers in wait event names. For this we use sed and eventsname.sed, a file generated with a sqlplus script.

# perf script | \
  awk '{printf "pid=%s TS=%s ",$2,$4} {print $5,$7,$8}' | \
  perl -p -e 's/0[xX][0-9a-fA-F]+\b/sprintf "%s", hex $&/ge' | \
  sed s/event=/event#=/g |sed -f eventsname.sed

pid=17270 TS=4004734.472372: probe_oracle:kews_update_wait_time: wait_time=1875804628 event=SQL*Net message from client
pid=17270 TS=4004734.472408: probe_oracle:kskthewt: timestamp=1444447003926023 event=SQL*Net message from client
pid=17270 TS=4004734.472953: probe_oracle:kskthbwt: timestamp=1444447003926620 event=SQL*Net message to client
pid=17270 TS=4004734.472977: probe_oracle:kews_update_wait_time: wait_time=30 event=SQL*Net message to client
pid=17270 TS=4004734.472982: probe_oracle:kskthewt: timestamp=1444447003926650 event=SQL*Net message to client
pid=17270 TS=4004734.474083: probe_oracle:kskthbwt: timestamp=1444447003927750 event=direct path read
pid=17270 TS=4004734.474398: probe_oracle:kews_update_wait_time: wait_time=315 event=direct path read
...

What the examples show is that trace data that you would normally find in a 10046 trace file (or  trace for EXEC DBMS_MONITOR.session_trace_enable(waits=>TRUE) if you prefer) can be collected using OS commands alone. Perf with utrace can be used to trace Oracle as an alternative to setting Oracle diagnostic events. What you cannot find in the Perf examples above are information on the parameters of the wait event (usually called p1, p2, p3). This is the subject of the next paragraph.


It only gets better from here: perf with uprobes can read Oracle SGA

Uprobes has received a major improvement with kernel 3.14 to include the possibility of fetching arguments from the memory of the traced process (see also http://lxr.free-electrons.com/source/Documentation/trace/uprobetracer.txt). As I write this Oracle has released UEK4 with kernel 4.1.x, which can be used to test this feature. Fedora is another option for a test environment with recent kernels.

Many details and metrics of the Oracle instrumentation coming from V$ views and their underlying X$ tables can be directly accessed by reading memory structures in the SGA (this is not the case for all X$, as many of them are  rather implemented as internal functions). X$KSUSE exposes many fields of V$SESSION. The memory location of the fields of interest can be identified using SQL as in this example.
The link between one of the main wait event userspace functions, kskthewt, and X$KSUSE is found by experimentation. This has been investigated together with Frits. We find that the value of the CPU register r13 points into the X$KSUSE record with an offset that is version depended. This work has already been done for SystemTap probes and a working example can be found at this link for 11.2.0,4 and this link for 12.1.0.2.

This is an example of a probe on the Oracle wait event interface that exposes selected fields from X$KSUSE memory structures:

# perf probe --del probe_oracle:kskthewt
# perf probe -x $ORACLE_HOME/bin/oracle kskthewt timestamp=%di event=%si p1='-2080(%r13)' p2='-2072(%r13)' p3='-2064(%r13)' sql_hash='-1828(%r13)'


# perf record -e probe_oracle:kskthewt -p <pid>

# perf script | \
  awk '{printf "pid=%s TS=%s ",$2,$4} {print $8,$9,$10,$11,$12}' | \
  perl -p -e 's/0[xX][0-9a-fA-F]+\b/sprintf "%s", hex $&/ge' | \
  sed s/event=/event#=/g |sed -f eventsname.sed

pid=17270 TS=4097944.225385: timestamp=1444540213679034 event=SQL*Net message from client p1=1413697536 p2=1 p3=0 sql_hash=11373786902
pid=17270 TS=4097944.225786: timestamp=1444540213679433 event=SQL*Net message to client p1=1413697536 p2=1 p3=0 sql_hash=11373786902
pid=17270 TS=4097944.227361: timestamp=1444540213681023 event=direct path read p1=1 p2=193 p3=7 sql_hash=11373786902
pid=17270 TS=4097944.229206: timestamp=1444540213682870 event=direct path read p1=1 p2=3896 p3=8 sql_hash=11373786902
pid=17270 TS=4097944.230532: timestamp=1444540213684197 event=direct path read p1=1 p2=12032 p3=8 sql_hash=11373786902
pid=17270 TS=4097944.232161: timestamp=1444540213685827 event=direct path read p1=1 p2=29464 p3=8 sql_hash=11373786902
pid=17270 TS=4097944.244675: timestamp=1444540213698344 event=direct path read p1=1 p2=177 p3=5 sql_hash=11373786902
...

More fields of interest from X$KSUSE or other areas of the SGA can be extracted as arguments of perf by extending the example above.

Example of tracing Oracle parse functions:

Function namePurposeSelected parameters
opiprsOracle Program Interface - Parse

Use to trace hard parsing.

This function is called when doing hard parsing
(i.e. when a SQL statement that is not in the library cache needs to be parsed).


Useful information when probing the function:

register rdx -> sql statement length        
register rsi -> pointer to the SQL text string                                                           
kksParseCursorUse to trace soft parsing.
This is one of the functions in the call stack when Oracle parses a cursors already in the library cache
Useful information when probing the function: 

11.2.0.4:
   register rsi -> sql statement lenght

   register rbx -> pointer to the SQL text string

12.1.0.2:
   
   register r9 ->sql statement lenght
   register r8 -> pointer to the SQL text string 



This is an example of a perf probe to trace hard parsing (tip: flush the shared pool before testing):

perf probe -x $ORACLE_HOME/bin/oracle opiprs length=%dx sql='+0(%si)':"string"

# perf record -e probe_oracle:opiprs -p <pid> 
# perf script

oracle 17270 [003] 4100899.401507: probe_oracle:opiprs: (27fd2f6) length=0x19 sql="select sysdate from dual"

Similarly for tracing soft parsing (11.2.0.4 version) you can use the following probe:

perf probe -x $ORACLE_HOME/bin/oracle kksParseCursor length=%si sql='+0(%bx)':"string"

Many more Oracle functions can be traced:

From the OS point of view, the Oracle executable is a big C program with a large number of functions. You can run "nm -D $ORACLE_HOME/bin/oracle" to get an idea of the scale: it is of the order of 10^5 text (code) sections. Obviously too many details and also undocumented. Some pointers to understand the function names can be found in MOS (see in especially the bug-related documents), from sqlplus running oradebug doc and from various blogs and presentations (see references).
Here below a short table of functions of interest for tracing logical and physical I/O taken from this blog post. The ideas and examples discussed above can be applied to these functions too, if there is a need to trace logical and physical I/O for example. Note that tracing logical I/O on a busy system can have an important overhead because of the large number of events per second.


Function namePurposeSelected parameters
kcbgtcrKernel Cache Buffers Get Consistent Read

Note: key function for consistent reads
tbs#=user_int32(%rdi) 
rel file n#= user_int32(%rdi+4) >> 22 & 0x003FFFFF                                                      
block#=user_int32(%rdi+4) & 0x003FFFFF
data_object_id#=user_int32(%rdi+8)
object_id#=user_int32(%rdi+12)

Note: for bigfile tablespaces:
block#=user_int32(%rdi+4)
kcbgcurKernel Cache Buffers Current

Note: key function for current reads
tbs#=user_int32(%rdi)
rel file n#= user_int32(%rdi+4) >> 22 & 0x003FFFFF
block#=user_int32(%rdi+4) & 0x003FFFFF
data_object_id#=user_int32(%rdi+8)
object_id#=user_int32(%rdi+12)

Note: for bigfile tablespaces:
block#=user_int32(%rdi+4)
kcbzibkcbZIB should stand for: Kernel Cache Buffers, Z
(kcbz.o is a module for physical IO helper functions), IB: Input Buffer


Note: this function is part of the physical I/O stack:
it indicates that Oracle will perform physical read(s) into the buffer cache

kcbzgbThe suffix GB in kcbZGB should stand for:
Get (space for) Buffer


Oracle allocates space in the buffer cache for a given block (typically before I/O operations). 

tbs n#=%rsi,
rel file n#=%rdx >> 22 & 0x003FFFFF
block#=%rdx & 0x003FFFFF
data_object_id#=%r8
object_id#=%r9

Note: for bigfile tablespaces:
block#=%rdx
kcbzvbInvoked after Oracle has read a given block

Note: this function is used both for reads in the buffer cache and for direct reads

tbs n#=%rsi, 
rel file n#=%rdx >> 22 & 0x003FFFFF,
block#=%rdx & 0x003FFFFF
data_object_id=user_int32(%rdi+24)

Note: for bigfile tablespaces:
block#=%rdx



Combining Perf probes and Oracle traces

What you have seen so far can also be described as a way to extract information normally available with Oracle diagnostic events, using perf and uprobes instead. Additional value to these methods comes when you combine them with perf probes for tracing kernel and hardware activity. There are many Perf events available. See also the output of perf list and the references in this post.

Study of Oracle I/O and wait events:
The example below combines perf events on tracepoints for system calls to the I/O interface in Linux and probes on the Oracle wait event interface. The example below records a trace for the Oracle userspace functions marking the beginning and end of a wait and for system calls to the I/O interface:

# perf record -e probe_oracle:kskthewt -e probe_oracle:kskthbwt  -e syscalls:sys_enter_io_submit -e syscalls:sys_exit_io_submit -e syscalls:sys_enter_io_getevents -e syscalls:sys_exit_io_getevents  -e syscalls:sys_enter_pread64 -e syscalls:sys_exit_pread64 -p <pid>

The output is processed for improved readability:

# perf script |  \
  awk '{printf "pid=%s TS=%s ", $2,$4} {print $5,$8,$9,$10,$11,$12}' |  \
  perl -p -e 's/0[xX][0-9a-fA-F]+\b/sprintf "%s", hex $&/ge' |  \
  sed s/event=/event#=/g |sed -f eventsname.sed

pid=25314 TS=4173676.363095: probe_oracle:kskthbwt: event=db file sequential read
pid=25314 TS=4173676.363109: syscalls:sys_enter_pread64: buf: 2644443136, count: 8192, pos:
pid=25314 TS=4173676.363667: syscalls:sys_exit_pread64:
pid=25314 TS=4173676.363689: probe_oracle:kskthewt: event=db file sequential read p1=4 p2=147586 p3=1 sql_hash=2246415721
pid=25314 TS=4173676.364030: probe_oracle:kskthbwt: event=db file scattered read
pid=25314 TS=4173676.364043: syscalls:sys_enter_pread64: buf: 5727326208, count: 40960, pos:
pid=25314 TS=4173676.364576: syscalls:sys_exit_pread64:
pid=25314 TS=4173676.364597: probe_oracle:kskthewt: event=db file scattered read p1=4 p2=147587 p3=5 sql_hash=2246415721
pid=25314 TS=4173676.364790: probe_oracle:kskthbwt: event=db file scattered read
pid=25314 TS=4173676.364801: syscalls:sys_enter_pread64: buf: 5727326208, count: 65536, pos:
pid=25314 TS=4173676.365344: syscalls:sys_exit_pread64:
pid=25314 TS=4173676.365365: probe_oracle:kskthewt: event=db file scattered read p1=4 p2=147592 p3=8 sql_hash=2246415721

What this example achieves is a way to port to Perf the work done with SystemTap for tracing Oracle I/O. See also http://externaltable.blogspot.com/2014/11/life-of-oracle-io-tracing-logical-and.html for additional details.

Oracle wait events and perf scheduler events

Perf scheduler events (see: perf list sched:*) are another source of very useful information. For example the event sched:sched_stat_runtime traces the amount of time spent executing on CPU. Note however that not all the scheduler probes are available on RHEL7 and UEK kernels, as CONFIG_SCHEDSTATS is not set on those platforms (see: grep CONFIG_SCHED /boot/config-`uname -r`). In particular this prevents the use of the technique described by Brendan Gregg for off-CPU flame graphs with perf.

This is an example of tracing the CPU scheduler together with I/O related events with Oracle wait events:

# perf record -e sched:* -e probe_oracle:kskthewt -e probe_oracle:kskthbwt  -e syscalls:sys_enter_io_submit -e syscalls:sys_exit_io_submit -e syscalls:sys_enter_io_getevents -e syscalls:sys_exit_io_getevents  -e syscalls:sys_enter_pread64 -e syscalls:sys_exit_pread64 -p <pid>

After capturing the workload you can extract atrace file running "perf script". In the table below you can see a snippet of the trace filtered with a simple scripts to improve the readability. You can see that perf provides valuable information by measuring CPU time, system calls and Oracle wait events.

EVENT NAME                      BEGIN TS        END TS          ELAPSED (MicroSec)
event=SQL*Net message to client 4413050521226   4413050521261   35
sched_stat_runtime              4413050521179   4413050521291   112
sched_stat_runtime              4413050521293   4413050521530   237
syscall:pread64                 4413050521392   4413050535684   14292
event=db file sequential read   4413050521375   4413050535704   14329
syscall:io_submit               4413050536073   4413050536249   176
sched_stat_runtime              4413050535593   4413050536551   958
syscall:io_submit               4413050536343   4413050536623   280
sched_stat_runtime              4413050536548   4413050536686   138
syscall:io_getevents            4413050536678   4413050537650   972
event=direct path read          4413050536652   4413050537677   1025
syscall:io_submit               4413050537807   4413050537924   117
sched_stat_runtime              4413050537563   4413050537963   400
syscall:io_getevents            4413050537956   4413050539289   1333
event=direct path read          4413050537943   4413050539315   1372
sched_stat_runtime              4413050539134   4413050539553   419
..


Ideas for future work

Stack profiling and flame graphs: An additional step to get more value out of perf traces is to visualize their output. Flame graphs and extended stack profiling have proven to be useful in this area. See also previous work on flame graphs for Oracle at link1 and link2 on extended stack profiling with stapflame by Frits at this link and with stack profilers at this link.
This is still work in progress. For a sneak preview follow this link for a proof-of-concept graph.

Perf hardware events: Combining Oracle traces traces with hardware and PMU counters is another area of interest for future investigations. This could be useful for example for studying logical reads and their dependencies on events at HW level on memory and/or processor cache level. 
Add and documenting more probes for Oracle userspace function tracing: For example it would be useful to develop a set of perf probes for replicating the 10046 trace event.

Linux and Oracle version compatibility:

The definition of probes on Oracle executable needs uprobes, available from kernel version 3.5. When testing the examples discussed in post on RHEL/OL, you can use Kernel UEK4 or UEK3 or use RHEL 7.x.  Notably, examples discussed in this post will not run on RHEL 6.x kernels.
Issues with uprobes and Oracle versions: uprobes works OK for tracing Oracle 11.2. However, for Oracle 12.1 I find that uprobes works OK on  RHEL7.0 (kernel 3.10.0-123) and UEK (kernel 3.8.x), but does not work for kernels that ship with RHEL 7.1,7,2 and anything higher (including UEK4). When testing the easiest is to use Oracle 11.2 or if you want to test Oracle 12.1 use UEK3 or RHEL 7.0 kernel. More investigations are needed on this topic.


Conclusions

Perf events and uprobes are very powerful tools for dynamic tracing of Oracle. They allow to complement and extend Oracle and OS instrumentation for advanced troubleshooting and performance investigations. In particular Perf with uprobes can be used:
  • As an alternative to standard instrumentation for Oracle tracing with diagnostic events: see the examples of using perf probes for tracing wait event, sql parsing, physical and logical I/O.
  • As a tool for combining OS-related probes (tracepoints for I/O, network, scheduler or hardware events) and Oracle diagnostic information, for advanced troubleshooting.
  • As a data collection tool for extended profiling of Oracle workloads, which combines stack traces, scheduler information, wait events or other Oracle diagnostic data for flame graph visualization.

Credits and references

Kudos to Brendan Gregg for his work on flame graphs and for the articles and presentations he has published on Linux tracers. Many thanks to Frits Hoogland for his investigations of Oracle internals, for collaboration on Oracle internals investigations and tracing tools and for proof reading this post. Many thanks to Tanel Poder for his work on Oracle internals and performance.
Additional links on topics related to this post:

http://www.brendangregg.com/perf.html
https://fritshoogland.wordpress.com/2016/01/12/introducing-stapflame-extended-stack-profiling-using-systemtap-perf-and-flame-graphs/
http://externaltable.blogspot.ch/2015/12/extended-stack-profiling-ideas-tools.html
http://blog.tanelpoder.com/2015/09/21/ram-is-the-new-disk-and-how-to-measure-its-performance-part-2-tools/
http://externaltable.blogspot.ch/2014/11/life-of-oracle-io-tracing-logical-and.html
https://fritshoogland.wordpress.com/2014/02/27/investigating-the-wait-interface-via-gdb/
http://enkitec.tv/2013/03/14/tanel-poders-hacking-session-how-oracle-sql-plans-are-really-executed-part-2/
https://alexanderanokhin.wordpress.com/tools/digger/
http://lxr.free-electrons.com/source/tools/perf/Documentation/perf-probe.txt
http://linux.die.net/man/1/perf-probe
http://lxr.free-electrons.com/source/Documentation/trace/uprobetracer.txt