Monday, January 25, 2016

Linux Perf Probes for Oracle Tracing

Topic: this post is about Linux perf and uprobes for tracing and profiling Oracle workloads for advanced troubleshooting.

Context

The recent progress and maturity of some of the Linux dynamic tracing tools has raised interest in applying these techniques to Oracle troubleshooting and performance investigations. See Brendan Gregg's web pages for summary and future developments on dynamic traces for Linux. Some recent work on applying these tools and techniques to Oracle can be found in: Tanel Poder's posts on "RAM is the new disk", Frits Hoogland's post on "Stapflame" and in this blog at "Extended stack profiling".

Dynamic tracers are proven tools for instrumenting OS calls: an example of interest for OLTP tuning, are probes for measuring I/O latency at microsecond precision. Another topic, more complex but also more exciting, is tracing Oracle userspace. This topic is of growing interest with modern servers hosting large amounts of RAM and workloads that are often CPU-bound.

DTrace is still a reference when talking about dynamic tracing. For Oracle userspace tracing, however, this is mostly relevant for the Solaris environment. Basic examples of DTrace probes for Oracle on Linux can be found in this blog at this link.

SystemTap is a tool that offers a mature interface to tracing Linux kernel and the userspace. Notably SystemTap offers the possibility to write probes with advanced aggregations similarly to DTrace. Tracing Oracle userspace and wait events with SystemTap has been covered before in this blog with work done in collaboration with Frits Hoogland: see this link and this link and the scripts at this Github repository. An advantage of SystemTap can trace userspace for RHEL/OL 6.x using the utrace.

Linux Perf is another major player in this area. It is a powerful tool, integrated in the kernel and continuously being improved with new kernel releases. Perf is an excellent tool for data gathering for on-CPU flame graphs (see this link and references below). Perf has gained a reputation to be stable and relatively low overhead, which makes it useful for troubleshooting production workloads (mileage may vary).

In this post you will find examples, of how Perf  can be used to trace the Oracle userspace: this includes tracing calls to userspace functions and also reading from memory, in particular from the Oracle SGA. This creates a link between Perf and Oracle instrumentation and can be used for tracing Oracle and for building extended profiling tools (i.e. tools that pull together OS and Oracle instrumentation) and for flame graph visualization.


Perf probes for Oracle

If you are not familiar with Linux Perf, you can profit from reading/testing a few examples from http://www.brendangregg.com/perf.html. The focus of this post is with perf probes for Oracle userspace functions using uprobes. This functionality is available from kernel version 3.5, this means that to run the examples shown here you will need a relatively recent Linux version: RHEL 7.x (with 3.10.x kernel) or Oracle Linux with UEK3 (3.8.x kernel) or UEK4 (4.1.x kernel). The examples in this post have been tested for Oracle 11.2.0.4 running on OL7.2 with UEK4.

Perf meets Oracle wait events: probes for the wait event interface are of particular interest as they provide a link between what Perf can measure and the Oracle instrumentation. Some key Oracle userspace functions in this area have been described previously in this blog (see also the example scripts in github). A short description of three functions of interest before that will be used later in the example scripts:


Function namePurposeSelected parameters
kskthbwtKernel service Kompile thread begin wait.
This function is called at the start of an Oracle wait event.
kslwtbctx is its parent function call and marks the start of a wait event. 
The suffix "bwt" most likely stands for "begin wait".
Parameters and registers:
11.2.0.4 - register r14  -> pointer into X$KSUSE (V$SESSION) array in the SGA with a fixed offset (version-dependent)
Note for 12.1.0.2 - register r13  -> pointer into X$KSUSE 

register rsi -> timestamp of the beginning of the wait (in microseconds)
register rdx -> wait event number
kskthewtKernel service Kompile thread end wait.
This function is called at the end of an Oracle wait event.
kslwtectx is its parent function call marking the end of a wait event.
The suffix "ewt" most likely stands for "end wait".
Parameters and registers:
register r13 -> pointer into X$KSUSE (V$SESSION) array in the SGA  with a fixed offset (version-dependent)
register rdi -> timestamp of the beginning of the wait (in microseconds)
register rsi -> wait event number
kews_update_wait_timeKernel event wait statistics update wait time.
This function is called at the end of an Oracle wait event.
kslwtectx is its parent function call marking the end of a wait event.
Parameters and registers:

register rsi -> wait time

register r13 (relevant for 11.2.0.4, it does not work on 12.1.0.2) -> event number
register r15 -> pointer into X$KSUSE (V$SESSION) array in the SGA  with a fixed offset (version-dependent)


You can get started with Perf tracing of Oracle userspace using the following:

# export ORACLE_HOME=/u01/app/oracle/product/11.2.0.4/rdbms
# perf probe -x $ORACLE_HOME/bin/oracle kskthewt timestamp=%di event=%si

If all worked well  you will have created a new probe for Oracle userspace tracing and will see the following output.

Added new event:
  probe_oracle:kskthewt (on kskthewt in /u01/app/oracle/product/11.2.0.4/rdbms/bin/oracle with timestamp=%di event=%si)

You can now use it in all perf tools, such as:

        perf record -e probe_oracle:kskthewt -aR sleep 1

Note in case of errors, check the OS and Oracle version compatibility discussed elsewhere in this post. Go ahead and create two more probes:

# perf probe -x $ORACLE_HOME/bin/oracle kskthbwt timestamp=%si event=%dx
# perf probe -x $ORACLE_HOME/bin/oracle kews_update_wait_time wait_time=%si event=%r13

You should have now three probes on Oracle, you verify this by listing them with:  "perf probe -l". If needed, you can delete probes with this syntax: "perf probe --del <probe_name>". Now you can run perf record to gather data for the events defined by the probes. One last step is to identify an Oracle process to test (it can be a sqlplus session for example). Find the OS pid of the Oracle process to probe and run the following:

# perf record -e probe_oracle:kews_update_wait_time -e probe_oracle:kskthbwt -e probe_oracle:kskthewt -p <pid>

Hit CTRL-C when done tracing. Display the trace output with the perf script command. Here an  example from "select count(*) from <testtable>" tun on a test databas:

# perf script

oracle 17270 [003] 4004734.472372: probe_oracle:kews_update_wait_time: (95af198) wait_time=0x6fce81d4 event=0x162
oracle 17270 [003] 4004734.472408: probe_oracle:kskthewt: (94862ec) timestamp=0x521b78690de07 event=0x162
oracle 17270 [003] 4004734.472953: probe_oracle:kskthbwt: (9486678) timestamp=0x521b78690e05c event=0x15e
oracle 17270 [003] 4004734.472977: probe_oracle:kews_update_wait_time: (95af198) wait_time=0x1e event=0x15e
oracle 17270 [003] 4004734.472982: probe_oracle:kskthewt: (94862ec) timestamp=0x521b78690e07a event=0x15e
oracle 17270 [003] 4004734.474083: probe_oracle:kskthbwt: (9486678) timestamp=0x521b78690e4c6 event=0xc6
oracle 17270 [003] 4004734.474398: probe_oracle:kews_update_wait_time: (95af198) wait_time=0x13b event=0xc6
...

The fields displayed in the example output are: process name, pid, CPU number, timestamp since machine startup, probe name and address, probe parameters. Some additional processing is beneficial to make the output more easily readable. In the next example perf script output is filtered by piping it to a series of commands: (1) to reduce the number of fields displayed, (2) convert hexadecimal numbers into decimal, (3) convert wait event numbers in wait event names. For this we use sed and eventsname.sed, a file generated with a sqlplus script.

# perf script | \
  awk '{printf "pid=%s TS=%s ",$2,$4} {print $5,$7,$8}' | \
  perl -p -e 's/0[xX][0-9a-fA-F]+\b/sprintf "%s", hex $&/ge' | \
  sed s/event=/event#=/g |sed -f eventsname.sed

pid=17270 TS=4004734.472372: probe_oracle:kews_update_wait_time: wait_time=1875804628 event=SQL*Net message from client
pid=17270 TS=4004734.472408: probe_oracle:kskthewt: timestamp=1444447003926023 event=SQL*Net message from client
pid=17270 TS=4004734.472953: probe_oracle:kskthbwt: timestamp=1444447003926620 event=SQL*Net message to client
pid=17270 TS=4004734.472977: probe_oracle:kews_update_wait_time: wait_time=30 event=SQL*Net message to client
pid=17270 TS=4004734.472982: probe_oracle:kskthewt: timestamp=1444447003926650 event=SQL*Net message to client
pid=17270 TS=4004734.474083: probe_oracle:kskthbwt: timestamp=1444447003927750 event=direct path read
pid=17270 TS=4004734.474398: probe_oracle:kews_update_wait_time: wait_time=315 event=direct path read
...

What the examples show is that trace data that you would normally find in a 10046 trace file (or  trace for EXEC DBMS_MONITOR.session_trace_enable(waits=>TRUE) if you prefer) can be collected using OS commands alone. Perf with utrace can be used to trace Oracle as an alternative to setting Oracle diagnostic events. What you cannot find in the Perf examples above are information on the parameters of the wait event (usually called p1, p2, p3). This is the subject of the next paragraph.


It only gets better from here: perf with uprobes can read Oracle SGA

Uprobes has received a major improvement with kernel 3.14 to include the possibility of fetching arguments from the memory of the traced process (see also http://lxr.free-electrons.com/source/Documentation/trace/uprobetracer.txt). As I write this Oracle has released UEK4 with kernel 4.1.x, which can be used to test this feature. Fedora is another option for a test environment with recent kernels.

Many details and metrics of the Oracle instrumentation coming from V$ views and their underlying X$ tables can be directly accessed by reading memory structures in the SGA (this is not the case for all X$, as many of them are  rather implemented as internal functions). X$KSUSE exposes many fields of V$SESSION. The memory location of the fields of interest can be identified using SQL as in this example.
The link between one of the main wait event userspace functions, kskthewt, and X$KSUSE is found by experimentation. This has been investigated together with Frits. We find that the value of the CPU register r13 points into the X$KSUSE record with an offset that is version depended. This work has already been done for SystemTap probes and a working example can be found at this link for 11.2.0,4 and this link for 12.1.0.2.

This is an example of a probe on the Oracle wait event interface that exposes selected fields from X$KSUSE memory structures:

# perf probe --del probe_oracle:kskthewt
# perf probe -x $ORACLE_HOME/bin/oracle kskthewt timestamp=%di event=%si p1='-2080(%r13)' p2='-2072(%r13)' p3='-2064(%r13)' sql_hash='-1828(%r13)'


# perf record -e probe_oracle:kskthewt -p <pid>

# perf script | \
  awk '{printf "pid=%s TS=%s ",$2,$4} {print $8,$9,$10,$11,$12}' | \
  perl -p -e 's/0[xX][0-9a-fA-F]+\b/sprintf "%s", hex $&/ge' | \
  sed s/event=/event#=/g |sed -f eventsname.sed

pid=17270 TS=4097944.225385: timestamp=1444540213679034 event=SQL*Net message from client p1=1413697536 p2=1 p3=0 sql_hash=11373786902
pid=17270 TS=4097944.225786: timestamp=1444540213679433 event=SQL*Net message to client p1=1413697536 p2=1 p3=0 sql_hash=11373786902
pid=17270 TS=4097944.227361: timestamp=1444540213681023 event=direct path read p1=1 p2=193 p3=7 sql_hash=11373786902
pid=17270 TS=4097944.229206: timestamp=1444540213682870 event=direct path read p1=1 p2=3896 p3=8 sql_hash=11373786902
pid=17270 TS=4097944.230532: timestamp=1444540213684197 event=direct path read p1=1 p2=12032 p3=8 sql_hash=11373786902
pid=17270 TS=4097944.232161: timestamp=1444540213685827 event=direct path read p1=1 p2=29464 p3=8 sql_hash=11373786902
pid=17270 TS=4097944.244675: timestamp=1444540213698344 event=direct path read p1=1 p2=177 p3=5 sql_hash=11373786902
...

More fields of interest from X$KSUSE or other areas of the SGA can be extracted as arguments of perf by extending the example above.

Example of tracing Oracle parse functions:

Function namePurposeSelected parameters
opiprsOracle Program Interface - Parse

Use to trace hard parsing.

This function is called when doing hard parsing
(i.e. when a SQL statement that is not in the library cache needs to be parsed).


Useful information when probing the function:

register rdx -> sql statement length        
register rsi -> pointer to the SQL text string                                                           
kksParseCursorUse to trace soft parsing.
This is one of the functions in the call stack when Oracle parses a cursors already in the library cache
Useful information when probing the function: 

11.2.0.4:
   register rsi -> sql statement lenght

   register rbx -> pointer to the SQL text string

12.1.0.2:
   
   register r9 ->sql statement lenght
   register r8 -> pointer to the SQL text string 



This is an example of a perf probe to trace hard parsing (tip: flush the shared pool before testing):

perf probe -x $ORACLE_HOME/bin/oracle opiprs length=%dx sql='+0(%si)':"string"

# perf record -e probe_oracle:opiprs -p <pid> 
# perf script

oracle 17270 [003] 4100899.401507: probe_oracle:opiprs: (27fd2f6) length=0x19 sql="select sysdate from dual"

Similarly for tracing soft parsing (11.2.0.4 version) you can use the following probe:

perf probe -x $ORACLE_HOME/bin/oracle kksParseCursor length=%si sql='+0(%bx)':"string"

Many more Oracle functions can be traced:

From the OS point of view, the Oracle executable is a big C program with a large number of functions. You can run "nm -D $ORACLE_HOME/bin/oracle" to get an idea of the scale: it is of the order of 10^5 text (code) sections. Obviously too many details and also undocumented. Some pointers to understand the function names can be found in MOS (see in especially the bug-related documents), from sqlplus running oradebug doc and from various blogs and presentations (see references).
Here below a short table of functions of interest for tracing logical and physical I/O taken from this blog post. The ideas and examples discussed above can be applied to these functions too, if there is a need to trace logical and physical I/O for example. Note that tracing logical I/O on a busy system can have an important overhead because of the large number of events per second.


Function namePurposeSelected parameters
kcbgtcrKernel Cache Buffers Get Consistent Read

Note: key function for consistent reads
tbs#=user_int32(%rdi) 
rel file n#= user_int32(%rdi+4) >> 22 & 0x003FFFFF                                                      
block#=user_int32(%rdi+4) & 0x003FFFFF
data_object_id#=user_int32(%rdi+8)
object_id#=user_int32(%rdi+12)

Note: for bigfile tablespaces:
block#=user_int32(%rdi+4)
kcbgcurKernel Cache Buffers Current

Note: key function for current reads
tbs#=user_int32(%rdi)
rel file n#= user_int32(%rdi+4) >> 22 & 0x003FFFFF
block#=user_int32(%rdi+4) & 0x003FFFFF
data_object_id#=user_int32(%rdi+8)
object_id#=user_int32(%rdi+12)

Note: for bigfile tablespaces:
block#=user_int32(%rdi+4)
kcbzibkcbZIB should stand for: Kernel Cache Buffers, Z
(kcbz.o is a module for physical IO helper functions), IB: Input Buffer


Note: this function is part of the physical I/O stack:
it indicates that Oracle will perform physical read(s) into the buffer cache

kcbzgbThe suffix GB in kcbZGB should stand for:
Get (space for) Buffer


Oracle allocates space in the buffer cache for a given block (typically before I/O operations). 

tbs n#=%rsi,
rel file n#=%rdx >> 22 & 0x003FFFFF
block#=%rdx & 0x003FFFFF
data_object_id#=%r8
object_id#=%r9

Note: for bigfile tablespaces:
block#=%rdx
kcbzvbInvoked after Oracle has read a given block

Note: this function is used both for reads in the buffer cache and for direct reads

tbs n#=%rsi, 
rel file n#=%rdx >> 22 & 0x003FFFFF,
block#=%rdx & 0x003FFFFF
data_object_id=user_int32(%rdi+24)

Note: for bigfile tablespaces:
block#=%rdx



Combining Perf probes and Oracle traces

What you have seen so far can also be described as a way to extract information normally available with Oracle diagnostic events, using perf and uprobes instead. Additional value to these methods comes when you combine them with perf probes for tracing kernel and hardware activity. There are many Perf events available. See also the output of perf list and the references in this post.

Study of Oracle I/O and wait events:
The example below combines perf events on tracepoints for system calls to the I/O interface in Linux and probes on the Oracle wait event interface. The example below records a trace for the Oracle userspace functions marking the beginning and end of a wait and for system calls to the I/O interface:

# perf record -e probe_oracle:kskthewt -e probe_oracle:kskthbwt  -e syscalls:sys_enter_io_submit -e syscalls:sys_exit_io_submit -e syscalls:sys_enter_io_getevents -e syscalls:sys_exit_io_getevents  -e syscalls:sys_enter_pread64 -e syscalls:sys_exit_pread64 -p <pid>

The output is processed for improved readability:

# perf script |  \
  awk '{printf "pid=%s TS=%s ", $2,$4} {print $5,$8,$9,$10,$11,$12}' |  \
  perl -p -e 's/0[xX][0-9a-fA-F]+\b/sprintf "%s", hex $&/ge' |  \
  sed s/event=/event#=/g |sed -f eventsname.sed

pid=25314 TS=4173676.363095: probe_oracle:kskthbwt: event=db file sequential read
pid=25314 TS=4173676.363109: syscalls:sys_enter_pread64: buf: 2644443136, count: 8192, pos:
pid=25314 TS=4173676.363667: syscalls:sys_exit_pread64:
pid=25314 TS=4173676.363689: probe_oracle:kskthewt: event=db file sequential read p1=4 p2=147586 p3=1 sql_hash=2246415721
pid=25314 TS=4173676.364030: probe_oracle:kskthbwt: event=db file scattered read
pid=25314 TS=4173676.364043: syscalls:sys_enter_pread64: buf: 5727326208, count: 40960, pos:
pid=25314 TS=4173676.364576: syscalls:sys_exit_pread64:
pid=25314 TS=4173676.364597: probe_oracle:kskthewt: event=db file scattered read p1=4 p2=147587 p3=5 sql_hash=2246415721
pid=25314 TS=4173676.364790: probe_oracle:kskthbwt: event=db file scattered read
pid=25314 TS=4173676.364801: syscalls:sys_enter_pread64: buf: 5727326208, count: 65536, pos:
pid=25314 TS=4173676.365344: syscalls:sys_exit_pread64:
pid=25314 TS=4173676.365365: probe_oracle:kskthewt: event=db file scattered read p1=4 p2=147592 p3=8 sql_hash=2246415721

What this example achieves is a way to port to Perf the work done with SystemTap for tracing Oracle I/O. See also http://externaltable.blogspot.com/2014/11/life-of-oracle-io-tracing-logical-and.html for additional details.

Oracle wait events and perf scheduler events

Perf scheduler events (see: perf list sched:*) are another source of very useful information. For example the event sched:sched_stat_runtime traces the amount of time spent executing on CPU. Note however that not all the scheduler probes are available on RHEL7 and UEK kernels, as CONFIG_SCHEDSTATS is not set on those platforms (see: grep CONFIG_SCHED /boot/config-`uname -r`). In particular this prevents the use of the technique described by Brendan Gregg for off-CPU flame graphs with perf.

This is an example of tracing the CPU scheduler together with I/O related events with Oracle wait events:

# perf record -e sched:* -e probe_oracle:kskthewt -e probe_oracle:kskthbwt  -e syscalls:sys_enter_io_submit -e syscalls:sys_exit_io_submit -e syscalls:sys_enter_io_getevents -e syscalls:sys_exit_io_getevents  -e syscalls:sys_enter_pread64 -e syscalls:sys_exit_pread64 -p <pid>

After capturing the workload you can extract atrace file running "perf script". In the table below you can see a snippet of the trace filtered with a simple scripts to improve the readability. You can see that perf provides valuable information by measuring CPU time, system calls and Oracle wait events.

EVENT NAME                      BEGIN TS        END TS          ELAPSED (MicroSec)
event=SQL*Net message to client 4413050521226   4413050521261   35
sched_stat_runtime              4413050521179   4413050521291   112
sched_stat_runtime              4413050521293   4413050521530   237
syscall:pread64                 4413050521392   4413050535684   14292
event=db file sequential read   4413050521375   4413050535704   14329
syscall:io_submit               4413050536073   4413050536249   176
sched_stat_runtime              4413050535593   4413050536551   958
syscall:io_submit               4413050536343   4413050536623   280
sched_stat_runtime              4413050536548   4413050536686   138
syscall:io_getevents            4413050536678   4413050537650   972
event=direct path read          4413050536652   4413050537677   1025
syscall:io_submit               4413050537807   4413050537924   117
sched_stat_runtime              4413050537563   4413050537963   400
syscall:io_getevents            4413050537956   4413050539289   1333
event=direct path read          4413050537943   4413050539315   1372
sched_stat_runtime              4413050539134   4413050539553   419
..


Ideas for future work

Stack profiling and flame graphs: An additional step to get more value out of perf traces is to visualize their output. Flame graphs and extended stack profiling have proven to be useful in this area. See also previous work on flame graphs for Oracle at link1 and link2 on extended stack profiling with stapflame by Frits at this link and with stack profilers at this link.
This is still work in progress. For a sneak preview follow this link for a proof-of-concept graph.

Perf hardware events: Combining Oracle traces traces with hardware and PMU counters is another area of interest for future investigations. This could be useful for example for studying logical reads and their dependencies on events at HW level on memory and/or processor cache level. 
Add and documenting more probes for Oracle userspace function tracing: For example it would be useful to develop a set of perf probes for replicating the 10046 trace event.

Linux and Oracle version compatibility:

The definition of probes on Oracle executable needs uprobes, available from kernel version 3.5. When testing the examples discussed in post on RHEL/OL, you can use Kernel UEK4 or UEK3 or use RHEL 7.x.  Notably, examples discussed in this post will not run on RHEL 6.x kernels.
Issues with uprobes and Oracle versions: uprobes works OK for tracing Oracle 11.2. However, for Oracle 12.1 I find that uprobes works OK on  RHEL7.0 (kernel 3.10.0-123) and UEK (kernel 3.8.x), but does not work for kernels that ship with RHEL 7.1,7,2 and anything higher (including UEK4). When testing the easiest is to use Oracle 11.2 or if you want to test Oracle 12.1 use UEK3 or RHEL 7.0 kernel. More investigations are needed on this topic.


Conclusions

Perf events and uprobes are very powerful tools for dynamic tracing of Oracle. They allow to complement and extend Oracle and OS instrumentation for advanced troubleshooting and performance investigations. In particular Perf with uprobes can be used:
  • As an alternative to standard instrumentation for Oracle tracing with diagnostic events: see the examples of using perf probes for tracing wait event, sql parsing, physical and logical I/O.
  • As a tool for combining OS-related probes (tracepoints for I/O, network, scheduler or hardware events) and Oracle diagnostic information, for advanced troubleshooting.
  • As a data collection tool for extended profiling of Oracle workloads, which combines stack traces, scheduler information, wait events or other Oracle diagnostic data for flame graph visualization.

Credits and references

Kudos to Brendan Gregg for his work on flame graphs and for the articles and presentations he has published on Linux tracers. Many thanks to Frits Hoogland for his investigations of Oracle internals, for collaboration on Oracle internals investigations and tracing tools and for proof reading this post. Many thanks to Tanel Poder for his work on Oracle internals and performance.
Additional links on topics related to this post:

http://www.brendangregg.com/perf.html
https://fritshoogland.wordpress.com/2016/01/12/introducing-stapflame-extended-stack-profiling-using-systemtap-perf-and-flame-graphs/
http://externaltable.blogspot.ch/2015/12/extended-stack-profiling-ideas-tools.html
http://blog.tanelpoder.com/2015/09/21/ram-is-the-new-disk-and-how-to-measure-its-performance-part-2-tools/
http://externaltable.blogspot.ch/2014/11/life-of-oracle-io-tracing-logical-and.html
https://fritshoogland.wordpress.com/2014/02/27/investigating-the-wait-interface-via-gdb/
http://enkitec.tv/2013/03/14/tanel-poders-hacking-session-how-oracle-sql-plans-are-really-executed-part-2/
https://alexanderanokhin.wordpress.com/tools/digger/
http://lxr.free-electrons.com/source/tools/perf/Documentation/perf-probe.txt
http://linux.die.net/man/1/perf-probe
http://lxr.free-electrons.com/source/Documentation/trace/uprobetracer.txt

Tuesday, December 29, 2015

Extended Stack Profiling - Ideas, Tools and Comments

Topic: This post provides a short summary and pointers to previous work on Extended Stack Profiling for troubleshooting and performance investigations.

Understanding the workload is an important part of troubleshooting activities. We seek answers to questions like: what is the system doing, where is the time spent, which code paths are most used, what are the wait events, etc. Sometimes the relevant diagnostic data is easy to find, other times we need to dig deeper. Stack profiling and flame graph visualizations are very useful techniques for advanced troubleshooting. In particular in the Linux environment stack traces gathered with perf provide a quick and powerful window into CPU-bound workloads, as detailed in the work of Brendan Gregg.

Extended Stack Profiling techniques stem from the experience of using on-CPU stack profiles and flame graphs. The basic idea is to pull together stack traces with OS- and application-specific metrics. This has the advantage of exposing the inner workings of the process under investigation, together with the context of its execution, such as its parent functions and the execution of kernel functions, when relevant. Application metrics that can also be added. Notably, when examining Oracle workloads, data from the wait event interface can be added as an additional dimension to the stack traces.

Examples of using extended stack tracing tools for troubleshooting and investigating Oracle workloads can be found in these blog articles: "Linux Kernel Stack Profiling and Flame Graphs Applied to Oracle Investigations", "Oracle Wait Events Investigated with Extended Stack Profiling and Flame Graphs" and in the presentation "Stack Traces & Flame Graphs for Oracle Troubleshooting".
Three tools and code examples complement the blog articles:
  • KStackSampler: a tool written in shell script which gathers kernel stack profiling together and process status. 
  • ORA_KStackProfiler: a simple kernel stack profiler written in C, extended with process status information and the option of sampling Oracle wait event information from SGA.
  • Ptrace_Profiler: an extension of ORA_KStackProfiler with userspace stack sampling implemented using ptrace and libunwind.

Pros: techniques based on extended stack profiling allow to investigate Oracle workloads beyond what is available with the wait event interface. In particular when combined with flame graph visualization they provide insights into the inner workings of complex workloads, which in turns can be used to complement or extend what is available with Oracle instrumentation. The techniques are general and can be used or to troubleshoot many and diverse Linux workloads.

Cons: Limitations of the code discussed here are many and the tools have to to be considered as experimental. One of the main issues is the error that is intrinsic when pasting together different data sources, such as stack traces and process status, that are collected sequentially by the tools. This is particularly evident with the ptrace-based implementation for userspace stack traces which requires to stop the process for a relatively long time (can be 100s of milliseconds). The fact that data collection is based on sampling introduces another potential source of error. The sampling frequency has to be adapted to the workload: not too high to limit the overhead, not too low to avoid losing the details of rapidly varying workloads. Some other 'gotchas' come from the use of flame graphs, in particular it is worth reminding that the horizontal axis of those graphs does not represent the time evolution.

Ideas for future work: the tools can be extended by developing the interfaces to sample more data sources. For Oracle this could be adding more fields from V$SESSION (X$KSUSE) or other V$/X$ structures. More generally, probes can be developed for sampling and investigating a larger range of applications, similarly and extending the the work done for Oracle. Extension for multi-threaded processes would also make the tools more generic. A user interface to simplify the selection of the data sources and integration with flame graph visualization would also be beneficial.
Another area for improvement is with userspace stack tracing. It is currently implemented in Ptrace_Profiler by stopping the process while unwinding the stack, a simple method with a high overhead. There are better methods: for example with techniques for asynchronous stack unwinding that allow to stop the process for a much shorter time. This would reduce the footprint of the measurement and also allow for higher sampling frequency.
Additional work is also needed to better understand the reliability of the measurements and the errors incurred when sampling and pasting together (on-the-fly) the various data sources, as discussed above.

Credits and additional references: Brendan Gregg is the inventor of flame graphs and has published excellent material on this and other related topics of interest for troubleshooting performance. Tanel Poder has covered the topic of stack profiling and many others of interest in his blog. Additional and related investigations of Oracle internals can be found in the blog of Frits Hoogland and in the blog of Stefan Koehler.
Links to previous work: "Linux Kernel Stack Profiling and Flame Graphs Applied to Oracle Investigations", "Oracle Wait Events Investigated with Extended Stack Profiling and Flame Graphs", "Stack Traces & Flame Graphs for Oracle Troubleshooting", "Oracle Optimizer Investigated with Flame Graphs", "Flame Graphs for Oracle". Tools referenced in this post are available on Github and at this web page.


Friday, December 11, 2015

Slides of the CERN Talks at UKOUG Tech15

This is a short blog post with the links to download the slides of the presentations with CERN speakers at UKOUG Tech15:


Many thanks to UKOUG for delivering once more a technical conference of very high quality and to all the participants who make the conference a great place to be, both for knowledge sharing and for meeting old and new friends in the Oracle arena.
Many thanks also for the all the good reviews of the UKOUG Tech14 presentation Modern Linux Tools for Oracle Troubleshooting (pdf), which have led to a speaker award.


Wednesday, November 18, 2015

Oracle Wait Events Investigated with Extended Stack Profiling and Flame Graphs

Topic: this post is about investigating Oracle wait events using stack profiles and flame graphs extended with OS-process state and Oracle wait event details.

Context: The case of the DB Time > CPU Time + Wait Time

Oracle instrumentation provides wait event and CPU time accounting, a powerful and readily accessible data source for performance troubleshooting. An Oracle session at a given point in time is either on CPU, for example when processing data from cache, or it is in a waiting state, for example waiting for an I/O operation to complete.
A recent talk by John Beresniewicz (JB) at the Oak Table World 2015 goes into the details of an interesting case, where the CPU time and wait time measurements from Oracle's V$ tables (and AWR reports) appear to violate the equation "CPU time + Wait time = DB Time".
In particular JB discusses in details a case of I/O testing, originally raised by Kevin Closson, where this discrepancy is apparent and originates from the fact of using modern "fast" storage.
To further set the context for the following discussion, you can watch JB's presentation and the slides. In short the root cause of the issue is that a significant amount of CPU cycles spent serving I/Os by the OS can be double counted both into Oracle I/O wait events and CPU counters. In this post you will learn of a technique and related tools for extended stack tracing that can be used to investigate this and similar cases and provide direct evidence of what is happening "inside Oracle wait events".

A test system reproducing high CPU usage inside I/O wait events

This is about reproducing a case where AWR reports  %Wait + %CPU > 100% (of DB Time). Following the original case reported by JB, the test cases are generated using Kevin Closson's SLOB. One crucial point for the following is to have a system that serves random I/O with low latency. For this you don't necessarily need expensive HW, you can emulate (for testing) a fast I/O subsystem  in various ways, typically using RAM-based devices.
The test system used in the following is a Virtual box VM with 4 CPU/cores and 4GB of RAM. The test DB is installed with OL 6.7 and the test database is Oracle version 12.1.0.2. Storage has been configured as three block devices and handled with ASM on the database in the VM, but it is actually served by 3 files on the host machine. The buffer cache of the test database is set to a very low size,  64MB.
The important point is that the host where Virtualbox runs has enough RAM to be able to cache the files used as storage for the test DB. A host with 128 GB of RAM has been used and the active set for the test DB is 40 GB.

Running the test workload, I/O bound with low latency I/O and high number of IOPS

This first test consists in running SLOB for a few minutes to drive the (I/O bound) workload. It is a simple read-only workload (random reads of single Oracle blocks, 8 KB) and no updates. Only one SLOB schema is needed, this has also the advantage of keeping the overall server CPU utilization low. Some relevant slob.conf parameters are: SCALE=40000M, UPDATE_PCT=0, RUN_TIME=1000. The command to run the test, after SLOB schema setup, is simply ./runit.sh 1.
Additional checks and boundary conditions are:
  • The system is not CPU-bound. This should be the case if the only workload on the database is the one-schema SLOB run. However it is important to double check with OS commands, for example "vmstat" or "top". In a CPU-bound server interpreting CPU and wait time becomes more complicated, due to the additional time spent waiting in the run queue.
  • I/Os are served at low latency and high IOPS. If you build your test machine using a VM created with Virtualbox, as described earlier in this post, this can be achieved by using a host with enough RAM to cache the files acting as back-end for the VM storage. The caching on the host filesystem would normally happen during SLOB schema creation, if the host server has enough RAM and no concurrent activity is competing for the filesystem cache. However if needed, caching can be triggered by running a few full scans of the USER1.CF1 table.

The test reproduces the case "%Wait + %CPU > 100%"

The AWR report of the SLOB workload shows that main activity on the database is random reads, instrumented by Oracle with the wait events: "db file parallel read", "db file sequential read" and by CPU time. The AWR report also shows that the system did not have any other significant activity besides the SLOB workload. The notable point (see also Figure 1) is that the sum of the percentage of DB time spent on non-idle wait events and CPU adds up to a value significantly higher than 100%. This is an anomaly, as it seems to violate the concept that an Oracle process at a given point in time is either on CPU or waiting for a wait event.

Figure 1: Extract from the AWR report of an I/O bound workload (generated with SLOB) showing %Wait + %CPU > 100% (DB Time). This is an apparent anomaly of the instrumentation as DB time is expected to be constituted of DB CPU and non-idle wait time.

Another way to measure the workload is with Tanel Poder's snapper. Snapper has the advantage of drilling down to session level rather than aggregating at the system level. The output of a 10-second sample is reported in Figure 2. From the metrics reported there, you can see that the workload is mostly single-block reads at a rate of 10K IOPS and that the wait time for I/O wait and CPU adds up to a value larger than DB Time. Note that DB Time for this workload is equal to the elapsed time, as there is only 1 session active and it is active all the time serving DB calls. An extract of relevant snapper measurements from Figure 2 is:

TIME, DB CPU: 71.5%
WAIT, db file sequential read: 9.1%
WAIT, db file parallel read: 78.7%
STAT, physical read IO requests: 10.31k


Figure 2: Snapper output of the SLOB workload in the case of low-latency I/O. The sum of DB CPU and non-idle Wait Time is larger than DB Time, which is an anomaly of the instrumentation. ASH data, however, reports what looks like a correct accounting of the percentage of DB Time spent into DB CPU and Wait events.


Something is wrong: the sum of CPU time and wait time should not be larger than DB time. One first check that you can do is to investigate if the CPU time measurement reported by Oracle is accurate. It turns out that the problem is not here, you can verify this using OS commands, for example with "pidstat". Pidstat output (not reported here for brevity) is also interesting because it adds two additional pieces of information: it confirms that the process spends most of its time on CPU and that the CPU cycles are accounted as "sys" CPU time, i.e. time spent in kernel operations.
The wait events are timed by Oracle, at the beginning of the wait and measured again at the end. The Oracle functions kskthbwt (Kernel service Kompile thread begin wait) and kskthewt (Kernel service Kompile thread end wait) appear to take care of measuring the wait time and update the instrumentation with the relevant metrics. In particular the wait time measurement for "db file sequential read" and "db file parallel read" seems to correctly account for the time spent in OS calls from the tracing tests performed by the author and reported in the post "Life of an Oracle I/O: Tracing Logical and Physical I/O with SystemTap".

Moving into the active session sampling data of snapper reported at the bottom of Figure 2, you can find a more correct picture of where DB Time is spent: most of the DB time is spent in wait and a small amount on CPU. Most importantly the sum of DB CPU and Wait events adds up to 100%, as expected.
Following JB's analysis, the collected evidence points to the fact that many CPU cycles spent performing kernel-related activities for serving I/O are accounted into Oracle wait events. Therefore the sum of %CPU and %Wait is higher than 100% because of the double counting that happens in  these events, especially with the I/O interface is working at a high rate of IOPS. In the following you will see how profiling of OS-process state together with wait events and kernel stacks provides direct observation of this hypothesis.

Direct investigation with extended stack traces and flame graphs

The idea is to sample the Oracle process running the test workload (random I/O driven by SLOB) using kernel stack profiles extended with additional sampling of the Oracle OS-process state and the Oracle wait event interface. This is an extension of the work discussed in the blog article "Linux Kernel Stack Profiling and Flame Graphs Applied to Oracle Investigations". A custom tool ORA_KStackProfiler is introduced to perform the following actions:
  • sampling of /proc/<pid>/stat to measure the process state (Running, Stopped, Disk, etc),
  • sampling /proc/<pid>/stack to sample the kernel stack trace if any
  • sampling the status of the Oracle session (on CPU of waiting for a wait event). The sampling is done directly from SGA by reading the memory locations corresponding to X$KSUSE.KSUSEOPC and X$KSUSE.KSUSETIM.
Note that ORA_KStackProfiler is an experimental tool with several limitations, including the fact that sampling is not done atomically (the stack profile, process state and wait event are gathered with subsequent calls), which can include errors. See also the README file.
These are the steps used for testing and the resulting flame graph can be found in Figure 3:

1. Run "make" to build the executable: ora_kstackprofiler

2. Run the scripts in the oracle_scripts directory 
     - run eventsname.sql using SQL*plus to generate the file eventsname.sed translating event# into event names 
     - run find_pid_ksustim_ksuseopc.sql using SQL*plus to find the pid and details of X$KSUSE of the Oracle session to trace
Example: 

SQL> @find_pid_ksustim_ksuseopc.sql
This script finds the OS pid and address of X$KSUSE.KSUSEOPC and X$KSUSE.KSUSETIM for a given Oracle session. To be used together with ora_kstackprofiler. Run as user SYS.

Enter Oracle SID to be investigated: 111
OS pid = 1234, ksuseopc = 5796061762, ksusetim = 5796061792

3. Stack traces sampling extended with Oracle wait event data as in the following example:

./ora_kstackprofiler --pid 1234 --delay 100 --count 1000 --ksuseopc 5796061762 --ksusetim 5796061792 > ora_stack_slob_fastio.txt

4. Process the output with additional filtering and with flame graphs for visualization using https://github.com/brendangregg/FlameGraph

cat ora_stack_slob_fastio.txt| grep -v 0xffffffffffffffff | sed -f oracle_scripts/eventsname.sed | ../FlameGraph/stackcollapse-stap.pl| ../FlameGraph/flamegraph.pl --title "Kernel stack, OS State and Wait event profiling of Oracle random I/O (SLOB workload, low latency I/O)" > ora_stack_slob_fastio.svg


Figure 3: Flame graph visualization of Kernel stack traces extended with process state sampling and wait event details from SGA show a large amount of CPU cycles consumed inside the wait events "db file parallel read" and "db file sequential read" for an I/O bound workload (low-latency random reads generated with SLOB). Note that the wait event details from SGA is reported at the bottom of the graph. The line above that is the process state from /proc/<pid>/stat. On top, the kernel stack trace from /proc/<pid>/stack. Click here for a svg version of the graph.


Figure 3 shows a flame graph of the Kernel stack traces extended with process state sampling and wait event details from SGA. The wait event data is sampled from SGA (and in particular from the memory locations for X$KSUSE.KSUSEOPC and X$KSUSE.KSUSETIM) and displayed at the bottom line of the graph. Just on top of that, the OS process state from /proc/<pid>/stat, finally the kernel stack data from /proc/<pid>/stack.
The flame graph of Figure 3 provides direct evidence of the fact that a significant amount of the time accounted by Oracle as wait events for I/O, is actually spent consuming CPU cycles for serving I/O.
This is consistent with the fact that I/O in this test is served at high rate. The percentage of samples marked as "On CPU (Oracle state)" is about 10%, which is consistent with what measured with ASH (see snapper output in Figure 2).
There are some important limitations in the method of extended stack sampling: sampling of the stack profile, process state and Oracle wait event details is not atomic, so we can expect mismatches. Moreover in Figure 3 you can see that the percentage of samples in OS state "running" is higher that the amount of CPU time measured with pidstat, in this example by about 20%. Note that the state "running" reported from /proc/pid/status, means that the process is either actually running on CPU or that it is runnable state (in the run queue). When reading the flame graph you should also remember that it is based on sampling, therefore it shows the stack traces that are mostly active. For example in Figure 3 you can see io_submit as one of the most active functions/code paths for asynchronous I/O (in the context of the wait event "db file parallel read"). The corresponding call io_getevents does not appear in the graph as the low latency and high rate of  I/O makes this operation very fast. Tracing with strace or SystemTap can be used to further drill down on this observation with measurements of the I/O calls and their timing.

For additional understanding of what the system is doing it is useful to measure also the userspace stack. For this an extended tool is introduced: Ptrace_Profiler.
The output of Ptrace_Profiler is reported in Figure 4, with some annotations of relevant functions in the userspace and their meaning. You can see that most of the time is spent in operations related to asynchronous I/O, both on the kernel level and on the userspace libaio function. This is consistent with the workload executing random reads under the wait event "db file parallel read" at high speed.


Figure 4: Flame graph visualization of kernel and userspace stack traces for Oracle I/O workload (SLOB) extended with OS process state and wait event details. Data is collected with a custom tool: Ptrace_Profiler. Some of the relevant functions are annotated with their role. Most of the time in the workload is spent executing I/O and in particular io_submit in libaio or in the kernel. Click here for a svg version of the graph.


I/O latency

The nature of the workload and in particular the high number of IOPS and low latency of the storage play an important role in amplifying the amount of CPU consumed in Oracle wait events per unit of time (more work requires more CPU cycles). A direct confirmation that the storage is serving I/O at low latency can be found with OS-based tools reading from block I/O tracepoints. The following tools can be used based on SystemTap and Ftrace: blockio_rq_issue_filter_latencyhistogram.stp and iolatency_micro. Some sample measurements are reported here below. Note that a large number of I/O requests are served in the bucket 64-to-128 microseconds.

# ./iolatency_micro 10
Tracing block I/O. Output every DeltaT = 10 seconds. Ctrl-C to end.
 >=(mus) .. <(mus)  : IOPS       IO_latency/DeltaT |IOPS Distribution                     |
      16 -> 32      : 0          0                 |                                      |
      32 -> 64      : 623        29947             |####                                  |
      64 -> 128     : 6302       605049            |######################################|
     128 -> 256     : 3482       668659            |#####################                 |
     256 -> 512     : 664        255206            |#####                                 |
     512 -> 1024    : 47         36249             |#                                     |
    1024 -> 2048    : 3          5529              |#                                     |
    2048 -> 4096    : 4          13209             |#                                     |



No double counting of CPU usage into wait events with slow I/O

You may not have found a situation like the one described above in your production DBs. Here we want to investigate what made the test "special". The key element of the example above is the low latency and high IOPS of the test workload. This causes the system to be very busy serving I/O and this amplifies the effect of double counting CPU in wait events. In this paragraph an additional test will hopefully make the matter more clear. The idea is to run the same SLOB workload against a storage system serving I/Os with much higher latency. In the test system used for this blog post, based on a VM, the I/O can be slowed down simply by flushing the filesystem cache on the host system (that is the system running Virtualbox):

# free && sync && echo 3 > /proc/sys/vm/drop_caches && free

The next step is to start the SLOB workload using the same test parameters as in the previous test, with the main difference that the I/O workload now runs against a much slower storage system: I/Os are not cached in the Virtualbox host anymore and are served by slow SATA disks. Using OS instrumentation and the AWR report one can quickly see that the system is I/O bound, serving only 300 IOPS (it was 10K IOPS in the previous test with low-latency I/O). Note also from Figure 5 that in this test %DB CPU + %Wait = 102% (of DB Time). This number is quite close to the ideal value of 100% and could easily be interpreted as measurement noise, if you did not have the above analysis to explain the effect double counting of CPU in I/O wait events. Here below an example of the latency histogram taken during this test, note the predominance of very high I/O latency.

# ./iolatency_micro 10
Tracing block I/O. Output every DeltaT = 10 seconds. Ctrl-C to end.
 >=(mus) .. <(mus)  : IOPS       IO_latency/DeltaT |IOPS Distribution                     |
      64 -> 128     : 0          48                |#                                     |
     128 -> 256     : 2          480               |#                                     |
     256 -> 512     : 5          2150              |##                                    |
     512 -> 1024    : 3          2918              |##                                    |
    1024 -> 2048    : 3          5222              |#                                     |
    2048 -> 4096    : 1          3686              |#                                     |
    4096 -> 8192    : 13         80486             |####                                  |
    8192 -> 16384   : 27         333004            |########                              |
   16384 -> 32768   : 44         1103462           |#############                         |
   32768 -> 65536   : 75         3706060           |######################                |
   65536 -> 131072  : 136        13389004          |######################################|
  131072 -> 262144  : 38         7569408           |###########                           |
  262144 -> 524288  : 0          157286            |#                                     |



Figure 5: Extract of the AWR report showing the top wait time for a test using SLOB workload against "slow I/O". Note that DB CPU and no-idle Wait sums up to 102% of DB Time. Which is close to the ideal case of a sum of 100%.

Using ORA_KStackProfiler you can get a more detailed view of what the system is doing while running the SLOB workload: most of the time is spent inside an I/O wait event, the most important being "db file parallel read". However contrary to the case of Figure 3, here the process is mostly in Sleep or Disk Sleep state, waiting for the io_getevents call reap I/Os. The flame graphs of Figure 6 (slow I/O) and Figure 3 (fast I/O) appear to be quite different but they just reflect two aspects of asynchronous I/O underlying the "db file parallel read" wait event. In the case of low latency I/O (Figure 3) most of the I/O wait time is spent submitting I/O while getting the results is very fast, in the opposite example of slow I/O (Figure 6), most of the execution time is spent waiting to reap I/Os (io_getevents). The code path is the same, stack profiling and flame graphs highlight in each case the code path where most of the time is spent.


Figure 6: Flame graph of the kernel stack profile extended with OS process state and wait event details for an I/O-bound Oracle workload generated with SLOB. The I/O system is serving I/Os at about 300 IOPS. Most of the time is spent waiting for I/O. During the I/O wait time the Oracle process is mostly in Sleep or Disk sleep state, although also on running/runnable state for a fraction of the time. Click here for a svg version of the graph.


Conclusions

Extended stack profiling and flame graphs combine visualization of kernel stack trace profiles, userspace stack, OS process state and Oracle wait event data. This technique can be used for advanced troubleshooting and root cause analysis to find the code paths that are most frequently executed and correlate this information with the application context. For Oracle troubleshooting this leads to additional data beyond what is available with the wait interface, to be used in special cases where low-level investigations are needed. An example is provided of using extended stack profiling for drilling down on a case of high utilization of CPU inside I/O wait events where the one observes the anomaly DB CPU Time + Wait Time >> DB time. Extended stack profiles and flame graphs provide direct evidence of what happens in the system and why CPU is consumed inside Oracle wait events.

Download the tools and previous work

The tool discussed in this post can be downloaded from this webpage or from Github.
Relevant previous work on stack profiling, flame graphs and investigations of Oracle workload and wait events can be found at: "Oracle Optimizer Investigated with Flame Graphs", "Flame Graphs for Oracle", "Linux Kernel Stack Profiling and Flame Graphs Applied to Oracle Investigations", "Life of an Oracle I/O: Tracing Logical and Physical I/O with SystemTap".

Acknowledgmeents and references

Brendan Gregg is the inventor of flame graphs and has published very interesting material on the topic. John Beresniewicz (JB)'s talk at the Oak Table World 2015 has discussed the original test case for this blog post. Kevin Closson is the author of  SLOB. Tanel Poder has covered the topic of profiling /proc/<pid>/stack in his blog. Frits Hoogland has many details on Oracle internals and I/O investigations in his blog.


Monday, October 26, 2015

Linux Kernel Stack Profiling and Flame Graphs Applied to Oracle Investigations

Topic: This blog post is about kernel stack profiling and visualization with flame graphs.

Context: Stack profiling and flame graphs are very useful tools and techniques for troubleshooting and investigating workloads at the OS-level and understand which code path take most of the execution time. You can find extensive material and examples of flame graphs in Brendan Gregg's website and blog. A few additional examples of using stack tracing and flame graphs for investigating Oracle are: "Flame Graphs for Oracle" and "Oracle Optimizer Investigated with Flame Graphs".
Flame graphs are often used to visualize and analyze stack traces for CPU-bound processes. Perf has proven to be a very good tool to capture such traces in Linux. With perf you can collect stack backtraces of the running processes with little overhead. However this does not cover the cases when the process under investigation is off-CPU (for example sleeping and/or waiting for an I/O operation to finish). This topic has also been addressed by Brendan Gregg in his blog, with off-CPU flame graphs. However data collection for the case of off-CPU stack traces seems to be still an area of active investigation, especially for Linux. This post addresses the case of collecting kernel stack data for flame graph visualization in the case of processes that spend a large fraction of their time off-CPU and/or in system calls.

What's new: In this blog you can find a description of  a basic technique for sampling kernel stacks and a script to automate the process. The script gathers kernel stack traces and process status by sampling the /proc filesystem. This is intended for investigations of processes in Linux that  spend a significant part of their elapsed time off CPU (for example in status S, sleep, or D, disk sleep). 
The main engine of the script is a simple loop reading /proc/pid/stack. I have first heard of this idea in Tanel Poder's blog. A similar concept is also used in the poor man's profiler. Just to illustrate how simple the method is, here is the main data collection part of kstacksampler.sh:

for x in $(seq 1 $iterations); do
   cat /proc/$pid/stack    # get kernel stack trace
done

Download kstacksampler.sh: from this webpage or from Github.

Notes and limitations: The script kstacksampler.sh is quite basic, to be taken as a proof of concept and study material. The overhead and potential side effects of using it are expected to be minimal for most of the systems, so most likely it is safe to use it in production. The use of shell for profiling does not allow for high frequency sampling, typically you should start experimenting with 10-20 Hz. Processes that are busy with CPU usage should rather be traced using perf and on-CPU flame graph techniques. The script currently only traces one process at a time (although it could be adapted to trace multiple pids). The script does not trace threads (although it can be modified to trace tids from /proc/pid/task/tid/stack). The script does not provide userspace traces. The script samples both the process kernel stack and status (running, sleeping, etc). A simple editing of the script can disable this feature if not needed. Note that in Linux the process status "running" refers to both running and runnable (in run queue) processes.


Example 1: Investigation of a workload showing a network bottleneck. 

Preliminary info:
The case is of a database session  retrieving data over a slow network. A simple statement for reading a database table "SELECT * FROM LARGETABLE;" is used to generate the workload. When examining the database process (on the DB server side) it shows little usage of CPU. This can be investigated for example using pidstat (RHEL/EL 6.7 was used for this test).

[root@MYDB]# pidstat -p 9124 1 10
Linux 2.6.32-573.7.1.el6.x86_64 (MYDB)          10/22/2015      _x86_64_        (4 CPU)

10:10:28 PM       PID    %usr %system  %guest    %CPU   CPU  Command
10:10:29 PM      9124    0.00    4.00    0.00    4.00     0  oracle_9124_orc
10:10:30 PM      9124    4.85    4.85    0.00    9.71     2  oracle_9124_orc
10:10:31 PM      9124    1.00    2.00    0.00    3.00     2  oracle_9124_orc

You can also use Oracle instrumentation to investigate the process and find that it is mostly idle. In particular V$ views such as V$SESSION, V$SESSION_LONGOPS, V$ACTIVE_SESSION_HISTORY are some of the views that can be used, or rather their graphical counterpart using OEM.

Stack Tracing:
You can now try the script kstacksampler.sh and build a flame graph with its output to further the investigation. In the following the task is divided in two steps. The first step  is data collection with kstacksampler.sh, with some additional data filtering with grep and sed. The second step integrates with the FlameGraph toolset. In particular stackcollapse-stap.pl folds the stack and flamegraph.pl builds the flame graph (see Figure 1).

Step1: stack sampling
$ ./kstacksampler.sh -p 9124 -n 100 -i .05 | grep -v 0xffffffffffffffff | sed 's/State:\t//g'| sed 's/\[<.*>] //g' >stack_sample_example1.txt

Step2: flame graph
$ cat stack_sample_example1.txt | ../FlameGraph/stackcollapse-stap.pl | ../FlameGraph/flamegraph.pl --title "Kernel stack profiling - Example 1, network-related bottleneck"> kstacksampler_Example1_network_bottleneck_Fig1.svg

The resulting flame graph is:

Figure 1: Flame graph of a process that is network-bound and spending the bulk of its time on TCP stack waiting for messages to arrive, as detailed in Example 1 in the text. Click here for a svg version of the graph.

From the flame graph of Figure 1 you can see that the process is sleeping for a large fraction of its time and that the large majority of the traced operations are related to network/TCP communication. This confirms what found earlier with pidstat and Oracle instrumentation. Kernel stack sampling has helped to find/confirm the root cause of the performance issue.

Additional investigations:
Without going into too many details, which is not the scope of this example, you can find interesting to take one additional step and use the results of the analysis to apply and measure some changes to make the process go faster. The general idea is to use larger "data transfers" and reduce the number of round trips, therefore limiting the impact of the network latency. A simple change in the context of this Oracle-based example is to increase the number of rows fetched at each step by changing the arraysize parameter in SQL*plus and set it to the maximum allowed value: "SET ARRAYSIZE 5000" (arraysize was set to 100 in the previous case and in the tracing data of  Figure 1). Other and similar options are also available at the TCP/kernel level, for those interested in further tuning. The outcome of the change of arraysize is a measurable reduction of the percentage of the elapsed time that is time spent waiting on TCP operations. More importantly this drives the throughput up (from 6 MB/s in the case of Figure 1 to about 20 MB/sec). Also pidstat shows that the CPU utilization increases (from 6% to about 28%). Similar observations can be done using Oracle instrumentation with V$ views. Finally the flame graph for the tuned process can be seen in Figure 1b. Notably the percentage of time spent with process status=running has considerably increased (more useful work is being done per unit of time).

Figure 1b: Flame graph of the process described in Example 1 after changing the parameter arraysize from 100 to 5000. An increase in CPU usage can be seen in this flame graph compared to what is represented in Figure 1 (taken with arraysize=100). This change correlates with higher network throughput and faster query execution. Click here for a svg version of the graph.

Additional test case details:
I have created the case of Example 1 using Oracle running in a VM and connecting to it via a sshd tunnel. The tunnel was set up using putty: putty -P <forwarded_port> -L 1111:DB_host:1521 oracle@VM_host). 1111 is the locally forwarded port, 1521 is the Oracle listener port on DB_host, <forwarded port> is a port on the VM_host forwarded into port 22 (sshd) of the DB_host. SQL*plus on a client machine was used to connect to the DB and run the query: "SELECT * FROM LARGETABLE;", where LARGETABLE is any table big enough so that the query runs in a few minutes. Another tip is to "set autotrace traceonly" before running the query to suppress the output to the screen.


Example 2: Investigation of an Oracle process doing random I/O. 

This example is about investigating I/O-bound processes. In particular Figure 2 shows the output of kernel stack profiling and flame graph for an Oracle process executing Kevin Closson's SLOB workload. The same two steps as detailed above in Example 1 have been used to collect data and produce the flame graph. The workload is mainly random read from block devices. The DB server used for this example runs in a VM and Oracle is configured to use ASM. The stack traces show that the process time is mostly spent either doing I/O or waiting for I/O calls to complete. Two different type of I/O system calls are visible in Figure 2: calls for blocking I/O (pread64) and calls for asynchronous I/O (io_submit and io_getevents). At the Oracle level this is reported using the wait events "db file sequential read" (blocking I/O) and "db file parallel read" (asynchronous I/O). You can read more on the topic of tracing Oracle I/O at "Life of an Oracle I/O: Tracing Logical and Physical I/O with SystemTap"

Figure 2: Flame graph visualization of the stack profile for an I/O bound Oracle process running SLOB. The process is spending the majority of the time executing I/O or waiting for I/O calls to return. Click here for a svg version of the graph.


Conclusions

Stack profiling and flame graphs are useful techniques and tools for troubleshooting performance investigations at the OS-level. Flame graphs are often used for profiling CPU-bound processes, for example in Linux with the help of perf. This post describes how stack profiling of kernel traces from the /proc filesystem can be used to investigate workloads that spend a significant amount of their time off-CPU and/or in kernel code (for example sleeping for I/O or network system calls to complete). A simple tool for stack sampling, kstacksampler.sh, complements this blog post together with two examples of its use for investigating network and I/O-bound Oracle processes.

References and acknowledgements

Brendan Gregg's flame graphs are the starting ground for this work. Tanel Poder has also published several original investigations of stack profiling for Oracle troubleshooting, notably this included the blog post "Peeking into Linux kernel-land using /proc filesystem for quick’n’dirty troubleshooting". Kevin Closson has also published work on stack profiling with perf and OProfile and he is the author of SLOB, used in Example 2 of this post.