Monday, June 24, 2013

DTrace Explorations of Oracle Wait Events on Linux and Solaris

DTrace is a great tool to measure and investigate latency for performance troubleshooting. DTrace is now coming to the Linux platform too and I would like to share a few tests I did with it. The following is meant to be a short technology exploration but hopefully it can also highlight what I believe is an area of great potential for performance investigations and troubleshooting of Oracle workloads, namely the integration of DTrace scripts and Oracle's wait events interface.

1 - Warmup: DTrace can probe OS calls coming from Oracle workload. The following is a standard DTrace script to measure the latency of I/O (synchronous) calls. The code here is modified for Linux x86-64 and it works by tracing pread64 system calls. This makes sense when using ASM (without asmlib) and performing single block reads (e.g. waiting for db file sequential read events in Oracle). In Solaris we would trace  pread calls  instead. The use of the DTrace function quantize allows to aggregate data in histograms, similar to Oracle's v$event_histogram. Finally the values are printed every 10 seconds (and counters reset to zero). Simple, elegant, effective!

dtrace -n '
syscall::pread64:entry { self->s = timestamp; } 
syscall::pread64:return /self->s/ { @pread["ns"] = quantize(timestamp -self->s); self->s = 0; }
tick-10s {
printa(@pread);
trunc(@pread);
}'


2 - DTrace meets Oracle. DTrace can also trace user processes, using with the pid Provider. It can trace function calls, see which arguments are passed to them and what is returned. There is an enormous potential to use this to investigate performance issues with Oracle. Let's see here how we can "tap" into Oracle's wait event instrumentation to link DTrace to wait events.
With a little experimentation with the tools described below in the article we find that in 11.2.0.3 Oracle calls kews_update_wait_time with the elapsed time in micro seconds as arg1. Moreover when kskthewt is called, arg1 is the wait event number (to be resolved queriying v$event_name). We can write the following as a test to print all wait events for a given Oracle process (OS process id is 1737 in this example):

# this is for Oracle 11.2.0.3, Linux and Solaris: trace os process pid=1737 (edit os pid as relevant)
dtrace -n '
pid1737:oracle:kews_update_wait_time:entry {
     self->ela_time = arg1;
}
pid1737:oracle:kskthewt:entry {
     printf("event#= %d, time(mu sec) = %d", arg1, self->ela_time);
}'

In the following example we use the same technique to produce event histograms for 2 Oracle processes (2349 and 2350 in the example below). The measured time is in micro seconds, therefore improving accuracy compared to Oracle's v$event_histogram data:

# this is for Oracle 11.2.0.3, Linux and Solaris (edit os pids as relevant)
dtrace  -n '
pid2349:oracle:kews_update_wait_time:entry,pid2350:oracle:kews_update_wait_time:entry {
     self->ela_time = arg1;
}
pid2349:oracle:kskthewt:entry,pid2350:oracle:kskthewt:entry {
     @event_num[arg1] = quantize(self->ela_time);
}
tick-10s {
printa(@event_num);
trunc(@event_num);
}'

Sample output:

 1 294099                        :tick-10s        
             146 (edited: db file sequential read)
           value  ------------- Distribution ------------- count
             128 |                                         0
             256 |@                                        14
             512 |@@@@@                                    120
            1024 |@@@@@@@@                                 182
            2048 |@@@@@@@@@@@@@@                           320
            4096 |@@@@@@@@@@@@@                            302
            8192 |                                         5
           16384 |                                         4
           32768 |                                         0


3 - DTrace can query V$ views. DTrace ability to snoop into user process, and therefore Oracle's engine, extends also to internal data structures. In the following example we use DTrace to read v$session data from the memory structure underlying the X$KSUSE fixed table.
The address for the relevant 'row' of X$KSUSE is taken from arg1 of kslwt_end_snapshot. After that we go and read memory directly triggered by the execution of kskthewt (other choices are possible). Memory offset values for the X$KSUSE fields of interest are port-specific and can be found with a debugger. Here below an example for Linux and Solaris.

# this is for Oracle 11.2.0.3, Linux, tested on OEL 6.4 x86-64 (edit os pid as relevant)
dtrace  -n '
pid1737:oracle:kslwt_end_snapshot:return {
     self->ksuse =arg1;
}
pid1737:oracle:kskthewt:entry {
     KSUSESEQ = *(uint16_t *) copyin(self->ksuse+5536,2);
     KSUSEOPC = *(uint16_t *) copyin(self->ksuse+5538,2);
     KSUSEP1  = *(uint64_t *) copyin(self->ksuse+5544,8);
     KSUSEP2  = *(uint64_t *) copyin(self->ksuse+5552,8);
     KSUSEP3  = *(uint64_t *) copyin(self->ksuse+5560,8);
     KSUSTIM  = *(uint32_t *) copyin(self->ksuse+5568,4);
     SQL_HASH = *(uint32_t *) copyin(self->ksuse+5796,4);
     printf("seq=%u, event#=%u, p1=%u, p2=%u, p3=%u, sql_hash=%u, time=%u", KSUSESEQ, KSUSEOPC, KSUSEP1, KSUSEP2, KSUSEP3, SQL_HASH, KSUSTIM);
}'

Sample output:

pid1737:oracle:kslwt_end_snapshot:return ' matched 2 probes
FUNCTION:NAME
kskthewt:entry seq=6634, event#=146, p1=4, p2=1345795, p3=1, sql_hash=334560939, time=10290
kskthewt:entry seq=6635, event#=445, p1=0, p2=215, p3=245, sql_hash=334560939, time=12
kskthewt:entry seq=6636, event#=197, p1=4, p2=1345796, p3=124, sql_hash=334560939, time=8499
kskthewt:entry seq=6637, event#=197, p1=4, p2=606850, p3=126, sql_hash=334560939, time=9898
kskthewt:entry seq=6640, event#=352, p1=1650815232, p2=1, p3=0, sql_hash=334560939, time=481
kskthewt:entry seq=6641, event#=348, p1=1650815232, p2=1, p3=0, sql_hash=1029988163, time=12

From the sample output here above we can see that DTrace is able to read X$KSUSE (or at least several fields there). This opens the possibility to collect data with extended filters, performs aggregations, etc. For example we can can find this techniques to filter all db file sequential read events for a given file number, or for a given I/O wait time threshold and/or for a given sql_hash value. The possibilities are many and worth further explorations.
Note for db file sequential read: p1 = file number, p2 = block number, p3 = number of blocks (see also v$event_name).

Same script, but this time for Solaris:

# this is for Oracle 11.2.0.3, tested on Solaris 11 x86-64
dtrace -p 1737 -n '
pid$target:oracle:kslwt_end_snapshot:return {
     self->ksuse =arg1;
}
pid$target:oracle:kskthewt:entry {
     KSUSESEQ = *(uint16_t *) copyin(self->ksuse+5592,2);
     KSUSEOPC = *(uint16_t *) copyin(self->ksuse+5594,2);
     KSUSEP1  = *(uint64_t *) copyin(self->ksuse+5600,8);
     KSUSEP2  = *(uint64_t *) copyin(self->ksuse+5608,8);
     KSUSEP3  = *(uint64_t *) copyin(self->ksuse+5616,8);
     KSUSEPOBJ  = *(uint64_t *) copyin(self->ksuse+6232,8);     
     KSUSTIM  = *(uint32_t *) copyin(self->ksuse+5624,4);
     printf("seq=%u, event#=%u, p1=%u, p2=%u, p3=%u, obj#=%u, time=%u", KSUSESEQ, KSUSEOPC, KSUSEP1, KSUSEP2, KSUSEP3, KSUSEPOBJ, KSUSTIM);
}'

Tools, resources and credits: A great tool for investigating Oracle calls with DTrace is digger by Alexander Anokhin and Brendan Gregg. It works under Solaris (I could not make it work under Linux port yet) and allows to trace all function calls and their arguments. Digger makes quite easy to see for example where Oracle instruments wait events by tracing: ./digger.sh -F -p <os pid> -f '*wt*'
Another great resource to get ideas on how to trace Oracle function is Frits Hoogland's blog and the article Profiling of Oracle function calls
Tanel Poder has made a great video on DTracing Oracle query execution engine, available on Enkitec TV.
Brendan Gregg is a reference when talking about DTrace and systems performance tuning, see his website and blog. See also Oracle's DTrace User Guide.
Paul Fox is the author of a port of DTrace for Linux (the other being Oracle's of course):  gihub repository.

DTrace lab environment for Oracle: I have used a test environment built with VirtualBox and installed with Oracle Linux 6.4. As DTrace port for these tests I have used dtrace4linux.
Note: in my Linux-based DTrace test environment I often get the error message "dtrace: processing aborted: Abort due to systemic unresponsiveness". This can be worked around by adding the -w flag when running DTrace scripts, that is by allowing destructive actions.
It's also beneficial to have a Solaris test environment, where DTrace is more mature. In my tests this was installed under VirtualBox too.

Conclusions
DTrace is a powerful tool to explore systems performance. In addition DTrace can explore and snoop inside the workings of user-space programs. This opens the door to many uses for Oracle database investigations, troubleshooting and performance tuning. This short article just scratches the surface of what is possible when combining DTrace with Oracle's wait events interface. The author believes this is an area that deserves further exploration. The Linux port used for these tests is maintained by Paul Fox (Oracle's port does not have the PID provider at the time of this writing). However it is worth noting that this port is not ready for production usage.