Tuesday, September 10, 2013

Getting Started with PyLatencyMap: Latency Heat Maps for Oracle, DTrace and More Sources

Topic: this is a getting-started article for PyLatencyMap, a performance tool for real-time investigation and troubleshooting of latency data. Contains examples relevant for the Oracle RDBMS and for DTrace.

Get the software

Download the latest version of PyLatencyMap (also available on GitHub). At the time of this writing PyLatencyMap has been tested on Linux and Solaris, with Oracle 11.2 and 12.1, with python 2.4 and 2.6. If you don't have python available similar functionality is available in a tool implemented in SQL*plus (OraLatencyMap).
If you have access to the DB server a quick way to get going with PyLatencyMap is to run it directly from there, using local authentication to the DB ('/ as sysdba'), but this is by no means a strict requirement.

Getting Started with PyLatencyMap for Oracle

The easiest way to get started with PyLatencyMap for Oracle wait events is to run Example1_oracle_random_read.sh this is a wrapper script for:
sqlplus -S / as sysdba @event_histograms_oracle/ora_latency.sql "db file sequential read" 3 | python LatencyMap.py

There are 2 components: the first is a SQL*plus script collects latency data with a sampling interval of 3 seconds and prints to standard output in a custom format. The second part is the visualization engine itself written in python. The data source is gv$event_histogram filtered for the "db file sequential read" wait event, which is the main event associated with single-block reads in Oracle (there are many more interesting details here, but outside the scope of this article).

Investigating commit latency

If you are using PyLatencyMap to investigate an OTLP-like workload chances are you want to check also the latency details of "log file sync" (again many details here are outside the scope of this article, see also Kevin Closson's blog). The wrapper script is Example2_oracle_commit_time.sh which is a straightforward modification of the script for Example 1 discussed above.
It is a straightforward modification extend this example to produce latency maps for arbitrary events.

Making sense of the tool's output: frequency-intensity latency heat maps

On the top half of the screen, the "blue graph" shows the number of operations/wait events for each latency bucket. We can use this to answer questions such as: how many operations are served from the storage (SSD) cache? How many operations are  served by the "spindles"? Is the latency of the IO served by spindles reflecting a healthy IO subsystem or shows signs of saturation?
The "red graph" is a measure of the wait time and highlights where large fractions of the wait time are spent. This can help to spot latency outliers (rare event but with large latency).
More discussions on latency measurement and visualization can be found in Brendan Gregg's blog.

Demo (video)




Getting Started with PyLatencyMap for DTrace

Example5_DTrace_pread.sh (see contents here below) is intended to show how to use PyLatencyMap to display heat maps for DTrace latency data and in particular for the pread (pread64) system call.
It is a straightforward modification to extend this example to produce latency maps for generic DTrace aggregation variables (quantize function). See also this blog entry.
Example5 is a wrapper script and contains 3 parts. The first is the data source: a DTrace script used to measure latency of pread system calls (similarly to the examples above this is for troubleshooting single block IO). The second part is a 'connector' used to convert the data source in a format readable by PyLatencyMap, which is the third and final step.

dtrace -s DTrace/pread_tracedata.d |python DTrace/dtrace_connector.py |python LatencyMap.py

Demo (video)




Some advanced commands

We can capture a data source for later replay: we just need to add the Unix tee command to redirect a copy of the data source to a file. We can later replay the file. Example:

sqlplus -S / as sysdba @event_histograms_oracle/ora_latency.sql "db file sequential read" 3 | tee my_recorded_data_source | python LatencyMap.py

Later we can replay with: cat my_recorded_data_source  | python LatencyMap.py

See python LatencyMap.py -h for the available options. For example we can replay the recorded data with a fixed minimum and maximum values and a reduced number of records in the display (80 instead of the default values of 90):

cat my_recorded_data_source | python LatencyMap.py --frequency_maxval=1000 --intensity_maxva=10000 --num_records=80

Is it safe to run PyLatencyMap in production?

This is the DBA/SysAdmin's call of course. On the test and production environments where I have used PyLatencyMap I have noticed it being lightweight and overall not interfering with DB activity. For review, please note that the query used to extract event histogram data from Oracle is in ora_latency.sql. The main python engine for visualization is LatencyMap.py.

A tip for test environments is to generate lots of random read operations in a controlled way using Kevin Closson's SLOB2. See also this link.

Conclusions

I hope you will be successful in investigating and troubleshooting latency issues with PyLatencyMap (or with its little brother OraLatencyMap). Feel free to let me know what you think of it, for example if you find bugs and/or have ideas for improvements.