Topic: This post is about a technique for Oracle performance tuning, the use of heat maps to investigate wait event latency (and in particular I/O-related latency). This post also discusses a SQL*plus-based script/tool I have developed to help with this type of monitoring and performance drill-down (OraLatencyMap).
Context: Oracle exposes latency data for the wait event interface in  V$EVENT_HISTOGRAM. This gives an additional dimension to drill down performance data for analysis and troubleshooting. In an older blog post I described an example of troubleshooting a storage issue for an OLTP (RAC) database by investigating the histogram data of the 'db file sequential read' wait event. In that context I had also developed and discussed ehm.sql, a simple PL/SQL script to collect and display data from GV$EVENT_HISTOGRAM.
What's new: An excellent article by Brendan Gregg, "Visualizing System Latency", Communications of the ACM, July 2010 has inspired me to develop an enhanced version of ehm.sql. The idea is to display in real time data of current and historical values of selected wait event histograms. In particular I/O-related events such as db file sequential read and log file sync make excellent candidates for this type of analysis. Moreover those events are relevant in a context that is familiar to me, that is drilling down issues with OLTP performance and access to storage.
As Brendan shows in his article, I/O latency data fits naturally to heat map representation, where time is on the horizontal axis, latency buckets are on the vertical axis and the quantity to display (for example number of waits or time waited) is displayed as color (hence the name heat map).
The tool: OraLatencyMap is a tool I have developed to help extract and represent event histogram data in a heat map. It is intended to be lightweight and 'DBA-friendly'. It's implemented in PL/SQL, it does not require the creation of any DB objects and runs under SQL*plus. OraLatencyMap requires a terminal supporting ANSI escape codes (for example PuTTY, MobaXterm, xterm, notably it does not run under windows' cmd.exe). Making SQL*plus behave like a monitoring tool, requires jumping through hoops. Credits to the experts who have shared their results in this area and thus made my job much easier here. In particular: Tanel Poder (for moats, sqlplus and color, snapper, etc..), Adrian Billington (moats) and Marcin Przepiorowski (topaas).
Example 1: study of db file sequential read
See here below a screen shot of a putty terminal where I ran SQL*plus and @OraLatencyMap. The script samples GV$EVENT_HISTOGRAM roughly every 3 seconds and displays 2 heat maps. The top heat map gives information on the number of waits per second on each latency bucket. The bottom heat map instead represents the estimated wait time per latency bucket. The two graphs represent the same type of information but with 2 different 'points of view'.
SQL> @OraLatencyMap
This type of data is useful when investigating single block read latency issues in OLTP systems for example. I'll leave for another time a discussion of the details and limitations of this approach.  I'll just point out that among others, it's also important to make sure the system is not starving with CPU to make sense of the data (CPU data not shown here). When reading the heat map I typically focus on 3 areas: one is low-latency (1ms) where I get info on what is most likely reads from storage cache, the second is the area of latency around the 16 and 32 ms buckets, most likely representing physical reads from 'rotating disks'. The third very interesting area to watch is the 'high latency' (>100ms), that is the area of "IO latency outliers" and can be sign of problems with the storage for example. Note that OraLatencyMap is a drill-down tool based on Oracle instrumentation so the interpretation of the results, especially when extended to storage, will often need additional data from the specific context being investigated. 
Example 2: study of log file sync
This example is about visualizing the latency of log file sync. This can be useful when drilling down commit latency issues. This is a critical area for many OLTP systems and the troubleshooting is often not easy: storage performance, server CPU starvation and unexpected behavior by LGWR, among others can all potentially cause problems in this area.
This screen shot was taken from a putty window, this time with a white background.
SQL> @OraLatencyMap_event 3 "log file sync"
Note: See also the blog entry on OraLatencyMap v1.1
Conclusions
Latency investigations of Oracle wait events give an additional and powerful dimension to performance tuning and troubleshooting. This is particularly useful for (synchronous) I/O-related wait events such db file sequential read and log file sync. Latency heat maps are a particularly suited to visualize IO latency (see also Brendan Gregg's article). The first version of a simple SQL*plus script (OraLatencyMap) to collect and visualize event histogram data as heat maps has been discussed.
OraLatencyMap is available for download from http://canali.web.cern.ch/canali/resources.htm and on GitHub


Crazy!! It works really well, just added to my favorite SQL scripts directory! :-)
ReplyDeleteThank you for sharing this...
@ludovicocaldara
wow - works just like just like that - no install, just run the SQL file.
ReplyDeleteHas to be the coolest graphical widget for Oracle performance that I've seen in ages.
Works great with https://fritshoogland.wordpress.com/2012/12/15/throttling-io-with-linux/ for playing around with the output.
- Kyle Hailey
http://dboptimizer.com
Hi Kyle, thanks for the comments and for the idea on how to play around with the output.
DeleteBTW in his presentation at USENIX/LISA 2010 Brendan Gregg showed that he was able to write short sentences in his NFS latency heat maps, I wonder if that's feasible with Oracle workload too :)
yes, I think so. I just linked in your post, Frits' post and Brendan's post on writing "in latency."
Deletehttp://dboptimizer.com/2013/05/10/colored-heat-maps-in-sqlplus/
Don't have time to do it myself today, but certainly could be arranged. With enough devices and cgroups defined should be easily possible for readers. :)
Luca,
ReplyDeleteVery, very good post as usual... although your 16 to 32ms for mechanical drives is really scary ! Maybe lots of over utilized RAID 5 SATA 7200 RPMs :-) ?
Your point about CPU utilization and how it relates to LFS is spot on. If you don't mind I'd like to refer your readers to a post I have on that very topic:
http://kevinclosson.wordpress.com/2007/07/21/manly-men-only-use-solid-state-disk-for-redo-logging-lgwr-io-is-simple-but-not-lgwr-processing/
Hi Kevin, thanks for your comments. You are right about the source of the 16-32 ms latency. From the heat maps you can see that on top of that we also have quite a few I/Os with very high latency too. Those systems are getting soon a H/W upgrade BTW (in the mean time they make good example graphs for I/O troubleshooting tools :)
DeleteI am more than happy to refer to your excellent blog.
That is very cool stuff. Another extension would be to color using how unusual the outliers are relative to the entire GV$EVENT_HISTOGRAM which is taken as representing the latency distribution altogether (after sufficient time under normal workloads) This would be more of a monitoring than diagnostic use case.
ReplyDeleteWell done guy...!
ReplyDeleteIt also works on windows client with ansicon :)
ReplyDeleteLuca, thanks for sharing, this is brilliant work.
ReplyDeleteHi Luca,
ReplyDeleteExcellent stuff....thanks a lot for sharing....
Alberto