Showing posts with label tips. Show all posts
Showing posts with label tips. Show all posts

Monday, June 13, 2016

IPython/Jupyter Notebooks for Oracle

Topic: In this short post you can find examples of how to use IPython/Jupyter notebooks for running SQL on Oracle.

IPython/Jupyter notebooks are one of the leading free platforms for data analysis, with many advantages, notably the interactive web-based interface and a large ecosystem of readily available packages for data analysis and visualization. Moreover IPython/Jupyter notebooks are a very handy format for sharing code and data as you will see in the examples.
See also this blog post with examples on how to use Jupyter for querying Apache Impala.

It is of interest to integrate many data sources into Jupyter notebooks to make the platform versatile and to fulfill many different use cases. In this short post you can find examples of how to query data from Oracle using Jupyter notebooks and simple integration with pandas and matplotlib.


Notebook Short description
Oracle_IPython_sqlplus Examples of how to use sqlplus inside Jupyter notebooks. It is based on the use of %%bash cell magic and here documents to wrap up sqlplus inside Jupyter cells.
Oracle_IPython_cx_Oracle_pandas Examples of how to query Oracle from Python using cx_Oracle and how to integrate with pandas and visualization with matplotlib.
Oracle_IPython_SQL_magic Examples of how to query Oracle using %sql line magic (or %%sql cell magic) and of the integration with cx_Oracle and pandas.


Dependencies and pointers to build a test environment:

  • Install IPython and Jupyter. The following assumes Anaconda from Continuum Analytics)
  • Install the Oracle client
    • Download the software from OTN: http://www.oracle.com/technetwork/topics/linuxx86-64soft-092277.html
    • On that same link you can find the installation instructions (scroll down by the end of the page)
    • Example of the actions: perform ln -s libclntsh.so.12.1 libclntsh.so (12c version) and export LD_LIBRARY_PATH={oracle client home}
    • I have tested this with Oracle client versions 12.1.0.2 and 11.2.0.4 on Linux
  • Post client installation:
    • set environment: export ORACLE_HOME={path to the Oracile client installation}
    • If not already installed, install libaio (yum install libaio)
  • Check that the Oracle client works and all dependencies are set by running sqlplus from the Oracle client home, example:
    • check client connectivity with: sqlplus username/password@dbserver:port/service_name
  • Install cx_Oracle, for example with pip install cx_Oracle
  • Install ipython-sql
Reference: a web page by Julian Dyke with examples on how to use cx_Oracle


Saturday, August 29, 2015

Add Color to Your SQL

Topic: this post is about some simple techniques to add color to SQL scripts and their terminal output using ANSI escape codes.

Colors can be used to improve the output of command line tools. This is common practice, for example with the (bash) shell. Colors can also be very useful to improve the quality of the output of SQL scripts. In my experience this is not used frequently, probably because of the the need of specialized techniques and also because the results depend on the terminal emulator (or tool) used to display the output. In this post you will find some pointers and examples that you can use to add color to your SQL output.

Notable previous work

Tanel Poder has published a couple of years ago his very cool logo to the great snapper v4 script using SQL and color-rich terminal output.
Sayan Malakshinov has published a blog article and a script color.sql that provide some ready-to-use and simple techniques for coloring the output of Oracle SQL*plus scripts.
ANSI escape codes are the main underlying technique to add color to the terminal output, see more details on how this works at: Wikipedia article on ANSI escape codes.
Putty is a widely usedterminal emulator that support ANSI escape calls. If you are a Windows user, note that CMD.EXE does not support ANSI escape codes, therefore it will not be suitable to run the scripts described in this post.
Heat map visualization is a powerful technique to explore 3D data, by providing the third dimension as color. I have integrated heat map visualization and command line/ terminal output with two tools OraLatencyMap and PyLatencyMap aimed at the study of I/O latency. I will share in the next paragraph some of the tips and lessons learned from developing those tools regarding the use of color on the terminal output.

Color palettes by example

Color palettes are very useful for heat map visualization. I have identified two simple palettes for displaying I/O latency histograms on terminal output: the first one is composed of shades of blue, the other is yellow-to-red. See an example of their usage as heat maps at this link. The scripts Color_palette_blue.sql and Color_palette_yellow-red.sql show two basic examples of how to generate color palettes using ANSI escape codes. The SQL, also pasted here below, works simply by changing the character background color, printing a white space and finally resetting the background back to normal:

define ANSICODE_PREFIX="chr(27)||'[48;5;'"
define ANSICODE_BACKTONORMAL="chr(27)||'[0m'"

select 0 ID, &ANSICODE_PREFIX|| '0m '|| &ANSICODE_BACKTONORMAL COLOR from dual
UNION ALL  -- Black
select 1 ID, &ANSICODE_PREFIX|| '15m '|| &ANSICODE_BACKTONORMAL COLOR from dual  
UNION ALL  -- White
select 2 ID, &ANSICODE_PREFIX|| '51m '|| &ANSICODE_BACKTONORMAL COLOR from dual  
UNION ALL  -- Light blue
select 3 ID, &ANSICODE_PREFIX|| '45m '|| &ANSICODE_BACKTONORMAL COLOR from dual
UNION ALL 
select 4 ID, &ANSICODE_PREFIX|| '39m '|| &ANSICODE_BACKTONORMAL COLOR from dual  
UNION ALL 
select 5 ID, &ANSICODE_PREFIX|| '33m '|| &ANSICODE_BACKTONORMAL COLOR from dual
UNION ALL 
select 6 ID, &ANSICODE_PREFIX|| '27m '|| &ANSICODE_BACKTONORMAL COLOR from dual
UNION ALL  -- Dark blue 
select 7 ID, &ANSICODE_PREFIX|| '21m '|| &ANSICODE_BACKTONORMAL COLOR from dual; 

The generation of the ANSI escape codes has little complexity and can be ported with little effort to many other language of interest. Here an example in PL/SQL (from OraLatencyMap):

create or replace function asciiescape_color (p_token pls_integer, p_palette_type varchar2) 
return varchar2 
   is
      type t_palette is varray(7) of pls_integer;        -- a palette of 7 colors
      v_palette_blue t_palette := t_palette(15,51,45,39,33,27,21);      -- shades of blue
      v_palette_red t_palette := t_palette(15,226,220,214,208,202,196); -- white-yellow-red
      v_colornum pls_integer;
   begin
      if ((p_token < 0) or (p_token > 6)) then
          raise_application_error(-20001,'The color palette has 7 colors, 0<=p_token<=6, found instead:'||to_char(p_token));
      end if; 

      if (p_palette_type = 'blue') then
          v_colornum := v_palette_blue(p_token+1);                               
      else
          v_colornum := v_palette_red(p_token+1);                                 
      end if;
      return(chr(27)||'[48;5;'||to_char(v_colornum)||'m');   
      --return the ascii escape sequence to change background color 
end;
/

An example in Python (from PyLatencyMap):

def asciiescape_color(token, palette):
    blue_palette = {0:15, 1:51, 2:45, 3:39, 4:33, 5:27, 6:21}        # palette, shades of blue 
    red_palette  = {0:15, 1:226, 2:220, 3:214, 4:208, 5:202, 6:196}  # white-yellow-red palette
    if palette == 'blue':
       color_asciival = blue_palette[token]
    elif palette == 'red':
       color_asciival = red_palette[token]
    else:
       raise Exception('Wrong or missing palette name.')
       exit(1)
    return(chr(27) + '[48;5;' + str(color_asciival) + 'm')


Other ANSI escape codes of interest from OraLatencyMap and PyLatencyMap are the codes to restore the cursor back to the normal value and to clear the screen. Here is an example from PyLatencyMap (Python):

#reset the background color back to normal
asciiescape_backtonormal = chr(27) + '[0m'

# clear screen and move cursor to top line
line += chr(27) + '[0m' + chr(27) + '[2J' + chr(27) + '[H' 

An example of colorizing SQL output

You will see in this paragraph an example that I hope is both instructive and fun: how to add colors to a script that computes an image of the Mandelbrot set. The starting script is quite interesting by itself as it uses just SQL for computation and output display. The code is not original, I have ported it to Oracle from code on the PostgreSQL wiki, with some minor modifications. Mandelbrot_SQL_Oracle_text.sql is the "black and white" starting starting script before adding color.
Adding colors to the output for this example is just a matter of combining the original "black and white" script with the SQL scripts for generating color palettes. The results are the the following two scripts (see also the figure below for an example of their output):

The figure illustrates how to add color to some classes of SQL output. The examples use a test script which generates an image of the Mandelbrot set using SQL. Note that adding color to the plain text version is achieved here just by adding an extra table join (with the PALETTE common table expression). Other methods to add color to Oracle scripts include using PL/SQL functions or the use of other languages, languages, such as Python, as discussed in the text of this post.

A version of the script for PostgreSQL that uses the ideas discussed in this post for colorizing SQL is: Mandelbrot_SQL_PostgreSQL_color_blue.sql

Conclusions

Colors can improve the effectiveness of command line scripts and terminal output, including SQL scripts. ANSI escape codes provide a powerful tool for many terminal operations. Heat map visualization is a powerful data visualization technique that can be implemented also on the terminal output using ANSI escape codes. In this post you can find simple tips on how to add colors to the terminal output both for SQL and other languages, notably Python. Adding color to "black and white" script output can be fun and useful at the same time, as illustrated with the Mandelbrot SQL example. Happy coloring!

Download: The tools discussed in this post can be downloaded from this webpage and from Github

Additional links and references

Tanel's colored fished logo to the snapper v4 script
Sayan Malakshinov's blog article and script color.sql
Wikipedia on ANSI escape codes
Latency heat maps for I/O latency measurements on the CLI with OraLatencyMap and PyLatencyMap
Fun SQL snippets from the PostgreSQL wiki
Additional examples of recursive common table expression (recursive subquery factoring) and non-standard uses of SQL: how to find numeric solutions to basic physics equations using SQL



Monday, December 30, 2013

How to Recover Files from a Dropped ASM Disk Group

Topic: This post describes a few tips and techniques on how to recover ASM files from a dropped disk group and an example of how to 'undrop' an ASM disk group

Context: This research stems from a recent case I have investigated where the data disk group of a test DB had all the disks with header_status 'FORMER': in other words the disk group had been dropped and hence could not be mounted. The next logical steps would have been to recreate the disk group and the test DB contained therein, although I first wanted to see if my 'favorite ASM support tools' (kfed and admu, see also references) could be of help to recover the lost files. It turns out that it is quite easy to recover files in a dropped diskgroup as the drop operation seems to just flick a key value in the disk headers. But let's go by steps:

Read files from a dropped ASM diskgroup with amdu

amdu is a diagnostic utility shipped with Oracle installations since 11g (can also be used for 10g, see references). It allows to read data from unmounted diskgroup and can be used in various support-type operations. For the case in question we can read files from a dropped diskgroup with

amdu -dis '/dev/mapper/MYLUN*p1' -former -extract ORCL_DATADG.256

This command will extract ASM file number 256 from the unmounted diskgroup ORCL_DATADG physically accessible from the path /dev/mapper/MYLUN*p1 (edit the LUN path as relevant for your system). File 256 in the ORCL_DATADG diskgroup in this example is the control file of the test database we want to recover.
Note: ASM starts numbering user-visible files from 256 (lower file numbers are used for the ASM system files, often hidden, see also X$KFFIL and more details in the references). It is very likely that file number 256 is a controlfile in 10g and 11g systems, as the controlfile is often the first file created in a disk group used for Oracle DBs. I have just observed a brand new 12c databases that file number 256 is the password file (new 12c feature), in that system the DB controfile is file number 257. 
The outcome of the amdu command above is to extract a copy of the file into the local file system (in a custom created directory). From the controlfile we can easily get a list of the rest of the DB files if needed. For example we can used the command string on the restored controlfile and process the output to find the name of the rest of the DB files.

Where are we with our disk group recovery? We have demonstrated a simple method to extract any file from a dropped disk group. Good news, we could recover the entire 'lost DB' onto a local filesystem.
Can we do better than that, for example recover all the files in one go and into ASM?

Undrop an ASM diskgroup with kfed

kfed is another great ASM diagnostic and repair tool shipped with Oracle. It allows to read and write ASM metadata from the disk header structures. Obviously writing into ASM disk headers is an unsupported activity that we can do at our own risk (or rather under guidance of Oracle support if needed).
Block number 0 of ASM allocation unit number 0 (see references for details) of each ASM disk contains, among others, a key called kfdhdb.hdrsts. Dropped disks have kfdhdb.hdrsts=4 (KFDHDR_FORMER), while disks that are members of a diskgroup have kfdhdb.hdrsts=3 (KFDHDR_MEMBER).
The 'trick' here is to read all the disk headers, one by one with kfed, change the value of kfdhdb.hdrsts from 4 to 3 and write the headers back into the disks.
Let's see this with a simple example, where we first create a diskgroup and then we dropp it to test the undrop procedure. The following assumes using a mixture of sqlplus commands on ASM and DB instances and running kfed from the OS prompt, as indicated.

1. Test setup

We create an ASM  disk group for testing, add a DB tablespace to it and then drop the diskgroup to prepare for the next step (undrop see point 2 below).

ASM_INSTANCE> create diskgroup ORCL_TESTDG external redundancy disk '/dev/mapper/MYLUN1_p1';


ORCL_DB> create tablespace testdrop datafile '+ORCL_TESTDG' size 100m ;

ORCL_DB> alter tablespace testdrop offline; -- this is needed or else diskgroup drop will fail as you cannot drop a diskgroup with open files

ASM_INSTANCE> drop diskgroup RDTEST2_TESTDROP  including contents;

2. Example of how to undrop the disk group and recuperate its files

We read the block header for each disk of the diskgroup (1 disk only in this example) and copy it to a local file:

$ kfed read /dev/mapper/MYLUN1_p1 aunum=0 blknum=0 text=dumpfile_MYLUN1_p1

Manual edit of the local copy of the header block:

$ vi dumpfile_MYLUN1_p1

replace the line:
kfdhdb.hdrsts:                        4 ; 0x027: KFDHDR_FORMER

with:
kfdhdb.hdrsts:                        3 ; 0x027: KFDHDR_MEMBER

We write the modified block header for each disk of the diskgroup (1 disk only in this example):

kfed write /dev/mapper/MYLUN1_p1 aunum=0 blknum=0 text=dumpfile_MYLUN1_p1


3. We can now check that our disk group and its files are back:

ASM_INSTANCE> alter diskgroup RDTEST2_TESTDROP mount;
ORCL_DB> alter tablespace testdrop online; 

Note, this procedure has been tested on Oracle for Linux, versions 11.2.0.3 and 12.1.0.1.

References

ASM internals and diagnostic utilities are a fascinating topic of investigation. An extensive source of information is the blog: ASM support Guy.
Some older research material that I have published on this topic can be found at wiki page on amdu and kfed and at wiki page on investigations of ASM internals.

Conclusions

We have discussed a few tips on how to read and recover files from a dropped disk group using the amdu utility and went through the main steps of an example procedure showing how to 'undrop' an ASM disk group using kfed.

Monday, October 28, 2013

Daylight Saving Time Change and AWR Data Mining

Topic: this article is about a tip for Oracle 11.2 and higher on how to handle DST change when querying directly the AWR repository. In addition a few comments on PerfSheet 4, a tool to extract and plot AWR data in Excel.

Introduction: Statspack and more recently AWR reports, are standard performance data sources for Oracle practitioners. AWR analytics/data mining brings an additional powerful set of techniques to understand database workloads and performance when instance-wide analysis is relevant. If you are new to this topic, see for example this presentation by Jury Velikanov.

SQL techniques
One of the SQL techniques that is commonly used when extracting data from AWR is the use of analytic functions (see also this presentation by Connor McDonald) to calculate "deltas" between data points for consecutive snapshots of V$ views stored in AWR. I mean something like this:

...value - lag(value) over (partition by dbid,instance_number,stat_id order by snap_id nulls first) 

Moreover it is quite useful to calculate ratios of such delta values over the time elapsed between two snapshots. How can we calculate the seconds between two snapshots? With something like this:

(extract(hour from END_INTERVAL_TIME-begin_interval_time)*3600 + ...add similar terms for minutes and seconds..

DST changes can cause trouble
What has all this got to do with daylight saving time changes?
The problem we are trying to solve here is, for example that in the fall, when moving back the clock one hour, we have 2 consecutive snapshots at 2:00 am. The time interval between the snapshots calculated with the simple formula above becomes zero (actually it would normally be a number close to zero by a handful of seconds and it can also be negative) and this calls for trouble when we want to divide our delta values by the snapshot time interval.

A solution for 11.2 and higher
Oracle 11.2 and higher provide a simple solution by introducing an extra column in the dba_hist_snapshot view (and underlying table WRM$_SNAPSHOT): SNAP_TIMEZONE, the "Snapshot time zone expressed as offset from UTC (Coordinated Universal Time) time zone".
This new column allows to write a simple 'patch' to our computation of the time difference to account for daylight saving changes:

...-extract(hour from snap_timezone - lag(snap_timezone)) 

Example
Here below an example from DBA_HIST_SNAPSHOT for a database server that has changed from CEST (Central European Summer Time) to CET on Sunday October 27th, 2013. Note that the SNAP_TIMEZONE value changes from UTC+2 to UTC+1. Note also that for SNAP_ID=13843 both begin_interval_time and end_interval_time are set to 2am. More precisely the end interval appears to have happened almost 20 seconds before the begin interval, as explained above this need to be corrected with the SNAP_TIMEZONE difference.



Mining AWR data
I find mining AWR data a very useful data source for performance tuning (when instance-wide data is appropriate). It is quite worth the effort to develop a set of scripts for AWR mining to be used when the need comes. Some pointers: my set of scripts can be downloaded from this link (it is packaged together with Perfsheet4). Another very good set of AWR scripts can be found at Karl Arao's website.

Perfsheet4 
Perfsheet 4 is a tool I have written and published in February 2013 with the goal of providing a simple point-and-click data mining tool for Oracle AWR data. It is a fork of Perfsheet 3.0 by Tanel Poder and comes with a few packaged queries to extract data from AWR and visualize it as line graphs. The power of the tool is that it automates some boring details of writing queries to extract AWR data and import the results in Excel and brings the user straight to Excel's pivot charts: a simple but also quite powerful aggregation and visualization engine.
The queries used in Perfsheet4 can be seen/customized in the Excel spreadsheet and are also available as text files in the zip file. You can see there that I have used the 11.2 new attribute of snap_timezone. This has come for a price though: the compatibility with 11.1 and 10.2 databases is broken (only Oracle 11.2 and higher will run the supplied scripts).

A basic example here below of how to use PerfSheet 4 in his latest version as I write this (v3.3):


Tuesday, September 10, 2013

Getting Started with PyLatencyMap: Latency Heat Maps for Oracle, DTrace and More Sources

Topic: this is a getting-started article for PyLatencyMap, a performance tool for real-time investigation and troubleshooting of latency data. Contains examples relevant for the Oracle RDBMS and for DTrace.

Get the software

Download the latest version of PyLatencyMap (also available on GitHub). At the time of this writing PyLatencyMap has been tested on Linux and Solaris, with Oracle 11.2 and 12.1, with python 2.4 and 2.6. If you don't have python available similar functionality is available in a tool implemented in SQL*plus (OraLatencyMap).
If you have access to the DB server a quick way to get going with PyLatencyMap is to run it directly from there, using local authentication to the DB ('/ as sysdba'), but this is by no means a strict requirement.

Getting Started with PyLatencyMap for Oracle

The easiest way to get started with PyLatencyMap for Oracle wait events is to run Example1_oracle_random_read.sh this is a wrapper script for:
sqlplus -S / as sysdba @event_histograms_oracle/ora_latency.sql "db file sequential read" 3 | python LatencyMap.py

There are 2 components: the first is a SQL*plus script collects latency data with a sampling interval of 3 seconds and prints to standard output in a custom format. The second part is the visualization engine itself written in python. The data source is gv$event_histogram filtered for the "db file sequential read" wait event, which is the main event associated with single-block reads in Oracle (there are many more interesting details here, but outside the scope of this article).

Investigating commit latency

If you are using PyLatencyMap to investigate an OTLP-like workload chances are you want to check also the latency details of "log file sync" (again many details here are outside the scope of this article, see also Kevin Closson's blog). The wrapper script is Example2_oracle_commit_time.sh which is a straightforward modification of the script for Example 1 discussed above.
It is a straightforward modification extend this example to produce latency maps for arbitrary events.

Making sense of the tool's output: frequency-intensity latency heat maps

On the top half of the screen, the "blue graph" shows the number of operations/wait events for each latency bucket. We can use this to answer questions such as: how many operations are served from the storage (SSD) cache? How many operations are  served by the "spindles"? Is the latency of the IO served by spindles reflecting a healthy IO subsystem or shows signs of saturation?
The "red graph" is a measure of the wait time and highlights where large fractions of the wait time are spent. This can help to spot latency outliers (rare event but with large latency).
More discussions on latency measurement and visualization can be found in Brendan Gregg's blog.

Demo (video)




Getting Started with PyLatencyMap for DTrace

Example5_DTrace_pread.sh (see contents here below) is intended to show how to use PyLatencyMap to display heat maps for DTrace latency data and in particular for the pread (pread64) system call.
It is a straightforward modification to extend this example to produce latency maps for generic DTrace aggregation variables (quantize function). See also this blog entry.
Example5 is a wrapper script and contains 3 parts. The first is the data source: a DTrace script used to measure latency of pread system calls (similarly to the examples above this is for troubleshooting single block IO). The second part is a 'connector' used to convert the data source in a format readable by PyLatencyMap, which is the third and final step.

dtrace -s DTrace/pread_tracedata.d |python DTrace/dtrace_connector.py |python LatencyMap.py

Demo (video)




Some advanced commands

We can capture a data source for later replay: we just need to add the Unix tee command to redirect a copy of the data source to a file. We can later replay the file. Example:

sqlplus -S / as sysdba @event_histograms_oracle/ora_latency.sql "db file sequential read" 3 | tee my_recorded_data_source | python LatencyMap.py

Later we can replay with: cat my_recorded_data_source  | python LatencyMap.py

See python LatencyMap.py -h for the available options. For example we can replay the recorded data with a fixed minimum and maximum values and a reduced number of records in the display (80 instead of the default values of 90):

cat my_recorded_data_source | python LatencyMap.py --frequency_maxval=1000 --intensity_maxva=10000 --num_records=80

Is it safe to run PyLatencyMap in production?

This is the DBA/SysAdmin's call of course. On the test and production environments where I have used PyLatencyMap I have noticed it being lightweight and overall not interfering with DB activity. For review, please note that the query used to extract event histogram data from Oracle is in ora_latency.sql. The main python engine for visualization is LatencyMap.py.

A tip for test environments is to generate lots of random read operations in a controlled way using Kevin Closson's SLOB2. See also this link.

Conclusions

I hope you will be successful in investigating and troubleshooting latency issues with PyLatencyMap (or with its little brother OraLatencyMap). Feel free to let me know what you think of it, for example if you find bugs and/or have ideas for improvements.

Monday, March 25, 2013

Testing Lost Writes with Oracle and Data Guard

Topic: This post is about lost writes in Oracle, on techniques for reproducing and investigating the effects of lost writes and on the use of Active Data Guard to protect against lost writes

Motivations. Imagine this scenario: a production system has two standbys to protect against disaster and to load balance read-only load (with Active Data Guard (ADG)). A lost write happens in the primary and remains unnoticed for a few days. Finally the block that suffered a lost write is updated again. Both standbys stop applying redo throwing ORA-600 [3020] (also known as stuck recovery). The primary DB keeps working fine, although it is logically corrupted by the lost write.
You are the DBA in charge of fixing this, what would you do?

The importance of testing: I hope that the example above illustrates that lost write can generate quite complex recovery scenarios and overall a few headaches to support DBAs. In this post I illustrate a few techniques and examples that can be used to test the effects of lost writes in Oracle and therefore prepare in case of a real-world issue strikes. Of particular interest will be to test the effect of lost writes in an environment with (Active) Data Guard.
We need to cover some ground first on techniques and how to setup the test. But first a definition.

Lost writes: "A data block lost write occurs when an I/O subsystem acknowledges the completion of the block write, while in fact the write did not occur in the persistent storage" (from support note 1302539.1). Lost writes can be caused by faulty storage, but also by Oracle bugs, in general anything in between our data in RAM and storage can corrupt our data, including controllers and network.

Digression on techniques.
1. A useful technique that we will need in the following is the ability to read and write a single block from Oracle data files (in a test environment). For databases on filesystems (and also DBs on NFS) dd is the tool for this job (I am using the Linux platform as reference). Examples:

read one 8KB block from filesystem (block 134 in this example):
dd if=testlostwrite.dbf bs=8192 count=1 skip=134 of=blk134.dmp 

write one 8KB block to filesystem (block 134 in this example)::
dd of=testlostwrite.dbf bs=8192 count=1 seek=134 if=blk134.dmp conv=notrunc

Note when writing to an Oracle datafile we must use conv=notrunc or else we will end up with an unusable (truncated) output file. Note also the syntax for specifying the block offset, skip is used for input files and seek for output files (see dd manual).

How to read and write single blocks on ASM data files. One possibility is to take a backup copy of the datafile with RMAN, edit it with dd (as detailed above), then (with rman again) restore the backup copy. With a little knowledge of ASM internals, more direct ways to access files in ASM are available: one can find the position of the block (and its mirror copies if relevant) and then use dd to read/write data directly. Currently my preferred way is sightly different and it exploits the dbms_diskgroup package. This is an undocumented package (see again the link above on ASM internals for some additional details) although it is extensively used by Oracle's asmcmd utility. I have packaged the dbms_diskgroup.read and dbms_diskgroup.write calls into a small utility written in perl (that I called asmblk_edit, follow this link to download asmblk_edit). Similar ideas can be found also in support note 603962.1.
The following example illustrates using the utility asmblk_edit to read and write block number 134 from and to a data files stored in ASM:

read one 8KB block from an Oracle data file in ASM (block 134 in this example):
./asmblk_edit -r -s 134 -a +TESTDB_DATADG1/TESTDB/datafile/testlostwrite.353.810578639 -f blk134.dmp

write one 8KB block to and Oracle data file in ASM (block 134 in this example):
./asmblk_edit -w -s 134 -a +TESTDB_DATADG1/TESTDB/datafile/testlostwrite.353.810578639 -f blk134.dmp

2. Another technique that we need is quite straightforward and allows us to find the offset of the Oracle block that we want to read/write to for our tests.
The example here below shows how to find block number where data is stored, using rowid:
SQL> select rowid, dbms_rowid.ROWID_BLOCK_NUMBER(rowid), a.* from testlosttable a;

Incidentally finding the block number for an index leaf block can be done with the following (using the undocumented function sys_op_lbid):
SQL> select rowid rowid_table_from_index_leaf, sys_op_lbid(18297477,'L',t.rowid) index_leaf_rowid from testlosttable t --note: 18297477 in this example is the data_object_id of the index I am examining, edit with the actual number as relevant

3. Last but not least, we need a way to modify data blocks 'under the nose of Oracle', in particular we want to make sure we flush/invalidate the relevant cached data and metadata. The method we will use is:
  • Offline the tablespace where data resides (this flushes dirty blocks and invalidates cache entries)
  • Perform the read/write modifications to the block, with dd or asmblk_edit, as needed
  • Online the tablespace again before the next usage

A basic recipe to reproduce the effects of a lost write. 
We can now put together the ideas and techniques described above into a working example aimed at reproducing the effects of a lost write in Oracle:

SQL> create bigfile tablespace testlostwrite datafile '{edit with datafile directory}/testlostwrite.dbf' size 10m;
SQL> create table testlosttable (id number, payload varchar2(100)) tablespace testlostwrite ;
SQL> create index i_testlosttable on testlosttable (id) tablespace testlostwrite ;
SQL> insert into testlosttable values (10,'aaaaaaaaaaaaaaaaaaaaaaaaa');
SQL> insert into testlosttable values (20,'bbbbbbbbbbbbbbbbbbbbbbbbb');
SQL> commit;
SQL> select rowid, dbms_rowid.ROWID_BLOCK_NUMBER(rowid), a.* from testlosttable a;
--note: this will allow to find the block_id where data resides, let's say it's block 134

SQL> alter tablespace testlostwrite offline;
-- read block either with dd or with asmblk_edit and create backup copy. Example:
-- ./asmblk_edit -r -s 134 -a +TESTDB_DATADG1/TESTDB/datafile/testlostwrite.353.810578639 -f blk134.dmp
SQL> alter tablespace testlostwrite online;
SQL> insert into testlosttable values (30,'cccccccccccccccccccccccccc');
SQL> commit;
SQL> alter tablespace testlostwrite offline;
-- write block either with dd or with asmblk_edit and from previously created backup copy. Example:
-- ./asmblk_edit -w -s 134 -a +TESTDB_DATADG1/TESTDB/datafile/testlostwrite.353.810578639 -f blk134.dmp
SQL> alter tablespace testlostwrite online;
SQL> -- our database has now a lost write in the table testlosttable block 134

The effect of the lost write on the table is that the row with id=30 has disappeared from the table. However the entry with id=30 is still visible in the index i_testlosttable. This observation can be confirmed with  the 2 queries reported here below. Note that in case of normal operations (i.e. no lost writes) the 2 queries would both return three rows, this is not the case here because of our manual editing of the table block with the asmblk utility.

SQL> select /*+ INDEX_FFS(a)*/ id from testlosttable a where id is not null;
SQL> select  /*+ FULL(a)*/ id from testlosttable a;
SQL> -- these 2 queries return different results in our test with a lost write.. and only one of them is correct!

We can now proceed with this additional test: insert a new row into the table, for example:

SQL> insert into testlosttable values (40,'ddddddddddddddddddddddd');
SQL> commit;

What we should notice notice at this point is that Oracle keeps working fine and no errors are shown to the user. With the techniques discussed above we can easily show that this new row has  been inserted into block 134 (the block with a lost write). Let's postpone further investigations for a later paragraph and for now just note that Oracle has not detected that we have suffered a lost write.

An example of the effects of lost writes with Standby (Data Guard)
A standby database, among others, provides a copy of the database that Oracle can use to detect lost writes (i.e. the standby provides a reference copy of 'good data'). A simple test to see how this works in practice:
  • We start by going through the same steps described above and we create a lost write in the primary. 
    • Note that at this point the standby has no knowledge that something has gone wrong in the primary.
  • Now we can go ahead and run DML against the block that has suffered the lost write 
    • i.e. we insert the row with id=40 as detailed above.
  • At the moment when the standby database will try to apply the redo log entry (change vector) to the block with a lost write, it will compare SCN numbers and find that something is wrong. MRP will stop and throw ORA-600 [3020] (stuck recovery).
An example of the error stack for a stuck recovery:

ORA-00600: internal error code, arguments: [3020], [10], [134], [134], [], [], [], [], [], [], [], []
ORA-10567: Redo is inconsistent with data block (file# 10, block# 134, file offset is 1097728 bytes)
ORA-10564: tablespace TESTLOSTWRITE
ORA-01110: data file 10: '+STDBY_DATADG1/TESTDB/datafile/testlostwrite.347.810578685'
ORA-10561: block type 'TRANSACTION MANAGED DATA BLOCK', data object# 18299147

Additional information is dumped in the trace files of MRP and of the recovery slaves.
It's worth noting that ORA-600 [3020] can be caused by a lost write (as in this example) but also by other issues, in particular with the redo consistency (outside the scope of this discussion).

Oracle 11g and lost write protection 
In the example above Oracle only throws an error alerting that something is wrong (in this case ORA-600 [3020], stuck recovery) when a DML operation is performed on the primary DB against a block that had suffered a lost write. This means that lost writes may also remain silent and unnoticed in the primary DB for a very long period of time. A new feature of 11g can be used to make Oracle more active in detecting lost write issues, in particular to check the blocks that are subject to physical reads into the buffer cache. This is how to activate these checks:
  • set the initialization parameter DB_LOST_WRITE_PROTECT = TYPICAL (or FULL if we prefer) on the primary database. This will cause the generation of additional redo entries when Oracle performs physical reads. 
  • set DB_LOST_WRITE_PROTECT = TYPICAL also on the standby, this will make MRP and its recovery slaves to check for lost writes using the extra information in the redo log stream. 
  • Outcome: if we hit a block suffering from lost write, MRP and its slaves will stop and throw ORA-752: recovery detected a lost write of a data block 
An example of the error stack for lost writes with db_lost_write_protect set to typical is pasted here below (note that additional information is dumped in the trace files of MRP and of the recovery slaves):

Hex dump of (file 10, block 134) in trace file {path..}/standby_pr09_26471.trc
Reading datafile '+STDBY_DATADG1/TESTDB/datafile/testlostwrite.347.810578685' for corruption at rdba: 0x00000086 (file 10, block 134)
Read datafile mirror 'STDBY_DATADG1_0000' (file 10, block 134) found same corrupt data (logically corrupt)
Read datafile mirror 'STDBY_DATADG1_0011' (file 10, block 134) found same corrupt data (logically corrupt)
STANDBY REDO APPLICATION HAS DETECTED THAT THE PRIMARY DATABASE
LOST A DISK WRITE OF BLOCK 134, FILE 10
NO REDO AT OR AFTER SCN 6367748422450 CAN BE USED FOR RECOVERY.
ORA-00752: recovery detected a lost write of a data block
ORA-10567: Redo is inconsistent with data block (file# 10, block# 134, file offset is 1097728 bytes)
ORA-10564: tablespace TESTLOSTWRITE
ORA-01110: data file 10: '+STDBY_DATADG1/TESTDB/datafile/testlostwrite.347.810578685'
ORA-10561: block type 'TRANSACTION MANAGED DATA BLOCK', data object# 18299538

Comparing this error message with the ORA-600 [3020] reported in the previous paragraph,the first thing we notice is that we have more information and more details. Moreover Oracle has checked that this is indeed a case of lost write and has even tried to resolve the 'logical corruption' trying to read the second mirror copy from the normal-redundancy ASM diskgroup used in this example (an attempt that did not help in this case as both mirror copies of the block have suffered a lost write, as generated in our example using the asmblk utility).

DB_LOST_WRITE_PROTECT and its impact on primary and standby DBs
From the example here above we see that the extra checks that oracle recovery process does when setting DB_LOST_WRITE_PROTECT with Oracle 11g are definitely an improvement from 10g behavior, although this mechanism does not provide a complete protection against lost writes but only gives us a higher probability that a lost write can be found.
How about the impact of setting this parameter? Extra redo entries are generated on the primary: they are called block read redo (BRR). We can directly investigate BRR entries for example by dumping redo (BRR corresponds to redo at layer 23 and opcode 2), see this example on how to do this type of dump.
SQL> alter system dump logfile '...{path to storage}.../thread_2_seq_1816.510.810661967' layer 23 opcode 2;

Another method is by querying v$mystat or v$sysstat. We will look for stats containing the text 'lost write' in their name:
SQL> select name, sum(value) from v$mystat se, v$statname n where n.statistic#=se.statistic# and (n.name like '%lost write%' or name like '%physical read%') group by name;

The size of BRR entries in the redo stream varies as Oracle's optimizations can come into play, such as batching several BRR entries in one redo record. Based on a limited number of observations of a production system I estimate than on average we can expect 30 bytes of extra redo generated by BRR for each physical block read performed by the database, although mileage may vary and definitely the impact of the parameter should be tested before applying it to a busy production system! Another observation based on testing is that direct read operations do not generate BRR (for example reads for parallel query).

Example of BRR entry from a logfile dump:

CHANGE #5 TYP:0 CLS:4 AFN:10 DBA:0x00000086 OBJ:18299943 SCN:0x05ca.9b6e95f9 SEQ:1 OP:23.2 ENC:0 RBL:1
 Block Read - afn: 10 rdba: 0x00000086 BFT:(1024,134) non-BFT:(0,134)
              scn: 0x05ca.9b6e95f9 seq: 0x01
              flags: 0x00000004 ( ckval )

After setting db_lost_write_protect to typical (or full) and rebooting the instance Oracle will set _log_committime_block_cleanout to true (based on observations on my test system 11.2.0.3). The effect is that (on the primary instances) Oracle will write into the redo logs additional entries related to commit time block cleanout operations (layer 4 opcode 6). Logging block cleanout redo record (and subsequently applying them to the standby database) is a desirable feature especially in the case of an environment deploying Active Data Guard.

On a standby instance Oracle setting db_lost_write_protect to  TYPICAL or FULL will make MRP slaves do the extra work of checking for lost writes by comparing SCNs in BRRs with SCN in the block headers. Notably the extra work implies additional physical reads on the standby. This additional work on the standby can be monitored by querying v$sysstat (statistics name = 'recovery blocks read for lost write detection' and also the related stat 'recovery blocks skipped lost write checks').

Note: checks for lost write based on SCN from BRR entries are performed also when doing media recovery (i.e. a recover database command). Therefore even if Data Guard is not available one can use a simple restore of the database from backup to perform validation of BRR records to search for lost writes. 

Analysis and troubleshooting.
From the steps details above we have a test environment with a lost write that we have generate. This will allow us to have some fun investigating the issue at low level and developing techniques for analysis and troubleshooting. All this with the great advantage, compared to a real-life case, that now we know for sure what the root cause of this incident is!
A good starting point for troubleshooting lost writes in a production system is support note 1265884.1 "Resolving ORA-752 or ORA-600 [3020] During Standby Recovery". A few more pointers here below.

- One technique to investigate the affected blocks both on primary and standby is by dumping the block contents to a trace file. The idea is to compare the contents of the primary and standby. Of particular interest will be  the SCN of the last change to the block and also SCNs in the ITL list. Example:
SQL> alter system dump datafile 10 block 134;
SQL> --the following performs a dump of the cached version of the block 
SQL> alter session set events 'immediate trace name set_tsn_p1 level <ts#+1>'; -- where ts# is the tablespace number
SQL> alter session set events 'immediate trace name buffer level <decimal rdba>'; --rba is 134 in our example

- We can also investigate the contents of the block suffering a lost write using SQL. This has the advantage of allowing the use of flashback query. The current content of the block on the primary database  may not be what we want: we are interested in a consistent image of the block at the SCN equal to the SCN reported in our ORA-752 message for example (this would also be current SCN of the standby as recovery is stuck there at that point in time).  Example:
SQL> set num 16
SQL> select ora_rowscn, rowid, dbms_rowid.rowid_row_number(a.rowid) row_number, a.* from testlosttable as of scn 6367748219413 a where rowid like 'ABFzkJAAAAAAACG%'; -- edit values for scn and rowid, use SQL below to find the values to use

- Digression: how to find the rowid of the block with lost write (block 134 of file 10 and object_id=18299145). Actually what we need is just the first 15 characters of the rowid (the last three characters are the row_number inside the block). Example:
SQL> select substr(DBMS_ROWID.ROWID_CREATE(rowid_type =>1, object_number =>18299145, relative_fno =>0, block_number =>134, row_number =>0),1,15) rowid_prefix from dual;

- How to find the current SCN (at the standby):
SQL> select current_scn from v$database;

- Another powerful technique is to dump redo log information from all relevant logfiles searching for entries related to the block with lost writes (in our example it's block 134 of file 10). The dump will include transaction details and most notably redo marking the time when DBWR has written the give block (this info is sotre in block written redo, BWR). If  the parameter db_lost_write_protect is set to typical or full the redo dump will also show details of the block read redo (see BRR discussed above). For further info on logfile dumps see also Julian Dyke's website. Example:
SQL> alter system dump logfile '...{path to storage}.../thread_2_seq_1816.510.810661967' DBA MIN 10 134 DBA MAX 10 134; -- edit file number and block number as needed

- Additional investigation on data affected by the lost write can be performed by log mining. The goal is to find the SQL of all the transactions for the affected block. Identify the relevant redo logs to mine first. Example:
SQL> BEGIN
SYS.DBMS_LOGMNR.ADD_LOGFILE(LogFileName=>'...{path to storage}.../thread_2_seq_1816.510.810661967',Options=>SYS.DBMS_LOGMNR.NEW);
SYS.DBMS_LOGMNR.START_LOGMNR(Options=> SYS.DBMS_LOGMNR.DICT_FROM_ONLINE_CATALOG);
END;
/
SQL> SELECT scn,sql_redo FROM SYS.V_$LOGMNR_CONTENTS WHERE data_obj#=18299145 and row_id like 'ABFzkJAAAAAAACG%'; -- calculate rowid with dbms_rowid package as detailed above

- Other analysis that can be done in case we have indexes on the tables:
In case one or more indexes are present we can read data from the index and compare the results with what we have on the table. Example to read from the index:
SQL> select rowid rowid_table_from_index_leaf, id, sys_op_lbid(18299146,'L',t.rowid) index_leaf_rowid from testlosttable t where rowid like 'ABFzkJAAAAAAACG%';
--note: update 18299146 with the the data_object_id of the index of interest 
-- in this example this is the data_object_id of I_TESTLOSTTABLE 

- We can also check if there are inconsistencies (corruption) between the table and its indexes with 'analyze table validate', although in my experience this can be quite time consuming and not necessarily add more information to the analysis. Example:
SQL> analyze table testlosttable validate structure cascade online;

Actions that we can take to restore the services
Suppose that our analysis has confirmed that a lost write happened and also that we have the details of what 'is lost'. We need now to fix the corrupted block on the primary and restore the service on the standby. In particular if we have an Active Data Guard with a SLA, the latter may be quite an urgent action. Hopefully we also have some ideas of what the root cause was and a way to fix it in the future.

One possible action is to failover to standby. This action plan however may provide to be unacceptable in many circumstances, given the potential for data loss it implies. A failover would likely not be acceptable if the primary database has continued working and accepting users transactions since the time (SCN) of the incident that has generated ORA-600 (or ORA-752).

Another possibility is to use our knowledge of the lost transactions gathered in the analysis phase to run SQL actions to 'fix the primary'. This has to be evaluated case by case. In some circumstances we can also get away with just dropping and recreating the object with a lost write. In our simple example of a lost insert on table testlosttable, the action to perform on the primary is:

SQL> alter index i_testlosttable rebuild online;
SQL> insert into testlosttable values (30,'cccccccccccccccccccccccccc');
SQL> commit;

What about fixing the standby? We can unstuck the recovery on the standby by allowing it to corrupt  the block with a lost write (and fix it later, as detailed below). Example:

SQL> alter system set db_lost_write_protect=none; --temporarily disable lost write check if needed
SQL> alter database recover automatic standby database allow 1 corruption;
SQL> --wait till the redo that cause the error message has been applied
SQL> alter database recover cancel;
SQL> -- restart normal Data Guard operations. An example for ADG:
SQL> alter system set db_lost_write_protect=typical;
SQL> alter database open read only;
SQL> alter database recover managed standby database nodelay using current logfile disconnect;

At this point all is OK except for one corrupted block on the standby. How to restore the corrupted block on standby? This depends on the case, we may just rebuild the object on primary this will fix the problem on standby. Another option is to copy over the datafile from primary to standby
In our example the corrupted block is block 134 of file 10 and we have an active data guard in real time apply. We can use automatic block media recovery (ABMR) to fix it. In my tests AMBR is attempted but does not really work against the corrupted block 134, I can work around this by zeroing out the block. This is an example (intended to be used on test databases):

./asmblk_edit -w -s 134 -a +STDBY_DATADG1/TESTDB/datafile/testlostwrite.347.810578685 -f zeroblock
where zeroblock file is created with: dd if=/dev/zero bs=8192 count=1 of=zeroblock 
if my standby was on filesystem I could have used:
dd if=/dev/zero of=testlostwrite.dbf bs=8192 count=1 seek=134 if=blk134.dmp conv=notrunc

If we now query the table (the table testlostwrite in our example) when Oracle reaches the zeroed block it will fetch a copy from production (see Oracle documentation for the details about configuring and using the ABMR feature). This will happen in a transparent way for the user issuing the query, the operation is logged in the alert log of the instance (the 2 lines here below appeared repeated twice in my test using 11.2.0.3):
Automatic block media recovery requested for (file# 10, block# 134)
Automatic block media recovery successful for (file# 10, block# 134)

Oracle's OTN demo video on lost write
Short videos with demos on testing lost writes and automatic block media recovery in a Data Guard environment can be found on OTN.
See also support document "Best Practices for Corruption Detection, Prevention, and Automatic Repair - in a Data Guard Configuration" [ID 1302539.1]

Conclusions and future work
This article illustrates a simple example on how to reproduce and investigate the effects lost write in a test environment, which is intended as a training exercise for DBAs.
I would like to end with four lessons learned on the topic of how lost writes can affect a high-availability Oracle deployment:
  • Finding the root causes of a lost write can prove to be very hard. Two possible sources of the root causes of these type of issues are the storage and Oracle bugs. How to find which one (if any of the two) is to blame for a particular incident? If the issue cannot be re-produced we can easily find ourselves in between a finger-pointing discussion between vendors.
  • Having a standby (Data Guard) is of great help for discovering lost writes. Without a standby there are high chances that we will not realize we had a lost write in the first place. However, once a lost write is found the redo apply stops: the recovery is stuck and throws error ORA-600 [3020] or ORA-752 if we try to restart it. A support DBA most likely will have to analyze the situation and decide what to do to fix the issue (for example perform a failover or rather going through a process similar to what described in this article). Moreover if one or more Active Data Guards are used for critical read-only activity, there is time pressure to restart the redo apply.
  • How can we check that our primary and standby databases are 'in sync', that is that there are no lost writes waiting to 'explode' as time bombs? I am not aware of a utility that could do such check. This is a very interesting topic, probably material for another post. A brief discussion of this problem and possible solutions can be found at this link and also here.
  • Database restores can be affected by lost writes too. For example the restore of a database backup can fail (get stuck on applying redo) because of a lost write that has happened in production after the latest data file backup. This has potential impacts on the disaster and recovery strategy.

Note: the examples reported in this article have been tested against Oracle 11.2.0.3 64 bit for Linux (RHEL5).

Wednesday, September 26, 2012

Command-Line DBA Scripts

Topic: A review of the command-line monitoring scripts that I currently use for Oracle.

Note: last updated in June 2015. Download links: http://canali.web.cern.ch/canali/resources.htm and https://github.com/lucacanali

Command-line is still very useful for Oracle database administration, troubleshooting and performance monitoring. One of the first thins that I do when I want to work with a given Oracle database is to connect as a DBA user with Sql*plus and get information on the active sessions with a script. I will run the script a few times to get an idea of what's happening 'right now' and get a possible starting point for more systematic tuning/troubleshooting if needed.
I liken this approach to examining a few pictures from a remote webcam and trying to find out what's happening out there!.. The method can be powerful, although with some limitations. One of its advantages is that it is easy to implement in Oracle! One of the main source of information for this type of monitoring is GV$SESSION, which is available in Oracle since ages. More recently GV$ACTIVE_SESSION_HISTORY has proven very useful and I quite like the additional improvements introduced there in11gR2.

Examples: One of my favorites is a script I have called top.sql (not very much an original name I know :) which queries gv$session (RAC-aware) and prints out what I currently find the most interesting details (for my environment) on active sessions that I can fit in roughly 200 characters. Another script that does a similar job taking data from v$active_session_history is ash.sql. the scripts ash2.sql, ash3.sql and ash4.sql are the RAC versions for 2, 3 and 4 node RACs respectively.


One of the typical pain points when writing CLI monitoring scripts is deciding which information to display and how to pack it in a relatively short line. I have also found that with each major Oracle versions I want to refresh my scripts with a few changes.

Scripts details: I have packaged some of the scripts that I have written and that I find useful in a zip file which is available at this link (unzip locally and run using sqlplus).  Note that the scripts are tested on 11gR2, many of them will also work on 10g and 12c.

Monitoring Class (Active Session Monitoring)
Script Name Description
top.sqlList with selected details of the active sessions, from gv$session (RAC aware)
top_10g.sqlActive sessions details, from gv$session, 10g version  (RAC aware)
ash.sqlList with selected details of the active sessions, from v$active_session_history (one instance only)
ash_h.sqlUsage @ash_h <n#_sec>. Active sessions details for the last n#_sec seconds, from v$active_session_history (current instance only)
ash2.sql, ash3.sql ash4.sqlActive sessions details, from gv$active_session_history (for RAC 2, 3 and 4 nodes).

SQL Monitoring
Script NameDescription
monitor.sql Display information form gv$sql_monitor for running statements statements. See also monitor_details.sql
monitor_plan.sqlUsage: @monitor_plan <key>. A drill-down script from monitor.sql. Prints details of the execution plan from gv$sql_plan_monitor
monitor_report_sid.sql,
monitor_report_sid_active.sql
Usage: @monitor_report_sid_active <sid>. Generates an active/html sql monior report for the given sid. See also @monitor_report_sid <sid> for the simpler html version.
monitor_report_sqli_d.sql,
monitor_report_sql_id_active.sql
Usage: @monitor_report_sql_id_active <sql_id>. Generates an active/html sql monior report for the given sql_id. See also @monitor_report_sql_id <sql)id> for the simpler html version.

Drill-down Class
Script Name Description
ash_sess.sqlUsage: @ash_sess <inst_id> <sid> <n#_sec>. Prints details from gv$active_session_history for a given session (RAC aware)
explain.sqlUsage: @explain <sql_id>. Prints detailed explain plan info for given sql_id from library cache (current instance only)
explain_awr.sqlUsage: @explain_awr sql_id. Prints detailed explain plan info for given sql_id from AWR (rac aware)
obj.sqlUsage: @obj <object_id>. Find details for a given object id, or data_object_id
sessinfo.sqlUsage example: @sessinfo sid=<num>. Prints info on sessions that match the search pattern (RAC aware)
sql.sqlUsage: @sql <sql_id>. Prints sql text for given sql_id (RAC aware)

Workload Characterization Class
Script Name Description
sysmetric.sql, sysmetric_details.sqlDisplay values of selected system metrics (RAC aware)
iometric.sql, iometric_details.sqlIO utilization metrics and details (RAC aware)
ehm.sql, ehm_local.sql
ehm_micro, ehm_micro_local
Usage: @ehm 10 "db file sequential read". Event history metric data, helps finding skew in wait times, for example to study random reads latency skew (RAC aware).
ehm_micro -> 12c version, latency drill down at microsecond.
eventmetric.sqlWait event metric details for top wait events (RAC aware)
filemetric.sqlFile metric details for top-usage files (RAC aware)
ash_top.sqlUsage @ash_top <n#_sec>. Aggregates ash data over n#_sec seconds (current instance only)
jobs_running.sqlList running jobs details (RAC aware)
load.sqlDisplays OS-load values (RAC aware)
longops.sql, longops_10g.sql, longops_details.sqlInfo on long running operations from gv$sessoin_longops (RAC aware)
transactions.sqlList of open transactions and crash recovery (RAC aware)

ASH and AWR Reports
Script NameDescription
ash_report.sqlUsage example: @ash_report sysdate-1/24 sysdate Generates an ASH report for the current instance and displays it in a browser.
awr_report.sqlUsage: @awr_report <snap_begin> <snap_end>. Generates an AWR report for the current instance and displays it in a browser.
awr_global_report.sqlUsage: @awr_global_report <snap_begin> <snap_end>. Generates a global AWR report for the current database and displays it in a browser.
awr_snapshots.sqlLists AWR snapshots for the last 24 hours.

Miscellaneous
Script Name Description
kill.sqlUsage @kill <sid> <serial#> kill the given session (current instance only)
kill_select.sqlUsage example: @kill_select "sql_id='xxxx'". Helps in preparing scripts for killing large number of sessions (current instance only)
locks.sql, locks_10g.sql, locks_details.sqlScripts to help investigating locks and blocking enqueues issues (RACaware)
binds.sqlUsage: @binds <sql_id>. Helps in finding bind values for a given sql_id (RAC aware)
login.sqlSql*plus startup/config script for my environment (windows client version 11.2.0.3)


Conclusions
Command-line scripts are still of great help for monitoring and  performance troubleshooting of Oracle databases. Of particular interest are queries to print the details of active sessions, the SQL they execute and their wait events details. This makes for a quick and easy entry point to troubleshooting and to get an idea of what's happening on the DB server. More systematic troubleshooting may be needed beyond that. Command-line scripts are easy to modify and customize, many DBAs indeed have their own tool-set. A selection of my own DBA scripts can be downloaded from this link.

A few links to end this post: First of all the excellent session snapper and the rest of the collection of scripts by Tanel Poder. In addition a mention to a few cool tools that are a sort of bridge between command-line and graphical-interface approaches: MOATS by Tanel Poder and Adrian Billington , SQL Dashboard by Jagjeet Singh and topaas by Marcin Przepiorowski.