External Table: IPython

Monday, June 13, 2016

IPython/Jupyter Notebooks for Oracle

Topic: In this short post you can find examples of how to use IPython/Jupyter notebooks for running SQL on Oracle.

IPython/Jupyter notebooks are one of the leading free platforms for data analysis, with many advantages, notably the interactive web-based interface and a large ecosystem of readily available packages for data analysis and visualization. Moreover IPython/Jupyter notebooks are a very handy format for sharing code and data as you will see in the examples.
See also this blog post with examples on how to use Jupyter for querying Apache Impala.

It is of interest to integrate many data sources into Jupyter notebooks to make the platform versatile and to fulfill many different use cases. In this short post you can find examples of how to query data from Oracle using Jupyter notebooks and simple integration with pandas and matplotlib.

Notebook	Short description
Oracle_IPython_sqlplus	Examples of how to use sqlplus inside Jupyter notebooks. It is based on the use of %%bash cell magic and here documents to wrap up sqlplus inside Jupyter cells.
Oracle_IPython_cx_Oracle_pandas	Examples of how to query Oracle from Python using cx_Oracle and how to integrate with pandas and visualization with matplotlib.
Oracle_IPython_SQL_magic	Examples of how to query Oracle using %sql line magic (or %%sql cell magic) and of the integration with cx_Oracle and pandas.

Dependencies and pointers to build a test environment:

Install IPython and Jupyter. The following assumes Anaconda from Continuum Analytics)
Install the Oracle client
- Download the software from OTN: http://www.oracle.com/technetwork/topics/linuxx86-64soft-092277.html
- On that same link you can find the installation instructions (scroll down by the end of the page)
- Example of the actions: perform ln -s libclntsh.so.12.1 libclntsh.so (12c version) and export LD_LIBRARY_PATH={oracle client home}
- I have tested this with Oracle client versions 12.1.0.2 and 11.2.0.4 on Linux
Post client installation:
- set environment: export ORACLE_HOME={path to the Oracile client installation}
- If not already installed, install libaio (yum install libaio)
Check that the Oracle client works and all dependencies are set by running sqlplus from the Oracle client home, example:
- check client connectivity with: sqlplus username/password@dbserver:port/service_name
Install cx_Oracle, for example with pip install cx_Oracle
Install ipython-sql
- download from https://github.com/LucaCanali/ipython-sql
- cd ipython-sql and (sudo) python setup.py install
- Optionally install the official version on PyPi using pip install ipython-sql

Reference: a web page by Julian Dyke with examples on how to use cx_Oracle

Wednesday, April 27, 2016

IPython/Jupyter Notebooks for Querying Apache Impala

Topic: in this post you can find examples of how to get started with using IPython/Jupyter notebooks for querying Apache Impala.

Apache Impala is an open source massively parallel processing (MPP) SQL Query Engine for Apache Hadoop. This post explores the use of IPython for querying Impala and generates from the notes of a few tests I ran recently on our systems. For completeness please that that several additional options exist to query Impala, some of the most popular are: the command line impala-shell and the web-based interface Hue.
IPython/Jupyter notebooks are one of the leading free platforms for data analysis, with many advantages, notably the interactive web-based interface and a large ecosystem of readily available packages for data analysis and visualization. Moreover IPython/Jupyter notebooks are a very handy format for sharing code and data as you will see in the examples.

Installation notes for a test environment:

To test the techniques described here you will need a target Hadoop system with a running instance of Impala.

If you are new to this, you can start with the Cloudera Quickstart VM

You will need an installation of IPython, for example on your laptop.

I have used Anaconda from Continuum Analitycs. Additional details from my test environment: Anaconda version 4.0.0, Linux version, python 2.7 for x86_64

Cloudera impyla, Python client for Impala and Hive

Install with pip install impyla or download from https://github.com/cloudera/impyla

ipython-sql, this is a key ingredient of this post. It implements "SQL magic" for IPython by Catherine Devlin

Install with pip install git+https://github.com/LucaCanali/ipython-sql.git or download from https://github.com/LucaCanali/ipython-sql
as I write this you will need the version I forked and to which I added a couple of changes to integrate it with Impala/impyla and Kerberos authentication

Example notebooks:

The example notebooks described in this post are available on Github.

The first example is about using the impyla client to query Apache Impala from IPython notebooks and basically comes straight out of impyla documentation. The next example adds "SQL magic" for an improved user interface. To run this you need to install ipython-sql by Catherine Devlin. As I write this you will need the version I forked and extended with minor changes to integrate it with Impala/impyla and with Kerberos authentication.
In practice ipython-sql adds the "IPython Magic functions" %%SQL and %SQL which can be used to run SQL directly in the notebooks.

Example 1: Impala_Basic.ipynb

Example 2: Impala_SQL_Magic.ipynb

Kerberos, if you need it:

Kerberos authentication is used in our production environments. This requires additional steps in the configuration and some modifications to the scripts. For a start, support for Kerberos requires installing additional components. In my test environment I had to run:

# yum install cyrus-sasl-devel
# yum install gcc-c++
$ pip install thrift_sasl

You also need to have a valid Kerberos ticket (use kinit to get the ticket if your Kerberos client is configured). Here are the example notebooks described in the previous paragraph, now extended with the steps needed for integrating with Kerberos authentication:

Example 3: Impala_Basic_Kerberos.ipynb

Example 4: Impala_SQL_Magic_Kerberos.ipynb

Conclusions

IPython/Jupyter notebooks can be used to build an interactive environment for data analysis with SQL on Apache Impala. This combines the advantages of using IPython, a well established platform for data analysis, with the ease of use of SQL and the performance of Apache Impala.
This post provides examples of how to integrate Impala and IPython using two python packages: the impyla client and ipython-sql. An additional contribution of this work is a minor change to the ipython-sql package to make it compatible with Impala and with Kerberos authentication. The examples described in this post are available on Github.

Acknowledgements and references

Catherine Devlin for ipython-sql.
Cloudera: the Apache Impala project, Wes McKinney for Pandas and Impyla.

The Jupyter project.
Many thanks to the CERN Hadoop service team.

Links to related technology: Hue (web-based SQL interface) and the Ibis project.