Topic: in this post you can find examples of how to get started with using IPython/Jupyter notebooks for querying Apache Impala.
Apache Impala is an open source massively parallel processing (MPP) SQL Query Engine for Apache Hadoop. This post explores the use of IPython for querying Impala and generates from the notes of a few tests I ran recently on our systems. For completeness please that that several additional options exist to query Impala, some of the most popular are: the command line impala-shell and the web-based interface Hue.
IPython/Jupyter notebooks are one of the leading free platforms for data analysis, with many advantages, notably the interactive web-based interface and a large ecosystem of readily available packages for data analysis and visualization. Moreover IPython/Jupyter notebooks are a very handy format for sharing code and data as you will see in the examples.
Installation notes for a test environment:
- To test the techniques described here you will need a target Hadoop system with a running instance of Impala.
- If you are new to this, see the Cloudera Quickstart VM
- You will need an installation of IPython, for example on your laptop.
- I have used Anaconda from Continuum Analitycs. Additional details of my test environment: Anaconda version 4.0.0, Linux version, python 2.7 for x86_64
- Cloudera impyla, Python client for Impala and Hive
- Install with pip install impyla or download from https://github.com/cloudera/impyla
- ipython-sql, this is a key ingredient of this post. It implements "SQL magic" for IPython by Catherine Devlin
- Install with pip install git+https://github.com/LucaCanali/ipython-sql.git or download from https://github.com/LucaCanali/ipython-sql
- as I write this you will need the version I forked and to which I added a couple of changes to integrate it with Impala/impyla and Kerberos authentication
The example notebooks described in this post are available on Github.
The first example is about using the impyla client to query Apache Impala from IPython notebooks and basically comes straight out of impyla documentation. The next example adds "SQL magic" for an improved user interface. To run this you need to install ipython-sql by Catherine Devlin. As I write this you will need the version I forked and extended with minor changes to integrate it with Impala/impyla and with Kerberos authentication.
In practice ipython-sql adds the "IPython Magic functions" %%SQL and %SQL which can be used to run SQL directly in the notebooks.
Example 1: Impala_Basic.ipynb
Example 2: Impala_SQL_Magic.ipynb
Kerberos, if you need it:
Kerberos authentication is used in our production environments. This requires additional steps in the configuration and some modifications to the scripts. For a start, support for Kerberos requires installing additional components. In my test environment I had to run:
# yum install cyrus-sasl-devel
# yum install gcc-c++
$ pip install thrift_sasl
You also need to have a valid Kerberos ticket (use kinit to get the ticket if your Kerberos client is configured). Here are the example notebooks described in the previous paragraph, now extended with the steps needed for integrating with Kerberos authentication:
Example 3: Impala_Basic_Kerberos.ipynb
Example 4: Impala_SQL_Magic_Kerberos.ipynb
IPython/Jupyter notebooks can be used to build an interactive environment for data analysis with SQL on Apache Impala. This combines the advantages of using IPython, a well established platform for data analysis, with the ease of use of SQL and the performance of Apache Impala.
This post provides examples of how to integrate Impala and IPython using two python packages: the impyla client and ipython-sql. An additional contribution of this work is a minor change to the ipython-sql package to make it compatible with Impala and with Kerberos authentication. The examples described in this post are available on Github.
Acknowledgements and references
The Jupyter project.
Many thanks to the CERN Hadoop service team.