At CERN we run multiple Hadoop clusters to satisfy demanding requirements from our experiments and accelerator communities. The usage and criticality of the clusters are increasing dramatically as more users are looking at Hadoop to process and archive the vast amounts of data coming out of LHC.
Databases at CERN blog
Topic: In this short post you can find examples of how to use IPython/Jupyter notebooks for running SQL on Oracle.
Topic: In this post you will find a short discussion and pointers to the code of a few sample scripts that I have written using Linux BPF/bcc and uprobes for Orac
In the part 2 of 'Integrating Hadoop and Elasticsearch' blogpost series we look at bridging Apache Spark and Elasticsearch. I assume that you have access to Hadoop and Elasticsearch clusters and you are faced with the challenge of bridging these two distributed systems. As spark code can be written in scala, python and java, we look at the setup, configuration and code snippets across all these three languages both in batch and interactively.
Topic: in this post you can find examples of how to get started with using IPython/Jupyter notebooks for querying Apache Impala.
Topic: This post is about profiling and performance tuning of distributed workloads and in particular Hadoop applications. You will learn of a profiler application we have developed and how it has successfully been applied to tuning Sqoop to improve the throughput of data transfer from Oracle to Hadoop.
Distributed systems always bring new challenges for administrators and users. This is the case with HDFS, the default distributed file system that Hadoop uses for storing data.
In order to face these challenges, tools to facilitate administration and usage of these systems are developed. At CERN, a Hadoop service is provided and we have developed and deployed on our clusters some tools, today we present one of these tools.
The Problem: database restore fails with ORA-19571: datafile copy RECID xxx STAMP yyy not found in control file
Our typical setup of Oracle databases consists of a primary RAC cluster along with a standby database, also in RAC configuration. We are taking RMAN database backups from standby, while archivelog are backed up from primary database. Typically we are backing up everything to DISK (NAS), and further transferring some backups to TAPE. We are also running regular automated recoveries to test our backups.
As more and more organisations are deploying Hadoop and Elasticsearch in tandem to satisfy batch analytics, real-time analytics and monitoring requirements, the need for tigher integration between Hadoop and Elasticsearch has never been more important. In this series of blogposts we look at how these two distributed systems can be tightly integrated and how each of them can exploit the feaures of the other system to achieve ever demanding analytics and monitoring needs.