On our way to build a central repository that stores consolidated audit and log data generated by the databases, we needed to develop several components that will help us to achieve such purpose. In this case, we will be talking about two custom sources for Apache Flume that have been developed in order to collect data from databases tables and (alert & listener) log files. Both these sources are implemented in a generic way, without any project dependency, so they can be used for any other project and the code is publicly accessible.
I have been wanting to test Apache Kafka for sometime now and finally got around to it! In this blog post I give a very short introduction on what is Kafka, installation & configuration of Kafka cluster and finally benchmarking few near real-world scenarios on OpenStack VM's
HDFS is part of the core Hadoop ecosystem and serves as a storage layer for the Hadoop computational frameworks like Spark, MapReduce. Like other distributed file systems, HDFS is based on an architecture where namespace is decoupled from the data. The namespace contains the file system metadata which is maintained by dedicated server called namenode and the data itself resides on other servers called datanodes.
This blogpost is about dumping HDFS metadata into Impala/Hive table for examination and offline analysis using SQL semantics
Topic: This post is about performance optimizations introduced in Apache Spark 2.0, in particular whole-stage code gen
Last week I've investigated how does OAuth2 protocol works and developed a Proof of Concept (PoC) in Java. In this post I would like to show you how effortlessly develop simple client-server application using OAuth 2.0 standard for authorization of protected resources placed on a server.
Before we start developing our first secured web application with OAuth2 let's understand how it works.
What is it and how does it work?
Alluxio refers to itself as an "Open Source Memory Speed Virtual Distributed Storage" platform. It sits between the storage and processing framework layers in the distributed computing ecosystem and claims to heavily improve performance when multiple jobs are reading/writing from/to the same data. This post will cover some of the basic features of Alluxio and will compare its performance for accessing data against caching in Spark.
Topic: In this post, you will find an example of how to build and deploy a basic artificial neural network scoring engine using PL/SQL.
At CERN we run multiple Hadoop clusters to satisfy demanding requirements from our experiments and accelerator communities. The usage and criticality of the clusters are increasing dramatically as more users are looking at Hadoop to process and archive the vast amounts of data coming out of LHC.
Topic: In this short post you can find examples of how to use IPython/Jupyter notebooks for running SQL on Oracle.