This is a short post introducing a notebook that you can use to play with a simple analysis of High Energy Physics (HEP) data using CERN open data and Apache Spark. The idea for this work started with a concept for a technology demonstrator of some recent developments on using Spark for data analysis in the context of HEP.
Topic: In this post you can find a few simple examples illustrating importa
Topic: This post is about measuring Apache Spark workload metrics for performance investigations.
This post reports performance tests for a few popular data formats and storage engines available in the Hadoop ecosystem: Apache Avro, Apache Parquet, Apache HBase and Apache Kudu. This exercise evaluates space efficiency, ingestion performance, analytic scans and random data lookup for a workload of interest at CERN Hadoop service.
In the following blog posts we study the topic of Distributed Deep Learning, or rather, how to parallelize gradient descent using data parallel methods. We start by laying out the theory, while supplying you with some intuition into the techniques we applied. At the end of this blog post, we conduct some experiments to evaluate how different optimization schemes perform in identical situations.
In this entry I would like to share my experiences using Oracle Java Cloud Service, especially securing the application environment. I will show you some issues that I encountered during standard process of setting up environment. I will also explain some basic concepts that are fundamental to work with cloud services.
Topic: this post is about a simple implementation with examples of IPython custom magic functions for running SQL in Apache Spark using PyS