In the following blog posts we study the topic of Distributed Deep Learning, or rather, how to parallelize gradient descent using data parallel methods. We start by laying out the theory, while supplying you with some intuition into the techniques we applied. At the end of this blog post, we conduct some experiments to evaluate how different optimization schemes perform in identical situations.
Databases at CERN blog
In this entry I would like to share my experiences using Oracle Java Cloud Service, especially securing the application environment. I will show you some issues that I encountered during standard process of setting up environment. I will also explain some basic concepts that are fundamental to work with cloud services.
Topic: this post is about a simple implementation with examples of IPython custom magic functions for running SQL in Apache Spark using PyS
In this blog entry we introduce evolutionary algorithms and an integration between an evolutionary computation tool, ECJ, and Apache Hadoop. This research aims at speeding up the evaluation of solutions by distributing the workload among a cluster of machines. Finally, we make sense out of this integration showing how it has been used for improving a face recognition algorithm.
Hypothesis is an implementation of Property-based testing for Python, similar to QuickCheck in Haskell/Erlang and test.check in Clojure (among others). Basically, it allows the programmer to formulate invariants about their programs, and have an automated system attempt to generate counter-examples that invalidates them.
On our way to build a central repository that stores consolidated audit and log data generated by the databases, we needed to develop several components that will help us to achieve such purpose. In this case, we will be talking about two custom sources for Apache Flume that have been developed in order to collect data from databases tables and (alert & listener) log files. Both these sources are implemented in a generic way, without any project dependency, so they can be used for any other project and the code is publicly accessible.
I have been wanting to test Apache Kafka for sometime now and finally got around to it! In this blog post I give a very short introduction on what is Kafka, installation & configuration of Kafka cluster and finally benchmarking few near real-world scenarios on OpenStack VM's
HDFS is part of the core Hadoop ecosystem and serves as a storage layer for the Hadoop computational frameworks like Spark, MapReduce. Like other distributed file systems, HDFS is based on an architecture where namespace is decoupled from the data. The namespace contains the file system metadata which is maintained by dedicated server called namenode and the data itself resides on other servers called datanodes.
This blogpost is about dumping HDFS metadata into Impala/Hive table for examination and offline analysis using SQL semantics
Topic: This post is about performance optimizations introduced in Apache Spark 2.0, in particular whole-stage code gen