Apache Spark is renowned for its speed and efficiency in handling large-scale data processing. However, optimizing Spark to achieve maximum performance requires a precise understanding of its inner workings. This blog post will guide you through establishing a Spark Performance Lab with essential tools and techniques aimed at enhancing Spark performance through detailed metrics analysis.
Tools
Tools
Apache Spark 3.0 Memory Monitoring Improvements
TLDR; Apache Spark 3.0 comes with many improvements, including new features for memory monitoring.
Distributed Deep Learning for Physics with TensorFlow and Kubernetes
Summary: This post details a solution for distributed deep learning training for a High Energy Physics use case, deployed using cloud resources and Kubernetes. You will find the results for training using CPU and GPU nodes. This post also describes an experimental tool that we developed, TF-Spawner, and how we used it to run distributed TensorFlow on a Kubernetes cluster.
Performance Analysis of a CPU-Intensive Workload in Apache Spark
Topic: This post is about techniques and tools for measuring and understanding CPU-bound and memory-bound workloads in Apache Spark. You will find examples applied to studying a simple workload consisting of reading Apache Parquet files into a Spark DataFrame.
IPython/Jupyter Notebooks for Oracle
Topic: In this short post you can find examples of how to use IPython/Jupyter notebooks for running SQL on Oracle.
Linux BPF/bcc for Oracle Tracing
Topic: In this post you will find a short discussion and pointers to the code of a few sample scripts that I have written using Linux BPF/bcc and uprobes for
IPython Notebooks for Querying Apache Impala
Topic: in this post you can find examples of how to get started with using IPython/Jupyter notebooks for querying Apache Impala.
Extended Stack Profiling - Ideas, Tools and Comments
Topic: This post provides a short summary and pointers to previous work on Extended Stack Profiling for troubleshooting and performance investigations.
Pagination
Disclaimer
The views expressed in this blog are those of the authors and cannot be regarded as representing CERN’s official position.
Blogroll
CERN update, Quantum Diaries, Careers at CERN
Christian Antognini, Karl Arao, Martin Bach, Mark Bobak, Wolfgang Breitling, Doug Burns, Kevin Closson, Cloudera blog, Wim Coekaerts, Bertrand Drouvot, Enkitec blog, Pete Finnigan, Richard Foote, Randolf Geist, Marco Gralike, Brendan Gregg, Kyle Hailey, Tim Hall, Uwe Hesse, Frits Hoogland, Hortonworks blog, Integrity Oracle Security, Tom Kyte, Adam Leventhal, Jonathan Lewis, Cary Millsap, James Morle, Karen Morton, Arup Nanda, Mogens Nørgaard, Oracle The Data Warehouse insider, Oracle Enterprise Manager, Oracle Linux blog, Oracle Multitenant, Oracle Optimizer blog, Oracle R technologies, Oracle Upgrade blog, Oracle Virtualization blog, Kerry Osborne, Tanel Poder, Planet PostgreSQL, Kellyn Pot'Vin, Pythian blog, Greg Rahn, Mark Rittman, Riyaj Shamsudeen, Chen Shapira, Carlos Sierra, Szymon Skorupinski