Performance

Unlocking Apache Spark Performance: Three Open-Source Tools We Use at CERN

Apache Spark is incredibly powerful, but anyone who has worked with it long enough knows the feeling:

Why is this job suddenly slower today?
Why are executors running out of memory?
Why is one stage taking 90% of the runtime?
What exactly is Spark doing behind the scenes?

Building an Apache Spark Performance Lab: Tools and Techniques for Spark Optimization

Apache Spark is renowned for its speed and efficiency in handling large-scale data processing. However, optimizing Spark to achieve maximum performance requires a precise understanding of its inner workings. This blog post will guide you through establishing a Spark Performance Lab with essential tools and techniques aimed at enhancing Spark performance through detailed metrics analysis.

Enhancing Apache Spark and Parquet Efficiency: A Deep Dive into Column Indexes and Bloom Filters

In the ever-evolving landscape of big data, Apache Spark and Apache Parquet continue to introduce game-changing features.

Enhancing Apache Spark Performance with Flame Graphs: A Practical Example Using Grafana Pyroscope

TL;DR Explore a step-by-step example of troubleshooting Apache Spark job performance using flame graph visualization and profiling. Discover the seamless integration of Grafana Pyroscope with Spark for streamlined data collection and visualization.

Performance Comparison of 5 JDKs on Apache Spark

Dive into a comprehensive load-testing exploration using Apache Spark with CPU-intensive workloads.

Apache Spark 3.0 Memory Monitoring Improvements

TLDR; Apache Spark 3.0 comes with many improvements, including new features for memory monitoring.

SparkMeasure, a tool for performance troubleshooting of Apache Spark workloads

SparkMeasure

Performance Analysis of a CPU-Intensive Workload in Apache Spark

Topic: This post is about techniques and tools for measuring and understanding CPU-bound and memory-bound workloads in Apache Spark. You will find examples applied to studying a simple workload consisting of reading Apache Parquet files into a Spark DataFrame.

Diving into Spark and Parquet Workloads, by Example

Topic: In this post you can find a few simple examples illustrating im

Apache Spark 2.0 Performance Improvements Investigated With Flame Graphs

Topic: This post is about performance optimizations introduced in Apache Spark 2.0, in particular whole-stage cod

Subscribe to Performance

Disclaimer

The views expressed in this blog are those of the authors and cannot be regarded as representing CERN’s official position.

CERN Social Media Guidelines

Blogroll

CERN update, Quantum Diaries, Careers at CERN

Christian Antognini, Karl Arao, Martin Bach, Mark Bobak, Wolfgang Breitling, Doug Burns, Kevin Closson, Cloudera blog, Wim Coekaerts, Bertrand Drouvot, Enkitec blog, Pete Finnigan, Richard Foote, Randolf Geist, Marco Gralike, Brendan Gregg, Kyle Hailey, Tim Hall, Uwe Hesse, Frits Hoogland, Hortonworks blog, Integrity Oracle Security, Tom Kyte, Adam Leventhal, Jonathan Lewis, Cary Millsap, James Morle, Karen Morton, Arup Nanda, Mogens Nørgaard, Oracle The Data Warehouse insider, Oracle Enterprise Manager, Oracle Linux blog, Oracle Multitenant, Oracle Optimizer blog, Oracle R technologies, Oracle Upgrade blog, Oracle Virtualization blog, Kerry Osborne, Tanel Poder, Planet PostgreSQL, Kellyn Pot'Vin, Pythian blog, Greg Rahn, Mark Rittman, Riyaj Shamsudeen, Chen Shapira, Carlos Sierra, Szymon Skorupinski