TL;DR Explore a step-by-step example of troubleshooting Apache Spark job performance using flame graph visualization and profiling. Discover the seamless integration of Grafana Pyroscope with Spark for streamlined data collection and visualization.
The Puzzle of the Slow Query
Set within the framework of data analysis for the ATLAS experiment's Data Control System, our exploration uses data stored in the Parquet format and deploys Apache Spark for queries. The setup: Jupyter notebooks operating on the SWAN service at CERN interfacing with the Hadoop and Spark service.
The Hiccup: A notably slow query during data analysis where two tables are joined. Running on 32 cores, this query takes 27 minutes—surprisingly long given the amount of data in play.
The tables involved:
- EVENTHISTORY: A log of events for specific sub-detectors, each row contains a timestamp, the subsystem id and a value
- LUMINOSITY, a table containing the details of time intervals called "luminosity blocks", see Luminosity block - Particle Wiki
Cracking the Performance Case
WebUI: The first point of entry for troubleshooting this was the Spark WebUI. We could find there the execution time of the query (27 minutes) and details on the execution plan and SQL metrics under the "SQL/ DataFrame" tab. Figure 1 shows a relevant snippet where we could clearly see that Broadcast nested loop join was used for this.
Execution Plan: The execution plan is the one we wanted for this query, that is the small LUMINOSITY table is broadcasted to all the executors and then joined with each partition of the larger EVENTHISTORY table.
Figure 1: This shows a relevant snippet of the execution graph from the Spark WebUI. The slow query discussed in this post runs using broadcast nested loops join. This means that the small table is broadcasted to all the nodes and then joined to each partition of the larger table.
CPU utilization measured with Spark Dashboard
Spark Dashboard instrumentation provides a way to collect and visualize Spark execution metrics. This makes it easy to plot the CPU used during the SQL execution. From there we could see that the workload was CPU-bound
The Clue: Profiling with Flame Graphs and Pyroscope
Stack profiling and Flame Graphs visualization are powerful techniques to investigate CPU-bound workloads. We use it here to find where the CPU cycles are consumed and thus make the query slow.
First a little recap of what is stack profiling with flame graph visualization, and what tools we can use to apply it to Apache Spark workloads:
Stack profiling and Flame Graphs visualization provide a powerful technique for troubleshooting CPU-bound workloads.
- Flame Graphs provide information on the "hot methods" consuming CPU
- Flame Graphs and profiling can also be used to profile time spent waiting (off-cpu) and memory allocation
Grafana Pyroscope simplifies data collections and visualization, using agents and a custom WebUI. Key motivations for using it with Spark are:
- Streamlined Data Collection & Visualization: The Pyroscope project page offers a simplified approach to data gathering and visualization with its custom WebUI and agent integration.
- Java Integration: The Pyroscope java agent is tailored to work seamlessly with Spark. This integration shines especially when Spark is running on various clusters such as YARN, K8S, or standalone Spark clusters.
- Correlation with Grafana: Grafana’s integration with Pyroscope lets you juxtapose metrics with other instruments, including the Spark metrics dashboard.
- Proven Underlying Technology: For Java and Python, the tech essentials for collecting stack profiling data, async-profiler and py-spy, are time-tested and reliable.
- Functional & Detailed WebUI: Pyroscope’s WebUI stands out with features that allow users to:
- Select specific data periods
- Store and display data across various measurements
- Offer functionalities to contrast and differentiate measurements
- Showcase collected data for all Spark executors, with an option to focus on individual executors or machines
- Lightweight Data Acquisition: The Pyroscope java agent is efficient in data gathering. By default, stacks are sampled every 10 milliseconds and uploaded every 10 seconds. We did not observe any measurable performance or stability impact of the instrumentation.
Spark Configuration
- Download from https://github.com/grafana/pyroscope/releases
- CLI start:
./pyroscope -server.http-listen-port 5040
- Or use docker:
docker run -it -p 5040:4040 grafana/pyroscope
- Note: customize the port number, I used port 5040 to avoid confusion with the Spark WebUI which defaults to port 4040 too
The Insight
Related work and links
- Spark Dashboard - tooling and configuration for deploying an Apache Spark Performance Dashboard using containers technology.
- Spark Measure - a tool for performance troubleshooting of Apache Spark workloads. It simplifies the collection and analysis of Spark task and stage metrics data.
- Spark Plugins - Code and examples of how to write and deploy Apache Spark Plugins.
- Spark Notes and Performance Testing notes