Enhancing Apache Spark and Parquet Efficiency: A Deep Dive into Column Indexes and Bloom Filters
In the ever-evolving landscape of big data, Apache Spark and Apache Parquet continue to introduce game-changing features.
In the ever-evolving landscape of big data, Apache Spark and Apache Parquet continue to introduce game-changing features.
TL;DR Explore a step-by-step example of troubleshooting Apache Spark job performance using flame graph visualization and profiling. Discover the seamless integration of Grafana Pyroscope with Spark for streamlined data collection and visualization.
Dive into a comprehensive load-testing exploration using Apache Spark with CPU-intensive workloads.
We are in a golden age for distributed data processing, with an abundance of tools and solutions emerging from industry and open source. High Energy Physics (HEP) experiments at the LHC stand to profit from all this progress, as they are data-intensive operations with several hundreds of Petabytes of data to collect and process.
TLDR; Apache Spark 3.0 comes with many improvements, including new features for memory monitoring.
Topic: This post dives into the steps for deploying a performance dashboard for Apache Spark, using Spark
Topic: This post is about techniques and tools for measuring and understanding CPU-bound and memory-bound workloads in Apache Spark. You will find examples applied to studying a simple workload consisting of reading Apache Parquet files into a Spark DataFrame.
The views expressed in this blog are those of the authors and cannot be regarded as representing CERN’s official position.
CERN update, Quantum Diaries, Careers at CERN
Christian Antognini, Karl Arao, Martin Bach, Mark Bobak, Wolfgang Breitling, Doug Burns, Kevin Closson, Cloudera blog, Wim Coekaerts, Bertrand Drouvot, Enkitec blog, Pete Finnigan, Richard Foote, Randolf Geist, Marco Gralike, Brendan Gregg, Kyle Hailey, Tim Hall, Uwe Hesse, Frits Hoogland, Hortonworks blog, Integrity Oracle Security, Tom Kyte, Adam Leventhal, Jonathan Lewis, Cary Millsap, James Morle, Karen Morton, Arup Nanda, Mogens Nørgaard, Oracle The Data Warehouse insider, Oracle Enterprise Manager, Oracle Linux blog, Oracle Multitenant, Oracle Optimizer blog, Oracle R technologies, Oracle Upgrade blog, Oracle Virtualization blog, Kerry Osborne, Tanel Poder, Planet PostgreSQL, Kellyn Pot'Vin, Pythian blog, Greg Rahn, Mark Rittman, Riyaj Shamsudeen, Chen Shapira, Carlos Sierra, Szymon Skorupinski