Databases at CERN blog - Powering particle physics

ATLAS DCS Analysis with Apache Spark and Jupyter Notebooks

canali — Mon, 17 Mar 2025 15:48:25 +0000

ATLAS DCS Analysis with Apache Spark and Jupyter Notebooks

Blog article:

The ATLAS Detector Control System (DCS) at CERN is essential for ensuring optimal detector performance. Each year, the system generates tens of billions of time-stamped sensor readings, presenting considerable challenges for large-scale data analysis. Although these data are stored in Oracle databases that excel in real-time transactional processing, the configuration—optimized with limited CPU resources to manage licensing costs—makes them less suited for extensive historical time-series analysis.

To overcome these challenges, a modern data pipeline has been developed that leverages Apache Spark, CERN’s Hadoop service, and the Service for Web-based Analysis (SWAN) platform. This scalable, high-performance framework enables researchers to efficiently process and analyze DCS data over extended periods, unlocking valuable insights into detector operations. By integrating advanced big data technologies, the new system enhances performance monitoring, aids in troubleshooting Data Acquisition (DAQ) link failures, and supports predictive maintenance, thereby ensuring the continued reliability of the ATLAS detector systems.

Note: this blog post is a reduced version of the article Advancing ATLAS DCS Data Analysis with a Modern Data Platform by Luca Canali, Andrea Formica and Michelle Solis.

The Data Pipeline: From Storage to Analysis

Figure 1: Overview of the Big Data architecture for Detector Control System (DCS) data analysis. The system integrates data from Oracle databases (including DCS, luminosity, and run information) and file-based metadata and mappings into the Hadoop ecosystem using Parquet files. Apache Spark serves as the core processing engine, enabling scalable analysis within an interactive environment powered by Jupyter notebooks on CERN SWAN. Reproduced with permission from Advancing ATLAS DCS Data Analysis with a Modern Data Platform.

Data Storage in Oracle Databases

The ATLAS Detector Control System (DCS) data is primarily stored in Oracle databases using a commercial product, the WinCC OA system, optimized for real-time monitoring and transactional operations. Each detector’s data is managed within dedicated database schemas, ensuring structured organization and efficient access.

At the core of this storage model is the EVENTHISTORY table, a high-volume repository that records sensor IDs, timestamps, and measurement values across thousands of monitoring channels. This table grows rapidly, exceeding one billion rows annually, requiring advanced partitioning strategies to facilitate efficient data access. To improve performance, range partitioning is implemented, segmenting the table into smaller, manageable partitions based on predefined time intervals, such as monthly partitions.

Since direct querying of this vast dataset for large-scale analysis can impose a heavy load on the production Oracle systems, a read-only replica copy, is used as the data source for many data querying use cases and for data extraction into CERN’s Hadoop-based analytics platform. This approach ensures that the primary database remains unaffected by analytical workloads, allowing detector experts to access and process historical data efficiently without impacting real-time operations.

Leveraging CERN’s Hadoop Service

To address the challenges of handling large-scale DCS data analysis, CERN’s Hadoop cluster, Analytix, provides a scalable and high-performance infrastructure tailored for parallelized computation and distributed storage. With over 1,400 physical cores and 20 PB of distributed storage, it enables efficient ingestion, processing, and querying of massive datasets.

Currently, approximately 3 TB of DCS data—representing 30% of the total available records—has been migrated into the Hadoop ecosystem, covering data from 2022 onward. Data extraction is performed via Apache Spark, leveraging the Spark JDBC connector to read from the read-only Oracle replica. Daily import jobs incrementally update the core EVENTHISTORY table, appending new records without reprocessing the entire dataset. Smaller, less dynamic tables undergo full replacements to maintain consistency.

For optimized storage and performance, all ingested data is converted to Apache Parquet format, a columnar storage format designed for high-speed analytical queries. The dataset is partitioned by day, enabling partition pruning—a technique that allows queries to efficiently filter relevant time slices, significantly reducing query execution times. The system can use Spark's parallel processing to rapidly process queries that target billions of individual data rows, completing such operations in just a few seconds and making it an ideal solution for correlation studies, anomaly detection, and long-term trend analysis of detector performance.

This modern data pipeline integrates seamlessly with CERN’s Jupyter notebooks service (SWAN), providing detector experts with a Python-based interactive environment for exploratory data analysis, visualization, and machine learning applications. The combination of Apache Spark, Parquet, and Hadoop enables the scalable processing of DCS data, facilitating key analyses such as monitoring DAQ link instabilities, tracking high-voltage performance, and diagnosing hardware failures in the ATLAS New Small Wheel (NSW) detector.

The Role of Apache Spark

Apache Spark plays a pivotal role in transforming how this data is accessed and analyzed. The Spark-based data pipeline extracts data from a read-only replica of the primary production database, ensuring minimal disruption to live operations. Using JDBC connectivity, Spark jobs are scheduled to run daily, incrementally updating Parquet files stored in CERN’s Hadoop cluster.

Key optimizations include:

Partitioning: Data is partitioned by day to facilitate faster querying and improved storage efficiency.
Incremental Updates: Only new data is ingested daily, preventing redundant processing.
Columnar Storage with Parquet: Apache Parquet enables efficient data retrieval, reducing query execution time and storage costs.

Extracting Data from Oracle using Apache Spark

Below is an example of how to create a Spark DataFrame that reads from an Oracle table using JDBC:

Run Oracle free 23ai on a container from gvenzl dockerhub repo https://github.com/gvenzl/oci-oracle-free

docker run -d --name mydb1 -e ORACLE_PASSWORD=oracle -p 1521:1521 gvenzl/oracle-free:23-slim
wait till the DB is fully started by checking the progress of the startup log at: docker logs -f mydb1

You need an Oracle client JDBC jar, available in Maven Central or download from the Oracle website:

bin/pyspark --packages com.oracle.database.jdbc:ojdbc11:23.7.0.25.01

Edit with the target database username:

db_user = "system"

Database server connection string (modify for the actual setup):

db_connect_string = "localhost:1521/FREEPDB1"

Database password:

db_pass = "oracle"

Query to extract data from the target database (example query):

myquery = "SELECT rownum AS id FROM dual CONNECT BY level<=10"

Mapping the Oracle query/table to a Spark DataFrame:

df = (spark.read.format("jdbc")
           .option("url", f"jdbc:oracle:thin:@{db_connect_string}")
           .option("driver", "oracle.jdbc.driver.OracleDriver")
           .option("query", myquery)
           .option("user", db_user)
           .option("password", db_pass)
           .option("fetchsize", 10000)
           .load())

Show schema and data for testing purposes:

df.printSchema()
df.show()

For more details on using Spark to read data from Oracle databases, see this note.

Implementing Time Partitioning

To efficiently partition the data by time, a custom post-processing code in PySpark is used. Below is an example of how partitioning is applied:

Import necessary functions:

from pyspark.sql.functions import col, year, month, dayofmonth

Read data from Oracle as a DataFrame:

df = (spark.read.format("jdbc")
           .option("url", f"jdbc:oracle:thin:@{db_connect_string}")
           .option("driver", "oracle.jdbc.driver.OracleDriver")
           .option("dbtable", "EVENTHISTORY")
           .option("user", db_user)
           .option("password", db_pass)
           .option("fetchsize", 10000)
           .load())

# Extract partitioning keys (year, month, day) from the 'timestamp' column

df = df.withColumn("year", year(col("timestamp"))) \
       .withColumn("month", month(col("timestamp"))) \
       .withColumn("day", dayofmonth(col("timestamp")))

# Write the DataFrame as Parquet files partitioned by year, month, and day

output_path = "hdfs://path/to/output_directory"
df.write.partitionBy("year", "month", "day").parquet(output_path)

For more details on writing data to Parquet with Spark, see this note.

Analysis Framework: A User-Friendly Approach

Apache Spark as the Core Processing Engine

The Apache Spark ecosystem allows for seamless querying and processing of vast datasets. Spark DataFrames and Spark SQL APIs offer a familiar and flexible interface for data manipulation, similar to Pandas for Python users. By enabling distributed computation, Spark ensures that billions of rows can be processed within seconds.

Benefits of Spark in the ATLAS DCS framework:

Scalability and Performance: Spark efficiently uses the available cores on each node and distributes workloads across multiple nodes.
Powerful APIs: Spark natively uses the DataFrame API and also makes available the SQL language, both provide for powerful and expressive APIs to boost performance.
Fault Tolerance: Spark has a proven architecture that provides automatic recovery and retries from many type of failures in a distributed environment.

Platform integration with Jupyter notebooks and Spark

Front-end analysis is conducted via Jupyter notebooks on the CERN’s SWAN platform, offering researchers an interactive and intuitive interface. Key capabilities include:

Spark integration: A dedicated component, the Spark Connector, abstracts the complexities of Spark configuration, ensuring seamless interaction with the Hadoop ecosystem.
Python environment and Dynamic Visualization: The platform harnesses the robust Python ecosystem for data processing, enabling the dynamic creation of tables, charts, and plots.
Data Integration: Seamless connectivity to diverse data sources—including Oracle databases and web services—simplifies the integration process, providing comprehensive access to all relevant data.

Figure 2: Analysis of ATLAS Detector Control System (DCS) data using Python and Apache Spark. The figure highlights specific elements of the ATLAS New Small Wheel (NSW) MicroMegas (MMG) subdetector that exhibited unstable behavior, prompting further investigation. This visualization was generated using a modern data platform that integrates Jupyter notebooks, CERN’s Hadoop service, and Spark-based analytics. The approach enables large-scale processing and efficient troubleshooting of detector performance. Reproduced with permission from Advancing ATLAS DCS Data Analysis with a Modern Data Platform.

Future Enhancements

To further optimize scalability, performance, and analytical capabilities, we are exploring several key improvements:

Kubernetes for Spark Orchestration: Moving from a Hadoop-based cluster to Kubernetes-managed Spark deployments will streamline resource allocation, optimize workload scheduling, and enable dynamic scaling during peak analysis periods. This transition also facilitates a smoother shift toward cloud-based architectures.
Cloud Storage Solutions: We are evaluating cloud-based storage options such as Amazon S3, which would further ease migration to a cloud environment and enhance data accessibility and scalability.
Advanced Data Formats: We are considering the adoption of modern data formats like Apache Iceberg and Delta Lake. These formats offer improved data ingestion workflows, better query performance and support for evolving data schemas, and enhanced data management capabilities in general.
Machine Learning and AI Integration: Leveraging GPU resources available on CERN’s SWAN platform will enable advanced machine learning techniques for predictive analytics, anomaly detection, and automated troubleshooting. This integration aims to identify detector inefficiencies and potential failures in real time, ultimately improving operational reliability and reducing downtime.

These enhancements aim to future-proof the DCS data analysis framework, ensuring it remains a highly efficient, scalable, and adaptable platform for ongoing and future ATLAS detector operations.

Conclusion

The integration of Apache Spark with CERN’s Hadoop infrastructure and CERN's Notebook service, has significantly enhanced ATLAS DCS data processing and analysis, by enabling a scalable, high-performance, and user-friendly platform. This framework empowers researchers to extract meaningful insights, enhance detector performance monitoring, and streamline troubleshooting processes, significantly improving operational efficiency. As the project continues to evolve, the adoption of cloud-based storage, Kubernetes orchestration, and AI-driven analytics will further enhance the platform’s capabilities supporting the needs of the scientific and engineering community.

Acknowledgements and Links

This work is based on the article Advancing ATLAS DCS Data Analysis with a Modern Data Platform by Luca Canali, Andrea Formica and Michelle Solis. Many thanks to our ATLAS colleagues, in particular from the ADAM (Atlas Data and Metadata) team and ATLAS DCS. Special thanks to the CERN Databases and Data Analytics group for their help and support with Oracle, Hadoop, SWAN and Spark services.

Additional links and notes:

Writing data to Parquet with Spark, see this note
Using Spark with Oracle, see this note
A short course on Apache Spark

canali Mon, 03/17/2025 - 16:48

Tags

Apache Spark

Oracle

Jupyter notebook

Add new comment

Kepler’s Mars Orbit Analysis with Python Notebooks & AI-Assisted Coding

canali — Thu, 27 Feb 2025 15:03:37 +0000

Kepler’s Mars Orbit Analysis with Python Notebooks & AI-Assisted Coding

Blog article:

Johannes Kepler’s analysis of Mars’ orbit stands as one of the greatest achievements in scientific history, revealing the elliptical nature of planetary paths and establishing the foundational laws of planetary motion. In this post, you will explore how you can recreate Kepler’s revolutionary findings using Python’s robust data science ecosystem.

Our goal is not to produce a specialized scientific paper but to provide a clear, interactive, and visually appealing demonstration suitable for a broad audience. Python libraries like NumPy, Pandas, SciPy, and Matplotlib, provide an efficient environment for numerical computations, data manipulation, and visualization. Jupyter Notebooks further enhance this process by providing an interactive and user-friendly platform to run code, visualize results, and document your insights clearly. Additionally, AI-assisted coding significantly simplifies technical tasks such as ellipse fitting, data interpolation, and creating insightful visualizations. This integration allows us to focus more on understanding the insights behind Kepler’s discoveries, making complex analyses accessible and engaging.

This project showcases:

A structured approach to data analysis using a handful of short Jupyter Notebooks.
How Python’s ecosystem (NumPy, Pandas, SciPy, Matplotlib) facilitates computational research.
The advantages of AI-assisted coding in accelerating development and boosting productivity.
An interactive, visually engaging reproduction of Kepler’s findings.

The full code and notebooks are available at: GitHub Repository

Jupyter Notebooks and AI-Assisted Coding: A Powerful Combination for Data Science

Jupyter Notebooks have become the standard environment for data science, offering an interactive and flexible platform for scientific computing. They can be run on local machines or cloud services such as Google Colab, Amazon SageMaker, IBM Watson Studio, Microsoft Azure, GitHub Codesopaces, Databricks, etc. CERN users can also run the notebooks on the CERN-hosted Jupyter notebooks service SWAN (Service for Web-based ANalysis), a widely popular service used by engineers and physicists across CERN for large-scale scientific analysis.

How Python and AI Tools Enhance This Project

Data Interpolation & Curve Fitting: Python libraries like SciPy and AI-assisted tools help generate optimal curve fits in seconds.
Plotting & Visualization: AI-driven code completion and Matplotlib make it easier and faster to generate plots.
Error Handling & Debugging: AI suggestions help identify and fix errors quickly, improving workflow efficiency.
Exploring Alternative Approaches: AI can suggest different computational methods, allowing for a more robust and exploratory approach to the analysis.

Why Use Jupyter Notebooks and AI-Assisted Coding?

Saves Time: Avoids writing repetitive, boilerplate code.
Enhances Accuracy: Reduces human error in complex calculations.
Boosts Creativity: Frees up cognitive resources to focus on insights rather than syntax.
Flexible & Scalable: Python notebooks can be used locally or on powerful cloud-based platforms for large-scale computations.
Widely Adopted: Used by researchers, engineers, and data scientists across academia, industry, and institutions like CERN.

Overview of the Analysis

The project is structured into a series of Jupyter notebooks, each building on the previous one to triangulate Mars' orbit and verify Kepler’s laws.

Click on the notebook links below to explore the details of each step.

Generating Mars Ephemeris (generate the measurements of Mars' celestial positions)
- Data is key for the success of this analysis, Kepler used Ticho Brahe's data, we are going to use NASA JPL's DE421 ephemeris via the Skyfield library to generate accurate planetary positions over a period of 12 Martian years (approximately 22 Earth years), starting from January 1, 2000.
- Determine the ecliptic longitude of Mars and the Sun in the plane of Earth's orbit.Filters out observations where Mars is obscured by the Sun.
- Save the filtered ephemeris data into a CSV file (ephemeris_mars_sun.csv).
- Key attributes in the saved data are: Date, Mars Ecliptic Longitude (deg), Sun Ecliptic Longitude (deg)
Key Insight of Kepler's Analysis (understand how Earth-based observations reveal Mars’ trajectory)
- Mars completes one full revolution around the Sun in 687 days (one Mars year). During this period, Earth occupies a different position in its orbit at each observation. By selecting measurements taken exactly one Mars year apart, we capture Mars' apparent position from varied vantage points. With enough observations over several Mars years, these multiple perspectives enable us to triangulate the position of Mars.
- Figure 1, Triangulating Mars' Position:
  - Select observations spaced 687 days apart (one Mars year) so that Mars is observed at nearly the same position relative to the Sun for each measurement.
  - For each observation, compute Earth's position in the ecliptic and derive Mars' line-of-sight vectors.
  - Apply least-squares estimation to solve for Mars' ecliptic coordinates.
Computing Mars' Orbit (calculate Mars orbit by triangulating Mars' position using all available observations)
- Load the dataset (line_of_sight_mars_from_earth.csv) with Mars and Sun observations, notably the following fields: Date, Mars Ecliptic Longitude (deg), and Sun Ecliptic Longitude (deg).Computes Mars' heliocentric coordinates and estimates its orbit.
- Generalized Triangulation
  - For each start date within the first Mars year, iterate through subsequent measurements at 687-day intervals (one Mars year), so that Mars is observed at nearly the same position relative to the Sun for each measurement.
  - Triangulate Mars' position from the accumulated data when at least two valid measurements are available.
  - Gracefully handle missing data and singular matrices to ensure robust estimation.
- Compile the computed Mars positions into a results DataFrame and save the results to a CSV file (computed_Mars_orbit.csv) for further analysis.
Kepler’s Laws (verify Kepler’s three laws with real data)
- Figure2: Demonstrate Kepler's First Law by fitting an elliptical model to confirm Mars’ orbit is an ellipse with the Sun at one focus. The fitted parameters match accepted values, notable eccentricity e ~ 0.09 and semi-major axis a ~ 1.52 AU.
- Second Law: Demonstrate that Mars sweeps out equal areas in equal time intervals using the measured values of Mars' orbit.
- Third Law: Validate the harmonic law by comparing the ratio T^2/a^3 for Mars and Earth.
Estimating Earth's Orbit (use Mars' ephemeris and line-of-sight data to determine Earth’s orbit)
- Earth Position Computation:
  - For each selected observation, compute Earth's heliocentric position by solving for the Earth-Sun distance using the observed Sun and Mars ecliptic longitudes and the estimated Mars position (found in notebook 3 of this series "Compute Mars Orbit")
  - Utilize a numerical solver (via fsolve) to ensure that the computed Earth position yields the correct LOS angle towards Mars.
- Fits Earth’s computed positions to an elliptical model and compares the results with accepted astronomical values.
- Visualizes Earth’s orbit alongside the positions of Mars and the Sun.

Conclusion

Kepler’s groundbreaking work reshaped our understanding of planetary motion, and today, we can revisit his analysis with modern computational tools. By combining Jupyter Notebooks, Python’s scientific libraries, and AI-assisted coding, we demonstrate how complex data analysis can be performed efficiently and interactively.

This project serves as an example of how AI and open-source tools empower researchers, educators, and enthusiasts to explore scientific discoveries with greater ease and depth.

👉 Check out the full project and try the notebooks yourself! GitHub Repository

References

This work is directly inspired by Terence Tao's project Climbing the Cosmic Distance Ladder. In particular see the two-part video series with Grant Sanderson (3Blue1Brown): Part 1 and Part 2

Further details on Kepler's analysis can be found in Tao's draft book chapter Chapter 4: Fourth Rung - The Planets: Download here

Another insightful video on Kepler’s discoveries is How the Bizarre Path of Mars Reshaped Astronomy [Kepler's Laws Part 2] by Welch Labs.

Mars-Orbit-Workshop contains material to conduct a workshop recreating Kepler's analysis.

The original work of Kepler was published in Astronomia Nova (New Astronomy) in 1609. The book is available on archive.org. See for example this link to chapter 42 of Astronomia Nova

Figure 3: An illustration from Chapter 42 of Astronomia Nova (1609) by Kepler, depicting the key concept of triangulating Mars' position using observations taken 687 days apart (one Martian year). This is the original version of Figures 1 and 2 in this post.

Acknowledgements

This work has been conducted in the context of the Databases and Analytics activities at CERN, in particular I'd like to thank my colleagues in the SWAN (Service for Web-based ANalysis) team.

canali Thu, 02/27/2025 - 16:03

Tags

python

Machine Learning

Add new comment

CERN PGDay 2025 is here!

mpotocky — Tue, 26 Nov 2024 17:10:00 +0000

CERN PGDay 2025 is here! mpotocky Tue, 11/26/2024 - 18:10

Building an Apache Spark Performance Lab: Tools and Techniques for Spark Optimization

canali — Fri, 26 Apr 2024 12:58:43 +0000

Building an Apache Spark Performance Lab: Tools and Techniques for Spark Optimization

Blog article:

Apache Spark is renowned for its speed and efficiency in handling large-scale data processing. However, optimizing Spark to achieve maximum performance requires a precise understanding of its inner workings. This blog post will guide you through establishing a Spark Performance Lab with essential tools and techniques aimed at enhancing Spark performance through detailed metrics analysis.

Why a Spark Performance Lab

The purpose of a Spark Performance Lab isn't just to measure the elapsed time of your Spark jobs but to understand the underlying performance metrics deeply. By using these metrics, you can create models that explain what's happening within Spark's execution and identify areas for improvement. Here are some key reasons to set up a Spark Performance Lab:

Hands-on learning and testing: A controlled lab setting allows for safer experimentation with Spark configurations and tuning and also experimenting and understanding the monitoring tools and Spark-generated metrics.
Load and scale: Our lab uses a workload generator, running TPCDS queries. This is a well-known set of complex queries that is representative of OLAP workloads, and that can easily be scaled up for testing from GBs to 100s of TBs.
Improving your toolkit: Having a toolbox is invaluable, however you need to practice and understand their output in a sandbox environment before moving to production.
Get value from the Spark metric system: Instead of focusing solely on how long a job takes, use detailed metrics to understand the performance and spot inefficiencies.

Tools and Components

In our Spark Performance Lab, several key tools and components form the backbone of our testing and monitoring environment:

Workload generator:
- We use a custom tool, TPCDS_PySpark, to generate a consistent set of queries (TPCDS benchmark), creating a reliable testing framework.
Spark instrumentation:
- Spark’s built-in Web UI for initial metrics and job visualization.
Custom tools:
- SparkMeasure: Use this for detailed performance metrics collection.
- Spark-Dashboard: Use this to monitor Spark jobs and visualize key performance metrics.

Additional tools for Performance Measurement include:

Flame Graphs for Spark and Grafana Pyroscope with Spark
Tools for OS performance monitoring

Demos

These quick demos and tutorials will show you how to use the tools in this Spark Performance Lab. You can follow along and get the same results on your own, which will help you start learning and exploring.

SparkMeasure - metrics collection
- Watch sparkMeasure's getting started demo and tutorial
TPCDS_PySpark - workload generator
- Watch TPCDS-PySpark demo and tutorial
Spark-Dashboard - real-time dashboards
- Watch Spark-Dashboard demo and tutorial

Figure 1: Snapshot from the Spark-Dashboard, visualizing the number of active tasks vs. measurement time, taken when running TPCDS scale 10,000 GB on a YARN cluster

How to Make the Best of Spark Metrics System

Understanding and utilizing Spark's metrics system is crucial for optimization:

Importance of Metrics: Metrics provide insights beyond simple timing, revealing details about task execution, resource utilization, and bottlenecks.
Execution Time is Not Enough: Measuring the execution time of a job, like how long it takes to run, is useful but it doesn’t show the whole picture. Say the job ran in 10 seconds. It's crucial to understand why it took 10 seconds instead of 100 seconds or just 1 second. What was slowing things down? Was it the CPU, data input/output, or something else, like data shuffling? This helps us identify the root causes of performance issues.
Key Metrics to Collect:
- Executor Run Time: Total time executors spend processing tasks.
- Executor CPU Time: Direct CPU time consumed by tasks.
- JVM GC Time: Time spent in garbage collection, affecting performance.
- Shuffle and I/O Metrics: Critical for understanding data movement and disk interactions.
- Memory Metrics: Key for performance and troubleshooting Out Of Memory errors
Metrics Analysis, what to look for:
- Look for bottlenecks: are there resources that are the bottleneck? Are the jobs running mostly on CPU or waiting for I/O or spending a lot of time on Garbage Collection?
- USE method: Utilization Saturation and Errors (USE) Method is a methodology for analyzing the performance of any system.
  - The tools described here can help you to measure and understand Utilization and Saturation.
- Can your job use a significant fraction of the available CPU cores?
  - Examine the measurement of the actual number of active tasks vs. time.
  - Figure 1 shows the number of active tasks measured while running TPCDS 10TB on a YARN cluster, with 256 cores allocated. The graph shows spikes and troughs.
- - Understand the root causes of the troughs using metrics and monitoring data. The reasons can be many: resource allocation, partition skew, straggler tasks, stage boundaries, etc.
- Which tool should I use?
  - Start with using the Spark Web UI
  - Instrument your jobs with sparkMesure. This is recommended early in the application development, testing, and for Continuous Integration (CI) pipelines.
  - Observe your Spark application execution profile with Spark-Dashboard.
  - Use available tools with OS metrics too. See also Spark-Dashboard extended instrumentation: it collects and visualizes OS metrics (from cgroup statistics) like network stats, etc
- Drill down:
  - An example of Spark metrics analysis for TPCDS run at scale 10 TB

Documentation:
- For those interested in delving deeper into Spark instrumentation and metrics, the Spark documentation offers a comprehensive guide.
- SparkMeasure: This tool captures metrics directly from Spark’s instrumentation via the Listener Bus. For a detailed understanding of how it operates, refer to the SparkMeasure architecture. It specifically gathers data from Spark's Task Metrics System, which you can explore further here.
- Spark-Dashboard: This application aggregates metrics that Spark exposes through the Dropwizard metrics library (see Spark-Dashboard architecture). A complete list of the metrics can be found here.

Figure 2: This technical drawing outlines the integrated monitoring pipeline for Apache Spark implemented by Spark-Dashboard using open-source components. The flow of the diagram illustrates the Spark metrics source and the components used to store and visualize the metrics.

Lessons Learned and Conclusions

From setting up and running a Spark Performance Lab, here are some key takeaways:

Collect, analyze and visualize metrics: Go beyond just measuring jobs' execution times to troubleshoot and fine-tune Spark performance effectively.
Use the Right Tools: Familiarize yourself with tools for performance measurement and monitoring.
Start Small, Scale Up: Begin with smaller datasets and configurations, then gradually scale to test larger, more complex scenarios.
Tuning is an Iterative Process: Experiment with different configurations, parallelism levels, and data partitioning strategies to find the best setup for your workload.

Establishing a Spark Performance Lab is a fundamental step for any data engineer aiming to master Spark's performance aspects. By integrating tools like Web UI, TPCDS_PySpark, sparkMeasure, and Spark-Dashboard, developers and data engineers can gain unprecedented insights into Spark operations and optimizations.

Explore this lab setup to turn theory into expertise in managing and optimizing Apache Spark. Learn by doing and experimentation!

Acknowledgements: A special acknowledgment goes out to the teams behind the CERN data analytics, monitoring, and web notebook services, as well as the dedicated members of the ATLAS database group.

Resources

To get started with the tools mentioned in this blog:

TPCDS_PySpark
SparkMeasure
Spark-Dashboard and Dashboard Notes
Flame Graphs for Spark and Grafana Pyroscope with Spark
Tools for OS performance monitoring

canali Fri, 04/26/2024 - 14:58

Tags

Add new comment

Enhancing Apache Spark and Parquet Efficiency: A Deep Dive into Column Indexes and Bloom Filters

canali — Tue, 30 Jan 2024 20:50:14 +0000

Enhancing Apache Spark and Parquet Efficiency: A Deep Dive into Column Indexes and Bloom Filters

Blog article:

In the ever-evolving landscape of big data, Apache Spark and Apache Parquet continue to introduce game-changing features. Their latest updates have brought forward significant enhancements, including column indexes, bloom filters. This blog post delves into these new features, exploring their applications and benefits. This post is based on the extended notes at:

Key Advantages of Parquet in Spark

This is not an introductory article, however here is a quick recap of why you may want to spend time learning more about Apache Parquet and Spark. Parquet is a columnar storage file format optimized for use with data processing frameworks like Apache Spark. It offers efficient data compression and encoding schemes.

Parquet is a columnar format enabling efficient data storage and retrieval
It supports compression and encoding
Optimizations in Spark for Parquet include:
- Vectorized Parquet reader
- Filter push down capabilities
- Enhanced support for partitioning and handling large files

Another key aspect of Parquet with Spark that are important to know for the following is:

Row Group Organization: Parquet files consist of one or more 'row groups,' typically sized around 128 MB, although this is adjustable.
Parallel Processing Capabilities: Both Spark and other engines can process Parquet data in parallel, leveraging the row group level or the file level for enhanced efficiency.
Row Group Statistics: Each row group holds vital statistics like minimum and maximum values, counts, and the number of nulls. These stats enable the 'skip data' feature when filters are applied, essentially serving as a zone map to optimize query performance.

ORC: For a comparison of Apache Parquet with another popular data format, Apache ORC, refer to Parquet-ORC Comparison.

Understanding Column Indexes and Bloom Filters in Parquet

Column Indexes: Enhancing Query Efficiency

Column indexes, introduced in Parquet 1.11 and utilized in Spark 3.2.0 and higher, offer a fine-grained approach to data filtering. These indexes store min and max values at the Parquet page level, allowing Spark to efficiently execute filter predicates at a much finer granularity than the default 128 MB row group size. Particularly effective for sorted data, column indexes can drastically reduce the data read from disk, improving query performance.

Bloom Filters: A Leap in Filter Operations

Parquet 1.12 (utilized by Spark 3.2.0 and higher) introduced Bloom filters, a probabilistic data structure that efficiently determines whether an element is in a set. They are particularly useful for columns with high cardinality and scenarios where filter operations are based on values likely to be absent from the dataset. Using bloom filters can lead to significant improvements in read performance.

Example: Spark using Parquet column indexes

Test dataset and preparation

The Parquet test file used below parquet112file_sorted is extracted from the TPCDS benchmark table web_sales
the table (parquet file) contains data sorted on the column ws_sold_time_sk
it's important that the data is sorted, this groups together values in the filter column "ws_sold_time_sk", if the values are scattered the column index min-max statistics will have a wide range and will not be able to help with skipping data
the sorted dataset has been created using spark.read.parquet("path + "web_sales_piece.parquet").sort("ws_sold_time_sk").coalesce(1).write.parquet(path + "web_sales_piece_sorted_ws_sold_time_sk.parquet")
Download the test data:
- Retrieve the test data using wget, a web browser, or any method of your choosing
- web_sales_piece.parquet
- web_sales_piece_sorted_ws_sold_time_sk.parquet

Run the tests

Fast (reads only 20k rows):

Spark will read the Parquet using a filter and makes use of column and offset indexes:

bin/spark-shell

val path = "./" 
val df = spark.read.parquet(path + "web_sales_piece_sorted_ws_sold_time_sk.parquet")

// Read the file using a filter, this will use column and offset indexes
val q1 = df.filter("ws_sold_time_sk=28801")
val plan = q1.queryExecution.executedPlan
q1.collect

// Use Spark metrics to see how many rows were processed
// This is also available for the WebUI in graphical form
val metrics = plan.collectLeaves().head.metrics
metrics("numOutputRows").value

res: Long = 20000

The result shows that only 20000 rows were processed, this corresponds to processing just a few pages, as opposed to reading and processing the entire file. This is made possible by the use of the min-max value statistics in the column index for column ws_sold_time_sk.

Column indexes are created by default in Spark version 3.2.x and higher.

Slow (reads 2M rows):

Same as above, but this time we disable the use of column indexes.

Note this is also what happens if you use Spark versions prior to Spark 3.2.0 (notably Spark 2.x) to read the file.

bin/spark-shell

val path = "./"
// disable the use of column indexes for testing purposes
val df = spark.read.option("parquet.filter.columnindex.enabled","false").parquet(path + "web_sales_piece_sorted_ws_sold_time_sk.parquet")

val q1 = df.filter("ws_sold_time_sk=28801")
val plan = q1.queryExecution.executedPlan
q1.collect

// Use Spark metrics to see how many rows were processed
val metrics = plan.collectLeaves().head.metrics
metrics("numOutputRows").value

res: Long = 2088626

The result is that all the rows in the row group (2088626 rows in the example) were read as Spark could not push the filter down to the Parquet page level. This example runs more slowly than the example below and in general performs more work (uses more CPU cycles and reads more data from the filesystem).

Diagnostics and Internals of Column and Offset Indexes

Column indexes in Parquet are key structures designed to optimize filter performance during data reads. They are particularly effective for managing and querying large datasets.

Key Aspects of Column Indexes:

Purpose and Functionality: Column indexes offer statistical data (minimum and maximum values) at the page level, facilitating efficient filter evaluation and optimization.
Default Activation: By default, column indexes are enabled to ensure optimal query performance.
Granularity Insights: While column indexes provide page-level statistics, similar statistics are also available at the row group level. Typically, a row group is approximately 128MB, contrasting with pages usually around 1MB.
Customization Options: Both rowgroup and page sizes are configurable, offering flexibility to tailor data organization. For further details, see Parquet Configuration Options.

Complementary Role of Offset Indexes:

Association with Column Indexes: Offset indexes work in tandem with column indexes and are stored in the file's footer in Parquet versions 1.11 and above.
Scanning Efficiency: A key benefit of these indexes is their role in data scanning. When filters are not applied in Parquet file scans, the footers with column index data can be efficiently skipped, enhancing the scanning process.

Additional Resources:

For an in-depth explanation of column and offset indexes in Parquet, consider reading this detailed description.

The integration of column and offset indexes significantly improves Parquet's capability in efficiently handling large-scale data, especially in scenarios involving filtered reads. Proper understanding and utilization of these indexes can lead to marked performance improvements in data processing workflows.

Tools to drill down on column index metadata in Parquet files

parquet-cli
- example: hadoop jar target/parquet-cli-1.13.1-runtime.jar org.apache.parquet.cli.Main column-index -c ws_sold_time_sk <path>/my_parquetfile
- More details on how to use parquet-cli at Tools for Parquet Diagnostics

Example with the Java API from Spark-shell

// customize with the file path and name
val fullPathUri = java.net.URI.create("<path>/myParquetFile")

// crate a Hadoop input file and opens it with ParquetFileReader
val in = org.apache.parquet.hadoop.util.HadoopInputFile.fromPath(new org.apache.hadoop.fs.Path(fullPathUri), spark.sessionState.newHadoopConf())
val pf = org.apache.parquet.hadoop.ParquetFileReader.open(in)

// Get the Parquet file version
pf.getFooter.getFileMetaData.getCreatedBy

// columns index
val columnIndex = pf.readColumnIndex(columns.get(0))
columnIndex.toString.foreach(print)

// offset index
pf.readOffsetIndex(columns.get(0))
print(pf.readOffsetIndex(columns.get(0)))

The output on a column that is sorted looks like:

row-group 0:
column index for column ws_sold_time_sk:
Boudary order: ASCENDING
                      null count  min                                       max
page-0                        45  29                                        12320
page-1                         0  12320                                     19782
page-2                         0  19782                                     26385
page-3                         0  26385                                     31758
page-4                         0  31758                                     36234
page-5                         0  36234                                     40492
page-6                         0  40492                                     44417
page-7                         0  44417                                     47596
page-8                         0  47596                                     52972
page-9                         0  52972                                     58388
page-10                        0  58388                                     62482
page-11                        0  62482                                     65804
page-12                        0  65804                                     68647
page-13                        0  68647                                     71299
page-14                        0  71303                                     74231
page-15                        0  74231                                     77978
page-16                        0  77978                                     85712
page-17                        0  85712                                     86399

offset index for column ws_sold_time_sk:
                          offset   compressed size       first row index
page-0                     94906              4759                     0
page-1                     99665              4601                 20000
page-2                    104266              4549                 40000
page-3                    108815              4415                 60000
page-4                    113230              4343                 80000
page-5                    117573              4345                100000
page-6                    121918              4205                120000
page-7                    126123              3968                140000
page-8                    130091              4316                160000
page-9                    134407              4370                180000
page-10                   138777              4175                200000
page-11                   142952              4012                220000
page-12                   146964              3878                240000
page-13                   150842              3759                260000
page-14                   154601              3888                280000
page-15                   158489              4048                300000
page-16                   162537              4444                320000
page-17                   166981               200                340000

Bloom filters in Parquet

With the release of Parquet 1.12, there's now the capability to generate and store Bloom filters within the file footer's metadata. This addition significantly enhances query performance for specific filtering operations. Bloom filters are especially advantageous in the following scenarios:

High Cardinality Columns: They effectively address the limitations inherent in using Parquet dictionaries for columns with a vast range of unique values.

Absent Value Filtering: Bloom filters are highly efficient for queries that filter based on values likely to be missing from the table or DataFrame. This efficiency stems from the characteristic of Bloom filters where false positives (erroneously concluding that a non-existent value is present) are possible, but false negatives (failing to identify an existing value) do not occur.

For a comprehensive understanding and technical details of implementing Bloom filters in Apache Parquet, refer to the official documentation on bloom filters in Apache Parquet

Configuration

Important configurations for writing bloom filters in Parquet files are:

.option("parquet.bloom.filter.enabled","true") // write bloom filters for all columns, default is false
.option("parquet.bloom.filter.enabled#column_name", "true") // write bloom filter for the given column
.option("parquet.bloom.filter.expected.ndv#column_name", num_values) // tuning for bloom filters, ndv = number of distinct values
.option("parquet.bloom.filter.max.bytes", 1024*1024) // The maximum number of bytes for a bloom filter bitset, default 1 MB

Write Parquet files with Bloom filters

This is an example of how to read a Parquet file without bloom filter (for example because it had been created with an older version of Spark/Parquet) and add the bloom filter, with additional tuning of the bloom filter parameters for one of the columns:

val df = spark.read.parquet("<path>/web_sales")
df.coalesce(1).write.option("parquet.bloom.filter.enabled","true").option("parquet.bloom.filter.expected.ndv#ws_sold_time_sk", 25000).parquet("<myfilepath")

Example: Checking I/O Performance in Parquet: With and Without Bloom Filters

Understanding the impact of using bloom filters on I/O performance during Parquet file reads can be important for optimizing data processing. This example outlines the steps to compare I/O performance when reading Parquet files, both with and without the utilization of bloom filters.

This example uses Parquet bloom filters to improve Spark read performance

1. Prepare the test table


bin/spark-shell
val numDistinctVals=1e6.toInt
val df=sql(s"select id, int(random()*100*$numDistinctVals) randomval from range($numDistinctVals)")
val path = "./"

// Write the test DataFrame into a Parquet file with a Bloom filter
df.coalesce(1).write.mode("overwrite").option("parquet.bloom.filter.enabled","true").option("parquet.bloom.filter.enabled#randomval", "true").option("parquet.bloom.filter.expected.ndv#randomval", numDistinctVals).parquet(path + "spark320_test_bloomfilter")

// Write the same DataFrame in Parquet, but this time without Bloom filters 
df.coalesce(1).write.mode("overwrite").option("parquet.bloom.filter.enabled","false").parquet(path + "spark320_test_bloomfilter_nofilter")

// use the OS (ls -l) to compare the size of the files with bloom filter and without
// in my test (Spark 3.5.0, Parquet 1.13.1) it was 10107275 with bloom filter and 8010077 without

:quit

2. Read data using the Bloom filter, for improved performance

bin/spark-shell

val path = "./"
val df =spark.read.option("parquet.filter.bloom.enabled","true").parquet(path + "spark320_test_bloomfilter")
val q1 = df.filter("randomval=1000000") // filter for a value that is not in the file
q1.collect

// print I/O metrics
org.apache.hadoop.fs.FileSystem.printStatistics()

// Output
FileSystem org.apache.hadoop.fs.RawLocalFileSystem: 1091611 bytes read, ...

:quit

3. Read disabling the Bloom filter (this will read more data from the filesystem and have worse performance)

bin/spark-shell

val path = "./"
val df =spark.read.option("parquet.filter.bloom.enabled","false").parquet(path + "spark320_test_bloomfilter")
val q1 = df.filter("randomval=1000000") // filter for a value that is not in the file
q1.collect

// print I/O metrics
org.apache.hadoop.fs.FileSystem.printStatistics()

// Output
FileSystem org.apache.hadoop.fs.RawLocalFileSystem: 8299656 bytes read, ...

Reading Parquet Bloom Filter Metadata with Apache Parquet Java API

To extract metadata about the bloom filter from a Parquet file using the Apache Parquet Java API in spark-shell, follow these steps:

Initialize the File Path: define the full path of your Parquet file

bin/spark-shell
val fullPathUri = java.net.URI.create("<my_file_path>")

Create Input File: utilize HadoopInputFile to create an input file from the specified path

val in = org.apache.parquet.hadoop.util.HadoopInputFile.fromPath(
         new org.apache.hadoop.fs.Path(fullPathUri), 
         spark.sessionState.newHadoopConf()
         )

Open Parquet File Reader: open the Parquet file reader for the input file
```
val pf = org.apache.parquet.hadoop.ParquetFileReader.open(in)
```
Retrieve Blocks and Columns: extract the blocks from the file footer and then get the columns from the first block
```
val blocks = pf.getFooter.getBlocks
val columns = blocks.get(0).getColumns
```

Read Bloom Filter: finally, read the bloom filter from the first column

val bloomFilter = pf.readBloomFilter(columns.get(0))
bloomFilter.getBitsetSize

By following these steps, you can successfully read the bloom filter metadata from a Parquet file using the Java API in the spark-shell environment.

Discovering Parquet Version

The Parquet file format is constantly evolving, incorporating additional metadata to support emerging features. Each Parquet file embeds the version information within its metadata, reflecting the Parquet version used during its creation.

Importance of Version Awareness:

Compatibility Considerations: When working with Parquet files generated by older versions of Spark and its corresponding Parquet library, it's important to be aware that certain newer features may not be supported. For instance, column indexes, which are available in the Spark DataFrame Parquet writer from version 3.2.0, might not be present in files created with older versions.

Upgrading for Enhanced Features: Upon upgrading your Spark version, it's beneficial to also update the metadata in existing Parquet files. This update allows you to utilize the latest features introduced in newer versions of Parquet.

Checking the Parquet File Version:

The following sections will guide you on how to check the Parquet version used in your files, ensuring that you can effectively manage and upgrade your Parquet datasets. This format provides a structured and detailed approach to understanding and managing Parquet file versions, emphasizing the importance of version compatibility and the process of upgrading.

Details at Tools for Parquet Diagnostics
parquet-cli
- example: hadoop jar parquet-cli/target/parquet-cli-1.13.1-runtime.jar org.apache.parquet.cli.Main meta <path>/myParquetFile

Hadoop API ...

example of using Hadoop API from the spark-shell CLI

// customize with the file path and name
val fullPathUri = java.net.URI.create("<path>/myParquetFile")
 
// crate a Hadoop input file and opens it with ParquetFileReader
val in = org.apache.parquet.hadoop.util.HadoopInputFile.fromPath(new org.apache.hadoop.fs.Path(fullPathUri), spark.sessionState.newHadoopConf())
val pf = org.apache.parquet.hadoop.ParquetFileReader.open(in)

// Get the Parquet file version
pf.getFooter.getFileMetaData.getCreatedBy

// Info on file metadata
print(pf.getFileMetaData)
print(pf.getRowGroups)

Spark extension library

The spark-extension library allows to query Parquet metadata using Apache Spark.

Example:

bin/spark-shell --packages uk.co.gresearch.spark:spark-extension_2.12:2.11.0-3.5

import uk.co.gresearch.spark.parquet._
spark.read.parquetMetadata("...path..").show()
spark.read.parquetBlockColumns(...path..").show()

Updating Parquet File Versions

Upgrading your Parquet files to a newer version can be achieved by copying them using a more recent version of Spark. This section covers the steps to convert your Parquet files to an updated version.

Conversion Method:

Using Recent Spark Versions: To update Parquet files, read them with a newer version of Spark and then save them again. This process effectively updates the files to the Parquet version used by that Spark release.

For instance, using Spark 3.5.0 will allow you to write files in Parquet version 1.13.1.

Approach Note: This method is somewhat brute-force as there isn't a direct mechanism solely for upgrading Parquet metadata.

Practical Example: Copying and converting Parquet version by reading and re-writing, applied to the TPCDS benchmark:

bin/spark-shell --master yarn --driver-memory 4g --executor-memory 50g --executor-cores 10 --num-executors 20 --conf spark.sql.shuffle.partitions=400

val inpath="/project/spark/TPCDS/tpcds_1500_parquet_1.10.1/"
val outpath="/project/spark/TPCDS/tpcds_1500_parquet_1.13.1/"
val compression_type="snappy" // may experiment with "zstd"

// we need to do this in two separate groups: partitioned and non-partitioned tables

// copy the **partitioned tables** of the TPCDS benchmark
// compact each directory into 1 file with repartition
val tables_partition=List(("catalog_returns","cr_returned_date_sk"), ("catalog_sales","cs_sold_date_sk"), ("inventory","inv_date_sk"), ("store_returns","sr_returned_date_sk"), ("store_sales","ss_sold_date_sk"), ("web_returns","wr_returned_date_sk"), ("web_sales","ws_sold_date_sk"))
for (t <- tables_partition) {
  println(s"Copying partitioned table $t")
  spark.read.parquet(inpath + t._1).repartition(col(t._2)).write.partitionBy(t._2).mode("overwrite").option("compression", compression_type).parquet(outpath + t._1)
}

// copy non-partitioned tables of the TPCDS benchmark
// compact each directory into 1 file with repartition
val tables_nopartition=List("call_center","catalog_page","customer","customer_address","customer_demographics","date_dim","household_demographics","income_band","item","promotion","reason","ship_mode","store","time_dim","warehouse","web_page","web_site")
for (t <- tables_nopartition) {
  println(s"Copying table $t")
  spark.read.parquet(inpath + t).coalesce(1).write.mode("overwrite").option("compression", compression_type).parquet(outpath + t)
}

Conclusions

Apache Spark and Apache Parquet continue to innovate and are constantly upping their game in big data. They've rolled out cool features like column indexes and bloom filters, really pushing the envelope on speed and efficiency. It's a smart move to keep your Spark updated, especially to Spark 3.x or newer, to get the most out of these perks. Also, don’t forget to give your Parquet files a quick refresh to the latest format – the blog post has got you covered with a how-to. Staying on top of these updates is key to keeping your data game strong!

I extend my deepest gratitude to my colleagues at CERN for their invaluable guidance and support. A special acknowledgment goes out to the teams behind the CERN data analytics, monitoring, and web notebook services, as well as the dedicated members of the ATLAS database team.

Further details on the topics covered here can be found at:

canali Tue, 01/30/2024 - 21:50

Tags

Apache Spark

Parquet

Performance

Add new comment

Enhancing Apache Spark Performance with Flame Graphs: A Practical Example Using Grafana Pyroscope

canali — Wed, 27 Sep 2023 13:47:25 +0000

Enhancing Apache Spark Performance with Flame Graphs: A Practical Example Using Grafana Pyroscope

Blog article:

TL;DR Explore a step-by-step example of troubleshooting Apache Spark job performance using flame graph visualization and profiling. Discover the seamless integration of Grafana Pyroscope with Spark for streamlined data collection and visualization.

The Puzzle of the Slow Query

Set within the framework of data analysis for the ATLAS experiment's Data Control System, our exploration uses data stored in the Parquet format and deploys Apache Spark for queries. The setup: Jupyter notebooks operating on the SWAN service at CERN interfacing with the Hadoop and Spark service.

The Hiccup: A notably slow query during data analysis where two tables are joined. Running on 32 cores, this query takes 27 minutes—surprisingly long given the amount of data in play.

The tables involved:

EVENTHISTORY: A log of events for specific sub-detectors, each row contains a timestamp, the subsystem id and a value
LUMINOSITY, a table containing the details of time intervals called "luminosity blocks", see Luminosity block - Particle Wiki

Data size:

EVENTHISTORY is a large table, it can collect millions of data points per day, while LUMINOSITY is a much smaller table (only thousands of points per day). In the test case reported here we used data collected over 1 day, with EVENTHISTORY -> 75M records, and LUMINOSITY -> 2K records.

The join condition between EVENTHISTORY and LUMINOSITY is an expression used to match for events in EVENTHISORY and intervals in LUMINOSITY (note this is not a join based on an equality predicate). This is what the query looks like in SQL:

spark.sql("""

select l.LUMI_NUMBER, e.ELEMENT_ID, e.VALUE_NUMBER

from eventhistory e, luminosity l

where e.ts between l.starttime and l.endtime

""")

An alternative version of the same query written using the DataFrame API:

eventhistory_df.join(

luminosity_df,

(eventhistory_df.ts >= luminosity_df.starttime) &

(eventhistory_df.ts <= luminosity_df.endtime)

).select(luminosity_df.LUMI_NUMBER,

eventhistory_df.ELEMENT_ID,

eventhistory_df.VALUE_NUMBER)

Cracking the Performance Case

WebUI: The first point of entry for troubleshooting this was the Spark WebUI. We could find there the execution time of the query (27 minutes) and details on the execution plan and SQL metrics under the "SQL/ DataFrame" tab. Figure 1 shows a relevant snippet where we could clearly see that Broadcast nested loop join was used for this.

Execution Plan: The execution plan is the one we wanted for this query, that is the small LUMINOSITY table is broadcasted to all the executors and then joined with each partition of the larger EVENTHISTORY table.

Figure 1: This shows a relevant snippet of the execution graph from the Spark WebUI. The slow query discussed in this post runs using broadcast nested loops join. This means that the small table is broadcasted to all the nodes and then joined to each partition of the larger table.

CPU utilization measured with Spark Dashboard

Spark Dashboard instrumentation provides a way to collect and visualize Spark execution metrics. This makes it easy to plot the CPU used during the SQL execution. From there we could see that the workload was CPU-bound

The Clue: Profiling with Flame Graphs and Pyroscope

Stack profiling and Flame Graphs visualization are powerful techniques to investigate CPU-bound workloads. We use it here to find where the CPU cycles are consumed and thus make the query slow.

First a little recap of what is stack profiling with flame graph visualization, and what tools we can use to apply it to Apache Spark workloads:

Stack profiling and Flame Graphs visualization provide a powerful technique for troubleshooting CPU-bound workloads.

Flame Graphs provide information on the "hot methods" consuming CPU
Flame Graphs and profiling can also be used to profile time spent waiting (off-cpu) and memory allocation

Grafana Pyroscope simplifies data collections and visualization, using agents and a custom WebUI. Key motivations for using it with Spark are:

Streamlined Data Collection & Visualization: The Pyroscope project page offers a simplified approach to data gathering and visualization with its custom WebUI and agent integration.
Java Integration: The Pyroscope java agent is tailored to work seamlessly with Spark. This integration shines especially when Spark is running on various clusters such as YARN, K8S, or standalone Spark clusters.
Correlation with Grafana: Grafana’s integration with Pyroscope lets you juxtapose metrics with other instruments, including the Spark metrics dashboard.
Proven Underlying Technology: For Java and Python, the tech essentials for collecting stack profiling data, async-profiler and py-spy, are time-tested and reliable.
Functional & Detailed WebUI: Pyroscope’s WebUI stands out with features that allow users to:
- Select specific data periods
- Store and display data across various measurements
- Offer functionalities to contrast and differentiate measurements
- Showcase collected data for all Spark executors, with an option to focus on individual executors or machines
Lightweight Data Acquisition: The Pyroscope java agent is efficient in data gathering. By default, stacks are sampled every 10 milliseconds and uploaded every 10 seconds. We did not observe any measurable performance or stability impact of the instrumentation.

Spark Configuration

To use Pyroscope with Spark we used some additional configurations. Note this uses a specialized Spark Plugin from this repo. It is also possible to use java agents. The details are at:

How-to profile Apache Spark jobs using Grafana Pyroscope

This is how we profiled and visualized the Flame Graph of the query execution:

1. Start Pyroscope

Download from https://github.com/grafana/pyroscope/releases
CLI start: ./pyroscope -server.http-listen-port 5040
Or use docker: docker run -it -p 5040:4040 grafana/pyroscope
Note: customize the port number, I used port 5040 to avoid confusion with the Spark WebUI which defaults to port 4040 too

2. Start Spark with custom configuration, as in this example with PySpark:

# Get the Spark session

from pyspark.sql import SparkSession

spark = (SparkSession.builder.

appName("DCS analysis").master("yarn")

.config("spark.jars.packages",

"ch.cern.sparkmeasure:sparkplugins_2.12:0.3, io.pyroscope:agent:0.12.0")

.config("spark.plugins", "ch.cern.PyroscopePlugin")

.config("spark.pyroscope.server", "http://pyroscope_hostname:5040")

.getOrCreate()

)

Figure 2: This is a snapshot from the Grafana Pyroscope dashboard with data collected during the execution of the slow query (join between EVENTHISTORY and LUMINOSITY). The query runs in 27 minutes, using 32 cores. The Flame Graph shows the top executed methods and the Flame Graph. Notably, a large fraction of the execution time appears to be spent into SparkDateTimeUtils performing date-datatype conversion operations. This is a crucial finding for the rest of the troubleshooting and proposed fix.

The Insight

Using profiling data from Pyroscope, we pinpointed the root cause of the query's sluggishness. Spark was expending excessive CPU cycles on data type conversion operations during the evaluation of the join predicate. Upon revisiting the WebUI and delving deeper into the execution plan under the SQL/DataFrame tab, we discovered, almost concealed in plain view, the specific step responsible for the heightened CPU consumption:

(9) BroadcastNestedLoopJoin [codegen id : 2]

Join condition: ((ts#1 >= cast(starttime_dec#57 as timestamp)) AND (ts#1 <= cast(endtime_dec#58 as timestamp)))

The extra operations of "cast to timestamp" appear to be key in explaining the issue.

Why do we have date format conversions?

By inspecting the schema of the involved tables, it turns out that in the LUMINOSITY table the fields used for joining with the timestamp are of type Decimal.

To recap, profiling data, together with the execution plan, showed that the query was slow because it forced data type conversion over and over for each row where the join condition was evaluated.

The fix:

The solution we applied for this was simple: we converted to use the same data type for all the columns involved in the join, in particular converting to timestamp the columns starttime and endtime of the LUMINOSITY table.

Results: improved performance 70x:

The results are that the query after the change runs in 23 sec, compared to the previous runtime of 27 minutes. Figure 3 shows the Flame graph after the fix was applied.

Figure 3: This is a snapshot of the Grafana Pyroscope dashboard with data collected during the execution of the query after tuning. The query takes only 23 seconds compared to 27 minutes before tuning (see Figure 2)

Wrapping up

Wrapping Up: Stack profiling and Flame Graph visualization aren’t just jargon—they’re game-changers. Our deep dive illuminated how they transformed an Apache Spark query performance by 70x. Using Grafana Pyroscope with Spark, we demonstrated a holistic approach to gather, analyze, and leverage stack profile data.

A hearty thank you to my colleagues at CERN for their guidance. A special nod to the CERN data analytics, monitoring, and web notebook services, and to the ATLAS database team.

canali Wed, 09/27/2023 - 15:47

Tags

Apache Spark

Performance

flame graph

Add new comment

Performance Comparison of 5 JDKs on Apache Spark

canali — Fri, 11 Aug 2023 14:39:59 +0000

Performance Comparison of 5 JDKs on Apache Spark

Blog article:

Dive into a comprehensive load-testing exploration using Apache Spark with CPU-intensive workloads. This blog provides a comparative analysis of five distinct JDKs' performance under heavy-duty tasks generated through Spark. Discover a meticulous breakdown of our testing methodology, tools, and insightful results. Keep in mind, our observations primarily indicate the test toolkit and system's performance rather than offering a broad evaluation of the JDKs.

In this post, we'll also emphasize:

The rationale behind focusing on CPU and memory-intensive workloads, especially when handling large Parquet datasets.
The load testing tool's design: stressing CPU and memory bandwidth with large Parquet files.
Key findings from our tests, offering insights into variations across different JDKs.
Tools and methods employed for the most accurate measurements, ensuring our results are as reflective of real-world scenarios as possible.

Join us on this journey to decipher the intricate landscape of JDKs in the realm of Apache Spark performance!

On the load testing tool and instrumentation

What is being measured:

this is a microbenchmark of CPU and memory bandwidth, the tool is not intended to measure the performance of Spark SQL.
this follows the general ideas of active benchmarking: a load generator is used to produce CPU and memory-intensive load, while the load is measured with instrumentation.

Why testing with a CPU and memory-intensive workload:
In real life, the CPU and memory intensive workloads are often the most critical ones. In particular, when working with large datasets in Parquet format, the CPU and memory-intensive workloads are often the most critical ones. Moreover, workloads that include I/O time from object storage can introduce a lot of variability in the results that does not reflect the performance of Apache Spark but rather of the object storage system. Working on a single large machine also reduces the variability of the results and makes it easier to compare the performance of different test configurations.

The test kit:
The testing toolkit used for this exercise is described at test_Spark_CPU_memory.

The tool generates CPU and memory-intensive load, with a configurable number of concurrent workers.
It works by reading a large Parquet file. The test setup is such that the file is cached in the system memory therefore the tool mostly stresses CPU and memory bandwidth.

Instrumentation:
The workload is mostly CPU-bound, therefore the main metrics of interest are CPU time and elapsed time. Using sparkMeasure, we can also collect metrics on the Spark executors, notably the executors' cumulative elapsed time, CPU time, and time in garbage collection.

Download test data:
The test data used to generate the workload is a large Parquet table, store_sales, taken from the open source TPCDS benchmark. The size of the test data is 200 GB, and it is stored in multiple Parquet files. You can also use a subset of the files in case you want to scale down the benchmark.
The files are cached in the filesystem cache, so that the test kit mostly stresses CPU and memory bandwidth (note, this requires 512GB of RAM on the test system, if you have less RAM, reduce the dataset size).

Download using download using: wget -r -np -nH --cut-dirs=2 -R "index.html*" -e robots=off https://sparkdltrigger.web.cern.ch/sparkdltrigger/TPCDS/store_sales.par…

Test results:
Tests were run using the script spark_test_JDKs.sh that runs test_Spark_CPU_memory.py with different JDKs and prints out the results. The output of three different tests were collected and stored in txt files that can be found in the Data folder.

Test system:
A server with dual CPUS (AMD Zen 2 architecture), 16 physical cores each, 512 GB RAM, ~300 GB of storage space.

Spark configuration:
We use Apache Spark run in local mode (that is on a single machine, not scaling out on a cluster) for these tests, with 64GB of heap memory and 20 cores allocated to Spark. The large heap memory allocation is to reduce Garbage Collection overhead, which still fits in the available RAM.
The number of cores for Spark (that is the maximum number of concurrent tasks being executed by Spark) is set to 20, which brings the CPU load during the test execution to use about 60% of the physical cores, the workload keeps the CPUs busy with processing Parquet files, the rest of the CPU power is available for running other accessory load, notably Garbage collection activities, the OS and other processes.

Example performance test results:
This shows how you can use the toolkit to run the performance tests and collect performance measurements:

$ export JAVA_HOME=.... # Set the JDK that will be used by Spark
$ ./test_Spark_CPU_memory.py --num_workers 20 # Run the 3 tests using 20 concurrent workers (Spark cores)

Allocating a Spark session in local mode with 20 concurrent tasks
Heap memory size = 64g, data_path = ./store_sales.parquet
sparkmeasure_path = spark-measure_2.12-0.23.jar
Scheduling job number 1
Job finished, job_run_time (elapsed time) = 43.93 sec
...executors Run Time = 843.76 sec
...executors CPU Time = 800.18 sec
...executors jvmGC Time = 27.43 sec
Scheduling job number 2
Job finished, job_run_time (elapsed time) = 39.13 sec
...executors Run Time = 770.83 sec
...executors CPU Time = 755.55 sec
...executors jvmGC Time = 14.93 sec
Scheduling job number 3
Job finished, job_run_time (elapsed time) = 38.82 sec
...executors Run Time = 765.22 sec
...executors CPU Time = 751.68 sec
...executors jvmGC Time = 13.32 sec

Notes:
The elapsed time and the Run time decrease with each test run, in particular from the first to the second run we see a noticeable improvement, this is because various internal Spark structures are being "warmed up" and cached. In all cases, data is read from the Filesystem cache, except for the first warm-up runs that are discarded. Therefore, the test kit mostly stresses CPU and memory bandwidth. For the test results and comparisons, we will use the values measured at the 3rd run of each test and average over the available test results for each category.

JDK comparison tests

The following tests compare the performance of 5 different JDKs, running on Linux (CentOS 7.9), on a server with dual Zen 2 CPUs, 16 physical cores each, 512 GB RAM, 300 GB of storage space for the test data. The Apache Spark version is 3.5.0 the test kit is test_Spark_CPU_memory.py. The JDK tested are:

Adoptium jdk8u392-b08
Adoptium jdk-11.0.21+9
Adoptium jdk-17.0.9+9
Oracle jdk-17.0.9
Oracle graalvm-jdk-17.0.9+11.1

The openJDKs were downloaded from Adoptium Temurin JDK, the Oracle JDKs were downloaded from Oracle JDK.
The Adoptium Temurin OpenJDK are free to use (see website).
Notably, the Oracle download page also reports that the JDK binaries are available at no cost under the Oracle No-Fee Terms and Conditions, and the GraalVM Free Terms and Conditions, respectively, see Oracle's webpage for details.

Test results and measurements

Test results summarized in this table are from the test output files, see Data. The values reported here are taken from the test reports, measured at the 3rd run of each test, as the run time improves when running the tests a couple of times in a row (as internal structures and caches are warming up, for example).The results are further averaged over the available test results (6 test runs) and reported for each category.

JDK and Metric name	OpenJDK Java 8	OpenJDK Java 11	OpenJDK Java 17	Oracle Java 17	GraalVM Java 17
JDK	Adoptium jdk8u392-b08	Adoptium jdk-11.0.21+9	Adoptium jdk-11.0.21+9	Oracle jdk-17.0.9	Oracle graalvm-jdk-17.0.9+11.1
Elapsed time (sec)	45.4	39.3	42.0	41.9	34.1
Executors' cumulative ... run time (sec)	896.1	775.9	829.7	828.6	672.3
... CPU time (sec)	851.9	763.4	800.6	796.4	649.5
... Garbage Collection time (sec)	42.6	12.3	29.4	32.5	23.0

Performance data analysis

From the metrics and elapsed time measurements reported above, the key findings are:

Java 8 has the slowest elapsed time, Java 11 and 17 are about 10% faster than Java 8, GraalVM is about 25% faster than Java 8.
The workload is CPU bound.

The instrumentation metrics provide additional clues on understanding the workload and its performance:

Run time, reports the cumulative elapsed time for the executors
CPU time reports the cumulative time spent on CPU.
Garbage Collection Time is the time spent by the executors on JVM Garbage collection, and it is a subset of the "Run time" metric.
From the measured values (see table above) we can conclude that the executors spend most of the time running tasks "on CPU", with some time spent on Garbage collection
We can see some fluctuations on Garbage Collection time, with Java 8 having the longest GC time. Note that the algorithm G1GC was used in all the tests (its use is set
as a configuration by the load generation tool test_Spark_CPU_memory.py).
We can see the GraalVM 17 stands out as having the shortest Executors' runtime. We can speculate that is due to the GraalVM just-in-time compiler and the Native Image feature, which provide several optimizations compared to the standard HotSpot JVM (note, before running to install GraalVM for your Spark jobs, please note that there are other factors at play here, including that Native Image feature in an optional early adopter technology, see Oracle documentation for details).
Java 8 shows the worst performance in terms of run time and CPU time, and it also has the longest Garbage Collection time. This is not surprising as Java 8 is the oldest of the JDKs tested here, and it is known to have worse performance than newer JDKs.
Java 11 and Java 17 have similar performance, with Java 11 being a bit faster than Java 17 (of the order of 5% for this workload), the origin of this difference is not investigated here.

Active benchmarking and sanity checks

The key idea of active benchmarking is that while the load testing tool is running, we also take several measurements and metrics using a variety of monitoring and measuring tools, for OS metrics and application-specific metrics. These measurements are used to complement the analysis results, provide sanity checks, and in general to help understand the performance of the system under test (why is the performance that we see what it is? why not higher/lower? Are there any bottlenecks or other issues/errors limiting the performance?).

Spark tools: the application-specific instrumentation used for these tests were the Spark WebUI and the instrumentation with sparkMeasure that allowed us to understand the workload as CPU-bound and to measure the CPU time and Garbage collection time.

Java FlameGraph: Link to a FlameGraph of the execution profile taken during a test run of test_Spark_CPU_memory.py. The FlameGraph shows that the workload is CPU-bound, and that the time is spent in the Spark SQL code, in particular in the Parquet reader. FlameGraphs are a visualization tool for profiling the performance of applications, see also Tools_FlameGraphs.md.

OS Tools: (see also OS monitoring tools): Another important aspect was to ensure that the data was cached in the filesystem cache, to avoid the overhead of reading from disk, for this tools like iostat and iotop were used to monitor the disk activity and ensure that the I/O on the system was minimal, therefore implying that data was read from the filesystem cache.
A more direct measurement was taken using cachestat, a tool that can be found in the perf-tools collection and bcc-tool, which allows measuring how many reads hit the filesystem cache, we could see that the hit rate was 100%, after the first couple of runs that populated the cache (and that were not taken in consideration for the test results).
CPU measurements were taken using top, htop, and vmstat to monitor the CPU usage and ensure that the CPUs were not saturated.

Other sanity checks: were about checking that the intended JDK was used in a given test, for that we used top and jps, for example.
Another important check is about the stability of the performance tests' measurements. We notice fluctuations in the execution time for different runs with the same parameters, for example. For this reason the load-testing tool is run on a local machine rather than a cluster, where these differences are amplified, moreover the tests are run multiple times, and the results reported are averages. We estimated the errors in the metrics measurements due to these fluctuations to be less than 3%, see also the raw test results reported available at Data.

Related work

The following references provide additional information on the topics covered in this note.

test_Spark_CPU_memory.py used to test CPU performance on two difference architectures
- see CPU and Memory testing with Spark and pdf
CPU load-testing kit
Metrics collection for Apache Spark performance troubleshooting: sparkMeasure
Active benchmarking
A short list of OS monitoring tools
A note on : How specify a custom Java Home/Java version for Spark executors on YARN

Conclusions

This blog post presents an exploration of load methodologies using Apache Spark and a custom CPU and memory-intensive testing toolkit. The focus is on comparing different JDKs and producing insights into their respective performance when running Apache Spark jobs under specific conditions (CPU and memory-intensive load when reading Parquet files). Upon evaluating Apache Spark's performance across different JDKs in CPU and memory-intensive tasks involving Parquet files, several key findings emerged:

JDK's Impact: The chosen JDK affects performance, with significant differences observed among Java 8, 11, 17, and GraalVM.
Evolution of JDKs: Newer JDK versions like Java 11 and 17 showcased better outcomes compared to Java 8. GraalVM, with its specific optimizations, also stood out.
Developer Insights: Beyond personal preference, JDK selection can drive performance optimization. Regular software updates are essential.
Limitations: Our results are based on specific test conditions. Real-world scenarios might differ, emphasizing the need for continuous benchmarking.
Guidance for System Specialists: This study offers actionable insights for architects and administrators to enhance system configurations for Spark tasks.

In essence, the choice of JDK, coupled with the nature of the workload, plays a significant role in Apache Spark's efficiency. Continuous assessment is crucial to maintain optimal performance.

Acknowledgements

canali Fri, 08/11/2023 - 16:39

Tags

Apache Spark

Java

Performance

Add new comment

Building a Semantic Search Engine and RAG Applications with Vector Databases and Large Language Models

canali — Thu, 22 Jun 2023 09:20:34 +0000

Building a Semantic Search Engine and RAG Applications with Vector Databases and Large Language Models

Blog article:

This blog post is about building a getting-started example for semantic search using vector databases and large language models (LLMs), an example of retrieval augmented generation (RAG) architecture. You can find the accompanying notebook at this link. See also the SWAN gallery.

CERN users can run the notebooks using the SWAN platform and GPU resources.

Other options for running the notebooks in the cloud with a GPU include Google's Colab.

Goals and Scope

Our primary goal is to demonstrate the implementation of a search engine that focuses on understanding the meaning of documents rather than relying solely on keywords.

The proposed implementation uses resources currently available to CERN users: Jupyter notebooks with GPUs, Python packages from the open source ecosystem, a vector database.

Limitations:it's important to note that this example does not cover building a fully-fledged search service or chat engine. We leave those topics for future work, here were limit the discussion to a getting-started example and a technology demonstrator.

Understanding Key Concepts

Semantic search: Semantic search involves searching for meaning rather than just literal matches of query words. By understanding the context and intent behind the query, semantic search engines can provide more accurate and relevant results.

Vector Database: A vector database is a specialized type of database designed to handle vector embeddings. These embeddings represent data in a way that captures essential semantic information. They are widely used in applications such as large language models, generative AI, and semantic search.

Large Language Models (LLMs): LLMs are powerful language models built using artificial neural networks with a vast number of parameters (ranging from tens of millions to billions). These models are trained on extensive amounts of unlabeled text data using self-supervised or semi-supervised learning techniques.

Implementation details

Building a semantic search prototype has become more accessible due to recent advancements in natural language processing and applied ML/AI. Using off-the-shelf components and integrating them effectively can accelerate the development process. Here are some notable key ingredients that facilitate this implementation:

Large Language Models (LLMs) and embedding Libraries:
- The availability of powerful LLMs such as OpenAI GPT-3.5 and GPT-4, Google's Palm 2, and of embedding libraries, significantly simplifies the implementation of semantic search and natural language processing in general. These models provide comprehensive language understanding and generation capabilities, enabling us to extract meaning from text inputs.
Platforms:
- Platforms and cloud services such as Hugging Face offer valuable resources for operating with ML models as these libraries provide pre-trained models, tokenization utilities, and interfaces to interact with LLMs, reducing the implementation complexity.
Open Source Libraries like LangChain:
- Open source libraries like LangChain provide a convenient way to integrate and orchestrate the different components required for building applications in the semantic search domain. These libraries often offer pre-defined pipelines, data processing tools, and easy-to-use APIs, allowing developers to focus on the core logic of their applications.
Vector Databases and Vector Libraries:
- Vector libraries play a crucial role in working with semantic embeddings. They provide functionalities for vector manipulation, similarity calculations, and operations necessary for processing and analyzing embedding data. Additionally, vector databases are recommended for advanced deployments, as they offer storage and querying capabilities for embeddings, along with metadata storage options. Several solutions are available in this area, ranging from mature products offered as cloud services to open source alternatives.

Back-end: prepare the embeddings and indexes in a vector database

To ensure factual accuracy and preserve the original document references, we will prepare the embeddings and indexes in a vector database for our semantic search query engine. Additionally, we aim to enable indexing of private documents, which necessitates storing the embeddings rather than relying on the LLM model directly.

Transforming document chunks into embedding vectors is a crucial step in the process. There are specialized libraries available that utilize neural networks for this task. These libraries can be accessed as cloud services or downloaded to run on local GPU resources. In the accompanying notebook, you can find in the accompanying notebook an example demonstrating this process.

A second import part is about storing the embeddings. For this a vector library or a vector database can be quite useful. A library like FAISS is a good idea is you have a small amount of documents and/or are just prototyping. A vector DB can provide more features than a simple library, in particular when handling large amounts of documents. In the accompanying notebook we use the FAISS library and, as alternative option, OpenSearch k-NN indexing. Note that several other vector database products can be readily "substituted" to offer comparable and, in some cases, extended functionality.

Note: CERN users have the option to contact the OpenSearch service to request an instance of OpenSearch equipped with the plugin for k-NN search. This can be a valuable resource for your semantic search implementation.

Figure 1: A schematic diagram of how to prepare a set of documents for semantic search. The documents are split in chunks, for each chunk embeddings are computed with a specilized library and then stored in a vector database.

When using FAISS as the Vector library, this is how embedding and indexing can be done:

This is the equivalent code when using OpenSearch as Vector DB:

Semantic querying using similarity search and vector DB indexes

This uses a key functionality of vector libraries and vector databases: similarity search. The general idea is to create a vector embedding for the query and find in the database of embedded vectors the closest elements to the query. For large document collections this can be slow, so vector libraries and databases implement specialized indexes and algorithms for this, for example approximate k-nearest neighbors search.

Figure 2: A diagram of the similarity query process. The query is converted into embeddings and similarity search via the specialized indexes is performed using a vector database or vector library. Algorithms such as k-nearest neighbors are used to find the matching document chunks for the given query.

Semantic search provides a list of relevant documents for a user query, list the page and text chunk reference, as in this example:

Grand Finale: a Large Language Model for natural language query capabilities

Semantic search returns a list of relevant document snippets, as the last (optional) step we want to convert that into a coherent text answer. For this we can use LLM models. The technique is simple, we just need to feed the query and the relevant pieces of text to the LLM and then take the answer from the model. For this we need to use a rather sophisticated LLM model. The best ones currently work as cloud services (some are free and some charge per use), other models available for free download currently require rather powerful GPUs to run locally.

This is the final result: a system capable of querying the indexed text(s) using natural language. In the following example we apply it to replying to queries about the future of LHC computing, based on the document A Roadmap for HEP Software and Computing R&D for the 2020s

Conclusions

In this blog post, we have demonstrated how to build a beginner's semantic search system using vector databases and large language models (LLMs). Our example has utilized Jupyter notebooks with GPUs, Python packages, and a vector database, proving that a semantic search engine that queries documents for meaning, instead of just keywords, can be feasibly built using existing resources.

In our implementation, we demonstrated how embeddings and indexing can be performed using FAISS as the vector library, or in alternative with OpenSearch as the vector database. We then moved onto the semantic query process using similarity search and vector DB indexes. To finalize the results, we utilized an LLM to convert the relevant document snippets into a coherent text answer.

Though the example provided is not intended to function as a fully-developed search service, it serves as an excellent starting point and technological demonstrator for those interested in semantic search engines. Additionally, we acknowledge the potential of these methods to handle private documents and produce factually accurate results with original document references.

We believe the combination of semantic search, vector databases, and large language models holds large potential for transforming how we approach information retrieval and natural language processing tasks.

The accompanying notebook, providing step-by-step code and more insights, is accessible on GitHub and via the CERN SWAN Gallery. For researchers and developers interested in delving into this exciting area of applied ML/AI, it offers a working example that can be run using CERN resources on SWAN, and also can run on Colab.

Acknowledgements

I would like to express my sincere gratitude to my colleagues at CERN for their invaluable assistance and insightful suggestions, in particular I'd like to acknowledge the CERN data analytics and web notebook services, the OpenSearch service, and the ATLAS database and data engineering teams. Their expertise and support have played a crucial role in making this collection of notebooks possible. Thank you for your contributions and dedication.

canali Thu, 06/22/2023 - 11:20

Add new comment

Exploratory Notebooks for Deep Learning, AI, and Data Tools: A Beginner's Guide

canali — Thu, 01 Jun 2023 14:53:43 +0000

Exploratory Notebooks for Deep Learning, AI, and Data Tools: A Beginner's Guide

Blog article:

Are you looking at some resources to get you up to speed with popular Deep Learning and Data processing frameworks? This blog entry provides a curated collection of notebooks that will help you kickstart your journey.

You can find the notebooks at this link. See also the SWAN gallery.

CERN users can run the notebooks on the SWAN platform, using GPU resources.

Other options for running the notebooks in the cloud with a GPU include Google's Colab.

Getting started with Deep Learning

These notebook showcase a digit recognition classifier using the MNIST dataset, which serves as a "Hello World!" for Deep Learning. Choose from the following options to get started:

Deep Learning and basic Data pipelines

Learn how to integrate Deep Learning frameworks with basic data pipelines using Pandas to feed data into the DL training step. These notebooks implement a Particle classifier using various DL frameworks. The data is stored in Parquet format, offering efficient data reading.

More advanced Data pipelines

Take your data processing skills to the next level with these notebooks, which demonstrate advanced data pipelines suitable for large datasets. Discover how to leverage the Petastorm library to read data from Parquet files with TensorFlow and PyTorch, as well as utilizing the TFRecord format with TensorFlow.

Additional complexity with models and data

Building upon the previous examples, these notebooks introduce more complex models and larger datasets for the Particle classifier. Explore the capabilities of TensorFlow, GRU, Transformer, and TFRecord with:

TensorFlow for the Inclusive Classifier, with GRU and TFRecord
- Description: This notebook focuses on training with data stored in TFRecord format.
- TensorFlow is configured to run on a GPU, and an LSTM-based model architecture is employed.
TensorFlow for the Inclusive Classifier, with Transformer and TFRecord
- Description: This notebook focuses on training with data stored in TFRecord format.
- TensorFlow is configured to run on a GPU, and a Transformer-based model architecture is employed.

AI Tools Examples

This section contains Jupyter notebook examples of AI tools, including LLMs, Transformers, vector databases. The notebooks are intended to be run using GPU resources.

Transformers library

Explore the powerful Transformers library from Hugging Face, widely used for LLM, Natural Language Processing (NLP), image, and speech tasks.

Large language models

These notebooks provide examples of how to use LLMs in notebook environments for tests and prototyping

Semantic search with Vector Databases and LLM

Semantic search allows to query a set of documents. This examples shows how to create vector embeddings, store them in a vector database, and perform semantic queries enhanced with LLM.

Semantic search with Vector Databases and LLM

Data Tools Examples

This section offers example notebooks featuring popular frameworks and libraries for handling data. Please note that it does not cover scale-out data solutions such as Spark and Dask.

For Apache Spark see SparkTraining

If you require access to relational databases for testing, CERN users can reach out to Oracle and DBOD services. You can also set up test databases using container technology. Here's how:

Running a test Oracle instance on a container:

Run Oracle Free on a container from gvenzl dockerhub repo https://github.com/gvenzl/oci-oracle-free
- see also https://github.com/gvenzl/oci-oracle-free
- docker run -d --name mydb1 -e ORACLE_PASSWORD=oracle -p 1521:1521 gvenzl/oracle-free:latest
- Wait until the DB is started (this may take a few minutes). Check progress with: docker logs -f mydb1
- install the Python library for connecting to Oracle: pip install oracledb

Setting up a PostgreSQL instance for testing using a Docker image:

docker run --name some-postgres -p 5432:5432 -e POSTGRES_PASSWORD=mysecretpassword -d postgres
wait till the DB is started, check logs at: docker logs -f some-postgres
install the Python library for connecting to PostgreSQL: pip install psycopg2-binary

Pandas and numpy examples

Database examples

Conclusions and acknowledgments

This blog entry provides a valuable collection of exploratory notebooks for individuals who are new to deep learning and data processing. With a focus on popular frameworks and libraries, these notebooks cover a range of topics including digit recognition, transformers for various tasks, integrating deep learning with data pipelines, advanced data processing techniques, and examples of data tools. Whether you are a CERN user or prefer cloud-based platforms like Google's Colab, these notebooks will help you quickly grasp the fundamentals and get started on your deep learning and data processing journey.

I would like to express my sincere gratitude to my colleagues at CERN for their invaluable assistance and insightful suggestions, in particular I'd like to acknowledge the CERN data analytics and web notebook services and ATLAS database and data engineering teams. Their expertise and support have played a crucial role in making this collection of notebooks possible. Thank you for your contributions and dedication.

canali Thu, 06/01/2023 - 16:53

Tags

Deep learning

Jupyter notebook

Add new comment

CPU Load Testing Exercises: Tools and Analysis for Oracle Database Servers

canali — Thu, 04 May 2023 12:38:32 +0000

CPU Load Testing Exercises: Tools and Analysis for Oracle Database Servers

Blog article:

This document describes some basic CPU load testing exercises on three different types of database servers used by the Oracle Service at CERN. It reports on the tests performed, tools used for data gathering, data analysis, findings, and lessons learned.

Motivations

CPU usage is important for data processing: We observe that workloads on Oracle database services at CERN are often CPU-bound. Database workloads for transactional processing perform many random read operations. In the past, this mostly stressed the I/O subsystem, these days we deploy databases with large buffer caches (400 GB or more of data block caches) and most operations are CPU bound, reading data from buffer cache.

Server consolidation, quality of service and licensing: We deploy on commodity HW considering various constraints: striking a balance between consolidating workloads and isolating critical workloads from different users’ communities. Moreover, Oracle licensing costs, which are proportional to the deployed CPUs, play a key input in the efforts streamlining the CPU deployments across the DB service.

Description and limitations of the tests

The tests reported here are extremely limited in scope, as they focus only on CPU performance and with two specific and “narrow” workloads. However, I believe they provide some indications on the behavior of the server CPU performance and the overall CPU capacity of the installed servers. The comparison between three different server models is the original motivation of this work as we wanted to understand how newer model can be deployed to replace old ones. This work is not a benchmark of the tested systems.

Tools used for load testing

The first workload generator and testing tool is a simple script burning CPU cycles in a loop and executed using multiple workers running in parallel, two implementations have been used, one in Python and one in Rust compiled to binary. Both provide similar results.

The second workload generator is SLOB a tool that runs on top of Oracle databases for testing and specifically stresses “Logical IO”, that is reading blocks from the Oracle buffer cache (memory).

Links to the code, measured data, and data analyses using notebooks:

CPU load testing kit - Python version	Kit for load testing and measuring CPU-intensive workloads, Python version.
CPU load testing kit - Rust version	Kit for Load testing and measuring CPU-intensive workloads, Rust version.
Oracle CPU load testing using SLOB	Load testing Oracle using the SLOB test kit.

Key findings

- RAC55 is the newest server model of the three tested and shows the highest CPU per-thread performance and highest CPU total throughput at saturation.

- RAC55 has about 2.0x single-thread performance increase compared to RAC52.

- RAC55 has about 1.5x single-thread performance increase over RAC54, but this is valid only for low load, as RAC54 has only 8 physical cores vs 16 cores in RAC55. Moreover, RAC54 provides considerably less total CPU throughput compared to RAC52 and RAC55.

- RAC55 has about 2.0x more total CPU throughput at saturation compared to RAC52 despite having only 16 physical cores compared to 20 physical cores in RAC52.

Description of the platforms

CPU load tests have been performed on three dedicated test servers representative of the production database servers in March 2023: RAC52, RAC54, and RAC55.

The servers were installed with RHEL 7.9 and Oracle tests used Oracle 19c (v. 19.17). We omit the configuration of networking and I/O, as not relevant for these tests. We don't report the exact CPU models in this doc.

RAC52 configuration:

20 physical cores (2 sockets, 10 physical cores each), 40 logical cores visible on the OS due to hyperthreading
CPU nominal frequency: 2.20 GHz
CPU from 2016, L1 caches: 32K + 32K, L2 cache 256K, L3 cache 25600K
RAM: DDR4, 512 GB

RAC54 configuration:

8 physical cores (2 sockets, 4 physical cores each), 16 logical cores visible on the OS due to hyperthreading
CPU nominal frequency: 3.80 GHz
CPU from 2019, L1 caches: 32K + 32K, L2 cache 1024K, L3 cache 16896K
RAM: DDR4, 768 GB

RAC55 configuration:

16 physical cores (2 sockets, 8 physical cores each), 32 logical cores visible on the OS due to hyperthreading
CPU nominal frequency: 3.7 GHz
CPU from 2019, L1 caches: 32K + 32K, L2 cache 512K, L3 cache 32768K
RAM: DDR4, 1 TB

Test 1 – Concurrent workers burning CPU cycles in a loop and in parallel

The workload generator and testing tool is a simple Python script burning CPU cycles in a loop.

The script is executed running on a configurable number of concurrent workers. The script measures the time spent executing a simple CPU-burning loop.

This provides a simple way to generate CPU load on the system.

Example of how the data was collected with the testing tool written in Rust and compiled to binary:

./test_cpu_parallel --num_workers 8 --full --output myout.csv

See the code and instructions on how to run it at this link

The advantage of this approach is that the testing tool is easy to write and can be easily automated.

The weak point of testing this way is that the test workload is somewhat “artificial” and disconnected with the server actual purpose as a DB server. For example, the CPU-burning loop used for this test is mostly instruction-intensive on the CPU and does not spend much time on memory access.

Measurements and results:

The following figures represent the same data in different ways to highlight different performance and scalability characteristics.

Figure 1 – Raw data

- The figure reports the testing job execution time, measured for varying server load on the three tested servers.

- A common pattern is that at low load (see data with just a few parallel workers) the job run time is almost constant.

- An important difference is that the job run time is different on the different platforms, in order of increasing performance: RAC52, RAC54, RAC55 (the newest server and the fastest).

- Another pattern is that the job running starts to increase linearly at higher load.

- The job execution time curve starts to bend upwards as the load increases. Typically, we see this happening when the num of parallel workers is greater than the number of physical cores on the server (20 cores on RAC52, 8 cores on RAC54 and 16 cores on RAC55)

Figure 2 - Speed

- This plot reports the number of jobs per minute per worker

- Data points can be interpreted as a measure of the “speed of the CPU” for a new job coming into the system given a defined system load

- We see that the “effective CPU speed” decreases as the load increases, with sudden changes at the points where the number of parallel workers is equal to the number of physical cores

- The CPU speed per thread is also different depending on the CPU architecture, in order of increasing performance: RAC52, RAC54, RAC55 (the newest server and the fastest).

Figure 3 - Capacity

- This plot shows the number of jobs executed per minute summed over all the running worker threads.

- As the load increases the server capacity increases, reaching a maximum value at number of workers = number of logical cores (40 for RAC52, 8 for RAC54, 32 for RAC55)

- This allows to compare the “Total CPU capacity” of the three servers. In order of increasing capacity: lowest capacity with RAC54 (the server with fewer cores), then RAC52, finally RAC55 has the highest CPU throughput (it’s the newest server)

Figure 4 - Scalability

- This shows the speedup, a measure of scalability. For the scope of this plot, speedup is calculated as N * (job execution time at load n) / (job execution time at load 1)

- We see almost linear scalability for low loads (up to the number of physical cores), then a slower increase up to the number of logical cores, and, eventually, the speedup reaches saturation

- RAC54 and RAC55 appear to scale almost linearly up to the number of physical cores (respectively 8 and 16)

Notes:

- Of the tested servers RAC55 appears the fastest on per-thread CPU performance at low and high loads and the one with higher CPU capacity.

- The difference in performance between RAC52 (oldest) and RAC55 (newest) is roughly x1.5 in per-CPU thread performance and x2 in overall CPU capacity at high load.

- RAC54 performs similarly to RAC55 but only at low loads (<= 8 concurrent workers)

Test 2 – Parallel workers running “SLOB tests”, measuring Oracle logical IO throughput

The second workload generator is SLOB, a tool by Kevin Closson, that runs on top of Oracle databases for load testing and specifically stresses Physical and Logical IO. In the configuration used for these tests we only stressed Logical IO, that is accessing blocks from the Oracle buffer cache (memory).

The tool creates test tables on the database and performs block IO reading from the test tables with a tunable number of concurrent workers.

Measurements and results:

The following figures represent the same data in different ways to highlight different performance and scalability characteristics.

Data with the graphs on Notebooks at this link

Figure 5 – Raw Data and Capacity

- The figure shows how the cumulative Oracle logical IO throughput increases with the number of parallel workers for the three servers tested.

- The common trend is that the Logical IO throughput increases with load up to the number of logical CPUs (16 for RAC54, 32 for RAC55, and 40 for RAC52).

- Measurements are “noisy” so we should take about 10% as the error margin on the collected data points.

- There are differences in performance and total throughput with RAC55 being the most performance and with the highest throughput.

- At low load, (<= 8 concurrent processes) RAC54 and RAC55 have similar performance.

- At high load, RAC55 has about 20% more capacity/throughput of Logical IO than RAC52.

Figure 6 - Speed

- The figure shows Oracle logical IO throughput per worker as function of load.

- The performance of logical IOs decays with increasing load.

- Logical IO performance appears close to constant up to the number of physical cores of the server (8 for RAC54, 16 for RAC55 and 20 for RAC52) and then decays for higher load, saturating when the number of logical cores is reached.

- Measurements are “noisy” so we should take about 10% as the error margin on the collected data points.

- RAC55 shows the highest performance overall for Logical I/O throughput.

- At load below 16 concurrent workers, RAC55 appears 1.5x faster than RAC52, the gap closes to about 20% at high load.

Figure 7 - Scalability

- The figure shows speedup as function of load.

- The figure shows the speedup, a measure of scalability, for this plot it’s calculated as the ratio of (cumulative Logical I/O at load n) / (cumulative Logical I/O at load 1)

o Linear scalability would be represented by a line with speedup = number of parallel workers.

- A general trend observed in the data is that the scalability curves start close to the ideal linear scalability and then bend downwards due to contention.

- RAC55 has the better scalability behavior of the three servers tested. At low load (less than 8 concurrent workers) RAC55 and RAC54 have similar behavior.

Conclusions

This work collects a few tests and measurements on stress testing and CPU loading on three different platforms of interest for the CERN Oracle database service.

The tests performed are narrow in scope, just addressing the CPU load.

Two different testing tools have been used for these tests: testing with a simple CPU-burning script loop run in parallel, and testing with an Oracle-specific workload generator for Logical I/O.

The tools used, as well as the measured data and their analyses, are available at this link.

We find that the newest server model (RAC55) has the highest CPU per-core performance, scalability, and overall CPU throughput.

This work has been done in the context of the CERN databases and analytics services and the ATLAS data engineering efforts, many thanks to my colleagues for their help and suggestions.

canali Thu, 05/04/2023 - 14:38

Tags

Performance

Testing

Oracle

Databases at CERN blog - Powering particle physics

ATLAS DCS Analysis with Apache Spark and Jupyter Notebooks

The Data Pipeline: From Storage to Analysis

Data Storage in Oracle Databases

Leveraging CERN’s Hadoop Service

The Role of Apache Spark

Extracting Data from Oracle using Apache Spark

Implementing Time Partitioning

Analysis Framework: A User-Friendly Approach

Apache Spark as the Core Processing Engine

Platform integration with Jupyter notebooks and Spark

Future Enhancements

Conclusion

Acknowledgements and Links

Add new comment

Kepler’s Mars Orbit Analysis with Python Notebooks & AI-Assisted Coding

Jupyter Notebooks and AI-Assisted Coding: A Powerful Combination for Data Science

How Python and AI Tools Enhance This Project

Why Use Jupyter Notebooks and AI-Assisted Coding?

Overview of the Analysis

Generating Mars Ephemeris (generate the measurements of Mars' celestial positions)

Key Insight of Kepler's Analysis (understand how Earth-based observations reveal Mars’ trajectory)

Computing Mars' Orbit (calculate Mars orbit by triangulating Mars' position using all available observations)

Kepler’s Laws (verify Kepler’s three laws with real data)

Estimating Earth's Orbit (use Mars' ephemeris and line-of-sight data to determine Earth’s orbit)

Conclusion

References

Acknowledgements

Add new comment

CERN PGDay 2025 is here!

Building an Apache Spark Performance Lab: Tools and Techniques for Spark Optimization

Why a Spark Performance Lab

Tools and Components

Demos

How to Make the Best of Spark Metrics System

Lessons Learned and Conclusions

Resources

Add new comment

Enhancing Apache Spark and Parquet Efficiency: A Deep Dive into Column Indexes and Bloom Filters

Key Advantages of Parquet in Spark

Understanding Column Indexes and Bloom Filters in Parquet

Column Indexes: Enhancing Query Efficiency

Bloom Filters: A Leap in Filter Operations

Example: Spark using Parquet column indexes

Diagnostics and Internals of Column and Offset Indexes

Key Aspects of Column Indexes:

Complementary Role of Offset Indexes:

Additional Resources:

Tools to drill down on column index metadata in Parquet files

Bloom filters in Parquet

Write Parquet files with Bloom filters

Example: Checking I/O Performance in Parquet: With and Without Bloom Filters

Reading Parquet Bloom Filter Metadata with Apache Parquet Java API

Discovering Parquet Version

Checking the Parquet File Version:

Updating Parquet File Versions

Conclusions

Add new comment

Enhancing Apache Spark Performance with Flame Graphs: A Practical Example Using Grafana Pyroscope

The Puzzle of the Slow Query

Cracking the Performance Case

The Clue: Profiling with Flame Graphs and Pyroscope

Spark Configuration

The Insight

Related work and links

Wrapping up

Add new comment

Performance Comparison of 5 JDKs on Apache Spark

On the load testing tool and instrumentation

JDK comparison tests

Test results and measurements

Performance data analysis

Active benchmarking and sanity checks

Related work

Conclusions

Acknowledgements

Add new comment

Building a Semantic Search Engine and RAG Applications with Vector Databases and Large Language Models

Goals and Scope

Understanding Key Concepts