Blog

Unlocking Apache Spark Performance: Three Open-Source Tools We Use at CERN

Apache Spark is incredibly powerful, but anyone who has worked with it long enough knows the feeling:

Why is this job suddenly slower today?
Why are executors running out of memory?
Why is one stage taking 90% of the runtime?
What exactly is Spark doing behind the scenes?

CERN PGDay 2026 is here!

After a successful first edition in 2025, CERN PGDay returns in 2026 as a regular gathering for PostgreSQL users and enthusiasts in Suisse Romande (western Switzerland). Co-organized by CERN and SwissPUG, the event offers a chance to connect, share ideas, and exchange experiences in the vibrant Geneva region — home to many international organizations across the public, private, and scientific sectors.

Tags

Why I’m Loving Spark 4’s Python Data Source (with Direct Arrow Batches)

TL;DR: Apache Spark 4 lets you build first-class data sources in pure Python. If your reader yields Arrow RecordBatch objects, Spark ingests them with reduced Python↔JVM serialization overhead. I used this to ship a ROOT data format reader for PySpark.

Troubleshoot I/O & Wait Latency with OraLatencyMap and PyLatencyMap

I recently chased an Oracle

ATLAS DCS Analysis with Apache Spark and Jupyter Notebooks

Kepler’s Mars Orbit Analysis with Python Notebooks & AI-Assisted Coding

Johannes Kepler’s analysis of Mars’ orbit stands as one of the greatest achievements in scientific history, revealing the elliptical nature of planetary paths and establishing the foundational laws of planetary motion. In this post, you will explore how you can recreate Kepler’s revolutionary findings using Python’s robust data science ecosystem.

CERN PGDay 2025 is here!

CERN PGDay 2025 builds on the experience of past PostgreSQL events at CERN and a newly established collaboration with SwissPUG. It will create an opportunity for PostgreSQL users and enthusiasts to meet in the French-speaking part of Switzerland in order to network and exchange their experiences.

Tags

Read more about CERN PGDay 2025 is here!

Building an Apache Spark Performance Lab: Tools and Techniques for Spark Optimization

Apache Spark is renowned for its speed and efficiency in handling large-scale data processing. However, optimizing Spark to achieve maximum performance requires a precise understanding of its inner workings. This blog post will guide you through establishing a Spark Performance Lab with essential tools and techniques aimed at enhancing Spark performance through detailed metrics analysis.

Enhancing Apache Spark and Parquet Efficiency: A Deep Dive into Column Indexes and Bloom Filters

In the ever-evolving landscape of big data, Apache Spark and Apache Parquet continue to introduce game-changing features.

Enhancing Apache Spark Performance with Flame Graphs: A Practical Example Using Grafana Pyroscope

TL;DR Explore a step-by-step example of troubleshooting Apache Spark job performance using flame graph visualization and profiling. Discover the seamless integration of Grafana Pyroscope with Spark for streamlined data collection and visualization.

Performance Comparison of 5 JDKs on Apache Spark

Dive into a comprehensive load-testing exploration using Apache Spark with CPU-intensive workloads.

Building a Semantic Search Engine and RAG Applications with Vector Databases and Large Language Models

This blog post is about building a getting-started example for semantic search using vector databases and large language models (LLMs), an example of retrieval augmented generation (RAG) architecture. You can find the accompanying notebook at this link. See also the SWAN gallery.

Exploratory Notebooks for Deep Learning, AI, and Data Tools: A Beginner's Guide

Are you looking at some resources to get you up to speed with popular Deep Learning and Data processing frameworks? This blog entry provides a curated collection of notebooks that will help you kickstart your journey.

You can find the notebooks at this link. See also the SWAN gallery.

CPU Load Testing Exercises: Tools and Analysis for Oracle Database Servers

This document describes some basic CPU load testing

Introduction to Apache Spark APIs for Data Processing

Making histograms with Apache Spark and other SQL engines

Topic:

Can High Energy Physics Analysis Profit from Apache Spark APIs?

We are in a golden age for distributed data processing, with an abundance of tools and solutions emerging from industry and open source. High Energy Physics (HEP) experiments at the LHC stand to profit from all this progress, as they are data-intensive operations with several hundreds of Petabytes of data to collect and process.

Distributed application cache for Kubernetes running Java Hibernate applications with Oracle Coherence Community Edition

While working on a data set it is important that it stays easily and quickly accessible. Hibernate second-level caching with Coherence offers applications a resource optimized solution that keeps frequently used data in memory, by distributing it among different application instances, or sharing it with one or more dedicated cache machines. This article describes the knowledge that we gained through using the Oracle Coherence Community Edition for Hibernate second-level caching and gives a general overview of how this product can be used with Java applications running on Kubernetes.

Author: Viktor Kozlovszky

Tags

Integrating ORDS with 3rd-party SSO

In today’s post I will describe the process of integrating OIDC implicit flow with ORDS running on Tomcat against Keycloak service. May sound complicated, but we’ll break it down into individual components so we know what we’re talking about.