Blog

Blog

Building an Apache Spark Performance Lab: Tools and Techniques for Spark Optimization

Submitted by canali on

Apache Spark is renowned for its speed and efficiency in handling large-scale data processing. However, optimizing Spark to achieve maximum performance requires a precise understanding of its inner workings. This blog post will guide you through establishing a Spark Performance Lab with essential tools and techniques aimed at enhancing Spark performance through detailed metrics analysis.

Enhancing Apache Spark Performance with Flame Graphs: A Practical Example Using Grafana Pyroscope

Submitted by canali on

TL;DR Explore a step-by-step example of troubleshooting Apache Spark job performance using flame graph visualization and profiling. Discover the seamless integration of Grafana Pyroscope with Spark for streamlined data collection and visualization.

 

Building a Semantic Search Engine and RAG Applications with Vector Databases and Large Language Models

Submitted by canali on

This blog post is about building a getting-started example for semantic search using vector databases and large language models (LLMs), an example of retrieval augmented generation (RAG) architecture. You can find the accompanying notebook at this link. See also the SWAN gallery.

Exploratory Notebooks for Deep Learning, AI, and Data Tools: A Beginner's Guide

Submitted by canali on

Are you looking at some resources to get you up to speed with popular Deep Learning and Data processing frameworks? This blog entry provides a curated collection of notebooks that will help you kickstart your journey.

You can find the notebooks at this link. See also the SWAN gallery.

Can High Energy Physics Analysis Profit from Apache Spark APIs?

Submitted by canali on

We are in a golden age for distributed data processing, with an abundance of tools and solutions emerging from industry and open source. High Energy Physics (HEP) experiments at the LHC stand to profit from all this progress, as they are data-intensive operations with several hundreds of Petabytes of data to collect and process.

Distributed application cache for Kubernetes running Java Hibernate applications with Oracle Coherence Community Edition

Submitted by vkozlovs on

While working on a data set it is important that it stays easily and quickly accessible. Hibernate second-level caching with Coherence offers applications a resource optimized solution that keeps frequently used data in memory, by distributing it among different application instances, or sharing it with one or more dedicated cache machines. This article describes the knowledge that we gained through using the Oracle Coherence Community Edition for Hibernate second-level caching and gives a general overview of how this product can be used with Java applications running on Kubernetes.

Author: Viktor Kozlovszky

ORDS - Managing APEX static images

Submitted by jgraniec on

In today’s post, we’ll be talking about the possible ways to manage the static images/CSS/JS that come shipped with APEX, when running on ORDS. They are separate resources (not contained in the DB like some other APEX images) necessary for your APEX applications look and behave the way they’re intended to. If you and your users browse your internet using lynx (see image below) feel free to skip this one. Otherwise - dig in! 

Creating PDFs in APEX after ORDS 19.1

Submitted by jgraniec on

Creating PDFs in APEX after ORDS 19.1

Until 19.1 ORDS provided a built-in printing engine based on Apache FOP which allowed you to download a PDF version of your reports and XLS-FO templates in a very easy manner. However in ORDS 18.4.0 release notes we could find information that this feature is deprecated and will be removed in future release. This is exactly what happened with the release of ORDS 19.2.

So what actually happened? 

This is from Oracle’s release notes of ORDS 19.2:

Distributed Deep Learning for Physics with TensorFlow and Kubernetes

Submitted by canali on

Summary: This post details a solution for distributed deep learning training for a High Energy Physics use case, deployed using cloud resources and Kubernetes. You will find the results for training using CPU and GPU nodes. This post also describes an experimental tool that we developed, TF-Spawner, and how we used it to run distributed TensorFlow on a Kubernetes cluster.

 

Benefits of a multi-layer system

Submitted by vkozlovs on

Introduction

Designing a multi-layer system is not rocket science, the difficulty can lie in selecting the right technologies. The main concept behind the design is to have better control and fine tuning of the components. This blog post will discuss the benefits & limitations of implementing this type of design and our practical experience gained from using it for the Open Days reservation system, which helped to welcome 75.000 people on our site and was hosted on the Oracle cloud using their cloud services.

Disclaimer

The views expressed in this blog are those of the authors and cannot be regarded as representing CERN’s official position.

CERN Social Media Guidelines

 

Blogroll