Introduction to Apache Spark APIs for Data Processing

Submitted by canali on
Blog article:

Text Box:
Introduction to Apache Spark APIs for Data Processing

Welcome to the website of the course on Apache Spark by CERN IT. The course is self-paced and open, it is a short introduction to the architecture and key abstractions used by Spark. Theory and demos cover the main Spark APIs: DataFrame API, Spark SQL, Streaming, Machine Learning. You will also learn how to deploy Spark on CERN computing resources, notably using the CERN SWAN service. Most tutorials and exercises are in Python and run on Jupyter notebooks.

Apache Spark is a popular engine for data processing at scale. Spark provides an expressive API and a scalable engine that integrates very well with the Hadoop ecosystem as well as with Cloud resources. Spark is currently used by several projects at CERN, notably by IT monitoring, by the security team, by the BE NXCALS project, by teams in ATLAS and CMS. Moreover, Spark is integrated with the CERN Hadoop service, the CERN Cloud service, and the CERN SWAN web notebooks service.

 

Accompanying notebooks


Course lectures and tutorials

 

Acknowledgements and feedback

Author and contact for feedback and questions: Luca Canali - Luca.Canali@cern.ch

CERN-IT Spark and data analytics services

Former contributors: Riccardo Castellotti, Prasanth Kothuri

Many thanks to CERN Technical Training for their collaboration and support

License: CC BY-SA 4.0

Published in November 2022

Add new comment

CAPTCHA
Enter the characters shown in the image.
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.

Disclaimer

The views expressed in this blog are those of the authors and cannot be regarded as representing CERN’s official position.

CERN Social Media Guidelines

 

Blogroll