pyspark

Why I’m Loving Spark 4’s Python Data Source (with Direct Arrow Batches)

TL;DR: Apache Spark 4 lets you build first-class data sources in pure Python. If your reader yields Arrow RecordBatch objects, Spark ingests them with reduced Python↔JVM serialization overhead. I used this to ship a ROOT data format reader for PySpark.

Machine Learning Pipelines for High Energy Physics Using Apache Spark with BigDL and Analytics Zoo

Topic: This post describes a data pipeline for a machine learning task of interest in high energy physics: building a particle cla

IPython/Jupyter SQL Magic Functions for PySpark

Topic: this post is about a simple implementation with examples of IPython custom magic functions for running SQL in Apache Spark usin

Integrating Hadoop and Elasticsearch – Part 2 – Writing to and Querying Elasticsearch from Apache Spark

Introduction

In the part 2 of 'Integrating Hadoop and Elasticsearch' blogpost series we look at bridging Apache Spark and Elasticsearch. I assume that you have access to Hadoop and Elasticsearch clusters and you are faced with the challenge of bridging these two distributed systems. As spark code can be written in scala, python and java, we look at the setup, configuration and code snippets across all these three languages both in batch and interactively.

Subscribe to pyspark