How to Build a Real-Time Streaming Pipeline with Spark Structured Streaming

In today’s data-driven world, real-time insights are a necessity. Whether it's monitoring financial transactions, tracking user behavior, or detecting fraud, businesses depend on fresh data flowing through streaming pipelines. One of the most powerful tools for this job is Apache Spark Structured Streaming

In this blog, we’ll walk through the concepts, architecture, and a step-by-step guide to building a real-time data streaming pipeline using Spark Structured Streaming.

🚀 What is Spark Structured Streaming? 

Spark Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. It lets you write streaming queries using familiar DataFrame and SQL APIs — and it handles the complexities of incremental data processing behind the scenes. 

Unlike traditional Spark Streaming (based on DStreams), Structured Streaming provides:

  • End-to-end exactly-once guarantees
  • Event-time processing
  • Support for windowing, watermarking, and stateful operations
  • Unified API for batch and stream processing

🧱 Core Architecture of Structured Streaming 

At its core, Spark Structured Streaming processes data as an unbounded table. New data keeps appending to this table, and your query processes it incrementally. 

Key Concepts:

  • Input Source: Kafka, File Stream, Socket, etc.
  • Query Logic: SQL or DataFrame transformations
  • Sink: Console, Kafka, Parquet, Delta, JDBC, etc.
  • Trigger: Controls the batch interval (fixed, continuous, or once)

🔧 Tools & Technologies We’ll Use 

To build a real-time pipeline, we'll use:

  • Apache Spark (3.x+)
  • Apache Kafka (as the message broker)
  • Spark Structured Streaming
  • Parquet or Delta Lake (as sink)
  • Optional: Apache Airflow or Docker

🛠️ Step-by-Step: Building a Real-Time Streaming Pipeline 

✅ Step 1: Start Kafka and Create a Topic 

If you’re using Docker, you can spin up Kafka quickly.

docker-compose up -d

Create a topic:

kafka-topics.sh --create --topic user-events --bootstrap-server localhost:9092

✅ Step 2: Send Real-Time Messages to Kafka 

You can simulate event data:

kafka-console-producer.sh --topic user-events --bootstrap-server localhost:9092

Paste messages like:

JSON

{"user_id": 101, "event": "click", "timestamp": "2025-03-30T10:05:00"}

✅ Step 3: Read from Kafka in Spark

python

from pyspark.sql import SparkSession
from pyspark.sql.functions import from_json, col
from pyspark.sql.types import StructType, StringType, TimestampType

spark = SparkSession.builder.appName("RealTimePipeline").getOrCreate()

schema = StructType() \
.add("user_id", StringType()) \
.add("event", StringType()) \
.add("timestamp", TimestampType())

kafka_df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "user-events") \
.load()

json_df = kafka_df.selectExpr("CAST(value AS STRING)") \
.select(from_json(col("value"), schema).alias("data")) \
.select("data.*")

✅ Step 4: Process the Stream (Optional Windowing)

python

from pyspark.sql.functions import window

windowed_df = json_df \
.withWatermark("timestamp", "5 minutes") \
.groupBy(window(col("timestamp"), "10 minutes"), col("event")) \
.count()

✅ Step 5: Write to a Sink (Parquet)

python

query = windowed_df.writeStream \
.outputMode("append") \
.format("parquet") \
.option("path", "/tmp/output") \
.option("checkpointLocation", "/tmp/checkpoints") \
.start()

query.awaitTermination()

🧪 Testing Your Pipeline 

Open your Kafka producer and start typing events. 

Simultaneously, monitor your output folder for Parquet files being created in near real-time.

🧠 Best Practices for Production

  • Use Delta Lake: It offers ACID transactions, versioning, and efficient upserts.
  • Add Monitoring: Use Spark UI, logs, and metrics to track performance.
  • Handle Late Data: Configure watermarks smartly to deal with delayed events.
  • Secure Your Pipeline: Use SSL for Kafka, and consider fine-grained access controls.

💼 Real-World Use Cases

  • E-commerce: Real-time cart abandonment tracking
  • Banking: Fraud detection based on transaction patterns
  • IoT: Monitoring sensor streams and triggering alerts
  • Streaming Media: User engagement analytics

✅ Final Thoughts 

Structured Streaming brings the power of Spark’s distributed engine to real-time data — without sacrificing developer simplicity or reliability. With just a few lines of code, you can build robust streaming pipelines that scale with your business needs. 

Whether you're a data engineer, analyst, or ML practitioner, learning to build real-time streaming pipelines with Spark is a game-changing skill in 2025 and beyond.