How to Build a Real-Time Streaming Pipeline with Spark Structured Streaming

In today’s data-driven world, real-time insights are a necessity. Whether it's monitoring financial transactions, tracking user behavior, or detecting fraud, businesses depend on fresh data flowing through streaming pipelines. One of the most powerful tools for this job is Apache Spark Structured Streaming.

In this blog, we’ll walk through the concepts, architecture, and a step-by-step guide to building a real-time data streaming pipeline using Spark Structured Streaming.

🚀 What is Spark Structured Streaming?

Spark Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. It lets you write streaming queries using familiar DataFrame and SQL APIs — and it handles the complexities of incremental data processing behind the scenes.

Unlike traditional Spark Streaming (based on DStreams), Structured Streaming provides:

End-to-end exactly-once guarantees
Event-time processing
Support for windowing, watermarking, and stateful operations
Unified API for batch and stream processing

🧱 Core Architecture of Structured Streaming

At its core, Spark Structured Streaming processes data as an unbounded table. New data keeps appending to this table, and your query processes it incrementally.

Key Concepts:

Input Source: Kafka, File Stream, Socket, etc.
Query Logic: SQL or DataFrame transformations
Sink: Console, Kafka, Parquet, Delta, JDBC, etc.
Trigger: Controls the batch interval (fixed, continuous, or once)

🔧 Tools & Technologies We’ll Use

To build a real-time pipeline, we'll use:

Apache Spark (3.x+)
Apache Kafka (as the message broker)
Spark Structured Streaming
Parquet or Delta Lake (as sink)
Optional: Apache Airflow or Docker

🛠️ Step-by-Step: Building a Real-Time Streaming Pipeline

✅ Step 1: Start Kafka and Create a Topic

If you’re using Docker, you can spin up Kafka quickly.

docker-compose up -d

Create a topic:

kafka-topics.sh --create --topic user-events --bootstrap-server localhost:9092

✅ Step 2: Send Real-Time Messages to Kafka

You can simulate event data:

kafka-console-producer.sh --topic user-events --bootstrap-server localhost:9092

Paste messages like:

JSON

{"user_id": 101, "event": "click", "timestamp": "2025-03-30T10:05:00"}

✅ Step 3: Read from Kafka in Spark

python

from pyspark.sql import SparkSession

from pyspark.sql.functions import from_json, col

from pyspark.sql.types import StructType, StringType, TimestampType

spark = SparkSession.builder.appName("RealTimePipeline").getOrCreate()

schema = StructType() \

.add("user_id", StringType()) \

.add("event", StringType()) \

.add("timestamp", TimestampType())

kafka_df = spark.readStream \

.format("kafka") \

.option("kafka.bootstrap.servers", "localhost:9092") \

.option("subscribe", "user-events") \

.load()

json_df = kafka_df.selectExpr("CAST(value AS STRING)") \

.select(from_json(col("value"), schema).alias("data")) \

.select("data.*")

✅ Step 4: Process the Stream (Optional Windowing)

python

from pyspark.sql.functions import window

windowed_df = json_df \

.withWatermark("timestamp", "5 minutes") \

.groupBy(window(col("timestamp"), "10 minutes"), col("event")) \

.count()

✅ Step 5: Write to a Sink (Parquet)

python

query = windowed_df.writeStream \

.outputMode("append") \

.format("parquet") \

.option("path", "/tmp/output") \

.option("checkpointLocation", "/tmp/checkpoints") \

.start()

query.awaitTermination()

🧪 Testing Your Pipeline

Open your Kafka producer and start typing events.

Simultaneously, monitor your output folder for Parquet files being created in near real-time.

🧠 Best Practices for Production

Use Delta Lake: It offers ACID transactions, versioning, and efficient upserts.
Add Monitoring: Use Spark UI, logs, and metrics to track performance.
Handle Late Data: Configure watermarks smartly to deal with delayed events.
Secure Your Pipeline: Use SSL for Kafka, and consider fine-grained access controls.

💼 Real-World Use Cases

E-commerce: Real-time cart abandonment tracking
Banking: Fraud detection based on transaction patterns
IoT: Monitoring sensor streams and triggering alerts
Streaming Media: User engagement analytics

✅ Final Thoughts

Structured Streaming brings the power of Spark’s distributed engine to real-time data — without sacrificing developer simplicity or reliability. With just a few lines of code, you can build robust streaming pipelines that scale with your business needs.

Whether you're a data engineer, analyst, or ML practitioner, learning to build real-time streaming pipelines with Spark is a game-changing skill in 2025 and beyond.

How to Build a Real-Time Streaming Pipeline with Spark Structured Streaming

You may also be interested in