How to Set Up Apache Spark on Windows, macOS, and Linux

Apache Spark is a powerful distributed computing framework for big data processing. To get started with Spark, you need to set up the environment correctly based on your operating system. In this guide, we’ll walk you through step-by-step instructions for installing and configuring Apache Spark on Windows, macOS, and Linux.

Prerequisites

Before installing Spark, make sure you have the following installed on your system:

Java Development Kit (JDK 8 or later) – Required for running Spark
Python (Optional, for PySpark) – If you plan to use Spark with Python
Scala (Optional, for Scala-based development) – Required for working with Spark in Scala

Check Java Installation

To check if Java is installed, run:

java -version

If Java is not installed, download and install it from Oracle JDK or OpenJDK.

Check Python Installation (For PySpark Users)

python --version

If Python is not installed, download it from Python’s official website.

Setting Up Apache Spark

1. Installing Spark on Windows

Step 1: Download Apache Spark

Visit the Apache Spark official website
Select Spark version (latest recommended)
Choose Pre-built for Apache Hadoop
Download and extract the .tgz or .zip file to a desired location (e.g., C:\spark)

Step 2: Set Environment Variables

Open the Start menu, search for Environment Variables, and open it.
Under System Variables, click New and add:

Variable name: SPARK_HOME
Variable value: C:\spark

3. Edit the Path variable and add:

%SPARK_HOME%\bin

Step 3: Verify Installation

Open Command Prompt (cmd) and run:

spark-shell

This should launch the Spark shell, indicating a successful installation.

2. Installing Spark on macOS

Step 1: Install Homebrew (If Not Installed)

Homebrew simplifies package installations on macOS. To install Homebrew, run:

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

Step 2: Install Apache Spark

With Homebrew installed, install Spark by running:

brew install apache-spark

Step 3: Set Up Environment Variables

Edit your shell configuration file (~/.zshrc or ~/.bash_profile):

echo 'export SPARK_HOME=/usr/local/Cellar/apache-spark' >> ~/.zshrc

echo 'export PATH=$SPARK_HOME/bin:$PATH' >> ~/.zshrc

source ~/.zshrc

Step 4: Verify Installation

Run:

spark-shell

If Spark starts successfully, your setup is complete.

3. Installing Spark on Linux (Ubuntu/Debian)

Step 1: Install Java and Python

Ensure Java and Python are installed:

sudo apt update

sudo apt install default-jdk python3

Step 2: Download Apache Spark

Navigate to the Apache Spark Downloads Page and download the latest pre-built version for Hadoop. Then, extract it:

wget https://downloads.apache.org/spark/spark-3.x.x-bin-hadoop3.tgz

mkdir ~/spark

tar -xvzf spark-3.x.x-bin-hadoop3.tgz -C ~/spark --strip-components=1

Step 3: Set Environment Variables

Edit your shell profile (~/.bashrc or ~/.zshrc):

echo 'export SPARK_HOME=~/spark' >> ~/.bashrc

echo 'export PATH=$SPARK_HOME/bin:$PATH' >> ~/.bashrc

source ~/.bashrc

Step 4: Verify Installation

Run:

spark-shell

If you see the Spark welcome message, your installation is successful!

Running a Simple Spark Application

Once Spark is installed, let’s run a simple script.

For Scala (Spark Shell):

spark-shell

Then run:

val data = Seq("Hello", "Apache", "Spark")

data.foreach(println)

For Python (PySpark):

pyspark

Then run:

data = ["Hello", "Apache", "Spark"]

for word in data:

print(word)

Conclusion

Setting up Apache Spark on Windows, macOS, and Linux is straightforward with the right steps. Once installed, you can start experimenting with data processing, machine learning, and real-time analytics using Spark.

✅ Next Steps:

Explore Spark SQL for querying structured data.
Learn about Spark Streaming for real-time analytics.
Use MLlib for machine learning in Spark.

📥 Want to master Apache Spark? Check out our courses at www.smartdatacamp.com! 🚀

How to Set Up Apache Spark on Windows, macOS, and Linux

You may also be interested in