How to Set Up Apache Spark on Windows, macOS, and Linux

Apache Spark is a powerful distributed computing framework for big data processing. To get started with Spark, you need to set up the environment correctly based on your operating system. In this guide, we’ll walk you through step-by-step instructions for installing and configuring Apache Spark on Windows, macOS, and Linux.

Prerequisites

Before installing Spark, make sure you have the following installed on your system:

  1. Java Development Kit (JDK 8 or later)Required for running Spark
  2. Python (Optional, for PySpark)If you plan to use Spark with Python
  3. Scala (Optional, for Scala-based development)Required for working with Spark in Scala
Check Java Installation

To check if Java is installed, run:

java -version

If Java is not installed, download and install it from Oracle JDK or OpenJDK.

Check Python Installation (For PySpark Users)

python --version

If Python is not installed, download it from Python’s official website.

Setting Up Apache Spark

1. Installing Spark on Windows

Step 1: Download Apache Spark

  • Visit the Apache Spark official website
  • Select Spark version (latest recommended)
  • Choose Pre-built for Apache Hadoop
  • Download and extract the .tgz or .zip file to a desired location (e.g., C:\spark)
Step 2: Set Environment Variables
  1. Open the Start menu, search for Environment Variables, and open it.
  2. Under System Variables, click New and add:
      • Variable name: SPARK_HOME
      • Variable value: C:\spark
   3.  Edit the Path variable and add:

  • %SPARK_HOME%\bin
Step 3: Verify Installation

Open Command Prompt (cmd) and run:

spark-shell

This should launch the Spark shell, indicating a successful installation.

2. Installing Spark on macOS

Step 1: Install Homebrew (If Not Installed)

Homebrew simplifies package installations on macOS. To install Homebrew, run:

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

Step 2: Install Apache Spark

With Homebrew installed, install Spark by running:

brew install apache-spark

Step 3: Set Up Environment Variables

Edit your shell configuration file (~/.zshrc or ~/.bash_profile):

echo 'export SPARK_HOME=/usr/local/Cellar/apache-spark' >> ~/.zshrc
echo 'export PATH=$SPARK_HOME/bin:$PATH' >> ~/.zshrc
source ~/.zshrc

Step 4: Verify Installation

Run:

spark-shell

If Spark starts successfully, your setup is complete.

3. Installing Spark on Linux (Ubuntu/Debian)

Step 1: Install Java and Python

Ensure Java and Python are installed:

sudo apt update
sudo apt install default-jdk python3
Step 2: Download Apache Spark

Navigate to the Apache Spark Downloads Page and download the latest pre-built version for Hadoop. Then, extract it:

wget https://downloads.apache.org/spark/spark-3.x.x-bin-hadoop3.tgz
mkdir ~/spark
tar -xvzf spark-3.x.x-bin-hadoop3.tgz -C ~/spark --strip-components=1
Step 3: Set Environment Variables

Edit your shell profile (~/.bashrc or ~/.zshrc):

echo 'export SPARK_HOME=~/spark' >> ~/.bashrc
echo 'export PATH=$SPARK_HOME/bin:$PATH' >> ~/.bashrc
source ~/.bashrc

Step 4: Verify Installation

Run:

spark-shell

If you see the Spark welcome message, your installation is successful!

Running a Simple Spark Application

Once Spark is installed, let’s run a simple script.

For Scala (Spark Shell):

spark-shell

Then run:

val data = Seq("Hello", "Apache", "Spark")
data.foreach(println)

For Python (PySpark):

pyspark

Then run:

data = ["Hello", "Apache", "Spark"]
for word in data:
print(word)

Conclusion

Setting up Apache Spark on Windows, macOS, and Linux is straightforward with the right steps. Once installed, you can start experimenting with data processing, machine learning, and real-time analytics using Spark.

Next Steps:

  • Explore Spark SQL for querying structured data.
  • Learn about Spark Streaming for real-time analytics.
  • Use MLlib for machine learning in Spark.
📥 Want to master Apache Spark? Check out our courses at www.smartdatacamp.com! 🚀