Preview Apache Spark Interview Question and Answer (100 FAQ)

Spark Interview Question Set 1

Introduction

How to add a index Column in Spark Dataframe?

What are the differences between Apache Spark and Apache Storm?

How to limit the number of retries on Spark job failure in YARN?

Is there any way to get Spark Application id, while running a job?

How to stop a Running Spark Application?

In Spark Standalone Mode, How to compress spark output written to HDFS

Is there any way to get the current number of partitions of a DataFrame?

How to get good performance with Spark

Why does a job fail with “No space left on device”, but df says otherwise?

Where are logs in Spark on YARN? How to view those logs?

Spark Interview Question Set 2

How to prevent Spark Executors from getting Lost when using YARN client mode?

In which situation you will use Client mode and Cluster mode ?

How to print the contents of RDD?

What is the difference between Apache Spark and Apache Flink?

How to remove the parentheses? from output

What are possible reasons for receiving TimeoutException: [n seconds] ?

How to open/stream .zip files through Spark?

How to read multiline JSON in Apache Spark?

How to replace NULL value in Spark Dataframe?

How does Spark partition(ing) work on files in HDFS?

Scenario Based Question (Memory Management)

Scenario Based Question (Cache)

Scenario Based Question (Cluster)

Scenario Based Question (Recovery)

Let’s say you have 100 GB of table and one 1 GB of small table. How do you join?

Spark Interview Question Set 3

How to read a AWS S3 file in Spark?

I want to find the moving average of the Time Series using Apache Spark

How to change column types in Spark SQL DataFrame?

I've got big RDD(1gb) in yarn cluster. I can't use collect() How to handle this?

Is there any way for Spark to create primary keys?

How to add a constant column in a Spark DataFrame?

What does Stage Skipped mean in Apache Spark web UI?

How to concatenate columns in apache spark dataframe?

While processing CSV file resultant output is multiple file, wanted single file?

Explain sortByKey() operation.

Spark Interview Question Set 4

List the advantage of Parquet file in Apache Spark.

Do you need to install Spark on all nodes of Yarn cluster while running Spark

What is PageRank?

What does MLlib do?

What is GraphX?

What do you understand by receivers in Spark Streaming ?

Name some companies that are already using Spark Streaming.

Name some source from where Spark streaming component can process real-time data

What are the key features of Apache Spark that you like?

What are the various data sources available in SparkSQL?

Spark Interview Question Set 5

What is the difference between map and flatMap and a good use case for each?

How to read multiple text files into a single RDD?

Does SparkSQL support subquery?

Have you ever encounter Spark java.lang.OutOfMemoryError? How to fix this issue?

How do I skip a header from CSV files in Spark?

What happens to RDD when one of the nodes on which it is distributed goes down?

Certain data that we want to use again and again how to improve performance

How Spark Streaming API works?

What is write ahead log(journaling)?

What are the advantages of DataFrame?

Spark Interview Question Set 6

What is DataFrames?

What is Spark Driver?

What are benefits of Spark over MapReduce?

What does a Spark Engine do?

Explain the difference between Spark SQL and Hive?

What are the various levels of persistence in Apache Spark?

Which one will you choose for a project Hadoop MapReduce or Apache Spark?

What is a DStream?

What is the significance of Sliding Window operation?

How can you minimize data transfers when working with Spark?

Spark Interview Question Set 7

Is it possible to run Apache Spark on Apache Mesos?

Can you use Spark to access and analyse data stored in Cassandra databases?

Explain about transformations and actions in the context of RDDs?

What is Apache Spark Streaming?

How can you define Spark Accumulators?

What is a Broadcast Variable?

What is Data locality / placement?

Which all cluster manager can be used with Spark?

What is Speculative Execution of a tasks?

What is stage, with regards to Spark Job execution?

Spark Interview Question Set 8

What is DAGSchedular and how it performs?

Please define executors in detail?

Please explain, how worker's work, when a new Job submitted to them?

What are the workers?

Define Spark architecture?

What is checkpointing?

What is the difference between groupByKey and use reduceByKey ?

What is Shuffling?

What is the difference between cache() and persist() method of RDD

What is coalesce transformation?

Spark Interview Question Set 9

Data is spread in all the nodes of cluster, how spark tries to process this data

How would you control the number of partitions of a RDD?

What is Lazy evaluated RDD mean?

How do you define RDD?

How do you evaluate your spark application ?

How do you disable Info Message when running Spark Application?

What is the advantage of broadcasting values across Spark Cluster?

Is it possible to have multiple SparkContext in single JVM?

What is the Default level of parallelism in Spark?

Which all are the, ways to configure Spark Properties and order them?

Spark Interview Question Set 10

Which all kind of data processing supported by Spark?

Why Spark is good at low-latency iterative workloads ?

We understand Spark Streaming uses micro-batching. Does this increase latency?

Does Spark require modified versions of Scala or Python?

Do I need Hadoop to run Spark?

How can I run Spark on a cluster?

Does my data need to fit in memory to use Spark?

How large a cluster can Spark scale to?

How does Spark relate to Apache Hadoop?

Who is using Spark in production?

What is mount points? why you use it? in Databricks?

Difference between partition and bucketing? in Apache Spark

Do you know the top five secrets of performance tuning Apache Spark?

Preview - Apache Spark Interview Question and Answer (100 FAQ)